Machine Learning model for grading essays written in English.

By,

Anand Sakhare

MISM-BIDA Graduate Student | Carnegie Mellon University

Mobile: (412)708-7836

Email: asakhare@andrew.cmu.edu

LinkedIn: https://www.linkedin.com/in/anand-sakhare/

##### The input data is a cleaned up version of  dataset from a kaggle competition.
Competition Name - The Hewlett Foundation: Automated Essay Scoring
Source: https://www.kaggle.com/c/asap-aes

Brief description of the steps that I followed for building the model:

1. Read all the data from CSV
2. Make the text essays lowercase
3. Removing the stop words
4. Lemmatizing all the words in the essays so that variation is reduced
5. Removing punctuation from the essays
6. Compute geometrical features such as length and number of digits
7. For each essay count the number of words belonging to each type of part of speech
8. Compute a tfidf sparse matrix and reduce it's dimensionality
9. Join all the features (tfidf sparse matrix, part of speech features and goemetrical features)
10. Split the data into trianing and testing data
11. Create a grid for hyper parameter tuning of the random forest classifier
12. Perform hyperparameter tuning and come up with the best estimator and parameters for the random forest classifier
13. Predict the output on the test data and evaluate the model

##### Please make sure following libraries are downloaded/installed on your system before running this file.
Pandas, Numpy, nltk, stop_words, re, collections, matplotlib

In [1]:
#Import all the data
import pandas as pd
data = pd.read_csv('train_set_rel3.csv',encoding='latin-1')
data = data.dropna(axis=0, how='any')      #Drop NA values
df = data
df = data[['essay','domain1_score']]
df.columns = ['text','y'] #Take essays and the scores
df.head()

Unnamed: 0,text,y
0,"Dear local newspaper, I think effects computer...",8.0
1,"Dear @CAPS1 @CAPS2, I believe that using compu...",9.0
2,"Dear, @CAPS1 @CAPS2 @CAPS3 More and more peopl...",7.0
3,"Dear Local Newspaper, @CAPS1 I have found that...",10.0
4,"Dear @LOCATION1, I know having computers has a...",8.0


In [2]:
%%time
#Make all the text essays lowercase
df['text'] = df['text'].str.lower()
print(df.head())

                                                text     y
0  dear local newspaper, i think effects computer...   8.0
1  dear @caps1 @caps2, i believe that using compu...   9.0
2  dear, @caps1 @caps2 @caps3 more and more peopl...   7.0
3  dear local newspaper, @caps1 i have found that...  10.0
4  dear @location1, i know having computers has a...   8.0
Wall time: 77.2 ms


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [3]:
%%time
#Remove the stopwords : stopwords are taken from two libraries - 'stop_words' and 'nltk'
essay_list = df['text']
from stop_words import get_stop_words
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words_list = set(get_stop_words('en')).union(set(stopwords.words('english')))  #Combining the stop words from both the packages

processed_essays = []
essay_list = list(essay_list)
for i in essay_list:
    temp = i
    parsed_essay = " ".join([word for word in temp.split() if word not in stop_words_list])
    processed_essays.append(parsed_essay)
#df['text'] = processed_essays

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\anand\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Wall time: 12.5 s


In [4]:
%%time
#Lemmatize all the words in the essays
import nltk
nltk.download('wordnet')
#Source: https://stackoverflow.com/questions/771918/how-do-i-do-word-stemming-or-lemmatization
from nltk.stem.wordnet import WordNetLemmatizer
lmtzr = WordNetLemmatizer()

s = ""
for i in range(len(processed_essays)):
    s = ""
    for word in processed_essays[i].split():
        s = s + lmtzr.lemmatize(word) + " "
    processed_essays[i] = s

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\anand\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
Wall time: 7.58 s


In [5]:
%%time
#Remove Punctuation
#Reference: https://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string-in-python
import re
for i in range(len(processed_essays)):
    processed_essays[i] = re.sub(r'[^\w\s]','',processed_essays[i])

Wall time: 268 ms


In [6]:
#Append to the dataframe df
df['text'] = processed_essays

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [7]:
#Compute the length of each essay - this will serve as a geometrical parameter
len_values = []
for i in range(0,len(df)):
    len_values.append(len(df.text.iloc[i]))

len_series = pd.Series(len_values)
df['length'] = len_series.values
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


Unnamed: 0,text,y,length
0,dear local newspaper think effect computer peo...,8.0,1110
1,dear caps1 caps2 believe using computer benefi...,9.0,1460
2,dear caps1 caps2 caps3 people use computers ev...,7.0,952
3,dear local newspaper caps1 found many expert s...,10.0,2110
4,dear location1 know computer positive effect p...,8.0,1466


In [8]:
#Compute the number of digits in each essay - this will serve as a geometrical parameter
digits_list = []

for i in range(0,len(df)):
    if(sum(c.isdigit() for c in df.text.iloc[i]) == 0):
        digits_list.append(0)
    else:
        digits_list.append(sum(c.isdigit() for c in df.text.iloc[i]))

digits_col = pd.Series(digits_list)
df['DIGITS'] = digits_col.values
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # This is added back by InteractiveShellApp.init_path()


Unnamed: 0,text,y,length,DIGITS
0,dear local newspaper think effect computer peo...,8.0,1110,5
1,dear caps1 caps2 believe using computer benefi...,9.0,1460,10
2,dear caps1 caps2 caps3 people use computers ev...,7.0,952,7
3,dear local newspaper caps1 found many expert s...,10.0,2110,41
4,dear location1 know computer positive effect p...,8.0,1466,4


In [9]:
%%time
#Count the parts of speech from each of the essay
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
counts = []
for essay in processed_essays:
    tokenize = nltk.word_tokenize(essay)
    tagged = nltk.pos_tag(tokenize)
    counts.append(tagged)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\anand\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\anand\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
Wall time: 1min 31s


In [10]:
#Convert the counter item to a dictionary and create a list of these dictionaries
from collections import Counter
pos_counter = []
for i in counts:
    a = Counter(tag for word,tag in i)
    pos_counter.append(dict(a))

In [11]:
#Append the word count according to each type of part of speech - append 0 where there is no words belonging to a specific type
#reference: https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/
orderedNames = ['CC','CD','DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNS', 'NNP', 'NNPS', 'PDT', 'POS', 'PRP'
               , 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'WDT', 'WP', 'WP$', 'WRB']
import numpy as np
pos_tag_count = []

for j in pos_counter:
    b = []
    for i in orderedNames:
        try:
            b.append(j[i])
        except KeyError:
            b.append(0)
    pos_tag_count.append(b)

In [12]:
#Append the counts for each essay in the dataframe df
pos_tag_count = np.matrix(pos_tag_count)
for i in range(len(orderedNames)):
    df[orderedNames[i]] = pos_tag_count[:,i]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


In [13]:
%%time
#Compute a tfidf sparse matrix based on the processed_essays
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
#Create a sparse matrix
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                                 stop_words='english')
X_tfidf = vectorizer.fit_transform(processed_essays)

Wall time: 1.27 s


In [14]:
#Convert the tfidf sparse matrix to a dense matrix and append to a dataframe
l = pd.DataFrame(X_tfidf.todense())
print(l.head())

   0      1      2      3      4      5      6      7      8      9      \
0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0   
1    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0   
2    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0   
3    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0   
4    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0   

   ...    43432  43433  43434  43435  43436  43437  43438  43439  43440  43441  
0  ...      0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0  
1  ...      0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0  
2  ...      0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0  
3  ...      0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0  
4  ...      0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0  

[5 rows x 43442 columns]


In [15]:
print(df.head())

                                                text     y  length  DIGITS  \
0  dear local newspaper think effect computer peo...   8.0    1110       5   
1  dear caps1 caps2 believe using computer benefi...   9.0    1460      10   
2  dear caps1 caps2 caps3 people use computers ev...   7.0     952       7   
3  dear local newspaper caps1 found many expert s...  10.0    2110      41   
4  dear location1 know computer positive effect p...   8.0    1466       4   

   CC  CD  DT  EX  FW  IN ...   RP  TO  UH  VB  VBD  VBG  WDT  WP  WP$  WRB  
0   0   0   0   0   0   3 ...    0   0   0  12    4    8    0   4    0    0  
1   0   8   3   0   0   9 ...    0   0   0  19    8   15    0   0    0    0  
2   0   2   0   0   0   1 ...    0   0   0   6    1    7    0   0    0    0  
3   0   0   2   1   0   3 ...    0   0   0   5   12    5    0   0    0    0  
4   0   5   4   0   0   3 ...    1   0   0   7    0    5    0   0    0    0  

[5 rows x 36 columns]


In [16]:
#%%time
#from sklearn.manifold import TSNE
#tsne = TSNE(n_components=2, verbose=1, perplexity=40)
#feature_vectors_tsne2d = tsne.fit_transform(l)

#Commenting out the above code for TSNE because of the execution time 

In [17]:
#Reducing the dimensions of the matrix using PCA 
from sklearn.decomposition import PCA

pca = PCA(n_components=4)  # project data down to 4 dimensions
feature_vectors_pca2d = pca.fit_transform(X_tfidf.todense())
l = pd.DataFrame(feature_vectors_pca2d)
#Append the part of speech features, features from tfidf dense matrix and the geometrical features
result = pd.concat([df, l], axis=1)
result = result.dropna(axis=0, how='any')

In [25]:
print(result.head())

Unnamed: 0,text,y,length,DIGITS,CC,CD,DT,EX,FW,IN,...,VBD,VBG,WDT,WP,WP$,WRB,0,1,2,3
0,dear local newspaper think effect computer peo...,8.0,1110.0,5.0,0.0,0.0,0.0,0.0,0.0,3.0,...,4.0,8.0,0.0,4.0,0.0,0.0,-0.088176,-0.163062,0.019372,-0.043956
1,dear caps1 caps2 believe using computer benefi...,9.0,1460.0,10.0,0.0,8.0,3.0,0.0,0.0,9.0,...,8.0,15.0,0.0,0.0,0.0,0.0,-0.076961,-0.13874,0.041166,-0.0135
2,dear caps1 caps2 caps3 people use computers ev...,7.0,952.0,7.0,0.0,2.0,0.0,0.0,0.0,1.0,...,1.0,7.0,0.0,0.0,0.0,0.0,-0.084502,-0.183798,0.042169,-0.050475
3,dear local newspaper caps1 found many expert s...,10.0,2110.0,41.0,0.0,0.0,2.0,1.0,0.0,3.0,...,12.0,5.0,0.0,0.0,0.0,0.0,-0.063925,-0.138634,0.011709,-0.033999
4,dear location1 know computer positive effect p...,8.0,1466.0,4.0,0.0,5.0,4.0,0.0,0.0,3.0,...,0.0,5.0,0.0,0.0,0.0,0.0,-0.091834,-0.1436,0.053373,-0.038958


Although PCA assumes linearity, I am using PCA for dimensionality reduction here just to save the execution time. Using TSNE would be better but there is a trade off for time complexity.

In [18]:
X = result.iloc[:,2:]
y = result.iloc[:,1]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True,random_state=42)


##### I have tried multiple models and analyzed the accuracy for them. RandomForestClassifier gives the best accuracy than 'DecisionTreeClassifier', 'OneVsRestClassifier' and 'svm'

##### Hence I have decided to further fine tune the random forest classifier.



In [19]:
#Creating a grid space with following hyperparameters
max_depth = [int(x) for x in np.linspace(10, 80, num = 5)]
n_estimators = [int(x) for x in np.linspace(start = 50, stop = 150, num = 3)]
#bootstrap = [True, False]
criterion = ["gini", "entropy"]
random_grid = {'n_estimators': n_estimators,
               'max_depth': max_depth,
               'criterion' : criterion,
               #'bootstrap': bootstrap
              }
print(random_grid)

{'n_estimators': [50, 100, 150], 'max_depth': [10, 27, 45, 62, 80], 'criterion': ['gini', 'entropy']}


In [20]:
%%time
#Perform grid search with the grid above to tune the Random Forest Classifier
#reference: http://scikit-learn.org/stable/auto_examples/model_selection/plot_randomized_search.html
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import  GridSearchCV
clf = RandomForestClassifier()
rf_random =  GridSearchCV(estimator = clf, param_grid=random_grid)
rf_random.fit(X_train, y_train)
print(rf_random.best_params_)



{'criterion': 'entropy', 'max_depth': 80, 'n_estimators': 150}
Wall time: 5min 39s


In [21]:
#Predict
model = rf_random.best_estimator_
y_pred = model.predict(X_test)

In [22]:
#Model Evaluation
from sklearn.metrics import accuracy_score
a = accuracy_score(y_test, y_pred)
print("RandomForestClassifier")
print("Accuracy Score in % : ")
print(a*100)

RandomForestClassifier
Accuracy Score in % : 
54.08320493066255


In [23]:
from sklearn.metrics import mean_squared_error
from math import sqrt
rms = sqrt(mean_squared_error(y_test, y_pred))
print("RMSE : " + str(rms))

RMSE : 4.06159243744237


Since there are about 60 classes for the prediction model, the ROC curve does not provide a good indsight towards the model performance.

Accuracy score which is coming out to be around 53.2%, is a good estimator of the model performance.

##### Most important features

In [24]:
feat_imp = model.feature_importances_
#ind = np.argsort(feat_imp)[::-1]
#feat_imp[ind]
print(feat_imp)

[1.26062543e-01 3.94223648e-02 3.85475460e-03 1.78369122e-02
 1.22905812e-02 2.34276736e-03 2.24958485e-03 2.48056842e-02
 6.49354530e-02 8.53329283e-03 7.99198048e-03 0.00000000e+00
 5.32236260e-03 8.38523623e-02 4.32955344e-02 4.88684640e-03
 1.65243157e-06 2.06751582e-04 1.67061170e-06 1.58155435e-02
 2.84688589e-03 4.88411755e-02 8.09246736e-03 9.40035898e-04
 8.05944669e-03 2.68559158e-03 4.63692318e-04 2.96573269e-02
 5.43098527e-02 3.06087159e-02 2.05306897e-03 1.59594681e-03
 5.34773210e-05 1.52898239e-03 5.27755835e-02 1.10112019e-01
 8.68592318e-02 9.48078565e-02]


The most important feature among the feature space is the length feature which represents the length of the essay. There are many features which are some importance such as the use of certain words in the, digits and some specific parts of speech.

~End of Document