# Automatic Essay Scoring

# Motivation

Automated essay scoring requires quantifying not only grammar but semantics, discourse and pragmatics. Two different approaches were explored - traditional NLP features with logistic regression, and word vector representations with a LSTM. Because the contest was held seven years ago, the contest entries all involved feature engineering and making use of different regressors such as linear regression and k-nearest neighbours. Recent attempts at the contest involve neural models, but not much work has been done in this area. Trying out the two different techniques allowed for a better learning experience with regards to finding which traditional NLP features correlate most to the final score, as well as providing the opportunity to build upon the current state of the art neural model and work towards beating it. 

# Approach

We explored classical natural language techniques by designing handcrafted features and performing logistic regression, and experimented with different word vector techniques with a LSTM. 

## Features

Features were designed to judge language fluidity, diction, structure, organization, originality and quality of the content. The selected features were as follows.  

1.Language quality and originality.  

- TF-IDF vectors: A TF-IDF vectorizer was trained on the essays and 400 features were selected as unigrams, bigrams, or trigrams. We ensured that each n-gram was observed at least five times in the essay but occurred in no more than 90% of the essays. Then, each n-gram was fed as a binary feature with a weight of one if it appeared, and zero otherwise.
 
- Doc2Vec: A Doc2Vec model was built from the essays, and a concatenation of the maximum and minimum vectors for each essay was fed as a feature. This allows us to encode semantic meaning from the essays, and concatenation performed better than summing or averaging the vectors. 

2.Numerical features. 
- Basic text features: Word count, average word length, and sentence count. 
- Part of speech counts: Number of nouns, verbs, foreign words, adjectives, adverbs, and conjunctions.

3.Structure and organization. 
- Punctuation: Number of exclamation marks and question marks.

## Logistic Regression 

Logistic regression was used as the learning model to make predictions based on the features. 5-fold cross validation was used in training and testing the model to avoid over-fitting.

## Long Short-Term Memory

Long short-term memory units are a modification to recurrent units that use three gates to forget information or preserve it. The model consists of two LSTM layers, a dropout layer, and a dense output layer. The dropout rate was set to 50% to guard against over-fitting. 
${W}$ vectors represent the weight for the input vectors, ${U}$ vectors are the weights for the previous cell output, ${x_t}$ is the input vector at time t, ${h_t}$ is the output vector at time t, and ${\circ}$ represents element-wise multiplication.  


The input gate is expressed as:

\begin{equation}
i_t  =  \sigma (W_i .x_t + U_i.h_{t- 1} + b_i)
\end{equation}

The forget gate is expressed as:

\begin{equation}
f_t  =  \sigma (W_i .x_t + U_f.h_{t- 1} + b_f)
\end{equation}

The output of the element-wise product of the previous state and the forget gate is ${S_{t- 1}} \circ f $. Then the output is:

\begin{equation}
 s_t = s_{t-1} \circ f + g  \circ f
\end{equation}

Lastly, the output gate:

\begin{equation}
o_t  =  \sigma (W_o .x_t + U_o.h_{t- 1} + b_o)
\end{equation}

The final result is put through tanh squashing, as this ensures the range is from -1 to 1. This stage is as follows:

\begin{equation}
h_t  =  \tanh{o_t} \circ {s_t}
\end{equation}

The outputs of this final layer are then fed into a dense layer, a densely-connected layer. It implements output = activation(dot(input, weights)) where activation is the element-wise ReLu activation function and weights is a weights matrix created by the layer:

\begin{equation}
d(h_t) = o_t \circ ReLu((h_t .W_t))
\end{equation}


# Data

The data set was provided from Hewlett Foundation’s Automated Student Assessment Prize competition on Kaggle. The dataset can be found in the Data folder, it is the file called 'training_set_rel3.xls'. Descriptions for each essay set are shown there as well.
	
- 12977 essay samples 
- 80% of the essays were used for training, and 20% for testing 
- 8 different essay prompts, each of which have a corresponding set of essays 
- Each set has a unique grading scale 
- Average word length of 150-550 words per essay 
- 2 essay sets are argumentative, 4 are response essays, and 2 are narrative (source dependent)

# Code

## Setup

In [200]:
# Import necessary modules
import numpy as np
import pandas as pd
import gensim 
from gensim.models.doc2vec import Doc2Vec
import nltk
from nltk.corpus import stopwords
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import cohen_kappa_score

stopwords = set(stopwords.words('english'))

First, we can set up the dataframes and explore the data. We will drop columns that we don't need and those with NaN values. There was one row without a domain1_score, which I removed. Some essays also contained domain2 or domain3 scores, but since not all the data has that field, we will ignore those fields. 

In [201]:
data = pd.ExcelFile('./data/training_set_rel3.xls')
df = data.parse("training_set")
df = df.drop('rater1_domain1', 1)
df = df.drop('rater2_domain1', 1)
df = df.dropna(axis = 1)

df.head()

Unnamed: 0,essay_id,essay_set,essay,domain1_score
0,1,1,"Dear local newspaper, I think effects computer...",8
1,2,1,"Dear @CAPS1 @CAPS2, I believe that using compu...",9
2,3,1,"Dear, @CAPS1 @CAPS2 @CAPS3 More and more peopl...",7
3,4,1,"Dear Local Newspaper, @CAPS1 I have found that...",10
4,5,1,"Dear @LOCATION1, I know having computers has a...",8


In [202]:
# Get an essay sample
essays = df['essay']
essays[0]

"Dear local newspaper, I think effects computers have on people are great learning skills/affects because they give us time to chat with friends/new people, helps us learn about the globe(astronomy) and keeps us out of troble! Thing about! Dont you think so? How would you feel if your teenager is always on the phone with friends! Do you ever time to chat with your friends or buisness partner about things. Well now - there's a new way to chat the computer, theirs plenty of sites on the internet to do so: @ORGANIZATION1, @ORGANIZATION2, @CAPS1, facebook, myspace ect. Just think now while your setting up meeting with your boss on the computer, your teenager is having fun on the phone not rushing to get off cause you want to use it. How did you learn about other countrys/states outside of yours? Well I have by computer/internet, it's a new way to learn about what going on in our time! You might think your child spends a lot of time on the computer, but ask them so question about the econom

## Doc2Vec Model

Next, we need to convert the essays into vector representations. The following is from a [Gensim Doc2Vec tutorial](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb). This allows us to learn paragraph and document embeddings via the distributed memory and distributed bag of words models. 

We create a Doc2Vec model with a vector size with 50 words and iterate over the training data 40 times. The minimum word count is two in order to discard uncommon words.

In [216]:
# Function to get all text from each essay - to build doc2vec
def all_essays(df):
    for (i, essay) in enumerate(df['essay']):
        yield gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(essay), [i])
        

all_essay_lst = all_essays(df)

# Instaniate the Doc2Vec model
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40)

# Build dictionary of all the unique words and their frequencies
model.build_vocab(all_essay_lst)

%time model.train(all_essay_lst, total_examples=model.corpus_count, epochs=model.epochs)

CPU times: user 54.6 ms, sys: 37.6 ms, total: 92.2 ms
Wall time: 99.5 ms


## Feature Extraction

Features were designed to judge language fluidity, diction, structure, organization, originality and quality of the content. The selected features were as follows.  

1.Language quality and originality.  

- TF-IDF vectors: A TF-IDF vectorizer was trained on the essays and 400 features were selected as unigrams, bigrams, or trigrams. We ensured that each n-gram was observed at least five times in the essay but occurred in no more than 90% of the essays. Then, each n-gram was fed as a binary feature with a weight of one if it appeared, and zero otherwise.
 
- Doc2Vec: A Doc2Vec model was built from the essays, and a concatenation of the maximum and minimum vectors for each essay was fed as a feature. This allows us to encode semantic meaning from the essays, and concatenation performed better than summing or averaging the vectors. 

2.Numerical features. 
- Basic text features: Word count, average word length, and sentence count. 
- Part of speech counts: Number of nouns, verbs, foreign words, adjectives, adverbs, and conjunctions.

3.Structure and organization. 
- Punctuation: Number of exclamation marks and question marks.

In [217]:
num_rows = df.shape[0]
essays = df['essay'].values

#Initialize dataframe columns
df['word_count'] = np.nan 
df['sentence_count'] = np.nan
df['avg_word_length'] = np.nan 
df['num_exclamation_marks'] = np.nan
df['num_question_marks'] = np.nan
df['num_stopwords'] = np.nan
df['word2vec_concat'] = np.nan

df['noun_count'] = np.nan
df['verb_count'] = np.nan
df['foreign_count'] = np.nan
df['adj_count'] = np.nan
df['conj_count'] = np.nan
df['adv_count'] = np.nan

def get_pos_tags(essay):
    nouns = verbs = foreign = adj = adv = conj = 0
    tokens = nltk.word_tokenize(essay)
    for token in tokens:
        pos_tag = nltk.pos_tag(nltk.word_tokenize(token))
        for (_, tag) in (pos_tag):
            if tag[0] == "N":
                nouns += 1
            elif tag[0] == "V":
                verbs += 1
            elif tag[0:2] == "FW":
                foreign += 1
            elif tag[0] == "J":
                adj += 1
            elif tag[0] == "R":
                adv += 1
            elif tag[0:2] == "CC" or tag[0:2] == "IN":
                conj += 1
    
    return [nouns, verbs, foreign, adj, adv, conj]


for i in range(num_rows):
    
    # Turn essay into list of words
    text = essays[i].split(" ")
    
    # Set word count
    df.set_value(i,'word_count', len(text))
    
    # Sentence count
    df.set_value(i, 'sentence_count', len(nltk.tokenize.sent_tokenize(essays[i])))
    
    # Average word length
    word_len = sum(len(word) for word in text) / len(text)
    df.set_value(i, 'avg_word_length', word_len)
    
    # Number of exclamation marks
    df.set_value(i, "num_exclamation_marks", sum(word.count("!") for word in essays[i]))
    
    # Number of question marks
    df.set_value(i, "num_question_marks", sum(word.count("?") for word in essays[i]))
    
    # Number of stop words
    df.set_value(i, "num_stopwords", sum([1 for word in text if word.lower() in stopwords]))

    # Doc2Vec conversion - min + max
    df.set_value(i, 'word2vec_concat', min(model.docvecs[i]) + max(model.docvecs[i]))
    
    # POS tag counts
    pos_lst = get_pos_tags(essays[i])
    df.set_value(i,'noun_count', pos_lst[0])
    df.set_value(i,'verb_count', pos_lst[1])
    df.set_value(i,'foreign_count', pos_lst[2])
    df.set_value(i,'adj_count', pos_lst[3])
    df.set_value(i,'adv_count', pos_lst[4])
    df.set_value(i,'conj_count', pos_lst[5])



### TF-IDF vectors

A TF-IDF vectorizer was trained on the essays and 400 features were selected as unigrams, bigrams, or trigrams. We ensured that each n-gram was observed at least five times in the essay but occurred in no more than 90% of the essays. Then, each n-gram was fed as a binary feature with a weight of one if it appeared, and zero otherwise.

In [218]:
def get_tfidf_vectors(essays):
    vectorizer = TfidfVectorizer(ngram_range=(1, 3), max_df=0.9, min_df=5, max_features=400, stop_words="english", binary=True)
    tfidf_vectors = vectorizer.fit_transform(essays)
    new_df = pd.DataFrame(tfidf_vectors.toarray(), columns=vectorizer.get_feature_names())
    
    return pd.concat([df, new_df], axis=1)

df = get_tfidf_vectors(essays)

We can now view the updated training set that includes all of these new features. 

In [226]:
df.head()

Unnamed: 0,essay_id,essay_set,essay,domain1_score,word_count,sentence_count,avg_word_length,num_exclamation_marks,num_question_marks,num_stopwords,...,world,wouldn,write,writing,wrong,year,years,yes,york,young
0,1,1,"Dear local newspaper, I think effects computer...",8,338.0,16.0,4.550296,4.0,2.0,168.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,1,"Dear @CAPS1 @CAPS2, I believe that using compu...",9,419.0,20.0,4.463007,1.0,1.0,189.0,...,0.121001,0.151485,0.173362,0.173767,0.0,0.0,0.0,0.0,0.0,0.0
2,3,1,"Dear, @CAPS1 @CAPS2 @CAPS3 More and more peopl...",7,279.0,14.0,4.526882,0.0,0.0,140.0,...,0.12576,0.0,0.0,0.0,0.15899,0.0,0.0,0.0,0.0,0.0
3,4,1,"Dear Local Newspaper, @CAPS1 I have found that...",10,524.0,27.0,5.041985,2.0,1.0,222.0,...,0.111728,0.0,0.160076,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,1,"Dear @LOCATION1, I know having computers has a...",8,465.0,30.0,4.526882,0.0,0.0,236.0,...,0.116441,0.145776,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Logistic Regression
We use sklearn for the logistic regression. First, we split up our dataset. We want to train on our new features without the score and essay removed, and test on the score column. Then, we can use 5 fold cross-validation to split up the two sets. Finally, we will train the logistic regression model on our training set.

The linear equation is:

\begin{equation}
y = b_0 + {b_1}x 
\end{equation}


Logistic regression is similar to linear regression but it produces a logistic curve, which is limited to values between 0 and 1. The curve is created by using the natural logarithm of the probabibility of the target variable, rather than the probability. The logistic equation is: 

\begin{equation}
p = \frac{{1}}{e^{-(b_0 + {b_1}x)}}
\end{equation}

The constant, $b_0$ moves the curve left and right and the slope, $b_1$ determines the steepness of the curve.

In [230]:
from sklearn.model_selection import StratifiedKFold

x = df.drop(['domain1_score', 'essay'], axis=1)
y = df['domain1_score']

# 5 fold cross validation to avoid overfitting
x = np.array(x)
y = np.array(y)
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
for train_index, test_index in kfold.split(x, y):
    X_train, X_test = x[train_index], x[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
logistic_reg = LogisticRegression()
logistic_reg.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Now, we can test our model on the unseen data and get an accuracy score and a quadratic weighted kappa score.

In [231]:
predictions = logistic_reg.predict(X_test)
print('Logistic regression classifier accuracy:', logistic_reg.score(X_test, y_test))

Logistic regression classifier accuracy: 0.5670423630003887


In [232]:
print(cohen_kappa_score(predictions, y_test, weights="quadratic"))

0.8472772087653904


## Long Short-Term Memory Network

Long short-term memory units are a modification to recurrent units that use three gates to forget information or preserve it. The model consists of two LSTM layers, a dropout layer, and a dense output layer. The dropout rate was set to 50% to guard against over-fitting.

The first layer of the LSTM has 300 units, 40% of which are dropped for the linear transformation of the input, and 40% of which are dropped for the linear transformation of the recurrent state. The second layer has 64 units, and 40% of units are dropped for the linear transformation of the recurrent state. Then, it runs a Dropout layer, which randomly sets 50% of input units to 0 at each update during training as a way to reduce overfitting. Lastly, it goes to a Dense layer, which is a densely-connected layer. It implements output = activation(dot(input, weights)) where activation is the element-wise ReLu activation function and weights is a weights matrix created by the layer. 

In [96]:
import os
import pandas as pd

X = pd.ExcelFile('./data/training_set_rel3.xls')
X = data.parse("training_set")
y = X['domain1_score']
X = X.dropna(axis=1)
X = X.drop(columns=['rater1_domain1', 'rater2_domain1'])

In [189]:
from keras.layers import Embedding, LSTM, Dense, Dropout, Lambda, Flatten
from keras.models import Sequential, load_model, model_from_config
import keras.backend as K

def get_model():
    model = Sequential([
        LSTM(300, dropout=0.4, recurrent_dropout=0.4, input_shape=[1, 300], return_sequences=True),
        LSTM(64, recurrent_dropout=0.4),
        Dropout(0.5),
        Dense(1, activation='relu')
    ])

    model.compile(loss='mean_squared_error', optimizer='rmsprop', metrics=['mae'])
    model.summary()

    return model

In [122]:
import numpy as np
import nltk
import re
from nltk.corpus import stopwords
from gensim.models import Word2Vec

stopwords = set(stopwords.words('english'))

def essay_to_list(essay):
    # Remove the tags
    essay = re.sub("[^a-zA-Z]", " ", essay)
    words = essay.lower().split()
    return [w for w in words if not w in stopwords]

def essay_to_sentences(essay):
    tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
    raw_sentences = tokenizer.tokenize(essay.strip())
    sentences = []
    for raw_sentence in raw_sentences:
        if len(raw_sentence) > 0:
            sentences.append(essay_to_list(raw_sentence))
    return sentences

# Generate feature vector for the words
def get_feature_vector(words, model, num_features, vec_type="sum"):
    feature_vector = np.zeros((num_features,),dtype="float32")
    num_words = 0.
    index2word_set = set(model.wv.index2word)
    
    max_vec =  np.zeros((num_features,),dtype="float32")
    min_vec =  np.ones((num_features,),dtype="float32")

    for word in words:
        if word in index2word_set:
            num_words += 1
            max_vec = np.maximum(model[word], feature_vector)
            min_vec = np.minimum(model[word], feature_vector)
            feature_vector = np.add(feature_vector, model[word]) 
    
    # return min vector + max vector
    if vec_type == "min+max":
        return np.add(min_vec, max_vec) 
    
    # average of vectors
    elif vec_type == "average":
        return np.divide(feature_vector, num_words)

    # return sum of word2vec vectors
    return feature_vector

# Generate word vectors from the mdoel
def generate_essay_vectors(essays, model, num_features, vec_type="sum"):
    essayfeature_vectors = np.zeros((len(essays),num_features),dtype="float32")
    for (i, essay) in enumerate(essays):
        essayfeature_vectors[i] = get_feature_vector(essay, model, num_features, vec_type)
    return essayfeature_vectors

In [123]:
from sklearn.cross_validation import KFold
from sklearn.linear_model import LinearRegression
from sklearn.metrics import cohen_kappa_score

def train_model(X, y, dataset, vec_type="sum"):
    count = 1
    results = []
    
    for train_set, test_set in dataset:
        print("Fold #", count)
        X_test, X_train, y_test, y_train = X.iloc[test_set], X.iloc[train_set], y.iloc[test_set], y.iloc[train_set]
        
        train_essays = X_train['essay']
        test_essays = X_test['essay']
        
        sentences = []
        
        for essay in train_essays:
            sentences += essay_to_sentences(essay)
                
        # Initialize variables for word2vec model
        num_features = 300 
        min_word_count = 40
        num_workers = 4
        context = 10
        downsampling = 1e-7

        # Train the word2vec model
        model = Word2Vec(sentences, workers=num_workers, size=num_features, min_count = min_word_count, window = context, sample = downsampling)
        model.init_sims(replace=True)

        clean_train_essays = []
        
        # Generate training and testing data word vectors.
        for essay_vec in train_essays:
            clean_train_essays.append(essay_to_list(essay_vec))
        train_vectors = generate_essay_vectors(clean_train_essays, model, num_features, vec_type)
        
        clean_test_essays = []
        for essay_vec in test_essays:
            clean_test_essays.append(essay_to_list( essay_vec))
        test_vectors = generate_essay_vectors(clean_test_essays, model, num_features, vec_type)
        
        train_vectors = np.array(train_vectors)
        test_vectors = np.array(test_vectors)

        # Reshape the train and test vectors to 3 dimensions - 1 represents one timestamp 
        train_vectors = np.reshape(train_vectors, (train_vectors.shape[0], 1, train_vectors.shape[1]))
        test_vectors = np.reshape(test_vectors, (test_vectors.shape[0], 1, test_vectors.shape[1]))
        
        # Call the LSTM to get the score predictions 
        lstm_model = get_model()
        lstm_model.fit(train_vectors, y_train, batch_size=64, epochs=50)
        y_pred = lstm_model.predict(test_vectors)
        
        # Round the prediction to the nearest integer
        y_pred = np.around(y_pred)
        
        # Evaluate the model: quadratic kappa score of predictions against human grading
        result = cohen_kappa_score(y_test.values, y_pred, weights='quadratic')
        print("QWK: ", result)
        results.append(result)
        
        count += 1

    return results

In [124]:
# Evaluate the model
dataset = KFold(len(X), n_folds=5, shuffle=True)

results_min_max = train_model(X, y, dataset, "min+max")
print("Average Quadratic Weighted Kappa after 5-fold cross validation for min + max word2vec ",np.around(np.array(results_min_max).mean(),decimals=4))

Fold # 1




_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_124 (LSTM)              (None, 1, 300)            721200    
_________________________________________________________________
lstm_125 (LSTM)              (None, 64)                93440     
_________________________________________________________________
dropout_64 (Dropout)         (None, 64)                0         
_________________________________________________________________
dense_64 (Dense)             (None, 1)                 65        
Total params: 814,705
Trainable params: 814,705
Non-trainable params: 0
_________________________________________________________________
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
E

Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
QWK:  0.9674684772448536
Fold # 3
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_128 (LSTM)              (None, 1, 300)            721200    
_________________________________________________________________
lstm_129 (LSTM)              (None, 64)                93440     
_________________________________________________________________
dropout_66 (Dropout)         (None

Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
QWK:  0.9727294307657198
Fold # 4
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_130 (LSTM)              (None, 1, 300)            721200    
_________________________________________________________________
lstm_131 (LSTM)              (None, 64)                93440     
_________________________________________________________________
dropout_67 (Dropout)         (None, 64)                0         
_________________________________________________________________
dense_67 (Dense)             (None, 1)    

Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
QWK:  0.9685796236967704
Fold # 5
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_132 (LSTM)              (None, 1, 300)            721200    
_________________________________________________________________
lstm_133 (LSTM)              (None, 64)                93440     
_________________________________________________________________
dropout_68 (Dropout)         (None, 64)                0         
_________________________________________________________________
dense_68 (Dense)             (None, 1)                 65        
Total params: 814,705
Trainable params: 814,705
Non-trainable params: 0
________________________________________________

Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
QWK:  0.9747481388630764
Average Quadratic Weighted Kappa after 5-fold cross validation for min + max word2vec  0.9713


In [125]:
results_average = train_model(X, y, dataset, "average")
print("Average Quadratic Weighted Kappa after 5-fold cross validation for average word2vec ",np.around(np.array(results_average).mean(),decimals=4))

Fold # 1




_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_134 (LSTM)              (None, 1, 300)            721200    
_________________________________________________________________
lstm_135 (LSTM)              (None, 64)                93440     
_________________________________________________________________
dropout_69 (Dropout)         (None, 64)                0         
_________________________________________________________________
dense_69 (Dense)             (None, 1)                 65        
Total params: 814,705
Trainable params: 814,705
Non-trainable params: 0
_________________________________________________________________
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
E

Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
QWK:  0.9027899887321981
Fold # 3
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_138 (LSTM)              (None, 1, 300)            721200    
_________________________________________________________________
lstm_139 (LSTM)              (None, 64)                93440     
_________________________________________________________________
dropout_71 (Dropout)         (None

Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
QWK:  0.9000602569154881
Fold # 4
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_140 (LSTM)              (None, 1, 300)            721200    
_________________________________________________________________
lstm_141 (LSTM)              (None, 64)                93440     
_________________________________________________________________
dropout_72 (Dropout)         (None, 64)                0         
_________________________________________________________________
dense_72 (Dense)             (None, 1)    

Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
QWK:  0.9060309973632426
Fold # 5
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_142 (LSTM)              (None, 1, 300)            721200    
_________________________________________________________________
lstm_143 (LSTM)              (None, 64)                93440     
_________________________________________________________________
dropout_73 (Dropout)         (None, 64)                0         
_________________________________________________________________
dense_73 (Dense)             (None, 1)                 65        
Total params: 814,705
Trainable params: 814,705
Non-trainable params: 0
____________________________________

Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
QWK:  0.9175050916768875
Average Quadratic Weighted Kappa after 5-fold cross validation for average word2vec  0.9073


In [126]:
results_sum = train_model(X, y, dataset)
print("Average Quadratic Weighted Kappa after 5-fold cross validation for sum word2vec ",np.around(np.array(results_sum).mean(),decimals=4))

Fold # 1




_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_144 (LSTM)              (None, 1, 300)            721200    
_________________________________________________________________
lstm_145 (LSTM)              (None, 64)                93440     
_________________________________________________________________
dropout_74 (Dropout)         (None, 64)                0         
_________________________________________________________________
dense_74 (Dense)             (None, 1)                 65        
Total params: 814,705
Trainable params: 814,705
Non-trainable params: 0
_________________________________________________________________
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
E

Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
QWK:  0.9672305787073104
Fold # 3
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_148 (LSTM)              (None, 1, 300)            721200    
_________________________________________________________________
lstm_149 (LSTM)              (None, 64)                93440     
_________________________________________________________________
dropout_76 (Dropout)         (None

Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
QWK:  0.9717122099661926
Fold # 4
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_150 (LSTM)              (None, 1, 300)            721200    
_________________________________________________________________
lstm_151 (LSTM)              (None, 64)                93440     
_________________________________________________________________
dropout_77 (Dropout)         (None, 64)                0         
_________________________________________________________________
dense_77 (Dense)             (None, 1)                

Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
QWK:  0.968605139334301
Fold # 5
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_152 (LSTM)              (None, 1, 300)            721200    
_________________________________________________________________
lstm_153 (LSTM)              (None, 64)                93440     
_________________________________________________________________
dropout_78 (Dropout)         (None, 64)                0         
_________________________________________________________________
dense_78 (Dense)             (None, 1)                 65        
Total params: 814,705
Trainable params: 814,705
Non-trainable params: 0
_____________________________________________________________

Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
QWK:  0.9731241327645618
Average Quadratic Weighted Kappa after 5-fold cross validation for sum word2vec  0.9706


In [127]:
# Print final results
print("Average Quadratic Weighted Kappa after 5-fold cross validation for min + max word2vec ",np.around(np.array(results_min_max).mean(),decimals=4))
print("Average Quadratic Weighted Kappa after 5-fold cross validation for average word2vec ",np.around(np.array(results_average).mean(),decimals=4))
print("Average Quadratic Weighted Kappa after 5-fold cross validation for sum word2vec ",np.around(np.array(results_sum).mean(),decimals=4))

Average Quadratic Weighted Kappa after 5-fold cross validation for min + max word2vec  0.9713
Average Quadratic Weighted Kappa after 5-fold cross validation for average word2vec  0.9073
Average Quadratic Weighted Kappa after 5-fold cross validation for sum word2vec  0.9706


# Experimental Setup

## K-fold cross validation
Each model was trained on 80% of the testing set, and testing was performed on the final 20%. For each model, 5-fold cross validation was used for training and testing to avoid over-fitting. 5-fold cross-validation involves randomly partitioning the set into 5 equal sized subsets, of which a single subset is used as validation data and the remaining 4 subsets are used for testing. Then, the cross-validation process is repeated 5 times so that each of the subsets are used as the validation data once and only once. This helps reduce over-fitting by using each sample for both training and validation. Furthermore, it was used for most of the other models in this area, so it helped to even the playing field for a more fair comparison between different models.

## Metrics
The evaluation metric was Quadratic Weighted Kappa, QWK, as per the ASAP competition. It takes into account as the baseline the possibility of agreement occuring by chance, and it typically varies from 0 (random agreement) to 1 (complete agreement). It is also possible to get a negative score if there is less agreeement than expected by chance. It is calculated between the model's predictions for the scores and the human grading scores for each essay. The QWK for each model is reported as the average from the five fold cross validation.

\begin{equation}
k = 1 - \frac{{\sum_{i, j}}{W_{i, j}}{O_{i, j}}}{{\sum_{i, j}}{W_{i, j}}{E_{i, j}}}
\end{equation}

# Results
It is worth noting that the Kaggle competition had a test set for which the ground truth was not publicly released. Because of this, we could only test by using 20% of the training set as a testing set. This means it is not a fair comparison to directly compare our results with the Kaggle competition. Results for the models are as follows, all of which are based on a 5-fold cross validation using 80% of the training data for training and 20% for testing. 


| Model  |  QWK Score | 
|---|---|
|Kaggle competition best score|0.801|
|Human grading|0.860 | 
|Logistic regression|0.847|
|State of the art (LSTM) |0.961 |
|Average Word2Vec and LSTM| 0.907|
|Min+max Word2Vec and LSTM|0.971|
|Sum Word2Vec and LSTM|0.971|


# Analysis of Results
The best score during the Kaggle competition was a QWK of 0.801. The Kaggle competition had a larger, hidden test set for which no gold standard is available, so not being able to test on that set is a contributer to our score being so high. The state of the art performance on this dataset was 0.96 in research and 0.961 from open-source work. We built upon their LSTM architecture and used the hyperparameters from the state of the art, and performed better both by taking the min plus the max of the vector representations of the essays (0.971) and got the highest score by using the sum of the vector representations of the sentences (0.971). 

# Future Work
Given the limitations of the dataset, it would be nice to have more robust data to test the models on. For example, four of the datasets were graded for content and not writing ability. Writing quality is a very important measure to take essay score into account, so it would be more beneficial to train the models on essay sets that were graded based on writing ability. We also only used the first human grader's score as the score, since a lot of essays did not include a second human grader score. Testing the models on the average of both scores would allow for a better comparison between human grading and the machine grading. 
