# Deep Learning for Automated Essay Scoring

## Introduction

Automated essay scoring (AES) is an NLP task that aims to predict the score of an essay based on a certain set of essay quality metrics. The score depends on the grammatical, organizational, and content features of the essays. Human raters establish rubrics and provide scores based on these criteria. However, employing human raters can pose challenges due to the large number of essays to be graded (which slows down the feedback loop) and the inconsistent grades (different raters may assign different scores to the same essay, or a rater may assign different scores to the same essay if evaluated on different days).

AES systems are computer systems that simulate the scoring characteristics of human raters and address the aforementioned problems. There are several models used in AES systems. The most crucial aspect of an AES system is essay representation or encoding. Essay representation involves capturing useful features from the essays that help measure their quality. Manual feature engineering can extract features in the form of lexical, syntactic, or semantic features. This approach has been employed in industrial AES systems. However, such approaches have drawbacks in terms of generalizability and requiring feature engineering tasks.

Deep learning has become a go-to approach for numerous artificial intelligence tasks, consistently achieving outstanding performance results. Deep learning eliminates the need for feature engineering as it learns automatically behind the scenes.

In this project, I will demonstrate the use of deep learning for automated essay scoring tasks.

## Implementation

### Libraries
As it is seen in the following code snippet, I imported a number of libraries from <code>PyTorch</code>, <code>sklearn</code> and <code>python (collections)</code>. 
 - <code>torch</code> is a deep learning framework that I used it for building, training and testing my models.
 - <code>torchtext</code> is sub-library in PyTorch for text data that I used it to vectorize and tokenize the essays.
 - <code>Pandas</code> is a data manipulation tool which I used it for loading the data from the disk.
 - <code>matplotlib</code> is data visualization library which I used it for generating graphs.
 - <code>numpy</code> is large and multi-dimensional arrays library which I used it for tranforming data into array.
 - <code>scikit-learn</code> is a popular machine learning library which I used it for measuring rater agreement(<code>cohen_kappa_score</code>)

In [None]:
import torch
from torch import nn
from torch.nn import functional as F
from torch.utils.data import Dataset, DataLoader
from torchtext.data import get_tokenizer
from torchtext.vocab import vocab
import torch.optim as optim
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.metrics import cohen_kappa_score
from collections import Counter, OrderedDict

tokenizer = get_tokenizer('basic_english')

### The Model
The following section shows the design of a multilayer perceptron model. The model has an embedding layer, 2 linear layers, 2 acitivation functions (ReLU and Sigmoid). 
 - The embedding layer: <code>nn.Embedding</code> is used to capture semantic and syntactic information from the essays.
 - The linear layers: <code>nn.Linear</code> represents linear transformation. In the model, there are two linear layers. The first linear layer takes an input from the embedding layer with embedding dimension (embedding_dim) size and generate an output with hidden dimension (hidden_dim) size. The second linear layer is used to generate the score (output).
 - Activation function: <code>nn.ReLU</code> and <code>nn.Sigmoid</code> are the activation function used to transform the linear layer into nonlinear. ReLU is applied to the first linear layer whereas sigmoid is applied tothe second linear layer.

This class has a constructor (<code>__init__</code>) and a forward pass (<code>forward</code>). In the constructor, the functions and the linear layers are setted. In the <code>forward</code> method, the order of computation is defined. The essay input (transformed into numbers) passed through the embedding layer. The intution here is it will capture semantic and syntactic information of the essay. And then, a mean pooling is applied. The first linear layer took the ouput of the averge pooled values and a ReLU actication function is applied over it. Finally, the second linear layer generates a value and adjusted using sigmoid activation function into a score.

In [None]:
class MLP(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_classes):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(embedding_dim, hidden_dim)
        self.linear2 = nn.Linear(hidden_dim, num_classes)
        self.relu = nn.ReLU()
        self.sigmoid = nn.Sigmoid()

    def forward(self, essay):
        embedded = self.embedding(essay)
        hidden = torch.mean(embedded, dim=1)
        layer1 = self.linear1(hidden)
        layer1_relu = self.relu(layer1)
        layer2 = self.linear2(layer1_relu)
        output = self.sigmoid(layer2)
        return output

### Custom Dataset class
I created a custom data <code>ASAPDataset</code> that takes a list of data and the vocab. This class contains three methods <code>__init__()</code>, <code>__len__()</code>, and <code>__getitem__()</code>. <code>__getitem__()</code> fetchs a sample from asap-aes dataset based on the given index. The <code>Dataset</code> provides a mechanism to load, preprocess, and iterate over the dataset.

In addition, there is <code>collate_fn</code> function defined to handle padding within the essay vectors. First, the maximum token length is identified and then set all the vectors of the essays to have the same length. The padding is represented using <code>0</code>.

In [None]:
class ASAPDataset(Dataset):
    def __init__(self, data, vocab):
        self.data = data
        self.vocab = vocab

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        sample = self.data[index]
        essay = sample[2]
        essay = self.vocab(tokenizer(essay))
        score = sample[3]
        return essay, score

def collate_fn(batch):
    essays, scores = zip(*batch)
    max_length = max([len(entry) for entry in essays])
    padded_essays = []
    for tokens in essays:
        padded_essay = tokens + [0] * (max_length - len(tokens))
        padded_essays.append(padded_essay)
    return torch.tensor(padded_essays, dtype=torch.int64), torch.tensor(scores, dtype=torch.int64)

### The Learning
In this section, I defined two functions, <code>training</code> and <code>testing</code>. The training function takes model, optimizer, dataset, and loss function. The model is an instance of the MLP class, optimizer is setted to Adam optimizer, the data is a batched data processd by the <code>DataLoader</code> and the criterion is a mean squared loss (MSE) function.

The training function is responsible for the learning component of the model. The model takes a batch of essays and produce an output with similar batch size. The output is in the range of 0-1 as it is squashed using <code>sigmoid</code> activation function. By transforming the actual score into the range of 0-1, loss of the model is computed. For transformation of the actual score into 0-1, I employed <code>Min-Max Normalization</code>.
$$
    min-max-normalization = \frac{score - min}{max  - min}
$$
where score is the essay score, min is the minimum score in the dataset and max is the maximum score in the dataset.

The other function in this section is, testing function. This function is used to evaluate the performance of the model. To evaluate the model, the output values are transformed into the actual score format. The model is evaluated against minimizing the loss and the agreement of AES system with the human raters. For minimizing the loss, <code>MSELoss</code> from <code>PyTorch</code> is employed.
$$
    MSE = \frac{1}{n}\sum(output - scores)^2
$$
where the output is score predicted by the model and scores are actual score from the dataset. The predicted score in both training and testing has a different form. In training phase, the predict score is in the range of 0-1 whereas during testing it is transformed into the range of in the dataset (please check for the actual scores range of the essay in the <code>essay_set</code> variable).

The other metrics used to measure the performance of the model is the raters' agreement. <code>scikit-learn</code> has an implementation of Cohen's kappa, <code>cohen_kappa_score</code>. This metrics measures the agreement level between raters. The score ranges from -1 to 1, where 1 indicates complete agreement, 0 agreement equivalent to chance and -1 complete disagreement.
$$
    k = 1 - \frac{\sum W_{i,j}O_{i,j}}{\sum W_{i,j}E_{i,j}}
$$
where $O_{i,j}$ is a histogram matrix with the number of predicted labels that have a rating  of $i$ (actual) that received a predicted value $j$, $E_{i,j}$ is a histogram matrix of expected ratings calculated as the outer product between the actual rating's histogram vector of ratings and the predicted rating's histogram vector of ratings.
$$
    W_{i, j} = \frac{(i-j)^2}{(R-1)^2}
$$
where $W_{i,j}$ is a weight matrix that is calculated based on the difference between actual and predicted values, and $R$ is the rating range.

In [None]:
def training(model, optimizer, data, criterion, prompt):
    model.train()
    loss = 0.0
    for (essay, scores) in data:
        optimizer.zero_grad()
        
        output = model(essay)
        scores = torch.tensor([min_max_normalization(score.item(), prompt) for score in scores], dtype=torch.float32)
        scores = scores.reshape(-1, 1)
        loss = criterion(output, scores)
        loss.backward()
        optimizer.step()

def testing(model, data, criterion, prompt):
    model.eval()
    total_loss = 0.0
    scores_4_qwk, output_4_qwk = [], []
    with torch.no_grad():
        for (essay, scores) in data:
            scores = scores.reshape(-1, 1)
            output = model(essay)
            output = torch.tensor([scaler(out.item(), prompt) for out in output ], dtype=torch.float32)
            output = output.reshape(-1, 1)
            loss = criterion(output, scores)
            total_loss += loss
            
            scores_4_qwk.append(scores)
            output_4_qwk.append(output)
            
    score_list = [score.item() for tensor_score in scores_4_qwk for score in tensor_score]
    output_list = [int(output.item()) for tensor_output in output_4_qwk for output in tensor_output]
    
    qwk = cohen_kappa_score(score_list, output_list, weights='quadratic')
    return qwk, total_loss / len(data)
        

### Loading the dataset
The ASAP-AES dataset is a popular dataset among AES researchers. The dataset can be downloaded from [Kaggle](https://www.kaggle.com/c/asap-aes). The dataset has 12976 entries and 28 columns (features). But for this project, I am only interested on 4 features, namely essay_id, essay_set, essay, and domain1_score.
 - essay_id is a unique id column for each entry
 - essay_set is an essay category. There are 8 essay sets, and each set represent different questions and different scoring range.
 - essay is a text response to the prompt given by student. This column is importtant feautre in the scoring process.
 - domain1_score is a score column. This field the summation of scores from two raters. In this project, the target value is this field.

In [None]:
file_path = '/kaggle/input/asap-aes/training_set_rel3.tsv'
columns = ['essay_id', 'essay_set', 'essay', 'domain1_score']
asap = pd.read_csv(file_path, sep='\t', encoding='ISO-8859-1', usecols=columns)
asap.head()
asap.head()

The following section contains an essay_set dictionary and two functions. The essay_set dictionary contains the score range of each prompts. For example, essay_set 1 has minimum value of 2 and a maximum value 12. These values are taken from the dataset description. 

The min_max_normalization and scaler functions are used to transform the scores from score range of the dataset into the range of 0-1 and vice versa.

In [None]:
essay_set = {
    1: (2, 12),
    2: (1, 6),
    3: (0, 3),
    4: (0, 3),
    5: (0, 4),
    6: (0, 4),
    7: (0, 30),
    8: (0, 60)
}
def min_max_normalization(score, prompt):
    return (score - essay_set[prompt][0]) / (essay_set[prompt][1] - essay_set[prompt][0])
def scaler(score, prompt):
    return round(score * (essay_set[prompt][1] - essay_set[prompt][0]) + essay_set[prompt][0])

The dataset is splited into train, validation and test dataset. For training, 60% of the data is used, 20% of the data is used for validation and the rest is used for testing. <code>random_split</code> from PyTorch is used to spilt the data into the three settings.

In [None]:
def split_dataset(prompt): 
    train, val, test = torch.utils.data.random_split(asap[asap['essay_set']==prompt].values, [0.6, 0.2, 0.2], generator=torch.Generator().manual_seed(42))
    return train, val, test

### Text representation
The following section shows where the texts are transformed into numbers. Every text in the essay is represented by a number.

In [None]:
def essay_vectorizer(text):
    tokenized_essay = [tokenizer(field[2].lower()) for field in text]

    tokens = set()
    for essay in tokenized_essay:
        tokens.update(essay)
    
    tokens = list(tokens)

    vocab_essay = vocab(OrderedDict([(token, 1) for token in tokens]), specials=['<unk>'])
    vocab_essay.set_default_index(vocab_essay['<unk>'])

    return vocab_essay

In [None]:
def essay_dataloader(prompt, batch_size):

    train, val, test = split_dataset(prompt)
    vocab_essay = essay_vectorizer(train)

    asap_train = ASAPDataset(train, vocab_essay)
    train_dl = DataLoader(asap_train, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)

    asap_val = ASAPDataset(val, vocab_essay)
    val_dl = DataLoader(asap_val, batch_size=batch_size, shuffle=False, collate_fn=collate_fn)

    asap_test = ASAPDataset(test, vocab_essay)
    test_dl = DataLoader(asap_test, batch_size=batch_size, shuffle=False, collate_fn=collate_fn)
    
    return train_dl, val_dl, test_dl, vocab_essay

In [None]:
def graphs(result, title, xlabel, ylabel, num_epochs, prompt):
    epochs = np.arange(1, num_epochs + 1, 1)
    #result = np.array(result)

    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    if type(result[0]) == list:
        for idx, prompt_result in enumerate(result):
            plt.plot(epochs, np.array(prompt_result), label=f'Prompt {idx + 1}')
    else:
        plt.plot(epochs, np.array(result), label=f'Prompt {prompt}')
    plt.legend()
    plt.show()

In the ASAP-AES dataset, there are eight prompts. Therefore, we train, validate and test the model using each prompt.  

In [None]:
prompt = 1
embedding_dim = 50
criterion = nn.MSELoss()
hidden_dim = 100
EPOCHS = 50
batch_size = 32
num_classes = 1

qwk_prompts_val, mse_prompts_val, qwk_prompts_test, mse_prompts_test  = [], [], [], []

for prompt in range(1, 9):
    train_dl, val_dl, test_dl, vocab_essay = essay_dataloader(prompt, batch_size)
    vocab_size = len(vocab_essay)
    
    model = MLP(vocab_size, embedding_dim, hidden_dim, num_classes)
    optimizer = optim.RMSprop(model.parameters(), lr=0.001, alpha=0.9)
    
    qwk_epoch, mse_epoch = [], []
    for epoch in range(0, EPOCHS):
        training(model, optimizer, train_dl, criterion, prompt) # Training
        qwk, mse = testing(model, val_dl, criterion, prompt) # Validation
        
        qwk_epoch.append(qwk)
        mse_epoch.append(mse.item())
        
    qwk_prompts_val.append(qwk_epoch)
    mse_prompts_val.append(mse_epoch)

    #Testing
    qwk_test, mse_test = testing(model, test_dl, criterion, prompt)
    qwk_prompts_test.append(qwk_test)
    mse_prompts_test.append(mse_test) 

### Validation visualization
The following graphs are the visualization of validation dataset for raters agreement and the errors. The <code>y-axis</code> of <code>Model(Error)</code> shows an error of 0-35. The reason behind this large error value is the output scores are transformed into the original score range of the prompts during validation and testing.

In [None]:
graph_settings = {
    'title': ('Model performance(QWK)', 'Model Error(MSE)'),
    'xlabel': '# Epochs',
    'ylabel': ('Agreement (QWK)', 'Error (MSE)')
}

graphs(qwk_prompts_val, graph_settings['title'][0], graph_settings['xlabel'], graph_settings['ylabel'][0], EPOCHS, prompt)

In [None]:
graphs(mse_prompts_val, graph_settings['title'][1], graph_settings['xlabel'], graph_settings['ylabel'][1], EPOCHS, prompt)

### Model Testing
The following graphs shows the perfomance of the model. The number of epochs  and hidden layers are fixed to 50 and 2 respectively.

In [None]:
prompts = range(1, 9)
plt.bar(prompts, qwk_prompts_test)
plt.xlabel('Prompts')
plt.ylabel('Agreement (QWK)')
plt.title('Model performance (QWK)')
plt.show()

## Summary
In this project, I demonstrated the use of a multilayer perceptron for automated essay scoring task. The performance varies with the number of epochs and the number of layers.