# Deep Learning for Automated Essay Scoring

## Introduction

Automated essay scoring (AES) is an NLP task that aims to predict the score of an essay based on a certain set of essay quality metrics. The score depends on the grammatical, organizational, and content features of the essays. Human raters establish rubrics and provide scores based on these criteria. However, employing human raters can pose challenges due to the large number of essays to be graded (which slows down the feedback loop) and the inconsistent grades (different raters may assign different scores to the same essay, or a rater may assign different scores to the same essay if evaluated on different days).

AES systems are computer systems that simulate the scoring characteristics of human raters and address the aforementioned problems. There are several models used in AES systems. The most crucial aspect of an AES system is essay representation or encoding. Essay representation involves capturing useful features from the essays that help measure their quality. Manual feature engineering can extract features in the form of lexical, syntactic, or semantic features. This approach has been employed in industrial AES systems. However, such approaches have drawbacks in terms of generalizability and requiring feature engineering tasks.

Deep learning has become a go-to approach for numerous artificial intelligence tasks, consistently achieving outstanding performance results. Deep learning eliminates the need for feature engineering as it learns automatically behind the scenes.

In this project, I will demonstrate the use of deep learning for automated essay scoring tasks.

## Implementation

### Libraries
As it is seen in the following code snippet, I imported a number of libraries from <code>PyTorch</code>, <code>sklearn</code> and <code>python (collections)</code>. 
 - <code>torch</code> is a deep learning framework that I used it for building, training and testing my models.
 - <code>torchtext</code> is sub-library in PyTorch for text data that I used it to vectorize and tokenize the essays.
 - <code>Pandas</code> is a data manipulation tool which I used it for loading the data from the disk.
 - <code>matplotlib</code> is data visualization library which I used it for generating graphs.
 - <code>numpy</code> is large and multi-dimensional arrays library which I used it for tranforming data into array.
 - <code>scikit-learn</code> is a popular machine learning library which I used it for measuring rater agreement(<code>cohen_kappa_score</code>)

In [1]:
import os
import math
import time
import torch
from torch import nn
from torch.nn import functional as F
from torch.utils.data import Dataset, DataLoader
import random
import numpy as np
import pandas as pd
from tqdm.auto import tqdm
from sklearn.model_selection import KFold
from sklearn.metrics import cohen_kappa_score
import matplotlib.pyplot as plt
import seaborn as sns

from transformers import AutoTokenizer, AutoModel


### The Model
The following section shows the design of a multilayer perceptron model. The model has an embedding layer, 2 linear layers, 2 acitivation functions (ReLU and Sigmoid). 
 - The embedding layer: <code>nn.Embedding</code> is used to capture semantic and syntactic information from the essays.
 - The linear layers: <code>nn.Linear</code> represents linear transformation. In the model, there are two linear layers. The first linear layer takes an input from the embedding layer with embedding dimension (embedding_dim) size and generate an output with hidden dimension (hidden_dim) size. The second linear layer is used to generate the score (output).
 - Activation function: <code>nn.ReLU</code> and <code>nn.Sigmoid</code> are the activation function used to transform the linear layer into nonlinear. ReLU is applied to the first linear layer whereas sigmoid is applied tothe second linear layer.

This class has a constructor (<code>__init__</code>) and a forward pass (<code>forward</code>). In the constructor, the functions and the linear layers are setted. In the <code>forward</code> method, the order of computation is defined. The essay input (transformed into numbers) passed through the embedding layer. The intution here is it will capture semantic and syntactic information of the essay. And then, a mean pooling is applied. The first linear layer took the ouput of the averge pooled values and a ReLU actication function is applied over it. Finally, the second linear layer generates a value and adjusted using sigmoid activation function into a score.

In [2]:

def normalize_scores(scores, set_id, min_scores, max_scores):
    mi = min_scores[set_id-1]
    ma = max_scores[set_id-1]
    return (scores - mi) / (ma - mi)

def denormalize_scores(scores_norm, set_id, min_scores, max_scores):
    mi = min_scores[set_id-1]
    ma = max_scores[set_id-1]
    return scores_norm * (ma - mi) + mi

In [3]:
class EssayDataset(Dataset):
    def __init__(self, texts, scores, tokenizer, max_len=512, embedder_name=None):
        self.texts = texts
        self.scores = scores
        self.tokenizer = tokenizer
        self.max_len = max_len
        self.embedder_name = embedder_name  # keep track of which model we use

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = str(self.texts[idx])
        score = self.scores[idx]

        # If using E5, prepend "passage: " to the essay
        if self.embedder_name and "e5" in self.embedder_name.lower():
            text = "passage: " + text

        enc = self.tokenizer(
            text,
            max_length=self.max_len,
            padding="max_length",
            truncation=True,
            return_tensors="pt"
        )

        return {
            "input_ids": enc["input_ids"].squeeze(0),
            "attention_mask": enc["attention_mask"].squeeze(0),
            "score": torch.tensor(score, dtype=torch.float)
        }


### Custom Dataset class
I created a custom data <code>ASAPDataset</code> that takes a list of data and the vocab. This class contains three methods <code>__init__()</code>, <code>__len__()</code>, and <code>__getitem__()</code>. <code>__getitem__()</code> fetchs a sample from asap-aes dataset based on the given index. The <code>Dataset</code> provides a mechanism to load, preprocess, and iterate over the dataset.

In addition, there is <code>collate_fn</code> function defined to handle padding within the essay vectors. First, the maximum token length is identified and then set all the vectors of the essays to have the same length. The padding is represented using <code>0</code>.

In [4]:
def get_encoder(model_name, unfreeze_last_n=0):
    if model_name == "roberta":
        name = "roberta-base"
    elif model_name == "bge":
        name = "BAAI/bge-base-en"
    elif model_name == "e5-base":
        name = "intfloat/e5-base-v2"
    elif model_name == "e5-large":
        name = "intfloat/e5-large"
    elif model_name == "qwen":
        name = "Qwen/Qwen3-Embedding-0.6B"   # qwen 0.6B close checkpoint
    elif model_name == "deberta":
        name = "microsoft/deberta-v3-base"
    else:
        raise ValueError("Unknown model")
    
    tokenizer = AutoTokenizer.from_pretrained(name, use_fast=True)
    encoder = AutoModel.from_pretrained(name)
    hidden_size = encoder.config.hidden_size

    # Freeze all params
    for param in encoder.parameters():
        param.requires_grad = False

    # Unfreeze last `unfreeze_last_n` layers if requested
    if hasattr(encoder, "encoder"):  # works for RoBERTa/E5/BGE
        layers = encoder.encoder.layer
        for layer in layers[-unfreeze_last_n:]:
            for param in layer.parameters():
                param.requires_grad = True
    elif hasattr(encoder, "model"):  # some Qwen variants wrap transformer under .model
        if hasattr(encoder.model, "layers"):
            layers = encoder.model.layers
            for layer in layers[-unfreeze_last_n:]:
                for param in layer.parameters():
                    param.requires_grad = True

    return encoder, tokenizer, hidden_size


### The Learning
In this section, I defined two functions, <code>training</code> and <code>testing</code>. The training function takes model, optimizer, dataset, and loss function. The model is an instance of the MLP class, optimizer is setted to Adam optimizer, the data is a batched data processd by the <code>DataLoader</code> and the criterion is a mean squared loss (MSE) function.

The training function is responsible for the learning component of the model. The model takes a batch of essays and produce an output with similar batch size. The output is in the range of 0-1 as it is squashed using <code>sigmoid</code> activation function. By transforming the actual score into the range of 0-1, loss of the model is computed. For transformation of the actual score into 0-1, I employed <code>Min-Max Normalization</code>.
$$
    min-max-normalization = \frac{score - min}{max  - min}
$$
where score is the essay score, min is the minimum score in the dataset and max is the maximum score in the dataset.

The other function in this section is, testing function. This function is used to evaluate the performance of the model. To evaluate the model, the output values are transformed into the actual score format. The model is evaluated against minimizing the loss and the agreement of AES system with the human raters. For minimizing the loss, <code>MSELoss</code> from <code>PyTorch</code> is employed.
$$
    MSE = \frac{1}{n}\sum(output - scores)^2
$$
where the output is score predicted by the model and scores are actual score from the dataset. The predicted score in both training and testing has a different form. In training phase, the predict score is in the range of 0-1 whereas during testing it is transformed into the range of in the dataset (please check for the actual scores range of the essay in the <code>essay_set</code> variable).

The other metrics used to measure the performance of the model is the raters' agreement. <code>scikit-learn</code> has an implementation of Cohen's kappa, <code>cohen_kappa_score</code>. This metrics measures the agreement level between raters. The score ranges from -1 to 1, where 1 indicates complete agreement, 0 agreement equivalent to chance and -1 complete disagreement.
$$
    k = 1 - \frac{\sum W_{i,j}O_{i,j}}{\sum W_{i,j}E_{i,j}}
$$
where $O_{i,j}$ is a histogram matrix with the number of predicted labels that have a rating  of $i$ (actual) that received a predicted value $j$, $E_{i,j}$ is a histogram matrix of expected ratings calculated as the outer product between the actual rating's histogram vector of ratings and the predicted rating's histogram vector of ratings.
$$
    W_{i, j} = \frac{(i-j)^2}{(R-1)^2}
$$
where $W_{i,j}$ is a weight matrix that is calculated based on the difference between actual and predicted values, and $R$ is the rating range.

In [5]:
class EssayRegressor(nn.Module):
    def __init__(self, encoder, hidden_size):
        super().__init__()
        self.encoder = encoder
        self.mlp = nn.Sequential(
            nn.Linear(hidden_size, 256),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(256, 1)
        )

    def forward(self, input_ids, attention_mask):
        outputs = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
        cls_embedding = outputs.last_hidden_state[:, 0, :]
        return self.mlp(cls_embedding)


### Loading the dataset
The ASAP-AES dataset is a popular dataset among AES researchers. The dataset can be downloaded from [Kaggle](https://www.kaggle.com/c/asap-aes). The dataset has 12976 entries and 28 columns (features). But for this project, I am only interested on 4 features, namely essay_id, essay_set, essay, and domain1_score.
 - essay_id is a unique id column for each entry
 - essay_set is an essay category. There are 8 essay sets, and each set represent different questions and different scoring range.
 - essay is a text response to the prompt given by student. This column is importtant feautre in the scoring process.
 - domain1_score is a score column. This field the summation of scores from two raters. In this project, the target value is this field.

In [6]:
file_path = 'training_set_rel3_cleaned.tsv'
columns = ['essay_id', 'essay_set', 'essay', 'domain1_score']
asap = pd.read_csv(file_path, sep='\t', encoding='ISO-8859-1', usecols=columns)
min_scores = [int(asap[asap["essay_set"] == s]["domain1_score"].min()) for s in range(1, 9)]
max_scores = [int(asap[asap["essay_set"] == s]["domain1_score"].max()) for s in range(1, 9)]
asap.head()

Unnamed: 0,essay_id,essay_set,essay,domain1_score
0,1,1,"Dear local newspaper, I think effects computer...",8
1,2,1,"Dear @CAPS1 @CAPS2, I believe that using compu...",9
2,3,1,"Dear, @CAPS1 @CAPS2 @CAPS3 More and more peopl...",7
3,4,1,"Dear Local Newspaper, @CAPS1 I have found that...",10
4,5,1,"Dear @LOCATION1, I know having computers has a...",8


The following section contains an essay_set dictionary and two functions. The essay_set dictionary contains the score range of each prompts. For example, essay_set 1 has minimum value of 2 and a maximum value 12. These values are taken from the dataset description. 

The min_max_normalization and scaler functions are used to transform the scores from score range of the dataset into the range of 0-1 and vice versa.

The dataset is splited into train, validation and test dataset. For training, 60% of the data is used, 20% of the data is used for validation and the rest is used for testing. <code>random_split</code> from PyTorch is used to spilt the data into the three settings.

In [7]:
from sklearn.model_selection import train_test_split

def split_dataset(prompt, test_size=0.2, val_size=0.2, seed=42):
    df_prompt = asap[asap["essay_set"] == prompt].copy()

    train_val, test = train_test_split(df_prompt, test_size=test_size, random_state=seed)
    train, val = train_test_split(train_val, test_size=val_size/(1-test_size), random_state=seed)

    return train, val, test

from torch.utils.data import DataLoader

def get_dataloaders(train_df, val_df, test_df, tokenizer, prompt,
                    batch_size=16, max_len=512):
    """
    Build PyTorch dataloaders for a prompt (expects pandas DataFrames).
    Scores are normalized with normalize_scores before being passed to the Dataset.
    """
    # Normalize arrays (vectorized)
    train_scores_norm = normalize_scores(train_df["domain1_score"].values, prompt, min_scores, max_scores)
    val_scores_norm   = normalize_scores(val_df["domain1_score"].values,   prompt, min_scores, max_scores)
    test_scores_norm  = normalize_scores(test_df["domain1_score"].values,  prompt, min_scores, max_scores)

    train_dataset = EssayDataset(train_df["essay"].values, train_scores_norm, tokenizer, max_len=max_len)
    val_dataset   = EssayDataset(val_df["essay"].values,   val_scores_norm,   tokenizer, max_len=max_len)
    test_dataset  = EssayDataset(test_df["essay"].values,  test_scores_norm,  tokenizer, max_len=max_len)

    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    val_loader   = DataLoader(val_dataset,   batch_size=batch_size, shuffle=False)
    test_loader  = DataLoader(test_dataset,  batch_size=batch_size, shuffle=False)

    return train_loader, val_loader, test_loader





In [8]:
import torch
import torch.optim as optim
from tqdm import tqdm

def train_all_prompts(embedder,
                      prompts=range(1,9),
                      num_epochs=10,
                      batch_size=16,
                      lr=2e-5,
                      patience=3,
                      max_len=512,
                      device=None):
    """
    Loop over essay sets (prompts) and train separately for each.
    Saves results to results.csv (train_and_evaluate does that).
    Returns: pandas DataFrame summarizing returns from results.csv for the embedder run.
    """
    if device is None:
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    summary = []

    for prompt in prompts:
        print("\n" + "="*30)
        print(f" Training Essay Set {prompt} with embedder '{embedder}'")
        print("="*30)

        # 1) get encoder & tokenizer for this embedder
        encoder, tokenizer, hidden_size = get_encoder(embedder)

        # 2) split dataset for this prompt
        train_df, val_df, test_df = split_dataset(prompt, test_size=0.2, val_size=0.2, seed=42)

        # 3) dataloaders (scores normalized inside)
        train_loader, val_loader, test_loader = get_dataloaders(train_df, val_df, test_df,
                                                                tokenizer, prompt,
                                                                batch_size=batch_size, max_len=max_len)

        # 4) model, optimizer, criterion
        model = EssayRegressor(encoder, hidden_size)
        optimizer = optim.AdamW(model.parameters(), lr=lr)
        criterion = nn.MSELoss()

        # 5) train/evaluate for this prompt
        train_losses, val_losses, val_qwks = train_and_evaluate(
            model, train_loader, val_loader, test_loader,
            optimizer, criterion, device, prompt, embedder,
            num_epochs, patience
        )

        # 6) read the last row for this prompt from results.csv (optional) and append to summary
        try:
            df_results = pd.read_csv("results.csv")
            df_prompt = df_results[(df_results["embedder"]==embedder) & (df_results["prompt"]==prompt)]
            if not df_prompt.empty:
                row = df_prompt.iloc[-1].to_dict()
            else:
                row = {"embedder": embedder, "prompt": prompt, "best_val_qwk": None, "best_val_mse": None, "test_qwk": None, "test_mse": None}
        except FileNotFoundError:
            row = {"embedder": embedder, "prompt": prompt, "best_val_qwk": None, "best_val_mse": None, "test_qwk": None, "test_mse": None}

        row["train_losses"] = train_losses
        row["val_losses"] = val_losses
        row["val_qwks"] = val_qwks
        summary.append(row)

    summary_df = pd.DataFrame(summary)
    return summary_df


### Text representation
The following section shows where the texts are transformed into numbers. Every text in the essay is represented by a number.

In [9]:
import os
import pandas as pd
from tqdm import tqdm

def train_and_evaluate(model, train_loader, val_loader, test_loader, 
                       optimizer, criterion, device, prompt, embedder, 
                       num_epochs=10, patience=3):

    model = model.to(device)
    best_val_qwk = -1.0
    best_val_mse = float("inf")
    patience_counter = 0

    train_losses, val_losses, val_qwks = [], [], []

    for epoch in range(1, num_epochs+1):
        model.train()
        running_loss = 0.0

        # Training loop with progress bar
        progress_bar = tqdm(train_loader, desc=f"Prompt {prompt} Epoch {epoch}", leave=False)
        for batch in progress_bar:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            scores = batch["score"].to(device).unsqueeze(1)

            optimizer.zero_grad()
            outputs = model(input_ids, attention_mask)
            loss = criterion(outputs, scores)
            loss.backward()
            optimizer.step()

            running_loss += loss.item()
            progress_bar.set_postfix(loss=loss.item())

        avg_train_loss = running_loss / len(train_loader)

        # Validation
        val_qwk, val_mse = evaluate(model, val_loader, criterion, prompt, device)
        train_losses.append(avg_train_loss)
        val_losses.append(val_mse)
        val_qwks.append(val_qwk)

        print(f"Prompt {prompt}, Epoch {epoch}: "
              f"Train Loss={avg_train_loss:.4f}, "
              f"Val QWK={val_qwk:.4f}, Val MSE={val_mse:.4f}")

        # Early stopping
        if val_qwk > best_val_qwk:
            best_val_qwk = val_qwk
            best_val_mse = val_mse
            patience_counter = 0
            torch.save(model.state_dict(), f"best_model_{embedder}_prompt{prompt}.pt")
        else:
            patience_counter += 1
            if patience_counter >= patience:
                print("Early stopping triggered.")
                break

    # Load best model before testing
    model.load_state_dict(torch.load(f"best_model_{embedder}_prompt{prompt}.pt"))

    # Final test evaluation
    test_qwk, test_mse = evaluate(model, test_loader, criterion, prompt, device)
    print(f"✅ Prompt {prompt} | Test QWK={test_qwk:.4f}, Test MSE={test_mse:.4f}")

    # Save results to CSV
    results_df = pd.DataFrame([{
        "embedder": embedder,
        "prompt": prompt,
        "best_val_qwk": best_val_qwk,
        "best_val_mse": best_val_mse,
        "test_qwk": test_qwk,
        "test_mse": test_mse
    }])

    results_df.to_csv("results.csv", mode="a", header=not os.path.exists("results.csv"), index=False)

    return train_losses, val_losses, val_qwks


In [10]:
from sklearn.metrics import cohen_kappa_score, mean_squared_error
import numpy as np

def evaluate(model, loader, criterion, prompt, device):
    """
    Evaluate model on loader.
    Returns: (qwk_on_raw_scale, mse_on_raw_scale)
    Note: loader should supply normalized scores (as used during training).
    """
    model.eval()
    all_preds_norm, all_labels_norm = [], []
    total_loss = 0.0
    n_batches = 0

    with torch.no_grad():
        for batch in loader:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            scores = batch["score"].to(device).unsqueeze(1)    # normalized scores

            outputs = model(input_ids, attention_mask)        # normalized-pred outputs
            loss = criterion(outputs, scores)
            total_loss += loss.item()
            n_batches += 1

            # Collect as numpy arrays
            all_preds_norm.extend(outputs.squeeze(1).cpu().numpy())
            all_labels_norm.extend(scores.squeeze(1).cpu().numpy())

    if len(all_preds_norm) == 0:
        return 0.0, 0.0  # safe fallback

    # Convert to numpy arrays
    all_preds_norm = np.array(all_preds_norm, dtype=float).flatten()
    all_labels_norm = np.array(all_labels_norm, dtype=float).flatten()

    # Denormalize both (raw score scale)
    preds_raw = denormalize_scores(all_preds_norm, prompt, min_scores, max_scores)
    labels_raw = denormalize_scores(all_labels_norm, prompt, min_scores, max_scores)

    # Round predictions to nearest integer and clip to valid range for QWK
    low, high = min_scores[prompt-1], max_scores[prompt-1]
    preds_rounded = np.clip(np.rint(preds_raw), low, high).astype(int)
    labels_int = labels_raw.astype(int)

    # QWK (on integer original-score scale) and raw-scale MSE
    try:
        qwk = cohen_kappa_score(labels_int, preds_rounded, weights="quadratic")
    except Exception:
        qwk = 0.0

    mse_raw = mean_squared_error(labels_raw, preds_raw)

    avg_loss = total_loss / n_batches if n_batches > 0 else 0.0
    return qwk, mse_raw


In [None]:
# choose device
device = "cuda" if torch.cuda.is_available() else "cpu"

# Example: train Roberta per-prompt
summary_bge= train_all_prompts(
    embedder="bge",
    prompts=range(1,9),
    num_epochs=10,
    batch_size=8,
    lr=2e-5,
    patience=3,
    max_len=256,
    device=device
)

# See summary
print(summary_bge[["embedder","prompt","best_val_qwk","test_qwk","test_mse"]])


In [11]:
# choose device
device = "cuda" if torch.cuda.is_available() else "cpu"

# Example: train Roberta per-prompt
summary_e5base = train_all_prompts(
    embedder="e5-base",
    prompts=range(1,9),
    num_epochs=10,
    batch_size=8,
    lr=2e-5,
    patience=3,
    max_len=256,
    device=device
)

# See summary
print(summary_e5base[["embedder","prompt","best_val_qwk","test_qwk","test_mse"]])



 Training Essay Set 1 with embedder 'e5-base'


                                                                                 

Prompt 1, Epoch 1: Train Loss=0.0285, Val QWK=0.5793, Val MSE=1.3398


                                                                                 

Prompt 1, Epoch 2: Train Loss=0.0170, Val QWK=0.6535, Val MSE=1.1099


                                                                                 

Prompt 1, Epoch 3: Train Loss=0.0134, Val QWK=0.6673, Val MSE=1.1773


                                                                                 

Prompt 1, Epoch 4: Train Loss=0.0107, Val QWK=0.7116, Val MSE=1.0656


                                                                                 

Prompt 1, Epoch 5: Train Loss=0.0102, Val QWK=0.7042, Val MSE=1.0644


                                                                                 

Prompt 1, Epoch 6: Train Loss=0.0090, Val QWK=0.6186, Val MSE=1.2026


                                                                                 

Prompt 1, Epoch 7: Train Loss=0.0074, Val QWK=0.7187, Val MSE=1.1444


                                                                                 

Prompt 1, Epoch 8: Train Loss=0.0073, Val QWK=0.6959, Val MSE=1.0340


                                                                                 

Prompt 1, Epoch 9: Train Loss=0.0065, Val QWK=0.6937, Val MSE=1.3473


                                                                                  

Prompt 1, Epoch 10: Train Loss=0.0057, Val QWK=0.6340, Val MSE=1.2697
Early stopping triggered.
✅ Prompt 1 | Test QWK=0.7244, Test MSE=1.0517

 Training Essay Set 2 with embedder 'e5-base'


                                                                                 

Prompt 2, Epoch 1: Train Loss=0.0227, Val QWK=0.5741, Val MSE=0.3444


                                                                                 

Prompt 2, Epoch 2: Train Loss=0.0143, Val QWK=0.5270, Val MSE=0.3733


                                                                                 

Prompt 2, Epoch 3: Train Loss=0.0123, Val QWK=0.5810, Val MSE=0.3536


                                                                                 

Prompt 2, Epoch 4: Train Loss=0.0096, Val QWK=0.6904, Val MSE=0.3281


                                                                                 

Prompt 2, Epoch 5: Train Loss=0.0072, Val QWK=0.6621, Val MSE=0.3137


                                                                                  

Prompt 2, Epoch 6: Train Loss=0.0071, Val QWK=0.6634, Val MSE=0.3513


                                                                                 

Prompt 2, Epoch 7: Train Loss=0.0056, Val QWK=0.6689, Val MSE=0.3435
Early stopping triggered.
✅ Prompt 2 | Test QWK=0.4957, Test MSE=0.4048

 Training Essay Set 3 with embedder 'e5-base'


                                                                                 

Prompt 3, Epoch 1: Train Loss=0.0544, Val QWK=0.6627, Val MSE=0.2926


                                                                                 

Prompt 3, Epoch 2: Train Loss=0.0395, Val QWK=0.5639, Val MSE=0.4070


                                                                                 

Prompt 3, Epoch 3: Train Loss=0.0346, Val QWK=0.6553, Val MSE=0.3253


                                                                                 

Prompt 3, Epoch 4: Train Loss=0.0274, Val QWK=0.6605, Val MSE=0.3076
Early stopping triggered.
✅ Prompt 3 | Test QWK=0.6480, Test MSE=0.3062

 Training Essay Set 4 with embedder 'e5-base'


                                                                                 

Prompt 4, Epoch 1: Train Loss=0.0516, Val QWK=0.7880, Val MSE=0.2482


                                                                                 

Prompt 4, Epoch 2: Train Loss=0.0299, Val QWK=0.7554, Val MSE=0.2641


                                                                                 

Prompt 4, Epoch 3: Train Loss=0.0244, Val QWK=0.8076, Val MSE=0.2266


                                                                                 

Prompt 4, Epoch 4: Train Loss=0.0195, Val QWK=0.8159, Val MSE=0.2303


                                                                                 

Prompt 4, Epoch 5: Train Loss=0.0149, Val QWK=0.8157, Val MSE=0.2828


                                                                                 

Prompt 4, Epoch 6: Train Loss=0.0116, Val QWK=0.8318, Val MSE=0.2363


                                                                                  

Prompt 4, Epoch 7: Train Loss=0.0083, Val QWK=0.7760, Val MSE=0.2846


                                                                                 

Prompt 4, Epoch 8: Train Loss=0.0070, Val QWK=0.8154, Val MSE=0.2432


                                                                                 

Prompt 4, Epoch 9: Train Loss=0.0065, Val QWK=0.8148, Val MSE=0.2593
Early stopping triggered.
✅ Prompt 4 | Test QWK=0.7906, Test MSE=0.2634

 Training Essay Set 5 with embedder 'e5-base'


                                                                                 

Prompt 5, Epoch 1: Train Loss=0.0369, Val QWK=0.7993, Val MSE=0.3009


                                                                                 

Prompt 5, Epoch 2: Train Loss=0.0204, Val QWK=0.7610, Val MSE=0.3057


                                                                                 

Prompt 5, Epoch 3: Train Loss=0.0177, Val QWK=0.7617, Val MSE=0.2916


                                                                                 

Prompt 5, Epoch 4: Train Loss=0.0140, Val QWK=0.7684, Val MSE=0.2937
Early stopping triggered.
✅ Prompt 5 | Test QWK=0.7870, Test MSE=0.2855

 Training Essay Set 6 with embedder 'e5-base'


                                                                                 

Prompt 6, Epoch 1: Train Loss=0.0502, Val QWK=0.7287, Val MSE=0.3031


                                                                                 

Prompt 6, Epoch 2: Train Loss=0.0231, Val QWK=0.8066, Val MSE=0.3074


                                                                                 

Prompt 6, Epoch 3: Train Loss=0.0182, Val QWK=0.8203, Val MSE=0.3082


                                                                                 

Prompt 6, Epoch 4: Train Loss=0.0167, Val QWK=0.7996, Val MSE=0.3876


                                                                                 

Prompt 6, Epoch 5: Train Loss=0.0124, Val QWK=0.8416, Val MSE=0.2849


                                                                                 

Prompt 6, Epoch 6: Train Loss=0.0106, Val QWK=0.8247, Val MSE=0.3016


                                                                                 

Prompt 6, Epoch 7: Train Loss=0.0096, Val QWK=0.8258, Val MSE=0.2645


                                                                                 

Prompt 6, Epoch 8: Train Loss=0.0083, Val QWK=0.8256, Val MSE=0.2729
Early stopping triggered.
✅ Prompt 6 | Test QWK=0.8192, Test MSE=0.2761

 Training Essay Set 7 with embedder 'e5-base'


                                                                                

Prompt 7, Epoch 1: Train Loss=0.0462, Val QWK=0.7263, Val MSE=9.0067


                                                                                 

Prompt 7, Epoch 2: Train Loss=0.0207, Val QWK=0.7683, Val MSE=9.5226


                                                                                 

Prompt 7, Epoch 3: Train Loss=0.0164, Val QWK=0.7560, Val MSE=8.1444


                                                                                 

Prompt 7, Epoch 4: Train Loss=0.0147, Val QWK=0.7690, Val MSE=7.9610


                                                                                 

Prompt 7, Epoch 5: Train Loss=0.0118, Val QWK=0.7716, Val MSE=7.3610


                                                                                 

Prompt 7, Epoch 6: Train Loss=0.0106, Val QWK=0.7758, Val MSE=7.3055


                                                                                 

Prompt 7, Epoch 7: Train Loss=0.0088, Val QWK=0.7246, Val MSE=9.7677


                                                                                 

Prompt 7, Epoch 8: Train Loss=0.0075, Val QWK=0.7562, Val MSE=7.6330


                                                                                 

Prompt 7, Epoch 9: Train Loss=0.0075, Val QWK=0.7607, Val MSE=8.3218
Early stopping triggered.
✅ Prompt 7 | Test QWK=0.8181, Test MSE=6.7164

 Training Essay Set 8 with embedder 'e5-base'


                                                                               

Prompt 8, Epoch 1: Train Loss=0.0399, Val QWK=0.5193, Val MSE=22.5325


                                                                               

Prompt 8, Epoch 2: Train Loss=0.0171, Val QWK=0.4966, Val MSE=20.9020


                                                                               

Prompt 8, Epoch 3: Train Loss=0.0134, Val QWK=0.3431, Val MSE=28.8511


                                                                               

Prompt 8, Epoch 4: Train Loss=0.0125, Val QWK=0.4934, Val MSE=30.7605
Early stopping triggered.
✅ Prompt 8 | Test QWK=0.4210, Test MSE=24.2088
  embedder  prompt  best_val_qwk  test_qwk   test_mse
0  e5-base       1      0.718743  0.724386   1.051668
1  e5-base       2      0.690352  0.495698   0.404785
2  e5-base       3      0.662742  0.648000   0.306180
3  e5-base       4      0.831782  0.790569   0.263362
4  e5-base       5      0.799291  0.786982   0.285463
5  e5-base       6      0.841562  0.819156   0.276126
6  e5-base       7      0.775820  0.818064   6.716422
7  e5-base       8      0.519301  0.420953  24.208826


In [11]:
# choose device
device = "cuda" if torch.cuda.is_available() else "cpu"

# Example: train Roberta per-prompt
summary_deberta = train_all_prompts(
    embedder="deberta",
    prompts=range(4,9),
    num_epochs=4,
    batch_size=8,
    lr=2e-5,
    patience=3,
    max_len=256,
    device=device
)

# See summary
print(summary_deberta[["embedder","prompt","best_val_qwk","test_qwk","test_mse"]])


 Training Essay Set 4 with embedder 'deberta'


                                                                                

Prompt 4, Epoch 1: Train Loss=0.0727, Val QWK=0.7404, Val MSE=0.3121


                                                                                

Prompt 4, Epoch 2: Train Loss=0.0373, Val QWK=0.6695, Val MSE=0.3424


                                                                                 

Prompt 4, Epoch 3: Train Loss=0.0291, Val QWK=0.8272, Val MSE=0.2366


                                                                                 

Prompt 4, Epoch 4: Train Loss=0.0253, Val QWK=0.7656, Val MSE=0.2872
✅ Prompt 4 | Test QWK=0.7880, Test MSE=0.2477

 Training Essay Set 5 with embedder 'deberta'


                                                                                 

Prompt 5, Epoch 1: Train Loss=0.0667, Val QWK=0.7294, Val MSE=0.3695


                                                                                 

Prompt 5, Epoch 2: Train Loss=0.0282, Val QWK=0.7740, Val MSE=0.3534


                                                                                 

Prompt 5, Epoch 3: Train Loss=0.0216, Val QWK=0.7370, Val MSE=0.3852


                                                                                 

Prompt 5, Epoch 4: Train Loss=0.0186, Val QWK=0.6862, Val MSE=0.4194
✅ Prompt 5 | Test QWK=0.7501, Test MSE=0.3772

 Training Essay Set 6 with embedder 'deberta'


                                                                                 

Prompt 6, Epoch 1: Train Loss=0.0638, Val QWK=0.7009, Val MSE=0.3838


                                                                                 

Prompt 6, Epoch 2: Train Loss=0.0273, Val QWK=0.8051, Val MSE=0.2536


                                                                                 

Prompt 6, Epoch 3: Train Loss=0.0227, Val QWK=0.8016, Val MSE=0.2508


                                                                                      

Prompt 6, Epoch 4: Train Loss=0.0186, Val QWK=0.8184, Val MSE=0.2382
✅ Prompt 6 | Test QWK=0.8185, Test MSE=0.2254

 Training Essay Set 7 with embedder 'deberta'


                                                                                 

Prompt 7, Epoch 1: Train Loss=0.0594, Val QWK=0.5839, Val MSE=16.0890


                                                                                

Prompt 7, Epoch 2: Train Loss=0.0268, Val QWK=0.6670, Val MSE=11.5367


                                                                                 

Prompt 7, Epoch 3: Train Loss=0.0189, Val QWK=0.7130, Val MSE=9.4810


                                                                                 

Prompt 7, Epoch 4: Train Loss=0.0170, Val QWK=0.7436, Val MSE=8.5375
✅ Prompt 7 | Test QWK=0.7587, Test MSE=8.6854

 Training Essay Set 8 with embedder 'deberta'


                                                                                        

Prompt 8, Epoch 1: Train Loss=0.0429, Val QWK=0.4212, Val MSE=22.6998


                                                                               

Prompt 8, Epoch 2: Train Loss=0.0185, Val QWK=0.6646, Val MSE=17.6292


                                                                               

Prompt 8, Epoch 3: Train Loss=0.0134, Val QWK=0.6528, Val MSE=16.5421


                                                                               

Prompt 8, Epoch 4: Train Loss=0.0104, Val QWK=0.7447, Val MSE=14.1352
✅ Prompt 8 | Test QWK=0.6711, Test MSE=14.5994
  embedder  prompt  best_val_qwk  test_qwk   test_mse
0  deberta       4      0.827238  0.787950   0.247699
1  deberta       5      0.774021  0.750101   0.377177
2  deberta       6      0.818390  0.818468   0.225355
3  deberta       7      0.743593  0.758668   8.685395
4  deberta       8      0.744720  0.671137  14.599413
