**Introduction**

In this script, we evaluate aspects of reliability, validity and fairness, while modeling the data with a classification model. To run it, it is sufficient to press the "play" button in each cell. We start with importing the packages and loading the data from a Google Drive Folder:

In [None]:
import os

os.environ["PYTHONHASHSEED"] = str(42)

In [None]:
import numpy as np
import pandas as pd

In [None]:
!wget -O training_set_rel3.tsv 'https://drive.google.com/uc?export=download&id=1ptegxIM5hB6fZNTqXg7xachk-Ax9dGdY'

--2025-01-28 10:15:53--  https://drive.google.com/uc?export=download&id=1ptegxIM5hB6fZNTqXg7xachk-Ax9dGdY
Resolving drive.google.com (drive.google.com)... 142.251.2.100, 142.251.2.138, 142.251.2.139, ...
Connecting to drive.google.com (drive.google.com)|142.251.2.100|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://drive.usercontent.google.com/download?id=1ptegxIM5hB6fZNTqXg7xachk-Ax9dGdY&export=download [following]
--2025-01-28 10:15:53--  https://drive.usercontent.google.com/download?id=1ptegxIM5hB6fZNTqXg7xachk-Ax9dGdY&export=download
Resolving drive.usercontent.google.com (drive.usercontent.google.com)... 142.250.141.132, 2607:f8b0:4023:c0b::84
Connecting to drive.usercontent.google.com (drive.usercontent.google.com)|142.250.141.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16337165 (16M) [application/octet-stream]
Saving to: ‘training_set_rel3.tsv’


2025-01-28 10:16:03 (58.5 MB/s) - ‘training_set_rel3.tsv’ 

In [None]:
file_path = 'training_set_rel3.tsv'
columns = ['essay_id', 'essay_set', 'essay', 'domain1_score', 'rater1_domain1', 'rater2_domain1']
asap = pd.read_csv(file_path, sep='\t', encoding='ISO-8859-1', usecols=columns)

The following code prints the file:

In [None]:
print(asap)

       essay_id  essay_set                                              essay  \
0             1          1  Dear local newspaper, I think effects computer...   
1             2          1  Dear @CAPS1 @CAPS2, I believe that using compu...   
2             3          1  Dear, @CAPS1 @CAPS2 @CAPS3 More and more peopl...   
3             4          1  Dear Local Newspaper, @CAPS1 I have found that...   
4             5          1  Dear @LOCATION1, I know having computers has a...   
...         ...        ...                                                ...   
12971     21626          8   In most stories mothers and daughters are eit...   
12972     21628          8   I never understood the meaning laughter is th...   
12973     21629          8  When you laugh, is @CAPS5 out of habit, or is ...   
12974     21630          8                                 Trippin' on fen...   
12975     21633          8   Many people believe that laughter can improve...   

       rater1_domain1  rate

Initially, the scores have different ranges for each set. To start our analysis, we first normalize the test scores in the following code to a range from 0 to 1:

In [None]:
sets = asap['essay_set'].unique()
scores = pd.DataFrame(asap, columns=['essay_set', 'domain1_score'])
scores_grp = scores.groupby(['essay_set'], as_index=False)
essay = pd.DataFrame(sets, columns=['sets'])
essay['counts'] = scores_grp.count()['domain1_score']
essay['min'] = scores_grp.min()['domain1_score']
essay['max'] = scores_grp.max()['domain1_score']
essay['med'] = scores_grp.median()['domain1_score']
print(essay)

   sets  counts  min  max   med
0     1    1783    2   12   8.0
1     2    1800    1    6   3.0
2     3    1726    0    3   2.0
3     4    1770    0    3   1.0
4     5    1805    0    4   2.0
5     6    1800    0    4   3.0
6     7    1569    2   24  16.0
7     8     723   10   60  37.0


In [None]:
scores = {}

for es in sets:
    min_es = asap[asap['essay_set'] == es].domain1_score.min()
    max_es =  asap[asap['essay_set'] == es].domain1_score.max()
    scores[es] = (min_es, max_es)
scores

{1: (2, 12),
 2: (1, 6),
 3: (0, 3),
 4: (0, 3),
 5: (0, 4),
 6: (0, 4),
 7: (2, 24),
 8: (10, 60)}

In [None]:
def minmax_scaler(es, score):
    return (score - scores[es][0]) / (scores[es][1] - scores[es][0])

def inverse_scaler(es, score):
    return round(score * (scores[es][1] - scores[es][0]) + scores[es][0])

In [None]:
def scale_dataset(asap):
    for row in range(len(asap)):
        asap.loc[row, 'nscore'] = minmax_scaler(asap.loc[row, 'essay_set'], asap.loc[row, 'domain1_score'])
    return asap

In [None]:
asap = scale_dataset(asap)

The following line of code prints the normalized score as a new column nscore.

In [None]:
print(asap)

       essay_id  essay_set                                              essay  \
0             1          1  Dear local newspaper, I think effects computer...   
1             2          1  Dear @CAPS1 @CAPS2, I believe that using compu...   
2             3          1  Dear, @CAPS1 @CAPS2 @CAPS3 More and more peopl...   
3             4          1  Dear Local Newspaper, @CAPS1 I have found that...   
4             5          1  Dear @LOCATION1, I know having computers has a...   
...         ...        ...                                                ...   
12971     21626          8   In most stories mothers and daughters are eit...   
12972     21628          8   I never understood the meaning laughter is th...   
12973     21629          8  When you laugh, is @CAPS5 out of habit, or is ...   
12974     21630          8                                 Trippin' on fen...   
12975     21633          8   Many people believe that laughter can improve...   

       rater1_domain1  rate

We are now prepared to start with our analysis.

**Setting up the Treatment as Classification Problem**

To treat the model as a classification problem, we further define two classes based on the nscore. For simplicity, we define nscores below 0.5 as class 0, and nscores equal to or larger than 0.5 as class 1.

In [None]:
asap['nclass'] = np.round(asap['nscore'])
asap['nclass'] = asap['nclass'].astype(int)

The new classes are given as a new variable nclass.

In [None]:
print(asap)

       essay_id  essay_set                                              essay  \
0             1          1  Dear local newspaper, I think effects computer...   
1             2          1  Dear @CAPS1 @CAPS2, I believe that using compu...   
2             3          1  Dear, @CAPS1 @CAPS2 @CAPS3 More and more peopl...   
3             4          1  Dear Local Newspaper, @CAPS1 I have found that...   
4             5          1  Dear @LOCATION1, I know having computers has a...   
...         ...        ...                                                ...   
12971     21626          8   In most stories mothers and daughters are eit...   
12972     21628          8   I never understood the meaning laughter is th...   
12973     21629          8  When you laugh, is @CAPS5 out of habit, or is ...   
12974     21630          8                                 Trippin' on fen...   
12975     21633          8   Many people believe that laughter can improve...   

       rater1_domain1  rate

In [None]:
!pip install accelerate
!pip install datasets
!pip install transformers

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

In [None]:
import datasets
from datasets import load_dataset
import transformers
import accelerate
import random
import torch
from transformers import set_seed

To unify our results as much as possible, we first set a seed.

In [None]:
def set_gen_seed(seed):
    set_seed(seed)
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

set_gen_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(42)

We transform asap into a dataset in the format of  Huggingface's datasets and further define training (train_df), validation (val_df) and test (test_df) sets.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

def split_data_by_group(df, group_var, train_frac=0.6, test_frac=0.2, random_state=42):
    """
    Splits a DataFrame into training, validation, and test sets, with splits performed within each level of the specified group variable.

    Parameters:
    - df: pandas DataFrame to split.
    - group_var: String name of the column containing group identifiers.
    - train_frac: Fraction of the data to allocate to the training set.
    - test_frac: Fraction of the data to allocate to the test set. The remainder goes to the validation set.
    - random_state: Random state for reproducibility.

    Returns:
    - train_df: Training data DataFrame.
    - val_df: Validation data DataFrame.
    - test_df: Test data DataFrame.
    """
    train_df = pd.DataFrame()
    val_df = pd.DataFrame()
    test_df = pd.DataFrame()

    for _, group_df in df.groupby(group_var):
        # Split current group into training and temp (validation + test) sets
        train, temp = train_test_split(group_df, train_size=train_frac, random_state=random_state)

        # Split temp into validation and test sets
        validation, test = train_test_split(temp, test_size=test_frac/(1 - train_frac), random_state=random_state)

        # Append current group's splits to the overall datasets
        train_df = pd.concat([train_df, train])
        val_df = pd.concat([val_df, validation])
        test_df = pd.concat([test_df, test])

    return train_df, val_df, test_df

We further rename nscore as labels. This is necessary so that the score is handled correctly by Huggingface's transformer models:

In [None]:
asap['labels'] = asap['nscore']

In [None]:
train_df, valid_df, test_df = split_data_by_group(asap[['essay','essay_set','labels', 'rater1_domain1', 'rater2_domain1']], asap['essay_set'], train_frac=0.6, test_frac=0.2, random_state=42)

In [None]:
print(train_df)

                                                   essay  essay_set  labels  \
1661   Dear @LOCATION1 press, I have recently heard a...          1    0.70   
1525   To: @ORGANIZATION1 goes so fast, and the most ...          1    0.60   
881    Dear local newspaper, I've heard that you were...          1    0.60   
1468   Dear local news paper, This paper is going to ...          1    0.40   
730    Honestly, I totally and absolutely believe tha...          1    0.60   
...                                                  ...        ...     ...   
12324   There are a couple things that can lead stran...          8    0.50   
12359   In a relationship you should be able to trust...          8    0.68   
12523   Laughter is a huge part oh building friendshi...          8    0.50   
12688   I think that laughter is a key element to any...          8    0.60   
12355   I'm a tell you about moments sometimes even a...          8    0.40   

       rater1_domain1  rater2_domain1  
1661       

We tokenize all data sets using the DistilBERT model:

In [None]:
import torch
from transformers import AutoModel, AutoTokenizer

model_ckpt = "distilbert-base-uncased"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModel.from_pretrained(model_ckpt).to(device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_ckpt, model_max_length=512)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
def tokenize(batch):
  return tokenizer(batch["essay"], truncation = True, padding='max_length', max_length=512)

In [None]:
from datasets import Dataset

train_ds = Dataset.from_pandas(train_df)
valid_ds = Dataset.from_pandas(valid_df)
test_ds = Dataset.from_pandas(test_df)

In [None]:
train_encoded = train_ds.map(tokenize, batched = True, batch_size= None)
valid_encoded = valid_ds.map(tokenize, batched = True, batch_size= None)
test_encoded = test_ds.map(tokenize, batched = True, batch_size= None)

Map:   0%|          | 0/7783 [00:00<?, ? examples/s]

Map:   0%|          | 0/2596 [00:00<?, ? examples/s]

Map:   0%|          | 0/2597 [00:00<?, ? examples/s]

To obtain suitable metrics, we further import and define the mean squared error for the regression problem.

**Testing Standards in Classification Models**

To prepare the evaluation of testing standards in classification standards, we first predict the ability classes using an LLM.



In [None]:
asap['labels'] = asap['nclass']

In [None]:
train_df2, valid_df2, test_df2 = split_data_by_group(asap[['essay','essay_set','labels','rater1_domain1', 'rater2_domain1']], asap['essay_set'], train_frac=0.6, test_frac=0.2, random_state=None)

In [None]:
train_ds2 = Dataset.from_pandas(train_df2)
valid_ds2 = Dataset.from_pandas(valid_df2)
test_ds2 = Dataset.from_pandas(test_df2)

In [None]:
train_df2["essay"][0]

"Dear local newspaper, I think effects computers have on people are great learning skills/affects because they give us time to chat with friends/new people, helps us learn about the globe(astronomy) and keeps us out of troble! Thing about! Dont you think so? How would you feel if your teenager is always on the phone with friends! Do you ever time to chat with your friends or buisness partner about things. Well now - there's a new way to chat the computer, theirs plenty of sites on the internet to do so: @ORGANIZATION1, @ORGANIZATION2, @CAPS1, facebook, myspace ect. Just think now while your setting up meeting with your boss on the computer, your teenager is having fun on the phone not rushing to get off cause you want to use it. How did you learn about other countrys/states outside of yours? Well I have by computer/internet, it's a new way to learn about what going on in our time! You might think your child spends a lot of time on the computer, but ask them so question about the econom

In [None]:
train_encoded2 = train_ds2.map(tokenize, batched = True, batch_size= None)
valid_encoded2 = valid_ds2.map(tokenize, batched = True, batch_size= None)
test_encoded2 = test_ds2.map(tokenize, batched = True, batch_size= None)

Map:   0%|          | 0/7783 [00:00<?, ? examples/s]

Map:   0%|          | 0/2596 [00:00<?, ? examples/s]

Map:   0%|          | 0/2597 [00:00<?, ? examples/s]

In [None]:
from transformers import AutoModelForSequenceClassification

num_labels = 2
model2 = (AutoModelForSequenceClassification.from_pretrained(model_ckpt, num_labels = num_labels)).to(device)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
from transformers import Trainer, TrainingArguments

batch_size = 8
logging_steps = len(train_encoded2)// batch_size
model_name = "finetuned-dataset2"
training_args = TrainingArguments(output_dir = model_name,
                                  num_train_epochs=3,
                                  learning_rate = 2e-5,
                                  weight_decay = 0.01,
                                  report_to="none",
                                  evaluation_strategy="epoch",
                                  logging_steps = logging_steps)



In [None]:
from sklearn.metrics import accuracy_score, f1_score

def compute_metrics(pred):
  labels = pred.label_ids
  preds = pred.predictions.argmax(-1)
  f1 = f1_score(labels, preds, average = "weighted")
  acc = accuracy_score(labels, preds)
  return{"accuracy": acc, "f1": f1}

In [None]:
trainer = Trainer(model = model2, args = training_args,
                  compute_metrics = compute_metrics,
                  train_dataset = train_encoded2,
                  eval_dataset = valid_encoded2,
                  tokenizer = tokenizer)
trainer.train()

  trainer = Trainer(model = model2, args = training_args,


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.4496,0.376124,0.83359,0.830081
2,0.3483,0.360188,0.851695,0.851532
3,0.279,0.411102,0.851695,0.850476


TrainOutput(global_step=2919, training_loss=0.3587736975824069, metrics={'train_runtime': 567.1643, 'train_samples_per_second': 41.168, 'train_steps_per_second': 5.147, 'total_flos': 3092981291218944.0, 'train_loss': 0.3587736975824069, 'epoch': 3.0})

In [None]:
model2.save_pretrained("finetuned-classification-model")

In order to calculate the interrater reliability in R, we calculate the predictions for the training, validation and test set and save them as csv files.



In [None]:
import numpy as np
from scipy.special import softmax

training_predictions = trainer.predict(train_encoded2)

training_probabilities = softmax(training_predictions.predictions, axis=1)

training_class_predictions = np.argmax(training_probabilities, axis=1)

validation_predictions = trainer.predict(valid_encoded2)

validation_probabilities = softmax(validation_predictions.predictions, axis=1)

validation_class_predictions = np.argmax(validation_probabilities, axis=1)

test_predictions = trainer.predict(test_encoded2)

test_probabilities = softmax(test_predictions.predictions, axis=1)

test_class_predictions = np.argmax(test_probabilities, axis=1)

In [None]:
import pandas as pd
df = pd.DataFrame(training_class_predictions)
df.to_csv("training_class_predictions.csv")

df = pd.DataFrame(validation_class_predictions)
df.to_csv("validation_class_predictions.csv")

df = pd.DataFrame(test_class_predictions)
df.to_csv("test_class_predictions.csv")

import pandas as pd
train_df2.to_csv("training_data.csv")
valid_df2.to_csv("validation_data.csv")
test_df2.to_csv("test_data.csv")

**Classification Models: Split-half Reliability**

We get new datasets that contain the first and second half of all texts, split by their sentences.

In [None]:
import pandas as pd
import re

# Sample DataFrame
data = {
    'essay': ["The quick brown fox jumps over the lazy dog. This is another sentence! This is a third sentence.", "What a beautiful day? Let's go for a walk."]
}

df = pd.DataFrame(data)

# Function to split sentences into halves
def split_sentences_to_halves(text):
    # Remove starting and ending ''
    text = text.strip("''")

    # Split into sentences using regex
    sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)

    midpoint = len(sentences) // 2

    first_half = " ".join(sentences[:midpoint])
    second_half = " ".join(sentences[midpoint:])

    return first_half, second_half

# Apply the function to create the new columns
df[['first_half', 'second_half']] = df['essay'].apply(split_sentences_to_halves).apply(pd.Series)

print(df)


                                               essay  \
0  The quick brown fox jumps over the lazy dog. T...   
1         What a beautiful day? Let's go for a walk.   

                                     first_half  \
0  The quick brown fox jumps over the lazy dog.   
1                         What a beautiful day?   

                                         second_half  
0  This is another sentence! This is a third sent...  
1                               Let's go for a walk.  


Application to our texts:

In [None]:
train_rel = train_df2.copy()
valid_rel = valid_df2.copy()
test_rel = test_df2.copy()

In [None]:
train_rel[['first_half', 'second_half']] = train_rel['essay'].apply(split_sentences_to_halves).apply(pd.Series)
valid_rel[['first_half', 'second_half']] = valid_rel['essay'].apply(split_sentences_to_halves).apply(pd.Series)
test_rel[['first_half', 'second_half']] = test_rel['essay'].apply(split_sentences_to_halves).apply(pd.Series)

We make new datasets that only contain the first and second half as essay.

In [None]:
train_rel_a = train_rel.copy()
valid_rel_a = valid_rel.copy()
test_rel_a = test_rel.copy()

train_rel_a = train_rel_a.drop(columns=['essay', 'second_half'])
train_rel_a = train_rel_a.rename(columns={'first_half': 'essay'})

valid_rel_a = valid_rel_a.drop(columns=['essay', 'second_half'])
valid_rel_a = valid_rel_a.rename(columns={'first_half': 'essay'})

test_rel_a = test_rel_a.drop(columns=['essay', 'second_half'])
test_rel_a = test_rel_a.rename(columns={'first_half': 'essay'})

In [None]:
train_rel_b = train_rel.copy()
valid_rel_b = valid_rel.copy()
test_rel_b = test_rel.copy()

train_rel_b = train_rel_b.drop(columns=['essay', 'first_half'])
train_rel_b = train_rel_b.rename(columns={'second_half': 'essay'})

valid_rel_b = valid_rel_b.drop(columns=['essay', 'first_half'])
valid_rel_b = valid_rel_b.rename(columns={'second_half': 'essay'})

test_rel_b = test_rel_b.drop(columns=['essay', 'first_half'])
test_rel_b = test_rel_b.rename(columns={'second_half': 'essay'})

In [None]:
train_rel2_a = Dataset.from_pandas(train_rel_a)
valid_rel2_a = Dataset.from_pandas(valid_rel_a)
test_rel2_a = Dataset.from_pandas(test_rel_a)

train_rel2_b = Dataset.from_pandas(train_rel_b)
valid_rel2_b = Dataset.from_pandas(valid_rel_b)
test_rel2_b = Dataset.from_pandas(test_rel_b)

We encode the texts:

In [None]:
train_encoded2_a = train_rel2_a.map(tokenize, batched = True, batch_size= None)
valid_encoded2_a = valid_rel2_a.map(tokenize, batched = True, batch_size= None)
test_encoded2_a = test_rel2_a.map(tokenize, batched = True, batch_size= None)

train_encoded2_b = train_rel2_b.map(tokenize, batched = True, batch_size= None)
valid_encoded2_b = valid_rel2_b.map(tokenize, batched = True, batch_size= None)
test_encoded2_b = test_rel2_b.map(tokenize, batched = True, batch_size= None)

Map:   0%|          | 0/7783 [00:00<?, ? examples/s]

Map:   0%|          | 0/2596 [00:00<?, ? examples/s]

Map:   0%|          | 0/2597 [00:00<?, ? examples/s]

Map:   0%|          | 0/7783 [00:00<?, ? examples/s]

Map:   0%|          | 0/2596 [00:00<?, ? examples/s]

Map:   0%|          | 0/2597 [00:00<?, ? examples/s]

We get the predictions for each half of the training and validation data set.

In [None]:
import numpy as np
from scipy.special import softmax

training_predictions_a = trainer.predict(train_encoded2_a)

training_probabilities_a = softmax(training_predictions_a.predictions, axis=1)

training_class_predictions_a = np.argmax(training_probabilities_a, axis=1)

training_predictions_b = trainer.predict(train_encoded2_b)

training_probabilities_b = softmax(training_predictions_b.predictions, axis=1)

training_class_predictions_b = np.argmax(training_probabilities_b, axis=1)

In [None]:
validation_predictions_a = trainer.predict(valid_encoded2_a)

validation_probabilities_a = softmax(validation_predictions_a.predictions, axis=1)

validation_class_predictions_a = np.argmax(validation_probabilities_a, axis=1)

validation_predictions_b = trainer.predict(valid_encoded2_b)

validation_probabilities_b = softmax(validation_predictions_b.predictions, axis=1)

validation_class_predictions_b = np.argmax(validation_probabilities_b, axis=1)

In [None]:
test_predictions_a = trainer.predict(test_encoded2_a)

test_probabilities_a = softmax(test_predictions_a.predictions, axis=1)

test_class_predictions_a = np.argmax(test_probabilities_a, axis=1)

test_predictions_b = trainer.predict(test_encoded2_b)

test_probabilities_b = softmax(test_predictions_b.predictions, axis=1)

test_class_predictions_b = np.argmax(test_probabilities_b, axis=1)

In [None]:
from sklearn.metrics import confusion_matrix

cm_train = confusion_matrix(training_class_predictions_a, training_class_predictions_b)
cm_valid = confusion_matrix(validation_class_predictions_a, validation_class_predictions_b)
cm_test = confusion_matrix(test_class_predictions_a, test_class_predictions_b)

We obtain the following confusion matrices for the training and validation data:

In [None]:
print(cm_train)

[[5036 1045]
 [ 496 1206]]


In [None]:
print(cm_valid)

[[1727  334]
 [ 129  406]]


In [None]:
print(cm_test)

[[1674  377]
 [ 150  396]]


**Classification Model Validity Check: Evaluating the Effect of Inserting Random Letters**

We insert 10 random letters in the text and inspect the changes. The following function was provided by Google Gemini.

In [None]:
import pandas as pd
import random
import string

# Sample DataFrame (use your own data)
data = {
    'essay': ["''The quick brown fox jumps over the lazy dog. This is another sentence!''", "''What a beautiful day? Let's go for a walk.''"]
}

df = pd.DataFrame(data)

def introduce_typos(text, num_typos=10):
    # Remove starting and ending ''
    text = text.strip("''")

    # Create a list of possible insertion points
    insertion_points = [i for i in range(len(text)) if text[i] != ' ']

    # Randomly select insertion points
    selected_points = random.sample(insertion_points, min(num_typos, len(insertion_points)))

    # Insert random letters at those points
    for point in selected_points:
        random_letter = random.choice(string.ascii_lowercase)
        text = text[:point] + random_letter + text[point:]

    return "''" + text + "''"  # Re-add ''

# Apply the function to create the 'essay_error' column
df['essay_error'] = df['essay'].apply(introduce_typos)

print(df)

                                               essay  \
0  ''The quick brown fox jumps over the lazy dog....   
1     ''What a beautiful day? Let's go for a walk.''   

                                         essay_error  
0  ''Trhe qugicxak brownb nfoxs jumps over the la...  
1  ''Wshat a zbeautiwful hdiayy? aLet's gzo ffor ...  


We apply this function to our essays:

In [None]:
train_typo = train_df2.copy()
valid_typo = valid_df2.copy()
test_typo = test_df2.copy()

In [None]:
train_typo['essay_error'] = train_typo['essay'].apply(introduce_typos)
valid_typo['essay_error'] = valid_typo['essay'].apply(introduce_typos)
test_typo['essay_error'] = test_typo['essay'].apply(introduce_typos)

In [None]:
train_typo = train_typo.drop(columns=['essay'])
train_typo = train_typo.rename(columns={'essay_error': 'essay'})

valid_typo = valid_typo.drop(columns=['essay'])
valid_typo = valid_typo.rename(columns={'essay_error': 'essay'})

test_typo = test_typo.drop(columns=['essay'])
test_typo = test_typo.rename(columns={'essay_error': 'essay'})

In [None]:
train_val_2 = Dataset.from_pandas(train_df2)
train_typo_2 = Dataset.from_pandas(train_typo)

valid_val_2 = Dataset.from_pandas(valid_df2)
valid_typo_2 = Dataset.from_pandas(valid_typo)

test_val_2 = Dataset.from_pandas(test_df2)
test_typo_2 = Dataset.from_pandas(test_typo)

We encode the texts:

In [None]:
train_val_encoded2 = train_val_2.map(tokenize, batched = True, batch_size= None)
train_typo_encoded2 = train_typo_2.map(tokenize, batched = True, batch_size= None)

valid_val_encoded2 = valid_val_2.map(tokenize, batched = True, batch_size= None)
valid_typo_encoded2 = valid_typo_2.map(tokenize, batched = True, batch_size= None)

test_val_encoded2 = test_val_2.map(tokenize, batched = True, batch_size= None)
test_typo_encoded2 = test_typo_2.map(tokenize, batched = True, batch_size= None)

Map:   0%|          | 0/7783 [00:00<?, ? examples/s]

Map:   0%|          | 0/7783 [00:00<?, ? examples/s]

Map:   0%|          | 0/2596 [00:00<?, ? examples/s]

Map:   0%|          | 0/2596 [00:00<?, ? examples/s]

Map:   0%|          | 0/2597 [00:00<?, ? examples/s]

Map:   0%|          | 0/2597 [00:00<?, ? examples/s]

Getting the predictions and comparing them:

In [None]:
import numpy as np
from scipy.special import softmax

orig_predictions_train = trainer.predict(train_val_encoded2)

orig_probabilities_train = softmax(orig_predictions_train.predictions, axis=1)

orig_class_predictions_train = np.argmax(orig_probabilities_train, axis=1)

typo_predictions_train = trainer.predict(train_typo_encoded2)

typo_probabilities_train = softmax(typo_predictions_train.predictions, axis=1)

typo_class_predictions_train = np.argmax(typo_probabilities_train, axis=1)

In [None]:
import numpy as np
from scipy.special import softmax

orig_predictions_valid = trainer.predict(valid_val_encoded2)

orig_probabilities_valid = softmax(orig_predictions_valid.predictions, axis=1)

orig_class_predictions_valid = np.argmax(orig_probabilities_valid, axis=1)

typo_predictions_valid = trainer.predict(valid_typo_encoded2)

typo_probabilities_valid = softmax(typo_predictions_valid.predictions, axis=1)

typo_class_predictions_valid = np.argmax(typo_probabilities_valid, axis=1)

In [None]:
import numpy as np
from scipy.special import softmax

orig_predictions_test = trainer.predict(test_val_encoded2)

orig_probabilities_test = softmax(orig_predictions_test.predictions, axis=1)

orig_class_predictions_test = np.argmax(orig_probabilities_test, axis=1)

typo_predictions_test = trainer.predict(test_typo_encoded2)

typo_probabilities_test = softmax(typo_predictions_test.predictions, axis=1)

typo_class_predictions_test = np.argmax(typo_probabilities_test, axis=1)

We get the following confusion matrices:

In [None]:
from sklearn.metrics import confusion_matrix

cm_typo_train = confusion_matrix(orig_class_predictions_train, typo_class_predictions_train)
cm_typo_valid = confusion_matrix(orig_class_predictions_valid, typo_class_predictions_valid)
cm_typo_test = confusion_matrix(orig_class_predictions_test, typo_class_predictions_test)

In [None]:
print(cm_typo_train)

[[2911   33]
 [ 345 4494]]


In [None]:
print(cm_typo_valid)

[[ 939    9]
 [ 137 1511]]


In [None]:
print(cm_typo_test)

[[ 947   10]
 [ 153 1487]]


**Fairness Evaluation in Classification Models: The Accuracy is Comparable over all Topics**

We first get the predictions per essay.

We first move the model to the same device.

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model2.to(device)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


Getting the predictions for the validation set:

In [None]:
validation_predictions = trainer.predict(valid_encoded2)

In [None]:
import numpy as np
from scipy.special import softmax

# Applying softmax to convert logits to probabilities
validation_probabilities = softmax(validation_predictions.predictions, axis=1)

validation_class_predictions = np.argmax(validation_probabilities, axis=1)


Getting the predictions for the training set:

In [None]:
training_predictions = trainer.predict(train_encoded2)

training_probabilities = softmax(training_predictions.predictions, axis=1)

training_class_predictions = np.argmax(training_probabilities, axis=1)

Getting the predictions for the test set:

In [None]:
test_predictions = trainer.predict(test_encoded2)

test_probabilities = softmax(test_predictions.predictions, axis=1)

test_class_predictions = np.argmax(test_probabilities, axis=1)

Getting confusion matrices for each essay topic for the training, validation and test sets.

In [None]:
train_encoded2

Dataset({
    features: ['essay', 'essay_set', 'labels', 'rater1_domain1', 'rater2_domain1', '__index_level_0__', 'input_ids', 'attention_mask'],
    num_rows: 7783
})

In [None]:
y_train_true = np.array(train_encoded2["labels"])
y_train_set = np.array(train_encoded2["essay_set"])
y_train_pred = training_class_predictions

y_valid_true = np.array(valid_encoded2["labels"])
y_valid_set = np.array(valid_encoded2["essay_set"])
y_valid_pred = validation_class_predictions

y_test_true = np.array(test_encoded2["labels"])
y_test_set = np.array(test_encoded2["essay_set"])
y_test_pred = test_class_predictions

In [None]:
y_train = np.column_stack((y_train_set, y_train_true, y_train_pred))
y_valid = np.column_stack((y_valid_set, y_valid_true, y_valid_pred))
y_test = np.column_stack((y_test_set, y_test_true, y_test_pred))

Calculating the confusion matrices:

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
# Find unique groups
groups = np.unique(y_train[:, 0])

# Dictionary to hold confusion matrices
confusion_matrices = {}

for group in groups:
    # Filter rows for the current group
    group_data = y_train[y_train[:, 0] == group]

    # True values are in the second column, predictions in the third
    true_values = group_data[:, 1]
    predictions = group_data[:, 2]

    # Compute the confusion matrix for the current group
    cm = confusion_matrix(true_values, predictions)

    # Store the confusion matrix using the group as the key
    #confusion_matrices[group] = cm / cm.sum()
    confusion_matrices[group] = cm

# Display the confusion matrices
for group, cm in confusion_matrices.items():
    print(f"Confusion Matrix for Group {group}:")
    print(np.round(cm, 2), "\n")

Confusion Matrix for Group 1:
[[135  41]
 [ 16 877]] 

Confusion Matrix for Group 2:
[[520  63]
 [ 68 429]] 

Confusion Matrix for Group 3:
[[333  53]
 [ 45 604]] 

Confusion Matrix for Group 4:
[[565  29]
 [ 16 452]] 

Confusion Matrix for Group 5:
[[524  39]
 [ 36 484]] 

Confusion Matrix for Group 6:
[[323  37]
 [ 24 696]] 

Confusion Matrix for Group 7:
[[210  52]
 [ 16 663]] 

Confusion Matrix for Group 8:
[[106  64]
 [  7 256]] 



In [None]:
# Find unique groups
groups = np.unique(y_valid[:, 0])

# Dictionary to hold confusion matrices
confusion_matrices = {}

for group in groups:
    # Filter rows for the current group
    group_data = y_valid[y_valid[:, 0] == group]

    # True values are in the second column, predictions in the third
    true_values = group_data[:, 1]
    predictions = group_data[:, 2]

    # Compute the confusion matrix for the current group
    cm = confusion_matrix(true_values, predictions)

    # Store the confusion matrix using the group as the key
    confusion_matrices[group] = cm

# Display the confusion matrices
for group, cm in confusion_matrices.items():
    print(f"Confusion Matrix for Group {group}:")
    print(np.round(cm, 2), "\n")

Confusion Matrix for Group 1:
[[ 37  20]
 [  9 291]] 

Confusion Matrix for Group 2:
[[141  41]
 [ 32 146]] 

Confusion Matrix for Group 3:
[[ 91  44]
 [ 28 182]] 

Confusion Matrix for Group 4:
[[162  14]
 [ 22 156]] 

Confusion Matrix for Group 5:
[[178  34]
 [ 20 129]] 

Confusion Matrix for Group 6:
[[102  26]
 [ 19 213]] 

Confusion Matrix for Group 7:
[[ 60  27]
 [ 15 212]] 

Confusion Matrix for Group 8:
[[27 29]
 [ 5 84]] 



In [None]:
# Find unique groups
groups = np.unique(y_test[:, 0])

# Dictionary to hold confusion matrices
confusion_matrices = {}

for group in groups:
    # Filter rows for the current group
    group_data = y_test[y_test[:, 0] == group]

    # True values are in the second column, predictions in the third
    true_values = group_data[:, 1]
    predictions = group_data[:, 2]

    # Compute the confusion matrix for the current group
    cm = confusion_matrix(true_values, predictions)

    # Store the confusion matrix using the group as the key
    confusion_matrices[group] = cm

# Display the confusion matrices
for group, cm in confusion_matrices.items():
    print(f"Confusion Matrix for Group {group}:")
    print(np.round(cm, 2), "\n")

Confusion Matrix for Group 1:
[[ 28  29]
 [ 10 290]] 

Confusion Matrix for Group 2:
[[123  52]
 [ 44 141]] 

Confusion Matrix for Group 3:
[[108  17]
 [ 29 192]] 

Confusion Matrix for Group 4:
[[163  14]
 [ 15 162]] 

Confusion Matrix for Group 5:
[[178  22]
 [ 26 135]] 

Confusion Matrix for Group 6:
[[102  26]
 [ 31 201]] 

Confusion Matrix for Group 7:
[[ 57  30]
 [ 11 216]] 

Confusion Matrix for Group 8:
[[25 33]
 [ 7 80]] 



**Experimental Validation Check: Applying LIME and SHAP Values**

As a demonstration, we also calculate LIME and SHAP values. Here, the individual tokens have no specific meanings, i.e. there are no clear expectations whether including a token should increase or decrease its value.

In [None]:
pred = transformers.pipeline("text-classification", model=model2, tokenizer=tokenizer, top_k=None, truncation=True )

Device set to use cuda:0


In [None]:
!pip install shap



In [None]:
import shap
explainer = shap.Explainer(pred)

In [None]:
shap_values = explainer(train_encoded2["essay"][:3])

Token indices sequence length is longer than the specified maximum sequence length for this model (556 > 512). Running this sequence through the model will result in indexing errors
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
PartitionExplainer explainer: 4it [00:16,  8.41s/it]


In [None]:
shap.plots.text(shap_values)


Output hidden; open in https://colab.research.google.com to view.

LIME Values:

In [None]:
!pip install lime

from lime.lime_text import LimeTextExplainer

class_names = ['Low Score', 'High Score']
explainer = LimeTextExplainer(class_names=class_names)

Collecting lime
  Downloading lime-0.2.0.1.tar.gz (275 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/275.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m275.7/275.7 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: lime
  Building wheel for lime (setup.py) ... [?25l[?25hdone
  Created wheel for lime: filename=lime-0.2.0.1-py3-none-any.whl size=283834 sha256=938d25e661c42a9bc50d30dc51da7fd7dc1e8b2d89b43e891f48b730ed9f90ab
  Stored in directory: /root/.cache/pip/wheels/85/fa/a3/9c2d44c9f3cd77cf4e533b58900b2bf4487f2a17e8ec212a3d
Successfully built lime
Installing collected packages: lime
Successfully installed lime-0.2.0.1


In [None]:
import torch.nn.functional as F

def predictor(texts):
  outputs = model2(**tokenizer(texts, return_tensors="pt", padding=True, truncation=True))
  probas = F.softmax(outputs.logits, dim=1).detach().numpy()
  return probas

In [None]:
explainer = LimeTextExplainer(class_names=class_names)