In [None]:
#This notebook is by Anastasia Ruzmaikina for the Kaggle Competition Learning Agency Lab - Automated Essay Scoring 2.0

The first automated essay scoring competition to tackle automated grading of student-written essays was twelve years ago. How far have we come from this initial competition? With an updated dataset and light years of new ideas we hope to see if we can get to the latest in automated grading to provide a real impact to overtaxed teachers who continue to have challenges with providing timely feedback, especially in underserved communities.

The goal of this competition is to train a model to score student essays. Your efforts are needed to reduce the high expense and time required to hand grade these essays. Reliable automated techniques could allow essays to be introduced in testing, a key indicator of student learning that is currently commonly avoided due to the challenges in grading.

Essay writing is an important method to evaluate student learning and performance. It is also time-consuming for educators to grade by hand. Automated Writing Evaluation (AWE) systems can score essays to supplement an educator’s other efforts. AWEs also allow students to receive regular and timely feedback on their writing. However, due to their costs, many advancements in the field are not widely available to students and educators. Open-source solutions to assess student writing are needed to reach every community with these important educational tools.
Previous efforts to develop open-source AWEs have been limited by small datasets that were not nationally diverse or focused on common essay formats. The first Automated Essay Scoring competition scored student-written short-answer responses, however, this is a writing task not often used in the classroom. To improve upon earlier efforts, a more expansive dataset that includes high-quality, realistic classroom writing samples was required. Further, to broaden the impact, the dataset should include samples across economic and location populations to mitigate the potential of algorithmic bias.
In this competition, you will work with the largest open-access writing dataset aligned to current standards for student-appropriate assessments. Can you help produce an open-source essay scoring algorithm that improves upon the original Automated Student Assessment Prize (ASAP) competition hosted in 2012?

In this notebook, I use Deberta-V3-Small to score student essays. The accuracy of the scores on the competition test set is 71.5%

In [None]:
import torch
torch.cuda.empty_cache()
torch.cuda.memory_summary(device=None, abbreviated=False)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.model_selection import train_test_split
from datasets import Dataset,DatasetDict
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
#from transformers import DebertaV3Model
from transformers import AutoModelForSequenceClassification,AutoTokenizer
from transformers import TextClassificationPipeline, AutoTokenizer, AutoModelForSequenceClassification
from transformers import TrainingArguments,Trainer

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
import torch
torch.cuda.empty_cache()

In [None]:
df = pd.read_csv('/kaggle/input/learning-agency-lab-automated-essay-scoring-2/train.csv')
df= df.groupby('score').sample(frac =0.201)
#df = df.sample(frac=1)
df_train = df.reset_index()
df_train['full_text'] = df_train['full_text'].str.lower()
df_train['score'] = df_train['score'] - 1
df_train['input'] = 'TEXT: '+ df_train.full_text
df_train1 = df_train.drop(['index','essay_id','full_text'], axis=1)#
ds = Dataset.from_pandas(df_train1)
print(ds)
df_train

In [None]:
model_nm = '/kaggle/input/debertav3small'#'/kaggle/input/huggingface-deberta-variants/deberta-base/deberta-base'  # '/kaggle/input/huggingface-deberta-variants/deberta-base'# '/kaggle/input/huggingface-deberta-variants/deberta-base-mnli/deberta-base-mnli'  #'/kaggle/input/debertav3small'       #'/kaggle/input/debertav3small'
tokz = AutoTokenizer.from_pretrained(model_nm)
def tok_func(x): return tokz(x["input"])
tok_ds = ds.map(tok_func, batched=True)
tok_ds = tok_ds.rename_columns({'score':'labels'})
dds = tok_ds.train_test_split(0.15, seed=420)
print(dds)


In [None]:
df_test = pd.read_csv('/kaggle/input/learning-agency-lab-automated-essay-scoring-2/test.csv')
df_test['full_text'] = df_test['full_text'].str.lower()
df_test['input'] = 'TEXT: '+ df_test.full_text
df_test1 = df_test.drop(['full_text','essay_id'], axis=1) #
eval_ds = Dataset.from_pandas(df_test1).map(tok_func, batched=True)
print(eval_ds)
df_test

In [None]:
sub = pd.read_csv('/kaggle/input/learning-agency-lab-automated-essay-scoring-2/sample_submission.csv')
sub

In [None]:
bs = 1
epochs = 2
lr = 4.15e-6
#from transformers import BertForSequenceClassification
#model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=6)

args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
    evaluation_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
    num_train_epochs=epochs, weight_decay=0.01, report_to='none')
model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=6)
#model.resize_token_embeddings(len(tokz))
#model.compile(optimizer= 'adam' , loss= keras.losses.binary_crossentropy, metrics=['accuracy'])
trainer = Trainer(model, args, train_dataset=dds['train'], eval_dataset=dds['test'],
                  tokenizer=tokz)#, compute_metrics=compute_metrics
trainer.train();

preds = trainer.predict(eval_ds).predictions.astype(float)
print(preds)
preds = np.clip(preds, 0, 1)

submission = df_test.essay_id.copy().to_frame()
submission["score"] = np.argmax(preds, axis=1)+1

#submission["generated"] = submission["generated"].round(1)
submission.to_csv("/kaggle/working/submission.csv", index=False)

In [None]:
sub1 = pd.read_csv('submission.csv')
sub1

In [None]:
import os
def remove_folder_contents(folder):
    for the_file in os.listdir(folder):
        file_path = os.path.join(folder, the_file)
        try:
            if os.path.isfile(file_path):
                os.unlink(file_path)
            elif os.path.isdir(file_path):
                remove_folder_contents(file_path)
                os.rmdir(file_path)
        except Exception as e:
            print(e)

folder_path = '/kaggle/working'
#remove_folder_contents(folder_path)
#os.rmdir(folder_path)