<a href="https://colab.research.google.com/github/iam-Dylan/automated-essay-scoring/blob/main/deberta.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Learning Agency Lab - Automated Essay Scoring 2.0

- Môn học: Phân tích dữ liệu thông minh
- Nhóm: 10

# **THỬ NGHIỆM TRÊN MÔ HÌNH NGÔN NGỮ LỚN**

##  **A. Tiền xử lý dữ liệu**


### **1. Import các thư viện cần thiết**

+ Cài đặt thư viện cần thiết.

In [1]:
!pip install datasets
!pip install accelerate -U

Collecting datasets
  Downloading datasets-2.19.2-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.1/542.1 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
Collecting requests>=2.32.1 (from datasets)
  Downloading requests-2.32.3-py3-none-any.whl (64 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[

In [2]:
import pandas as pd
import numpy as np
import string
import re

from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig
from transformers import TrainingArguments, Trainer
from transformers import DataCollatorWithPadding
from datasets import Dataset

from sklearn.model_selection import train_test_split
from sklearn.metrics import cohen_kappa_score

import warnings
warnings.simplefilter('ignore')

### **2. Đọc dữ liệu**
- Để dễ dàng đồng bộ đường dẫn, nhóm sẽ thực hiện download file csv trực tiếp từ Google Drive.

In [3]:
# URL tải xuống trực tiếp của tệp CSV trên Google Drive
TRAIN_ID = '1hUhF4f-gGTixo_-b-ytez01_swNBslIG'
url = f"https://drive.google.com/uc?export=download&id={TRAIN_ID}"
# Đọc tệp CSV từ URL
try:
    train = pd.read_csv(url)
    display(train.head())
except Exception as e:
    print(f"Đã xảy ra lỗi: {e}")

Unnamed: 0,essay_id,full_text,score
0,000d118,Many people have car where they live. The thin...,3
1,000fe60,I am a scientist at NASA that is discussing th...,3
2,001ab80,People always wish they had the same technolog...,4
3,001bdc0,"We all heard about Venus, the planet without a...",4
4,002ba53,"Dear, State Senator\n\nThis is a letter to arg...",3


In [4]:
TEST_ID = '1kJa0kIeP0RpAFFcKa1QpFP7o4xtpxjet'
url = f"https://drive.google.com/uc?export=download&id={TEST_ID}"
# Đọc tệp CSV từ URL
try:
    test = pd.read_csv(url)
    display(test.head())
except Exception as e:
    print(f"Đã xảy ra lỗi: {e}")

Unnamed: 0,essay_id,full_text
0,000d118,Many people have car where they live. The thin...
1,000fe60,I am a scientist at NASA that is discussing th...
2,001ab80,People always wish they had the same technolog...


### **3. Tiền xử lý dữ liệu**

Cần **làm sạch văn bản**, nhằm chuẩn hóa và loại bỏ những thành phần không cần thiết trước khi tiến hành các bước xử lý tiếp theo.

- Văn bản được chuyển đổi toàn bộ về **chữ thường** để đảm bảo tính nhất quán và tránh phân biệt giữa chữ hoa và chữ thường.
- Các **thẻ HTML**, thẻ tên người dùng (bắt đầu bằng @), **hashtag** (bắt đầu bằng #), và đường dẫn **URL** đều được loại bỏ để giữ lại nội dung văn bản thực sự.
- Các **ký tự đặc biệt** và các **số** trong văn bản, thường không mang lại giá trị ngữ nghĩa, cũng được loại bỏ.
- Các **dấu câu liên tiếp** được xử lý và thay thế bằng một ký tự duy nhất.
- Các **từ viết tắt** được mở rộng thành dạng đầy đủ để đảm bảo tính nhất quán. Tham khảo từ: [Expand Contractions](https://www.kaggle.com/code/xianhellg/more-feature-engineering-feature-selection-0-817?scriptVersionId=173223907&cellId=11)

In [5]:
def expand_contractions(text):
    contractions_dict = {
    "ain't": "am not", "aren't": "are not", "can't": "cannot", "can't've": "cannot have", "'cause": "because", "could've": "could have",
    "couldn't": "could not", "couldn't've": "could not have", "didn't": "did not", "doesn't": "does not", "don't": "do not", "hadn't": "had not",
    "hadn't've": "had not have", "hasn't": "has not", "haven't": "have not",
    "he'd": "he would",
    "he'd've": "he would have","he'll": "he will", "he'll've": "he will have", "he's": "he is",
    "how'd": "how did","how'd'y": "how do you","how'll": "how will","how's": "how is",
    "I'd": "I would",
    "I'd've": "I would have","I'll": "I will","I'll've": "I will have","I'm": "I am","I've": "I have","isn't": "is not",
    "it'd": "it had",
    "it'd've": "it would have","it'll": "it will","it'll've": "it will have","it's": "it is",
    "let's": "let us","ma'am": "madam","mayn't": "may not","might've": "might have","mightn't": "might not","mightn't've": "might not have",
    "must've": "must have","mustn't": "must not","mustn't've": "must not have",
    "needn't": "need not","needn't've": "need not have",
    "o'clock": "of the clock",
    "oughtn't": "ought not","oughtn't've": "ought not have",
    "shan't": "shall not","sha'n't": "shall not","shan't've": "shall not have",
    "she'd": "she would",
    "she'd've": "she would have","she'll": "she will","she'll've": "she will have","she's": "she is",
    "should've": "should have","shouldn't": "should not","shouldn't've": "should not have",
    "so've": "so have","so's": "so is",
    "that'd": "that would",
    "that'd've": "that would have","that's": "that is",
    "there'd": "there had",
    "there'd've": "there would have","there's": "there is",
    "they'd": "they would",
    "they'd've": "they would have","they'll": "they will","they'll've": "they will have","they're": "they are","they've": "they have",
    "to've": "to have","wasn't": "was not","weren't": "were not",
    "we'd": "we had",
    "we'd've": "we would have","we'll": "we will","we'll've": "we will have","we're": "we are","we've": "we have",
    "what'll": "what will","what'll've": "what will have","what're": "what are","what's": "what is","what've": "what have",
    "when's": "when is","when've": "when have",
    "where'd": "where did","where's": "where is","where've": "where have",
    "who'll": "who will","who'll've": "who will have","who's": "who is","who've": "who have","why's": "why is","why've": "why have",
    "will've": "will have","won't": "will not","won't've": "will not have",
    "would've": "would have","wouldn't": "would not","wouldn't've": "would not have",
    "y'all": "you all","y'alls": "you alls","y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are",
    "y'all've": "you all have","you'd": "you had","you'd've": "you would have","you'll": "you you will","you'll've": "you you will have",
    "you're": "you are",  "you've": "you have"
    }
    contractions_re = re.compile('(%s)' % '|'.join(contractions_dict.keys()))

    return contractions_re.sub(lambda match: contractions_dict[match.group(0)], text)

def clean_text(text):
    text = text.lower()
    text = re.compile(r'<.*?>').sub(r'', text)
    text = re.sub(r'@\w+\s*', '', text)
    text = re.sub(r'#\w+', '', text)
    text = re.sub(r'http\S+|www\S+', '', text)
    text = text.replace(u'\xa0', ' ')
    text = re.sub(r'\d+', '', text)
    text = expand_contractions(text)
    text = re.sub(r'\.+', '.', text)
    text = re.sub(r'\,+', ',', text)
    text = text.strip()

    return text

train['full_text'] = train['full_text'].apply(clean_text)
test['full_text'] = test['full_text'].apply(clean_text)

##  **B. Xây dựng mô hình**


### **1. Chuẩn bị dữ liệu**

In [6]:
train['label'] = train['score'].apply(lambda x: x-1)
train['label'] = train['label'].astype('float32')

train, val = train_test_split(train, test_size=0.2, stratify=train['score'])

### **2. Xử lý dữ liệu đầu vào**

Máy tính không thể trực tiếp hiểu và xử lý văn bản tự nhiên như con người. Tokenization chuyển văn bản thành các đơn vị nhỏ hơn (tokens), giúp máy tính có thể làm việc với chúng một cách dễ dàng hơn.

In [7]:
tokenizer = AutoTokenizer.from_pretrained('microsoft/deberta-v3-small')
tokenizer

tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/578 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

DebertaV2TokenizerFast(name_or_path='microsoft/deberta-v3-small', vocab_size=128000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '[CLS]', 'eos_token': '[SEP]', 'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	128000: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [8]:
def tokenize_function(sample):
    tokenized = tokenizer(sample['full_text'],
                          add_special_tokens = True,
                          max_length = 1024,
                          padding = 'max_length',
                          truncation=True)
    return tokenized

### **2. Tạo dataset cho dữ liệu**

In [9]:
def get_dataset(df):
    ds = Dataset.from_pandas(df)
    return ds

train_ds = get_dataset(train).map(tokenize_function).remove_columns(['essay_id', 'full_text', 'score'])
valid_ds = get_dataset(val).map(tokenize_function).remove_columns(['essay_id', 'full_text', 'score'])

Map:   0%|          | 0/13845 [00:00<?, ? examples/s]

Map:   0%|          | 0/3462 [00:00<?, ? examples/s]

### **4. Xây dựng hàm đánh giá**

In [10]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    qwk = cohen_kappa_score(labels, predictions.clip(0,5).round(0), weights='quadratic')
    results = {
        'qwk': qwk
    }
    return results

### **5. Thiết lập tham số huấn luyện**

In [11]:
training_args = TrainingArguments(
    output_dir='output',
    fp16=True,
    learning_rate=1e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=4,
    weight_decay=0.01,
    evaluation_strategy='epoch',
    metric_for_best_model='qwk',
    save_strategy='epoch',
    save_total_limit=1,
    load_best_model_at_end=True,
    report_to='none',
    warmup_ratio=0.0,
    lr_scheduler_type='cosine',
    optim='adamw_torch',
    logging_first_step=True,
)

In [12]:
config = AutoConfig.from_pretrained(
    pretrained_model_name_or_path='microsoft/deberta-v3-small',
    attention_probs_dropout_prob=0.0,
    hidden_dropout_prob=0.0,
    num_labels=1)

### **6. Huấn luyện mô hình**

In [13]:
model = AutoModelForSequenceClassification.from_pretrained('microsoft/deberta-v3-small', config=config)
model

pytorch_model.bin:   0%|          | 0.00/286M [00:00<?, ?B/s]

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DebertaV2ForSequenceClassification(
  (deberta): DebertaV2Model(
    (embeddings): DebertaV2Embeddings(
      (word_embeddings): Embedding(128100, 768, padding_idx=0)
      (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True)
      (dropout): StableDropout()
    )
    (encoder): DebertaV2Encoder(
      (layer): ModuleList(
        (0-5): 6 x DebertaV2Layer(
          (attention): DebertaV2Attention(
            (self): DisentangledSelfAttention(
              (query_proj): Linear(in_features=768, out_features=768, bias=True)
              (key_proj): Linear(in_features=768, out_features=768, bias=True)
              (value_proj): Linear(in_features=768, out_features=768, bias=True)
              (pos_dropout): StableDropout()
              (dropout): StableDropout()
            )
            (output): DebertaV2SelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=T

In [14]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=valid_ds,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [15]:
trainer.train()

Epoch,Training Loss,Validation Loss,Qwk
1,0.3687,0.348461,0.777052
2,0.3173,0.30148,0.818423
3,0.2651,0.296586,0.817038
4,0.2126,0.295057,0.824532


TrainOutput(global_step=6924, training_loss=0.30657971518222893, metrics={'train_runtime': 5666.3538, 'train_samples_per_second': 9.773, 'train_steps_per_second': 1.222, 'total_flos': 1.46723500505088e+16, 'train_loss': 0.30657971518222893, 'epoch': 4.0})

### **7. Ví dụ tập huấn luyện**

In [16]:
sample = train.sample(5)
sample_ds = get_dataset(sample).map(tokenize_function).remove_columns(['essay_id', 'full_text', 'score'])

predictions = trainer.predict(sample_ds).predictions
sample['pred'] = predictions.clip(0,5).round(0) + 1
sample[[ 'full_text', 'score', 'pred']]

Map:   0%|          | 0/5 [00:00<?, ? examples/s]

Unnamed: 0,full_text,score,pred
1884,helping out people in other contries with the ...,3,3.0
12650,i do not think that the new software would be ...,4,3.0
13283,"in todays society, technology is widely used a...",3,3.0
1134,the chance that computers would have the abili...,2,2.0
8028,today i -luke- think you should join the proga...,2,2.0


### **8. Đưa ra dự đoán**

In [17]:
test_ds = get_dataset(test).map(tokenize_function).remove_columns(['essay_id', 'full_text'])
predictions = trainer.predict(test_ds).predictions
test['score'] = predictions.clip(0,5).round(0) + 1
test

Map:   0%|          | 0/3 [00:00<?, ? examples/s]

Unnamed: 0,essay_id,full_text,score
0,000d118,many people have car where they live. the thin...,3.0
1,000fe60,i am a scientist at nasa that is discussing th...,3.0
2,001ab80,people always wish they had the same technolog...,5.0
