# Fine-tuning distilBERT for Text Readability Classification

Code cells have been individually cited via comments wherever third-party code has been referred to or implemented, and a citation list has been added at the bottom of this notebook in Harvard style referencing.

### Project Overview:

The purpose of this project is to create a text readability classifier (inspired by the flesch kincaid readability tests) that determines whether a piece of text is easy or hard to read. I shall be making use of english textbooks from South-East Asian / Middle Eastern areas as datasets. Since most readability classifiers use data from the United Kingdom / United States in their model, I thought it would be interesting to approach this problem using data from non-western regions to see if they could predict readability scores accurately for english phrases across the world. After building the classifier, I shall test it on speech / interview transcripts of various politicians as a use case to get a bit more insight into their speaking styles.

### Project Aim:

1) To construct a model that allows writers to have more control over their writing, so that they could structure their work according to their intended audience.

### Installing and Importing the Required Libraries:

In [1]:
!pip install transformers
!pip install datasets



In [2]:
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.tokenize.casual import casual_tokenize
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
import re
from cleantext import clean
from nltk import word_tokenize
import textstat
from transformers import DistilBertTokenizerFast
import torch
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments
from datasets import load_metric 


Since the GPL-licensed package `unidecode` is not installed, using Python's `unicodedata` package which yields worse results.


### Selection of Data:

For this project, I'm using English textbooks of varying grades from different countries. I found all of them on [Library Genesis](https://www.libgen.is/) and since they were PDF files, I then converted them to text files using [Zamzar File Converter](https://www.zamzar.com/). I initially tried using python modules for this task like PDF Miner and PyPDF, but kept running into errors as most of the code I found on StackOverflow was not suitable with the latest version of Python. 

For this notebook, I have used **first and tenth grade textbooks from India from the National Council of Educational Research and Training Publication (2009 Edition)**, which can be found [here](https://libgen.is/search.php?req=ncert+english&open=0&res=25&view=simple&phrase=1&column=def). 


### Preprocessing the Data:

I've used Regex and the Clean-Text Library to prepare the data before the classification task. I defined a 'read_and_clean' function to read any given text file and clean the data in it, whilst replacing the line-breaks according to every condition (as described in the comments) as the text files for this task aren't following a particular pattern with grammar since it was converted from an image-heavy PDF. After that, I'm splitting the sentence after every full stop ('.') and avoiding any sentences with less than two words as it won't be of much use.  

In [3]:
def remove(text):
    text = re.sub(r"#\S+", " ", text) #remove hashtags
    text = re.sub(r'\w*\d+\w*', '', text) #remove numbers
    text = re.sub(r'[^a-zA-Z0-9\n\?!\.]', ' ', text) #remove special characters
    text = text.strip(" ")
    text = text.strip(".")
    return text


In [4]:
# read the file and clean it.
def read_and_clean(file_name):
# read the file
    fs = open(file_name, 'r') 
    book1 = fs.read()
# convert it to . if 2 or more line breaks are together
    book1 = re.sub(r"\n{2,}",". ", book1)
# convert it to . if 2 or more spaces are together
    book1 = re.sub(r"\s{2,}",". ", book1)
# convert a single line break to space if it is followed by a small letter
    book1 = re.sub(r"\n{1}(?=\s[a-z])"," ", book1)
# convert a single line break to space if it is followed by a space and small letter
    book1 = re.sub(r"\n{1}(?=[a-z])"," ", book1)
# convert all remaining line breaks to .
    book1 = re.sub(r"\n",". ", book1)
    total = []
    
    clean(book1,
        no_urls=True) #https://pypi.org/project/clean-text/

# split the sentence after every '.'
    for i in book1.split(". "):
# clean it using the above function
        clean_text = remove(i)
# convert the sentence to a list of words and check the length. if it is greater then 2, then consider it a sentence
        if len(word_tokenize(clean_text)) >2:
            total.append(clean_text)
# return the final list
    return total
    

### Labelling the Data and Calling the Functions

In [5]:
# reading the grade one file
grade_one_sentence = read_and_clean("../data/gradeoneindia.txt")


In [6]:
label = 0
new_examples1 = []
for i in grade_one_sentence:  
    if len(word_tokenize(i)) >2:
        new_examples1 = new_examples1 + [[i, label]]

In [7]:
new_examples1 = new_examples1[16:] # slicing the few sentences in the beginning to remove the contents page.

In [8]:
# read the grade ten file
grade_ten_sentence = read_and_clean("../data/gradetenindia.txt")


In [9]:
label = 1
new_examples2 = []
for i in grade_ten_sentence:  
    if len(word_tokenize(i))>2:
        new_examples2 = new_examples2 + [[i, label]]

In [10]:
# new_examples2 = new_examples2[16:] # slicing the few sentences in the beginning to remove the contents page.

In [11]:
new_examples2 = new_examples2[600:] # slicing further to avoid overfitting due to data imbalance 

### Checking for Data Imbalance:

In [12]:
len(new_examples1)

263

In [13]:
len(new_examples2)

558

In [14]:
len(new_examples1)+len(new_examples2) 

821

### Checking the Readability Scores using the Textstat Library:

In [15]:
fs = open('../data/gradeoneindia.txt', 'r') 
bookone = fs.read()

In [16]:
# https://pypi.org/project/textstat/
bookonescore = round(textstat.flesch_kincaid_grade(bookone))
bookonescore

7

In [17]:
fs = open('../data/gradetenindia.txt', 'r') 
booktwo = fs.read()

In [18]:
# https://pypi.org/project/textstat/
booktwoscore = round(textstat.flesch_kincaid_grade(booktwo))
booktwoscore

16

### Organising the Labelled Data together using a Pandas Dataframe

In [19]:
dataset = pd.DataFrame(columns = ["text", "label"]) 
dataset = dataset.append(pd.DataFrame(new_examples2+new_examples1, columns = ["text", "label"]))

### Splitting the Data into Train and Test Sets

In [20]:
# https://realpython.com/train-test-split-python-data/
from sklearn.model_selection import train_test_split
train_texts, test_texts, train_labels, test_labels = train_test_split(list(dataset["text"]), list(dataset["label"]), test_size=.1)

### Tokenization:


In [21]:
# https://github.com/rasbt/stat453-deep-learning-ss21/blob/main/L19/distilbert-classifier/01_distilbert-simple.ipynb
# https://huggingface.co/docs/transformers/custom_datasets
from transformers import DistilBertTokenizerFast
import torch
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments

# https://analyticsindiamag.com/python-guide-to-huggingface-distilbert-smaller-faster-cheaper-distilled-bert/
# https://huggingface.co/transformers/training.html
from datasets import load_metric 
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
# downloading the distilbert model
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'classifier.w

In [22]:
# https://github.com/rasbt/stat453-deep-learning-ss21/blob/main/L19/distilbert-classifier/01_distilbert-simple.ipynb
# https://huggingface.co/docs/transformers/custom_datasets

# converting the text to numbers
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

### Data Loader Developed Using Dataset Class:

In [23]:
# https://github.com/rasbt/stat453-deep-learning-ss21/blob/main/L19/distilbert-classifier/01_distilbert-simple.ipynb
# https://huggingface.co/docs/transformers/custom_datasets

# creating a dataset class so we can provide this dataframe to the model

class IMDbDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = IMDbDataset(train_encodings, train_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)

### Train Model:

In [24]:
# https://huggingface.co/docs/transformers/custom_datasets

# getting all the arguments for the model
training_args = TrainingArguments(
    output_dir='results',          # output directory
    num_train_epochs=6,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
)

# https://huggingface.co/transformers/training.html

# metric to compute accuracy 
metric = load_metric("accuracy") 

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

# creating the trainer which trains the model on data
trainer = Trainer(
    model=model,                          
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=test_dataset,           # evaluation dataset
    compute_metrics=compute_metrics
)
# start training
trainer.train()

***** Running training *****
  Num examples = 738
  Num Epochs = 6
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 282


Step,Training Loss




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=282, training_loss=0.11696170238738364, metrics={'train_runtime': 2623.9744, 'train_samples_per_second': 1.688, 'train_steps_per_second': 0.107, 'total_flos': 75611977192224.0, 'train_loss': 0.11696170238738364, 'epoch': 6.0})

### Evaluate:

In [25]:
# start evaluation
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 83
  Batch size = 64


{'eval_loss': 0.41143596172332764,
 'eval_accuracy': 0.9036144578313253,
 'eval_runtime': 49.0509,
 'eval_samples_per_second': 1.692,
 'eval_steps_per_second': 0.041,
 'epoch': 6.0}

`- The model accuracy is 93% for 5 epochs`

`- The model accuracy is 95% for 10 epochs`

`- The model accuracy is 92% for 4 epochs (after balanced dataset)`

`- The model accuracy is 90% for 6 epochs (after balanced dataset and .1 test size)`

### Adding in Some New Strings to Test the Model:

In [30]:
# https://huggingface.co/transformers/model_doc/distilbert.html#distilbertforsequenceclassification
inputs = tokenizer("I am under siege. I agree with you on that.", return_tensors="pt") 
labels = torch.tensor([1]).unsqueeze(0) 
outputs = model(**inputs, labels=labels)
loss = outputs.loss
logits = outputs.logits

In [31]:
logits

tensor([[-3.4372,  3.9164]], grad_fn=<AddmmBackward0>)

We can now check which label the model is more confident about. We take the confidence of the first cluster, and then the second cluster. If the confidence of the first cluster is greater then the confidence of second cluster, it'll print out '0'. If the confidence of the first cluster is not greater than the second, it'll print out '1'

In [32]:
if logits[0][0] > logits[0][1]: 
  print(0)
else:
  print(1) 


1


### Cross-checking it with the Textstat Library:

In [29]:
# https://pypi.org/project/textstat/
trump = round(textstat.flesch_kincaid_grade("I am under siege. I agree with you on that."))
trump

-1

#### `(All observations and findings shall be included in the critical essay).`

### Citation List:    

#### Websites:

1) Alammar, J., 2021. A Visual Guide to Using BERT for the First Time. [online] Jalammar.github.io. Available at: <https://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/> [Accessed 4 December 2021].

2) Davis, A., 2021. The fundamentals of programming - Python Video Tutorial | LinkedIn Learning, formerly Lynda.com. [online] LinkedIn. Available at: <https://www.linkedin.com/learning/programming-foundations-fundamentals-3/the-fundamentals-of-programming?autoAdvance=true&autoSkip=false&autoplay=true&resume=true&u=57077561> [Accessed 24 October 2021].

3) Dib, F., 2021. regex101: build, test, and debug regex. [online] regex101. Available at: <https://regex101.com/> [Accessed 4 December 2021].

4) Huggingface.co. 2021. DistilBERT. [online] Available at: <https://huggingface.co/transformers/model_doc/distilbert.html#distilbertforsequenceclassification> [Accessed 11 December 2021].

5) Huggingface.co. 2021. Fine-tuning a pretrained model. [online] Available at: <https://huggingface.co/transformers/training.html> [Accessed 9 December 2021].

6) Huggingface.co. 2021. How to fine-tune a model for common downstream tasks. [online] Available at: <https://huggingface.co/docs/transformers/custom_datasets> [Accessed 5 December 2021].

7) Libgen.is. 2021. Library Genesis. [online] Available at: <https://www.libgen.is/> [Accessed 4 November 2021].

8) McCallum, L., 2021. NLP Week 4.1 - Classification Task Notebook. [online] GitHub. Available at: <https://git.arts.ac.uk/lmccallum/nlp-21-22/blob/master/NLP%20Week%204.1%20-%20Classification%20Task.ipynb> [Accessed 16 November 2021].

9) Nisbet, J., 2021. Python for students - Python Video Tutorial | LinkedIn Learning, formerly Lynda.com. [online] LinkedIn. Available at: <https://www.linkedin.com/learning/python-for-students/python-for-students?autoAdvance=true&autoSkip=false&autoplay=true&resume=false&u=57077561> [Accessed 18 October 2021].

10) Portilla, J., 2021. Natural Language Processing with Python. [online] Udemy. Available at: <https://www.udemy.com/course/nlp-natural-language-processing-with-python/?ranMID=39197&ranEAID=JVFxdTr9V80&ranSiteID=JVFxdTr9V80-gIa4CDf8o_3HXX8ZIg_F1g&LSNPUBID=JVFxdTr9V80&utm_source=aff-campaign&utm_medium=udemyads> [Accessed 27 October 2021].

11) Python, R., 2021. Split Your Dataset With scikit-learn's train_test_split() – Real Python. [online] Realpython.com. Available at: <https://realpython.com/train-test-split-python-data/> [Accessed 5 December 2021].

12) Raschka, S., 2021. L19.6 DistilBert Movie Review Classifier in PyTorch. [online] Youtube.com. Available at: <https://www.youtube.com/watch?v=emDmznRlsWw> [Accessed 5 December 2021].

13) Raschka, S., 2021. Simple distilBERT Repository. [online] GitHub. Available at: <https://github.com/rasbt/stat453-deep-learning-ss21/blob/main/L19/distilbert-classifier/01_distilbert-simple.ipynb> [Accessed 5 December 2021].

14) Rose, D., 2021. Artificial Intelligence Foundations: Neural Networks Video Tutorial | LinkedIn Learning, formerly Lynda.com. [online] LinkedIn. Available at: <https://www.linkedin.com/learning/artificial-intelligence-foundations-neural-networks/welcome?autoAdvance=true&autoSkip=false&autoplay=true&resume=true&u=57077561> [Accessed 6 December 2021].

15) Medium. 2021. TDS Tutorial: Fine-Tuning Hugging Face Model with Custom Dataset. [online] Available at: <https://towardsdatascience.com/fine-tuning-hugging-face-model-with-custom-dataset-82b8092f5333> [Accessed 6 December 2021].

16) PyPI. 2021. clean-text. [online] Available at: <https://pypi.org/project/clean-text/> [Accessed 14 November 2021].

17) PyPI. 2021. textstat. [online] Available at: <https://pypi.org/project/textstat/> [Accessed 15 November 2021].

18) Sanh, V., 2021. 🏎 Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT. [online] Medium. Available at: <https://medium.com/huggingface/distilbert-8cf3380435b5> [Accessed 6 December 2021].

19) Stack Abuse. 2021. Using Regex for Text Manipulation in Python. [online] Available at: <https://stackabuse.com/using-regex-for-text-manipulation-in-python/> [Accessed 16 November 2021].

20) Verma, A., 2021. Python Guide to HuggingFace DistilBERT - Smaller, Faster & Cheaper Distilled BERT. [online] Analytics India Magazine. Available at: <https://analyticsindiamag.com/python-guide-to-huggingface-distilbert-smaller-faster-cheaper-distilled-bert/> [Accessed 6 December 2021].

21) Zamzar.com. 2021. Zamzar - video converter, audio converter, image converter, eBook converter. [online] Available at: <https://www.zamzar.com/> [Accessed 7 November 2021].