#  Using Naive Bayes for Text Readability Classification

Code cells have been individually cited via comments wherever third-party code has been referred to or implemented, and a citation list has been added at the bottom of this notebook in Harvard style referencing.

### Project Overview:

The purpose of this project is to create a text readability classifier (inspired by the flesch kincaid readability tests) that determines whether a piece of text is easy or hard to read. I shall be making use of english textbooks from South-East Asian / Middle Eastern areas as datasets. Since most readability classifiers use data from the United Kingdom / United States in their model, I thought it would be interesting to approach this problem using data from non-western regions to see if they could predict readability scores accurately for english phrases across the world. After building the classifier, I shall test it on speech / interview transcripts of various politicians as a use case to get a bit more insight into their speaking styles.

### Project Aim:

1) To construct a model that allows writers to have more control over their writing, so that they could structure their work according to their intended audience.

### Installing and Importing the Required Libraries:

In [1]:
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.tokenize.casual import casual_tokenize
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
import textstat
import re
from cleantext import clean
from nltk import word_tokenize

Since the GPL-licensed package `unidecode` is not installed, using Python's `unicodedata` package which yields worse results.


### Selection of Data:

For this project, I'm using English textbooks of varying grades from different countries. I found all of them on [Library Genesis](https://www.libgen.is/) and since they were PDF files, I then converted them to text files using [Zamzar File Converter](https://www.zamzar.com/). I initially tried using python modules for this task like PDF Miner and PyPDF, but kept running into errors as most of the code I found on StackOverflow was not suitable with the latest version of Python. 

For this notebook, I have used **fourth and twelfth grade english textbooks from Afghanistan by the country's Ministry of Education (2011 Edition)**, which can be found [here](https://libgen.is/search.php?req=afghanistan+english&lg_topic=libgen&open=0&view=simple&res=25&phrase=1&column=def). 


### Preprocessing the Data:

I've used Regex and the Clean-Text Library to prepare the data before the classification task. I defined a 'read_and_clean' function to read any given text file and clean the data in it, whilst replacing the line-breaks according to every condition (as described in the comments) as the text files for this task aren't following a particular pattern with grammar since it was converted from an image-heavy PDF. After that, I'm splitting the sentence after every full stop ('.') and avoiding any sentences with less than two words as it won't be of much use.  

In [2]:
def remove(text):
    text = re.sub(r"#\S+", " ", text) #remove hashtags
    text = re.sub(r"https*\S+", " ", text) #remove URLs
    text = re.sub(r'\w*\d+\w*', '', text) #remove numbers
    text = re.sub(r'[^a-zA-Z0-9\n\?!\.]', ' ', text) #remove special characters
    text = text.strip(" ")
    text = text.strip(".")
    return text


In [3]:
# read the file and clean it.
def read_and_clean(file_name):
# read the file
    fs = open(file_name, 'r') 
    book1 = fs.read()
# convert it to . if 2 or more line breaks are together
    book1 = re.sub(r"\n{2,}",". ", book1)
# convert it to . if 2 or more spaces are together
    book1 = re.sub(r"\s{2,}",". ", book1)
# convert a single line break to space if it is followed by a small letter
    book1 = re.sub(r"\n{1}(?=\s[a-z])"," ", book1)
# convert a single line break to space if it is followed by a space and small letter
    book1 = re.sub(r"\n{1}(?=[a-z])"," ", book1)
# convert all remaining line breaks to .
    book1 = re.sub(r"\n",". ", book1)
    total = []
    
    clean(book1,
        no_urls=True) # https://pypi.org/project/clean-text/
    
# split the sentence after every '.'
    for i in book1.split(". "):
# clean it using the above function
        clean_text = remove(i)
# convert the sentence to a list of words and check the length. if it is greater then 2, then consider it a sentence
        if len(word_tokenize(clean_text)) >2:
            total.append(clean_text)
# return the final list
    return total    

### Labelling the Data and Calling the Functions

In [4]:
# read the grade one file
grade_one_sentence = read_and_clean("../data/gradefourafghan.txt")


In [5]:
label = 0
new_examples1 = []
for i in grade_one_sentence:  
    if len(word_tokenize(i)) >2:
        new_examples1 = new_examples1 + [[i, label]]

In [6]:
new_examples1 = new_examples1[16:] # slicing the few sentences in the beginning to remove the contents page.

In [7]:
# read the grade ten file
grade_ten_sentence = read_and_clean("../data/gradetwelveafghan.txt")


In [8]:
label = 1
new_examples2 = []
for i in grade_ten_sentence:  
    if len(word_tokenize(i))>2:
        new_examples2 = new_examples2 + [[i, label]]

In [9]:
# new_examples2 = new_examples2[17:] # slicing the few sentences in the beginning to remove the contents page.

In [10]:
new_examples2 = new_examples2[500:] # slicing further to avoid overfitting due to data imbalance 

### Checking for Data Imbalance:

In [11]:
len(new_examples1)

923

In [12]:
len(new_examples2) 

2601

In [13]:
len(new_examples1)+len(new_examples2) 

3524

### Checking the Readability Scores using the Textstat Library:

In [14]:
fs = open('../data/gradefourafghan.txt', 'r') 
bookone = fs.read()

In [15]:
# https://pypi.org/project/textstat/
bookonescore = round(textstat.flesch_kincaid_grade(bookone))
bookonescore

4

In [16]:
fs = open('../data/gradetwelveafghan.txt', 'r') 
booktwo = fs.read()

In [17]:
# https://pypi.org/project/textstat/
booktwoscore = round(textstat.flesch_kincaid_grade(booktwo))
booktwoscore

6

### Organising the Labelled Data together using a Pandas Dataframe

In [18]:
dataset = pd.DataFrame(columns = ["text", "label"])  
dataset = dataset.append(pd.DataFrame(new_examples2+new_examples1, columns = ["text", "label"]))

### TFIDF:

In [19]:
# https://git.arts.ac.uk/lmccallum/nlp-21-22/blob/master/NLP%20Week%204.1%20-%20Classification%20Task.ipynb
# Tokeniser!
vectoriser = TfidfVectorizer(tokenizer=casual_tokenize)
#Fitting the model
tfidf_model = vectoriser.fit(dataset["text"])
#Getting vectors for everything
tfidf = tfidf_model.transform(dataset["text"]).todense()



### Train Model: 

In [20]:
# https://git.arts.ac.uk/lmccallum/nlp-21-22/blob/master/NLP%20Week%204.1%20-%20Classification%20Task.ipynb
features = tfidf
labels = np.array(dataset["label"], dtype = int)
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.1, random_state=0)
gnb = GaussianNB()
#Training the model
model = gnb.fit(X_train, y_train);
#Test 
y_pred = model.predict(X_test)
num_incorrect = (y_test != y_pred).sum()
total = y_test.shape[0]
acc = (total - num_incorrect) / total * 100
print("Number of mislabeled points out of a total %d points : %d, %0.3f" % (total, num_incorrect, acc))

Number of mislabeled points out of a total 353 points : 13, 96.317


### Adding in Some New Strings to Test the Model:

In [21]:
my_new_text = ["I am under siege. I agree with you on that."]

In [22]:
# https://git.arts.ac.uk/lmccallum/nlp-21-22/blob/master/NLP%20Week%204.1%20-%20Classification%20Task.ipynb
# turning new text into vectors
new_tfidf = tfidf_model.transform(my_new_text).todense()

In [23]:
# https://git.arts.ac.uk/lmccallum/nlp-21-22/blob/master/NLP%20Week%204.1%20-%20Classification%20Task.ipynb
# trying new data 
your_new_data = new_tfidf
y_pred = model.predict(your_new_data)
# looking at predictions
for i, t in enumerate(my_new_text):
    print(t," -> class:", y_pred[i])

I am under siege. I agree with you on that.  -> class: 1


### Cross-checking it with the Textstat Library:

In [24]:
# https://pypi.org/project/textstat/
trump = round(textstat.flesch_kincaid_grade("I am under siege. I agree with you on that."))
trump

-1

#### `(All observations and findings shall be included in the critical essay).`

### Citation List:    

#### Websites:

1) Davis, A., 2021. The fundamentals of programming - Python Video Tutorial | LinkedIn Learning, formerly Lynda.com. [online] LinkedIn. Available at: <https://www.linkedin.com/learning/programming-foundations-fundamentals-3/the-fundamentals-of-programming?autoAdvance=true&autoSkip=false&autoplay=true&resume=true&u=57077561> [Accessed 24 October 2021].

2) Dib, F., 2021. regex101: build, test, and debug regex. [online] regex101. Available at: <https://regex101.com/> [Accessed 4 December 2021].

3) Libgen.is. 2021. Library Genesis. [online] Available at: <https://www.libgen.is/> [Accessed 4 November 2021].

4) McCallum, L., 2021. NLP Week 4.1 - Classification Task Notebook. [online] GitHub. Available at: <https://git.arts.ac.uk/lmccallum/nlp-21-22/blob/master/NLP%20Week%204.1%20-%20Classification%20Task.ipynb> [Accessed 16 November 2021].

5) Nisbet, J., 2021. Python for students - Python Video Tutorial | LinkedIn Learning, formerly Lynda.com. [online] LinkedIn. Available at: <https://www.linkedin.com/learning/python-for-students/python-for-students?autoAdvance=true&autoSkip=false&autoplay=true&resume=false&u=57077561> [Accessed 18 October 2021].

6) Portilla, J., 2021. Natural Language Processing with Python. [online] Udemy. Available at: <https://www.udemy.com/course/nlp-natural-language-processing-with-python/?ranMID=39197&ranEAID=JVFxdTr9V80&ranSiteID=JVFxdTr9V80-gIa4CDf8o_3HXX8ZIg_F1g&LSNPUBID=JVFxdTr9V80&utm_source=aff-campaign&utm_medium=udemyads> [Accessed 27 October 2021].

7) Python, R., 2021. Practical Text Classification With Python and Keras – Real Python. [online] Realpython.com. Available at: <https://realpython.com/python-keras-text-classification/> [Accessed 2 December 2021].

8) Rose, D., 2021. Artificial Intelligence Foundations: Neural Networks Video Tutorial | LinkedIn Learning, formerly Lynda.com. [online] LinkedIn. Available at: <https://www.linkedin.com/learning/artificial-intelligence-foundations-neural-networks/welcome?autoAdvance=true&autoSkip=false&autoplay=true&resume=true&u=57077561> [Accessed 6 December 2021].

9) PyPI. 2021. clean-text. [online] Available at: <https://pypi.org/project/clean-text/> [Accessed 14 November 2021].

10) PyPI. 2021. textstat. [online] Available at: <https://pypi.org/project/textstat/> [Accessed 15 November 2021].

11) Stack Abuse. 2021. Using Regex for Text Manipulation in Python. [online] Available at: <https://stackabuse.com/using-regex-for-text-manipulation-in-python/> [Accessed 16 November 2021].

12) Zamzar.com. 2021. Zamzar - video converter, audio converter, image converter, eBook converter. [online] Available at: <https://www.zamzar.com/> [Accessed 7 November 2021].
