# Using a Convolutional Neural Network for Text Readability Classification

Code cells have been individually cited via comments wherever third-party code has been referred to or implemented, and a citation list has been added at the bottom of this notebook in Harvard style referencing.

### Project Overview:

The purpose of this project is to create a text readability classifier (inspired by the flesch kincaid readability tests) that determines whether a piece of text is easy or hard to read. I shall be making use of english textbooks from South-East Asian / Middle Eastern areas as datasets. Since most readability classifiers use data from the United Kingdom / United States in their model, I thought it would be interesting to approach this problem using data from non-western regions to see if they could predict readability scores accurately for english phrases across the world. After building the classifier, I shall test it on speech / interview transcripts of various politicians as a use case to get a bit more insight into their speaking styles.

### Project Aim:

1) To construct a model that allows writers to have more control over their writing, so that they could structure their work according to their intended audience.

### Installing and Importing the Required Libraries:

In [65]:
import numpy as np  
from keras.preprocessing import sequence 
from keras.models import Sequential 
from keras.layers import Dense, Dropout, Activation  
from keras.layers import Conv1D, GlobalMaxPooling1D
from gensim.models.keyedvectors import KeyedVectors
from nltk.tokenize.casual import casual_tokenize
from gensim.models import KeyedVectors
from sklearn.model_selection import train_test_split
import pandas as pd
from random import shuffle
import regex as re
from cleantext import clean
from nltk import word_tokenize
import textstat

### Selection of Data:

For this project, I'm using English textbooks of varying grades from different countries. I found all of them on [Library Genesis](https://www.libgen.is/) and since they were PDF files, I then converted them to text files using [Zamzar File Converter](https://www.zamzar.com/). I initially tried using python modules for this task like PDF Miner and PyPDF, but kept running into errors as most of the code I found on StackOverflow was not suitable with the latest version of Python. 

For this notebook, I have used first and tenth grade textbooks from India from the NCERT Publication, which can be found [here](https://libgen.is/search.php?req=ncert+english&open=0&res=25&view=simple&phrase=1&column=def). 


### Preprocessing the Data:

I've used Regex and the Clean-Text Library to prepare the data before the classification task. I defined a 'read_and_clean' function to read any given text file and clean the data in it, whilst replacing the line-breaks according to every condition (as described in the comments) as the text files for this task aren't following a particular pattern with grammar since it was converted from an image-heavy PDF. After that, I'm splitting the sentence after every full stop ('.') and avoiding any sentences with less than two words as it won't be of much use.  

In [66]:
def remove(text):
    text = re.sub(r"#\S+", " ", text) #remove hashtags
    text = re.sub(r'\w*\d+\w*', '', text) #remove numbers
    text = re.sub(r'[^a-zA-Z0-9\n\?!\.]', ' ', text) #remove special characters
    text = text.strip(" ")
    text = text.strip(".")
    return text

In [67]:
# read the file and clean it.
def read_and_clean(file_name):
# read the file
    fs = open(file_name, 'r') 
    book1 = fs.read()
# convert it to . if 2 or more line breaks are together
    book1 = re.sub(r"\n{2,}",". ", book1)
# convert it to . if 2 or more spaces are together
    book1 = re.sub(r"\s{2,}",". ", book1)
# convert a single line break to space if it is followed by a small letter
    book1 = re.sub(r"\n{1}(?=\s[a-z])"," ", book1)
# convert a single line break to space if it is followed by a space and small letter
    book1 = re.sub(r"\n{1}(?=[a-z])"," ", book1)
# convert all remaining line breaks to .
    book1 = re.sub(r"\n",". ", book1)
    total = []
    
    clean(book1,
        no_urls=True) #https://pypi.org/project/clean-text/

# split the sentence after every '.'
    for i in book1.split(". "):
# clean it using the above function
        clean_text = remove(i)
# convert the sentence to a list of words and check the length. if it is greater then 2, then consider it a sentence
        if len(word_tokenize(clean_text)) >2:
            total.append(clean_text)
# return the final list
    return total   

### Labelling the Data and Calling the Functions:

In [68]:
# read the grade one file
grade_one_sentence = read_and_clean("../data/gradeoneindia.txt")

In [74]:
label = 0
new_examples1 = []
for i in grade_one_sentence:  
    if len(word_tokenize(i)) >2:
        new_examples1 = new_examples1 + [[i, label]]

In [75]:
new_examples1 = new_examples1[16:] # slicing the few sentences in the beginning to remove the contents page.

In [71]:
new_examples1

[['The Tailor and his Friend', 1],
 ['My house is red', 1],
 ['a little house  A happy child am I', 1],
 ['I laugh and play the whole day long  I hardly ever cry', 1],
 ['I have a tree  a green  green tree  To shade me from the sun  And under it I often sit  When all my play is done',
  1],
 ['Read and match the words with the pictures', 1],
 ['Draw a line', 1],
 ['There are many kinds of houses', 1],
 ['Circle the ones you have seen', 1],
 ['Draw your house here and talk about it', 1],
 ['The following have lost their babies', 1],
 ['Trace along the maze using different colours and find them', 1],
 ['One has been done for you', 1],
 ['Once there were three little pigs Sonu  Monu and Gonu', 1],
 ['Sonu lived in a house of straw', 1],
 ['Monu lived in a house of sticks and One day a big bad wolf came to Sonu s house',
  1],
 ['He said   I will huff and puff and I will blow your house down.  So he huffed and he puffed and he blew the house down',
  1],
 ['Sonu ran to Monu s house', 1],
 

In [72]:
# read the grade ten file
grade_ten_sentence = read_and_clean("../data/gradetenindia.txt")


In [76]:
label = 1
new_examples2 = []
for i in grade_ten_sentence:  
    if len(word_tokenize(i))>2:
        new_examples2 = new_examples2 + [[i, label]]

In [None]:
# new_examples2 = new_examples2[17:] # slicing the few sentences in the beginning to remove the contents page.

In [77]:
new_examples2 = new_examples2[500:] # slicing further to avoid overfitting due to data imbalance

In [42]:
new_examples2

[['They often waited for Wanda Petronski   to have fun with her', 10],
 ['Most of the children in Room Thirteen didn t have names like that', 10],
 ['They had names easy to say  like Thomas  Smith or Allen', 10],
 ['There was one boy named Bounce  Willie Bounce  and people thought that was funny  but not funny in the same way that Petronski was',
  10],
 ['Wanda didn t have any friends', 10],
 ['She came to school alone and went home alone', 10],
 ['She always wore a faded blue dress that didn t hang right', 10],
 ['It was clean  but it looked as though it had never been ironed properly',
  10],
 ['She didn t have any friends  but a lot of girls talked to her', 10],
 ['Sometimes  they surrounded her in the school yard as she stood watching the little girls play hopscotch on the worn hard ground',
  10],
 ['Wanda   Peggy would say in a most courteous manner as though she were talking to Miss Mason',
  10],
 ['Wanda   she d say  giving one of her friends a nudge   tell us', 10],
 ['How m

### Checking for Data Imbalance:

In [63]:
len(new_examples1)

263

In [64]:
len(new_examples2)

658

In [43]:
len(new_examples1)+len(new_examples2)

921

### Checking the Readability Scores using the Textstat Library:

In [78]:
fs = open('../data/gradeoneindia.txt', 'r') 
bookone = fs.read()

In [79]:
# https://pypi.org/project/textstat/
bookonescore = round(textstat.flesch_kincaid_grade(bookone))
bookonescore

7

In [80]:
fs = open('../data/gradetenindia.txt', 'r') 
booktwo = fs.read()

In [81]:
# https://pypi.org/project/textstat/
booktwoscore = round(textstat.flesch_kincaid_grade(booktwo))
booktwoscore

16

### Organising the Labelled Data together using a Pandas Dataframe:

In [44]:
dataset = pd.DataFrame(columns = ["text", "label"])  
dataset = dataset.append(pd.DataFrame(new_examples2+new_examples1, columns = ["text", "label"]))

### Downloading Pre-trained Vectors trained on part of Google News dataset:

In [82]:
# https://code.google.com/archive/p/word2vec/
!pip install wget
import wget
import os
if not os.path.isfile("GoogleNews-vectors-negative300.bin.gz"):
    print("downloading")
    wget.download("https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz")
else:
    print("you already have it!")

In [47]:
embeddings_file = "../data/GoogleNews-vectors-negative300.bin.gz"
wv = KeyedVectors.load_word2vec_format(embeddings_file, binary=True, limit=200000)

### Classification:

In [48]:
#Code Adapted from the NLP in Action book ch7 https://github.com/totalgood/nlpia
word_vectors = KeyedVectors.load_word2vec_format(embeddings_file, binary=True, limit=200000)
def tokenize_and_vectorize(dataset):
    vectorized_data = []
    for sample in dataset:
        tokens = casual_tokenize(sample)
        sample_vecs = []
        for token in tokens:
            try:
                sample_vecs.append(word_vectors[token])
            except KeyError:
                pass  # No matching token in the Google w2v vocab
        vectorized_data.append(sample_vecs)

    return vectorized_data

def pad_trunc(data, maxlen):
    """ For a given dataset pad with zero vectors or truncate to maxlen """
    new_data = []

    # Creating a vector of 0's the length of the word vectors
    zero_vector = []
    for _ in range(len(data[0][0])):
        zero_vector.append(0.0)

    for sample in data:
 
        if len(sample) > maxlen:
            temp = sample[:maxlen]
        elif len(sample) < maxlen:
            temp = sample
            additional_elems = maxlen - len(sample)
            for _ in range(additional_elems):
                temp.append(zero_vector)
        else:
            temp = sample
        new_data.append(temp)
    return new_data

In [49]:
features = tokenize_and_vectorize(dataset["text"].values)
x_train, x_test, y_train, y_test = train_test_split(features, dataset["label"], test_size=0.2, random_state=0)

In [50]:
type(x_train)

list

In [51]:
maxlen = 50
embedding_dims = 300    # Length of the token vectors for passing into the Convnet

In [52]:
np.array(x_train).shape,np.array(x_test).shape

  np.array(x_train).shape,np.array(x_test).shape


((736,), (185,))

In [53]:
x_train = pad_trunc(x_train, maxlen)
x_test = pad_trunc(x_test, maxlen)

In [54]:
type(x_train)

list

In [55]:
np.array(x_train).shape

(736, 50, 300)

In [56]:
x_train = np.reshape(x_train, (len(x_train), maxlen, embedding_dims))
y_train = np.array(y_train)
x_train = np.reshape(x_train, (len(x_train), maxlen, embedding_dims))
y_train = np.array(y_train)
x_test = np.reshape(x_test, (len(x_test), maxlen, embedding_dims))
y_test = np.array(y_test)

In [58]:
np.array(x_train).shape,np.array(x_test).shape

((736, 50, 300), (185, 50, 300))

In [59]:
batch_size = 32      
filters = 1        
kernel_size = 10   
hidden_dims = 10       
epochs = 7    

### Building the Model: 

In [60]:
#Code Adapted from the NLP in Action book ch7 https://github.com/totalgood/nlpia
print('Build model...')
model = Sequential()

# Adding a Convolution1D, which will learn filters
# word group filters of size filter_length:
model.add(Conv1D(filters,
                 kernel_size,
                 padding='valid',
                 activation='relu',
                 strides=1,
                 input_shape=(maxlen, embedding_dims)))
# Using max pooling:
model.add(GlobalMaxPooling1D())
# Adding a vanilla hidden layer:
model.add(Dense(hidden_dims))
model.add(Dropout(0.4))
model.add(Activation('relu'))
# Projecting onto a single unit output layer, and squashing it with a sigmoid:
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
print(model.summary())

Build model...
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv1d_1 (Conv1D)            (None, 41, 1)             3001      
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 1)                 0         
_________________________________________________________________
dense_2 (Dense)              (None, 10)                20        
_________________________________________________________________
dropout_1 (Dropout)          (None, 10)                0         
_________________________________________________________________
activation_2 (Activation)    (None, 10)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 11        
_________________________________________________________________
activation_3 (Activation)    (None, 1) 

In [61]:
# https://stackoverflow.com/questions/58636087/tensorflow-valueerror-failed-to-convert-a-numpy-array-to-a-tensor-unsupporte

x_train = np.asarray(x_train).astype(np.float32)
y_train = np.asarray(y_train).astype(np.float32)
x_test = np.asarray(x_test).astype(np.float32)
y_test = np.asarray(y_test).astype(np.float32)

### Train Model: 

In [62]:
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          validation_data=(x_test, y_test))
model_structure = model.to_json()
with open("cnn_model.json", "w") as json_file:
    json_file.write(model_structure)

model.save_weights("cnn_weights.h5")
print('Model saved.')

Epoch 1/7
Epoch 2/7
Epoch 3/7
Epoch 4/7
Epoch 5/7
Epoch 6/7
Epoch 7/7
Model saved.


#### `(All observations and findings shall be included in the critical essay).`

### Citation List:    


#### Websites:

1) Code.google.com. 2022. Google Code Archive - Word2Vec. [online] Available at: <https://code.google.com/archive/p/word2vec/> [Accessed 5 December 2021].

2) Davis, A., 2021. The fundamentals of programming - Python Video Tutorial | LinkedIn Learning, formerly Lynda.com. [online] LinkedIn. Available at: <https://www.linkedin.com/learning/programming-foundations-fundamentals-3/the-fundamentals-of-programming?autoAdvance=true&autoSkip=false&autoplay=true&resume=true&u=57077561> [Accessed 24 October 2021].

3) Dib, F., 2021. regex101: build, test, and debug regex. [online] regex101. Available at: <https://regex101.com/> [Accessed 4 December 2021].

4) GitHub. 2021. GitHub - totalgood/nlpia: Examples and libraries for "Natural Language Processing in Action" book. [online] Available at: <https://github.com/totalgood/nlpia> [Accessed 5 December 2021].

5) Libgen.is. 2021. Library Genesis. [online] Available at: <https://www.libgen.is/> [Accessed 4 November 2021].

6) McCallum, L., 2021. NLP Week 4.1 - Classification Task Notebook. [online] GitHub. Available at: <https://git.arts.ac.uk/lmccallum/nlp-21-22/blob/master/NLP%20Week%204.1%20-%20Classification%20Task.ipynb> [Accessed 16 November 2021].

7) McCallum, L., 2021. NLP Week 5.1 CNNs Notebook. [online] GitHub. Available at: <https://git.arts.ac.uk/lmccallum/nlp-21-22/blob/master/NLP%20Week%205.1%20CNNs.ipynb> [Accessed 2 December 2021].

8) Nisbet, J., 2021. Python for students - Python Video Tutorial | LinkedIn Learning, formerly Lynda.com. [online] LinkedIn. Available at: <https://www.linkedin.com/learning/python-for-students/python-for-students?autoAdvance=true&autoSkip=false&autoplay=true&resume=false&u=57077561> [Accessed 18 October 2021].

9) Portilla, J., 2021. Natural Language Processing with Python. [online] Udemy. Available at: <https://www.udemy.com/course/nlp-natural-language-processing-with-python/?ranMID=39197&ranEAID=JVFxdTr9V80&ranSiteID=JVFxdTr9V80-gIa4CDf8o_3HXX8ZIg_F1g&LSNPUBID=JVFxdTr9V80&utm_source=aff-campaign&utm_medium=udemyads> [Accessed 27 October 2021].

10) Rose, D., 2021. Artificial Intelligence Foundations: Neural Networks Video Tutorial | LinkedIn Learning, formerly Lynda.com. [online] LinkedIn. Available at: <https://www.linkedin.com/learning/artificial-intelligence-foundations-neural-networks/welcome?autoAdvance=true&autoSkip=false&autoplay=true&resume=true&u=57077561> [Accessed 6 December 2021].

11) PyPI. 2021. clean-text. [online] Available at: <https://pypi.org/project/clean-text/> [Accessed 14 November 2021].

12) PyPI. 2021. textstat. [online] Available at: <https://pypi.org/project/textstat/> [Accessed 15 November 2021].

13) Stack Abuse. 2021. Using Regex for Text Manipulation in Python. [online] Available at: <https://stackabuse.com/using-regex-for-text-manipulation-in-python/> [Accessed 16 November 2021].

14) T., Carvalho, V. and Fedor, Z., 2021. Tensorflow - ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type float). [online] Stack Overflow. Available at: <https://stackoverflow.com/questions/58636087/tensorflow-valueerror-failed-to-convert-a-numpy-array-to-a-tensor-unsupporte> [Accessed 5 December 2021].

15) Zamzar.com. 2021. Zamzar - video converter, audio converter, image converter, eBook converter. [online] Available at: <https://www.zamzar.com/> [Accessed 7 November 2021].
