<a href="https://colab.research.google.com/github/adunuthulan/LanguageLevel/blob/master/LanguageLevel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Analyzing The Reading Level of Text with Machine Learning**
By Nirav Adunuthula
Started Oct 11, 2019

##Why Predict Reading Level?
For beginnners, learning a language is difficult. You need to speak, hear, and read the language to gain full literacy. For people unable to join a class, the internet is a great resource for free material; however, there isn't always a clear guide for what books are at a person's level that will help them improve their literacy. The issue I've run into is as such: **how would one categorize writing into a reading level?**

Here are some features off the top of my head that logically would correspond to reading level:
* The complexity of the words used/the maturity level
* The average word/sentence length
* The use of complex punctuation/grammar (colons, dashes, etc.)

There might be other features that we are not considering or that are not clear to us. For that, we can use Machine Learning. 

## The Data

But in order to use ML, we need Data. Books used pedalogically in schools have somewhat well defined levels of reading with the grade the books are taught at, 
so I shall use them as my training data. Words by themselves are difficult to categorize into reading levels ('the' is probably in every work of literature), so **I shall feed in chapters of books alongside the grade they are taught at as my data.**

In [None]:
try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass

TensorFlow 2.x selected.


In [None]:
from __future__ import absolute_import, division, print_function, unicode_literals

# TensorFlow and tf.keras
import tensorflow as tf
from tensorflow import keras
import tensorflow_hub as hub
import tensorflow_datasets as tfds

# Helper libraries
import numpy as np
import matplotlib.pyplot as plt
import nltk
import pandas as pd
import itertools
import random

print("Tensorflow version: ", tf.__version__)
print("Hub version: ", hub.__version__)

Tensorflow version:  2.0.0
Hub version:  0.7.0


In [None]:
nltk.download("book")

[nltk_data] Downloading collection 'book'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package brown to /root/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package chat80 to /root/nltk_data...
[nltk_data]    |   Package chat80 is already up-to-date!
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package conll2000 to /root/nltk_data...
[nltk_data]    |   Package conll2000 is already up-to-date!
[nltk_data]    | Downloading package conll2002 to /root/nltk_data...
[nltk_data]    |   Package conll2002 is already up-to-date!
[nltk_data]    | Downloading package dependency_treebank to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package dependency_treebank is already up-to-date!
[nltk_data]    | Downloadi

True

In [None]:
#mount drive when using Google Colab
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
#import readingdata in a csv file and put it in a pandas data frame

path = "/content/drive/My Drive/Colab Notebooks/LanguageLevelData/CSV/readingdata.csv"
rd = pd.read_csv(path)
print(rd)

                                                Text   ReadingLevel
0  Pedro wants to ride his skateboard. Pedro has ...              1
1  People eat shrimp. Shrimp comes from the ocean...              1
2  Look at all of these big buildings! This is a ...              2
3  The environment is all around you. Rocks, soil...              2
4  Cactus pygmy-owls are little birds. They are a...              3


### Pre-processing the Data
The data is in the form (String text, int grade_level). We will obtain some features from the data like number of sentences in each text and the number of times each word is repeated. We will ignore any punctuation and possibly discount some common words like 'and' or 'a' in the feature list we use to train a classifier.

In [None]:
#Preprocess the text data which is in the form (String text, int grade_level)
sentences = []
for row in rd.itertuples():
  sentences.append(nltk.sent_tokenize(row[1].lower()))

#get the number of sentences in each text for future feature processing
num_sent = [len(sent) for sent in sentences]

tokenized_sentences = []
for row in rd.itertuples():
  tokenized_sentences.append(nltk.word_tokenize(row[1].lower()))
print(tokenized_sentences)

#find the frequency of words in sentences
word_freq = [nltk.FreqDist(t_sent) for t_sent in tokenized_sentences]
print ("Found %d unique word tokens in book 1" % len(word_freq[0].items()))

#Put the labels into an array and put the sentences with the labels
rd_token = rd.copy()
print(rd_token)

[['pedro', 'wants', 'to', 'ride', 'his', 'skateboard', '.', 'pedro', 'has', 'pads', 'for', 'his', 'knees', '.', 'he', 'also', 'has', 'pads', 'for', 'his', 'elbows', '.', 'he', 'has', 'pads', 'for', 'his', 'hands', '.', 'he', 'puts', 'on', 'his', 'helmet', '.', 'pedro', 'puts', 'on', 'his', 'safety', 'shoes', '.', 'he', 'has', 'his', 'skateboard', '.', 'let', "'s", 'have', 'fun', '!'], ['people', 'eat', 'shrimp', '.', 'shrimp', 'comes', 'from', 'the', 'ocean', '.', 'people', 'eat', 'clams', '.', 'clams', 'come', 'from', 'the', 'ocean', '.', 'people', 'eat', 'lobsters', '.', 'lobsters', 'come', 'from', 'the', 'ocean', '.', 'people', 'eat', 'small', 'fish', '.', 'small', 'fish', 'come', 'from', 'the', 'ocean', '.', 'people', 'eat', 'big', 'fish', '.', 'big', 'fish', 'come', 'from', 'the', 'ocean', '.', 'people', 'eat', 'mussels', '.', 'mussels', 'come', 'from', 'the', 'ocean', '.', 'people', 'eat', 'many', 'foods', 'from', 'the', 'ocean', '.'], ['look', 'at', 'all', 'of', 'these', 'big', 

In [None]:
def shuff(batch_size):
  sent_shuff = sentences[0:batch_size]
  return random.shuffle(sent_shuff)
def batch(batch_size):
  return sentences[0:batch_size]

###Using the Tensorflow Dataset Pipeline
Instead of Pre-Processing the data into tokenized sentences for our own ML analysis, we can also transform the csv into a TF Dataset so it can be easily piped into TF.

In [None]:
target = rd.pop(' ReadingLevel')
dataset = tf.data.Dataset.from_tensor_slices((rd.values[:, 0], target.values))

for feat, targ in dataset.take(5):
  print ('Features: {}, Target: {}'.format(feat, targ))

Features: b"Pedro wants to ride his skateboard. Pedro has pads for his knees. he also has pads for his elbows. He has pads for his hands. He puts on his helmet. Pedro puts on his safety shoes. He has his skateboard. Let's have fun!", Target: 1
Features: b'People eat shrimp. Shrimp comes from the ocean. People eat clams. Clams come from the ocean. People eat lobsters. Lobsters come from the ocean. People eat small fish. Small fish come from the ocean. People eat big fish. Big fish come from the ocean. People eat mussels. Mussels come from the ocean. People eat many foods from the ocean.', Target: 1
Features: b'Look at all of these big buildings! This is a city. A city is an urban community. An urban community is a place where many people live. Do you know what these big buildings are? They are apartment buildings. Many people in urban communities live in apartments. People in all communities work. Some people in cities work in factories. Factories make things like cars, tools, and toys.

In [None]:
#the percent of the data that goes into the training set VS validation set
percentTV = .8

#split up the data, using a percentage for the training set and all the data for testing
num_take = tf.cast((len(rd)*percentTV), tf.int64)
train_data = dataset.take(num_take)
validation_data = dataset.skip(num_take).take(len(rd)-num_take)
test_data = dataset



PS : How to flatten a matrix into a single array

`flattened = [val for sublist in sentences for val in sublist]`

## Setting up a Model with Tensorflow

#### Using Our Pre-Processed Text

#### Using TensorFlow-Hub
Rather than training on features, we can also have the computer itself look at the text and reading levels and try and come up with its own way of classifying the text. We will use Tensorflow-Hub to have another means of processing the text into a shape keras can use.


In [None]:
#Set up the layers of the model. At first we will make the machine self-identify features of the text.

embedding = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1"
hub_layer = hub.KerasLayer(embedding, input_shape=[], 
                           dtype=tf.string, trainable=True)

#hub_layer()

model = tf.keras.Sequential()

model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
keras_layer (KerasLayer)     (None, 20)                400020    
_________________________________________________________________
dense (Dense)                (None, 16)                336       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
Total params: 400,373
Trainable params: 400,373
Non-trainable params: 0
_________________________________________________________________


In [None]:
#following the tensorflow.org tutorial on ML on text
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [None]:
for a, b in train_data.batch(num_take).take(5):
  print("Feature shape: ", a.shape)
  print("Target shape: ", b.shape)

Feature shape:  (4,)
Target shape:  (4,)


In [None]:
history = model.fit(train_data.shuffle(num_take).batch(2),
                    epochs=4,
                    validation_data=validation_data.batch(len(rd)-num_take),
                    verbose=1)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


And let's see how the model performs. Two values will be returned. Loss (a number which represents our error, lower values are better), and accuracy.

In [None]:
results = model.evaluate(test_data.batch(5), verbose=2)

for name, value in zip(model.metrics_names, results):
  print("%s: %.3f" % (name, value))

1/1 - 0s - loss: -5.6327e-01 - accuracy: 0.4000
loss: -0.563
accuracy: 0.400


In [None]:
moun