# Machine Learning based Text Difficulty Classification

## Problem Definition

Right now, there are over 1.5 billion people learning English right now. When learning English, reading English books are an excellent way to improve. Those learners must find English books to read that they can understand and are comfortable with. But it is very difficult to find books that the learner can understand easily, we built a regression model to analyze the text and output the difficulty level of the text. Unlike traditional manual based approaches, this one combine many useful the text features. English learners can use this classification system to better find good books that they can read and will understand.

## Feasibility Analysis

Logistic regression is, in general a type of generalized linear regression. For this problem, the length of the sentences, the length of the words, and the difficulty of the words are related to the difficulty of the text in a linear pattern. While the relationship is linear, and the output for each text is independent. Hence, we can use a logistic regression model to solve this problem. 

In this scenario, we can begin by using the above features as input and put the text difficulty level as output. 

## Implementation
First, we need to find the data set with enough trustworthiness, a clear classification of each text, and is big enough. At last, we decided on books 1-4 of New Concept English. 

We used the number of each book as a reference to the difficulty. It contains most English grammar, and the data is large enough, with 160 texts. 

![](img/NCE.png)

Most of the code related to data loading functions are in `api.py` , but the main code is in`text_diff.py`. 

In [4]:
from api.api import *
from textstat.textstat import legacy_round, textstatistics

Now, let's start to extract the features from the text and classify the difficulty. The **basic features** are the text length, sentence length, word length, and number of sentences. In most cases, the larger those features are, the more difficult the text.

In [5]:
# Feature: text length
def features_len_text(text):
    # Normalize to a number between 0..1
    len_text = [len(text) / 3000]
    return len_text


# Feature: sentence length
def features_len_sentence(text):
    replace_list = [ 
        '\n', ':', ';", '
        '', '--', ' $', ' he ', ' him ', ' she ', ' her ', ' the ', '1', '2',
        '3', '4', '5', '6', '7', '8', '9', '0', ' they ', ' them '
    ]   

    for i in replace_list:
        text = text.replace(i, ' ')

    sentences = re.split('[.?!]', text)

    # Ignore all sentences shorter than 5 words
    sentences = [i for i in sentences if len(i) >= 5]

    num_of_sentence = len(sentences)
    len_sentence = [0] * num_of_sentence

    words_array = [(i.strip().split(" ")) for i in sentences]

    for i in range(0, num_of_sentence):
        for word in words_array[i]:
            if len(word) > 1:
                len_sentence[i] += 1

    # Ignore all sentences where length is bigger than 3-sigma
    len_sentence = mean_3_sigma(len_sentence)

    # Build a sentence length histogram with 20 buckets
    len_sentence_hist = [0] * 20
    for i in len_sentence:
        len_sentence_hist[min(20, i) % 20] += 1
    len_sentence_hist = [i / sum(len_sentence_hist) for i in len_sentence_hist]

    return len_sentence_hist


# Feature: word length
def features_len_word(text):

    # Split into single word, and convert every word to its original form, for example "apples" to "apple"
    words = splitwords(text)
    len_word = [len(i) for i in words]

    # Build a word length histogram with 5 buckets
    len_word_hist = [0] * 5
    for i in len_word:
        len_word_hist[min(21, i) % 5] += 1
    len_word_hist = [i / sum(len_word_hist) for i in len_word_hist]

    return len_word_hist


# Feature: number of sentences
def features_num_sentence(text):
    sentences = re.split('[.?!]', text)
    num_of_sentences = [len(sentences) / 32]

    return num_of_sentences

We also use more **advanced features**: everybody knows that all English words are made of syllables. The more syllables, the more complicated the word will be. We can use the library syllable_count from textstatistics to calculate the number of syllables. 

We can also use the number of repeating words to analyze the difficulty of the text. Therefore, if we count the number of unique words, we can get the difficulty of the text.
Another advanced feature is word difficulty. The code for it is in api.py. One way is to use the text corpus itself to assign a difficulty level to each word. Another way is to NGSL word frequency table. The higher the frequency, the easier the words are.

In [6]:
# Feature: syllables per word
def features_syllables_word(text):
    words = splitwords(text)
    syllable_word = [textstatistics().syllable_count(i) for i in words]

    # Make a histogram with 5 buckets
    syllable_word_hist = [0] * 5
    for i in syllable_word:
        syllable_word_hist[min(5, i) % 5] += 1
    syllable_word_hist = [
        i / sum(syllable_word_hist) for i in syllable_word_hist
    ]

    return syllable_word_hist


# Feature: number of unique words
def features_unique_words(text):
    words_array = splitwords(text)
    unique_words = []

    # Only count the unique words
    for word in words_array:
        if word not in unique_words:
            unique_words.append(word)

    len_unique_words = [len(unique_words) / 230]

    return len_unique_words

We will stop with adding more features and focus on how to use these features for model training. In the list "newFeatures", every element is a function pointer to the previously defined feature extraction function. 

In [7]:
print("1. Assigning difficulty level to each word")
textbooks_data = load_textbooks_data()
diff_level = get_diff_level(textbooks_data)

print("2. Generating features from training data")
newFeatures = [
    features_len_text, features_len_sentence,
    features_len_word, features_num_sentence, 
    features_syllables_word, features_unique_words
]

1. Assigning difficulty level to each word
2. Generating features from training data


The code below uses the encapsulated logistic regression and order regression functions from SenseTime API to train the model and test it over the testing data. Till now, the relatively basic end to end flow has been established. 

In [8]:
testtime = 10
order_acc_list = []
logic_acc_list = []

print("3. Training the regression models, and evaluating the accuracy")
for i in range(testtime):
    print("Start", i, "running ...")

    shuffle_data(10)

    train_data = load_train_data()
    train_feats, train_labels = get_feats_labels(train_data,
                                                 diff_level,
                                                 newFeatures=newFeatures,
                                                 diff_use=1)

    test_data = load_test_data()
    test_feats, test_labels = get_feats_labels(test_data,
                                               diff_level,
                                               newFeatures=newFeatures,
                                               diff_use=1)

    # Running the logistic regression model training and evaluation
    model = logistic_regression()
    model.train(train_feats, train_labels)

    pred_y = model.pred(test_feats)
    acc = accuracy(pred_y, test_labels)
    logic_acc_list.append(acc)

    # Running the order regression model training and evaluation
    model = order_regression()
    model.train(train_feats, train_labels)

    pred_y = model.pred(test_feats)
    acc = accuracy(pred_y, test_labels)
    order_acc_list.append(acc)

print("The final model outputs:")
print("10x logistic regression model evals: ", logic_acc_list)
print("10x order regression model evals: ", order_acc_list)
print("Logic_avg_acc     ", sum(logic_acc_list) / testtime)
print("Order_avg_acc     ", sum(order_acc_list) / testtime)

3. Training the regression models, and evaluating the accuracy
Start 0 running ...
Start 1 running ...
Start 2 running ...
Start 3 running ...
Start 4 running ...
Start 5 running ...
Start 6 running ...
Start 7 running ...
Start 8 running ...
Start 9 running ...
The final model outputs:
10x logistic regression model evals:  [0.9, 0.9, 0.8, 0.975, 0.9, 0.875, 0.825, 0.95, 0.8, 0.925]
10x order regression model evals:  [0.875, 0.925, 0.825, 0.95, 0.95, 0.825, 0.9, 0.975, 0.875, 0.975]
Logic_avg_acc      0.8850000000000001
Order_avg_acc      0.9075000000000001


## Further Improvements

Now, we will explore more approaches to improve the accuracy. 
#### First, we can add more features. 

For example, we can use the difficulty level of each sentence based its grammar pattern.

To analyze the difficulty of each sentence, we need to analyze the grammar to see if it is easy or hard to understand. We can leverage syntactic parsing to generate a syntactic tree and understand the grammar pattern of the sentence. With the grammar patterns, we can then score each sentence. To achieve this, the toolkit `StanfordDependencyParser` from `nltk` can be a good fit. 

#### Second, we can tune the parameters. 

There are two types of parameters: the data related, and the model related. 

Below are examples of the data related parameters.
```python
len(unique_words) / 230
len(sentences) / 32
```
In the previous feature extraction code, there are multiple constant numbers which are decided from the text itself. The goal is to make sure the extracted feature numbers will remain between 0 and 1. 

For the model related parameters, in logistic regression, we can consider these two parameters: "C" and "max_iter":
```python
LogisticRegression(C=5000, max_iter=10000)
```
C is the reciprocal of the regularization parameter. As C increases, the regulation parameter decreases, and the strictness incrases. max_iter is maximum number of iterations taken for the solvers to converge. 

Through parameter tuning, the accuracy is nearing 90%, which is relative good.

#### Third, we can explore other ML models. 

We learned that order regression can be another good model to use, where we can continue to improve the accuracy further. Besides the regression models, we can also explore other models like RandomForest, XGBoost.

At the end of our project, we tested order regression and logistic regression each 10 times. Each time, we reshuffle the data. `shuffle_data` is a function that randomly selects test data among the whole data set. Considering that there are 40 text files for each difficulty level, 10 seems an appropriate ratio. 

The results are shown below. 

![](img/result.png)

Overall, for this problem, order regression is on average better than logistic regression, achieving a **final accuracy is above 90%**. 