# Text Readability

## Problem Description

**Motivation**

For this project, I chose to use the [CommonLit Readability Prize](https://www.kaggle.com/competitions/commonlitreadabilityprize) competition from Kaggle. The task in this competition is to build a model that will accurately determine the reading level of text, given only the text.

While the competition's primary goal is to assist administrators, teachers, and students in a school setting, I had a slightly different motivation. In my professional life, I work as a product manager on a chatbot product that helps users manage their health and welfare benefits in the United States. A big focus of ours is evaluating whether the chatbot is assisting users in achieving tasks. The content designers for the chatbot's responses are all college educated experts in the benefits administration field, however the users of the chatbot have varying levels of education and benefits domain knowledge. 

In addition, health and welfare benefits in the United States is very complicated.  Even if a user has high reading proficiency, they are often trying to simply get tasks done without much effort on their part. Their ability to accomplish tasks may be hindered by content that's difficult to understand.

Therefore, assessing the readability of our chatbot's responses may help us identify areas for improvement. Aiming at simpler responses that are easy to read by a majority of our users will help reduce friction and help them get tasks accomplished. With benefits being such a complicated field, our chatbot's ability to make it easy for everyone to understand will set us apart in the industry.

I also had technical motivations for this choice of project.  The data and task lends itself nicely to playing with various models and approaches.  I chose to focus on feature engineering and model architecture.

**Approach**

The task here is a regression task.  The score to predict is 'BT Easiness'.  This metric was derived from a study where users were given two excerpts of text and asked to rate which one was easier for a student to read.  The score ranks the documents by difficulty based on the probability that the document may be easier than other documents.

I chose to experiment with the following 9 regression models:
- Bidirectional LSTM models with the following configurations:
    - only the text excerpt as input to predict BT Easiness.
    - both the text excerpt as input and also some engineered features from the text (such as the mean number of words per sentence, the mean syllable count, etc) to predict BT Easiness.
    - the text excerpt, some engineered features from the text (such as the mean number of words per sentence, the mean syllable count, etc), and the counts of parts of speech as input, to predict BT Easiness.
    - the text excerpt, some engineered features from the text (such as the mean number of words per sentence, the mean syllable count, etc),  the counts of parts of speech as input, and the counts of word origins (entimology), to predict BT Easiness.
- Fine tuned RoBERTa models with the following configurations:
    - only the text excerpt as input to predict BT Easiness.
    - both the text excerpt as input and also some engineered features from the text (such as the mean number of words per sentence, the mean syllable count, etc) to predict BT Easiness.
    - the text excerpt, some engineered features from the text (such as the mean number of words per sentence, the mean syllable count, etc), and the counts of parts of speech as input, to predict BT Easiness.
    - the text excerpt, some engineered features from the text (such as the mean number of words per sentence, the mean syllable count, etc),  the counts of parts of speech as input, and the counts of word origins (entimology), to predict BT Easiness.
- An XGBoost model including only the engineered features from the text, but not the text itself.

**A note about my submission**

This notebook is my main submission and contains the Exploritory Data Analysis and summary of the results, and other details.  The 9 models themselves have been completed in separate notebooks as follows:

| Description | Notebook, executed on Kaggle | Notebook, Github |
| - | - | - |
| Bidirectional LSTM, only text | https://www.kaggle.com/code/focusleft/commonlit-readability-lstm/notebook | https://github.com/corymills/dtsa-5511-final-project/blob/main/commonlit-readability-lstm.ipynb |
| Bidirectional LSTM, text and sentence features | https://www.kaggle.com/code/focusleft/commonlit-readability-lstm-with-sentence-features/notebook | https://github.com/corymills/dtsa-5511-final-project/blob/main/commonlit-readability-lstm-with-sentence-features.ipynb |
| Bidirectional LSTM, text, sentence, and POS features | https://www.kaggle.com/code/focusleft/commonlit-readability-lstm-with-sentence-pos/notebook | https://github.com/corymills/dtsa-5511-final-project/blob/main/commonlit-readability-roberta-with-sent-pos.ipynb |
| Bidirectional LSTM, text, sentence, POS, and origin features | https://www.kaggle.com/code/focusleft/commonlit-readability-lstm-with-sent-pos-lang/notebook | https://github.com/corymills/dtsa-5511-final-project/blob/main/commonlit-readability-roberta-with-sent-pos-lang.ipynb |
| RoBERTa, only text | https://www.kaggle.com/code/focusleft/commonlit-readability-roberta/notebook | https://github.com/corymills/dtsa-5511-final-project/blob/main/commonlit-readability-roberta.ipynb |
| RoBERTa, text and sentence features | https://www.kaggle.com/code/focusleft/commonlit-readability-roberta-with-sent-features/notebook | https://github.com/corymills/dtsa-5511-final-project/blob/main/commonlit-readability-roberta-with-sent-features.ipynb |
| RoBERTa, text, sentence, and POS features | https://www.kaggle.com/code/focusleft/commonlit-readability-roberta-with-sent-pos/notebook | https://github.com/corymills/dtsa-5511-final-project/blob/main/commonlit-readability-roberta-with-sent-pos.ipynb |
| RoBERTa, text, sentence, POS, and origin features | https://www.kaggle.com/code/focusleft/commonlit-readability-roberta-with-sent-pos-lang/notebook | https://github.com/corymills/dtsa-5511-final-project/blob/main/commonlit-readability-roberta-with-sent-pos-lang.ipynb |
| XGBoost | https://www.kaggle.com/code/focusleft/commonlit-readability-xgboost?scriptVersionId=146704819 | https://github.com/corymills/dtsa-5511-final-project/blob/main/commonlit-readability-xgboost.ipynb |

## Imports

In [1]:
# Data manipulation libraries
import pandas as pd
import numpy as np

# Data visualization libraries
import plotly.express as px
import plotly.graph_objects as go

# Natural language processing libraries
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import wordnet
from nltk.tag import pos_tag
from collections import Counter

# Syllable counting library
import syllapy

# Etymology library
import ety

# Progress bar library
from tqdm import tqdm

# Initialize the progress bar for pandas
tqdm.pandas()

# Download NLTK resources
nltk.download("wordnet")
nltk.download("punkt")
nltk.download("averaged_perceptron_tagger")

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\focus\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\focus\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\focus\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

## Exploritory Data Analysis

### Data Description

First, I read the data and review the imported data types and general form of the data.

The data was pulled from [the CLEAR Corpus](https://www.commonlit.org/blog/introducing-the-clear-corpus-an-open-dataset-to-advance-research-28ff8cfea84a/).  This is a more complete and detailed dataset than was originally included in the Kaggle competition.

The data consists of 4724 records.  Each record contains 
- some descriptive information about the source of the text (author, title, publication year, etc)
- the text excerpt itself, a paragraph of text from the source described
- the BT Easiness label for the excerpt text that I will attempt to predict
- various other readability indicies
- data from the Kaggle competition including the top contender's predictions

**Data Citation:**

Brown, M. (2021, December 6). Introducing: The CLEAR Corpus, an open dataset to advance research. CommonLit. https://www.commonlit.org/blog/introducing-the-clear-corpus-an-open-dataset-to-advance-research-28ff8cfea84a/

In [2]:
df = pd.read_csv(
    "kaggle\input\clear-corpus-6-01-clear-corpus-6-01\CLEAR Corpus 6.01 - CLEAR Corpus 6.01.csv"
)

In [3]:
df.info(max_cols=None)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4724 entries, 0 to 4723
Data columns (total 40 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   ID                                  4724 non-null   int64  
 1   Last Changed                        139 non-null    float64
 2   Author                              4724 non-null   object 
 3   Title                               4724 non-null   object 
 4   Anthology                           2712 non-null   object 
 5   URL                                 4724 non-null   object 
 6   Source                              4724 non-null   object 
 7   Pub Year                            4715 non-null   float64
 8   Category                            4724 non-null   object 
 9   Location                            4724 non-null   object 
 10  License                             4724 non-null   object 
 11  MPAA
Max                            4724 no

In [4]:
df.head(10)

Unnamed: 0,ID,Last Changed,Author,Title,Anthology,URL,Source,Pub Year,Category,Location,...,CAREC_M,CARES,CML2RI,firstPlace_pred,secondPlace_pred,thirdPlace_pred,fourthPlace_pred,fifthPlace_pred,sixthPlace_pred,Kaggle split
0,400,,Carolyn Wells,Patty's Suitors,,http://www.gutenberg.org/cache/epub/5631/pg563...,gutenberg,1914.0,Lit,mid,...,0.11952,0.457534,12.097815,-0.383831,-0.283604,-0.346879,-0.28162,-0.247767,-0.289945,Train
1,401,,Carolyn Wells,Two Little Women on a Holiday,,http://www.gutenberg.org/cache/epub/5893/pg589...,gutenberg,1917.0,Lit,mid,...,0.04921,0.46251,22.550179,-0.260307,-0.20996,-0.061565,-0.234231,-0.201347,-0.156156,Train
2,402,,Carolyn Wells,Patty Blossom,,http://www.gutenberg.org/cache/epub/20945/pg20...,gutenberg,1917.0,Lit,mid,...,0.09724,0.369259,18.125279,-0.615037,-0.5306,-0.527847,-0.55018,-0.565762,-0.538852,Train
3,403,,CHARLES KINGSLEY,THE WATER-BABIES\nA Fairy Tale for a Land-Baby,,http://www.gutenberg.org/files/25564/25564-h/2...,gutenberg,1863.0,Lit,mid,...,0.08856,0.390759,10.95946,-1.528806,-1.525546,-1.471455,-1.265776,-1.422547,-1.393155,Test
4,404,,Charles Kingsley,HOW THE ARGONAUTS WERE DRIVEN INTO THE UNKNOWN...,The Heroes\n or Greek Fairy Tales for my...,http://www.gutenberg.org/files/677/677-h/677-h...,gutenberg,1889.0,Lit,mid,...,0.08798,0.389226,3.19596,-1.335586,-1.321922,-1.163985,-1.122501,-1.185518,-1.271324,Train
5,405,,Charles Madison Curry\n Erle Elsworth C...,The Three Little Bears,Children's Literature\n A Textbook of So...,http://www.gutenberg.org/files/25545/25545-h/2...,gutenberg,1920.0,Lit,mid,...,0.36885,0.301666,28.990105,0.341717,0.376123,0.353762,0.343373,0.361875,0.368739,Train
6,406,,Clair W. Hayes,"The Boy Allies On the Firing Line\n Or, ...",,http://www.gutenberg.org/files/12870/12870-h/1...,gutenberg,1915.0,Lit,mid,...,0.16523,0.419842,12.766583,-1.070515,-1.022543,-0.971631,-0.991998,-1.040679,-1.075716,Train
7,407,,Clair W. Hayes,The Boy Allies in Great Peril,,http://www.gutenberg.org/cache/epub/12682/pg12...,gutenberg,1916.0,Lit,mid,...,0.18656,0.484475,14.130141,-1.390635,-1.55488,-1.581937,-1.666938,-1.540613,-1.600628,Train
8,408,,Clair W. Hayes,The Boy Allies At Verdun,,http://www.gutenberg.org/cache/epub/13020/pg13...,gutenberg,1917.0,Lit,start,...,0.12905,0.430107,10.216473,-1.041028,-1.093127,-0.959339,-0.954064,-1.023613,-0.903814,Train
9,409,,Claude A. Labelle,The Ranger Boys and the Border Smugglers,,http://www.gutenberg.org/files/25514/25514-h/2...,gutenberg,1922.0,Lit,mid,...,0.07326,0.37727,16.497078,-0.273477,-0.281462,-0.281333,-0.280411,-0.311182,-0.310186,Train


### Excerpt Review

#### Lowest BT Easiness Ratings

I've listed the five lowest BT Easiness ratings and their excerpts.  These do appear to be relatively difficult to read.

In [5]:
lowest_bt_easiness_samples = (
    df[["Excerpt", "BT Easiness"]].sort_values("BT Easiness", ascending=True).head(5)
)
highest_bt_easiness_samples = (
    df[["Excerpt", "BT Easiness"]].sort_values("BT Easiness", ascending=False).head(5)
)

for _, sample in lowest_bt_easiness_samples.iterrows():
    print(f"BT Easiness: {sample['BT Easiness']}\n{sample['Excerpt']}\n\n")

BT Easiness: -3.676267773
The commutator is peculiar, consisting of only three segments of a copper ring, while in the simplest of other continuous current generators several times that number exist, and frequently 120! segments are to be found. These three segments are made so as to be removable in a moment for cleaning or replacement. They are mounted upon a metal support, and are surrounded on all sides by a free air space, and cannot, therefore, lose their insulated condition. This feature of air insulation is peculiar to this system, and is very important as a factor in the durability of the commutator. Besides this, the commutator is sustained by supports carried in flanges upon the shaft, which flanges, as an additional safeguard, are coated all over with hard rubber, one of the finest known insulators. It may be stated, without fear of contradiction, that no other commutator made is so thoroughly insulated and protected. The three commutator segments virtually constitute a sing

#### Highest BT Easiness Ratings

I've listed the five highest BT Easiness ratings and their excerpts.  These do appear to be relatively easy to read.

In [6]:
for _, sample in highest_bt_easiness_samples.iterrows():
    print(f"BT Easiness: {sample['BT Easiness']}\n{sample['Excerpt']}\n\n")

BT Easiness: 1.711389827
When you think of dinosaurs and where they lived, what do you picture? Do you see hot, steamy swamps, thick jungles, or sunny plains? Dinosaurs lived in those places, yes. But did you know that some dinosaurs lived in the cold and the darkness near the North and South Poles?
This surprised scientists, too. Paleontologists used to believe that dinosaurs lived only in the warmest parts of the world. They thought that dinosaurs could only have lived in places where turtles, crocodiles, and snakes live today. Later, these dinosaur scientists began finding bones in surprising places.
One of those surprising fossil beds is a place called Dinosaur Cove, Australia. One hundred million years ago, Australia was connected to Antarctica. Both continents were located near the South Pole. Today, paleontologists dig dinosaur fossils out of the ground. They think about what those ancient bones must mean.


BT Easiness: 1.658697523
The next morning Lizzy met her friend Spider a

### BT Easiness Distribution

In [7]:
fig = px.histogram(df["BT Easiness"], nbins=30)
fig.update_layout(bargap=0.1)

fig.show()

### Correlation between BT Easiness and other readability indicies

I've reviewed the correlation between the BT Easiness rating and other calculated readability indicies.  There is, unsurprisingly, strong correlations between them all.  This shows that BT Easiness is likely a good indicator of readability when compared to other standard indicies.

In [8]:
# Calculate the correlation matrix for various readability indicies
correlation_matrix = df[
    [
        "BT Easiness",
        "Flesch-Reading-Ease",
        "Flesch-Kincaid-Grade-Level",
        "Automated Readability Index",
        "SMOG Readability",
        "New Dale-Chall Readability Formula",
        "CAREC",
        "CAREC_M",
        "CARES",
        "CML2RI",
    ]
].corr()

# Create a heatmap using Plotly to visualize the correlation matrix
fig = go.Figure(
    data=go.Heatmap(
        z=correlation_matrix.values,
        x=correlation_matrix.index,
        y=correlation_matrix.columns,
        colorscale="balance",
        colorbar=dict(title="Correlation"),
        text=correlation_matrix.values.round(2),
    )
)

# Update the layout of the heatmap
fig.update_layout(
    title="Correlation Plot for Readability Indices",
    xaxis_title="Columns",
    yaxis_title="Columns",
    height=600,
    width=600,
)

# Display the heatmap
fig.show()

### Data Cleaning, Text Preprocessing, and Feature Engineering

Below, I pre-process the text data along with performing feature extraction.  The following code takes text input, performs various operations on it, and returns cleaned textual data and a set of linguistic statistics and processed data.

The following code:

- Converts the input text to lowercase
- Tokenizes the text into sentences using the Natural Language Toolkit (`nltk`), and each sentence is further tokenized into words. POS tags are assigned to each word.
- For each word in the text, the code does the following:
  - Determines the word's origin language (etymology).
  - Retrieves the full name of the POS tag from the `pos_mapping`.
  - Counts the number of syllables in the word.
  - Calculates the length of the word.
- Calculates various linguistic statistics, such as the mean syllable count, the number of sentences, the mean sentence length, the mean word length, and the total number of words.

In [9]:
# Define a dictionary that maps Part-of-Speech (POS) tags to their full names
pos_mapping = {
    "CC": "Coordinating Conjunction",
    "CD": "Cardinal Digit",
    "DT": "Determiner",
    "EX": "Existential There",
    "FW": "Foreign Word",
    "IN": "Preposition or Subordinating Conjunction",
    "JJ": "Adjective",
    "JJR": "Adjective, Comparative",
    "JJS": "Adjective, Superlative",
    "LS": "List Item Marker",
    "MD": "Modal",
    "NN": "Noun, Singular or Mass",
    "NNS": "Noun, Plural",
    "NNP": "Proper Noun, Singular",
    "NNPS": "Proper Noun, Plural",
    "PDT": "Predeterminer",
    "POS": "Possessive Ending",
    "PRP": "Personal Pronoun",
    "PRP$": "Possessive Pronoun",
    "RB": "Adverb",
    "RBR": "Adverb, Comparative",
    "RBS": "Adverb, Superlative",
    "RP": "Particle",
    "TO": "to",
    "UH": "Interjection",
    "VB": "Verb, Base Form",
    "VBD": "Verb, Past Tense",
    "VBG": "Verb, Gerund or Present Participle",
    "VBN": "Verb, Past Participle",
    "VBP": "Verb, Non-3rd Person Singular Present",
    "VBZ": "Verb, 3rd Person Singular Present",
    "WDT": "Wh-determiner",
    "WP": "Wh-pronoun",
    "WP$": "Possessive Wh-pronoun",
    "WRB": "Wh-adverb",
}


# Define a function to process text
def process_text(text):
    # Convert text to lowercase
    text = text.lower()

    # Initialize lists to store information about words
    word_origins = []
    word_pos = []
    syllable_counts = []
    sentence_lengths = []
    word_lengths = []

    # Tokenize the text into sentences
    sentences = sent_tokenize(text)

    # Process each sentence
    for sentence in sentences:
        # Tokenize the sentence into words and get their POS tags
        tokens = nltk.word_tokenize(sentence)
        pos_tags = nltk.pos_tag(tokens)

        # Calculate sentence length and process each word in the sentence
        sentence_lengths.append(len(pos_tags))
        for token, pos in pos_tags:
            # Get the language origin of the word using etymology
            origin = ety.origins(token)
            if origin:
                origin = origin[0].language.name
            else:
                origin = "unknown"
            word_origins.append(origin)

            # Get the full name of the POS tag using the mapping dictionary
            full_pos_name = pos_mapping.get(pos, pos)
            word_pos.append(full_pos_name)

            # Calculate the syllable count of the word using syllapy
            syllables = syllapy.count(token)
            syllable_counts.append(syllables)

            # Calculate the length of the word
            word_lengths.append(len(token))

    # Calculate various statistics based on the processed text
    processed_excerpt = text
    origin_counts = Counter(word_origins)
    pos_counts = Counter(word_pos)
    mean_syllable_count = np.mean(syllable_counts)
    num_sentences = len(sentences)
    mean_sentence_length = np.mean(sentence_lengths)
    num_words = np.sum(sentence_lengths)
    mean_word_length = np.mean(word_lengths)

    # Return the processed information as a tuple
    return (
        word_origins,
        origin_counts,
        word_pos,
        pos_counts,
        syllable_counts,
        mean_syllable_count,
        num_sentences,
        mean_sentence_length,
        mean_word_length,
        num_words,
        processed_excerpt,
    )

In [10]:
df[
    [
        "word_origins",
        "word_origin_counts",
        "pos",
        "pos_counts",
        "syllable_counts",
        "mean_syllable_count",
        "num_sentences",
        "mean_sentence_length",
        "mean_word_length",
        "num_words",
        "processed_excerpt",
    ]
] = df["Excerpt"].progress_apply(lambda x: pd.Series(process_text(x)))

100%|██████████| 4724/4724 [02:06<00:00, 37.40it/s]


### Correlation between BT Easiness and Word Origins

Below, I calculate and plot the correlation between BT Easiness and the word origin counts from the excerpt.  

A majority of word origins do not have any correlation with BT Easiness, so I have excluded any origins with correlation coefficient less than .05 from the plot below.  We do see that some origins have strong correlation.  Namely, a greater presence of words with a middle english origin increase readability, while greater presence of French and Latin origin words decrease readability.

In [11]:
# Get word origins and BT Easiness
origins_df = pd.DataFrame(df["BT Easiness"]).join(
    pd.DataFrame(df["word_origin_counts"].tolist()).fillna(0)
)

# Calculate the correlation matrix between columns in 'origins_df'
correlation_matrix = origins_df.corr()

# Extract correlations of 'BT Easiness' column with other columns, sort them in descending order
bt_easiness_corr = correlation_matrix["BT Easiness"]
sorted_index = bt_easiness_corr.sort_values(ascending=False).index

# Reorder 'bt_easiness_corr' based on sorted index and remove the self-correlation
bt_easiness_corr = bt_easiness_corr[sorted_index]
bt_easiness_corr = bt_easiness_corr.drop("BT Easiness")

# Filter correlations with an absolute value greater than 0.05
bt_easiness_corr = bt_easiness_corr[abs(bt_easiness_corr) > 0.05]

# Import the 'plotly' library for creating interactive plots
import plotly.graph_objects as go

# Create a bar chart using 'bt_easiness_corr' data with specified attributes
fig = go.Figure(
    data=go.Bar(
        x=bt_easiness_corr.index,
        y=bt_easiness_corr.values,
        marker=dict(color=bt_easiness_corr.values, colorscale="balance"),
        text=bt_easiness_corr.values.round(2),
        textposition="inside",
        cliponaxis=True,
    )
)

# Rotate the text labels on the x-axis by 90 degrees
fig.update_traces(textangle=90)

# Configure layout settings for the plot
fig.update_layout(
    title="Correlation of 'BT Easiness' with Word Origin Language",
    xaxis_title="Columns",
    yaxis_title="Correlation",
    xaxis_tickangle=-45,
    yaxis=dict(automargin=True),
    uniformtext_minsize=12,
    uniformtext_mode="show",
    height=700,
    width=1000,
)

# Display the interactive plot
fig.show()

### Correlation between BT Easiness and Parts of Speech

I plot the correlation between BT Easiness and Parts of Speech counts below.  There does appear to be a relatively strong correlation with certain parts of speech and BT Easiness.

In [12]:
origins_df = pd.DataFrame(df["BT Easiness"]).join(
    pd.DataFrame(df["pos_counts"].tolist()).fillna(0)
)

correlation_matrix = origins_df.corr()
bt_easiness_corr = correlation_matrix["BT Easiness"]
sorted_index = bt_easiness_corr.sort_values(ascending=False).index
bt_easiness_corr = bt_easiness_corr[sorted_index]
bt_easiness_corr = bt_easiness_corr.drop("BT Easiness")

import plotly.graph_objects as go

fig = go.Figure(
    data=go.Bar(
        x=bt_easiness_corr.index,
        y=bt_easiness_corr.values,
        marker=dict(color=bt_easiness_corr.values, colorscale="balance"),
        text=bt_easiness_corr.values.round(2),
        textposition="inside",
        cliponaxis=True,
    )
)

fig.update_traces(textangle=90)

fig.update_layout(
    title="Correlation of 'BT Easiness' with Parts of Speech",
    xaxis_title="Columns",
    yaxis_title="Correlation",
    xaxis_tickangle=-45,
    yaxis=dict(automargin=True),
    uniformtext_minsize=12,
    uniformtext_mode="show",
    height=700,
    width=1200,
)

fig.show()

### Correlation between BT Easiness and Sentence Composition Statistics

I also plot the correlation between BT Easiness and certain sentence composition statistics such as mean word length, mean syllable count, etc.  There also does appear to be correlation in these features.

In [13]:
origins_df = df[
    [
        "BT Easiness",
        "mean_syllable_count",
        "num_sentences",
        "mean_sentence_length",
        "mean_word_length",
        "num_words",
    ]
]

correlation_matrix = origins_df.corr()
bt_easiness_corr = correlation_matrix["BT Easiness"]
sorted_index = bt_easiness_corr.sort_values(ascending=False).index
bt_easiness_corr = bt_easiness_corr[sorted_index]
bt_easiness_corr = bt_easiness_corr.drop("BT Easiness")

import plotly.graph_objects as go

fig = go.Figure(
    data=go.Bar(
        x=bt_easiness_corr.index,
        y=bt_easiness_corr.values,
        marker=dict(color=bt_easiness_corr.values, colorscale="balance"),
        text=bt_easiness_corr.values.round(2),
        textposition="inside",
        cliponaxis=True,
    )
)

fig.update_traces(textangle=90)

fig.update_layout(
    title="Correlation of 'BT Easiness' with Sentence Composition Features",
    xaxis_title="Columns",
    yaxis_title="Correlation",
    xaxis_tickangle=-45,
    yaxis=dict(automargin=True),
    uniformtext_minsize=12,
    uniformtext_mode="show",
    height=700,
    width=700,
)

fig.show()

### Syllable Count Distribution

I plot the distribution of syllable counts, plus a plot of BT Easiness vs. Syllable Count.  This reflects the same correlation we saw above.

In [14]:
fig = px.histogram(df["mean_syllable_count"], nbins=30)
fig.update_layout(bargap=0.1)

fig.show()

In [15]:
fig = px.scatter(
    df,
    x=df["mean_syllable_count"],
    y=df["BT Easiness"],
    title="Scatterplot of Mean Syllable Count vs BT Easiness",
)
fig.show()

### Sentence Count Distribution

I plot the distribution of sentence counts, plus a plot of BT Easiness vs. Sentence Count.  This reflects the same correlation we saw above.

In [16]:
fig = px.histogram(df["num_sentences"], nbins=30)
fig.update_layout(bargap=0.1)

fig.show()

In [17]:
fig = px.scatter(
    df,
    x=df["num_sentences"],
    y=df["BT Easiness"],
    title="Scatterplot of Number of Sentences vs BT Easiness",
)
fig.show()

### Sentence Length Distribution

I plot the distribution of sentence lengths, plus a plot of BT Easiness vs. Mean Sentence Length.  This reflects the same correlation we saw above.

In [18]:
fig = px.histogram(df["mean_sentence_length"], nbins=30)
fig.update_layout(bargap=0.1)

fig.show()

In [19]:
fig = px.scatter(
    df,
    x=df["mean_sentence_length"],
    y=df["BT Easiness"],
    title="Scatterplot of Mean Sentence Length vs BT Easiness",
)
fig.show()

### Word Length Distribution

I plot the distribution of mean word lengths, plus a plot of BT Easiness vs. Mean Word Length.  This reflects the same correlation we saw above.

In [20]:
fig = px.histogram(df["mean_word_length"], nbins=30)
fig.update_layout(bargap=0.1)

fig.show()

In [21]:
fig = px.scatter(
    df,
    x=df["mean_word_length"],
    y=df["BT Easiness"],
    title="Scatterplot of Mean Word Length vs BT Easiness",
)
fig.show()

### Number of Words Distribution

I plot the distribution of number of words, plus a plot of BT Easiness vs. Number of Words.  This reflects the same lack of correlation we saw above.

In [22]:
fig = px.histogram(df["num_words"], nbins=30)
fig.update_layout(bargap=0.1)

fig.show()

In [23]:
fig = px.scatter(
    df,
    x=df["num_words"],
    y=df["BT Easiness"],
    title="Scatterplot of Number of Words vs BT Easiness",
)
fig.show()

## Models

I chose to experiment with the following 9 regression models:
- Bidirectional LSTM models with the following configurations:
    - only the text excerpt as input to predict BT Easiness.
    - both the text excerpt as input and also some engineered features from the text (such as the mean number of words per sentence, the mean syllable count, etc) to predict BT Easiness.
    - the text excerpt, some engineered features from the text (such as the mean number of words per sentence, the mean syllable count, etc), and the counts of parts of speech as input, to predict BT Easiness.
    - the text excerpt, some engineered features from the text (such as the mean number of words per sentence, the mean syllable count, etc),  the counts of parts of speech as input, and the counts of word origins (entimology), to predict BT Easiness.
- Fine tuned RoBERTa models with the following configurations:
    - only the text excerpt as input to predict BT Easiness.
    - both the text excerpt as input and also some engineered features from the text (such as the mean number of words per sentence, the mean syllable count, etc) to predict BT Easiness.
    - the text excerpt, some engineered features from the text (such as the mean number of words per sentence, the mean syllable count, etc), and the counts of parts of speech as input, to predict BT Easiness.
    - the text excerpt, some engineered features from the text (such as the mean number of words per sentence, the mean syllable count, etc),  the counts of parts of speech as input, and the counts of word origins (entimology), to predict BT Easiness.
- An XGBoost model including only the engineered features from the text, but not the text itself.

For the two models that only used text input, I built models that ingested the text input, ran them through the NLU model (Bidirectional LSTM or RoBERTa), and then processed through deeply connected layers.

For the 7 models that used my engineered features, I employed a hybrid model with two inputs.  One was the text text input, following the same architecture as the text-only models above.  The other was a continuous input, encoded and processed through deeply connected layers.  The two sides are concatenated together and processed through to the final regression output.

Each model was trained for a maximum of 30 epochs with early stopping.  The full dataset was split into 80% training data and 20% validation data for each attempt.

For more details about the models themselves, please see the code and detail in the following links.

| Description | Notebook, executed on Kaggle | Notebook, Github |
| - | - | - |
| Bidirectional LSTM, only text | https://www.kaggle.com/code/focusleft/commonlit-readability-lstm/notebook | https://github.com/corymills/dtsa-5511-final-project/blob/main/commonlit-readability-lstm.ipynb |
| Bidirectional LSTM, text and sentence features | https://www.kaggle.com/code/focusleft/commonlit-readability-lstm-with-sentence-features/notebook | https://github.com/corymills/dtsa-5511-final-project/blob/main/commonlit-readability-lstm-with-sentence-features.ipynb |
| Bidirectional LSTM, text, sentence, and POS features | https://www.kaggle.com/code/focusleft/commonlit-readability-lstm-with-sentence-pos/notebook | https://github.com/corymills/dtsa-5511-final-project/blob/main/commonlit-readability-roberta-with-sent-pos.ipynb |
| Bidirectional LSTM, text, sentence, POS, and origin features | https://www.kaggle.com/code/focusleft/commonlit-readability-lstm-with-sent-pos-lang/notebook | https://github.com/corymills/dtsa-5511-final-project/blob/main/commonlit-readability-roberta-with-sent-pos-lang.ipynb |
| RoBERTa, only text | https://www.kaggle.com/code/focusleft/commonlit-readability-roberta/notebook | https://github.com/corymills/dtsa-5511-final-project/blob/main/commonlit-readability-roberta.ipynb |
| RoBERTa, text and sentence features | https://www.kaggle.com/code/focusleft/commonlit-readability-roberta-with-sent-features/notebook | https://github.com/corymills/dtsa-5511-final-project/blob/main/commonlit-readability-roberta-with-sent-features.ipynb |
| RoBERTa, text, sentence, and POS features | https://www.kaggle.com/code/focusleft/commonlit-readability-roberta-with-sent-pos/notebook | https://github.com/corymills/dtsa-5511-final-project/blob/main/commonlit-readability-roberta-with-sent-pos.ipynb |
| RoBERTa, text, sentence, POS, and origin features | https://www.kaggle.com/code/focusleft/commonlit-readability-roberta-with-sent-pos-lang/notebook | https://github.com/corymills/dtsa-5511-final-project/blob/main/commonlit-readability-roberta-with-sent-pos-lang.ipynb |
| XGBoost | https://www.kaggle.com/code/focusleft/commonlit-readability-xgboost?scriptVersionId=146704819 | https://github.com/corymills/dtsa-5511-final-project/blob/main/commonlit-readability-xgboost.ipynb |

## Results and Analysis

For each model, I have listed the best mean squared error acheived.

| Description | Best Mean Squared Error | 
| - | - | 
| Bidirectional LSTM, only text | 0.6548 *Best Score | 
| Bidirectional LSTM, text and sentence features | 0.6820 | 
| Bidirectional LSTM, text, sentence, and POS features | 0.6614 | 
| Bidirectional LSTM, text, sentence, POS, and origin features | 0.6657 | 
| RoBERTa, only text | 1.1398 | 
| RoBERTa, text and sentence features | 0.8998 | 
| RoBERTa, text, sentence, and POS features | 0.8168 | 
| RoBERTa, text, sentence, POS, and origin features | 0.8025 | 
| XGBoost | 0.6840  | 

I was surprised by these results.  My expectations would have been that the RoBERTa models would have outperformed both LSTM and XGBoost simply because they were pre-trained, but they actually performed the worst.

I would have also expected that the addition of the engineered features would have led to improvements, particularly since I saw pretty convincing correlations in my EDA.  We did see notable improvements in the RoBERTa models when adding the additional features, however the additional features seemed to have a negative impact on the LSTM model.

I was also surprised that XGBoost, a tree-based model that only took the engineered features as input performed comparably to the LSTM models.  This shows that deep learning models are not always better suited for a problem than less complex models.

See the individual model notebooks for plots of their performance.

## Conclusion

**Overview**

I attempted multiple deep learning model configurations as well as an XGBoost tree-based model.  I also performed data cleaning and feature extraction in an attempt to improve the output of my models.

Of the 9 models attempted, I found that the Bidirectional LSTM that took only the excerpt text as input performed the best, with a mean squared error score of 0.6548.

**Opportunities and Improvements**

In general, this dataset is relatively small at less than 5000 excerpts.  If I were to continue work on these models, I would attempt to supplement this dataset with additional data.  I suspect that, with more training data available, the RoBERTa models would likely outperform the others.

I would also liked to have spent more time experimenting with model architectures, adding and adjusting layers of each model, etc.

## References

https://www.kaggle.com/competitions/commonlitreadabilityprize

https://www.commonlit.org/blog/introducing-the-clear-corpus-an-open-dataset-to-advance-research-28ff8cfea84a/

https://educationaldatamining.org/EDM2021/virtual/static/pdf/EDM21_paper_35.pdf?ref=commonlit.org