# COGS 118B - Final Project

# Discovering Semantic Patterns In Word Difficulty Using Clustering

## Group members

- Anh Tran
- Eric Song
- Kendrick Nguyen

# Abstract 

This project seeks to find underlying patterns in English words that can contribute to its difficulty. We define word difficulty as the level of complexity in understanding a particular word. Although word difficulty is largely subjective and experience-dependent, we would like to examine whether certain semantic features in English words also carry independent and/or latent significance to difficulty. In fact, this project will build off from another study, the *Word Difficulty Prediction Using Covolutional Neural Networks study* (Basu, Garain, and Naskar, 2019)<a name="avishek"></a>[<sup>[1]</sup>](#avisheknote), that similarly aligns with our project.

Our project employs exploratory data analysis and machine learning algorithms to find critical semantic patterns contributing to word difficulty using the study's corpus dataset that also contains semantic features, such as `Length`, `Log_Freq_HAL`, and `I_Zscore`. In our case, we define `I_Zscore` as a metric of word difficulty (0 being easy to understand and 1 being hard to understand). This project attempts to find underlying patterns using various unsupervised learning clustering algorithms, yet our findings were unfortunately poor based on silhouette score metrics. However, since these silhouette score metrics were similarly and consistently low, we concluded that our project was limited by the relatively small corpus dataset and unaccounted external factors that could also contribute to word difficulty.

# Background

English is currently the most spoken language in the world at 1.456 billion speakers <a name="wiki"></a>[<sup>[2]</sup>](#wikinote). A large portion of these English speakers are those learning it as a second language <a name="wiki"></a>[<sup>[2]</sup>](#wikinote). Often times, people who are learning the language find it difficult and encounter many challenges such as the complexity of pronunciation and non-obvious rule sets (for instance, think of “read” and “read”<a name="adjective"></a>[<sup>[6]</sup>](#adjectivenote)). To gain a better understanding of how the language is learned by English second language learners, many delve into how difficult it is to learn a particular word of the English language. For example, the Flesch-Kincaid readability tests was created in order to see how difficult a passage in English is to grasp<a name="flesch"></a>[<sup>[3]</sup>](#fleschnote). The test was created based on the need to judge the U.S. Navy recruitment to see their reading comprehension level. The test uses total words, total sentences, total syllables, and total words to plug into an equation to churn out a score.

Another research group that looked to analyze English words was <a name="avishek"></a>[<sup>[1]</sup>](#avisheknote). Building on the English Lexicon Project, Basu et. al. looked to use traditional machine learning models as well as a convolutional neural network based prediction model to predict word difficulty. We will build on the foundations that this project and the English Lexicon Project laid out. In particular, we will be using their `I_Zscore` as a metric of word difficulty. The `I_Zscore` is the “standardized mean lexical decision latency for each word” <a name="lexicon"></a>[<sup>[7]</sup>](#lexiconnote)). The lexical decision latency is the time it takes to read a word and decide whether that word is in the English language or not <a name="lexical"></a>[<sup>[8]</sup>](#lexicalnote). Presumably, this is a way for us to decide how difficult a word is. Harder words may have higher lexical decision latency than easier words, as the English Lexicon Project goes to explore.

We will, in part, be using unsupervised machine learning techniques to try and discover underlying patterns between words that are classified as easy (closer to 0 on the `I_Zscore`) and words that are classified as hard (closer to 1 on the `I_Zscore`).

By discovering certain patterns among English words, such as similarities in its pronunciation or length, many English speakers and learners could leverage these patterns to learn new words that follow a similar convention. These patterns could alternatively provide English speakers and learners insights and expectations about word difficulty, which can facilitate people’s subjective opinions on how language is used and learned.

# Problem Statement

The scope of this project's problem statement is to determine what makes an English word difficult semantically and whether easy/difficult words share some underlying similarity that isn't immediately obvious using clustering. For difficult words, we define it as the `I_Zscore` obtained from the *Word Difficulty Prediction Using Covolutional Neural Networks* study<a name="avishek"></a>[<sup>[1]</sup>](#avisheknote). Our success can be measured in some of the following ways: finding clusters that correspond well with the `I_Zscore` and computing silhouette score to evaluate the clustering. To find underlying patterns, we will examine and compare the semantic features (ex. `Length`, `Log_Freq_HAL`, `I_Mean_Accuracy`, etc.) for each word within a cluster.

# Data

Our dataset of choice is a corpus dataset from the *Word Difficulty Prediction Using Covolutional Neural Networks study* (Basu, Garain, and Naskar, 2019)<a name="avishek"></a>[<sup>[1]</sup>](#avisheknote). The words were tokenized from the SUBTLEXUS corpus of 131 million words.

- The raw dataset is also published on from Kaggle: https://www.kaggle.com/datasets/kkhandekar/word-difficulty. 
- Number of observations: 9 variables, 40481 observations
- Description: An observation consists of the `Word`, `Length`, `Freq_HAL`, `Log_Freq_HAL`, `I_Mean_RT`, `I_Zscore`, `I_SD`, `Obs`, and `I_Mean_Accuracy`.
- Critical variables for our problem statement is `I_Zscore`, as it denotes the difficulty of a word. This value fluctuates between 0 and 1 for a word with 0 being SIMPLE and 1 being DIFFICULT.

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('./data/WordDifficulty.csv')
df.head()

Unnamed: 0,Word,Length,Freq_HAL,Log_Freq_HAL,I_Mean_RT,I_Zscore,I_SD,Obs,I_Mean_Accuracy
0,a,1,10610626,16.18,798.92,-0.01,333.85,24.0,0.73
1,aah,3,222,5.4,816.43,0.21,186.03,21.0,0.62
2,Aaron,5,10806,9.29,736.06,-0.11,289.01,32.0,0.97
3,aback,5,387,5.96,796.27,0.11,171.61,15.0,0.45
4,abacus,6,513,6.24,964.4,0.65,489.0,15.0,0.47


The following is a brief description of each feature:

- `Length`: Number of characters

- `Freq_HAL`: Hyperspace Analogue to Language frequency norms based on the HAL corpus of 131 million words. Higher values may indicate more frequent words in a corpus.

- `Log_Freq_HAL`: Applied logarithmic transformation to `Freq_HAL`

- `I_Mean_RT`: Individual mean reaction time, associated with lexical decision time

- `I_Zscore`: Z-score of individual reaction times, associated with word difficulty

- `I_SD`: Standard deviation of individual reaction times

- `Obs`: Number of observations/individuals experimented with respective word

- `I_Mean_Accuracy`: Individual mean accuracy score, average accuracy score in tasks related to word difficulty

This dataset appears somewhat preprocessed prior to publishment where some features have a transformed or standardized version of themselves. From glance, we can perform feature selection by removing the `Freq_HAL`, `I_Mean_RT`, `I_SD`, and `Obs` columns as the dataset already offers the same feature but transformed. The crtical feature `I_Zscore` is a function of `I_Mean_RT` and `I_SD`, so including the the latter is redundant.

In [2]:
# Drop Freq_HAL and Obs columns
df = df.drop(['Freq_HAL', 'I_Mean_RT', 'I_SD', 'Obs'], axis=1)
df = df.dropna().reset_index(drop=True)

# Apply lower to words
df['Word'] = df['Word'].str.lower()

In [3]:
import re

# Use regex to remove any quotes, astericks, and other punctuations
pattern = r"[\"*!?.,']"

for index, word in enumerate(df['Word']):
    cleaned_word = re.sub(pattern, '', word)
    df.loc[index, 'Word'] = cleaned_word
    
# Remove duplicates
df.drop_duplicates('Word', inplace=True)

This dataset can be strengthened by extracting more features, such as:

- Vowel Count, also correlated to syllables
- Entropy, or the measure of the unpredictability of the word's character. Computed from $H(x)=\Sigma{p(x)\log{p(x)}}$
- Parts of speech category. Tagged based on the [Penn Treebank Project](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).
- Sentiment score

In [4]:
# Count vowels
vowels = ['a', 'e', 'i', 'o', 'u']
num_vowels = []
for word in df['Word']:
    vowel_count = sum(word.count(vowel) for vowel in vowels)
    num_vowels.append(vowel_count)
    
df['Vowels'] = num_vowels

In [5]:
from collections import Counter
import math

# Word entropy
def calculate_entropy(word):
    # Frequency of each character
    char_counts = Counter(word)

    # Calculate the probability of each character
    total_chars = len(word)
    char_probabilities = {char: count / total_chars for char, count in char_counts.items()}

    # Calculate the entropy
    entropy = -sum(prob * math.log2(prob) for prob in char_probabilities.values())

    return entropy

entropy_values = [calculate_entropy(word) for word in df['Word']]
df['Entropy'] = entropy_values

In [6]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

nltk.download(['vader_lexicon', 'averaged_perceptron_tagger'])

# Getting parts of speech
word_tags = nltk.pos_tag(df['Word'])
word_tags = [word_tag[1] for word_tag in word_tags]

df['PoS'] = word_tags

[nltk_data] Downloading package vader_lexicon to C:\Users\Kendrick
[nltk_data]     Nguyen\AppData\Roaming\nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Kendrick Nguyen\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


In [7]:
# Getting sentiment score, we look at the compound score for a final vote
sia = SentimentIntensityAnalyzer()
sentiment_scores = [sia.polarity_scores(word)['compound'] for word in df['Word']]

df['SentimentScore'] = sentiment_scores

In [9]:
# Save preprocessed dataset
df.to_csv('./data/NewWordDifficulty.csv', index=False)
df.head()

Unnamed: 0,Word,Length,Log_Freq_HAL,I_Zscore,I_Mean_Accuracy,Vowels,Entropy,PoS,SentimentScore
0,a,1,16.18,-0.01,0.73,1,-0.0,DT,0.0
1,aah,3,5.4,0.21,0.62,2,0.918296,JJ,0.0
2,aaron,5,9.29,-0.11,0.97,3,1.921928,NN,0.0
3,aback,5,5.96,0.11,0.45,2,1.921928,NN,0.0
4,abacus,6,6.24,0.65,0.47,3,2.251629,NN,0.0


# Proposed Solution

In this section, clearly describe a solution to the problem. The solution should be applicable to the project domain and appropriate for the dataset(s) or input(s) given. Provide enough detail (e.g., algorithmic description and/or theoretical properties) to convince us that your solution is applicable. Make sure to describe how the solution will be tested.  

If you know details already, describe how (e.g., library used, function calls) you plan to implement the solution in a way that is reproducible.

If it is appropriate to the problem statement, describe a benchmark model<a name="sota"></a>[<sup>[3]</sup>](#sotanote) against which your solution will be compared. 

# Evaluation Metrics

Propose at least one evaluation metric that can be used to quantify the performance of both the benchmark model and the solution model. The evaluation metric(s) you propose should be appropriate given the context of the data, the problem statement, and the intended solution. Describe how the evaluation metric(s) are derived and provide an example of their mathematical representations (if applicable). Complex evaluation metrics should be clearly defined and quantifiable (can be expressed in mathematical or logical terms).

# Results

You may have done tons of work on this. Not all of it belongs here. 

Reports should have a __narrative__. Once you've looked through all your results over the quarter, decide on one main point and 2-4 secondary points you want us to understand. Include the detailed code and analysis results of those points only; you should spend more time/code/plots on your main point than the others.

If you went down any blind alleys that you later decided to not pursue, please don't abuse the TAs time by throwing in 81 lines of code and 4 plots related to something you actually abandoned.  Consider deleting things that are not important to your narrative.  If its slightly relevant to the narrative or you just want us to know you tried something, you could keep it in by summarizing the result in this report in a sentence or two, moving the actual analysis to another file in your repo, and providing us a link to that file.

### Subsection 1

You will likely have different subsections as you go through your report. For instance you might start with an analysis of the dataset/problem and from there you might be able to draw out the kinds of algorithms that are / aren't appropriate to tackle the solution.  Or something else completely if this isn't the way your project works.

### Subsection 2

Another likely section is if you are doing any feature selection through cross-validation or hand-design/validation of features/transformations of the data

### Subsection 3

Probably you need to describe the base model and demonstrate its performance.  Maybe you include a learning curve to show whether you have enough data to do train/validate/test split or have to go to k-folds or LOOCV or ???

### Subsection 4

Perhaps some exploration of the model selection (hyper-parameters) or algorithm selection task. Validation curves, plots showing the variability of perfromance across folds of the cross-validation, etc. If you're doing one, the outcome of the null hypothesis test or parsimony principle check to show how you are selecting the best model.

### Subsection 5 

Maybe you do model selection again, but using a different kind of metric than before?



# Discussion

### Interpreting the result

OK, you've given us quite a bit of tech informaiton above, now its time to tell us what to pay attention to in all that.  Think clearly about your results, decide on one main point and 2-4 secondary points you want us to understand. Highlight HOW your results support those points.  You probably want 2-5 sentences per point.

### Limitations

Are there any problems with the work?  For instance would more data change the nature of the problem? Would it be good to explore more hyperparams than you had time for?   

### Ethics & Privacy

If your project has obvious potential concerns with ethics or data privacy discuss that here.  Almost every ML project put into production can have ethical implications if you use your imagination. Use your imagination.

Even if you can't come up with an obvious ethical concern that should be addressed, you should know that a large number of ML projects that go into producation have unintended consequences and ethical problems once in production. How will your team address these issues?

Consider a tool to help you address the potential issues such as https://deon.drivendata.org

### Conclusion

Reiterate your main point and in just a few sentences tell us how your results support it. Mention how this work would fit in the background/context of other work in this field if you can. Suggest directions for future work if you want to.

# Footnotes
<a name="lorenznote"></a>1.[^](#lorenz): Lorenz, T. (9 Dec 2021) Birds Aren’t Real, or Are They? Inside a Gen Z Conspiracy Theory. *The New York Times*. https://www.nytimes.com/2021/12/09/technology/birds-arent-real-gen-z-misinformation.html<br> 
<a name="admonishnote"></a>2.[^](#admonish): Also refs should be important to the background, not some randomly chosen vaguely related stuff. Include a web link if possible in refs as above.<br>
<a name="sotanote"></a>3.[^](#sota): Perhaps the current state of the art solution such as you see on [Papers with code](https://paperswithcode.com/sota). Or maybe not SOTA, but rather a standard textbook/Kaggle solution to this kind of problem
