# COGS 118B - Project Proposal

# Names

- Anh Tran
- Eric Song
- Kendrick Nguyen

# Abstract

The heart of this project is to determine underlying patterns among English words that may contribute to its difficulty. We define word difficulty as the level of complexity in understanding a particular word. Although word difficulty is subjective and experience-dependent, we will quantify it using features such as, pronunciation, length, ease of use, Hyperspace Analog to Language (HAL), etc. Our project hopes to use exploratory data analysis, unsupervised and supervised machine learning algorithms to discover which critical features contribute to a certain pattern and their respective word difficulty. The output of these algorithms, such as spatial representations, clustering metrics, and regression metrics, will be used to verify these underlying patterns and word difficulty.

# Background

English is currently the most spoken language in the world at 1.456 billion speakers <a name="wiki"></a>[<sup>[2]</sup>](#wikinote). A large portion of these English speakers are those learning it as a second language <a name="wiki"></a>[<sup>[2]</sup>](#wikinote). Often times, people who are learning the language find it difficult and encounter many challenges such as the complexity of pronunciation and non-obvious rule sets (for instance, think of “read” and “read”<a name="adjective"></a>[<sup>[6]</sup>](#adjectivenote)). To gain a better understanding of how the language is learned by English second language learners, many delve into how difficult it is to learn a particular word of the English language. For example, the Flesch-Kincaid readability tests was created in order to see how difficult a passage in English is to grasp<a name="flesch"></a>[<sup>[3]</sup>](#fleschnote). The test was created based on the need to judge the U.S. Navy recruitment to see their reading comprehension level. The test uses total words, total sentences, total syllables, and total words to plug into an equation to churn out a score.

Another research group that looked to analyze English words was <a name="avishek"></a>[<sup>[1]</sup>](#avisheknote). Building on the English Lexicon Project, Basu et. al. looked to use traditional machine learning models as well as a convolutional neural network based prediction model to predict word difficulty. We will build on the foundations that this project and the English Lexicon Project laid out. In particular, we will be using their I_Zscore as a metric of word difficulty. The I_Zscore is the “standardized mean lexical decision latency for each word” <a name="lexicon"></a>[<sup>[7]</sup>](#lexiconnote)). The lexical decision latency is the time it takes to read a word and decide whether that word is in the English language or not <a name="lexical"></a>[<sup>[8]</sup>](#lexicalnote). Presumably, this is a way for us to decide how difficult a word is. Harder words may have higher lexical decision latency than easier words, as the English Lexicon Project goes to explore.

We will, in part, be using unsupervised machine learning techniques to try and discover underlying patterns between words that are classified as easy (closer to 0 on the I_Zscore) and words that are classified as hard (closer to 1 on the I_Zscore).

By discovering certain patterns among English words, such as similarities in its pronunciation or length, many English speakers and learners could leverage these patterns to learn new words that follow a similar convention. These patterns could alternatively provide English speakers and learners insights and expectations about word difficulty, which can facilitate people’s subjective opinions on how language is used and learned.





# Problem Statement

The scope of this project's problem statement is to determine how difficult an English word is and whether difficult words share some underlying similarity that isn't immediately obvious. For difficult, we define it as the `I_Zscore` obtained from [this dataset](https://www.kaggle.com/datasets/kkhandekar/word-difficulty/data). We are trying to determine if there are underlying characteristics of the data--for instance, word length, vowel count, the presence of certain groupings of letters--that can help us group certain words together and predict the difficulty of new words. Our success can be measured in some of the following ways: finding clusters that correspond well with the `I_ZScore` (as potentially determined by an adjusted rand score) and testing the accuracy of predictions on new words.

### Data: WORD DIFFICULTY PREDICTION

- Source: https://www.kaggle.com/datasets/kkhandekar/word-difficulty. This dataset is obtained from Kaggle open source where they used the dataset to figure out the difficulty of English language <a name="avishek"></a>[<sup>[1]</sup>](#avisheknote). This dataset was used by another project to determine word difficulty and need to give project
- Number of observations: 9 variables, 40481 observations
- Description: An observation consists of the `Word`, `Length`, `Freq_HAL`, `Log_Freq_HAL`, `I_Mean_RT`, `I_Zscore`, `I_SD`, `Obs`, and `I_Mean_Accuracy`.
- Critical variables for our problem statement is `I_Zscore`, as it denotes the difficulty of a word. This value fluctuates between 0 & 1 for a word with 0 being SIMPLE & 1 being DIFFICULT.


# Proposed Solution

Word difficulty is subjective and experience-dependent. Therefore, this project is not intended to formulate a universal model and metric of judging how difficult it is to understand different English words. This project is rather concentrated on discovering patterns among different English words and their correlations to word difficulty. In other words, our general solution is to extract certain relational characteristics and features among English words to help evaluate correlations in word difficulty based on `I_ZScore`.

Features already proposed in the dataset, such as `Length`, `Freq_HAL`, `I_ZScore`, etc., will be used to discover certain patterns. As we are also not limited to this subset, additional features, such as grammatical category (ex., noun, verb, adjective), n-grams (sequences of contiguous words), phonetics, word embeddings/token, sentiment scores, etc., could be assessed and employed. An additional route could be discovering a "hidden" or custom feature derived from one or more of these mentioned features.

These experimentations will be achieved by conducting:

- **Data Augmentation:** To create additional dataset features, such as classifying its grammatical category, n-grams, phonetics, word embeddings/token, sentiment scores, etc. This could be achieved by stitching an additional word/dictionary-based dataset or utilizing a public word/dictionary API to obtain these features. Obtaining more features could strengthen our analysis for word patterns and difficulty.

- **Exploratory Data Analysis (EDA):** To summarize the main characteristics of the dataset by visualizing its default features. EDA for instance could help us understand the distribution of quantitative-typed features and, if applicable, identify redundant features to eliminate.

- **Dimension Reduction & Feature Selection:** To extract and reduce for the most important features in the dataset. This step works in conjunction with EDA as it provides insights for the most prevalent features or patterns in English words. This will mostly be accomplished using Principal Component Analysis (PCA) and various feature selection techniques offered by `sklearn`’s API.

- **Various Clustering Algorithms:**

  - K-Means Clustering
  - Hierarchical Clustering
  - Word Embedding-Based Clustering

- **Various Regression Algorithms**

  - Linear Regression
  - Random Forest Regression
  - XGBoost Regression  

- **Algorithm Evaluation:** See more in the Evaluation Metrics section.

# Evaluation Metrics

Using our particular dataset, the `I_ZScore` variable will be used a "ground-truth" label. Since `I_ZScore` is a continuous, we may split this variable into three classes ("easy", "medium", and "hard"). The evaluation metric used to evaluate the performance of our clustering algorithms is Silhouette Scores, which quantifies the density and separation of clusters. This metric is bounded by the range from -1 to 1, and ideally a higher positive score is desired.

Meanwhile, "0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar."<a name="sklearn"></a>[<sup>[4]</sup>](#sklearnnote).


Silhouette Scores could be mathematically represented where,<a name="fleischer"></a>[<sup>[5]</sup>](#fleischernote)

$a$: The mean distance between a sample and all other points in the same class

$b$: The mean distance between a sample and all other points in the next nearest cluster

$$score = \frac{b-a}{max(a,b)}$$

We would like to visualize how certain patterns and features relate with difficulty; thus, visual inspection of 2D plots will be another form evaluation.

The regression algorithms will be used to find correlations among all features against `I_ZScore` in a One Vs. All manner. Therefore, to find correlation strengths between patterns and `I_ZScore`s, we use $R^2$ proportion of variance explained by the model to evaluate performance. This value ranges from 0 to 1, and a higher value is desired for a better fit. Briefly, this is mathematically represented as

$$R^2 = 1 - \frac{\Sigma(y-\hat{y})^2}{\Sigma(y-\bar{y})^2}$$

# Ethics & Privacy

The dataset that we are using is a public English word dataset that have various features and characteristics that may contribute to the word difficulty. Some of these characteristics were obtained from other research works, and we will take precaution in properly providing credit when given. Otherwise, there should not be any privacy and ethical concern in our project

The project can be biased since the methods and reasonings we will employ for suggesting underlying patterns and difficulty is strictly for English words. Thus, our result will not necessarily translate and generalize for other languages unfortunatel

Another facet that we need to consider is how our results may be used or interpreted. Our models and our results should in no way be used as a metric to judge individuals. Language learning is not a uniform process between different people and our model should not be used as a way to gauge someone's progress in learning a language.

# Team Expectations 

* *Show up and participate to weekly team meetings*
* *Do the work you're in charge of - contribute to the team!*
* *Communicate through group chat and make sure that the works that we are doing are understandable.*
* *Support each other and accept the diversity of individuals.*

# Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/11  |  5 PM |  Brainstorm topics/questions (all)  | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research |
| 2/16  |  5 PM |  Do background research on topic (all) | Discuss ideal dataset(s) and ethics; draft project proposal |
| 2/20  | 5 PM  | Edit, finalize, and submit proposal; Search for datasets (all)  | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part   |
| 2/22  | 5 PM  | Import & Wrangle Data ,do some EDA (Anh) | Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 3/1  | 5 PM  | Finalize wrangling/EDA; Begin programming for project (Eric & Kendrick) | Discuss/edit project code; Complete project |
| 3/8  | 5 PM  | Complete analysis; Draft results/conclusion/discussion (all)| Discuss/edit full project |
| 3/10  | Before 11:59 PM  | NA | Turn in Final Project  |

# Footnotes
<a name="avisheknote"></a>1.[^](#avishek): Avishek, G et al. (2019) Word Difficulty Prediction Using Covolutional Neural Networks. https://github.com/garain/Word-Difficulty-Prediction/blob/master/WORD_DIFFICULTY.pdf<br>
<a name="wikinote"></a>2.[^](#wiki): Wikipedia contributors (18 Feb. 2024) List of Languages By Total Number of Speakers. https://en.wikipedia.org/wiki/List_of_languages_by_total_number_of_speakers<br>
<a name="fleschnote"></a>3.[^](#flesch): Wikipedia contributors (27 Dec. 2023) Flesch–Kincaid readability tests. https://en.wikipedia.org/w/index.php?title=Flesch%E2%80%93Kincaid_readability_tests&oldid=1192056958<br>
<a name="sklearnnote"></a>4.[^](#sklearn): sklearn.metrics.silhouette_score. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html<br>
<a name="fleischernote"></a>5.[^](#fleischer): Lecture 6: K-Means etc. https://github.com/COGS118B/Lecture/blob/main/L06_Kmeans.pdf<br>
<a name="adjectivenote"></a>6.[^](#adjective): Order of adjectives. https://dictionary.cambridge.org/us/grammar/british-grammar/adjectives-order<br>
<a name="lexiconnote"></a>7.[^](#lexicon): Balota, D. A., Yap, M. J., Cortese, M. J., Hutchison, K. A., Kessler, B., Loftis, B., Neely, J. H., Nelson, D. L., Simpson, G. B., & Treiman, R. (2007) The English Lexicon Project. https://link.springer.com/content/pdf/10.3758/BF03193014.pdf<br>
<a name="lexicalnote"></a>8.[^](#lexical): Daniel, Z., Stephanie, M. (2000) Lexical Decision. https://www.sciencedirect.com/topics/social-sciences/lexical-decision#:~:text=One%20measure%20of%20the%20relative,the%20English%20language%20or%20not.<br>
