## Contents
- [1. Get Data](#1.-Get-Data)
- [2. Split Data](#2.-Split-Data)
- [3. Preprocessing](#3.-Preprocessing)
  - [3.1. Clean the Data](#3.1.-Clean-the-Data)
  - [3.2. Transform the Data](#3.2.-Transform-the-Data)
  - [3.3. Feature Engineering](#3.3.-Feature-Engineering)
    - [3.3.1 Feature Engineering - Numerical Features](#331-Feature-Engineering---Numerical-Features)
      - [Feature creation](#Feature-creation)
      - [Relevance Analysis of Numerical Features](#Relevance-Analysis-of-Numerical-Features)
        - [Steps taken](#Steps-taken)
        - [Plots Examples](#Plots-Examples)
          - [Relevant Feature: Mistakes Dist Ratio](#Relevant-Feature-Mistakes-Dist-Ratio)
          - [Non-Relevant Feature: Average Sentence Length](#Non-Relevant-Feature-Average-Sentence-Length)
        - [General Observations for Low Relevance Features](#General-Observations-for-Low-Relevance-Features)
      - [Correlation Analysis of Numerical Features](#Correlation-Analysis-of-Numerical-Features)
    - [3.3.2 Feature Engineering - TF-IDF](#332-Feature-Engineering---TF-IDF)
    - [3.3.3 Feature Engineering - Combine Numerical and TF-IDF features](#333-Feature-Engineering---Combine-Numerical-and-TF-IDF-features)
    - [3.3.4 Feature Engineering - Word2Vect](#334-Feature-Engineering---Word2Vect)
    - [3.3.5 Feature Engineering - BERT](#335-Feature-Engineering---BERT)
- [4. Experiments](#4.-Experiments)
  - [4.1. Experiment 1 (train models on the initial 17k data set)](#4.1.-Experiment-1-(train-models-on-the-initial-17k-data-set))
    - [4.1.1 Experiment Summary](#411-Experiment-Summary)
    - [4.1.2 Results](#412-Results)
      - [4.1.2.1 Numerical Features Only](#4121-Numerical-Features-Only)
      - [4.1.2.2 TF-IDF Features Only](#4122-TF-IDF-Features-Only)
      - [4.1.2.3 Combined Numerical and TF-IDF Features](#4123-Combined-Numerical-and-TF-IDF-Features)
      - [4.1.2.4 Word2Vec Features Only](#4124-Word2Vec-Features-Only)
      - [4.1.2.5 BERT](#4125-BERT)
    - [4.1.3 Summary of Results and Recommendations](#413-Summary-of-Results-and-Recommendations)
  - [4.2. Experiment 2 (train models on expanded data set by essays with underrepresented scores 1,5,6)](#4.2.-Experiment-2-(train-models-on-expanded-data-set-by-essays-with-underrepresented-scores-1,5,6))
    - [4.2.1 Comparison of initial data, preprocessing and model training for Experiment 1 and Experiment 2](#421-Comparison-of-initial-data-preprocessing-and-model-training-for-Experiment-1-and-Experiment-2)
    - [4.2.2 Comparison of the results from Experiment 1 and Experiment 2](#422-Comparison-of-the-results-from-Experiment-1-and-Experiment-2)
      - [4.2.2.1 Numerical Features Only](#4221-Numerical-Features-Only)
      - [4.2.2.2 TF-IDF Features Only](#4222-TF-IDF-Features-Only)
      - [4.2.2.3 Combined Numerical and TF-IDF Features](#4223-Combined-Numerical-and-TF-IDF-Features)
      - [4.2.2.4 Word2Vec Features Only](#4224-Word2Vec-Features-Only)
      - [4.2.2.5 BERT](#4225-BERT)
    - [4.2.3 Summary of Results and Recommendations](#423-Summary-of-Results-and-Recommendations)
- [5. Dimensionality Reduction](#5.-Dimensionality-Reduction)
- [6. Hyperparameter tuning](#6.-Hyperparameter-tuning)
- [7. Predict on Unseen Data](#7.-Predict-on-Unseen-Data)


## 1. Get Data

The dataset for this project comes from the Learning Agency Lab Automated Essay Scoring 2 competition on Kaggle. The goal is to train a model to score student essays, with the evaluation metric being quadratic weighted kappa.

The dataset comprises approximately 17,000 student-written argumentative essays, each scored on a scale of 1 to 6. The dataset is split into training and test sets.

- **train.csv**: Contains essays and their corresponding scores.
  - `essay_id`: The unique ID of the essay.
  - `full_text`: The full essay response.
  - `score`: The holistic score of the essay on a 1-6 scale.

- **test.csv**: Unseen test data on Kaggle side. It has the same fields as train.csv, excluding the score. The rerun test set has approximately 8,000 observations.
    - Note: This unseen dataset is located on Kaggle under the hood, and might be used additionally to evaluate model performance on unseen data. In this task, we will split the training data 80/20 to create a test set, which will be used during model testing. 

- **sample_submission.csv**: A submission file in the correct format.
  - `essay_id`: The unique ID of the essay.
  - `score`: The predicted holistic score of the essay on a 1-6 scale.

For more details, refer to the [Learning Agency Lab Automated Essay Scoring 2 Kaggle Page](https://www.kaggle.com/c/learning-agency-lab-automated-essay-scoring-2).

- **Initial EDA**
- The dataset contains **17,307 entries** and **3 columns**.
- The 'score' column shows an **imbalanced distribution**, with scores 1, 5, and 6 having significantly fewer samples, as illustrated in the plot below:

<img src="score_distribution_plot.png" alt="Score Distribution" width="800" height="600"/>

## 2. Split Data
The data was split into training (80%) and testing (20%) sets using stratified sampling to ensure that the score proportions are maintained in both sets. This approach preserves the original distribution of scores, which is crucial for accurately handling imbalanced datasets (see plot for details).
- The `train_split` dataset contains **13,845 entries** and **3 columns**.
- The `test_split` dataset contains **3,462 entries** and **3 columns**.

<img src="score_distribution_train_test_plot.png" alt="Score Distribution Train and Test" width="1200" height="900"/>

## 3. Preprocessing
### 3.1 Clean the Data
This stage includes handling missing values, removing duplicates, and correcting errors.

- **Missing Values:**
  - There are no missing values in the `full_text` column.

- **Duplicate Detection:**
  - Two approaches were utilized to find duplicates:
    1. Finding entries that are not exactly identical.
        - No duplicates found.
    2. TF-IDF vectorization followed by a similarity measure such as cosine similarity.
      - **Threshold 0.95:** Returns 1 duplicate.
      - **Threshold 0.9:** Returns 2 duplicates.
        - Texts found with threshold 0.9 have similar lengths, and scores are the same. The difference (visually defined) is that one essay has a PII placeholder, while another doesn't, e.g., `PROPER_NAME` in one essay and `Luk` in another.
      - **Threshold 0.8:** Returns 7 duplicates.
        - Texts have different scores, noticeably due to different lengths; hence this option will be skipped. It appears that students might have copied essays from friends and enriched them.
    -  See the code in "Analyzing duplicates" Notebook
- **Replacing PII Placeholders:**
  - During duplicate detection, it was noticed that `full_text` contains placeholders for PII data. The volume of placeholders is as follows:
    ```
    PROPER_NAME      252
    EMAIL_ADDRESS      2
    STUDENT_NAME       7
    OTHER_PII         28
    LOCATION_NAME     12
    SCHOOL_NAME       14
    GENERIC_NAME       1
    PHONE_NUMBER       2
    STREET_ADDRESS     2
    STATE_NAME         1
    TEST_NAME          1
    CITY_STATE         1
    ```

  - Given the low frequency of most placeholders, it was decided to skip replacing these placeholders with real names. The effort required to accurately replace these terms may not be justified by the potential (and likely minimal) improvements in model performance.
  - See the code in "Analyzing PII placeholders" Notebook

### 3.2 Transform the Data
This stage includes several key steps to prepare the text for machine learning models. The following transformations were applied:

- **Standardizing Contractions**: Expanded common contractions using a predefined dictionary.
- **Removing HTML Tags**: Eliminated HTML tags to retain only plain text.
- **Removing Special Characters and Punctuation**: Cleansed text by removing special characters and punctuation.
- **Removing Words with Numbers**: Removed words containing numbers and any trailing 's.
- **Removing Stop Words**: Eliminated common stop words.
- **Removing Non-ASCII Characters**: Removed non-ASCII characters, including emojis.
- **Tokenization and Lemmatization**: Applied NLTK's tokenization and lemmatization using WordNet POS tags.
- **Identifying Misspelt Words**: Identified and counted misspelled words using a spell checker.
- **Stemming (Removed)**: Initially applied stemming but later removed it in favor of lemmatization for better results.

### 3.3. Feature Engineering

In the preprocessing stage, we utilize various technologies and techniques to transform the raw data into a format suitable for machine learning models. We will use the following techniques:

- **Numerical Features**
Processing and normalizing numerical data features.

- **TF-IDF**
Using Term Frequency-Inverse Document Frequency (TF-IDF) to convert text data into numerical vectors based on word frequency.

- **Word2Vec**
Employing Word2Vec to create dense vector representations of words, capturing semantic meanings and relationships.

- **BERT**
Utilizing Bidirectional Encoder Representations from Transformers (BERT) for creating contextualized word embeddings, enhancing the understanding of word context in sentences.

- **SBERT**
Using Sentence-BERT (SBERT) to generate embeddings for entire sentences, optimizing for tasks that involve sentence-level semantics.

### 3.3.1 Feature Engineering - Numerical Features
This stage involves feature creation, relevance analysis, and correlation analysis. From 48 created features 8 remained (relevant feature for predicting scores with Pearson correlation coefficient < 0.9)

#### Feature creation
We created 48 numerical features, including:

1. **Text Analysis Features**: Computes text features such as word count, stopword count, punctuation count, sentence lengths, etc.
2. **Ratio Features**: Calculates ratios of different features like distinct words ratio, mistakes ratio, and transitional phrases ratio.
3. **Text Statistics Features**: Uses the textstat library to compute readability and complexity metrics such as Flesch reading ease, SMOG index, Coleman-Liau index, and others.

#### Relevance Analysis of Numerical Features:
To analyze the relevance of these features, we used a combination of statistical tests and visualizations. The methods include:

- **F-statistic and p-value**: We used the F-statistic to measure the relationship between each feature and the target variable, with a p-value threshold of 0.05 indicating statistical significance.
- **Plot Feature per Classes**: Visualizing the distribution of features across score classes to intuitively assess their relevance.

By combining these methods, we identified 20 relevant features and 28 non-relevant features. This comprehensive approach ensures that selected features are consistently relevant across multiple tests, enhancing the accuracy of our scoring prediction model.

##### Steps taken
1. **F-statistic and p-value Analysis**: Initially with a p-value threshold of 0.05 identified 3 non-relevant features. 

2. **Categorization Based on F-Statistics**: Established thresholds for High, Medium, and Low relevance categories based on F-statistics, which were later verified through plot analysis. See:

<img src="f_stat_df.png" alt="f-statistic for Numeric Features" width="400" height="300"/>
*The chart shows the F-statistics for various numeric features. Note: Higher f-statistic values indicate greater relevance and the feature's importance for prediction.*

<img src="f_statistics_with_relevance_categories.png" alt="Categorization of Numeric Features" width="800" height="600"/>
*The chart shows the F-statistics for various numeric features categorized by their relevance (High, Medium, Low) in predicting essay scores.*

3. **Plot Analysis**: Analyzed plots for each category to conclude the relevance of features.

##### Plots Examples

###### Relevant Feature: Mistakes Dist Ratio
The plots below show the distribution of the 'mistakes_dist_ratio' feature across different score classes. The clear trend and separation between score classes indicate that 'mistakes_dist_ratio' is a relevant feature for predicting scores.

<img src="mistakes_dist_ratio_boxplot.png" alt="Mistakes Dist Ratio Boxplot" width="800" height="600"/>
*The boxplot shows a decreasing trend in 'mistakes_dist_ratio' with higher scores, indicating its relevance.*

<img src="mistakes_dist_ratio_violinplot.png" alt="Mistakes Dist Ratio Violin Plot" width="800" height="600"/>
*The violin plot highlights the separation between score classes, further confirming the relevance of 'mistakes_dist_ratio'.*

###### Non-Relevant Feature: Average Sentence Length
The plots below illustrate the distribution of 'avg_sentence_length' across score classes. The lack of clear trends and the presence of significant overlap suggest that 'avg_sentence_length' is not a strong predictor of the score.

<img src="avg_sentence_length_boxplot.png" alt="Average Sentence Length Boxplot" width="800" height="600"/>
*The boxplot shows overlapping distributions of 'avg_sentence_length' across score classes, indicating low relevance.*

<img src="avg_sentence_length_violinplot.png" alt="Average Sentence Length Violin Plot" width="800" height="600"/>
*The violin plot confirms the lack of clear trends and significant overlap, further suggesting low relevance.*

##### General Observations for Low Relevance Features

- **Lack of Clear Trend**: Features do not show a consistent trend across score classes, indicating a weak correlation with the target variable.
- **High Variability**: Significant outliers across all score classes diminish the predictive power of these features.
- **Overlapping Distributions**: Overlapping in distributions across score classes reduces their relevance in prediction tasks.
The lack of clear trends, high variability, and overlapping distributions suggest that low-relevance features are not significant predictors of the score. These features are unlikely to contribute meaningfully to the accuracy and reliability of the scoring prediction model.

#### Correlation Analysis of Numerical Features:
The purpose of this analysis is to reduce multicollinearity, which can negatively impact the performance of machine learning models.
The Pearson correlation coefficient was used for the correlation analysis. 
Features with a high correlation (greater than 0.9) were excluded to minimize redundancy and improve model performance.
From the 20 relevant numerical features, the following 8 features were retained for further use (correlation < 0.9):
- 'reading_time'
- 'mistakes_dist_ratio'
- 'polysyllabcount'
- 'sentence_count'
- 'difficult_words'
- 'comma_count'
- 'transitional_phrases_c'
- 'text_dist_words_ratio'

### 3.3.2 Feature Engineering - TF-IDF
**TF-IDF feature extraction**
- Starting with max_df=0.99 and min_df=10 in the TfidfVectorizer is a strategic choice to reduce the initial vocabulary size of 54,056 to a more manageable number. 
  - The max_df=0.99 parameter excludes terms that appear in more than 99% of the documents, removing very common terms that are less informative. 
  - The min_df=10 parameter excludes terms that appear in fewer than 10 documents, eliminating rare terms that are unlikely to be useful for generalizing across the dataset.

**Dimensionality Reduction - PCA**
- TF-IDF extracted 6168 features, which is too many for the data set of 13000 rows. It's a good practice to have not more than 10% features. Hence, the dimensionality reduction technique should be applied
- Based on the analysis, 1300 components were selected to capture slightly over 85% of the variance. Choosing fewer components would capture significantly less variance and may not be sufficient for effective data representation. See histogram and plot for visibility

<img src="explained_variance_ratio_histogram.png" alt="Explained Variance ratio Histogram" width="800" height="600"/>
*The histogram shows that the first few principal components explain a significant portion of the variance, while the contribution of each subsequent component rapidly decreases.*

<img src="cumulative_explained_variance_ratio_plot.png" alt="Cumulative Explained Variance Ratio Plot" width="800" height="600"/>
*The plot shows that 1300 components are required to capture slightly over 85% of the total variance. The curve rises steeply initially and then starts to flatten out, indicating diminishing returns for each additional component.*

### 3.3.3 Feature Engineering - Combine Numerical and TF-IDF features
In later stages, we can experiment with models on different sets of features (e.g., numerical, TF-IDF, or combined features). Here, we focus on preparing combined numerical and TF-IDF features.

- **Numerical Features**: Given the range of TF-IDF values (approximately -0.52 to 0.69) numerical features were scaled to ensure each contributes equally to the analysis. Scaling is necessary because numerical features have different ranges and magnitudes, and it prevents features with larger ranges from dominating the results. 
- **TF-IDF Features**: These vectors are already normalized, representing term frequencies normalized by document frequencies. Scaling TF-IDF features again could distort their inherent meaning.

### 3.3.4 Feature Engineering - Word2Vect
Using the Word2Vec technique for essay scoring tasks can capture the semantic relationships and contextual meanings of words, providing richer and more nuanced text representations that can improve the accuracy of scoring based on content and coherence. 

Exploratory Data Analysis (EDA) was performed to understand the dataset characteristics such as vocabulary size, minimum text length, maximum text length, average text length, and average sentence length. Based on these insights, the following parameters were chosen for the Word2Vec model:

- **vector_size=300:** This larger size helps capture more semantic nuances, which is beneficial for a diverse and rich vocabulary.
- **window=5:** A window size of 5 is sufficient to capture the context within the average sentence length of approximately 20 words.
- **min_count=10:** Filters out infrequent words, reducing noise and focusing on more common and likely more informative words.
- **workers=32:** Utilizes all 32 CPU cores to speed up the training process, making efficient use of the available computational resources.

### 3.3.4 Feature Engineering - BERT
BERT (Bidirectional Encoder Representations from Transformers) provides contextualized word embeddings that capture the meaning of words based on their context within a sentence. This is particularly beneficial for tasks like essay scoring, where understanding the nuanced meaning of sentences and their coherence is crucial. 

**Exploratory Data Analysis**
Texts longer than 512 tokens are truncated. The length of texts was explored

- Text Length Statistics by Score
| Score | Count  | Mean       | Std Dev    | Min  | 25%  | 50%  | 75%   | Max   |
|-------|--------|------------|------------|------|------|------|-------|-------|
| 1     | 1001.0 | 319.931069 | 127.523991 | 168.0| 229.0| 282.0| 376.00| 1109.0|
| 2     | 3778.0 | 308.344362 | 115.177906 | 164.0| 232.0| 280.0| 350.75| 1824.0|
| 3     | 5022.0 | 422.313222 | 120.774191 | 182.0| 338.0| 403.0| 485.00| 1361.0|
| 4     | 3141.0 | 566.596307 | 130.210442 | 266.0| 478.0| 548.0| 632.00| 1608.0|
| 5     | 776.0  | 752.271907 | 160.134917 | 427.0| 638.0| 730.0| 847.25| 1617.0|
| 6     | 125.0  | 908.688000 | 179.384675 | 611.0| 785.0| 890.0| 982.00| 1582.0|

- Maximum token length: 1824- 
Percentage of texts within BERT's token limit: 70.50%

<img src="text_length_distribution_with_scores.png" alt="Text Length Distribution" width="800" height="600"/>
*We can see that texts longer than 512 tokens (all with score=6 and majority with score=5) will lose the content beyond this limit, potentially leading to an incomplete representation of the text's quality and coherence, which are crucial for essay scoring.*


**Alternative Approaches**:
- **Longer Context Models:** Consider using models designed for longer contexts, such as Longformer or BigBird, which can handle longer sequences without truncation.
- **Chunking:** Split longer texts into chunks of 512 tokens and aggregate their embeddings, although this might not fully capture the global context and coherence of the entire text.

However, let's proceed with BERT to see its results first, and consider alternative approaches if necessary.

**BERT Embedding Process**
- The `full_text` column is converted to lowercase to ensure consistency with the bert-base-uncased model, which is used to capture the full context and semantics of the text. 
- Texts longer than 512 tokens are truncated, and all texts are padded to a maximum of 512 tokens.
- The get_bert_embedding function extracts 768-dimensional embeddings from the [CLS] token for each text, capturing rich contextual information for essay scoring.


## 4. Experiments
## 4.1. Experiment 1 (train models on the initial 17k data set) 

This experiment used 80% of the initial `train.csv` dataset (with 20% left unseen), splitting it again into 80/20 for train and test sets (stratified by score). 

Performance on both the train (11074 rows) and test (2769 rows sets was analyzed to evaluate the performance of various regression models on different feature sets: 
- numerical features,
- TF-IDF features,
- combined numerical and TF-IDF features,
- Word2Vec features,
- BERT embeddings (gave up due to long run time).

The models include
- Linear Regression, Random Forest Regressor, AdaBoost Regressor, CatBoost Regressor, XGBoost Regressor, and LightGBM Regressor.
Detailed results are shown below.

### 4.1.1 Experiment Summary

This experiment used 80% of the initial train.csv dataset (20% is still unseen), splitting it again into 80/20. The performance of the seen data of both train_split and test_split sets was analyzed to evaluate the performance of various regression models on different feature sets: numerical features, TF-IDF features, combined numerical and TF-IDF features, Word2Vec features, and BERT embeddings. The models tested include Linear Regression, Random Forest Regressor, AdaBoost Regressor, CatBoost Regressor, XGBoost Regressor, and LightGBM Regressor. Detailed results are shown below.

### 4.1.2 Results

#### 4.1.2.1 Numerical Features Only

| Model                  | QWK Score (Train) | QWK Score (Test) |
|------------------------|-------------------|------------------|
| Linear Regression      | 0.438766          | 0.661431         |
| Random Forest Regressor| 0.719541          | 0.716102         |
| AdaBoost Regressor     | 0.716374          | 0.707997         |
| CatBoost Regressor     | 0.738748          | 0.714213         |
| XGBoost Regressor      | 0.760924          | 0.725878         |
| LightGBM Regressor     | 0.768217          | 0.705879         |

| Model                  | QWK Score (Train) | QWK Score (Test) |
|------------------------|-------------------|------------------|
| CatBoost Classifier    | 0.738946          | 0.697195         |
| XGBoost Classifier     | 0.719882          | 0.690799         |
| LightGBM Classifier    | 0.910527          | 0.679942         |
| Random Forest Classifier| 0.674805         | 0.670565         |
| Logistic Regression    | 0.671035          | 0.670444         |
| AdaBoost Classifier    | 0.339885          | 0.335948         |

#### 4.1.2.2 TF-IDF Features Only

| Model                  | QWK Score (Train) | QWK Score (Test) |
|------------------------|-------------------|------------------|
| Linear Regression      | 0.650407          | 0.485852         |
| CatBoost Regressor     | 0.725470          | 0.583873         |
| XGBoost Regressor      | 0.848442          | 0.520738         |
| LightGBM Regressor     | 0.951018          | 0.533020         |

| Model                  | QWK Score (Train) | QWK Score (Test) |
|------------------------|-------------------|------------------|
| Logistic Regression    | 0.663757          | 0.578316         |
| LightGBM Classifier    | 1.000000          | 0.513538         |
| CatBoost Classifier    | 0.788927          | 0.510062         |
| XGBoost Classifier     | 0.993334          | 0.506426         |

#### 4.1.2.3 Combined Numerical and TF-IDF Features

| Model                  | QWK Score (Train) | QWK Score (Test) |
|------------------------|-------------------|------------------|
| Linear Regression      | 0.697232          | 0.672009         |
| CatBoost Regressor     | 0.870019          | 0.737538         |
| XGBoost Regressor      | 0.942193          | 0.752054         |
| LightGBM Regressor     | 0.970799          | 0.738065         |

| Model                  | QWK Score (Train) | QWK Score (Test) |
|------------------------|-------------------|------------------|
| CatBoost Classifier    | 0.847677          | 0.728796         |
| XGBoost Classifier     | 0.995395          | 0.723560         |
| LightGBM Classifier    | 1.000000          | 0.718091         |

#### 4.1.2.4 Word2Vec Features Only

| Model                  | QWK Score (Train) | QWK Score (Test) |
|------------------------|-------------------|------------------|
| Linear Regression      | 0.355899          | 0.351317         |
| CatBoost Regressor     | 0.523580          | 0.373310         |
| XGBoost Regressor      | 0.673253          | 0.461169         |
| LightGBM Regressor     | 0.781056          | 0.364984         |

| Model                  | QWK Score (Train) | QWK Score (Test) |
|------------------------|-------------------|------------------|
| XGBoost Classifier     | 0.887020          | 0.567956         |
| LightGBM Classifier    | 0.998553          | 0.563399         |
| CatBoost Classifier    | 0.701877          | 0.557159         |
| AdaBoost Classifier    | 0.426707          | 0.391434         |
| Random Forest Classifier| 0.267254         | 0.250111         |

*Word2Vec with a Simple NN using Tensor Library*

    - QWK Score on Train Set: 0.6412155326583289
    - QWK Score on Test Set: 0.5535504981689063

#### 4.1.2.5 BERT

| Model                  | QWK Score (Train) | QWK Score (Test) |
|------------------------|-------------------|------------------|
| CatBoost Classifier    | 0.823684          | 0.679635         |
| XGBoost Classifier     | 0.975234          | 0.690236         |
| LightGBM Classifier    | 1.000000          | 0.683610         |

| Model                  | QWK Score (Train) | QWK Score (Test) |
|------------------------|-------------------|------------------|
| Linear Regression      | 0.557390          | 0.506605         |
| CatBoost Regressor     | 0.756849          | 0.591912         |
| XGBoost Regressor      | 0.796409          | 0.596577         |
| LightGBM Regressor     | 0.932279          | 0.568143         |

*BERT with a Simple NN using TensorFlow*
    - QWK Score on Train Set: 0.7216489652760272
    - QWK Score on Test Set: 0.5787502629917947

*Note: AdaBoost and Random Forest*

- **Resource Usage:** The laptop was hanging when running AdaBoost and Random Forest, so these models were abandoned due to excessive resource consumption.

### 4.1.3 Summary of Results and Recommendations

1. **Proceed with Combined Numerical and TF-IDF Features:**
   - **Reason:** <span style="color:red">The combination of numerical and TF-IDF features yields the best performance across models, particularly with XGBoost Regressor (QWK Score of 0.752054 on the test set).</span>  This indicates that the combination of different types of features provides a more comprehensive representation of the data, leading to better model performance.
   - Word2Vec might have performed worse than Numerical+TF-IDF because the context captured by Word2Vec embeddings may not have been sufficient to fully represent the nuanced criteria used by teachers to score essays, such as coherence, structure, and specific content, which are better captured by the combined numerical and TF-IDF features.

2. **Undersampled Scores Issue:**
   - **Observation:** Scores 1, 5, and 6, which are undersampled, are sometimes not predicted at all, even with stratified splitting. This indicates a need for additional training data to better represent these scores.
   - **Examples:**
     - Distinct predicted values on the test set for CatBoost Classifier: `[1, 2, 3, 4, 5]`
     - Distinct predicted values on the test set for Logistic Regression: `[2, 3, 4, 5]`

3. **Note on Train and Test Score Differences:**
   - **Reason for Differences:** The significant differences between training and test scores, especially with TF-IDF and Word2Vec features, suggest overfitting. The models are performing well on the training data but failing to generalize to the test data.
   - **Overfitting Indicators:** High training scores coupled with much lower test scores are classic signs of overfitting. This occurs when a model learns the training data too well, including noise and outliers, but fails to capture the underlying patterns applicable to the test data.

4. **Next Steps:**
   - **Address Overfitting:** Implement regularization techniques, use cross-validation, apply hyperparameter tuning to improve model generalizability.
   - **Further Feature Engineering:** Explore additional features or combinations that may improve model performance.
   - **Feature Selection for TF-IDF:** Consider reducing the number of TF-IDF features as part of hyperparameter tuning. This could help mitigate overfitting and improve model performance. Try different values for `max_features` or adjust `max_df` and `min_df` parameters to reduce the feature space.
   - **Model Ensemble:** Consider ensemble methods like stacking or blending to leverage the strengths of multiple models. However, these methods may involve increased computational cost, complexity, and longer training times. Therefore, it might be best to forgo this approach.

5. **Models to Proceed with:**
   - **Focus on High-Performing Models:** XGBoost Regressor and CatBoost Regressor are strong candidates due to their high QWK scores on both numerical and combined features <span style="color:red">CatBoost Regressor might generalize better to unseen data given its slightly smaller gap between train and test scores.</span> .
   - **Balanced Approach:** While we can proceed with the best-performing models, it's still beneficial to try different models with proper cross-validation and regularization to ensure robustness and generalizability.


## 4.2. Experiment 2 (train models on expanded data set by esaays with underpresented scores 1,5,6)

Experiment 2 is built upon Experiment 1 by improving the dataset. The original dataset of 17,000 rows has been expanded by incorporating additional essays sourced from the Internet, particularly focusing on underrepresented scores of 1, 5, and 6. Consequently, all preprocessing steps, including cleaning, transforming, feature engineering, and model training, have been re-executed. Below, we outline the key differences from Experiment 1

### 4.2.1 Comparison of initial data, preprocessing and model training for Experiment 1 and Experiment 2

| Aspect                        | Experiment 1                                                            | Experiment 2                                                            |
|-------------------------------|-------------------------------------------------------------------------|-------------------------------------------------------------------------|
| **Initial ED**               |                                                                         |                                                                         |
| Dataset Size                  | 17,307 entries, 3 columns                                               | 22,567 entries, 3 columns                                               |
| Score Distribution            | Imbalanced, scores 1, 5, and 6 underrepresented                         | Imbalanced, scores 1, 5, and 6 underrepresented                         |
| Score Distribution Plot       |                     | see the plot below
|
| **Duplicates Handling**       | Duplicates deleted                                                      | Duplicates retained                                                     |
| **Split Data**                |                                                                         |                                                                         |
| Training Set Size             | 13,845 entries, 3 columns                                               | 18,053 entries, 3 columns                                               |
| Testing Set Size              | 3,462 entries, 3 columns                                                | 4,514 entries, 3 columns                                                |
| Split Method                  | Stratified sampling                                                     | Stratified sampling                                                     |
| **Model Training**            |                                                                         |                                                                         |
| Models Trained                | Linear Regression, Random Forest Regressor, AdaBoost Regressor, CatBoost Regressor, XGBoost Regressor, LightGBM Regressor, BERT, Word2Vec | the same except BERT (BERT skipped due to high resource consumption and poor results) |
| Features Used                 | Numerical Features Only, TF-IDF Features Only, Combined Numerical and TF-IDF Features, Word2Vec Features Only | the same |

<img src="score_distribution_plot_exp_2.png" alt="Score Distribution" width="1200" height="900"/>

*The plot on the left shows the distribution of scores in the training dataset for Experiment 1, highlighting the imbalance, while the plot on the right compares the distribution of scores in the original dataset (df1) and the combined dataset (df_combined) for Experiment 2, demonstrating the improved representation of underrepresented scores 1, 5, and 6.*



### 4.2.2 Comparison of the results from Experiment 1 and Experiment 2

#### 4.2.2.1 Numerical Features Only

| Model                   | Experiment 1 Train | Experiment 1 Test | Experiment 2 Train | Experiment 2 Test |
|-------------------------|--------------------|-------------------|--------------------|-------------------|
| Linear Regression       | 0.438766           | 0.661431          | 0.438766           | 0.661431          |
| Random Forest Regressor | 0.719541           | 0.716102          | 0.719541           | 0.716102          |
| AdaBoost Regressor      | 0.716374           | 0.707997          | 0.716374           | 0.707997          |
| CatBoost Regressor      | 0.738748           | 0.714213          | 0.738748           | 0.714213          |
| XGBoost Regressor       | 0.760924           | 0.725878          | 0.760924           | 0.725878          |
| LightGBM Regressor      | 0.768217           | 0.705879          | 0.768217           | 0.705879          |
| CatBoost Classifier     | 0.738946           | 0.697195          | 0.840578           | 0.833498          |
| XGBoost Classifier      | 0.719882           | 0.690799          | 0.836412           | 0.828504          |
| LightGBM Classifier     | 0.910527           | 0.679942          | 0.942789           | 0.814261          |
| Random Forest Classifier| 0.674805           | 0.670565          | 0.808717           | 0.810941          |
| Logistic Regression     | 0.671035           | 0.670444          | 0.805838           | 0.812265          |
| AdaBoost Classifier     | 0.339885           | 0.335948          | 0.633158           | 0.634854          |

#### 4.2.2.2 TF-IDF Features Only

| Model                   | Experiment 1 Train | Experiment 1 Test | Experiment 2 Train | Experiment 2 Test |
|-------------------------|--------------------|-------------------|--------------------|-------------------|
| Linear Regression       | 0.650407           | 0.485852          | 0.650407           | 0.485852          |
| CatBoost Regressor      | 0.725470           | 0.583873          | 0.725470           | 0.583873          |
| XGBoost Regressor       | 0.848442           | 0.520738          | 0.848442           | 0.520738          |
| LightGBM Regressor      | 0.951018           | 0.533020          | 0.951018           | 0.533020          |
| Logistic Regression     | 0.663757           | 0.578316          | 0.793493           | 0.739610          |
| LightGBM Classifier     | 1.000000           | 0.513538          | 1.000000           | 0.714014          |
| CatBoost Classifier     | 0.788927           | 0.510062          | 0.833477           | 0.696338          |
| XGBoost Classifier      | 0.993334           | 0.506426          | 0.993676           | 0.712289          |

#### 4.2.2.3 Combined Numerical and TF-IDF Features

| Model                   | Experiment 1 Train | Experiment 1 Test | Experiment 2 Train   | Experiment 2 Test   |
|-------------------------|--------------------|-------------------|----------------------|---------------------|
| Linear Regression       | 0.697232           | 0.672009          | 0.758918             | 0.781659            |
| <span style="color:red">CatBoost Regressor</span>      | 0.870019           | 0.737538          | <span style="color:red">0.904795</span> | <span style="color:red">0.861679</span> |
| XGBoost Regressor       | 0.942193           | 0.752054          | 0.949607             | 0.864658            |
| LightGBM Regressor      | 0.970799           | 0.738065          | 0.961129             | 0.850613            |
| CatBoost Classifier     | 0.847677           | 0.728796          | 0.902166             | 0.851082            |
| XGBoost Classifier      | 0.995395           | 0.723560          | 0.995851             | 0.854455            |
| LightGBM Classifier     | 1.000000           | 0.718091          | 1.000000             | 0.849610            |



#### 4.2.2.4 Word2Vec Features Only

| Model                   | Experiment 1 Train | Experiment 1 Test | Experiment 2 Train | Experiment 2 Test |
|-------------------------|--------------------|-------------------|--------------------|-------------------|
| Linear Regression       | 0.355899           | 0.351317          | 0.355899           | 0.351317          |
| CatBoost Regressor      | 0.523580           | 0.373310          | 0.523580           | 0.373310          |
| XGBoost Regressor       | 0.673253           | 0.461169          | 0.673253           | 0.461169          |
| LightGBM Regressor      | 0.781056           | 0.364984          | 0.781056           | 0.364984          |
| XGBoost Classifier      | 0.887020           | 0.567956          | 0.887020           | 0.567956          |
| LightGBM Classifier     | 0.998553           | 0.563399          | 0.998553           | 0.563399          |
| CatBoost Classifier     | 0.701877           | 0.557159          | 0.701877           | 0.557159          |
| AdaBoost Classifier     | 0.426707           | 0.391434          | 0.426707           | 0.391434          |
| Random Forest Classifier| 0.267254           | 0.250111          | 0.267254           | 0.250111          |

*Word2Vec with a simple feedforward neural network using TensorFlow Library*

| Metric                  | Experiment 1       | Experiment 2       |
|-------------------------|--------------------|--------------------|
| QWK Score on Train Set  | 0.6412155326583289 | 0.6412155326583289 |
| QWK Score on Test Set   | 0.5535504981689063 | 0.5535504981689063 |

#### 4.2.2.6 BERT

| Model                   | Experiment 1 Train | Experiment 1 Test  | Experiment 2 Train | Experiment 2 Test |
|-------------------------|--------------------|--------------------|--------------------|-------------------|
| CatBoost Classifier     | 0.823684           | 0.679635           | skipped            | skipped           |
| XGBoost Classifier      | 0.975234           | 0.690236           | skipped            | skipped           |
| LightGBM Classifier     | 1.000000           | 0.683610           | skipped            | skipped           |
| Linear Regression       | 0.557390           | 0.506605           | skipped            | skipped           |
| CatBoost Regressor      | 0.756849           | 0.591912           | skipped            | skipped           |
| XGBoost Regressor       | 0.796409           | 0.596577           | skipped            | skipped           |
| LightGBM Regressor      | 0.932279           | 0.568143           | skipped            | skipped           |

*BERT with a simple feedforward neural network using TensorFlow Library*

| Metric                  | Experiment 1       | Experiment 2       |
|-------------------------|--------------------|--------------------|
| QWK Score on Train Set  | 0.7216489652760272 | skipped            |
| QWK Score on Test Set   | 0.5787502629917947 | skipped            |

### 4.2.3 Summary of Results and Recommendations

**Results Improvement**: Experiment 2 shows improved QWK scores across most models compared to Experiment 1, particularly for combined numerical and TF-IDF features.

**Train-Test Difference**: The difference between train and test QWK scores has generally reduced in Experiment 2, indicating better generalization.

**Model to Proceed With**: The **CatBoost Regressor** with combined numerical and TF-IDF features demonstrates the best balance of high QWK scores and reduced overfitting.

**Next Steps**:
1. **TF-IDF Dimensionality Reduction**: Consider reducing dimensionality while retaining essential information by applying PCA. This could help mitigate overfitting and improve model performance. Try different values for `max_features` or adjust `max_df` and `min_df` parameters to reduce the feature space.
2. **Hyperparameter Tuning**: Perform grid search or random search to fine-tune the CatBoost Regressor parameters.
3. **Further Address Overfitting**: Implement regularization techniques and use cross-validation to improve model generalizability.
4. **Cross-Validation**: Implement cross-validation to ensure robustness and generalizability of the model.
5. **Model Evaluation**: Continuously evaluate the model on a separate validation set to monitor performance and adjust as needed.

## 5. Dimentionality Reduction

PCA (Principal Component Analysis) was applied to reduce the number of features from 1300 to 1000, 700, and 500, explaining 80%, 75%, and 70% of the variance, respectively. 
See Notebooks 15.1, 15.2, 15.3, 16, and 17 in Experiment 2 for detailed steps and analysis.

The following results were achieved using CatBoostRegressor (iterations=600, depth=4, learning_rate=0.1):

| Dataset                              | Count of PCA Components | Count of Features | QWK Score (Train) | QWK Score (Test) |
|--------------------------------------|-------------------------|-------------------|-------------------|------------------|
| combined_features_exp_2.csv          | 1300                    | 1308              | 0.904795          | 0.861679         |
| combined_features_exp_2_pca_1000.csv | 1000                    | 1008              | 0.906795          | 0.862143         |
| combined_features_exp_2_pca_700.csv  | 700                     | 708               | 0.906436          | 0.862702         |
| combined_features_exp_2_pca_500.csv  | 500                     | 508               | 0.902056          | 0.867482         |

Despite the improvements, there is still a difference between the train and test set scores, indicating that some overfitting persists.                        


## 6. Hyperparameter tuning

After performing Randomized Search Cross-Validation with CatBoostRegressor on multiple datasets, the best hyperparameters were identified based on the Quadratic Weighted Kappa (QWK) score. 

**Datasets tried:**
- 'combined_features_exp_2.csv'
- 'combined_features_exp_2_pca_1000.csv'
- 'combined_features_exp_2_pca_700.csv'
- 'combined_features_exp_2_pca_500.csv'

**Parameters tried:**
- iterations: 300, 500, 600, 1000
- depth: 4, 6, 8
- learning_rate: 0.01, 0.05, 0.1
- l2_leaf_reg: 1, 3, 5, 7, 9

**The following parameters were chosen:**

- **Dataset:** combined_features_exp_2_pca_500.csv (508 features: 8 numerical and 500 TF-IDF, PCA explaining 70% of variance)

- **Params:** `{'learning_rate': 0.01, 'l2_leaf_reg': 7, 'iterations': 500, 'depth': 4}`

**Reasons for this choice:**

1. **Balanced Performance:** 
   - The selected model has a good balance between the train and test QWK scores, indicating it is not significantly overfitting. 
   
| Metric                | QWK Score         | Standard Deviation |
|-----------------------|-------------------|--------------------|
| Mean CV QWK Score     | 0.832616553       | 0.003274856        |
| Train QWK Score       | <span style="color:red">0.837511829</span>       | 0.001020374        |
| Test QWK Score        | <span style="color:red">0.836941581</span>       | 0.004895276        |

2. **Generalization:** 
   - Among models with similar performance, we prefer those with lower iterations because they are less likely to overfit and generalize better to unseen data.

3. **Performance Improvement:** 
   - The chosen option has quite good performance, which is better than the results achieved in Experiment 1 (train 0.870, <span style="color:red">test 0.737</span>)

By choosing these parameters, we aim to achieve a model that performs well on new data while maintaining stability and consistency.


## 7. Predict on Unseen Data

The result was predicted on the unseen data set (test_split.csv) which we put aside at the very beginning (it contains 20% of the oversampled data set). Data was transformed in the same way as the train_split data set. Pre-trained scaler, TF-IDF vectorizer, PCA, and the model were applied.

**Results:**
- **Quadratic Weighted Kappa Score:** 0.8359

- **Confusion Matrix:**

|   | 1   | 2    | 3    | 4    | 5   | 6  |
|---|-----|------|------|------|-----|----|
| 1 | 1   | 204  | 44   | 18   | 0   | 0  |
| 2 | 0   | 1055 | 241  | 44   | 2   | 0  |
| 3 | 0   | 258  | 706  | 270  | 22  | 0  |
| 4 | 0   | 4    | 188  | 461  | 132 | 0  |
| 5 | 0   | 0    | 1    | 90   | 591 | 4  |
| 6 | 0   | 0    | 0    | 0    | 160 | 18 |

- **Detailed Analysis:**
    
| Actual | Predicted | Count |
|--------|-----------|-------|
| 1      | 2         | 204   |
| 1      | 3         | 44    |
| 1      | 4         | 18    |
| 1      | 5         | 0     |
| 1      | 6         | 0     |
| 2      | 1         | 0     |
| 2      | 3         | 241   |
| 2      | 4         | 44    |
| 2      | 5         | 2     |
| 2      | 6         | 0     |
| 3      | 1         | 0     |
| 3      | 2         | 258   |
| 3      | 4         | 270   |
| 3      | 5         | 22    |
| 3      | 6         | 0     |
| 4      | 1         | 0     |
| 4      | 2         | 4     |
| 4      | 3         | 188   |
| 4      | 5         | 132   |
| 4      | 6         | 0     |
| 5      | 1         | 0     |
| 5      | 2         | 0     |
| 5      | 3         | 1     |
| 5      | 4         | 90    |
| 5      | 6         | 4     |
| 6      | 1         | 0     |
| 6      | 2         | 0     |
| 6      | 3         | 0     |
| 6      | 4         | 0     |
| 6      | 5         | 160   |

The detailed analysis provides insights into specific misclassifications. Notable points include:

- The majority of instances are correctly classified, especially for classes 2 and 3.
- Some instances of class 1 are misclassified as class 2 (204) and class 3 (44).
- Some instances of class 3 are misclassified as class 2 (258) and class 4 (270).
- Misclassifications are less frequent for higher classes (5 and 6), but there are still some notable errors, such as class 6 being predicted as class 5 (160 times).

**Conclusion:**
<span style="color:red">The QWK score on the unseen data set is 0.8359</span>, which is consistent with the QWK scores observed during the training and validation process. This indicates that the model has generalized well to the unseen data.
