### Essay Scoring Task Workflow

<img src="essay_scoring_task_workflow.png" alt="Text Length Distribution" width="1200" height="900"/>

### Experiments Overview

Experiment 1 and Experiment 2 were trained using the same features and models. 

- **Experiment 1:** Utilized the initial dataset provided on the Kaggle page. However, the Quadratic Weighted Kappa (QWK) score was relatively low.
- **Experiment 2:** Built upon Experiment 1 by enhancing the dataset. The original dataset of 17,000 rows was expanded by incorporating additional essays sourced from the Internet, particularly focusing on underrepresented scores of 1, 5, and 6.
- **Idea about the topic split (not implemented):** An additional idea involved splitting the entire dataset into several subsets based on topics identified through specialized techniques. The plan was to predict scores for each topic separately and then implement an ensemble model. Initial analyses and topic-based data splitting were performed. While some topics yielded good prediction results, one or two topics did not perform well. Given the satisfactory results achieved in Experiment 2, it was decided to discontinue this approach. Note: The topic-splitting approach is not presented in the repository.

### Data Split
22,567 entriis were split to `train_split` and `test_split` data sets in 80/20 proportion. `test_split` data set was put aside as an unseen data set.

### Features
- Numeric features box on the workflow represents the extraction and use of numerical features from the dataset, such as reading time, mistakes distribution ratio, etc.
- TF-IDF features box on the workflow represents the extraction and use of TF-IDF features, which capture the importance of words in the essays relative to their frequency.
- Combined numeric and TF-IDF features box on workflow involves combining both numeric and TF-IDF features to leverage the strengths of both types of features for improved model performance.

### Models Overview

- **CatBoost Regressor** showed the best result overall. 

- **XGBoost Regressor** also demonstrated high performance, but the difference between train and test results was larger compared to CatBoost Regressor, indicating potential overfitting.


- **Word2Vec** did not perform well. This is likely because the specific words used in the essays carry more significance in the scoring process by teachers, which is effectively captured by TF-IDF. Word2Vec focuses on the semantic meaning and relationships between words, which may not be as critical for this particular task.

- **BERT** did not yield satisfactory results either. One contributing factor could be that only 70.50% of the texts were within BERT's token limit. Furthermore, all texts with a score of 6 were truncated, and the majority of texts with a score of 5 were truncated. This truncation could lead to a loss of important information, negatively impacting the model's performance.

- **Neural Networks** A simple feedforward neural network using the TensorFlow Library was combined with Word2Vec and BERT. However, the model took too long to run and did not show promising results.

Based on these observations, it was decided to proceed with CatBoost Regressor.

### Dimensionality Reduction
 From 48 created numerical features 8 remained after relevance analysis and correlation analysis.
 500 TF-IDF features were left after applying PCA (Principle Component Analysis)
 So, training was on 508 features.
Left numerical features:
-    'reading_time',
-    'mistakes_dist_ratio',
-    'polysyllabcount',
-    'sentence_count',
-    'difficult_words',
-    'comma_count',
-    'transitional_phrases_c',
-    'text_dist_words_ratio'

### Metric
The Quadratic Weighted Kappa Score (QWK) was a requirement for evaluating the performance of the essay scoring models. 
- This metric is particularly suitable for this task because it measures the agreement between two sets of scores while considering the ordinal nature of the data.
- QWK penalizes larger discrepancies more heavily than smaller ones. This is important for score prediction because an incorrect prediction of 1 vs. 3 is worse than 1 vs. 2.
- QWK provides a balanced view of model performance, addressing both precision (correct positive predictions) and recall (correctly identifying actual positives), which is crucial for a fair evaluation of score prediction models.
  
### Hyperparameter Tuning
After performing Randomized Search Cross-Validation with CatBoostRegressor on multiple datasets, the best hyperparameters were identified based on the Quadratic Weighted Kappa (QWK) score. 

The following parameters were chosen:

- Dataset: combined_features_exp_2_pca_500.csv (508 features: 8 numerical and 500 TF-IDF, PCA explaining 70% of variance)

- Params: `{'learning_rate': 0.01, 'l2_leaf_reg': 7, 'iterations': 500, 'depth': 4}`

The selected model has a good balance between the train and test QWK scores, indicating it is not significantly overfitting. 
   
| Metric                | QWK Score         | Standard Deviation |
|-----------------------|-------------------|--------------------|
| Mean CV QWK Score     | 0.832616553       | 0.003274856        |
| Train QWK Score       | <span style="color:red">0.837511829</span>       | 0.001020374        |
| Test QWK Score        | <span style="color:red">0.836941581</span>       | 0.004895276        |

Among models with similar performance, we preferred those with lower iterations because they are less likely to overfit and generalize better to unseen data.

The chosen option has quite good performance, which is better than the results achieved in Experiment 1 (train 0.870, <span style="color:red">test 0.737</span>)

### Prediction Results on Unseen Data
Unseen data is a `test_split` data set which was put aside at the beginning of the task.

- **Quadratic Weighted Kappa Score:** <span style="color:red">0.8359</span>

- **Confusion Matrix:**

|   | 1   | 2    | 3    | 4    | 5   | 6  |
|---|-----|------|------|------|-----|----|
| 1 | 1   | 204  | 44   | 18   | 0   | 0  |
| 2 | 0   | 1055 | 241  | 44   | 2   | 0  |
| 3 | 0   | 258  | 706  | 270  | 22  | 0  |
| 4 | 0   | 4    | 188  | 461  | 132 | 0  |
| 5 | 0   | 0    | 1    | 90   | 591 | 4  |
| 6 | 0   | 0    | 0    | 0    | 160 | 18 |

The QWK score on the unseen data set is 0.8359, which is consistent with the QWK scores observed during the training and validation process. This indicates that the model has generalized well to the unseen data.