The below fine-tuning process involves random search for hyperparameter tuning, cross-validation to ensure robustness, and regularization to prevent overfitting.

The purpose of this code is to:
Evaluate each hyperparameter set found by RandomizedSearchCV.
Fit a new model with each hyperparameter set on the full training data.
Evaluate the performance on the test set to capture the QWK score.

## Content
- Hyperparameter Tuning with Randomized Search
- Hyperparameter Tuning Results
- Conclusion

## Hyperparameter Tuning with Randomized Search

In [36]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import cohen_kappa_score, make_scorer
from catboost import CatBoostRegressor
from sklearn.model_selection import RandomizedSearchCV
import time

# List of datasets to process
datasets = [
    'combined_features_exp_2.csv',
    'combined_features_exp_2_pca_1000.csv',
    'combined_features_exp_2_pca_700.csv',
    'combined_features_exp_2_pca_500.csv'
]

results = []

# Function to discretize predictions
def discretize_predictions(predictions, target_classes):
    bins = np.linspace(min(target_classes) - 0.5, max(target_classes) + 0.5, num=len(target_classes) + 1)
    discretized = np.digitize(predictions, bins) - 1
    discretized = np.clip(discretized, 0, len(target_classes) - 1)
    return discretized + 1

# Custom scorer
def qwk_scorer(y_true, y_pred):
    target_classes = np.sort(np.unique(y_true))
    y_pred_discretized = discretize_predictions(y_pred, target_classes)
    return cohen_kappa_score(y_true, y_pred_discretized, weights='quadratic')

# Define the parameter grid for RandomizedSearchCV
param_dist = {
    'iterations': [300, 500, 600, 1000],
    'depth': [4, 6, 8],
    'learning_rate': [0.01, 0.05, 0.1],
    'l2_leaf_reg': [1, 3, 5, 7, 9]
}

# Process each dataset
for dataset in datasets:
    # Load the datasets
    combined_features_df = pd.read_csv(dataset)
    df_transformed = pd.read_csv('transformed_data_exp_2.csv')

    print(f"Working on Split, Train, Validate for {dataset}")
    start_time = time.time()

    # Split Data
    X = combined_features_df
    y = df_transformed['score']
    X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

    # Check the distribution of the target classes in the training data
    print("Distribution of target classes in the training data:")
    print(y_train.value_counts())

    # Check the distribution of the target classes in the test data
    print("Distribution of target classes in the test data:")
    print(y_test.value_counts())

    # Initialize the CatBoost Regressor model
    model = CatBoostRegressor(random_seed=42, silent=True)

    # Perform Randomized Search
    random_search = RandomizedSearchCV(model, param_distributions=param_dist, n_iter=10, scoring=make_scorer(qwk_scorer), cv=3, random_state=42, verbose=1, return_train_score=True)

    target_classes = np.sort(np.unique(y))

    print(f"Performing Randomized Search for {dataset}...")
    start_time = time.time()
    random_search.fit(X_train, y_train)
    end_time = time.time()
    print(f"Elapsed time for Randomized Search on {dataset}: {end_time - start_time} seconds")

    # Evaluate each parameter set found during the random search
    for i in range(len(random_search.cv_results_['params'])):
        # Instantiate a new model with the current parameters
        model = CatBoostRegressor(**random_search.cv_results_['params'][i], random_seed=42, silent=True)
        model.fit(X_train, y_train)

        # Predict on the test set
        y_test_pred = model.predict(X_test)
        y_test_pred_discretized = discretize_predictions(y_test_pred, target_classes)
        qwk_test_score = cohen_kappa_score(y_test, y_test_pred_discretized, weights='quadratic')

        results.append({
            'Dataset': dataset,
            'Params': random_search.cv_results_['params'][i],
            'Mean CV QWK Score': random_search.cv_results_['mean_test_score'][i],
            'Std CV QWK Score': random_search.cv_results_['std_test_score'][i],
            'Mean Train QWK Score': random_search.cv_results_['mean_train_score'][i],
            'Std Train QWK Score': random_search.cv_results_['std_train_score'][i],
            'QWK Score (Test)': qwk_test_score
        })

results_df = pd.DataFrame(results)
print(results_df)


Working on Split, Train, Validate for combined_features_exp_2.csv
Distribution of target classes in the training data:
score
2    4294
3    4017
4    2513
5    2194
1     854
6     568
Name: count, dtype: int64
Distribution of target classes in the test data:
score
2    1074
3    1005
4     628
5     548
1     214
6     142
Name: count, dtype: int64
Performing Randomized Search for combined_features_exp_2.csv...
Fitting 3 folds for each of 10 candidates, totalling 30 fits
Elapsed time for Randomized Search on combined_features_exp_2.csv: 1469.1450400352478 seconds
Working on Split, Train, Validate for combined_features_exp_2_pca_1000.csv
Distribution of target classes in the training data:
score
2    4294
3    4017
4    2513
5    2194
1     854
6     568
Name: count, dtype: int64
Distribution of target classes in the test data:
score
2    1074
3    1005
4     628
5     548
1     214
6     142
Name: count, dtype: int64
Performing Randomized Search for combined_features_exp_2_pca_1000.cs

In [37]:
results_df = pd.DataFrame(results)
print(results_df)

                                 Dataset  \
0            combined_features_exp_2.csv   
1            combined_features_exp_2.csv   
2            combined_features_exp_2.csv   
3            combined_features_exp_2.csv   
4            combined_features_exp_2.csv   
5            combined_features_exp_2.csv   
6            combined_features_exp_2.csv   
7            combined_features_exp_2.csv   
8            combined_features_exp_2.csv   
9            combined_features_exp_2.csv   
10  combined_features_exp_2_pca_1000.csv   
11  combined_features_exp_2_pca_1000.csv   
12  combined_features_exp_2_pca_1000.csv   
13  combined_features_exp_2_pca_1000.csv   
14  combined_features_exp_2_pca_1000.csv   
15  combined_features_exp_2_pca_1000.csv   
16  combined_features_exp_2_pca_1000.csv   
17  combined_features_exp_2_pca_1000.csv   
18  combined_features_exp_2_pca_1000.csv   
19  combined_features_exp_2_pca_1000.csv   
20   combined_features_exp_2_pca_700.csv   
21   combined_features_exp_2_pca

In [48]:
results_df.to_csv('results_hyperparameter_tuning_exp_2.csv', index=False)
print('File exported')

File exported


### Hyperparameter Tuning Results

| Dataset                            | Params                                                                       | Mean CV QWK Score | Std CV QWK Score | Mean Train QWK Score | Std Train QWK Score | QWK Score (Test) | Std QWK Score (Test) | Difference Train-Test |
|------------------------------------|-----------------------------------------------------------------------------|-------------------|------------------|----------------------|---------------------|-------------------|-----------------------|-----------------------|
| combined_features_exp_2.csv        | {'learning_rate': 0.05, 'l2_leaf_reg': 3, 'iterations': 500, 'depth': 4}     | 0.856785316       | 0.002313883      | 0.890393826          | 0.001004777         | 0.862944872       | 0.027448954           | 0.03360851            |
| combined_features_exp_2.csv        | {'learning_rate': 0.01, 'l2_leaf_reg': 9, 'iterations': 600, 'depth': 4}     | 0.836155371       | 0.002945146      | 0.842439295          | 0.001422746         | 0.839618041       | 0.002821254           | 0.006283924           |
| combined_features_exp_2.csv        | {'learning_rate': 0.01, 'l2_leaf_reg': 3, 'iterations': 600, 'depth': 8}     | 0.842026364       | 0.004105418      | 0.869182568          | 0.000449432         | 0.849394721       | 0.019787847           | 0.027156204           |
| combined_features_exp_2.csv        | {'learning_rate': 0.01, 'l2_leaf_reg': 3, 'iterations': 500, 'depth': 6}     | 0.837544298       | 0.003376512      | 0.848248694          | 0.000969295         | 0.842222122       | 0.006026572           | 0.010704396           |
| combined_features_exp_2.csv        | {'learning_rate': 0.05, 'l2_leaf_reg': 7, 'iterations': 500, 'depth': 8}     | 0.856820014       | 0.000800015      | 0.945462504          | 0.001891942         | 0.863310033       | 0.082152471           | 0.08864249            |
| combined_features_exp_2.csv        | {'learning_rate': 0.01, 'l2_leaf_reg': 1, 'iterations': 500, 'depth': 4}     | 0.83316856        | 0.002523337      | 0.838370204          | 0.001419475         | 0.835363111       | 0.003007094           | 0.005201645           |
| combined_features_exp_2.csv        | {'learning_rate': 0.01, 'l2_leaf_reg': 7, 'iterations': 500, 'depth': 4}     | 0.832376472       | 0.002343155      | 0.836947592          | 0.00119573          | 0.834471069       | 0.002476523           | 0.00457112            |
| combined_features_exp_2.csv        | {'learning_rate': 0.1, 'l2_leaf_reg': 5, 'iterations': 300, 'depth': 6}      | 0.859615523       | 0.00234149       | 0.928651932          | 0.001188534         | 0.867254191       | 0.061397742           | 0.06903641            |
| combined_features_exp_2.csv        | {'learning_rate': 0.1, 'l2_leaf_reg': 5, 'iterations': 1000, 'depth': 6}     | 0.860744079       | 0.002020475      | 0.998536951          | 0.00012059          | 0.870449464       | 0.128087488           | 0.137792872           |
| combined_features_exp_2.csv        | {'learning_rate': 0.05, 'l2_leaf_reg': 9, 'iterations': 1000, 'depth': 6}    | 0.862169775       | 0.002204736      | 0.960449369          | 0.001390719         | 0.86589132        | 0.094558049           | 0.098279594           |
| combined_features_exp_2_pca_1000.csv| {'learning_rate': 0.05, 'l2_leaf_reg': 3, 'iterations': 500, 'depth': 4}     | 0.859160156       | 0.002217783      | 0.890518882          | 0.001828432         | 0.862972038       | 0.027546844           | 0.031358726           |
| combined_features_exp_2_pca_1000.csv| {'learning_rate': 0.01, 'l2_leaf_reg': 9, 'iterations': 600, 'depth': 4}     | 0.836970488       | 0.002921298      | 0.842693743          | 0.001856109         | 0.840338335       | 0.002355408           | 0.005723256           |
| combined_features_exp_2_pca_1000.csv| {'learning_rate': 0.01, 'l2_leaf_reg': 3, 'iterations': 600, 'depth': 8}     | 0.842247775       | 0.003154087      | 0.869938321          | 0.000936115         | 0.849434081       | 0.02050424            | 0.027690545           |
| combined_features_exp_2_pca_1000.csv| {'learning_rate': 0.01, 'l2_leaf_reg': 3, 'iterations': 500, 'depth': 6}     | 0.838138884       | 0.003442996      | 0.848459666          | 0.000867957         | 0.841810504       | 0.006649162           | 0.010320782           |
| combined_features_exp_2_pca_1000.csv| {'learning_rate': 0.05, 'l2_leaf_reg': 7, 'iterations': 500, 'depth': 8}     | 0.858478784       | 0.003460752      | 0.946094698          | 0.000485199         | 0.862505367       | 0.083589331           | 0.087615915           |
| combined_features_exp_2_pca_1000.csv| {'learning_rate': 0.01, 'l2_leaf_reg': 1, 'iterations': 500, 'depth': 4}     | 0.833506151       | 0.003518263      | 0.838665827          | 0.001491746         | 0.836503986       | 0.002161841           | 0.005159675           |
| combined_features_exp_2_pca_1000.csv| {'learning_rate': 0.01, 'l2_leaf_reg': 7, 'iterations': 500, 'depth': 4}     | 0.833148593       | 0.00287556       | 0.837147701          | 0.001311759         | 0.835976385       | 0.001171315           | 0.003999107           |
| combined_features_exp_2_pca_1000.csv| {'learning_rate': 0.1, 'l2_leaf_reg': 5, 'iterations': 300, 'depth': 6}      | 0.859590139       | 0.001540672      | 0.926364175          | 0.001107256         | 0.865086468       | 0.061277707           | 0.066774036           |
| combined_features_exp_2_pca_1000.csv| {'learning_rate': 0.1, 'l2_leaf_reg': 5, 'iterations': 1000, 'depth': 6}     | 0.862131176       | 0.002602016      | 0.998205616          | 0.000229146         | 0.86912991        | 0.129075706           | 0.13607444            |
| combined_features_exp_2_pca_1000.csv| {'learning_rate': 0.05, 'l2_leaf_reg': 9, 'iterations': 1000, 'depth': 6}    | 0.862212895       | 0.002900367      | 0.959139894          | 0.000662427         | 0.868818305       | 0.090321589           | 0.096926999           |
| combined_features_exp_2_pca_700.csv | {'learning_rate': 0.05, 'l2_leaf_reg': 3, 'iterations': 500, 'depth': 4}     | 0.858854258       | 0.001931357      | 0.889828213          | 0.00055954          | 0.86191427        | 0.027913943           | 0.030973955           |
| combined_features_exp_2_pca_700.csv | {'learning_rate': 0.01, 'l2_leaf_reg': 9, 'iterations': 600, 'depth': 4}     | 0.837709483       | 0.004040692      | 0.842798618          | 0.001410994         | 0.840913769       | 0.001884848           | 0.005089135           |
| combined_features_exp_2_pca_700.csv | {'learning_rate': 0.01, 'l2_leaf_reg': 3, 'iterations': 600, 'depth': 8}     | 0.845353516       | 0.003565102      | 0.869857577          | 0.00079851          | 0.851001292       | 0.018856285           | 0.024504061           |
| combined_features_exp_2_pca_700.csv | {'learning_rate': 0.01, 'l2_leaf_reg': 3, 'iterations': 500, 'depth': 6}     | 0.840372791       | 0.004308134      | 0.849619663          | 0.000852309         | 0.84419019        | 0.005429474           | 0.009246873           |
| combined_features_exp_2_pca_700.csv | {'learning_rate': 0.05, 'l2_leaf_reg': 7, 'iterations': 500, 'depth': 8}     | 0.859698792       | 0.000681382      | 0.9448475            | 0.000676832         | 0.864220086       | 0.080627414           | 0.085148707           |
| combined_features_exp_2_pca_700.csv | {'learning_rate': 0.01, 'l2_leaf_reg': 1, 'iterations': 500, 'depth': 4}     | 0.833763453       | 0.003467285      | 0.839154003          | 0.001293811         | 0.835471943       | 0.00368206            | 0.005390551           |
| combined_features_exp_2_pca_700.csv | {'learning_rate': 0.01, 'l2_leaf_reg': 7, 'iterations': 500, 'depth': 4}     | 0.832611666       | 0.00336891       | 0.837428077          | 0.00150747          | 0.83591           | 0.001518077           | 0.004816411           |
| combined_features_exp_2_pca_700.csv | {'learning_rate': 0.1, 'l2_leaf_reg': 5, 'iterations': 300, 'depth': 6}      | 0.860832137       | 0.001543695      | 0.92562342           | 0.00069879          | 0.865411548       | 0.060211872           | 0.064791283           |
| combined_features_exp_2_pca_700.csv | {'learning_rate': 0.1, 'l2_leaf_reg': 5, 'iterations': 1000, 'depth': 6}     | 0.86214159        | 0.002622933      | 0.997385187          | 7.42E-05            | 0.867234767       | 0.13015042            | 0.135243598           |
| combined_features_exp_2_pca_700.csv | {'learning_rate': 0.05, 'l2_leaf_reg': 9, 'iterations': 1000, 'depth': 6}    | 0.864421173       | 0.001447394      | 0.957499661          | 0.001348913         | 0.867610764       | 0.089888897           | 0.093078488           |
| combined_features_exp_2_pca_500.csv | {'learning_rate': 0.05, 'l2_leaf_reg': 3, 'iterations': 500, 'depth': 4}     | 0.85979636        | 0.002051874      | 0.889430401          | 0.001704951         | 0.865081395       | 0.024349005           | 0.02963404            |
| combined_features_exp_2_pca_500.csv | {'learning_rate': 0.01, 'l2_leaf_reg': 9, 'iterations': 600, 'depth': 4}     | 0.837850217       | 0.003536244      | 0.843456471          | 0.001645281         | 0.840161407       | 0.003295064           | 0.005606254           |
| combined_features_exp_2_pca_500.csv | {'learning_rate': 0.01, 'l2_leaf_reg': 3, 'iterations': 600, 'depth': 8}     | 0.84748457        | 0.003306987      | 0.871741671          | 0.000619883         | 0.852632204       | 0.019109467           | 0.024257101           |
| combined_features_exp_2_pca_500.csv | {'learning_rate': 0.01, 'l2_leaf_reg': 3, 'iterations': 500, 'depth': 6}     | 0.840527963       | 0.003120887      | 0.850188657          | 0.000979343         | 0.840891835       | 0.009296822           | 0.009660693           |
| combined_features_exp_2_pca_500.csv | {'learning_rate': 0.05, 'l2_leaf_reg': 7, 'iterations': 500, 'depth': 8}     | 0.860333984       | 0.002416314      | 0.944709026          | 0.00112634          | 0.867929119       | 0.076779907           | 0.084375043           |
| combined_features_exp_2_pca_500.csv | {'learning_rate': 0.01, 'l2_leaf_reg': 1, 'iterations': 500, 'depth': 4}     | 0.833728862       | 0.003564335      | 0.838947538          | 0.001261587         | 0.835217338       | 0.003730201           | 0.005218676           |
| combined_features_exp_2_pca_500.csv | {'learning_rate': 0.01, 'l2_leaf_reg': 7, 'iterations': 500, 'depth': 4}     | 0.832616553       | 0.003274856      | 0.837511829          | 0.001020374         | 0.836941581       | 0.000570248           | 0.004895276           |
| combined_features_exp_2_pca_500.csv | {'learning_rate': 0.1, 'l2_leaf_reg': 5, 'iterations': 300, 'depth': 6}      | 0.859468314       | 0.001741257      | 0.92553654           | 0.000256544         | 0.870563859       | 0.054972682           | 0.066068226           |
| combined_features_exp_2_pca_500.csv | {'learning_rate': 0.1, 'l2_leaf_reg': 5, 'iterations': 1000, 'depth': 6}     | 0.861536723       | 0.001818931      | 0.99654072           | 0.000393156         | 0.871573062       | 0.124967658           | 0.135003997           |
| combined_features_exp_2_pca_500.csv | {'learning_rate': 0.05, 'l2_leaf_reg': 9, 'iterations': 1000, 'depth': 6}    | 0.86442626        | 0.001268183      | 0.955630536          | 0.000268897         | 0.870655122       | 0.084975415           | 0.091204276           |


### Conclusion

From the hyperparameter tuning results, we chose the following parameters for the final model:  
**Dataset:** combined_features_exp_2_pca_500.csv (508 features: 8 numerical and 500 TF-IDF, PCA explaining 70% of variance)

**Params:** `{'learning_rate': 0.01, 'l2_leaf_reg': 7, 'iterations': 500, 'depth': 4}`.

**Reasons for this choice:**

1. **Balanced Performance:** 
   - The selected model has a good balance between the train and test QWK scores, indicating it is not significantly overfitting. 
   
| Metric                | QWK Score         | Standard Deviation |
|-----------------------|-------------------|--------------------|
| Mean CV QWK Score     | 0.832616553       | 0.003274856        |
| Train QWK Score       | 0.837511829       | 0.001020374        |
| Test QWK Score        | 0.836941581       | 0.004895276        |

2. **Generalization:** 
   - Among models with similar performance, we prefer those with lower iterations because they are less likely to overfit and generalize better to unseen data.

3. **Performance Improvement:** 
   - The chosen option has quite good performance, which ih better than the results achieved in Experiment 1.

By choosing these parameters, we aim to achieve a model that performs well on new data while maintaining stability and consistency.
