## Contents
- [Summary of Steps](#Summary-of-Steps)
- [Imports](#Imports)
- [Classification Models](#Classification-Models)
  - [Numerical Features](#Numerical-Features)
  - [TF-IDF Features](#TF-IDF-Features)
  - [Combined Numerical and TF-IDF Features](#Combined-Numerical-and-TF-IDF-Features)
- [Regression Models](#Regression-Models)
  - [Numerical Features](#Numerical-Features)
  - [TF-IDF Features](#TF-IDF-Features)
  - [Combined Numerical and TF-IDF Features](#Combined-Numerical-and-TF-IDF-Features)
- [Word2Vec Features](#Word2Vec-Features)
  - [Option 1: Traditional Machine Learning Classification Models](#Option-1:-Traditional-Machine-Learning-Classification-Models)
  - [Option 2: Traditional Machine Learning Regression Models](#Option-2:-Traditional-Machine-Learning-Regression-Models)
  - [Option 3: Using Neural Networks](#Option-3:-Using-Neural-Networks)
- [BERT](#BERT)


## Summary of Steps

1. **Load Data:** Load the `numeric_features_added_v1.csv`.
2. **Define Features:** Define the numerical features and extract them from the DataFrame.
3. **Split Data:** Split the data into training and test sets.
4. **Train Model:** Train the `CatBoostRegressor` on the training data.
5. **Predict:** Make predictions on the test data.
6. **Discretize Predictions:** Discretize both the predictions and the actual test labels.
7. **Evaluate:** Compute the Quadratic Weighted Kappa Score to evaluate the model.


Note: Light GBM is commented because it generates logs. Model was evaluated and commented later as it was not the best one

## Imports

In [33]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import cohen_kappa_score
from catboost import CatBoostRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
import time

In [35]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import cohen_kappa_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from catboost import CatBoostClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
import time

## Classification Models

### Numerical features

In [39]:
# Load the training and test datasets
df = pd.read_csv('numeric_features_added_exp_2.csv')

# Define numerical features
numerical_features = [
    'reading_time',
    'mistakes_dist_ratio',
    'polysyllabcount',
    'sentence_count',
    'difficult_words',
    'comma_count',
    'transitional_phrases_c',
    'text_dist_words_ratio'
]

print("Working on Split, Train, Validate")
start_time = time.time()

# Split Data
X = df[numerical_features]
y = df['score'] - 1  # Adjust class labels to start from 0
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Check the distribution of the target classes in the training data
print("Distribution of target classes in the training data:")
print(y_train.value_counts())

# Check the distribution of the target classes in the test data
print("Distribution of target classes in the test data:")
print(y_test.value_counts())

# Define models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Random Forest Classifier': RandomForestClassifier(n_estimators=600, max_depth=4, random_state=42),
    'AdaBoost Classifier': AdaBoostClassifier(n_estimators=100, random_state=42),
    'CatBoost Classifier': CatBoostClassifier(iterations=600, depth=4, learning_rate=0.1, random_seed=42, silent=True),
    'XGBoost Classifier': XGBClassifier(objective='multi:softmax', colsample_bytree=0.3, learning_rate=0.1, max_depth=4, alpha=10, n_estimators=600),
    #'LightGBM Classifier': LGBMClassifier(n_estimators=600, learning_rate=0.1, max_depth=4, random_state=42)
}

# Store results
results = []

target_classes = np.sort(np.unique(y))

for name, model in models.items():
    print(f"Working on {name}...")
    start_time = time.time()
    model.fit(X_train, y_train)
    end_time = time.time()
    print(f"Elapsed time for {name}: {end_time - start_time} seconds")

    # Predictions on training set
    y_train_pred = model.predict(X_train) + 1  # Adjust predictions back to original scale
    print(f"Distinct predicted values on training set for {name}: {np.unique(y_train_pred)}")
    kappa_train_score = cohen_kappa_score(y_train + 1, y_train_pred, weights='quadratic')

    # Predictions on test set
    y_test_pred = model.predict(X_test) + 1  # Adjust predictions back to original scale
    print(f"Distinct predicted values on test set for {name}: {np.unique(y_test_pred)}")
    kappa_test_score = cohen_kappa_score(y_test + 1, y_test_pred, weights='quadratic')

    results.append({'Model': name, 'QWK Score (Train)': kappa_train_score, 'QWK Score (Test)': kappa_test_score})




Working on Split, Train, Validate
Distribution of target classes in the training data:
score
1    4294
2    4017
3    2513
4    2194
0     854
5     568
Name: count, dtype: int64
Distribution of target classes in the test data:
score
1    1074
2    1005
3     628
4     548
0     214
5     142
Name: count, dtype: int64
Working on Logistic Regression...


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Elapsed time for Logistic Regression: 6.297773838043213 seconds
Distinct predicted values on training set for Logistic Regression: [1 2 3 4 5 6]
Distinct predicted values on test set for Logistic Regression: [1 2 3 4 5 6]
Working on Random Forest Classifier...
Elapsed time for Random Forest Classifier: 11.210044860839844 seconds
Distinct predicted values on training set for Random Forest Classifier: [2 3 4 5 6]
Distinct predicted values on test set for Random Forest Classifier: [2 3 4 5 6]
Working on AdaBoost Classifier...
Elapsed time for AdaBoost Classifier: 2.1219630241394043 seconds
Distinct predicted values on training set for AdaBoost Classifier: [1 2 3 4 5 6]
Distinct predicted values on test set for AdaBoost Classifier: [1 2 3 4 5 6]
Working on CatBoost Classifier...
Elapsed time for CatBoost Classifier: 4.910132169723511 seconds
Distinct predicted values on training set for CatBoost Classifier: [1 2 3 4 5 6]
Distinct predicted values on test set for CatBoost Classifier: [1 2 3

In [40]:
# Display results in a DataFrame
results_df = pd.DataFrame(results).sort_values(by='QWK Score (Test)', ascending=False)
print(results_df)

                      Model  QWK Score (Train)  QWK Score (Test)
3       CatBoost Classifier           0.840578          0.833498
4        XGBoost Classifier           0.836412          0.828504
0       Logistic Regression           0.805838          0.812265
1  Random Forest Classifier           0.808717          0.810941
2       AdaBoost Classifier           0.633158          0.634854


### TF-IDF features

In [41]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import cohen_kappa_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from catboost import CatBoostClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
import time

# Load the datasets
df_transformed = pd.read_csv('transformed_data_exp_2.csv')
df_tfidf = pd.read_csv('tfidf_features_exp_2.csv')

print("Working on Split, Train, Validate")
start_time = time.time()

# Split Data
X = df_tfidf
y = df_transformed['score'] - 1  # Adjust class labels to start from 0
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Check the distribution of the target classes in the training data
print("Distribution of target classes in the training data:")
print(y_train.value_counts())

# Check the distribution of the target classes in the test data
print("Distribution of target classes in the test data:")
print(y_test.value_counts())

# Define models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    #'Random Forest Classifier': RandomForestClassifier(n_estimators=600, max_depth=4, random_state=42),
    #'AdaBoost Classifier': AdaBoostClassifier(n_estimators=100, random_state=42),
    'CatBoost Classifier': CatBoostClassifier(iterations=600, depth=4, learning_rate=0.1, random_seed=42, silent=True),
    'XGBoost Classifier': XGBClassifier(objective='multi:softmax', colsample_bytree=0.3, learning_rate=0.1, max_depth=4, alpha=10, n_estimators=600),
    #'LightGBM Classifier': LGBMClassifier(n_estimators=600, learning_rate=0.1, max_depth=4, random_state=42)
}

# Store results
results = []

target_classes = np.sort(np.unique(y))

for name, model in models.items():
    print(f"Working on {name}...")
    start_time = time.time()
    model.fit(X_train, y_train)
    end_time = time.time()
    print(f"Elapsed time for {name}: {end_time - start_time} seconds")

    # Predictions on training set
    y_train_pred = model.predict(X_train) + 1  # Adjust predictions back to original scale
    print(f"Distinct predicted values on training set for {name}: {np.unique(y_train_pred)}")
    kappa_train_score = cohen_kappa_score(y_train + 1, y_train_pred, weights='quadratic')

    # Predictions on test set
    y_test_pred = model.predict(X_test) + 1  # Adjust predictions back to original scale
    print(f"Distinct predicted values on test set for {name}: {np.unique(y_test_pred)}")
    kappa_test_score = cohen_kappa_score(y_test + 1, y_test_pred, weights='quadratic')

    results.append({'Model': name, 'QWK Score (Train)': kappa_train_score, 'QWK Score (Test)': kappa_test_score})

Working on Split, Train, Validate
Distribution of target classes in the training data:
score
1    4294
2    4017
3    2513
4    2194
0     854
5     568
Name: count, dtype: int64
Distribution of target classes in the test data:
score
1    1074
2    1005
3     628
4     548
0     214
5     142
Name: count, dtype: int64
Working on Logistic Regression...
Elapsed time for Logistic Regression: 5.57770562171936 seconds
Distinct predicted values on training set for Logistic Regression: [1 2 3 4 5 6]
Distinct predicted values on test set for Logistic Regression: [1 2 3 4 5 6]
Working on CatBoost Classifier...
Elapsed time for CatBoost Classifier: 112.0259006023407 seconds
Distinct predicted values on training set for CatBoost Classifier: [1 2 3 4 5 6]
Distinct predicted values on test set for CatBoost Classifier: [1 2 3 4 5 6]
Working on XGBoost Classifier...
Elapsed time for XGBoost Classifier: 219.37614035606384 seconds
Distinct predicted values on training set for XGBoost Classifier: [1 2 3

In [42]:
# Display results in a DataFrame
results_df = pd.DataFrame(results).sort_values(by='QWK Score (Test)', ascending=False)
print(results_df)

                 Model  QWK Score (Train)  QWK Score (Test)
0  Logistic Regression           0.793493          0.739610
2   XGBoost Classifier           0.993676          0.712289
1  CatBoost Classifier           0.833477          0.696338


### Combined Numerical and TF-IDF features

In [43]:
# Load the datasets
combined_features_df = pd.read_csv('combined_features_exp_2.csv')
df_transformed = pd.read_csv('transformed_data_exp_2.csv')

print("Working on Split, Train, Validate")
start_time = time.time()

# Split Data
X = combined_features_df
y = df_transformed['score'] - 1  # Adjust class labels to start from 0
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Check the distribution of the target classes in the training data
print("Distribution of target classes in the training data:")
print(y_train.value_counts())

# Check the distribution of the target classes in the test data
print("Distribution of target classes in the test data:")
print(y_test.value_counts())

# Define models
models = {
    #'Random Forest Classifier': RandomForestClassifier(n_estimators=600, max_depth=4, random_state=42),
    #'AdaBoost Classifier': AdaBoostClassifier(n_estimators=100, random_state=42),
    'CatBoost Classifier': CatBoostClassifier(iterations=600, depth=4, learning_rate=0.1, random_seed=42, silent=True),
    'XGBoost Classifier': XGBClassifier(objective='multi:softmax', colsample_bytree=0.3, learning_rate=0.1, max_depth=4, alpha=10, n_estimators=600),
    #'LightGBM Classifier': LGBMClassifier(n_estimators=600, learning_rate=0.1, max_depth=4, random_state=42)
}

# Store results
results = []

target_classes = np.sort(np.unique(y))

for name, model in models.items():
    print(f"Working on {name}...")
    start_time = time.time()
    model.fit(X_train, y_train)
    end_time = time.time()
    print(f"Elapsed time for {name}: {end_time - start_time} seconds")

    # Predictions on training set
    y_train_pred = model.predict(X_train) + 1  # Adjust predictions back to original scale
    print(f"Distinct predicted values on training set for {name}: {np.unique(y_train_pred)}")
    kappa_train_score = cohen_kappa_score(y_train + 1, y_train_pred, weights='quadratic')

    # Predictions on test set
    y_test_pred = model.predict(X_test) + 1  # Adjust predictions back to original scale
    print(f"Distinct predicted values on test set for {name}: {np.unique(y_test_pred)}")
    kappa_test_score = cohen_kappa_score(y_test + 1, y_test_pred, weights='quadratic')

    results.append({'Model': name, 'QWK Score (Train)': kappa_train_score, 'QWK Score (Test)': kappa_test_score})

Working on Split, Train, Validate
Distribution of target classes in the training data:
score
1    4294
2    4017
3    2513
4    2194
0     854
5     568
Name: count, dtype: int64
Distribution of target classes in the test data:
score
1    1074
2    1005
3     628
4     548
0     214
5     142
Name: count, dtype: int64
Working on CatBoost Classifier...
Elapsed time for CatBoost Classifier: 60.9344277381897 seconds
Distinct predicted values on training set for CatBoost Classifier: [1 2 3 4 5 6]
Distinct predicted values on test set for CatBoost Classifier: [1 2 3 4 5 6]
Working on XGBoost Classifier...
Elapsed time for XGBoost Classifier: 214.78234338760376 seconds
Distinct predicted values on training set for XGBoost Classifier: [1 2 3 4 5 6]
Distinct predicted values on test set for XGBoost Classifier: [1 2 3 4 5 6]


In [44]:
# Display results in a DataFrame
results_df = pd.DataFrame(results).sort_values(by='QWK Score (Test)', ascending=False)
print(results_df)

                 Model  QWK Score (Train)  QWK Score (Test)
1   XGBoost Classifier           0.995851          0.854455
0  CatBoost Classifier           0.902166          0.851082


## Regression models

### Numerical features

In [45]:
# Load the training and test datasets
df = pd.read_csv('numeric_features_added_exp_2.csv')

# Define numerical features
numerical_features = [
    'reading_time',
    'mistakes_dist_ratio',
    'polysyllabcount',
    'sentence_count',
    'difficult_words',
    'comma_count',
    'transitional_phrases_c',
    'text_dist_words_ratio'
]

print("Working on Split, Train, Validate")
start_time = time.time()

# Split Data
X = df[numerical_features]
y = df['score']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Check the distribution of the target classes in the training data
print("Distribution of target classes in the training data:")
print(y_train.value_counts())

# Check the distribution of the target classes in the test data
print("Distribution of target classes in the test data:")
print(y_test.value_counts())

# Define models
models = {
    'Linear Regression': LinearRegression(),
    'Random Forest Regressor': RandomForestRegressor(n_estimators=600, max_depth=4, random_state=42),
    'AdaBoost Regressor': AdaBoostRegressor(n_estimators=100, random_state=42),
    'CatBoost Regressor': CatBoostRegressor(iterations=600, depth=4, learning_rate=0.1, random_seed=42, silent=True),
    'XGBoost Regressor': XGBRegressor(objective='reg:squarederror', colsample_bytree=0.3, learning_rate=0.1, max_depth=4, alpha=10, n_estimators=600),
    #'LightGBM Regressor': LGBMRegressor(n_estimators=600, learning_rate=0.1, max_depth=4, random_state=42)
}

# Store results
results = []

# Function to discretize predictions
def discretize_predictions(predictions, target_classes):
    bins = np.linspace(min(target_classes) - 0.5, max(target_classes) + 0.5, num=len(target_classes) + 1)
    discretized = np.digitize(predictions, bins) - 1
    discretized = np.clip(discretized, 0, len(target_classes) - 1)
    return discretized + 1

target_classes = np.sort(np.unique(y))

for name, model in models.items():
    print(f"Working on {name}...")
    start_time = time.time()
    model.fit(X_train, y_train)
    end_time = time.time()
    print(f"Elapsed time for {name}: {end_time - start_time} seconds")

    # Predictions on training set
    y_train_pred = model.predict(X_train)
    y_train_pred_scaled = (y_train_pred - y_train_pred.min()) / (y_train_pred.max() - y_train_pred.min()) * (y_train.max() - y_train.min()) + y_train.min()
    y_train_discretized = discretize_predictions(y_train, target_classes)
    y_train_pred_discretized = discretize_predictions(y_train_pred_scaled, target_classes)
    kappa_train_score = cohen_kappa_score(y_train_discretized, y_train_pred_discretized, weights='quadratic')

    # Predictions on test set
    y_test_pred = model.predict(X_test)
    y_test_pred_scaled = (y_test_pred - y_test_pred.min()) / (y_test_pred.max() - y_test_pred.min()) * (y_train.max() - y_train.min()) + y_train.min()
    y_test_discretized = discretize_predictions(y_test, target_classes)
    y_test_pred_discretized = discretize_predictions(y_test_pred_scaled, target_classes)
    kappa_test_score = cohen_kappa_score(y_test_discretized, y_test_pred_discretized, weights='quadratic')

    results.append({'Model': name, 'QWK Score (Train)': kappa_train_score, 'QWK Score (Test)': kappa_test_score})


Working on Split, Train, Validate
Distribution of target classes in the training data:
score
2    4294
3    4017
4    2513
5    2194
1     854
6     568
Name: count, dtype: int64
Distribution of target classes in the test data:
score
2    1074
3    1005
4     628
5     548
1     214
6     142
Name: count, dtype: int64
Working on Linear Regression...
Elapsed time for Linear Regression: 0.014182329177856445 seconds
Working on Random Forest Regressor...
Elapsed time for Random Forest Regressor: 16.106038808822632 seconds
Working on AdaBoost Regressor...
Elapsed time for AdaBoost Regressor: 0.6758365631103516 seconds
Working on CatBoost Regressor...
Elapsed time for CatBoost Regressor: 1.6053357124328613 seconds
Working on XGBoost Regressor...
Elapsed time for XGBoost Regressor: 0.6350545883178711 seconds


In [46]:
# Show results
results_df = pd.DataFrame(results)
print(results_df)

                     Model  QWK Score (Train)  QWK Score (Test)
0        Linear Regression           0.500947          0.643906
1  Random Forest Regressor           0.800033          0.799370
2       AdaBoost Regressor           0.791092          0.792753
3       CatBoost Regressor           0.847720          0.832627
4        XGBoost Regressor           0.850914          0.834900


### TF-IDF features

In [47]:
# Load the datasets
df_transformed = pd.read_csv('transformed_data_exp_2.csv')
df_tfidf = pd.read_csv('tfidf_features_exp_2.csv')

print("Working on Split, Train, Validate")
start_time = time.time()

# Split Data
X = df_tfidf
y = df_transformed['score']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Check the distribution of the target classes in the training data
print("Distribution of target classes in the training data:")
print(y_train.value_counts())

# Check the distribution of the target classes in the test data
print("Distribution of target classes in the test data:")
print(y_test.value_counts())

# Define models
models = {
    'Linear Regression': LinearRegression(),
    # 'Random Forest Regressor': RandomForestRegressor(n_estimators=600, max_depth=4, random_state=42),
    # 'AdaBoost Regressor': AdaBoostRegressor(n_estimators=100, random_state=42),
    'CatBoost Regressor': CatBoostRegressor(iterations=600, depth=4, learning_rate=0.1, random_seed=42, silent=True),
    'XGBoost Regressor': XGBRegressor(objective='reg:squarederror', colsample_bytree=0.3, learning_rate=0.1, max_depth=4, alpha=10, n_estimators=600),
    #'LightGBM Regressor': LGBMRegressor(n_estimators=600, learning_rate=0.1, max_depth=4, random_state=42)
}

# Store results
results = []

# Function to discretize predictions
def discretize_predictions(predictions, target_classes):
    bins = np.linspace(min(target_classes) - 0.5, max(target_classes) + 0.5, num=len(target_classes) + 1)
    discretized = np.digitize(predictions, bins) - 1
    discretized = np.clip(discretized, 0, len(target_classes) - 1)
    return discretized + 1

target_classes = np.sort(np.unique(y))

for name, model in models.items():
    print(f"Working on {name}...")
    start_time = time.time()
    model.fit(X_train, y_train)
    end_time = time.time()
    print(f"Elapsed time for {name}: {end_time - start_time} seconds")

    # Predictions on training set
    y_train_pred = model.predict(X_train)
    y_train_pred_scaled = (y_train_pred - y_train_pred.min()) / (y_train_pred.max() - y_train_pred.min()) * (y_train.max() - y_train.min()) + y_train.min()
    y_train_discretized = discretize_predictions(y_train, target_classes)
    y_train_pred_discretized = discretize_predictions(y_train_pred_scaled, target_classes)
    kappa_train_score = cohen_kappa_score(y_train_discretized, y_train_pred_discretized, weights='quadratic')

    # Predictions on test set
    y_test_pred = model.predict(X_test)
    y_test_pred_scaled = (y_test_pred - y_test_pred.min()) / (y_test_pred.max() - y_test_pred.min()) * (y_train.max() - y_train.min()) + y_train.min()
    y_test_discretized = discretize_predictions(y_test, target_classes)
    y_test_pred_discretized = discretize_predictions(y_test_pred_scaled, target_classes)
    kappa_test_score = cohen_kappa_score(y_test_discretized, y_test_pred_discretized, weights='quadratic')

    results.append({'Model': name, 'QWK Score (Train)': kappa_train_score, 'QWK Score (Test)': kappa_test_score})


Working on Split, Train, Validate
Distribution of target classes in the training data:
score
2    4294
3    4017
4    2513
5    2194
1     854
6     568
Name: count, dtype: int64
Distribution of target classes in the test data:
score
2    1074
3    1005
4     628
5     548
1     214
6     142
Name: count, dtype: int64
Working on Linear Regression...
Elapsed time for Linear Regression: 2.2922353744506836 seconds
Working on CatBoost Regressor...
Elapsed time for CatBoost Regressor: 14.47586464881897 seconds
Working on XGBoost Regressor...
Elapsed time for XGBoost Regressor: 33.5309784412384 seconds


In [48]:
# Show results
results_df = pd.DataFrame(results)
print(results_df)

                Model  QWK Score (Train)  QWK Score (Test)
0   Linear Regression           0.690214          0.693764
1  CatBoost Regressor           0.807107          0.718420
2   XGBoost Regressor           0.901976          0.699006


### Combined Numerical and TF-IDF features

In [51]:
# Load the datasets
combined_features_df = pd.read_csv('combined_features_exp_2.csv')
df_transformed = pd.read_csv('transformed_data_exp_2.csv')

print("Working on Split, Train, Validate")
start_time = time.time()

# Split Data
X = combined_features_df
y = df_transformed['score']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Check the distribution of the target classes in the training data
print("Distribution of target classes in the training data:")
print(y_train.value_counts())

# Check the distribution of the target classes in the test data
print("Distribution of target classes in the test data:")
print(y_test.value_counts())

# Define models
models = {
    'Linear Regression': LinearRegression(),
    #'Random Forest Regressor': RandomForestRegressor(n_estimators=600, max_depth=4, random_state=42),
    #'AdaBoost Regressor': AdaBoostRegressor(n_estimators=100, random_state=42),
    'CatBoost Regressor': CatBoostRegressor(iterations=600, depth=4, learning_rate=0.1, random_seed=42, silent=True),
    'XGBoost Regressor': XGBRegressor(objective='reg:squarederror', colsample_bytree=0.3, learning_rate=0.1, max_depth=4, alpha=10, n_estimators=600),
    #'LightGBM Regressor': LGBMRegressor(n_estimators=600, learning_rate=0.1, max_depth=4, random_state=42)
}

# Store results
results = []

# Function to discretize predictions
def discretize_predictions(predictions, target_classes):
    bins = np.linspace(min(target_classes) - 0.5, max(target_classes) + 0.5, num=len(target_classes) + 1)
    discretized = np.digitize(predictions, bins) - 1
    discretized = np.clip(discretized, 0, len(target_classes) - 1)
    return discretized + 1

target_classes = np.sort(np.unique(y))

for name, model in models.items():
    print(f"Working on {name}...")
    start_time = time.time()
    model.fit(X_train, y_train)
    end_time = time.time()
    print(f"Elapsed time for {name}: {end_time - start_time} seconds")

    # Predictions on training set
    y_train_pred = model.predict(X_train)
    y_train_pred_scaled = (y_train_pred - y_train_pred.min()) / (y_train_pred.max() - y_train_pred.min()) * (y_train.max() - y_train.min()) + y_train.min()
    y_train_discretized = discretize_predictions(y_train, target_classes)
    y_train_pred_discretized = discretize_predictions(y_train_pred_scaled, target_classes)
    kappa_train_score = cohen_kappa_score(y_train_discretized, y_train_pred_discretized, weights='quadratic')

    # Predictions on test set
    y_test_pred = model.predict(X_test)
    y_test_pred_scaled = (y_test_pred - y_test_pred.min()) / (y_test_pred.max() - y_test_pred.min()) * (y_train.max() - y_train.min()) + y_train.min()
    y_test_discretized = discretize_predictions(y_test, target_classes)
    y_test_pred_discretized = discretize_predictions(y_test_pred_scaled, target_classes)
    kappa_test_score = cohen_kappa_score(y_test_discretized, y_test_pred_discretized, weights='quadratic')

    results.append({'Model': name, 'QWK Score (Train)': kappa_train_score, 'QWK Score (Test)': kappa_test_score})


Working on Split, Train, Validate
Distribution of target classes in the training data:
score
2    4294
3    4017
4    2513
5    2194
1     854
6     568
Name: count, dtype: int64
Distribution of target classes in the test data:
score
2    1074
3    1005
4     628
5     548
1     214
6     142
Name: count, dtype: int64
Working on Linear Regression...
Elapsed time for Linear Regression: 2.24367618560791 seconds
Working on CatBoost Regressor...
Elapsed time for CatBoost Regressor: 14.668799638748169 seconds
Working on XGBoost Regressor...
Elapsed time for XGBoost Regressor: 34.756917238235474 seconds


In [52]:
# Show results
results_df = pd.DataFrame(results)
print(results_df)

                Model  QWK Score (Train)  QWK Score (Test)
0   Linear Regression           0.758918          0.781659
1  CatBoost Regressor           0.904795          0.861679
2   XGBoost Regressor           0.949607          0.864658


## Word2Vect features

#### Option 1: Traditional Machine Learning Classification Models

In [59]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import cohen_kappa_score
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from catboost import CatBoostClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
import time
import ast

# Load the datasets
embeddings_df = pd.read_csv('word2vec_features_exp_2.csv')

# Function to convert space-separated string of numbers (with brackets) to a list of floats
def convert_to_list(embedding_str):
    # Remove the square brackets
    embedding_str = embedding_str.replace('[', '').replace(']', '')
    # Split the string by spaces and convert to list of floats
    return [float(num) for num in embedding_str.split()]

# Apply the function to the word2vec_embedding column
embeddings_df['word2vec_embedding'] = embeddings_df['word2vec_embedding'].apply(convert_to_list)

print("Working on Split, Train, Validate")
start_time = time.time()

# Split Data
X = np.vstack(embeddings_df['word2vec_embedding'].values)
y = embeddings_df['score'] - 1  # Adjust class labels to start from 0
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Check the distribution of the target classes in the training data
print("Distribution of target classes in the training data:")
print(y_train.value_counts())

# Check the distribution of the target classes in the test data
print("Distribution of target classes in the test data:")
print(y_test.value_counts())

# Define models
models = {
    'Random Forest Classifier': RandomForestClassifier(n_estimators=600, max_depth=4, random_state=42),
    'AdaBoost Classifier': AdaBoostClassifier(n_estimators=100, random_state=42),
    'CatBoost Classifier': CatBoostClassifier(iterations=600, depth=4, learning_rate=0.1, random_seed=42, silent=True),
    'XGBoost Classifier': XGBClassifier(objective='multi:softmax', colsample_bytree=0.3, learning_rate=0.1, max_depth=4, alpha=10, n_estimators=600),
    #'LightGBM Classifier': LGBMClassifier(n_estimators=600, learning_rate=0.1, max_depth=4, random_state=42)
}

# Store results
results = []

target_classes = np.sort(np.unique(y))

for name, model in models.items():
    print(f"Working on {name}...")
    start_time = time.time()
    model.fit(X_train, y_train)
    end_time = time.time()
    print(f"Elapsed time for {name}: {end_time - start_time} seconds")

    # Predictions on training set
    y_train_pred = model.predict(X_train) + 1  # Adjust predictions back to original scale
    print(f"Distinct predicted values on training set for {name}: {np.unique(y_train_pred)}")
    kappa_train_score = cohen_kappa_score(y_train + 1, y_train_pred, weights='quadratic')

    # Predictions on test set
    y_test_pred = model.predict(X_test) + 1  # Adjust predictions back to original scale
    print(f"Distinct predicted values on test set for {name}: {np.unique(y_test_pred)}")
    kappa_test_score = cohen_kappa_score(y_test + 1, y_test_pred, weights='quadratic')

    results.append({'Model': name, 'QWK Score (Train)': kappa_train_score, 'QWK Score (Test)': kappa_test_score})


Working on Split, Train, Validate
Distribution of target classes in the training data:
score
1    4294
2    4017
3    2513
4    2194
0     854
5     568
Name: count, dtype: int64
Distribution of target classes in the test data:
score
1    1074
2    1005
3     628
4     548
0     214
5     142
Name: count, dtype: int64
Working on Random Forest Classifier...
Elapsed time for Random Forest Classifier: 303.306348323822 seconds
Distinct predicted values on training set for Random Forest Classifier: [2 3 5]
Distinct predicted values on test set for Random Forest Classifier: [2 3 5]
Working on AdaBoost Classifier...
Elapsed time for AdaBoost Classifier: 387.48888397216797 seconds
Distinct predicted values on training set for AdaBoost Classifier: [1 2 3 4 5 6]
Distinct predicted values on test set for AdaBoost Classifier: [1 2 3 4 5 6]
Working on CatBoost Classifier...
Elapsed time for CatBoost Classifier: 56.398645639419556 seconds
Distinct predicted values on training set for CatBoost Classi

In [60]:
# Display results in a DataFrame
results_df = pd.DataFrame(results).sort_values(by='QWK Score (Test)', ascending=False)
print(results_df)

                      Model  QWK Score (Train)  QWK Score (Test)
3        XGBoost Classifier           0.918547          0.733207
2       CatBoost Classifier           0.785205          0.706903
1       AdaBoost Classifier           0.643776          0.627829
0  Random Forest Classifier           0.542995          0.534292


#### Option 2: Traditional Machine Learning Regression Models

In [61]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import cohen_kappa_score
from catboost import CatBoostRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
import time
import ast

# Load the datasets
embeddings_df = pd.read_csv('word2vec_features_exp_2.csv')

# Function to convert space-separated string of numbers (with brackets) to a list of floats
def convert_to_list(embedding_str):
    # Remove the square brackets
    embedding_str = embedding_str.replace('[', '').replace(']', '')
    # Split the string by spaces and convert to list of floats
    return [float(num) for num in embedding_str.split()]

# Apply the function to the word2vec_embedding column
embeddings_df['word2vec_embedding'] = embeddings_df['word2vec_embedding'].apply(convert_to_list)

print("Working on Split, Train, Validate")
start_time = time.time()

# Split Data
X = np.vstack(embeddings_df['word2vec_embedding'].values)
y = embeddings_df['score']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Check the distribution of the target classes in the training data
print("Distribution of target classes in the training data:")
print(y_train.value_counts())

# Check the distribution of the target classes in the test data
print("Distribution of target classes in the test data:")
print(y_test.value_counts())

# Define models
models = {
    'Linear Regression': LinearRegression(),
    #'Random Forest Regressor': RandomForestRegressor(n_estimators=600, max_depth=4, random_state=42),
    #'AdaBoost Regressor': AdaBoostRegressor(n_estimators=100, random_state=42),
    'CatBoost Regressor': CatBoostRegressor(iterations=600, depth=4, learning_rate=0.1, random_seed=42, silent=True),
    'XGBoost Regressor': XGBRegressor(objective='reg:squarederror', colsample_bytree=0.3, learning_rate=0.1, max_depth=4, alpha=10, n_estimators=600),
    #'LightGBM Regressor': LGBMRegressor(n_estimators=600, learning_rate=0.1, max_depth=4, random_state=42)
}

# Store results
results = []

# Function to discretize predictions
def discretize_predictions(predictions, target_classes):
    bins = np.linspace(min(target_classes) - 0.5, max(target_classes) + 0.5, num=len(target_classes) + 1)
    discretized = np.digitize(predictions, bins) - 1
    discretized = np.clip(discretized, 0, len(target_classes) - 1)
    return discretized + 1

target_classes = np.sort(np.unique(y))

for name, model in models.items():
    print(f"Working on {name}...")
    start_time = time.time()
    model.fit(X_train, y_train)
    end_time = time.time()
    print(f"Elapsed time for {name}: {end_time - start_time} seconds")

    # Predictions on training set
    y_train_pred = model.predict(X_train)
    y_train_pred_scaled = (y_train_pred - y_train_pred.min()) / (y_train_pred.max() - y_train_pred.min()) * (y_train.max() - y_train.min()) + y_train.min()
    y_train_discretized = discretize_predictions(y_train, target_classes)
    y_train_pred_discretized = discretize_predictions(y_train_pred_scaled, target_classes)
    kappa_train_score = cohen_kappa_score(y_train_discretized, y_train_pred_discretized, weights='quadratic')

    # Predictions on test set
    y_test_pred = model.predict(X_test)
    y_test_pred_scaled = (y_test_pred - y_test_pred.min()) / (y_test_pred.max() - y_test_pred.min()) * (y_train.max() - y_train.min()) + y_train.min()
    y_test_discretized = discretize_predictions(y_test, target_classes)
    y_test_pred_discretized = discretize_predictions(y_test_pred_scaled, target_classes)
    kappa_test_score = cohen_kappa_score(y_test_discretized, y_test_pred_discretized, weights='quadratic')

    results.append({'Model': name, 'QWK Score (Train)': kappa_train_score, 'QWK Score (Test)': kappa_test_score})


Working on Split, Train, Validate
Distribution of target classes in the training data:
score
2    4294
3    4017
4    2513
5    2194
1     854
6     568
Name: count, dtype: int64
Distribution of target classes in the test data:
score
2    1074
3    1005
4     628
5     548
1     214
6     142
Name: count, dtype: int64
Working on Linear Regression...
Elapsed time for Linear Regression: 4.015959739685059 seconds
Working on CatBoost Regressor...
Elapsed time for CatBoost Regressor: 23.43088173866272 seconds
Working on XGBoost Regressor...
Elapsed time for XGBoost Regressor: 31.195597887039185 seconds


In [62]:
# Show results
results_df = pd.DataFrame(results)
print(results_df)

                Model  QWK Score (Train)  QWK Score (Test)
0   Linear Regression           0.595021          0.622043
1  CatBoost Regressor           0.739214          0.678438
2   XGBoost Regressor           0.813884          0.685875


#### Option 3: Using Neural Networks

A simple feedforward neural network using TensorFlow Library

In [63]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import cohen_kappa_score
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
from sklearn.preprocessing import StandardScaler

# Load the datasets
embeddings_df = pd.read_csv('word2vec_features_exp_2.csv')

# Function to convert space-separated string of numbers (with brackets) to a list of floats
def convert_to_list(embedding_str):
    # Remove the square brackets
    embedding_str = embedding_str.replace('[', '').replace(']', '')
    # Split the string by spaces and convert to list of floats
    return [float(num) for num in embedding_str.split()]

# Apply the function to the word2vec_embedding column
embeddings_df['word2vec_embedding'] = embeddings_df['word2vec_embedding'].apply(convert_to_list)

print("Working on Split, Train, Validate")
start_time = time.time()

# Split Data
X = np.vstack(embeddings_df['word2vec_embedding'].values)
y = embeddings_df['score']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Dynamically determine target classes
target_classes = np.sort(np.unique(y))

# Check the distribution of the target classes in the training data
print("Distribution of target classes in the training data:")
print(y_train.value_counts())

# Check the distribution of the target classes in the test data
print("Distribution of target classes in the test data:")
print(y_test.value_counts())

# Standardize data for neural networks
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Define the neural network model
model = Sequential()
model.add(Dense(128, input_dim=X_train_scaled.shape[1], activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dense(1, activation='linear'))

# Compile the model
model.compile(optimizer=Adam(learning_rate=0.001), loss='mean_squared_error')

# Train the model
history = model.fit(X_train_scaled, y_train, epochs=50, batch_size=32, validation_data=(X_test_scaled, y_test), verbose=2)

# Predict
y_train_pred = model.predict(X_train_scaled).flatten()
y_test_pred = model.predict(X_test_scaled).flatten()

# Function to discretize predictions
def discretize_predictions(predictions, target_classes):
    bins = np.linspace(min(target_classes) - 0.5, max(target_classes) + 0.5, num=len(target_classes) + 1)
    discretized = np.digitize(predictions, bins) - 1
    discretized = np.clip(discretized, 0, len(target_classes) - 1)
    return discretized + 1

# Define the QWK computation function
def compute_qwk(y_true, y_pred, target_classes):
    y_pred_discretized = discretize_predictions(y_pred, target_classes)
    return cohen_kappa_score(y_true, y_pred_discretized, weights='quadratic')

# Evaluate using QWK score
qwk_train = compute_qwk(y_train, y_train_pred, target_classes)
qwk_test = compute_qwk(y_test, y_test_pred, target_classes)

print(f"QWK Score on Train Set: {qwk_train}")
print(f"QWK Score on Test Set: {qwk_test}")


Working on Split, Train, Validate
Distribution of target classes in the training data:
score
2    4294
3    4017
4    2513
5    2194
1     854
6     568
Name: count, dtype: int64
Distribution of target classes in the test data:
score
2    1074
3    1005
4     628
5     548
1     214
6     142
Name: count, dtype: int64


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 1/50
452/452 - 12s - 27ms/step - loss: 1.8031 - val_loss: 1.1005
Epoch 2/50
452/452 - 4s - 8ms/step - loss: 1.1389 - val_loss: 1.0990
Epoch 3/50
452/452 - 3s - 6ms/step - loss: 1.0602 - val_loss: 1.0839
Epoch 4/50
452/452 - 3s - 7ms/step - loss: 0.9994 - val_loss: 0.9974
Epoch 5/50
452/452 - 3s - 8ms/step - loss: 0.9648 - val_loss: 1.0961
Epoch 6/50
452/452 - 4s - 8ms/step - loss: 0.9273 - val_loss: 0.9902
Epoch 7/50
452/452 - 4s - 8ms/step - loss: 0.9125 - val_loss: 1.0044
Epoch 8/50
452/452 - 4s - 8ms/step - loss: 0.8817 - val_loss: 0.8196
Epoch 9/50
452/452 - 4s - 8ms/step - loss: 0.8582 - val_loss: 0.8983
Epoch 10/50
452/452 - 4s - 8ms/step - loss: 0.8417 - val_loss: 0.8315
Epoch 11/50
452/452 - 4s - 8ms/step - loss: 0.8402 - val_loss: 0.7980
Epoch 12/50
452/452 - 4s - 8ms/step - loss: 0.8243 - val_loss: 0.8603
Epoch 13/50
452/452 - 4s - 8ms/step - loss: 0.8143 - val_loss: 0.9262
Epoch 14/50
452/452 - 4s - 8ms/step - loss: 0.8028 - val_loss: 0.8140
Epoch 15/50
452/452 - 4s - 

## BERT

- **Run Time and Resource Usage:** The BERT model occupied much resources and required several hours to run, leading to the decision to abandon this approach due to impracticality.