## Contents
- [Summary of Steps](#Summary-of-Steps)
- [Imports](#Imports)
- [Classification Models](#Classification-Models)
  - [Numerical Features](#Numerical-Features)
  - [TF-IDF Features](#TF-IDF-Features)
  - [Combined Numerical and TF IDF Features](#Combined-Numerical-and-TF-IDF-Features)
- [Regression Models](#Regression-Models)
  - [Numerical Features](#Numerical-Features)
  - [TF-IDF Features](#TF-IDF-Features)
  - [Combined Numerical and TF-IDF Features](#Combined-Numerical-and-TF-IDF-Features)
- [Word2Vec Features](#Word2Vec-Features)
  - [Option 1: Traditional Machine Learning Classification Models](#Option-1:-Traditional-Machine-Learning-Classification-Models)
  - [Option 2: Traditional Machine Learning Regression Models](#Option-2:-Traditional-Machine-Learning-Regression-Models)
  - [Option 3: Using Neural Networks](#Option-3:-Using-Neural-Networks)
- [BERT](#BERT)


## Summary of Steps

1. **Load Data:** Load the `numeric_features_added_v1.csv`.
2. **Define Features:** Define the numerical features and extract them from the DataFrame.
3. **Split Data:** Split the data into training and test sets.
4. **Train Model:** Train the `CatBoostRegressor` on the training data.
5. **Predict:** Make predictions on the test data.
6. **Discretize Predictions:** Discretize both the predictions and the actual test labels.
7. **Evaluate:** Compute the Quadratic Weighted Kappa Score to evaluate the model.


## Imports

In [60]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import cohen_kappa_score
from catboost import CatBoostRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
import time

In [62]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import cohen_kappa_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from catboost import CatBoostClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
import time

In [64]:
# Note: LightGBM was commented after execution to avoid heavy logs

## Classification Models

### Numerical features

In [66]:
# Load the training and test datasets
df = pd.read_csv('numeric_features_added_v1.csv')

# Define numerical features
numerical_features = [
    'reading_time',
    'mistakes_dist_ratio',
    'polysyllabcount',
    'sentence_count',
    'difficult_words',
    'comma_count',
    'transitional_phrases_c',
    'text_dist_words_ratio'
]

print("Working on Split, Train, Validate")
start_time = time.time()

# Split Data
X = df[numerical_features]
y = df['score'] - 1  # Adjust class labels to start from 0
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Check the distribution of the target classes in the training data
print("Distribution of target classes in the training data:")
print(y_train.value_counts())

# Check the distribution of the target classes in the test data
print("Distribution of target classes in the test data:")
print(y_test.value_counts())

# Define models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Random Forest Classifier': RandomForestClassifier(n_estimators=600, max_depth=4, random_state=42),
    'AdaBoost Classifier': AdaBoostClassifier(n_estimators=100, random_state=42),
    'CatBoost Classifier': CatBoostClassifier(iterations=600, depth=4, learning_rate=0.1, random_seed=42, silent=True),
    'XGBoost Classifier': XGBClassifier(objective='multi:softmax', colsample_bytree=0.3, learning_rate=0.1, max_depth=4, alpha=10, n_estimators=600),
    #'LightGBM Classifier': LGBMClassifier(n_estimators=600, learning_rate=0.1, max_depth=4, random_state=42)
}

# Store results
results = []

target_classes = np.sort(np.unique(y))

for name, model in models.items():
    print(f"Working on {name}...")
    start_time = time.time()
    model.fit(X_train, y_train)
    end_time = time.time()
    print(f"Elapsed time for {name}: {end_time - start_time} seconds")

    # Predictions on training set
    y_train_pred = model.predict(X_train) + 1  # Adjust predictions back to original scale
    print(f"Distinct predicted values on training set for {name}: {np.unique(y_train_pred)}")
    kappa_train_score = cohen_kappa_score(y_train + 1, y_train_pred, weights='quadratic')

    # Predictions on test set
    y_test_pred = model.predict(X_test) + 1  # Adjust predictions back to original scale
    print(f"Distinct predicted values on test set for {name}: {np.unique(y_test_pred)}")
    kappa_test_score = cohen_kappa_score(y_test + 1, y_test_pred, weights='quadratic')

    results.append({'Model': name, 'QWK Score (Train)': kappa_train_score, 'QWK Score (Test)': kappa_test_score})




Working on Split, Train, Validate
Distribution of target classes in the training data:
score
2    4017
1    3022
3    2513
0     801
4     621
5     100
Name: count, dtype: int64
Distribution of target classes in the test data:
score
2    1005
1     756
3     628
0     200
4     155
5      25
Name: count, dtype: int64
Working on Logistic Regression...


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Elapsed time for Logistic Regression: 8.929550170898438 seconds
Distinct predicted values on training set for Logistic Regression: [1 2 3 4 5 6]
Distinct predicted values on test set for Logistic Regression: [2 3 4 5]
Working on Random Forest Classifier...
Elapsed time for Random Forest Classifier: 15.649439573287964 seconds
Distinct predicted values on training set for Random Forest Classifier: [2 3 4 5]
Distinct predicted values on test set for Random Forest Classifier: [2 3 4 5]
Working on AdaBoost Classifier...
Elapsed time for AdaBoost Classifier: 4.318726539611816 seconds
Distinct predicted values on training set for AdaBoost Classifier: [1 2 3 4 5 6]
Distinct predicted values on test set for AdaBoost Classifier: [1 2 3 4 5 6]
Working on CatBoost Classifier...
Elapsed time for CatBoost Classifier: 6.367633104324341 seconds
Distinct predicted values on training set for CatBoost Classifier: [1 2 3 4 5 6]
Distinct predicted values on test set for CatBoost Classifier: [1 2 3 4 5 6]
W

In [67]:
# Display results in a DataFrame
results_df = pd.DataFrame(results).sort_values(by='QWK Score (Test)', ascending=False)
print(results_df)

                      Model  QWK Score (Train)  QWK Score (Test)
3       CatBoost Classifier           0.738946          0.697195
4        XGBoost Classifier           0.719882          0.690799
1  Random Forest Classifier           0.674805          0.670565
0       Logistic Regression           0.671035          0.670444
2       AdaBoost Classifier           0.339885          0.335948


### TF-IDF features

In [68]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import cohen_kappa_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from catboost import CatBoostClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
import time

# Load the datasets
df_transformed = pd.read_csv('transformed_data_v1.csv')
df_tfidf = pd.read_csv('tfidf_features.csv')

print("Working on Split, Train, Validate")
start_time = time.time()

# Split Data
X = df_tfidf
y = df_transformed['score'] - 1  # Adjust class labels to start from 0
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Check the distribution of the target classes in the training data
print("Distribution of target classes in the training data:")
print(y_train.value_counts())

# Check the distribution of the target classes in the test data
print("Distribution of target classes in the test data:")
print(y_test.value_counts())

# Define models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    #'Random Forest Classifier': RandomForestClassifier(n_estimators=600, max_depth=4, random_state=42),
    #'AdaBoost Classifier': AdaBoostClassifier(n_estimators=100, random_state=42),
    'CatBoost Classifier': CatBoostClassifier(iterations=600, depth=4, learning_rate=0.1, random_seed=42, silent=True),
    'XGBoost Classifier': XGBClassifier(objective='multi:softmax', colsample_bytree=0.3, learning_rate=0.1, max_depth=4, alpha=10, n_estimators=600),
   # 'LightGBM Classifier': LGBMClassifier(n_estimators=600, learning_rate=0.1, max_depth=4, random_state=42)
}

# Store results
results = []

target_classes = np.sort(np.unique(y))

for name, model in models.items():
    print(f"Working on {name}...")
    start_time = time.time()
    model.fit(X_train, y_train)
    end_time = time.time()
    print(f"Elapsed time for {name}: {end_time - start_time} seconds")

    # Predictions on training set
    y_train_pred = model.predict(X_train) + 1  # Adjust predictions back to original scale
    print(f"Distinct predicted values on training set for {name}: {np.unique(y_train_pred)}")
    kappa_train_score = cohen_kappa_score(y_train + 1, y_train_pred, weights='quadratic')

    # Predictions on test set
    y_test_pred = model.predict(X_test) + 1  # Adjust predictions back to original scale
    print(f"Distinct predicted values on test set for {name}: {np.unique(y_test_pred)}")
    kappa_test_score = cohen_kappa_score(y_test + 1, y_test_pred, weights='quadratic')

    results.append({'Model': name, 'QWK Score (Train)': kappa_train_score, 'QWK Score (Test)': kappa_test_score})

Working on Split, Train, Validate
Distribution of target classes in the training data:
score
2    4017
1    3022
3    2513
0     801
4     621
5     100
Name: count, dtype: int64
Distribution of target classes in the test data:
score
2    1005
1     756
3     628
0     200
4     155
5      25
Name: count, dtype: int64
Working on Logistic Regression...
Elapsed time for Logistic Regression: 7.91997218132019 seconds
Distinct predicted values on training set for Logistic Regression: [1 2 3 4 5]
Distinct predicted values on test set for Logistic Regression: [1 2 3 4 5]
Working on CatBoost Classifier...
Elapsed time for CatBoost Classifier: 165.75451135635376 seconds
Distinct predicted values on training set for CatBoost Classifier: [1 2 3 4 5 6]
Distinct predicted values on test set for CatBoost Classifier: [1 2 3 4 5]
Working on XGBoost Classifier...
Elapsed time for XGBoost Classifier: 354.85085010528564 seconds
Distinct predicted values on training set for XGBoost Classifier: [1 2 3 4 5 

In [69]:
# Display results in a DataFrame
results_df = pd.DataFrame(results).sort_values(by='QWK Score (Test)', ascending=False)
print(results_df)

                 Model  QWK Score (Train)  QWK Score (Test)
0  Logistic Regression           0.663757          0.578316
1  CatBoost Classifier           0.788927          0.510062
2   XGBoost Classifier           0.993334          0.506426


### Combined Numerical and TF IDF features

In [70]:
# Load the datasets
combined_features_df = pd.read_csv('combined_features.csv')
df_transformed = pd.read_csv('transformed_data_v1.csv')

print("Working on Split, Train, Validate")
start_time = time.time()

# Split Data
X = combined_features_df
y = df_transformed['score'] - 1  # Adjust class labels to start from 0
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Check the distribution of the target classes in the training data
print("Distribution of target classes in the training data:")
print(y_train.value_counts())

# Check the distribution of the target classes in the test data
print("Distribution of target classes in the test data:")
print(y_test.value_counts())

# Define models
models = {
    #'Random Forest Classifier': RandomForestClassifier(n_estimators=600, max_depth=4, random_state=42),
    #'AdaBoost Classifier': AdaBoostClassifier(n_estimators=100, random_state=42),
    'CatBoost Classifier': CatBoostClassifier(iterations=600, depth=4, learning_rate=0.1, random_seed=42, silent=True),
    'XGBoost Classifier': XGBClassifier(objective='multi:softmax', colsample_bytree=0.3, learning_rate=0.1, max_depth=4, alpha=10, n_estimators=600),
    #'LightGBM Classifier': LGBMClassifier(n_estimators=600, learning_rate=0.1, max_depth=4, random_state=42)
}

# Store results
results = []

target_classes = np.sort(np.unique(y))

for name, model in models.items():
    print(f"Working on {name}...")
    start_time = time.time()
    model.fit(X_train, y_train)
    end_time = time.time()
    print(f"Elapsed time for {name}: {end_time - start_time} seconds")

    # Predictions on training set
    y_train_pred = model.predict(X_train) + 1  # Adjust predictions back to original scale
    print(f"Distinct predicted values on training set for {name}: {np.unique(y_train_pred)}")
    kappa_train_score = cohen_kappa_score(y_train + 1, y_train_pred, weights='quadratic')

    # Predictions on test set
    y_test_pred = model.predict(X_test) + 1  # Adjust predictions back to original scale
    print(f"Distinct predicted values on test set for {name}: {np.unique(y_test_pred)}")
    kappa_test_score = cohen_kappa_score(y_test + 1, y_test_pred, weights='quadratic')

    results.append({'Model': name, 'QWK Score (Train)': kappa_train_score, 'QWK Score (Test)': kappa_test_score})

Working on Split, Train, Validate
Distribution of target classes in the training data:
score
2    4017
1    3022
3    2513
0     801
4     621
5     100
Name: count, dtype: int64
Distribution of target classes in the test data:
score
2    1005
1     756
3     628
0     200
4     155
5      25
Name: count, dtype: int64
Working on CatBoost Classifier...
Elapsed time for CatBoost Classifier: 54.24480748176575 seconds
Distinct predicted values on training set for CatBoost Classifier: [1 2 3 4 5 6]
Distinct predicted values on test set for CatBoost Classifier: [1 2 3 4 5 6]
Working on XGBoost Classifier...
Elapsed time for XGBoost Classifier: 165.84056162834167 seconds
Distinct predicted values on training set for XGBoost Classifier: [1 2 3 4 5 6]
Distinct predicted values on test set for XGBoost Classifier: [1 2 3 4 5 6]


In [71]:
# Display results in a DataFrame
results_df = pd.DataFrame(results).sort_values(by='QWK Score (Test)', ascending=False)
print(results_df)

                 Model  QWK Score (Train)  QWK Score (Test)
0  CatBoost Classifier           0.847677          0.728796
1   XGBoost Classifier           0.995395          0.723560


## Regression models

### Numerical features

In [72]:
# Load the training and test datasets
df = pd.read_csv('numeric_features_added_v1.csv')

# Define numerical features
numerical_features = [
    'reading_time',
    'mistakes_dist_ratio',
    'polysyllabcount',
    'sentence_count',
    'difficult_words',
    'comma_count',
    'transitional_phrases_c',
    'text_dist_words_ratio'
]

print("Working on Split, Train, Validate")
start_time = time.time()

# Split Data
X = df[numerical_features]
y = df['score']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Check the distribution of the target classes in the training data
print("Distribution of target classes in the training data:")
print(y_train.value_counts())

# Check the distribution of the target classes in the test data
print("Distribution of target classes in the test data:")
print(y_test.value_counts())

# Define models
models = {
    'Linear Regression': LinearRegression(),
    'Random Forest Regressor': RandomForestRegressor(n_estimators=600, max_depth=4, random_state=42),
    'AdaBoost Regressor': AdaBoostRegressor(n_estimators=100, random_state=42),
    'CatBoost Regressor': CatBoostRegressor(iterations=600, depth=4, learning_rate=0.1, random_seed=42, silent=True),
    'XGBoost Regressor': XGBRegressor(objective='reg:squarederror', colsample_bytree=0.3, learning_rate=0.1, max_depth=4, alpha=10, n_estimators=600),
    #'LightGBM Regressor': LGBMRegressor(n_estimators=600, learning_rate=0.1, max_depth=4, random_state=42)
}

# Store results
results = []

# Function to discretize predictions
def discretize_predictions(predictions, target_classes):
    bins = np.linspace(min(target_classes) - 0.5, max(target_classes) + 0.5, num=len(target_classes) + 1)
    discretized = np.digitize(predictions, bins) - 1
    discretized = np.clip(discretized, 0, len(target_classes) - 1)
    return discretized + 1

target_classes = np.sort(np.unique(y))

for name, model in models.items():
    print(f"Working on {name}...")
    start_time = time.time()
    model.fit(X_train, y_train)
    end_time = time.time()
    print(f"Elapsed time for {name}: {end_time - start_time} seconds")

    # Predictions on training set
    y_train_pred = model.predict(X_train)
    y_train_pred_scaled = (y_train_pred - y_train_pred.min()) / (y_train_pred.max() - y_train_pred.min()) * (y_train.max() - y_train.min()) + y_train.min()
    y_train_discretized = discretize_predictions(y_train, target_classes)
    y_train_pred_discretized = discretize_predictions(y_train_pred_scaled, target_classes)
    kappa_train_score = cohen_kappa_score(y_train_discretized, y_train_pred_discretized, weights='quadratic')

    # Predictions on test set
    y_test_pred = model.predict(X_test)
    y_test_pred_scaled = (y_test_pred - y_test_pred.min()) / (y_test_pred.max() - y_test_pred.min()) * (y_train.max() - y_train.min()) + y_train.min()
    y_test_discretized = discretize_predictions(y_test, target_classes)
    y_test_pred_discretized = discretize_predictions(y_test_pred_scaled, target_classes)
    kappa_test_score = cohen_kappa_score(y_test_discretized, y_test_pred_discretized, weights='quadratic')

    results.append({'Model': name, 'QWK Score (Train)': kappa_train_score, 'QWK Score (Test)': kappa_test_score})


Working on Split, Train, Validate
Distribution of target classes in the training data:
score
3    4017
2    3022
4    2513
1     801
5     621
6     100
Name: count, dtype: int64
Distribution of target classes in the test data:
score
3    1005
2     756
4     628
1     200
5     155
6      25
Name: count, dtype: int64
Working on Linear Regression...
Elapsed time for Linear Regression: 0.00824737548828125 seconds
Working on Random Forest Regressor...
Elapsed time for Random Forest Regressor: 9.106829643249512 seconds
Working on AdaBoost Regressor...
Elapsed time for AdaBoost Regressor: 0.24966645240783691 seconds
Working on CatBoost Regressor...
Elapsed time for CatBoost Regressor: 1.296612024307251 seconds
Working on XGBoost Regressor...
Elapsed time for XGBoost Regressor: 0.6050937175750732 seconds


In [73]:
# Show results
results_df = pd.DataFrame(results)
print(results_df)

                     Model  QWK Score (Train)  QWK Score (Test)
0        Linear Regression           0.438766          0.661431
1  Random Forest Regressor           0.719541          0.716102
2       AdaBoost Regressor           0.716374          0.707997
3       CatBoost Regressor           0.738748          0.714213
4        XGBoost Regressor           0.760924          0.725878


### TF-IDF features

In [74]:
# Load the datasets
df_transformed = pd.read_csv('transformed_data_v1.csv')
df_tfidf = pd.read_csv('tfidf_features.csv')

print("Working on Split, Train, Validate")
start_time = time.time()

# Split Data
X = df_tfidf
y = df_transformed['score']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Check the distribution of the target classes in the training data
print("Distribution of target classes in the training data:")
print(y_train.value_counts())

# Check the distribution of the target classes in the test data
print("Distribution of target classes in the test data:")
print(y_test.value_counts())

# Define models
models = {
    'Linear Regression': LinearRegression(),
    # 'Random Forest Regressor': RandomForestRegressor(n_estimators=600, max_depth=4, random_state=42),
    # 'AdaBoost Regressor': AdaBoostRegressor(n_estimators=100, random_state=42),
    'CatBoost Regressor': CatBoostRegressor(iterations=600, depth=4, learning_rate=0.1, random_seed=42, silent=True),
    'XGBoost Regressor': XGBRegressor(objective='reg:squarederror', colsample_bytree=0.3, learning_rate=0.1, max_depth=4, alpha=10, n_estimators=600),
    #'LightGBM Regressor': LGBMRegressor(n_estimators=600, learning_rate=0.1, max_depth=4, random_state=42)
}

# Store results
results = []

# Function to discretize predictions
def discretize_predictions(predictions, target_classes):
    bins = np.linspace(min(target_classes) - 0.5, max(target_classes) + 0.5, num=len(target_classes) + 1)
    discretized = np.digitize(predictions, bins) - 1
    discretized = np.clip(discretized, 0, len(target_classes) - 1)
    return discretized + 1

target_classes = np.sort(np.unique(y))

for name, model in models.items():
    print(f"Working on {name}...")
    start_time = time.time()
    model.fit(X_train, y_train)
    end_time = time.time()
    print(f"Elapsed time for {name}: {end_time - start_time} seconds")

    # Predictions on training set
    y_train_pred = model.predict(X_train)
    y_train_pred_scaled = (y_train_pred - y_train_pred.min()) / (y_train_pred.max() - y_train_pred.min()) * (y_train.max() - y_train.min()) + y_train.min()
    y_train_discretized = discretize_predictions(y_train, target_classes)
    y_train_pred_discretized = discretize_predictions(y_train_pred_scaled, target_classes)
    kappa_train_score = cohen_kappa_score(y_train_discretized, y_train_pred_discretized, weights='quadratic')

    # Predictions on test set
    y_test_pred = model.predict(X_test)
    y_test_pred_scaled = (y_test_pred - y_test_pred.min()) / (y_test_pred.max() - y_test_pred.min()) * (y_train.max() - y_train.min()) + y_train.min()
    y_test_discretized = discretize_predictions(y_test, target_classes)
    y_test_pred_discretized = discretize_predictions(y_test_pred_scaled, target_classes)
    kappa_test_score = cohen_kappa_score(y_test_discretized, y_test_pred_discretized, weights='quadratic')

    results.append({'Model': name, 'QWK Score (Train)': kappa_train_score, 'QWK Score (Test)': kappa_test_score})


Working on Split, Train, Validate
Distribution of target classes in the training data:
score
3    4017
2    3022
4    2513
1     801
5     621
6     100
Name: count, dtype: int64
Distribution of target classes in the test data:
score
3    1005
2     756
4     628
1     200
5     155
6      25
Name: count, dtype: int64
Working on Linear Regression...
Elapsed time for Linear Regression: 1.5149664878845215 seconds
Working on CatBoost Regressor...
Elapsed time for CatBoost Regressor: 14.15885615348816 seconds
Working on XGBoost Regressor...
Elapsed time for XGBoost Regressor: 30.564793825149536 seconds


In [75]:
# Show results
results_df = pd.DataFrame(results)
print(results_df)

                Model  QWK Score (Train)  QWK Score (Test)
0   Linear Regression           0.650407          0.485852
1  CatBoost Regressor           0.725470          0.583873
2   XGBoost Regressor           0.848442          0.520738


### Combined Numerical and TF-IDF features

In [76]:
# Load the datasets
combined_features_df = pd.read_csv('combined_features.csv')
df_transformed = pd.read_csv('transformed_data_v1.csv')

print("Working on Split, Train, Validate")
start_time = time.time()

# Split Data
X = combined_features_df
y = df_transformed['score']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Check the distribution of the target classes in the training data
print("Distribution of target classes in the training data:")
print(y_train.value_counts())

# Check the distribution of the target classes in the test data
print("Distribution of target classes in the test data:")
print(y_test.value_counts())

# Define models
models = {
    'Linear Regression': LinearRegression(),
    #'Random Forest Regressor': RandomForestRegressor(n_estimators=600, max_depth=4, random_state=42),
    #'AdaBoost Regressor': AdaBoostRegressor(n_estimators=100, random_state=42),
    'CatBoost Regressor': CatBoostRegressor(iterations=600, depth=4, learning_rate=0.1, random_seed=42, silent=True),
    'XGBoost Regressor': XGBRegressor(objective='reg:squarederror', colsample_bytree=0.3, learning_rate=0.1, max_depth=4, alpha=10, n_estimators=600),
   # 'LightGBM Regressor': LGBMRegressor(n_estimators=600, learning_rate=0.1, max_depth=4, random_state=42)
}

# Store results
results = []

# Function to discretize predictions
def discretize_predictions(predictions, target_classes):
    bins = np.linspace(min(target_classes) - 0.5, max(target_classes) + 0.5, num=len(target_classes) + 1)
    discretized = np.digitize(predictions, bins) - 1
    discretized = np.clip(discretized, 0, len(target_classes) - 1)
    return discretized + 1

target_classes = np.sort(np.unique(y))

for name, model in models.items():
    print(f"Working on {name}...")
    start_time = time.time()
    model.fit(X_train, y_train)
    end_time = time.time()
    print(f"Elapsed time for {name}: {end_time - start_time} seconds")

    # Predictions on training set
    y_train_pred = model.predict(X_train)
    y_train_pred_scaled = (y_train_pred - y_train_pred.min()) / (y_train_pred.max() - y_train_pred.min()) * (y_train.max() - y_train.min()) + y_train.min()
    y_train_discretized = discretize_predictions(y_train, target_classes)
    y_train_pred_discretized = discretize_predictions(y_train_pred_scaled, target_classes)
    kappa_train_score = cohen_kappa_score(y_train_discretized, y_train_pred_discretized, weights='quadratic')

    # Predictions on test set
    y_test_pred = model.predict(X_test)
    y_test_pred_scaled = (y_test_pred - y_test_pred.min()) / (y_test_pred.max() - y_test_pred.min()) * (y_train.max() - y_train.min()) + y_train.min()
    y_test_discretized = discretize_predictions(y_test, target_classes)
    y_test_pred_discretized = discretize_predictions(y_test_pred_scaled, target_classes)
    kappa_test_score = cohen_kappa_score(y_test_discretized, y_test_pred_discretized, weights='quadratic')

    results.append({'Model': name, 'QWK Score (Train)': kappa_train_score, 'QWK Score (Test)': kappa_test_score})


Working on Split, Train, Validate
Distribution of target classes in the training data:
score
3    4017
2    3022
4    2513
1     801
5     621
6     100
Name: count, dtype: int64
Distribution of target classes in the test data:
score
3    1005
2     756
4     628
1     200
5     155
6      25
Name: count, dtype: int64
Working on Linear Regression...
Elapsed time for Linear Regression: 1.3538024425506592 seconds
Working on CatBoost Regressor...
Elapsed time for CatBoost Regressor: 14.627174615859985 seconds
Working on XGBoost Regressor...
Elapsed time for XGBoost Regressor: 30.71868872642517 seconds


In [79]:
# Show results
results_df = pd.DataFrame(results)
print(results_df)

                Model  QWK Score (Train)  QWK Score (Test)
0   Linear Regression           0.697232          0.672009
1  CatBoost Regressor           0.870019          0.737538
2   XGBoost Regressor           0.942193          0.752054


## Word2Vect features

#### Option 1: Traditional Machine Learning Classification Models

In [80]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import cohen_kappa_score
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from catboost import CatBoostClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
import time
import ast

# Load the datasets
embeddings_df = pd.read_csv('word2vec_features.csv')

# Function to convert space-separated string of numbers (with brackets) to a list of floats
def convert_to_list(embedding_str):
    # Remove the square brackets
    embedding_str = embedding_str.replace('[', '').replace(']', '')
    # Split the string by spaces and convert to list of floats
    return [float(num) for num in embedding_str.split()]

# Apply the function to the word2vec_embedding column
embeddings_df['word2vec_embedding'] = embeddings_df['word2vec_embedding'].apply(convert_to_list)

print("Working on Split, Train, Validate")
start_time = time.time()

# Split Data
X = np.vstack(embeddings_df['word2vec_embedding'].values)
y = embeddings_df['score'] - 1  # Adjust class labels to start from 0
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Check the distribution of the target classes in the training data
print("Distribution of target classes in the training data:")
print(y_train.value_counts())

# Check the distribution of the target classes in the test data
print("Distribution of target classes in the test data:")
print(y_test.value_counts())

# Define models
models = {
    'Random Forest Classifier': RandomForestClassifier(n_estimators=600, max_depth=4, random_state=42),
    'AdaBoost Classifier': AdaBoostClassifier(n_estimators=100, random_state=42),
    'CatBoost Classifier': CatBoostClassifier(iterations=600, depth=4, learning_rate=0.1, random_seed=42, silent=True),
    'XGBoost Classifier': XGBClassifier(objective='multi:softmax', colsample_bytree=0.3, learning_rate=0.1, max_depth=4, alpha=10, n_estimators=600),
    #'LightGBM Classifier': LGBMClassifier(n_estimators=600, learning_rate=0.1, max_depth=4, random_state=42)
}

# Store results
results = []

target_classes = np.sort(np.unique(y))

for name, model in models.items():
    print(f"Working on {name}...")
    start_time = time.time()
    model.fit(X_train, y_train)
    end_time = time.time()
    print(f"Elapsed time for {name}: {end_time - start_time} seconds")

    # Predictions on training set
    y_train_pred = model.predict(X_train) + 1  # Adjust predictions back to original scale
    print(f"Distinct predicted values on training set for {name}: {np.unique(y_train_pred)}")
    kappa_train_score = cohen_kappa_score(y_train + 1, y_train_pred, weights='quadratic')

    # Predictions on test set
    y_test_pred = model.predict(X_test) + 1  # Adjust predictions back to original scale
    print(f"Distinct predicted values on test set for {name}: {np.unique(y_test_pred)}")
    kappa_test_score = cohen_kappa_score(y_test + 1, y_test_pred, weights='quadratic')

    results.append({'Model': name, 'QWK Score (Train)': kappa_train_score, 'QWK Score (Test)': kappa_test_score})


Working on Split, Train, Validate
Distribution of target classes in the training data:
score
2    4017
1    3022
3    2513
0     801
4     621
5     100
Name: count, dtype: int64
Distribution of target classes in the test data:
score
2    1005
1     756
3     628
0     200
4     155
5      25
Name: count, dtype: int64
Working on Random Forest Classifier...
Elapsed time for Random Forest Classifier: 55.92894721031189 seconds
Distinct predicted values on training set for Random Forest Classifier: [1 2 3 4]
Distinct predicted values on test set for Random Forest Classifier: [1 2 3 4]
Working on AdaBoost Classifier...
Elapsed time for AdaBoost Classifier: 106.02817106246948 seconds
Distinct predicted values on training set for AdaBoost Classifier: [1 2 3 4 5 6]
Distinct predicted values on test set for AdaBoost Classifier: [1 2 3 4 5 6]
Working on CatBoost Classifier...
Elapsed time for CatBoost Classifier: 14.567898750305176 seconds
Distinct predicted values on training set for CatBoost C

In [81]:
# Display results in a DataFrame
results_df = pd.DataFrame(results).sort_values(by='QWK Score (Test)', ascending=False)
print(results_df)

                      Model  QWK Score (Train)  QWK Score (Test)
3        XGBoost Classifier           0.887020          0.567956
2       CatBoost Classifier           0.701877          0.557159
1       AdaBoost Classifier           0.426707          0.391434
0  Random Forest Classifier           0.267254          0.250111


#### Option 2: Traditional Machine Learning Regression Models

In [82]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import cohen_kappa_score
from catboost import CatBoostRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
import time
import ast

# Load the datasets
embeddings_df = pd.read_csv('word2vec_features.csv')

# Function to convert space-separated string of numbers (with brackets) to a list of floats
def convert_to_list(embedding_str):
    # Remove the square brackets
    embedding_str = embedding_str.replace('[', '').replace(']', '')
    # Split the string by spaces and convert to list of floats
    return [float(num) for num in embedding_str.split()]

# Apply the function to the word2vec_embedding column
embeddings_df['word2vec_embedding'] = embeddings_df['word2vec_embedding'].apply(convert_to_list)

print("Working on Split, Train, Validate")
start_time = time.time()

# Split Data
X = np.vstack(embeddings_df['word2vec_embedding'].values)
y = embeddings_df['score']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Check the distribution of the target classes in the training data
print("Distribution of target classes in the training data:")
print(y_train.value_counts())

# Check the distribution of the target classes in the test data
print("Distribution of target classes in the test data:")
print(y_test.value_counts())

# Define models
models = {
    'Linear Regression': LinearRegression(),
    #'Random Forest Regressor': RandomForestRegressor(n_estimators=600, max_depth=4, random_state=42),
    #'AdaBoost Regressor': AdaBoostRegressor(n_estimators=100, random_state=42),
    'CatBoost Regressor': CatBoostRegressor(iterations=600, depth=4, learning_rate=0.1, random_seed=42, silent=True),
    'XGBoost Regressor': XGBRegressor(objective='reg:squarederror', colsample_bytree=0.3, learning_rate=0.1, max_depth=4, alpha=10, n_estimators=600),
    #'LightGBM Regressor': LGBMRegressor(n_estimators=600, learning_rate=0.1, max_depth=4, random_state=42)
}

# Store results
results = []

# Function to discretize predictions
def discretize_predictions(predictions, target_classes):
    bins = np.linspace(min(target_classes) - 0.5, max(target_classes) + 0.5, num=len(target_classes) + 1)
    discretized = np.digitize(predictions, bins) - 1
    discretized = np.clip(discretized, 0, len(target_classes) - 1)
    return discretized + 1

target_classes = np.sort(np.unique(y))

for name, model in models.items():
    print(f"Working on {name}...")
    start_time = time.time()
    model.fit(X_train, y_train)
    end_time = time.time()
    print(f"Elapsed time for {name}: {end_time - start_time} seconds")

    # Predictions on training set
    y_train_pred = model.predict(X_train)
    y_train_pred_scaled = (y_train_pred - y_train_pred.min()) / (y_train_pred.max() - y_train_pred.min()) * (y_train.max() - y_train.min()) + y_train.min()
    y_train_discretized = discretize_predictions(y_train, target_classes)
    y_train_pred_discretized = discretize_predictions(y_train_pred_scaled, target_classes)
    kappa_train_score = cohen_kappa_score(y_train_discretized, y_train_pred_discretized, weights='quadratic')

    # Predictions on test set
    y_test_pred = model.predict(X_test)
    y_test_pred_scaled = (y_test_pred - y_test_pred.min()) / (y_test_pred.max() - y_test_pred.min()) * (y_train.max() - y_train.min()) + y_train.min()
    y_test_discretized = discretize_predictions(y_test, target_classes)
    y_test_pred_discretized = discretize_predictions(y_test_pred_scaled, target_classes)
    kappa_test_score = cohen_kappa_score(y_test_discretized, y_test_pred_discretized, weights='quadratic')

    results.append({'Model': name, 'QWK Score (Train)': kappa_train_score, 'QWK Score (Test)': kappa_test_score})


Working on Split, Train, Validate
Distribution of target classes in the training data:
score
3    4017
2    3022
4    2513
1     801
5     621
6     100
Name: count, dtype: int64
Distribution of target classes in the test data:
score
3    1005
2     756
4     628
1     200
5     155
6      25
Name: count, dtype: int64
Working on Linear Regression...
Elapsed time for Linear Regression: 0.33718204498291016 seconds
Working on CatBoost Regressor...
Elapsed time for CatBoost Regressor: 5.67457389831543 seconds
Working on XGBoost Regressor...
Elapsed time for XGBoost Regressor: 4.95456862449646 seconds


In [83]:
# Show results
results_df = pd.DataFrame(results)
print(results_df)

                Model  QWK Score (Train)  QWK Score (Test)
0   Linear Regression           0.355899          0.351317
1  CatBoost Regressor           0.523580          0.373310
2   XGBoost Regressor           0.673253          0.461169


#### Option 3: Using Neural Networks

A simple feedforward neural network using TensorFlow Library

In [15]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import cohen_kappa_score
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
from sklearn.preprocessing import StandardScaler

# Load the datasets
embeddings_df = pd.read_csv('word2vec_features.csv')

# Function to convert space-separated string of numbers (with brackets) to a list of floats
def convert_to_list(embedding_str):
    # Remove the square brackets
    embedding_str = embedding_str.replace('[', '').replace(']', '')
    # Split the string by spaces and convert to list of floats
    return [float(num) for num in embedding_str.split()]

# Apply the function to the word2vec_embedding column
embeddings_df['word2vec_embedding'] = embeddings_df['word2vec_embedding'].apply(convert_to_list)

print("Working on Split, Train, Validate")
start_time = time.time()

# Split Data
X = np.vstack(embeddings_df['word2vec_embedding'].values)
y = embeddings_df['score']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Dynamically determine target classes
target_classes = np.sort(np.unique(y))

# Check the distribution of the target classes in the training data
print("Distribution of target classes in the training data:")
print(y_train.value_counts())

# Check the distribution of the target classes in the test data
print("Distribution of target classes in the test data:")
print(y_test.value_counts())

# Standardize data for neural networks
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Define the neural network model
model = Sequential()
model.add(Dense(128, input_dim=X_train_scaled.shape[1], activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dense(1, activation='linear'))

# Compile the model
model.compile(optimizer=Adam(learning_rate=0.001), loss='mean_squared_error')

# Train the model
history = model.fit(X_train_scaled, y_train, epochs=50, batch_size=32, validation_data=(X_test_scaled, y_test), verbose=2)

# Predict
y_train_pred = model.predict(X_train_scaled).flatten()
y_test_pred = model.predict(X_test_scaled).flatten()

# Function to discretize predictions
def discretize_predictions(predictions, target_classes):
    bins = np.linspace(min(target_classes) - 0.5, max(target_classes) + 0.5, num=len(target_classes) + 1)
    discretized = np.digitize(predictions, bins) - 1
    discretized = np.clip(discretized, 0, len(target_classes) - 1)
    return discretized + 1

# Define the QWK computation function
def compute_qwk(y_true, y_pred, target_classes):
    y_pred_discretized = discretize_predictions(y_pred, target_classes)
    return cohen_kappa_score(y_true, y_pred_discretized, weights='quadratic')

# Evaluate using QWK score
qwk_train = compute_qwk(y_train, y_train_pred, target_classes)
qwk_test = compute_qwk(y_test, y_test_pred, target_classes)

print(f"QWK Score on Train Set: {qwk_train}")
print(f"QWK Score on Test Set: {qwk_test}")


Working on Split, Train, Validate
Distribution of target classes in the training data:
score
3    4017
2    3022
4    2513
1     801
5     621
6     100
Name: count, dtype: int64
Distribution of target classes in the test data:
score
3    1005
2     756
4     628
1     200
5     155
6      25
Name: count, dtype: int64


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 1/50
347/347 - 7s - 21ms/step - loss: 1.5953 - val_loss: 1.0994
Epoch 2/50
347/347 - 2s - 5ms/step - loss: 1.1590 - val_loss: 1.1863
Epoch 3/50
347/347 - 2s - 5ms/step - loss: 1.0383 - val_loss: 1.0890
Epoch 4/50
347/347 - 3s - 8ms/step - loss: 0.9696 - val_loss: 1.1958
Epoch 5/50
347/347 - 2s - 5ms/step - loss: 0.9548 - val_loss: 1.0846
Epoch 6/50
347/347 - 3s - 8ms/step - loss: 0.8994 - val_loss: 1.1531
Epoch 7/50
347/347 - 3s - 7ms/step - loss: 0.8765 - val_loss: 0.9718
Epoch 8/50
347/347 - 2s - 6ms/step - loss: 0.8655 - val_loss: 1.0934
Epoch 9/50
347/347 - 3s - 8ms/step - loss: 0.8411 - val_loss: 0.8995
Epoch 10/50
347/347 - 3s - 7ms/step - loss: 0.8187 - val_loss: 0.9938
Epoch 11/50
347/347 - 3s - 8ms/step - loss: 0.8122 - val_loss: 0.9161
Epoch 12/50
347/347 - 3s - 8ms/step - loss: 0.8045 - val_loss: 0.8496
Epoch 13/50
347/347 - 2s - 7ms/step - loss: 0.7992 - val_loss: 0.9131
Epoch 14/50
347/347 - 2s - 5ms/step - loss: 0.7804 - val_loss: 0.8731
Epoch 15/50
347/347 - 2s - 5

## BERT

- **Run Time and Resource Usage:** The BERT model occupied all resources and required several hours to run, leading to the decision to abandon this approach due to impracticality.

#### Option 1: Traditional Machine Learning Classification Models

In [84]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import cohen_kappa_score
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from catboost import CatBoostClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
import time

# Load the datasets
embeddings_df = pd.read_csv('bert_features.csv')

# Load the original dataset with scores to ensure the labels match the embeddings
df_transformed = pd.read_csv('transformed_data_v1.csv')

# Assuming embeddings_df already contains embeddings as numerical columns
# Combine BERT embeddings with the scores
y = df_transformed['score'] - 1  # Adjust class labels to start from 0
X = embeddings_df.values

print("Working on Split, Train, Validate")
start_time = time.time()

# Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Check the distribution of the target classes in the training data
print("Distribution of target classes in the training data:")
print(y_train.value_counts())

# Check the distribution of the target classes in the test data
print("Distribution of target classes in the test data:")
print(y_test.value_counts())

# Define models
models = {
    #'Random Forest Classifier': RandomForestClassifier(n_estimators=600, max_depth=4, random_state=42),
    #'AdaBoost Classifier': AdaBoostClassifier(n_estimators=100, random_state=42),
    'CatBoost Classifier': CatBoostClassifier(iterations=600, depth=4, learning_rate=0.1, random_seed=42, silent=True),
    'XGBoost Classifier': XGBClassifier(objective='multi:softmax', colsample_bytree=0.3, learning_rate=0.1, max_depth=4, alpha=10, n_estimators=600),
    #'LightGBM Classifier': LGBMClassifier(n_estimators=600, learning_rate=0.1, max_depth=4, random_state=42)
}

# Store results
results = []

target_classes = np.sort(np.unique(y))

for name, model in models.items():
    print(f"Working on {name}...")
    start_time = time.time()
    model.fit(X_train, y_train)
    end_time = time.time()
    print(f"Elapsed time for {name}: {end_time - start_time} seconds")

    # Predictions on training set
    y_train_pred = model.predict(X_train) + 1  # Adjust predictions back to original scale
    print(f"Distinct predicted values on training set for {name}: {np.unique(y_train_pred)}")
    kappa_train_score = cohen_kappa_score(y_train + 1, y_train_pred, weights='quadratic')

    # Predictions on test set
    y_test_pred = model.predict(X_test) + 1  # Adjust predictions back to original scale
    print(f"Distinct predicted values on test set for {name}: {np.unique(y_test_pred)}")
    kappa_test_score = cohen_kappa_score(y_test + 1, y_test_pred, weights='quadratic')

    results.append({'Model': name, 'QWK Score (Train)': kappa_train_score, 'QWK Score (Test)': kappa_test_score})




Working on Split, Train, Validate
Distribution of target classes in the training data:
score
2    4017
1    3022
3    2513
0     801
4     621
5     100
Name: count, dtype: int64
Distribution of target classes in the test data:
score
2    1005
1     756
3     628
0     200
4     155
5      25
Name: count, dtype: int64
Working on CatBoost Classifier...
Elapsed time for CatBoost Classifier: 38.15868878364563 seconds
Distinct predicted values on training set for CatBoost Classifier: [1 2 3 4 5 6]
Distinct predicted values on test set for CatBoost Classifier: [1 2 3 4 5 6]
Working on XGBoost Classifier...
Elapsed time for XGBoost Classifier: 81.05140852928162 seconds
Distinct predicted values on training set for XGBoost Classifier: [1 2 3 4 5 6]
Distinct predicted values on test set for XGBoost Classifier: [1 2 3 4 5 6]


In [85]:
# Convert results to DataFrame for better visualization
results_df = pd.DataFrame(results)
print(results_df)

                 Model  QWK Score (Train)  QWK Score (Test)
0  CatBoost Classifier           0.823684          0.679635
1   XGBoost Classifier           0.975234          0.690236


#### Option 2: Traditional Machine Learning Regression Models

In [86]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import cohen_kappa_score
from catboost import CatBoostRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
import time

# Load the BERT embeddings
embeddings_df = pd.read_csv('bert_features.csv')

# Load the original dataset with scores to ensure the labels match the embeddings
df_transformed = pd.read_csv('transformed_data_v1.csv')

# Combine BERT embeddings with the scores
y = df_transformed['score']
X = embeddings_df.values

print("Working on Split, Train, Validate")
start_time = time.time()

# Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Check the distribution of the target classes in the training data
print("Distribution of target classes in the training data:")
print(y_train.value_counts())

# Check the distribution of the target classes in the test data
print("Distribution of target classes in the test data:")
print(y_test.value_counts())

# Define models
models = {
    'Linear Regression': LinearRegression(),
    'CatBoost Regressor': CatBoostRegressor(iterations=600, depth=4, learning_rate=0.1, random_seed=42, silent=True),
    'XGBoost Regressor': XGBRegressor(objective='reg:squarederror', colsample_bytree=0.3, learning_rate=0.1, max_depth=4, alpha=10, n_estimators=600),
    #'LightGBM Regressor': LGBMRegressor(n_estimators=600, learning_rate=0.1, max_depth=4, random_state=42)
}

# Store results
results = []

# Function to discretize predictions
def discretize_predictions(predictions, target_classes):
    bins = np.linspace(min(target_classes) - 0.5, max(target_classes) + 0.5, num=len(target_classes) + 1)
    discretized = np.digitize(predictions, bins) - 1
    discretized = np.clip(discretized, 0, len(target_classes) - 1)
    return discretized + 1

target_classes = np.sort(np.unique(y))

for name, model in models.items():
    print(f"Working on {name}...")
    start_time = time.time()
    model.fit(X_train, y_train)
    end_time = time.time()
    print(f"Elapsed time for {name}: {end_time - start_time} seconds")

    # Predictions on training set
    y_train_pred = model.predict(X_train)
    y_train_pred_scaled = (y_train_pred - y_train_pred.min()) / (y_train_pred.max() - y_train_pred.min()) * (y_train.max() - y_train.min()) + y_train.min()
    y_train_discretized = discretize_predictions(y_train, target_classes)
    y_train_pred_discretized = discretize_predictions(y_train_pred_scaled, target_classes)
    kappa_train_score = cohen_kappa_score(y_train_discretized, y_train_pred_discretized, weights='quadratic')

    # Predictions on test set
    y_test_pred = model.predict(X_test)
    y_test_pred_scaled = (y_test_pred - y_test_pred.min()) / (y_test_pred.max() - y_test_pred.min()) * (y_train.max() - y_train.min()) + y_train.min()
    y_test_discretized = discretize_predictions(y_test, target_classes)
    y_test_pred_discretized = discretize_predictions(y_test_pred_scaled, target_classes)
    kappa_test_score = cohen_kappa_score(y_test_discretized, y_test_pred_discretized, weights='quadratic')

    results.append({'Model': name, 'QWK Score (Train)': kappa_train_score, 'QWK Score (Test)': kappa_test_score})




Working on Split, Train, Validate
Distribution of target classes in the training data:
score
3    4017
2    3022
4    2513
1     801
5     621
6     100
Name: count, dtype: int64
Distribution of target classes in the test data:
score
3    1005
2     756
4     628
1     200
5     155
6      25
Name: count, dtype: int64
Working on Linear Regression...
Elapsed time for Linear Regression: 21291.91133761406 seconds
Working on CatBoost Regressor...
Elapsed time for CatBoost Regressor: 25.162798404693604 seconds
Working on XGBoost Regressor...
Elapsed time for XGBoost Regressor: 28.569011688232422 seconds


In [87]:
# Convert results to DataFrame for better visualization
results_df = pd.DataFrame(results)
print(results_df)

                Model  QWK Score (Train)  QWK Score (Test)
0   Linear Regression           0.557390          0.506605
1  CatBoost Regressor           0.756849          0.591912
2   XGBoost Regressor           0.796409          0.596577


#### Option 3: Using Neural Networks

In [56]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import cohen_kappa_score
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
from sklearn.preprocessing import StandardScaler
import time

# Load the BERT embeddings
embeddings_df = pd.read_csv('bert_features.csv')

# Load the original dataset with scores to ensure the labels match the embeddings
df_transformed = pd.read_csv('transformed_data_v1.csv')

# Combine BERT embeddings with the scores
y = df_transformed['score']
X = embeddings_df.values

print("Working on Split, Train, Validate")
start_time = time.time()

# Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Dynamically determine target classes
target_classes = np.sort(np.unique(y))

# Check the distribution of the target classes in the training data
print("Distribution of target classes in the training data:")
print(y_train.value_counts())

# Check the distribution of the target classes in the test data
print("Distribution of target classes in the test data:")
print(y_test.value_counts())

# Standardize data for neural networks
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Define the neural network model
model = Sequential()
model.add(Dense(128, input_dim=X_train_scaled.shape[1], activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dense(1, activation='linear'))

# Compile the model
model.compile(optimizer=Adam(learning_rate=0.001), loss='mean_squared_error')

# Train the model
history = model.fit(X_train_scaled, y_train, epochs=50, batch_size=32, validation_data=(X_test_scaled, y_test), verbose=2)

# Predict
y_train_pred = model.predict(X_train_scaled).flatten()
y_test_pred = model.predict(X_test_scaled).flatten()

# Function to discretize predictions
def discretize_predictions(predictions, target_classes):
    bins = np.linspace(min(target_classes) - 0.5, max(target_classes) + 0.5, num=len(target_classes) + 1)
    discretized = np.digitize(predictions, bins) - 1
    discretized = np.clip(discretized, 0, len(target_classes) - 1)
    return discretized + 1

# Define the QWK computation function
def compute_qwk(y_true, y_pred, target_classes):
    y_pred_discretized = discretize_predictions(y_pred, target_classes)
    return cohen_kappa_score(y_true, y_pred_discretized, weights='quadratic')

# Evaluate using QWK score
qwk_train = compute_qwk(y_train, y_train_pred, target_classes)
qwk_test = compute_qwk(y_test, y_test_pred, target_classes)




Working on Split, Train, Validate
Distribution of target classes in the training data:
score
3    4017
2    3022
4    2513
1     801
5     621
6     100
Name: count, dtype: int64
Distribution of target classes in the test data:
score
3    1005
2     756
4     628
1     200
5     155
6      25
Name: count, dtype: int64
Epoch 1/50


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


347/347 - 2s - 6ms/step - loss: 1.2502 - val_loss: 0.9882
Epoch 2/50
347/347 - 1s - 2ms/step - loss: 0.7721 - val_loss: 0.9516
Epoch 3/50
347/347 - 1s - 2ms/step - loss: 0.6475 - val_loss: 0.8703
Epoch 4/50
347/347 - 1s - 2ms/step - loss: 0.5889 - val_loss: 0.8411
Epoch 5/50
347/347 - 1s - 2ms/step - loss: 0.5490 - val_loss: 1.0387
Epoch 6/50
347/347 - 1s - 2ms/step - loss: 0.5269 - val_loss: 0.8222
Epoch 7/50
347/347 - 1s - 2ms/step - loss: 0.5002 - val_loss: 0.7037
Epoch 8/50
347/347 - 1s - 2ms/step - loss: 0.4840 - val_loss: 0.8296
Epoch 9/50
347/347 - 1s - 2ms/step - loss: 0.4688 - val_loss: 0.8837
Epoch 10/50
347/347 - 1s - 2ms/step - loss: 0.4594 - val_loss: 0.8370
Epoch 11/50
347/347 - 1s - 4ms/step - loss: 0.4582 - val_loss: 0.5535
Epoch 12/50
347/347 - 1s - 2ms/step - loss: 0.4458 - val_loss: 0.9599
Epoch 13/50
347/347 - 1s - 2ms/step - loss: 0.4476 - val_loss: 0.7568
Epoch 14/50
347/347 - 1s - 2ms/step - loss: 0.4317 - val_loss: 0.6697
Epoch 15/50
347/347 - 1s - 2ms/step - lo

In [57]:
print(f"QWK Score on Train Set: {qwk_train}")
print(f"QWK Score on Test Set: {qwk_test}")

QWK Score on Train Set: 0.7216489652760272
QWK Score on Test Set: 0.5787502629917947
