# PROJECT N°2 : ESSAY GRADES PREDICTION

Students :
- Hamza Mostefaoui
- Anthony Reichen
- Noman Ghulam
- Suleiman Adebowale Ojo
- Jordan Porcu

Instructor :
- Assan Sanogo

### OVERVIEW

With given data about essays, essays topic and grades attributed by teachers, we have to create a pipeline that grades the essay from the text itself. 
For doing it, we imagined a pipeline this way :
- Visualizing of the data
- Checking the balance of the data, and if needed, balancing it
- Extracting features from the essay text
- Spliting the whole data set in train, test and validation set
- Make 3 models : 
    1. A classification model to predict the essay set from the text
    2. A regression model to predict the main grade (domain1_score)
    3. For the essay set n°2, there are 2 grades : domain1_score and domain2_score. So, we need a third regression model to predict the domain2_score

### IMPORTATION

In [232]:
# STANDARD
import pandas as pd
import numpy as np

# DATA VIZ
import matplotlib.pyplot as plt
import plotly.graph_objs as go
from plotly.subplots import make_subplots

# FEATURE ENGINEERING 
from src.features import *
import textstat

# MACHINE LEARNING
import xgboost as xgb
import pickle
from xgboost import XGBClassifier,XGBRegressor
from sklearn.model_selection import GridSearchCV,train_test_split
from sklearn.metrics import classification_report, cohen_kappa_score,mean_squared_error
from sklearn.preprocessing import MinMaxScaler,LabelEncoder
from imblearn.over_sampling import RandomOverSampler

training_data = pd.read_excel('data/training_set_rel3.xls')
training_data.head(5)

Unnamed: 0,essay_id,essay_set,essay,rater1_domain1,rater2_domain1,rater3_domain1,domain1_score,rater1_domain2,rater2_domain2,domain2_score,...,rater2_trait3,rater2_trait4,rater2_trait5,rater2_trait6,rater3_trait1,rater3_trait2,rater3_trait3,rater3_trait4,rater3_trait5,rater3_trait6
0,1,1,"Dear local newspaper, I think effects computer...",4.0,4.0,,8.0,,,,...,,,,,,,,,,
1,2,1,"Dear @CAPS1 @CAPS2, I believe that using compu...",5.0,4.0,,9.0,,,,...,,,,,,,,,,
2,3,1,"Dear, @CAPS1 @CAPS2 @CAPS3 More and more peopl...",4.0,3.0,,7.0,,,,...,,,,,,,,,,
3,4,1,"Dear Local Newspaper, @CAPS1 I have found that...",5.0,5.0,,10.0,,,,...,,,,,,,,,,
4,5,1,"Dear @LOCATION1, I know having computers has a...",4.0,4.0,,8.0,,,,...,,,,,,,,,,


### FUNCTIONS CREATION

`DATAFRAME_SCALER` will be used to scale the dataframe. Since every essay set has its own range of grades, we need to split the dataset for every essay_set, scale it, then mix it back together. Than, we will have a fully scaled dataset, where grades goes from -1 to 1. Moreover, we save a dictionnary that store the model used for every set (and for domain2_score scaling), that will allow us, in the end, to bring predicted grades to the correct range

`FEATURE_ENGINEERING` is made to extract features from the text. We extract 130 features, that explains the text complexity, to help the model understand every aspect of it. Then, we save it in a .pickle file to avoid the long time of execution

`SPLIT_AND_BALANCE` is there to balance the data. We will see in the next part that the data is not balanced for the 8th essay set. So we extract 200 records for the test and validation set, to be sure they are balanced, then we use a RandomOverSampler to balance the remaining data, that we will use to train the model

`reverse_scaling` and `reverse_scaling_2` are made to, in the end, reverse the scaling of the predicted value, and give the same range of grades than the original scores. The "_2" stands for the use in the domain2_score case

In [233]:
def DATAFRAME_SCALER(df, remove_useless_columns=True, range_of_scaling=(-1, 1), display_distribution=False):
    # dictionary to store scalers
    scalers_dict = {}

    # --- DOMAIN 1 HANDLING ---
    scaled_domain1_score_list = []
    for i in np.arange(8):
        df_temp = df[df["essay_set"] == i+1].copy()  # parsing essay set
        domain_score = np.array(df_temp["domain1_score"]).reshape(-1, 1)  # turning score column in array
        scaler = MinMaxScaler(feature_range=range_of_scaling)  # scaler
        scaler.fit(domain_score)  # fitting ...
        df_temp["scaled_domain1_score"] = scaler.transform(domain_score)  # scaling the score column
        scaled_domain1_score_list.append(df_temp[["scaled_domain1_score"]])  # add the scaled column to lists
        
        # save the scaler with essay_set as the key
        scalers_dict[f"domain1_essay_set_{i+1}"] = scaler

    scaled_scores_df = pd.concat(scaled_domain1_score_list)
    df = df.join(scaled_scores_df)

    # --- DOMAIN 2 HANDLING ---
    scaler2 = MinMaxScaler(feature_range=range_of_scaling)
    x = np.array(df["domain2_score"]).reshape(-1, 1)
    scaler2.fit(x)
    df["scaled_domain2_score"] = scaler2.transform(x)

    # Save the scaler for domain2
    scalers_dict["domain2"] = scaler2

    if remove_useless_columns:
        df = df.drop(['rater1_domain1', 'rater2_domain1', 'rater3_domain1', 
                      'rater1_domain2', 'rater2_domain2', 'rater1_trait1',       
                      'rater1_trait2', 'rater1_trait3', 'rater1_trait4', 
                      'rater1_trait5', 'rater1_trait6', 'rater2_trait1', 
                      'rater2_trait2', 'rater2_trait3', 'rater2_trait4', 
                      'rater2_trait5', 'rater2_trait6', 'rater3_trait1', 
                      'rater3_trait2', 'rater3_trait3', 'rater3_trait4', 
                      'rater3_trait5', 'rater3_trait6'], axis=1)

    if display_distribution:
        plt.figure(figsize=(20, 6))
        plt.suptitle("With parsed scaling")

        plt.subplot(1, 2, 1)
        plt.hist(df['scaled_domain1_score'], edgecolor='black')
        plt.title('domain1_score')
        plt.xlabel('value')
        plt.ylabel('freq')

        plt.subplot(1, 2, 2)
        plt.hist(df['scaled_domain2_score'], edgecolor='black')
        plt.title('domain2_score')
        plt.xlabel('value')
        plt.ylabel('freq')

        plt.show()

    # return the processed dataframe and the dictionary of scalers
    return df, scalers_dict

def FEATURE_ENGINEERING(df):
    # creating columns from features.py
    df["count_characters"] = df["essay"].apply(lambda x: count_characters(x))
    df["count_syllables"] = df["essay"].apply(lambda x: count_syllables(x))
    df["count_words"] = df["essay"].apply(lambda x: count_words(x))
    df["count_sentences"] = df["essay"].apply(lambda x: count_sentences(x))
    df["flesch_reading_ease"] = df["essay"].apply(lambda x: get_flesch_reading_ease(x))
    df["gunning_fog"] = df["essay"].apply(lambda x: get_gunning_fog(x))
    df["automated_readability_index"] = df["essay"].apply(lambda x: get_automated_readability_index(x))
    df["smog_index"] = df["essay"].apply(lambda x: get_smog_index(x))
    df["flesch_kincaid_grade"] = df["essay"].apply(lambda x: get_flesch_kincaid_grade(x))
    df["coleman_liau_index"] = df["essay"].apply(lambda x: get_coleman_liau_index(x))
    df["dale_chall_readability_score"] = df["essay"].apply(lambda x: get_dale_chall_readability_score(x))
    df["automated_readability_index"] = df["essay"].apply(lambda x: get_automated_readability_index(x))
    df["dale_chall_readability_score"] = df["essay"].apply(lambda x: get_dale_chall_readability_score(x))
    df["difficult_words"] = df["essay"].apply(lambda x: get_difficult_words(x))
    df["linsear_write_formula"] = df["essay"].apply(lambda x: get_linsear_write_formula(x))
    df["count_awl_words"] = df["essay"].apply(lambda x: count_awl_words(x))
    df["calculate_lexical_diversity"] = df["essay"].apply(lambda x: calculate_lexical_diversity(x))
    df["get_average_heights"] = df["essay"].apply(lambda x: get_average_heights(x))
    df["get_average_connections_at_root"] = df["essay"].apply(lambda x: get_average_connections_at_root(x))
    df["get_length_of_clauses"] = df["essay"].apply(lambda x: get_length_of_clauses(x))
    df["calculate_misspelling_score"] = df["essay"].apply(lambda x: calculate_misspelling_score(x))
    df["detect_slur_usage"] = df["essay"].apply(lambda x: detect_slur_usage(x))
    df["calculate_overusage_of_punctuation"] = df["essay"].apply(lambda x: calculate_overusage_of_punctuation(x))
    df["count_tagged_entity"] = df["essay"].apply(lambda x: count_tagged_entity(x))
    df["count_stop_words"] = df["essay"].apply(lambda x: count_stop_words(x))
    df["count_quoted_words"] = df["essay"].apply(lambda x: count_quoted_words(x))

    tmp_df = pd.DataFrame()
    tmp_df = df["essay"].apply(lambda x: get_pos_tags(x))
    tmp_df = pd.json_normalize(tmp_df)
    df = df.join(tmp_df, how="left")

    tmp_df = pd.DataFrame()
    tmp_df = df["essay"].apply(lambda x: get_word_frequency(x))
    tmp_df = pd.json_normalize(tmp_df)
    df = df.join(tmp_df, how="left")

    tmp_df = pd.DataFrame()
    tmp_df = df["essay"].apply(lambda x: get_sentence_tree_roots(x))
    tmp_df = pd.json_normalize(tmp_df)
    tmp_df.fillna(0, inplace=True)
    df = df.join(tmp_df, how="left")
    
    with open("processed_data.pickle", "wb") as file:
        pickle.dump(df, file)
    return df

def SPLIT_AND_BALANCE(df, test_size=200, val_size=200, random_state=42):
    # oversampling the data
    ros = RandomOverSampler(random_state=random_state)
    X_resampled, y_resampled = ros.fit_resample(df.drop(['essay_set'], axis=1), df['essay_set'])
    resampled_df = pd.concat([X_resampled, y_resampled], axis=1)
    
    # test, train and validation creation
    X_temp, X_test, y_temp, y_test = train_test_split(resampled_df.drop(['essay_set'], axis=1), resampled_df['essay_set'], 
                                                      test_size=test_size, stratify=resampled_df['essay_set'], random_state=random_state)
    X_train, X_validation, y_train, y_validation = train_test_split(X_temp, y_temp, 
                                                                    test_size=val_size, stratify=y_temp, random_state=random_state)
    
    # mixing in dataframes 
    train_set = pd.concat([X_train, y_train], axis=1)
    test_set = pd.concat([X_test, y_test], axis=1)
    validation_set = pd.concat([X_validation, y_validation], axis=1)
    
    return train_set, test_set, validation_set

def reverse_scaling(row):
    scaler_key = f'domain1_essay_set_{int(row["essay_set"])}'
    scaler = scalers_dict[scaler_key]
    inversed_pred = scaler.inverse_transform([[row['pred']]])[0][0]

    rounded_pred = np.round(inversed_pred)
    return rounded_pred

def reverse_scaling_2(row):
    scaler_key = 'domain2'
    scaler = scalers_dict[scaler_key]
    inversed_pred = scaler.inverse_transform([[row['pred']]])[0][0]
    rounded_pred = np.round(inversed_pred)
    return rounded_pred

### <span style="color:RED">DISCLAIMER</span> : takes a long time to execute, so pass to the next cell instead


In [None]:

scaled_df,scalers_dict = DATAFRAME_SCALER(training_data)
featured_df = FEATURE_ENGINEERING(scaled_df)
train_set,test_set,validation_set = SPLIT_AND_BALANCE(featured_df)

# TRAIN, TEST AND VALIDATION SETS CREATION

In [234]:
# creating the scaled dataframe and the scaler dictionnary (for reverse scaling after regression)
scaled_df,scalers_dict = DATAFRAME_SCALER(training_data)
target = "scaled_domain1_score"
train_set = pd.read_csv("data/final/train_set.csv").drop(["Unnamed: 0"],axis=1).dropna(subset=[target])
features = train_set.columns[6:]
test_set = pd.read_csv("data/final/test_set.csv").drop(["Unnamed: 0"],axis=1).dropna(subset=[target])
validation_set = pd.read_csv("data/final/validation_set.csv").drop(["Unnamed: 0"],axis=1).dropna(subset=[target])

# DATA VISUALISATION

### WITHOUT SCALING

In [235]:
fig = make_subplots(rows=1, cols=3)
fig.add_trace(go.Histogram(x=training_data["essay_set"], nbinsx=8, name='essay_set'), row=1, col=1)
fig.add_trace(go.Histogram(x=training_data["domain1_score"], nbinsx=20, name='domain1_score'), row=1, col=2)
fig.update_layout(title_text='raw data distributions')
fig.show()

This is the distribution of essay_set and domain1_score for the raw data that was given to us. Two conclusion came to our mind : 
1. For the essay_set, 8th class is less represented than the others, that will lead to an imbalanced data issue that we managed to solve
2. Since the range are not the same for every essay_set, we can clearly see the issue with the domain1_score distribution. We decided to use our functions to scale the dataset given the context of the esay_set

In [236]:
X_train, y_train = train_set[features], train_set[target]
X_test, y_test = test_set[features], test_set[target]
X_val, y_val = validation_set[features], validation_set[target]

### WITH SCALING

In [237]:
fig = make_subplots(rows=1, cols=3)
fig.add_trace(go.Histogram(x=X_train["essay_set"], nbinsx=8, name='train'), row=1, col=1)
fig.add_trace(go.Histogram(x=X_test["essay_set"], nbinsx=8, name='test'), row=1, col=2)
fig.add_trace(go.Histogram(x=X_val["essay_set"], nbinsx=8, name='validation'), row=1, col=3)
fig.update_layout(title_text='essay_set distribution for each dataset')
fig.show()

fig = make_subplots(rows=1, cols=3)
fig.add_trace(go.Histogram(x=y_train, nbinsx=20, name='train'), row=1, col=1)
fig.add_trace(go.Histogram(x=y_test, nbinsx=20, name='test'), row=1, col=2)
fig.add_trace(go.Histogram(x=y_val, nbinsx=20, name='validation'), row=1, col=3)
fig.update_layout(title_text='domain1_score (scaled) distribution for each dataset')
fig.show()

Now we can see that we split the dataset in train, test and validation set. All these 3 sets are balanced, with no lack of class representation, and that the domain1_score is finely distributed among all the sets

# MODEL CREATION

### STEP 1 : ESSAY_SET CLASSIFICATION 

As said earlier, the first step is to make a classification model that will train over the features to predict the essay_set. It will be interesting to use it for new texts where we don't know which set it come from.

In [238]:
# create sub sets from the original data set. The only things changing here is the target and features
class_X_train = X_train.drop(["essay_set"],axis=1)
class_y_train = X_train["essay_set"]-1

class_X_test = X_test.drop(["essay_set"],axis=1)
class_y_test = X_test["essay_set"]-1

class_X_val = X_val.drop(["essay_set"],axis=1)
class_y_val = X_val["essay_set"]-1

### Model optimization

We use XGBoost Classifier since this library is very complete for machine learning tasks, and give a good result in most of the prediction issues.
We also use GridSearch with Cross Validation to fine-tune the model, and use the best parameters

In [239]:
model = XGBClassifier(objective='multi:softprob', num_class=len(class_y_train.unique()), seed=42)
param_grid = {
    'max_depth': [6, 8],
    'learning_rate': [0.1, 0.2],
    'n_estimators': [100, 150],
    'subsample': [0.8, 1],
}

grid_search = GridSearchCV(estimator=model, param_grid=param_grid, scoring='accuracy', cv=3, verbose=1)
grid_search.fit(class_X_train, class_y_train)

print(f"Best parameters: {grid_search.best_params_}")

Fitting 3 folds for each of 16 candidates, totalling 48 fits
Best parameters: {'learning_rate': 0.2, 'max_depth': 6, 'n_estimators': 150, 'subsample': 0.8}


Now that we have our parameters, we can apply it, and see the model performance

### Model performance

In [240]:
best_params = grid_search.best_params_
class_model_optimized = XGBClassifier(**best_params, objective='multi:softprob', num_class=len(class_y_train.unique()), seed=42)
class_model_optimized.fit(class_X_train, class_y_train)

# prediction on test set
pred_test = class_model_optimized.predict(class_X_test)
class_report_test = classification_report(class_y_test, pred_test)
print("classification report on test :\n", class_report_test)

# prediction on test validation set
pred_val = class_model_optimized.predict(class_X_val)
class_report_val = classification_report(class_y_val, pred_val)
print("classification report on validation :\n", class_report_val)

classification report on test :
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        12
           1       1.00      1.00      1.00        12
           2       0.93      1.00      0.96        13
           3       1.00      0.85      0.92        13
           4       0.87      1.00      0.93        13
           5       1.00      1.00      1.00        12
           6       1.00      0.92      0.96        13
           7       1.00      1.00      1.00        12

    accuracy                           0.97       100
   macro avg       0.97      0.97      0.97       100
weighted avg       0.97      0.97      0.97       100

classification report on validation :
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        13
           1       1.00      0.92      0.96        13
           2       0.91      0.83      0.87        12
           3       0.92      0.92      0.92        12
      

Since it's a classification problem, the classification report gives a lot of metrics and is a good way to see model performance on every class. Here we have good result on global and specific prediction, so we can save the model to use it later

In [241]:
with open('models/essay_set_classification_model.pkl', 'wb') as file:
    pickle.dump(class_model_optimized, file)

### STEP 2 : DOMAIN1_SCORE REGRESSION

Now, we can focus on the main part of the project : grade prediction. Again, we will use XGBoost Regressor, with GridSearch and Cross Validation to ensure we use the right model. We will use split sets we created in the beginning and see the results

### Model optimization

In [242]:
param_grid = {
    'max_depth': [3, 5],
    'learning_rate': [0.01, 0.1],
    'n_estimators': [100, 200],
    'subsample': [0.8, 1]
}

grid_search = GridSearchCV(estimator=XGBRegressor(seed=42), param_grid=param_grid, scoring='neg_mean_squared_error', cv=3, verbose=1)
grid_search.fit(X_train, y_train)
print(f"Meilleurs paramètres: {grid_search.best_params_}")

Fitting 3 folds for each of 16 candidates, totalling 48 fits


Meilleurs paramètres: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 200, 'subsample': 1}


In [243]:
best_params = grid_search.best_params_
model_optimized = XGBRegressor(**best_params, seed=42)

model_optimized.fit(X_train, y_train)

predictions_test = model_optimized.predict(X_test)
mse_test = mean_squared_error(y_val, predictions_test)
print(f"MSE on test set: {mse_test}")

predictions_val = model_optimized.predict(X_val)
mse_val = mean_squared_error(y_val, predictions_val)
print(f"MSE on validation set: {mse_val}")

MSE on test set: 0.32263248796886346
MSE on validation set: 0.0631256422796088


The MSE is a go-to metric used in a lot of regression problems. But here, if it's interesting, it's not irrevelant enough for 2 reasons :
1. We used it on the scaled values of grading. We will now use reverse the scaling and compare it with the original domain1_score values
2. From the kaggle competition where this project is from, they use the Quadratic Weighted Kappa (QWK) metric. We will use it from sklearn.metrics, the cohen_kappa_score using the parameters "weights" as "quadratic"

In [244]:
# create subset just for result and performance purpose
train_result_df = train_set[["essay_set", "domain1_score", "scaled_domain1_score"]].copy()
test_result_df = test_set[["essay_set", "domain1_score", "scaled_domain1_score"]].copy()
val_result_df = validation_set[["essay_set", "domain1_score", "scaled_domain1_score"]].copy()

# add scaled predictions
val_result_df.loc[:, "pred"] = predictions_val
test_result_df.loc[:, "pred"] = predictions_test

# reverse scaling to display true predictions
val_result_df["reversed_pred"] = val_result_df.apply(reverse_scaling,axis=1)
test_result_df["reversed_pred"] = test_result_df.apply(reverse_scaling,axis=1)

print(f'Test : {cohen_kappa_score(test_result_df["domain1_score"],test_result_df["reversed_pred"], weights="quadratic")}')
print(f'Validation : {cohen_kappa_score(val_result_df["domain1_score"],val_result_df["reversed_pred"], weights="quadratic")}')

Test : 0.9877636680539252
Validation : 0.9920438350098933


As we can see from the kaggle competition, the top 1 from the leaderboard managed to have a QWK of 0.81, but it was 12 years ago and since we don't have access to their work, we assumed that they didn't have the same tools and libraries we have now. So being able to have better results is not surprising now. We are satisfied with our performance, so we can save the model.

In [245]:
with open('models/domain1_score_regression_model.pkl', 'wb') as file:
    pickle.dump(model_optimized, file)

### STEP 3 : DOMAIN2_SCORE REGRESSION

The project is almost over, but we have to deal with one more issue : the essay_set 2 and its domain2_score. Since it's the only one with this score, we decided to only change our way to prepare the data. We did it this way :
1. We took the train_set with almost every records from the original set, already cleaned and scaled
2. We only saved the records with essay set 2
3. We split in a more usual way to get (X_train,y_train), (X_test,y_test) and (X_val,y_val) with train_test_split
4. Then we used the same process from the domain1_score regression

In [246]:
target = "scaled_domain2_score"
X_train_temp_2, y_train_temp_2 = train_set[train_set["essay_set"]==2][features[:-1]], train_set[train_set["essay_set"]==2][target]
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X_train_temp_2,y_train_temp_2, test_size=0.2, random_state=42)
X_train_2_final, X_val_2, y_train_2_final, y_val_2 = train_test_split(X_train_2,y_train_2, test_size=0.2, random_state=42)

In [247]:
test_result_df_2 = pd.DataFrame(y_test_2).join(train_set["domain2_score"])
val_result_df_2 = pd.DataFrame(y_val_2).join(train_set["domain2_score"])

In [248]:
param_grid = {
    'max_depth': [3, 5],
    'learning_rate': [0.01, 0.1],
    'n_estimators': [100, 200],
    'subsample': [0.8, 1]
}

grid_search = GridSearchCV(estimator=XGBRegressor(seed=42), param_grid=param_grid, scoring='neg_mean_squared_error', cv=3, verbose=1)
grid_search.fit(X_train_2_final, y_train_2_final)
print(f"Meilleurs paramètres: {grid_search.best_params_}")

Fitting 3 folds for each of 16 candidates, totalling 48 fits


Meilleurs paramètres: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100, 'subsample': 0.8}


In [249]:
best_params = grid_search.best_params_
model_optimized_2 = XGBRegressor(**best_params, seed=42)

model_optimized_2.fit(X_train_2_final, y_train_2_final)

predictions_test = model_optimized_2.predict(X_test_2)
mse_test = mean_squared_error(y_test_2, predictions_test)

print(f"MSE on test set: {mse_test}")

predictions_val = model_optimized_2.predict(X_val_2)
mse_val = mean_squared_error(y_val_2, predictions_val)
print(f"MSE on validation set: {mse_val}")

MSE on test set: 0.1155221166102705
MSE on validation set: 0.13775895851855027


In [253]:
test_result_df_2.loc[:, "pred"] = predictions_test
val_result_df_2.loc[:, "pred"] = predictions_val

val_result_df_2["reversed_pred"] = val_result_df_2.apply(reverse_scaling_2,axis=1)
test_result_df_2["reversed_pred"] = test_result_df_2.apply(reverse_scaling_2,axis=1)

print(f'Test : {cohen_kappa_score(test_result_df_2["domain2_score"],test_result_df_2["reversed_pred"], weights="quadratic")}')
print(f'Validation : {cohen_kappa_score(val_result_df_2["domain2_score"],val_result_df_2["reversed_pred"], weights="quadratic")}')

Test : 0.6312690798081115
Validation : 0.6307863141092374


Here, we can see the result is not as good as the first one. But it still statisfies us, so we can now save it

In [252]:
with open('models/domain2_score_regression_model.pkl', 'wb') as file:
    pickle.dump(model_optimized_2, file)

# Conclusion and way of upgrades

As a conclusion, we can say that this project covered a large amount of data topics : 
- Data exploration
- Data engineering
- Machine learning
- Model deployment

Even if we managed to get good results, we can see few ways to enhance the project : 
- Using the tokenized and vectorized text to make predictions. And even add vectorized text to features to compare perfomance for those three ways (features, features + vectorized text and vectorized text only)
- Using different models such as basic sklearn classification models, random forest or else to compare performance
- Using different method of scaling, instead of just MinMaxScaling, like changing the range of data, or even the model itself
- Using different ways of sampling such as SMOTE
- Fine-tuning models with other hyperparameters
- And finally, we could have go deeper using deep learning and neural networks