# Model Training

In [8]:
# Import libraries

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report

In [38]:
# Load the CSV dataset
data = pd.read_csv('emotion_data.csv')

# Print the column names and first few rows to verify the structure
print("Columns in the dataset:")
print(data.columns)
print("\nFirst few rows:")
print(data.head())

Columns in the dataset:
Index(['statement', 'status', 'sentiment_score', 'sentiment_category',
       'text_length', 'word_count', 'avg_word_length', 'stopword_count',
       'first_person_pronoun_count', 'keywords_found', 'keyword_count',
       'predicted_emotion'],
      dtype='object')

First few rows:
                                           statement   status  \
0                                         oh my gosh  Anxiety   
1  trouble sleeping confused mind restless heart ...  Anxiety   
2  all wrong back off dear forward doubt stay in ...  Anxiety   
3  ive shifted my focus to something else but im ...  Anxiety   
4  im restless and restless its been a month now ...  Anxiety   

   sentiment_score sentiment_category  text_length  word_count  \
0           0.0000            neutral           10           3   
1          -0.7269           negative           61          10   
2          -0.7351           negative           75          14   
3          -0.4215           negative

### Training the Models

In this setup, our target variable is the predicted_emotion column. Essentially, I am building models to predict the emotion label assigned to each record in the dataset. By leveraging various features, from numeric scores like sentiment_score and text_length to categorical details like sentiment_category and text features like the statement itsel, the models learn to classify the underlying emotion of each text entry.

In [14]:
# Define the target variable
target_col = 'predicted_emotion'

# Select numeric features (update the list if you want to include/exclude other columns)
feature_cols = [
    'sentiment_score', 'text_length', 'word_count', 
    'avg_word_length', 'stopword_count', 
    'first_person_pronoun_count', 'keyword_count'
]

X = data[feature_cols]
y = data[target_col]

print("Selected Features:")
print(X.head())
print("\nTarget Variable:")
print(y.head())

Selected Features:
   sentiment_score  text_length  word_count  avg_word_length  stopword_count  \
0           0.0000           10           3                3               1   
1          -0.7269           61          10                6               3   
2          -0.7351           75          14                5               5   
3          -0.4215           59          11                5               3   
4          -0.4939           66          14                4               8   

   first_person_pronoun_count  keyword_count  
0                           1              0  
1                           0              0  
2                           0              0  
3                           1              0  
4                           0              0  

Target Variable:
0     surprise
1         fear
2         fear
3    Uncertain
4         fear
Name: predicted_emotion, dtype: object


I split the dataset into 80% training and 20% testing because I wanted to have a robust set of data to train my model while still keeping a separate portion to evaluate its performance on unseen examples. This 80/20 split is a common approach that gives me enough data to learn patterns effectively, while ensuring that the test set is large enough to provide a realistic assessment of how well my model generalizes to new data.

In [17]:
# Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In my initial approach, I focused on leveraging key numeric features, such as sentiment score, text length, word count, average word length, stopword count, first-person pronoun count, and keyword count—to capture important signals within the text that might indicate different emotions. I split the dataset into training and testing sets and applied feature scaling to ensure that all features contributed equally during model training. Then, I built five baseline models (Logistic Regression, Decision Tree, Random Forest, SVM, and K-Nearest Neighbors) to evaluate various algorithmic approaches and establish a performance benchmark. This process provided valuable insights into the predictive power of my engineered features and set the stage for further refinements and the exploration of more complex models, like neural networks, later on.

In [20]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

# Define the models with a fixed random_state for reproducibility
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42),
    'Support Vector Machine': SVC(random_state=42),
    'K-Nearest Neighbors': KNeighborsClassifier()
}

# Dictionary to store model results
results = {}

for name, model in models.items():
    # Train the model using the scaled training data
    model.fit(X_train_scaled, y_train)
    
    # Make predictions on the test set
    y_pred = model.predict(X_test_scaled)
    
    # Evaluate model performance
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred)
    
    # Store the results
    results[name] = {'accuracy': accuracy, 'report': report}
    
    # Print performance metrics
    print(f'{name} Accuracy: {accuracy:.4f}')
    print(report)
    print('-------------------------------------')

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Logistic Regression Accuracy: 0.7007
              precision    recall  f1-score   support

   Uncertain       0.71      0.97      0.82      6553
       anger       0.00      0.00      0.00       167
     disgust       0.00      0.00      0.00         9
        fear       0.41      0.02      0.04       668
         joy       0.00      0.00      0.00       232
     neutral       0.00      0.00      0.00        12
     sadness       0.31      0.06      0.10      1520
    surprise       0.00      0.00      0.00        69

    accuracy                           0.70      9230
   macro avg       0.18      0.13      0.12      9230
weighted avg       0.59      0.70      0.60      9230

-------------------------------------
Decision Tree Accuracy: 0.5804
              precision    recall  f1-score   support

   Uncertain       0.75      0.73      0.74      6553
       anger       0.02      0.02      0.02       167
     disgust       0.00      0.00      0.00         9
        fear       0.13   

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Random Forest Accuracy: 0.6928
              precision    recall  f1-score   support

   Uncertain       0.72      0.94      0.82      6553
       anger       0.00      0.00      0.00       167
     disgust       0.00      0.00      0.00         9
        fear       0.29      0.04      0.07       668
         joy       0.21      0.04      0.07       232
     neutral       0.00      0.00      0.00        12
     sadness       0.36      0.14      0.20      1520
    surprise       0.00      0.00      0.00        69

    accuracy                           0.69      9230
   macro avg       0.20      0.15      0.15      9230
weighted avg       0.60      0.69      0.62      9230

-------------------------------------


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Support Vector Machine Accuracy: 0.7100
              precision    recall  f1-score   support

   Uncertain       0.71      1.00      0.83      6553
       anger       0.00      0.00      0.00       167
     disgust       0.00      0.00      0.00         9
        fear       0.00      0.00      0.00       668
         joy       0.00      0.00      0.00       232
     neutral       0.00      0.00      0.00        12
     sadness       0.00      0.00      0.00      1520
    surprise       0.00      0.00      0.00        69

    accuracy                           0.71      9230
   macro avg       0.09      0.12      0.10      9230
weighted avg       0.50      0.71      0.59      9230

-------------------------------------
K-Nearest Neighbors Accuracy: 0.6817
              precision    recall  f1-score   support

   Uncertain       0.73      0.92      0.81      6553
       anger       0.00      0.00      0.00       167
     disgust       0.00      0.00      0.00         9
        fear     

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [44]:
# Fine-tuning the parameters to increase accuracy

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

# Re-create the feature matrix X with all desired columns
feature_cols = ['sentiment_score', 'text_length', 'word_count', 
                'avg_word_length', 'stopword_count', 
                'first_person_pronoun_count', 'keyword_count', 
                'sentiment_category', 'status', 'statement', 'keywords_found']
X = data[feature_cols]
y = data['predicted_emotion']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Work on copies and fill missing text values
X_train = X_train.copy()
X_test = X_test.copy()
for col in ['statement', 'keywords_found']:
    X_train[col] = X_train[col].fillna('')
    X_test[col] = X_test[col].fillna('')

# Define the preprocessor to handle each feature type
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['sentiment_score', 'text_length', 'word_count', 
                                     'avg_word_length', 'stopword_count', 
                                     'first_person_pronoun_count', 'keyword_count']),
        ('cat', OneHotEncoder(handle_unknown='ignore'), ['sentiment_category', 'status']),
        ('txt_stmt', TfidfVectorizer(max_features=500), 'statement'),
        ('txt_kw', TfidfVectorizer(max_features=100), 'keywords_found')
    ])

# Define models with a more concise parameter grid for faster tuning
models = {
    'Logistic Regression': (
        LogisticRegression(max_iter=1000, random_state=42),
        {'classifier__C': [0.1, 1]}
    ),
    'Decision Tree': (
        DecisionTreeClassifier(random_state=42),
        {'classifier__max_depth': [None, 5], 
         'classifier__min_samples_split': [2, 5]}
    ),
    'Random Forest': (
        RandomForestClassifier(random_state=42),
        {'classifier__n_estimators': [50, 100], 
         'classifier__max_depth': [None, 10]}
    ),
    'Support Vector Machine': (
        SVC(random_state=42),
        {'classifier__C': [0.1, 1], 
         'classifier__kernel': ['linear', 'rbf']}
    ),
    'K-Nearest Neighbors': (
        KNeighborsClassifier(),
        {'classifier__n_neighbors': [3, 5], 
         'classifier__weights': ['uniform', 'distance']}
    )
}

results = {}

# Loop through each model, performing a simplified grid search and evaluation
for name, (model, param_grid) in models.items():
    print(f"--- Tuning and evaluating {name} ---")
    
    # Build the pipeline: preprocessor then classifier
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', model)
    ])
    
    # Use GridSearchCV with fewer parameter options and 3-fold CV for speed
    grid_search = GridSearchCV(pipeline, param_grid, cv=3, scoring='accuracy', n_jobs=-1, verbose=1)
    grid_search.fit(X_train, y_train)
    
    best_model = grid_search.best_estimator_
    y_pred = best_model.predict(X_test)
    
    acc = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred)
    
    results[name] = {'accuracy': acc, 'report': report, 'best_params': grid_search.best_params_}
    
    print(f"{name} - Best Parameters: {grid_search.best_params_}")
    print(f"{name} - Test Accuracy: {acc:.4f}")
    print(report)
    print('-------------------------------------')

--- Tuning and evaluating Logistic Regression ---
Fitting 3 folds for each of 2 candidates, totalling 6 fits


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Logistic Regression - Best Parameters: {'classifier__C': 1}
Logistic Regression - Test Accuracy: 0.7542
              precision    recall  f1-score   support

   Uncertain       0.78      0.93      0.85      6553
       anger       0.55      0.04      0.07       167
     disgust       0.00      0.00      0.00         9
        fear       0.63      0.34      0.44       668
         joy       0.53      0.07      0.13       232
     neutral       0.00      0.00      0.00        12
     sadness       0.64      0.40      0.49      1520
    surprise       0.00      0.00      0.00        69

    accuracy                           0.75      9230
   macro avg       0.39      0.22      0.25      9230
weighted avg       0.72      0.75      0.72      9230

-------------------------------------
--- Tuning and evaluating Decision Tree ---
Fitting 3 folds for each of 4 candidates, totalling 12 fits


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Decision Tree - Best Parameters: {'classifier__max_depth': 5, 'classifier__min_samples_split': 2}
Decision Tree - Test Accuracy: 0.7249
              precision    recall  f1-score   support

   Uncertain       0.74      0.96      0.83      6553
       anger       0.00      0.00      0.00       167
     disgust       0.00      0.00      0.00         9
        fear       0.59      0.17      0.27       668
         joy       0.00      0.00      0.00       232
     neutral       0.00      0.00      0.00        12
     sadness       0.58      0.19      0.29      1520
    surprise       0.00      0.00      0.00        69

    accuracy                           0.72      9230
   macro avg       0.24      0.17      0.17      9230
weighted avg       0.66      0.72      0.66      9230

-------------------------------------
--- Tuning and evaluating Random Forest ---
Fitting 3 folds for each of 4 candidates, totalling 12 fits


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Random Forest - Best Parameters: {'classifier__max_depth': None, 'classifier__n_estimators': 100}
Random Forest - Test Accuracy: 0.7432
              precision    recall  f1-score   support

   Uncertain       0.75      0.97      0.84      6553
       anger       0.00      0.00      0.00       167
     disgust       0.00      0.00      0.00         9
        fear       0.75      0.26      0.38       668
         joy       0.75      0.01      0.03       232
     neutral       0.00      0.00      0.00        12
     sadness       0.67      0.22      0.33      1520
    surprise       0.50      0.01      0.03        69

    accuracy                           0.74      9230
   macro avg       0.43      0.18      0.20      9230
weighted avg       0.72      0.74      0.68      9230

-------------------------------------
--- Tuning and evaluating Support Vector Machine ---
Fitting 3 folds for each of 4 candidates, totalling 12 fits


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Support Vector Machine - Best Parameters: {'classifier__C': 1, 'classifier__kernel': 'linear'}
Support Vector Machine - Test Accuracy: 0.7494
              precision    recall  f1-score   support

   Uncertain       0.76      0.95      0.85      6553
       anger       0.00      0.00      0.00       167
     disgust       0.00      0.00      0.00         9
        fear       0.66      0.29      0.40       668
         joy       0.00      0.00      0.00       232
     neutral       0.00      0.00      0.00        12
     sadness       0.66      0.32      0.43      1520
    surprise       0.00      0.00      0.00        69

    accuracy                           0.75      9230
   macro avg       0.26      0.19      0.21      9230
weighted avg       0.70      0.75      0.70      9230

-------------------------------------
--- Tuning and evaluating K-Nearest Neighbors ---
Fitting 3 folds for each of 4 candidates, totalling 12 fits


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


K-Nearest Neighbors - Best Parameters: {'classifier__n_neighbors': 5, 'classifier__weights': 'uniform'}
K-Nearest Neighbors - Test Accuracy: 0.7109
              precision    recall  f1-score   support

   Uncertain       0.75      0.91      0.82      6553
       anger       0.00      0.00      0.00       167
     disgust       0.00      0.00      0.00         9
        fear       0.54      0.31      0.39       668
         joy       0.36      0.05      0.09       232
     neutral       0.00      0.00      0.00        12
     sadness       0.44      0.27      0.33      1520
    surprise       0.00      0.00      0.00        69

    accuracy                           0.71      9230
   macro avg       0.26      0.19      0.21      9230
weighted avg       0.66      0.71      0.67      9230

-------------------------------------


I decided to include XGBoost in my analysis because of its proven efficiency and effectiveness in handling complex datasets. Its ability to build ensembles of decision trees using gradient boosting allows for robust handling of non-linear relationships and interactions between features. Given the variety of models I was comparing, I wanted to see if XGBoost could provide a performance boost, especially in terms of accuracy and generalization, by leveraging its regularization techniques to reduce overfitting. This approach was particularly appealing as it helped me validate the strength of my engineered features and explore a more advanced method beyond the baseline algorithms.

In [32]:
# XGBoost Model Training

!pip install xgboost
from xgboost import XGBClassifier

from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier

# Encode target variable to numeric labels
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

# Define the XGBoost classifier
xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', random_state=42)

# Train the XGBoost model using the scaled training data
xgb_model.fit(X_train_scaled, y_train_encoded)

# Make predictions on the test set
y_pred_xgb = xgb_model.predict(X_test_scaled)

# Decode predictions back to original labels if needed
y_pred_labels = label_encoder.inverse_transform(y_pred_xgb)

# Evaluate model performance
from sklearn.metrics import accuracy_score, classification_report
accuracy_xgb = accuracy_score(y_test_encoded, y_pred_xgb)
report_xgb = classification_report(y_test_encoded, y_pred_xgb, target_names=label_encoder.classes_)

print('XGBoost Accuracy: {:.4f}'.format(accuracy_xgb))
print(report_xgb)



Parameters: { "use_label_encoder" } are not used.



XGBoost Accuracy: 0.7029
              precision    recall  f1-score   support

   Uncertain       0.72      0.97      0.83      6553
       anger       0.00      0.00      0.00       167
     disgust       0.00      0.00      0.00         9
        fear       0.36      0.02      0.04       668
         joy       0.24      0.03      0.05       232
     neutral       0.00      0.00      0.00        12
     sadness       0.37      0.08      0.13      1520
    surprise       1.00      0.01      0.03        69

    accuracy                           0.70      9230
   macro avg       0.34      0.14      0.13      9230
weighted avg       0.61      0.70      0.61      9230



In [34]:
# Fine-tuning the XGBoost parameters to improve the model's accuracy

# Encode target variable to numeric labels
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

# Define the XGBoost classifier
xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', random_state=42)

# Define a parameter grid for fine-tuning
param_grid = {
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
    'n_estimators': [100, 200, 300],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

# Set up GridSearchCV to search for the best hyperparameters
grid_search = GridSearchCV(
    estimator=xgb_model,
    param_grid=param_grid,
    cv=3,  # You can adjust the number of folds
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

# Run grid search on the training data
grid_search.fit(X_train_scaled, y_train_encoded)

# Print the best parameters and best score from grid search
print("Best parameters found:", grid_search.best_params_)
print("Best cross-validation accuracy:", grid_search.best_score_)

# Evaluate the best estimator on the test set
best_xgb = grid_search.best_estimator_
y_pred_best = best_xgb.predict(X_test_scaled)

accuracy_best = accuracy_score(y_test_encoded, y_pred_best)
print("Tuned XGBoost Test Accuracy: {:.4f}".format(accuracy_best))
print("\nClassification Report:")
print(classification_report(y_test_encoded, y_pred_best, target_names=label_encoder.classes_))


Fitting 3 folds for each of 108 candidates, totalling 324 fits


Parameters: { "use_label_encoder" } are not used.



Best parameters found: {'colsample_bytree': 0.8, 'learning_rate': 0.01, 'max_depth': 7, 'n_estimators': 300, 'subsample': 0.8}
Best cross-validation accuracy: 0.7095105400573779
Tuned XGBoost Test Accuracy: 0.7102

Classification Report:
              precision    recall  f1-score   support

   Uncertain       0.71      1.00      0.83      6553
       anger       0.00      0.00      0.00       167
     disgust       0.00      0.00      0.00         9
        fear       1.00      0.00      0.01       668
         joy       1.00      0.00      0.01       232
     neutral       0.00      0.00      0.00        12
     sadness       0.44      0.01      0.02      1520
    surprise       0.00      0.00      0.00        69

    accuracy                           0.71      9230
   macro avg       0.39      0.13      0.11      9230
weighted avg       0.68      0.71      0.59      9230



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### Comparison Summary

In the first round, I trained five baseline models (Logistic Regression, Decision Tree, Random Forest, SVM, and K-Nearest Neighbors) using basic feature preprocessing. This initial evaluation provided a quick benchmark, though it highlighted issues such as overfitting in tree-based models and inconsistent performance across classifiers.

In the next round, I expanded the feature set by incorporating numeric, categorical, and text-based data through a ColumnTransformer, and applied TF‑IDF vectorization on text features. With a more refined preprocessing pipeline and a simplified grid search for hyperparameter tuning, all models showed modest accuracy improvements compared to their baseline versions.

Comparing the two rounds of XGBoost, the initial run served as a baseline while the tuned version—despite a longer runtime—yielded a slight accuracy gain (around 1% increase). This suggests that while XGBoost can benefit from tuning, the improvements may be incremental without further feature engineering or addressing class imbalance.

Overall, Random Forest and XGBoost emerged as top performers in terms of accuracy and handling complex feature interactions, whereas Logistic Regression and SVM remain attractive due to their lower computational cost. Each approach has its strengths and limitations, and further fine-tuning or ensemble methods may be required to fully optimize predictive performance.