# Emotion Classification Using Random Forest with Hyperparameter Tuning and Cross-Validation

In this Colab notebook, we leverage the **Random Forest Classifier** for emotion classification on a balanced, augmented dataset. To improve model performance and generalization, we apply **hyperparameter tuning** using **GridSearchCV** with 5-fold cross-validation. This approach optimizes key Random Forest parameters to achieve the best accuracy.

### Key Highlights:
1. **Dataset Splitting:**
   - The dataset is divided into training (60%), validation (20%), and testing (20%) subsets.
   - Training data is used for model optimization, validation data for interim evaluation, and test data for final assessment.

2. **TF-IDF Vectorization:**
   - Text data is transformed into numerical features using **TF-IDF Vectorizer**, which captures the importance of words in textual data.

3. **Hyperparameter Tuning:**
   - GridSearchCV tests combinations of key Random Forest parameters:
     - **n_estimators**: Number of trees in the forest.
     - **max_depth**: Maximum depth of each tree.
     - **min_samples_split**: Minimum samples required to split an internal node.
     - **min_samples_leaf**: Minimum samples required at a leaf node.
     - **bootstrap**: Whether bootstrap sampling is used.

4. **Performance Evaluation:**
   - The optimized model is evaluated on:
     - Validation data: To assess its performance during tuning.
     - Test data: To measure performance on unseen examples.

5. **Confusion Matrix Analysis:**
   - A confusion matrix visualizes the model's predictions on the test set, offering insights into its classification strengths and weaknesses.



# Step 1: Import Libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

# Step 2: Load the combined dataset

In [None]:
# Load the augmented dataset
# Purpose: Load the dataset containing balanced text data and emotion labels.
df = pd.read_csv('Augmented_Emotion_Dataset.csv')

# Split the dataset into features (X) and labels (y)
# X: Cleaned text data
# y: Emotion labels
X = df['cleaned_text']
y = df['EMOTION']

# Step 3: Split the Data into Train and Test Sets

In [None]:
# Purpose: Divide the dataset into training, validation, and testing subsets.
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Step 4: Text Vectorization using TF-IDF

In [None]:
# Purpose: Transform text data into numerical feature vectors for machine learning input.
tfidf = TfidfVectorizer(max_features=5000)  # Maximum number of features set to 5000
X_train_tfidf = tfidf.fit_transform(X_train)  # Fit and transform training data
X_val_tfidf = tfidf.transform(X_val)         # Transform validation data
X_test_tfidf = tfidf.transform(X_test)       # Transform testing data

# Step 5: Train and Evaluate Models using the augmented balanced dataset

In [None]:
# Define the hyperparameter grid for GridSearchCV
# Purpose: Specify the range of parameters for tuning the Random Forest model.
param_grid = {
    'n_estimators': [50, 100, 200],      # Number of trees in the forest
    'max_depth': [10, 20, None],         # Maximum depth of each tree
    'min_samples_split': [2, 5, 10],     # Minimum samples to split an internal node
    'min_samples_leaf': [1, 2, 4],       # Minimum samples required at a leaf node
    'bootstrap': [True, False]           # Whether bootstrap sampling is used
}

# Perform hyperparameter tuning with GridSearchCV
# Purpose: Identify the best combination of hyperparameters using cross-validation.
grid_search = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=5, scoring='accuracy', verbose=2)
grid_search.fit(X_train_tfidf, y_train)

# Retrieve and display the best parameters and cross-validation score
# Purpose: Summarize the hyperparameter tuning results.
best_rf_model = grid_search.best_estimator_  # Retrieve the best model
print("Best Parameters:", grid_search.best_params_)  # Print the best hyperparameters

# Evaluate the best model on the validation set
# Purpose: Assess the performance of the optimized model on validation data.
y_val_pred = best_rf_model.predict(X_val_tfidf)  # Predict on validation data
val_accuracy = accuracy_score(y_val, y_val_pred)  # Calculate validation accuracy
print("\nValidation Accuracy:", val_accuracy)
print("Classification Report:\n", classification_report(y_val, y_val_pred))

# Evaluate the best model on the test set
# Purpose: Measure final performance on unseen test data.
y_test_pred = best_rf_model.predict(X_test_tfidf)  # Predict on test data
test_accuracy = accuracy_score(y_test, y_test_pred)  # Calculate test accuracy
print("\nTest Accuracy:", test_accuracy)
print("Classification Report on Test Data:\n", classification_report(y_test, y_test_pred))


# Confusion matrix visualization
# Purpose: Provide detailed insights into model predictions for the test set.

def plot_confusion_matrix(y_true, y_pred, labels, title="Confusion Matrix"):
    """
    Plot the confusion matrix to analyze prediction results.
    - y_true: True labels
    - y_pred: Predicted labels
    - labels: Unique class labels
    - title: Title for the confusion matrix plot
    """
    cm = confusion_matrix(y_true, y_pred, labels=labels)  # Generate the confusion matrix
    plt.figure(figsize=(10, 7))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=labels, yticklabels=labels)  # Plot heatmap
    plt.title(title)
    plt.xlabel('Predicted Labels')
    plt.ylabel('True Labels')
    plt.show()

# Call the function to plot the confusion matrix for the test set
plot_confusion_matrix(y_test, y_test_pred, labels=np.unique(y), title="Test Confusion Matrix with Best Parameters")


Fitting 5 folds for each of 162 candidates, totalling 810 fits
[CV] END bootstrap=True, max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   1.6s
[CV] END bootstrap=True, max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   1.5s
[CV] END bootstrap=True, max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   1.7s
[CV] END bootstrap=True, max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   1.7s
[CV] END bootstrap=True, max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   1.6s
[CV] END bootstrap=True, max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   2.9s
[CV] END bootstrap=True, max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   2.9s
[CV] END bootstrap=True, max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   3.3s
[CV] E