# Experiment 4: Handling Imbalanced Data in Sentiment Classification
This notebook explores different techniques for handling class imbalance in sentiment classification using TF-IDF trigram features and Random Forest. Results and metrics are logged to MLflow for analysis.

In [None]:
# Run AWS CLI configuration in the terminal as `aws configure` or here in the notebook
# Note: This is typically done in the terminal, but you can also run it in a notebook cell
# Uncomment the next line to run it in a notebook cell (not recommended for production use)
# !aws configure

## Set Up MLflow Tracking
Load environment variables and set the MLflow tracking URI to log experiment results to the remote server.

In [1]:
# Step 2: Set up the MLflow tracking server
from dotenv import load_dotenv
load_dotenv()  # This loads environment variables from .env

import os
import mlflow

mlflow.set_tracking_uri(os.environ.get("MLFLOW_TRACKING_URI"))

## Set or Create MLflow Experiment
Set the experiment name in MLflow. If it doesn't exist, MLflow will create it on the tracking server.

In [2]:
# Set or create an experiment
mlflow.set_experiment("Exp 4 - Handling Imbalanced Data")

2025/07/11 04:57:04 INFO mlflow.tracking.fluent: Experiment with name 'Exp 4 - Handling Imbalanced Data' does not exist. Creating a new experiment.


<Experiment: artifact_location='s3://mlflow-bucket-2025/535067134122221828', creation_time=1752235024257, experiment_id='535067134122221828', last_update_time=1752235024257, lifecycle_stage='active', name='Exp 4 - Handling Imbalanced Data', tags={}>

## Import Required Libraries
Import libraries for resampling, text vectorization, model training, evaluation, visualization, and experiment tracking.

In [4]:
from imblearn.over_sampling import SMOTE, ADASYN
from imblearn.under_sampling import RandomUnderSampler
from imblearn.combine import SMOTEENN
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import mlflow
import mlflow.sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import os

## Load Processed Dataset
Load the preprocessed Reddit sentiment dataset for feature engineering and modeling.

In [5]:
df = pd.read_csv('reddit_preprocessing.csv').dropna(subset=['clean_comment'])
df.shape

(36662, 2)

## Define Imbalanced Data Experiment Function
Create a function to run experiments with different imbalance handling methods (class weights, oversampling, undersampling, ADASYN, SMOTEENN) using TF-IDF trigram features and Random Forest. Metrics and artifacts are logged to MLflow.

In [6]:
# Step 1: Function to run the experiment
def run_imbalanced_experiment(imbalance_method):
    ngram_range = (1, 3)  # Trigram setting
    max_features = 10000  # Set max_features to 1000 for TF-IDF

    # Step 4: Train-test split before vectorization and resampling
    X_train, X_test, y_train, y_test = train_test_split(df['clean_comment'], df['category'], test_size=0.2, random_state=42, stratify=df['category'])

    # Step 2: Vectorization using TF-IDF, fit on training data only
    vectorizer = TfidfVectorizer(ngram_range=ngram_range, max_features=max_features)
    X_train_vec = vectorizer.fit_transform(X_train)  # Fit on training data
    X_test_vec = vectorizer.transform(X_test)  # Transform test data

    # Step 3: Handle class imbalance based on the selected method (only applied to the training set)
    if imbalance_method == 'class_weights':
        # Use class_weight in Random Forest
        class_weight = 'balanced'
    else:
        class_weight = None  # Do not apply class_weight if using resampling

        # Resampling Techniques (only apply to the training set)
        if imbalance_method == 'oversampling':
            smote = SMOTE(random_state=42)
            X_train_vec, y_train = smote.fit_resample(X_train_vec, y_train)
        elif imbalance_method == 'adasyn':
            adasyn = ADASYN(random_state=42)
            X_train_vec, y_train = adasyn.fit_resample(X_train_vec, y_train)
        elif imbalance_method == 'undersampling':
            rus = RandomUnderSampler(random_state=42)
            X_train_vec, y_train = rus.fit_resample(X_train_vec, y_train)
        elif imbalance_method == 'smote_enn':
            smote_enn = SMOTEENN(random_state=42)
            X_train_vec, y_train = smote_enn.fit_resample(X_train_vec, y_train)

    # Step 5: Define and train a Random Forest model
    with mlflow.start_run() as run:
        # Set tags for the experiment and run
        mlflow.set_tag("mlflow.runName", f"Imbalance_{imbalance_method}_RandomForest_TFIDF_Trigrams")
        mlflow.set_tag("experiment_type", "imbalance_handling")
        mlflow.set_tag("model_type", "RandomForestClassifier")

        # Add a description
        mlflow.set_tag("description", f"RandomForest with TF-IDF Trigrams, imbalance handling method={imbalance_method}")

        # Log vectorizer parameters
        mlflow.log_param("vectorizer_type", "TF-IDF")
        mlflow.log_param("ngram_range", ngram_range)
        mlflow.log_param("vectorizer_max_features", max_features)

        # Log Random Forest parameters
        n_estimators = 200
        max_depth = 15

        mlflow.log_param("n_estimators", n_estimators)
        mlflow.log_param("max_depth", max_depth)
        mlflow.log_param("imbalance_method", imbalance_method)

        # Initialize and train the model
        model = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, random_state=42, class_weight=class_weight)
        model.fit(X_train_vec, y_train)

        # Step 6: Make predictions and log metrics
        y_pred = model.predict(X_test_vec)

        # Log accuracy
        accuracy = accuracy_score(y_test, y_pred)
        mlflow.log_metric("accuracy", accuracy)

        # Log classification report
        classification_rep = classification_report(y_test, y_pred, output_dict=True)
        for label, metrics in classification_rep.items():
            if isinstance(metrics, dict):
                for metric, value in metrics.items():
                    mlflow.log_metric(f"{label}_{metric}", value)

        # Log confusion matrix
        conf_matrix = confusion_matrix(y_test, y_pred)
        plt.figure(figsize=(8, 6))
        sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues")
        plt.xlabel("Predicted")
        plt.ylabel("Actual")
        plt.title(f"Confusion Matrix: TF-IDF Trigrams, Imbalance={imbalance_method}")
        # Create the 'results' directory if it doesn't exist
        os.makedirs("results/notebook_5", exist_ok=True)
        confusion_matrix_filename = f"confusion_matrix_{imbalance_method}.png"
        plt.savefig(f"results/notebook_5/{confusion_matrix_filename}")
        mlflow.log_artifact(f"results/notebook_5/{confusion_matrix_filename}")
        plt.close()

        # Log the model
        mlflow.sklearn.log_model(model, f"random_forest_model_tfidf_trigrams_imbalance_{imbalance_method}")

# Step 7: Run experiments for different imbalance methods
imbalance_methods = ['class_weights', 'oversampling', 'adasyn', 'undersampling', 'smote_enn']

for method in imbalance_methods:
    run_imbalanced_experiment(method)



🏃 View run Imbalance_class_weights_RandomForest_TFIDF_Trigrams at: http://ec2-44-249-137-23.us-west-2.compute.amazonaws.com:5000/#/experiments/535067134122221828/runs/7414cd66732143e79aa38868bd204875
🧪 View experiment at: http://ec2-44-249-137-23.us-west-2.compute.amazonaws.com:5000/#/experiments/535067134122221828




🏃 View run Imbalance_oversampling_RandomForest_TFIDF_Trigrams at: http://ec2-44-249-137-23.us-west-2.compute.amazonaws.com:5000/#/experiments/535067134122221828/runs/0e725fa4c29e44c8941f302476224f4d
🧪 View experiment at: http://ec2-44-249-137-23.us-west-2.compute.amazonaws.com:5000/#/experiments/535067134122221828




🏃 View run Imbalance_adasyn_RandomForest_TFIDF_Trigrams at: http://ec2-44-249-137-23.us-west-2.compute.amazonaws.com:5000/#/experiments/535067134122221828/runs/372db707b5a747728e3621e1c96163eb
🧪 View experiment at: http://ec2-44-249-137-23.us-west-2.compute.amazonaws.com:5000/#/experiments/535067134122221828




🏃 View run Imbalance_undersampling_RandomForest_TFIDF_Trigrams at: http://ec2-44-249-137-23.us-west-2.compute.amazonaws.com:5000/#/experiments/535067134122221828/runs/6ea3029613e94dd7b97ba219cd849886
🧪 View experiment at: http://ec2-44-249-137-23.us-west-2.compute.amazonaws.com:5000/#/experiments/535067134122221828




🏃 View run Imbalance_smote_enn_RandomForest_TFIDF_Trigrams at: http://ec2-44-249-137-23.us-west-2.compute.amazonaws.com:5000/#/experiments/535067134122221828/runs/5eb7ba5b5aba4f58b25a510149cecd24
🧪 View experiment at: http://ec2-44-249-137-23.us-west-2.compute.amazonaws.com:5000/#/experiments/535067134122221828


## Experiment Summary and Next Steps
This notebook runs multiple experiments to compare different methods for handling class imbalance in sentiment classification. Results, metrics, and confusion matrices are logged to MLflow for analysis. Review MLflow UI to compare model performance and select the best imbalance handling strategy for your task.

From our analysis so far, we have found that the best model so far is:

model=TFIDF, ngram_ranges=(1,3), max_features=1000, imbalance_methods='oversampling'