# Experiment 5: ML Algorithms with Hyperparameter Tuning (Optuna & MLflow)
This notebook compares multiple ML algorithms for sentiment classification, using TF-IDF trigram features and SMOTE for balancing. Hyperparameter tuning is performed with Optuna, and results are logged to MLflow for analysis.

In [None]:
# Run AWS CLI configuration in the terminal as `aws configure` or here in the notebook
# Note: This is typically done in the terminal, but you can also run it in a notebook cell
# Uncomment the next line to run it in a notebook cell (not recommended for production use)
# !aws configure

## Set Up MLflow Tracking
Load environment variables and set the MLflow tracking URI to log experiment results to the remote server.

In [1]:
# Step 2: Set up the MLflow tracking server
from dotenv import load_dotenv
load_dotenv()  # This loads environment variables from .env

import os
import mlflow

mlflow.set_tracking_uri(os.environ.get("MLFLOW_TRACKING_URI"))

## Set or Create MLflow Experiment
Set the experiment name in MLflow. If it doesn't exist, MLflow will create it on the tracking server.

In [2]:
# Set or create an experiment
mlflow.set_experiment("Exp 5 - ML Algos with HP Tuning")

<Experiment: artifact_location='s3://mlflow-bucket-2025/534886635769708033', creation_time=1752235853681, experiment_id='534886635769708033', last_update_time=1752235853681, lifecycle_stage='active', name='Exp 5 - ML Algos with HP Tuning', tags={}>

## Import Required Libraries
Import libraries for hyperparameter tuning (Optuna), model training, evaluation, feature engineering, balancing, visualization, and experiment tracking.

In [3]:
import optuna
import mlflow
import mlflow.sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


## Load Processed Dataset
Load the preprocessed Reddit sentiment dataset for feature engineering and modeling.

In [4]:
df = pd.read_csv('reddit_preprocessing.csv').dropna()
df.shape

(36662, 2)

## Preprocessing and Feature Engineering
Remap class labels, remove NaNs, split data, vectorize text using TF-IDF trigrams, and balance classes with SMOTE.

In [5]:
import plotly
import nbformat

# Step 1: Remap the class labels from [-1, 0, 1] to [2, 0, 1]
df['category'] = df['category'].map({-1: 2, 0: 0, 1: 1})

# Step 2: Remove rows where the target labels (category) are NaN
df = df.dropna(subset=['category'])

ngram_range = (1, 3)  # Trigram setting
max_features = 10000  # Set max_features to 1000 for TF-IDF

# Step 4: Train-test split before vectorization and resampling
X_train, X_test, y_train, y_test = train_test_split(df['clean_comment'], df['category'], test_size=0.2, random_state=42, stratify=df['category'])

# Step 2: Vectorization using TF-IDF, fit on training data only
vectorizer = TfidfVectorizer(ngram_range=ngram_range, max_features=max_features)
X_train_vec = vectorizer.fit_transform(X_train)  # Fit on training data
X_test_vec = vectorizer.transform(X_test)  # Transform test data

smote = SMOTE(random_state=42)
X_train_vec, y_train = smote.fit_resample(X_train_vec, y_train)

# Function to log results in MLflow
def log_mlflow(model_name, model, X_train, X_test, y_train, y_test):
    with mlflow.start_run():
        # Log model type
        mlflow.set_tag("mlflow.runName", f"{model_name}_SMOTE_TFIDF_Trigrams")
        mlflow.set_tag("experiment_type", "algorithm_comparison")

        # Log algorithm name as a parameter
        mlflow.log_param("algo_name", model_name)

        # Train model
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)

        # Log accuracy
        accuracy = accuracy_score(y_test, y_pred)
        mlflow.log_metric("accuracy", accuracy)

        # Log classification report
        classification_rep = classification_report(y_test, y_pred, output_dict=True)
        for label, metrics in classification_rep.items():
            if isinstance(metrics, dict):
                for metric, value in metrics.items():
                    mlflow.log_metric(f"{label}_{metric}", value)

        # Log the model
        mlflow.sklearn.log_model(model, f"{model_name}_model")


# Step 6: Optuna objective function for XGBoost
def objective_xgboost(trial):
    n_estimators = trial.suggest_int('n_estimators', 50, 300)
    learning_rate = trial.suggest_float('learning_rate', 1e-4, 1e-1, log=True)
    max_depth = trial.suggest_int('max_depth', 3, 10)

    model = XGBClassifier(n_estimators=n_estimators, learning_rate=learning_rate, max_depth=max_depth, random_state=42)
    return accuracy_score(y_test, model.fit(X_train_vec, y_train).predict(X_test_vec))


# Step 7: Run Optuna for XGBoost, log the best model only
def run_optuna_experiment():
    study = optuna.create_study(direction="maximize")
    study.optimize(objective_xgboost, n_trials=30)

    # Get the best parameters and log only the best model
    best_params = study.best_params
    best_model = XGBClassifier(n_estimators=best_params['n_estimators'], learning_rate=best_params['learning_rate'], max_depth=best_params['max_depth'], random_state=42)

    # Log the best model with MLflow, passing the algo_name as "xgboost"
    log_mlflow("XGBoost", best_model, X_train_vec, X_test_vec, y_train, y_test)

    optuna.visualization.plot_optimization_history(study).show()

    optuna.visualization.plot_param_importances(study).show()

# Run the experiment for XGBoost
run_optuna_experiment()

[I 2025-07-11 05:52:20,315] A new study created in memory with name: no-name-e342f4b4-156d-4396-9e80-befadd673c05
[I 2025-07-11 05:53:13,648] Trial 0 finished with value: 0.6193917905359334 and parameters: {'n_estimators': 262, 'learning_rate': 0.0009704686227421479, 'max_depth': 9}. Best is trial 0 with value: 0.6193917905359334.
[I 2025-07-11 05:53:36,343] Trial 1 finished with value: 0.6402563752897859 and parameters: {'n_estimators': 169, 'learning_rate': 0.006066526677412434, 'max_depth': 7}. Best is trial 1 with value: 0.6402563752897859.
[I 2025-07-11 05:53:52,280] Trial 2 finished with value: 0.7407609436792582 and parameters: {'n_estimators': 193, 'learning_rate': 0.053598706882898375, 'max_depth': 6}. Best is trial 2 with value: 0.7407609436792582.
[I 2025-07-11 05:53:55,037] Trial 3 finished with value: 0.5089322241920087 and parameters: {'n_estimators': 97, 'learning_rate': 0.000766847735767811, 'max_depth': 3}. Best is trial 2 with value: 0.7407609436792582.
[I 2025-07-11 

🏃 View run XGBoost_SMOTE_TFIDF_Trigrams at: http://ec2-44-249-137-23.us-west-2.compute.amazonaws.com:5000/#/experiments/534886635769708033/runs/0171f3f7569a48dc819a85a0053e13f7
🧪 View experiment at: http://ec2-44-249-137-23.us-west-2.compute.amazonaws.com:5000/#/experiments/534886635769708033


## Experiment Summary and Next Steps
This notebook runs hyperparameter tuning for XGBoost using Optuna, logs the best model and metrics to MLflow, and visualizes optimization history and parameter importances. Review MLflow and Optuna plots to select the best model and understand which hyperparameters matter most.