In [1]:
import mlflow
mlflow.set_tracking_uri("http://ec2-13-221-127-40.compute-1.amazonaws.com:5000/")

In [2]:
# Set or create an experiment
mlflow.set_experiment("Exp 3 - TfIdf trigram max features")

2025/12/11 19:21:05 INFO mlflow.tracking.fluent: Experiment with name 'Exp 3 - TfIdf trigram max features' does not exist. Creating a new experiment.


<Experiment: artifact_location='s3://mlflow-bucket-youtube-sentiments-classifier/3', creation_time=1765461068279, experiment_id='3', last_update_time=1765461068279, lifecycle_stage='active', name='Exp 3 - TfIdf trigram max features', tags={}>

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import mlflow
import mlflow.sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import os

In [4]:
df = pd.read_csv('reddit_preprocessing.csv').dropna(subset=['clean_comment'])
df.shape

(36662, 2)

In [5]:
# Step 1: Function to run the experiment
def run_experiment_tfidf_max_features(max_features):
    ngram_range = (1, 3)  # Trigram setting

    # Step 2: Vectorization using TF-IDF with varying max_features
    vectorizer = TfidfVectorizer(ngram_range=ngram_range, max_features=max_features)

    X_train, X_test, y_train, y_test = train_test_split(df['clean_comment'], df['category'], test_size=0.2, random_state=42, stratify=df['category'])

    X_train = vectorizer.fit_transform(X_train)
    X_test = vectorizer.transform(X_test)

    # Step 4: Define and train a Random Forest model
    with mlflow.start_run() as run:
        # Set tags for the experiment and run
        mlflow.set_tag("mlflow.runName", f"TFIDF_Trigrams_max_features_{max_features}")
        mlflow.set_tag("experiment_type", "feature_engineering")
        mlflow.set_tag("model_type", "RandomForestClassifier")

        # Add a description
        mlflow.set_tag("description", f"RandomForest with TF-IDF Trigrams, max_features={max_features}")

        # Log vectorizer parameters
        mlflow.log_param("vectorizer_type", "TF-IDF")
        mlflow.log_param("ngram_range", ngram_range)
        mlflow.log_param("vectorizer_max_features", max_features)

        # Log Random Forest parameters
        n_estimators = 200
        max_depth = 15

        mlflow.log_param("n_estimators", n_estimators)
        mlflow.log_param("max_depth", max_depth)

        # Initialize and train the model
        model = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, random_state=42)
        model.fit(X_train, y_train)

        # Step 5: Make predictions and log metrics
        y_pred = model.predict(X_test)

        # Log accuracy
        accuracy = accuracy_score(y_test, y_pred)
        mlflow.log_metric("accuracy", accuracy)

        # Log classification report
        classification_rep = classification_report(y_test, y_pred, output_dict=True)
        for label, metrics in classification_rep.items():
            if isinstance(metrics, dict):
                for metric, value in metrics.items():
                    mlflow.log_metric(f"{label}_{metric}", value)

        # Log confusion matrix
        conf_matrix = confusion_matrix(y_test, y_pred)
        plt.figure(figsize=(8, 6))
        sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues")
        plt.xlabel("Predicted")
        plt.ylabel("Actual")
        plt.title(f"Confusion Matrix: TF-IDF Trigrams, max_features={max_features}")
        plt.savefig("confusion_matrix.png")
        mlflow.log_artifact("confusion_matrix.png")
        plt.close()

        # Log the model
        mlflow.sklearn.log_model(model, f"random_forest_model_tfidf_trigrams_{max_features}")

# Step 6: Test various max_features values
max_features_values = [1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000]

for max_features in max_features_values:
    run_experiment_tfidf_max_features(max_features)

2025/12/11 19:23:58 INFO mlflow.tracking._tracking_service.client: üèÉ View run TFIDF_Trigrams_max_features_1000 at: http://ec2-13-221-127-40.compute-1.amazonaws.com:5000/#/experiments/3/runs/19a5c352cafb45e29109de0a8db56046.
2025/12/11 19:23:58 INFO mlflow.tracking._tracking_service.client: üß™ View experiment at: http://ec2-13-221-127-40.compute-1.amazonaws.com:5000/#/experiments/3.
2025/12/11 19:25:57 INFO mlflow.tracking._tracking_service.client: üèÉ View run TFIDF_Trigrams_max_features_2000 at: http://ec2-13-221-127-40.compute-1.amazonaws.com:5000/#/experiments/3/runs/3aa99dc52ded40a88a8c4e354e0868a8.
2025/12/11 19:25:57 INFO mlflow.tracking._tracking_service.client: üß™ View experiment at: http://ec2-13-221-127-40.compute-1.amazonaws.com:5000/#/experiments/3.
2025/12/11 19:27:07 INFO mlflow.tracking._tracking_service.client: üèÉ View run TFIDF_Trigrams_max_features_3000 at: http://ec2-13-221-127-40.compute-1.amazonaws.com:5000/#/experiments/3/runs/97549e858d144e368ac2bfaccd0

![Screenshot 2025-12-11 221437.png](attachment:ff442e50-b4d9-4803-a51e-f6fdf153992f.png)

# **MLflow Parallel Coordinates Plot ‚Äî TF-IDF Max-Features Experiment**

This visualization compares **10 model runs**, each using different values of the TF-IDF parameter `max_features`, and evaluates three core metrics:

* **Accuracy**
* **Precision (negative sign simply means MLflow is minimizing it)**
* **Recall (negative sign means the same)**

Your goal here is to find which `max_features` value leads to the best balanced performance.

---

# üîç **1. Impact of `vectorizer_max_features`**

The plot shows TF-IDF tested with values around:

```
1000
2000
3000
5000
8000
10000
```

### **Key Finding**

‚úî The model with **max_features ‚âà 1000‚Äì3000** achieves **higher precision and recall consistency**.
‚úî But the **best overall accuracy (~0.6618)** is achieved at **max_features ‚âà 10000**.

This suggests:

* Smaller vocabularies generalize better (more stable precision/recall)
* A large vocabulary (10k features) captures more signal ‚Üí best accuracy

---

# üéØ **2. Best Performing Model**

### ‚≠ê **Highest Accuracy Model**

* **`vectorizer_max_features = 10000`**
* Accuracy ‚âà **0.6618**
* Precision is lower (‚âà 0.88)
* Recall is highest (‚âà 0.134)

This model achieves the **best recall**, meaning it captures more positive cases but with some precision trade-off.

---

# ‚öñ **3. Trade-off Analysis**

### **High Accuracy (10k features)**

* Good for capturing complex text patterns
* Highest recall
* Precision dips slightly

### **Medium Features (3k‚Äì5k)**

* Precision is the best here (‚âà 0.97‚Äì1.00)
* Recall is more stable but lower
* Accuracy slightly lower (‚âà 0.652‚Äì0.656)

### **Low Features (1k)**

* Poor accuracy
* Precision/recall unstable
* Model underfits due to limited vocabulary

---

# üß† **4. Final Recommendation**

If your goal is:

### ‚úî **Best Overall Accuracy** ‚Üí Choose **max_features = 10,000**

### ‚úî **Best Precision (minimizing false positives)** ‚Üí Choose **max_features ‚âà 3,000‚Äì5,000**

### ‚úî **Best Recall (capturing more true positives)** ‚Üí **max_features = 10,000**

---

# üìù **Professional Summary (for client)**

> The experiment evaluated TF-IDF vectorizers with increasing vocabulary sizes. We observed that smaller vocabularies (3k‚Äì5k features) produced the most consistent precision, while the largest vocabulary (10k features) delivered the highest accuracy and recall. Based on the overall performance balance, the TF-IDF model with **10,000 features** is recommended as it captures the most informative text signals and provides the strongest predictive capability.