## BINARY CLASSIFICATION

In [1]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
import pandas as pd
import numpy as np

In [2]:
# Load dataset

text_cleaned = pd.read_csv("../../assets/Cleaned_Tweets.csv")
text_cleaned.head()                           
                           

Unnamed: 0,tweet,emotion,product_filled,cleaned_tweets,text_length
0,.@wesley83 i have a 3g iphone. after 3 hrs twe...,negative,iphone,3g iphone 3 hrs tweeting rise_austin dead need...,12
1,@jessedee know about @fludapp ? awesome ipad/i...,positive,ipad or iphone app,know awesome ipadiphone app youll likely appre...,14
2,@swonderlin can not wait for #ipad 2 also. the...,positive,ipad,wait ipad 2 also sale sxsw,6
3,@sxsw i hope this year's festival isn't as cra...,negative,ipad or iphone app,hope years festival isnt crashy years iphone a...,9
4,@sxtxstate great stuff on fri #sxsw: marissa m...,positive,google,great stuff fri sxsw marissa mayer google tim ...,15


In [5]:
binary_df = text_cleaned[text_cleaned["emotion"].isin(['positive', 'negative'])]
binary_df


Unnamed: 0,tweet,emotion,product_filled,cleaned_tweets,text_length
0,.@wesley83 i have a 3g iphone. after 3 hrs twe...,negative,iphone,3g iphone 3 hrs tweeting rise_austin dead need...,12
1,@jessedee know about @fludapp ? awesome ipad/i...,positive,ipad or iphone app,know awesome ipadiphone app youll likely appre...,14
2,@swonderlin can not wait for #ipad 2 also. the...,positive,ipad,wait ipad 2 also sale sxsw,6
3,@sxsw i hope this year's festival isn't as cra...,negative,ipad or iphone app,hope years festival isnt crashy years iphone a...,9
4,@sxtxstate great stuff on fri #sxsw: marissa m...,positive,google,great stuff fri sxsw marissa mayer google tim ...,15
...,...,...,...,...,...
9076,@mention your pr guy just convinced me to swit...,positive,iphone,pr guy convinced switch back iphone great sxsw...,10
9078,&quot;papyrus...sort of like the ipad&quot; - ...,positive,ipad,quotpapyrussort like ipadquot nice lol sxsw la...,7
9079,diller says google tv &quot;might be run over ...,negative,other google product or service,diller says google tv quotmight run playstatio...,13
9084,i've always used camera+ for my iphone b/c it ...,positive,ipad or iphone app,ive always used camera iphone bc image stabili...,17


In [9]:
X_binary = binary_df["cleaned_tweets"]
y_binary = binary_df["emotion"]

y_binary = y_binary.map({'negative': 0, 'positive': 1})


Xb_train, Xb_test, yb_train, yb_test = train_test_split(
    X_binary, y_binary, test_size=0.2, random_state=42, stratify=y_binary
)

In [10]:


Xb_temp, Xb_val, yb_temp, yb_val= train_test_split(
    Xb_train, yb_train, test_size=0.2, random_state=42, stratify=yb_train
)



In [11]:
# Define pipeline
sentiment_pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(max_features=5000, ngram_range=(1,2))),
    ("clf", LogisticRegression(max_iter=1000))
])

# Fit pipeline
sentiment_pipeline.fit(Xb_temp, yb_temp)

# Predictions
y_pred = sentiment_pipeline.predict(Xb_val)

# Evaluation
print("Classification Report:\n", classification_report(yb_val, y_pred))
print("Confusion Matrix:\n", confusion_matrix(yb_val, y_pred))

Classification Report:
               precision    recall  f1-score   support

           0       0.86      0.07      0.12        91
           1       0.85      1.00      0.92       477

    accuracy                           0.85       568
   macro avg       0.85      0.53      0.52       568
weighted avg       0.85      0.85      0.79       568

Confusion Matrix:
 [[  6  85]
 [  1 476]]



SUMMARY:
It looks like your validation results are showing very poor recall for class 0 (only 7%), which means your model is barely predicting any negatives correctly — it’s biased toward predicting positive (class 1) almost all the time.

This is common in imbalanced datasets, and since you’re doing binary classification for "positive" vs "negative", the issue might be:

Class distribution is heavily skewed (way more positives than negatives).

No balancing techniques applied (e.g., class weights, oversampling).

The model is learning features that mostly correlate with positives.

In [13]:
# Define pipeline

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline

smote_pipeline = ImbPipeline([
    ("tfidf", TfidfVectorizer(max_features=5000, ngram_range=(1,2))),
    ("smote", SMOTE(random_state=42)),
    ("clf", LogisticRegression(max_iter=1000, class_weight=None))  # class_weight not needed since we use SMOTE
])

# Fit pipeline
smote_pipeline.fit(Xb_temp, yb_temp)

# Predictions
y_pred = smote_pipeline.predict(Xb_val)

# Evaluation
print("Classification Report:\n", classification_report(yb_val, y_pred))
print("Confusion Matrix:\n", confusion_matrix(yb_val, y_pred))

Classification Report:
               precision    recall  f1-score   support

           0       0.54      0.53      0.53        91
           1       0.91      0.91      0.91       477

    accuracy                           0.85       568
   macro avg       0.72      0.72      0.72       568
weighted avg       0.85      0.85      0.85       568

Confusion Matrix:
 [[ 48  43]
 [ 41 436]]


Looks like SMOTE helped your model pick up more of the minority class (label 0) compared to your earlier results.

Here’s the change in performance compared to your previous run:

Recall for class 0 went from 0.07 → 0.53  (big improvement — now your model catches more positive cases).

Precision for class 0 dropped a bit (0.86 → 0.54) — this is expected, because SMOTE increases false positives.

Class 1 stayed strong with 0.91 precision/recall.

Overall accuracy stayed at 85%, but your model is now more balanced in detecting both classes.

In [14]:
param_grid = {
    "tfidf__max_features": [3000, 5000, 7000],
    "tfidf__ngram_range": [(1,1), (1,2)],
    "clf__C": [0.01, 0.1, 1, 10],
    "clf__solver": ["liblinear", "lbfgs"],
    "clf__penalty": ["l2"]  # liblinear & lbfgs both work with L2
}

# Grid search
grid_search = GridSearchCV(
    smote_pipeline,
    param_grid,
    cv=5,
    scoring="f1_macro",
    n_jobs=-1,
    verbose=2
)

# Fit grid search
grid_search.fit(Xb_train, yb_train)

# Best parameters
print("Best Parameters:", grid_search.best_params_)

# Evaluate on test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(Xb_test)

from sklearn.metrics import classification_report, confusion_matrix
print("\nClassification Report:\n", classification_report(yb_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(yb_test, y_pred))

Fitting 5 folds for each of 48 candidates, totalling 240 fits
Best Parameters: {'clf__C': 10, 'clf__penalty': 'l2', 'clf__solver': 'liblinear', 'tfidf__max_features': 7000, 'tfidf__ngram_range': (1, 1)}

Classification Report:
               precision    recall  f1-score   support

           0       0.63      0.58      0.61       114
           1       0.92      0.94      0.93       596

    accuracy                           0.88       710
   macro avg       0.78      0.76      0.77       710
weighted avg       0.87      0.88      0.88       710


Confusion Matrix:
 [[ 66  48]
 [ 38 558]]


Key takeaways from your results:

Recall for class 0 (minority) improved from 0.53 → 0.58, meaning the model is catching more minority samples.

Recall for class 1 (majority) stayed strong at 0.94.

Overall accuracy went up to 88%.

Macro avg recall (important for imbalanced data) also improved.

In [18]:
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score



models = {
    "SVM": LinearSVC(random_state=42),
    "Random Forest": RandomForestClassifier(n_estimators=200, random_state=42),
    "XGBoost": XGBClassifier(use_label_encoder=False, eval_metric="logloss", random_state=42),
    "Gradient Boosting": GradientBoostingClassifier(random_state=42)
}

results = {}

for name, clf in models.items():
    pipe = ImbPipeline([
        ("tfidf", TfidfVectorizer(max_features=5000, ngram_range=(1,2))),
        ("smote", SMOTE(random_state=42)),
        ("clf", clf)
    ])
    
     # Fit
    pipe.fit(Xb_temp, yb_temp)
    
    # Predict
    y_pred = pipe.predict(Xb_val)
    
    # Evaluate
    acc = accuracy_score(yb_val, y_pred)
    print(f"\n{name} Accuracy: {acc:.4f}")
    print(classification_report(yb_val, y_pred))
    
    results[name] = acc

print("\nSummary of Results:", results)
    


SVM Accuracy: 0.8662
              precision    recall  f1-score   support

           0       0.60      0.49      0.54        91
           1       0.91      0.94      0.92       477

    accuracy                           0.87       568
   macro avg       0.75      0.72      0.73       568
weighted avg       0.86      0.87      0.86       568


Random Forest Accuracy: 0.8908
              precision    recall  f1-score   support

           0       0.84      0.40      0.54        91
           1       0.90      0.99      0.94       477

    accuracy                           0.89       568
   macro avg       0.87      0.69      0.74       568
weighted avg       0.89      0.89      0.87       568



Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)



XGBoost Accuracy: 0.8574
              precision    recall  f1-score   support

           0       0.61      0.31      0.41        91
           1       0.88      0.96      0.92       477

    accuracy                           0.86       568
   macro avg       0.74      0.63      0.66       568
weighted avg       0.84      0.86      0.84       568


Gradient Boosting Accuracy: 0.8398
              precision    recall  f1-score   support

           0       0.50      0.49      0.50        91
           1       0.90      0.91      0.90       477

    accuracy                           0.84       568
   macro avg       0.70      0.70      0.70       568
weighted avg       0.84      0.84      0.84       568


Summary of Results: {'SVM': 0.8661971830985915, 'Random Forest': 0.8908450704225352, 'XGBoost': 0.8573943661971831, 'Gradient Boosting': 0.8397887323943662}


In [20]:
# Define models and param grids
models_and_params = {
    "SVM": (
        LinearSVC(random_state=42),
        {
            "clf__C": [0.1, 1, 10]
        }
    ),
    "Random Forest": (
        RandomForestClassifier(random_state=42),
        {
            "clf__n_estimators": [100, 200],
            "clf__max_depth": [None, 10, 20]
        }
    ),
    "XGBoost": (
        XGBClassifier(use_label_encoder=False, eval_metric="logloss", random_state=42),
        {
            "clf__n_estimators": [100, 200],
            "clf__max_depth": [3, 6, 10],
            "clf__learning_rate": [0.1, 0.3]
        }
    ),
    "Gradient Boosting": (
        GradientBoostingClassifier(random_state=42),
        {
            "clf__n_estimators": [100, 200],
            "clf__learning_rate": [0.05, 0.1, 0.3],
            "clf__max_depth": [3, 5]
        }
    )
}

results = {}

for name, (clf, param_grid) in models_and_params.items():
    print(f"\n Running GridSearch for {name}...")
    
    pipe = ImbPipeline([
        ("tfidf", TfidfVectorizer(max_features=5000, ngram_range=(1,2))),
        ("smote", SMOTE(random_state=42)),
        ("clf", clf)
    ])
    
    grid = GridSearchCV(pipe, param_grid, cv=3, scoring="accuracy", n_jobs=-1, verbose=1)
    grid.fit(Xb_temp, yb_temp)
    
    y_pred = grid.predict(Xb_val)
    
    acc = accuracy_score(yb_val, y_pred)
    print(f"{name} Best Params: {grid.best_params_}")
    print(f"{name} Accuracy: {acc:.4f}")
    print(classification_report(yb_val, y_pred))
    
    results[name] = {"Accuracy": acc, "Best Params": grid.best_params_}

print("\n Summary of Results:")
for model, res in results.items():
    print(f"{model}: Accuracy={res['Accuracy']:.4f}, Best Params={res['Best Params']}")


 Running GridSearch for SVM...
Fitting 3 folds for each of 3 candidates, totalling 9 fits
SVM Best Params: {'clf__C': 1}
SVM Accuracy: 0.8662
              precision    recall  f1-score   support

           0       0.60      0.49      0.54        91
           1       0.91      0.94      0.92       477

    accuracy                           0.87       568
   macro avg       0.75      0.72      0.73       568
weighted avg       0.86      0.87      0.86       568


 Running GridSearch for Random Forest...
Fitting 3 folds for each of 6 candidates, totalling 18 fits
Random Forest Best Params: {'clf__max_depth': None, 'clf__n_estimators': 200}
Random Forest Accuracy: 0.8908
              precision    recall  f1-score   support

           0       0.84      0.40      0.54        91
           1       0.90      0.99      0.94       477

    accuracy                           0.89       568
   macro avg       0.87      0.69      0.74       568
weighted avg       0.89      0.89      0.87    

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


XGBoost Best Params: {'clf__learning_rate': 0.3, 'clf__max_depth': 3, 'clf__n_estimators': 200}
XGBoost Accuracy: 0.8574
              precision    recall  f1-score   support

           0       0.61      0.31      0.41        91
           1       0.88      0.96      0.92       477

    accuracy                           0.86       568
   macro avg       0.74      0.63      0.66       568
weighted avg       0.84      0.86      0.84       568


 Running GridSearch for Gradient Boosting...
Fitting 3 folds for each of 12 candidates, totalling 36 fits
Gradient Boosting Best Params: {'clf__learning_rate': 0.3, 'clf__max_depth': 3, 'clf__n_estimators': 200}
Gradient Boosting Accuracy: 0.8609
              precision    recall  f1-score   support

           0       0.59      0.43      0.50        91
           1       0.90      0.94      0.92       477

    accuracy                           0.86       568
   macro avg       0.74      0.69      0.71       568
weighted avg       0.85      0.8