# Sentiment Analysis: Training & Evaluation

This notebook trains a sentiment classifier on Twitter airline tweets.

**Workflow:**
1. Load and explore data
2. Text preprocessing
3. Split train/val/test
4. Train TF-IDF + LogisticRegression
5. Evaluate and analyze
6. Save model artifacts

**Dataset:** 14,640 airline tweets (negative, neutral, positive)

In [1]:
import sys
sys.path.insert(0, '/data')
import sentiment_utils

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

import importlib
import sentiment_utils
importlib.reload(sentiment_utils)

from sentiment_utils import (
    load_data,
    preprocess_dataframe,
    split_data,
    vectorize_and_train,
    evaluate_model,
    get_feature_importance,
    LABEL_MAP,
)

from joblib import dump

DATA_PATH = "data/Tweets.csv"
VECT_PATH = "tfidf_vectorizer.joblib"
MODEL_PATH = "logreg_sentiment_model.joblib"

print("✓ Imports OK")

✓ Imports OK


## Setup

## Load Data

In [3]:
df_raw = load_data(DATA_PATH)
print(f"Shape: {df_raw.shape}")
print(f"\nFirst few rows:")
print(df_raw.head())

✓ Loaded 14640 tweets from data/Tweets.csv
Shape: (14640, 2)

First few rows:
  airline_sentiment                                               text
0           neutral                @VirginAmerica What @dhepburn said.
1          positive  @VirginAmerica plus you've added commercials t...
2           neutral  @VirginAmerica I didn't today... Must mean I n...
3          negative  @VirginAmerica it's really aggressive to blast...
4          negative  @VirginAmerica and it's a really big bad thing...


## EDA with YData Profiling

In [4]:
from ydata_profiling import ProfileReport

print("Generating profiling report...")
profile = ProfileReport(
    df_raw,
    title="Airline Sentiment - EDA",
    minimal=False
)
profile.to_file("airline_sentiment_profile.html")
print("✓ Report saved to airline_sentiment_profile.html")

Generating profiling report...


Summarize dataset:   0%|                                                                                                                                               | 0/7 [00:00<?, ?it/s, Describe variable: text]
Summarize dataset:  29%|██████████████████████████████████████▌                                                                                                | 2/7 [00:00<00:00,  7.63it/s, Describe variable: text][A
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  9.53it/s][A
Summarize dataset: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 28.53it/s, Completed]
Generate report structure: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████

✓ Report saved to airline_sentiment_profile.html





## Raw Distribution

In [5]:
raw_counts = df_raw['airline_sentiment'].value_counts()

fig, ax = plt.subplots(figsize=(10, 5))
colors = ['#d32f2f', '#f57c00', '#388e3c']
raw_counts.plot(kind='bar', color=colors, ax=ax)
ax.set_title('Label Distribution', fontsize=14, fontweight='bold')
ax.set_xlabel('Sentiment')
ax.set_ylabel('Count')
ax.set_xticklabels(ax.get_xticklabels(), rotation=0)

for i, v in enumerate(raw_counts):
    ax.text(i, v + 100, f'{v}\n({v/len(df_raw)*100:.1f}%)', ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

print("\nClass Imbalance:")
print(f"  Negative: {raw_counts['negative']/len(df_raw)*100:.1f}%")
print(f"  Neutral:  {raw_counts['neutral']/len(df_raw)*100:.1f}%")
print(f"  Positive: {raw_counts['positive']/len(df_raw)*100:.1f}%")
print(f"  Ratio (neg/pos): {raw_counts['negative']/raw_counts['positive']:.1f}x")


Class Imbalance:
  Negative: 62.7%
  Neutral:  21.2%
  Positive: 16.1%
  Ratio (neg/pos): 3.9x


## Preprocess

In [6]:
df_clean = preprocess_dataframe(df_raw)

print(f"\nCleaned shape: {df_clean.shape}")
print(f"\nSample cleaning:")
for idx in range(3):
    print(f"\n[Original]")
    print(f"  {df_clean.iloc[idx]['text'][:100]}...")
    print(f"[Cleaned]")
    print(f"  {df_clean.iloc[idx]['clean_text']}")


✓ Preprocessed - Label distribution:
  negative  :  9178 ( 62.7%)
  neutral   :  3099 ( 21.2%)
  positive  :  2363 ( 16.1%)

Cleaned shape: (14640, 4)

Sample cleaning:

[Original]
  @VirginAmerica What @dhepburn said....
[Cleaned]
  what said

[Original]
  @VirginAmerica plus you've added commercials to the experience... tacky....
[Cleaned]
  plus you ve added commercials to the experience tacky

[Original]
  @VirginAmerica I didn't today... Must mean I need to take another trip!...
[Cleaned]
  i didn t today must mean i need to take another trip


## Text Length Analysis

In [7]:
df_clean['text_len'] = df_clean['clean_text'].str.len()
df_clean['word_count'] = df_clean['clean_text'].str.split().str.len()

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

df_clean['text_len'].hist(bins=50, ax=axes[0], color='skyblue', edgecolor='black')
axes[0].set_title('Character Length', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Length')
axes[0].set_ylabel('Count')

df_clean['word_count'].hist(bins=50, ax=axes[1], color='lightcoral', edgecolor='black')
axes[1].set_title('Word Count', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Words')
axes[1].set_ylabel('Count')

plt.tight_layout()
plt.show()

print(f"\nStats:")
print(f"  Avg length: {df_clean['text_len'].mean():.0f} chars")
print(f"  Avg words: {df_clean['word_count'].mean():.1f}")


Stats:
  Avg length: 85 chars
  Avg words: 16.5


## Split Data

In [8]:
X_train, X_val, X_test, y_train, y_val, y_test = split_data(df_clean)

print(f"\nClass distribution in splits:")
for split_name, y_split in [('Train', y_train), ('Val', y_val), ('Test', y_test)]:
    print(f"\n{split_name}:")
    for label_id in range(3):
        count = (y_split == label_id).sum()
        pct = count / len(y_split) * 100
        print(f"  {LABEL_MAP[label_id]:10s}: {count:5d} ({pct:5.1f}%)")


✓ Data split:
  Train: 11712 ( 80.0%)
  Val:     366 (  2.5%)
  Test:   2562 ( 17.5%)

Class distribution in splits:

Train:
  negative  :  7343 ( 62.7%)
  neutral   :  2479 ( 21.2%)
  positive  :  1890 ( 16.1%)

Val:
  negative  :   229 ( 62.6%)
  neutral   :    78 ( 21.3%)
  positive  :    59 ( 16.1%)

Test:
  negative  :  1606 ( 62.7%)
  neutral   :   542 ( 21.2%)
  positive  :   414 ( 16.2%)


## Train Model

In [9]:
vectorizer, model = vectorize_and_train(X_train, y_train)


✓ Vectorizing...
  Vocab: 7576 | Shape: (11712, 7576)
✓ Training model...
  Done


## Evaluate

In [10]:
val_metrics = evaluate_model(vectorizer, model, X_val, y_val, "Validation")


EVALUATION: Validation
Accuracy:  0.7951
Precision: 0.7533
Recall:    0.7730
F1-Score:  0.7583

Confusion Matrix:
[[188  32   9]
 [ 14  60   4]
 [  6  10  43]]

Report:
              precision    recall  f1-score   support

    negative       0.90      0.82      0.86       229
     neutral       0.59      0.77      0.67        78
    positive       0.77      0.73      0.75        59

    accuracy                           0.80       366
   macro avg       0.75      0.77      0.76       366
weighted avg       0.81      0.80      0.80       366



In [11]:
test_metrics = evaluate_model(vectorizer, model, X_test, y_test, "Test")


EVALUATION: Test
Accuracy:  0.7760
Precision: 0.7249
Recall:    0.7444
F1-Score:  0.7312

Confusion Matrix:
[[1311  235   60]
 [ 107  383   52]
 [  54   66  294]]

Report:
              precision    recall  f1-score   support

    negative       0.89      0.82      0.85      1606
     neutral       0.56      0.71      0.62       542
    positive       0.72      0.71      0.72       414

    accuracy                           0.78      2562
   macro avg       0.72      0.74      0.73      2562
weighted avg       0.79      0.78      0.78      2562



## Confusion Matrix

In [12]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

cm = test_metrics['confusion_matrix']
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0],
            xticklabels=[LABEL_MAP[i] for i in range(3)],
            yticklabels=[LABEL_MAP[i] for i in range(3)])
axes[0].set_title('Confusion Matrix', fontsize=12, fontweight='bold')
axes[0].set_ylabel('True')
axes[0].set_xlabel('Predicted')

cm_norm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
sns.heatmap(cm_norm, annot=True, fmt='.2%', cmap='Greens', ax=axes[1],
            xticklabels=[LABEL_MAP[i] for i in range(3)],
            yticklabels=[LABEL_MAP[i] for i in range(3)])
axes[1].set_title('Normalized', fontsize=12, fontweight='bold')
axes[1].set_ylabel('True')
axes[1].set_xlabel('Predicted')

plt.tight_layout()
plt.show()

## Error Analysis

In [13]:
X_test_vec = vectorizer.transform(X_test)
y_pred = model.predict(X_test_vec)

errors = y_test != y_pred
error_count = errors.sum()
total = len(y_test)

print(f"\nMisclassifications: {error_count}/{total} ({error_count/total*100:.1f}%)")

print(f"\nBreakdown by true label:")
for true_id in range(3):
    true_label = LABEL_MAP[true_id]
    mask = (y_test == true_id) & errors
    if mask.sum() > 0:
        print(f"\n  When true={true_label}:")
        for pred_id in range(3):
            pred_label = LABEL_MAP[pred_id]
            count = ((y_test == true_id) & (y_pred == pred_id) & errors).sum()
            if count > 0:
                print(f"    Predicted {pred_label}: {count}x")

# Show a few examples
error_idx = np.where(errors)[0]
if len(error_idx) > 0:
    print(f"\nExample errors:")
    for i, idx in enumerate(error_idx[:3]):
        true_label = LABEL_MAP[int(y_test.iloc[idx])]
        pred_label = LABEL_MAP[int(y_pred[idx])]
        text = X_test.iloc[idx]
        
        print(f"\n{i+1}. Text: {text[:80]}...")
        print(f"   True: {true_label}, Pred: {pred_label}")


Misclassifications: 574/2562 (22.4%)

Breakdown by true label:

  When true=negative:
    Predicted neutral: 235x
    Predicted positive: 60x

  When true=neutral:
    Predicted negative: 107x
    Predicted positive: 52x

  When true=positive:
    Predicted negative: 54x
    Predicted neutral: 66x

Example errors:

1. Text: when you ve got to get there we ve got excuses usairwaysfail...
   True: negative, Pred: neutral

2. Text: i managed to get sorted out over the phone good luck dealing with the snow in te...
   True: positive, Pred: neutral

3. Text: call gate d in clt and get me on this flight...
   True: neutral, Pred: negative


## Feature Importance

In [14]:
importance = get_feature_importance(vectorizer, model, top_n=15)

print("\nTop words per class:")
for label, features in importance.items():
    print(f"\n{label.upper()}:")
    for i, (word, coef) in enumerate(list(features.items())[:10], 1):
        print(f"  {i:2d}. {word:15s} ({coef:6.3f})")


Top words per class:

NEGATIVE:
   1. not             ( 3.833)
   2. no              ( 3.441)
   3. delayed         ( 3.378)
   4. hours           ( 3.277)
   5. worst           ( 2.878)
   6. your            ( 2.631)
   7. hour            ( 2.614)
   8. cancelled       ( 2.580)
   9. delay           ( 2.501)
  10. nothing         ( 2.424)

NEUTRAL:
   1. is there        ( 2.167)
   2. can you         ( 2.141)
   3. hi              ( 2.123)
   4. is your         ( 1.677)
   5. do you          ( 1.653)
   6. destinationdragons ( 1.585)
   7. do              ( 1.522)
   8. need            ( 1.520)
   9. any             ( 1.514)
  10. can             ( 1.507)

POSITIVE:
   1. great           ( 5.634)
   2. thanks          ( 4.890)
   3. awesome         ( 4.078)
   4. love            ( 3.797)
   5. amazing         ( 3.360)
   6. thank           ( 3.180)
   7. thx             ( 2.918)
   8. thank you       ( 2.744)
   9. best            ( 2.696)
  10. the best        ( 2.659)


## Visualize Features

In [15]:
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

colors_map = {'negative': '#d32f2f', 'neutral': '#f57c00', 'positive': '#388e3c'}

for idx, (label, features) in enumerate(importance.items()):
    words = list(features.keys())[:10]
    coefs = list(features.values())[:10]
    
    axes[idx].barh(words, coefs, color=colors_map[label])
    axes[idx].set_title(f'{label.upper()}', fontweight='bold')
    axes[idx].set_xlabel('Coefficient')
    axes[idx].invert_yaxis()

plt.tight_layout()
plt.show()

## Save Artifacts

In [16]:
dump(vectorizer, VECT_PATH)
dump(model, MODEL_PATH)

print(f"✓ Vectorizer -> {VECT_PATH}")
print(f"✓ Model -> {MODEL_PATH}")
print(f"\n✓ Ready for inference!")

✓ Vectorizer -> tfidf_vectorizer.joblib
✓ Model -> logreg_sentiment_model.joblib

✓ Ready for inference!


## Summary

In [17]:
print(f"""
╔════════════════════════════════════════╗
║  TRAINING COMPLETE                     ║
╚════════════════════════════════════════╝

Dataset: 14,640 airline tweets
Classes: negative (63%), neutral (21%), positive (16%)

Model: Logistic Regression + TF-IDF
- Vocab size: 20,000
- N-grams: unigrams + bigrams
- Class weights: balanced

Results (Test Set):
- Accuracy:  {test_metrics['accuracy']:.3f}
- Precision: {test_metrics['precision_macro']:.3f}
- Recall:    {test_metrics['recall_macro']:.3f}
- F1-Score:  {test_metrics['f1_macro']:.3f}

Artifacts saved:
✓ {VECT_PATH}
✓ {MODEL_PATH}
✓ airline_sentiment_profile.html

Next: Open sentiment.API.ipynb for inference
""")


╔════════════════════════════════════════╗
║  TRAINING COMPLETE                     ║
╚════════════════════════════════════════╝

Dataset: 14,640 airline tweets
Classes: negative (63%), neutral (21%), positive (16%)

Model: Logistic Regression + TF-IDF
- Vocab size: 20,000
- N-grams: unigrams + bigrams
- Class weights: balanced

Results (Test Set):
- Accuracy:  0.776
- Precision: 0.725
- Recall:    0.744
- F1-Score:  0.731

Artifacts saved:
✓ tfidf_vectorizer.joblib
✓ logreg_sentiment_model.joblib
✓ airline_sentiment_profile.html

Next: Open sentiment.API.ipynb for inference

