# Tweet Sentiment Analysis: From Baseline to State-of-the-Art

#### This notebook serves as the primary workspace for developing and comparing sentiment analysis models. We will follow a structured approach:
1.  **Setup and Data Exploration**: Load libraries and understand the dataset.
2.  **Universal Text Preprocessing**: Create a robust cleaning pipeline for our text data.
3.  **Part 1: Baseline Models (Scikit-learn)**: Implement and evaluate classic machine learning models using TF-IDF.
4.  **Part 2: Deep Learning Models**: Build and evaluate an LSTM model and discuss the implementation of a Transformer (RoBERTa).
5.  **Model Comparison & Final Selection**: Compare the results and choose the best model.
6.  **Saving and Predicting with the Final Model**: Save the chosen model and use it for inference on new tweets.

### 1. Setup and Data Exploration: 
##### First, let's import all necessary libraries, download NLTK data, and perform a brief exploratory data analysis (EDA).


#### 1.1 Imports

In [5]:
import pandas as pd
import numpy as np
import re
import nltk
import joblib
import os
import matplotlib.pyplot as plt
import seaborn as sns

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Scikit-learn Imports
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_auc_score
from sklearn.preprocessing import label_binarize

# Deep Learning Imports
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, SpatialDropout1D, Bidirectional
from tensorflow.keras.callbacks import EarlyStopping

# Configure plots
sns.set_style('whitegrid')

2025-10-02 23:48:55.622193: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2025-10-02 23:48:55.624423: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-10-02 23:48:55.809358: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-10-02 23:48:56.692325: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off,

#### 1.2 NLTK Downloads

In [6]:
# Download NLTK data (only needs to be done once)
try:
    stopwords.words('english')
except LookupError:
    print("Downloading NLTK data...")
    nltk.download('stopwords')
    nltk.download('punkt')
    nltk.download('wordnet')
    print("Downloads complete.")

#### 1.3 Data Loading and Initial Analysis

In [7]:
# Define file paths
# Note: The '..' moves one directory up from /notebooks to the project root
DATASET_PATH = '../data/raw/Tweets.csv'
MODEL_DIR = '../models'

# Create models directory if it doesn't exist
os.makedirs(MODEL_DIR, exist_ok=True)

# Load the data
df = pd.read_csv(DATASET_PATH)

# Let's focus on the columns we need: 'text' and 'airline_sentiment'
df = df[['text', 'airline_sentiment']]
df.dropna(inplace=True)

print("Dataset Info:")
df.info()
print("\nFirst 5 rows:")
print(df.head())

FileNotFoundError: [Errno 2] No such file or directory: '../data/raw/Tweets.csv'

#### 1.4 Sentiment Distribution

In [None]:
# Visualizing the count of each sentiment class helps us understand if the dataset is imbalanced.
plt.figure(figsize=(8, 5))
sns.countplot(x='airline_sentiment', data=df, order=['positive', 'neutral', 'negative'])
plt.title('Distribution of Sentiments')
plt.xlabel('Sentiment')
plt.ylabel('Number of Tweets')
plt.show()

### 2. Universal Text Preprocessing

##### This is a crucial step to clean the raw text. We will create a single function that will be used across all models to ensure consistency. The pipeline includes:
1.  **Regular expression** cleaning
2.  **Case normalization**
3.  **Tokenization**
4.  **Stopwords removals**
5.  **Lematization**

In [None]:
# Initialize preprocessing tools
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))
custom_stopwords = {'american', 'us', 'airways', 'air', 'airline', 'jetblue', 'virgin', 'united', 'southwest', 'flight'}
stop_words.update(custom_stopwords)

def preprocess_text(text: str) -> str:
    """Applies the full text cleaning pipeline to a single string."""
    if not isinstance(text, str):
        return ""

    # 1. Regex Cleaning (remove URLs, mentions, hashtags)
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    text = re.sub(r'\@\w+|\#', '', text)
    # 2. Remove non-alphabetic characters
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # 3. Case Normalization
    text = text.lower()
    # 4. Tokenization
    tokens = word_tokenize(text)
    # 5. Stopwords Removal and Lemmatization
    processed_tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words and len(word) > 1]

    return " ".join(processed_tokens)


# Apply the preprocessing function to our text column
print("Preprocessing text data... (This may take a moment)")
df['processed_text'] = df['text'].apply(preprocess_text)
print("Preprocessing complete.")

### 3. Part 1: Baseline Models (Scikit-learn)
##### We will start by training and evaluating several strong baseline models.

#### 3.1 Feature Extraction (TF-IDF with N-Grams) & Data Splitting
##### We convert our cleaned text into numerical features using **TF-IDF Vectorization**. We include **N-Grams** (`ngram_range=(1, 2)`) to capture both single words and two-word phrases, which often carry more meaning than words in isolation.

In [None]:
# Define features (X) and target (y)
X = df['processed_text']
y = df['airline_sentiment']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Initialize and fit the TF-IDF Vectorizer on the training data
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

print(f"Shape of TF-IDF matrix for training data: {X_train_tfidf.shape}")

#### 3.2 Model Training and Evaluation
 We will train and evaluate the following Scikit-learn models:
 - **Support Vector Machine (SVM)**: A powerful model that finds an optimal hyperplane to separate classes.
 - **Random Forests**: An ensemble method that builds multiple decision trees and merges them to get a more accurate and stable prediction.
 - **Logistic Regression**: A reliable and interpretable linear model.
 - **Multinomial Naive Bayes**: A probabilistic model that works very well for text classification.

 *A Note on K-Means*: K-Means is an **unsupervised clustering** algorithm, meaning it groups data without predefined labels. Since our goal is **supervised classification** (predicting known sentiment labels), K-Means is not an appropriate choice for this task.

In [None]:
# Define the models we want to train
models = {
    "Linear SVM": LinearSVC(random_state=42),
    "Random Forest": RandomForestClassifier(random_state=42, n_jobs=-1),
    "Logistic Regression": LogisticRegression(max_iter=1000, random_state=42),
    "Multinomial Naive Bayes": MultinomialNB()
}

# Binarize the labels for AUC calculation
y_test_binarized = label_binarize(y_test, classes=['positive', 'neutral', 'negative'])
class_labels = ['positive', 'neutral', 'negative']

for name, model in models.items():
    print(f"--- Training {name} ---")
    model.fit(X_train_tfidf, y_train)
    y_pred = model.predict(X_test_tfidf)
    
    # Evaluation
    print(f"\n--- Evaluation for {name} ---")
    print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred, target_names=class_labels))
    
    if hasattr(model, "predict_proba"):
        y_pred_proba = model.predict_proba(X_test_tfidf)
    else: # For SVM which uses decision_function
        y_pred_proba = model.decision_function(X_test_tfidf)
        # We need to reshape for 3-class problem
        if len(y_pred_proba.shape) == 1:
             y_pred_proba = np.vstack([-y_pred_proba, y_pred_proba]).T
    
    # Ensure y_pred_proba has 3 columns for 3 classes for AUC calculation
    if y_pred_proba.shape[1] == 2 and len(class_labels) == 3:
        # A common case for binary classifiers on multi-class data
        # We can't calculate multi-class AUC directly, so we'll skip it.
        print("Skipping Macro-Average AUC for this model.")
    elif y_pred_proba.shape[1] != len(class_labels):
         print("Skipping Macro-Average AUC due to shape mismatch.")
    else:
        auc_score = roc_auc_score(y_test_binarized, y_pred_proba, multi_class='ovr', average='macro')
        print(f"Macro-Average One-vs-Rest AUC: {auc_score:.4f}\n")


    cm = confusion_matrix(y_test, y_pred, labels=class_labels)
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=class_labels, yticklabels=class_labels)
    plt.title(f'Confusion Matrix for {name}')
    plt.ylabel('Actual')
    plt.xlabel('Predicted')
    plt.show()
    print("-" * 50 + "\n")

### 4. Part 2: Deep Learning Models
##### Now we'll build more complex models: a Recurrent Neural Network (LSTM) and discuss a state-of-the-art Transformer.

#### 4.1 Advanced Model: RNN/LSTM
This requires a different preprocessing pipeline to convert text into sequences of integers for the Embedding layer.

#### 4.1.1 Preprocessing for LSTM (Tokenization & Padding)

In [None]:
# Keras Tokenizer parameters
MAX_NB_WORDS = 10000  # Max number of words in the vocabulary
MAX_SEQUENCE_LENGTH = 100 # Max length of a tweet
EMBEDDING_DIM = 128 # Dimension of the word embeddings

# Create and fit the tokenizer
keras_tokenizer = Tokenizer(num_words=MAX_NB_WORDS, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~', lower=True)
keras_tokenizer.fit_on_texts(df['processed_text'].values)

# Convert text to sequences and pad them
X_seq = keras_tokenizer.texts_to_sequences(df['processed_text'].values)
X_pad = pad_sequences(X_seq, maxlen=MAX_SEQUENCE_LENGTH)

# One-hot encode the labels
y_encoded = pd.get_dummies(df['airline_sentiment']).values

# Split the data for the LSTM model
X_train_lstm, X_test_lstm, y_train_lstm, y_test_lstm = train_test_split(
    X_pad, y_encoded, test_size=0.2, random_state=42, stratify=y_encoded
)

print(f"Shape of LSTM training data: {X_train_lstm.shape}")
print(f"Shape of LSTM training labels: {y_train_lstm.shape}")

#### 4.1.2 Building and Training the LSTM Model

In [None]:
model_lstm = Sequential()
model_lstm.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X_pad.shape[1]))
model_lstm.add(SpatialDropout1D(0.2))
model_lstm.add(Bidirectional(LSTM(100, dropout=0.2, recurrent_dropout=0.2)))
model_lstm.add(Dense(3, activation='softmax'))

model_lstm.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model_lstm.summary())

# Train the model with early stopping
history = model_lstm.fit(
    X_train_lstm, y_train_lstm,
    epochs=5,
    batch_size=64,
    validation_split=0.1,
    callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)]
)

#### 4.1.3 Evaluating the LSTM Model

In [None]:
# Evaluate on the test set
loss, accuracy = model_lstm.evaluate(X_test_lstm, y_test_lstm, verbose=2)
print(f"\nLSTM Model Accuracy: {accuracy:.4f}")

# Generate classification report and confusion matrix
y_pred_lstm_proba = model_lstm.predict(X_test_lstm)
y_pred_lstm = np.argmax(y_pred_lstm_proba, axis=1)
y_test_labels = np.argmax(y_test_lstm, axis=1)

# The order of classes from pd.get_dummies is alphabetical: negative, neutral, positive
class_mapping = {0: 'negative', 1: 'neutral', 2: 'positive'} 
y_pred_lstm_labels = np.vectorize(class_mapping.get)(y_pred_lstm)
y_test_actual_labels = np.vectorize(class_mapping.get)(y_test_labels)

print("\nLSTM Classification Report:")
print(classification_report(y_test_actual_labels, y_pred_lstm_labels, labels=class_labels))

cm_lstm = confusion_matrix(y_test_actual_labels, y_pred_lstm_labels, labels=class_labels)
plt.figure(figsize=(8, 6))
sns.heatmap(cm_lstm, annot=True, fmt='d', cmap='Blues', xticklabels=class_labels, yticklabels=class_labels)
plt.title('Confusion Matrix for LSTM Model')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

### 4.2 State-of-the-Art Model: Transformer (RoBERTa)
**Transformers (like RoBERTa)** represent the current state-of-the-art. RoBERTa (A Robustly Optimized BERT Pretraining Approach) improves upon BERT's pre-training strategy, allowing it to often achieve better performance.

#### 4.2.1 A Note on Computational Resources
**Warning**: Fine-tuning a Transformer model is highly resource-intensive and slow without a GPU. To make this notebook runnable in a standard environment, we will **train on a small subset of the data (1000 samples) for only one epoch**. The resulting accuracy will not be optimal but will serve as a proof-of-concept for the implementation pipeline.

#### 4.2.2 Preparing Data for RoBERTa

In [None]:
# For RoBERTa, we use the original, un-preprocessed text.
# The data (X_train, X_test, y_train, y_test) is already split.

# Create a smaller subset for demonstration purposes
SUBSET_SIZE = 1000
X_train_sub = X_train[:SUBSET_SIZE]
y_train_sub = y_train[:SUBSET_SIZE]
X_test_sub = X_test[:SUBSET_SIZE]
y_test_sub = y_test[:SUBSET_SIZE]


# Load RoBERTa Tokenizer
tokenizer_roberta = RobertaTokenizer.from_pretrained('roberta-base')

# Tokenize the data subsets
train_encodings = tokenizer_roberta(X_train_sub.tolist(), truncation=True, padding=True, max_length=128)
test_encodings = tokenizer_roberta(X_test_sub.tolist(), truncation=True, padding=True, max_length=128)

# Convert labels to one-hot encoding
y_train_encoded = pd.get_dummies(y_train_sub).values
y_test_encoded = pd.get_dummies(y_test_sub).values

# Create TensorFlow Datasets
train_dataset = tf.data.Dataset.from_tensor_slices((dict(train_encodings), y_train_encoded))
test_dataset = tf.data.Dataset.from_tensor_slices((dict(test_encodings), y_test_encoded))

#### 4.2.3 Loading and Compiling the RoBERTa Model

In [None]:
# Load pre-trained model
model_roberta = TFRobertaForSequenceClassification.from_pretrained('roberta-base', num_labels=3)

# Compile the model
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
model_roberta.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])

print(model_roberta.summary())

#### 4.2.4 Fine-Tuning the Model

In [None]:
# Fine-tune the model on our subset
print("\nFine-tuning RoBERTa model on a subset of data...")
roberta_history = model_roberta.fit(
    train_dataset.shuffle(100).batch(16),
    epochs=1,
    batch_size=16,
    validation_data=test_dataset.shuffle(100).batch(16)
)
print("Fine-tuning complete.")

#### 4.2.5 Evaluating the RoBERTa Model

In [None]:
# Evaluate on the test subset
loss_roberta, accuracy_roberta = model_roberta.evaluate(test_dataset.batch(16))
print(f"\nRoBERTa Model Accuracy on Subset: {accuracy_roberta:.4f}")

# Generate classification report and confusion matrix
y_pred_roberta_logits = model_roberta.predict(test_dataset.batch(16)).logits
y_pred_roberta = np.argmax(y_pred_roberta_logits, axis=1)
y_test_roberta_labels = np.argmax(y_test_encoded, axis=1)


y_pred_roberta_mapped = np.vectorize(class_mapping.get)(y_pred_roberta)
y_test_actual_roberta_mapped = np.vectorize(class_mapping.get)(y_test_roberta_labels)


print("\nRoBERTa Classification Report (on Subset):")
print(classification_report(y_test_actual_roberta_mapped, y_pred_roberta_mapped, labels=class_labels))

cm_roberta = confusion_matrix(y_test_actual_roberta_mapped, y_pred_roberta_mapped, labels=class_labels)
plt.figure(figsize=(8, 6))
sns.heatmap(cm_roberta, annot=True, fmt='d', cmap='Blues', xticklabels=class_labels, yticklabels=class_labels)
plt.title('Confusion Matrix for RoBERTa Model (on Subset)')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

### 5. Model Comparison & Final Selection
Here we will programmatically compare the results stored from each model run and make a final decision.

#### 5.1 Results Leaderboard

In [None]:
# Create a DataFrame from the results dictionary
results_df = pd.DataFrame(list(results.items()), columns=['Model', 'Accuracy'])
results_df = results_df.sort_values(by='Accuracy', ascending=False).reset_index(drop=True)

print("--- Model Performance Leaderboard ---")
print(results_df)

# Visualize the results
plt.figure(figsize=(10, 6))
sns.barplot(x='Accuracy', y='Model', data=results_df, palette='viridis')
plt.title('Model Accuracy Comparison')
plt.xlim(0.5, 1.0)
plt.show()

#### 5.2 Final Decision
Let's briefly summarize the results based on the leaderboard:
- **Scikit-learn Models**: The `Linear SVM` and `Logistic Regression` models provide excellent baseline accuracies and are extremely fast to train. They are strong contenders for any text classification task.
- **LSTM Model**: This model typically achieves a competitive accuracy, demonstrating its ability to understand word order and context, which the TF-IDF models cannot do.
- **RoBERTa (on Subset)**: The accuracy on the small subset is for demonstration only and cannot be directly compared. However, it establishes a working pipeline. It is **expected to significantly outperform** all other models when trained on the full dataset with adequate resources (GPU and more epochs).
**Decision**: For this project, which aims for a balance of high performance and manageable complexity, the **LSTM model** is the best choice. It delivers strong results without the heavy computational requirements of RoBERTa. If maximum accuracy were the only goal, investing the time and resources to fully train RoBERTa would be the next step.
We will select the **LSTM model** as our final, savable artifact.

#### 6. Saving and Predicting with the Final Model
We will save the trained LSTM model and its corresponding Keras tokenizer so we can use them for inference without retraining.


### 6.1 Saving the Artifacts

In [None]:
# Define final model paths
LSTM_MODEL_PATH = os.path.join(MODEL_DIR, 'final_lstm_model.keras')
TOKENIZER_PATH = os.path.join(MODEL_DIR, 'final_keras_tokenizer.pkl')

# Save the Keras model and tokenizer
model_lstm.save(LSTM_MODEL_PATH)
joblib.dump(keras_tokenizer, TOKENIZER_PATH)

print(f"Final Keras Tokenizer saved to: {TOKENIZER_PATH}")
print(f"Final LSTM Model saved to: {LSTM_MODEL_PATH}")

#### 6.2 Prediction Function

In [None]:
def predict_new_tweet(text: str):
    """Loads final model artifacts and predicts sentiment for a new text string."""
    # Load the saved artifacts
    try:
        loaded_tokenizer = joblib.load(TOKENIZER_PATH)
        loaded_model = tf.keras.models.load_model(LSTM_MODEL_PATH)
    except FileNotFoundError:
        print("Model files not found. Please train and save the model first.")
        return

    # Preprocess and tokenize the new text
    processed_text = preprocess_text(text)
    sequence = loaded_tokenizer.texts_to_sequences([processed_text])
    padded_sequence = pad_sequences(sequence, maxlen=MAX_SEQUENCE_LENGTH)

    # Predict
    prediction_proba = loaded_model.predict(padded_sequence)[0]
    
    # Get class with highest probability
    class_labels = ['negative', 'neutral', 'positive'] # Alphabetical order from get_dummies
    prediction_label = class_labels[np.argmax(prediction_proba)]
    probabilities = dict(zip(class_labels, prediction_proba))

    print(f"\nTweet: '{text}'")
    print(f"Predicted Sentiment: -> {prediction_label} <-")
    print("Probabilities:")
    for sentiment, prob in probabilities.items():
        print(f"  - {sentiment}: {prob:.4f}")
    print("-" * 30)

### 6.3 Test Cases

In [None]:
predict_new_tweet("I am so happy with their service, it was an amazing journey!")
predict_new_tweet("The plane was dirty and the staff was rude. Never flying with them again.")
predict_new_tweet("My flight from JFK to LAX is on time.")
predict_new_tweet("@AmericanAir you are the worst. My flight is delayed again!")

NameError: name 'joblib' is not defined