# Twitter Sentiment Analysis - Evaluation & Interpretability

This notebook focuses on evaluating the trained models and interpreting their predictions:

1. Load the best-performing models
2. Evaluate models on the test set
3. Generate classification reports and confusion matrices
4. Apply SHAP values for model interpretability
5. Compare feature importance across different models
6. Analyze misclassified examples

## 1. Setup and Imports

In [0]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import os
import warnings
warnings.filterwarnings('ignore')

# Machine learning libraries
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix, roc_auc_score
import scipy.sparse as sp

# Deep learning libraries
import torch
import torch.nn as nn

# Interpretability
import shap

# Visualization settings
plt.style.use('ggplot')
sns.set(style='whitegrid')
%matplotlib inline

## 2. Load Models and Test Data

In [0]:
# For Google Colab, uncomment these lines to mount Google Drive
# from google.colab import drive
# drive.mount('/content/drive')
# features_dir = '/content/drive/MyDrive/path/to/features'
# models_dir = '/content/drive/MyDrive/path/to/models'
# results_dir = '/content/drive/MyDrive/path/to/results'

# For local development
features_dir = '../data/features'
models_dir = '../models'
results_dir = '../results'
os.makedirs(results_dir, exist_ok=True)

# Load model comparison results
model_results = pd.read_csv(os.path.join(results_dir, 'model_comparison_results.csv'))

# Get top performing models
top_models = model_results.head(3)
print("Top 3 performing models:")
top_models

In [0]:
# Load test data
# Load labels
y = np.load(os.path.join(features_dir, 'labels.npy'))

# Load features
X_bow = sp.load_npz(os.path.join(features_dir, 'bow_features.npz'))
X_tfidf = sp.load_npz(os.path.join(features_dir, 'tfidf_features.npz'))
X_word2vec = np.load(os.path.join(features_dir, 'word2vec_features.npy'))
X_glove = np.load(os.path.join(features_dir, 'glove_features.npy'))
X_bert = np.load(os.path.join(features_dir, 'bert_features.npy'))

print("Features loaded successfully.")

# Load the split indices if they were saved
# Otherwise, recreate the splits
from sklearn.model_selection import train_test_split

def split_data(X, y, test_size=0.15, val_size=0.15, random_state=42):
    # First split: training + validation vs test
    X_train_val, X_test, y_train_val, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state, stratify=y
    )
    
    # Second split: training vs validation
    # Adjust validation size to be a percentage of the training + validation set
    val_ratio = val_size / (1 - test_size)
    X_train, X_val, y_train, y_val = train_test_split(
        X_train_val, y_train_val, test_size=val_ratio, random_state=random_state, stratify=y_train_val
    )
    
    return X_train, X_val, X_test, y_train, y_val, y_test

# Create train/val/test splits for each feature type
# BoW features
_, _, bow_test, _, _, y_test = split_data(X_bow, y)

# TF-IDF features
_, _, tfidf_test, _, _, _ = split_data(X_tfidf, y, random_state=42)

# Word2Vec features
_, _, w2v_test, _, _, _ = split_data(X_word2vec, y, random_state=42)

# GloVe features
_, _, glove_test, _, _, _ = split_data(X_glove, y, random_state=42)

# BERT features
_, _, bert_test, _, _, _ = split_data(X_bert, y, random_state=42)

print(f"Test set: {y_test.shape[0]} samples")

In [0]:
# Load label encoder
with open(os.path.join(models_dir, 'label_encoder.pkl'), 'rb') as f:
    label_encoder = pickle.load(f)

# Display label encoding
print("Label Encoding:")
for i, label in enumerate(label_encoder.classes_):
    print(f"{i} -> {label}")

## 3. Load and Evaluate Best Models

In [0]:
# Function to load a saved sklearn model
def load_sklearn_model(model_path):
    with open(model_path, 'rb') as f:
        model = pickle.load(f)
    return model

# Function to load BiLSTM model
def load_bilstm_model(model_path, input_dim):
    # Define BiLSTM model class
    class BiLSTMClassifier(nn.Module):
        def __init__(self, input_dim, hidden_dim, output_dim, n_layers, dropout):
            super(BiLSTMClassifier, self).__init__()
            self.lstm = nn.LSTM(input_dim, hidden_dim, num_layers=n_layers, bidirectional=True, dropout=dropout, batch_first=True)
            self.fc = nn.Linear(hidden_dim * 2, output_dim)
            
        def forward(self, text):
            # text shape: [batch size, input dim]
            # We need to add sequence length dimension for LSTM
            text = text.unsqueeze(1)  # Now: [batch size, 1, input dim]
            
            output, (hidden, cell) = self.lstm(text)
            
            # Concatenate the final forward and backward hidden states
            hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1)
            
            return self.fc(hidden)
    
    # Create model with same architecture
    hidden_dim = 128
    output_dim = len(label_encoder.classes_)
    n_layers = 2
    dropout = 0.5
    
    model = BiLSTMClassifier(input_dim, hidden_dim, output_dim, n_layers, dropout)
    
    # Load weights
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.load_state_dict(torch.load(model_path, map_location=device))
    model = model.to(device)
    model.eval()
    
    return model, device

# Function to evaluate model and generate report
def evaluate_model(model, X_test, y_test, model_name, feature_name, is_bilstm=False, device=None):
    if is_bilstm:
        # Predict using BiLSTM
        X_test_tensor = torch.FloatTensor(X_test).to(device)
        with torch.no_grad():
            outputs = model(X_test_tensor)
            _, y_pred = torch.max(outputs, 1)
            y_pred = y_pred.cpu().numpy()
    else:
        # Predict using sklearn model
        y_pred = model.predict(X_test)
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted')
    report = classification_report(y_test, y_pred, target_names=label_encoder.classes_, output_dict=True)
    conf_matrix = confusion_matrix(y_test, y_pred)
    
    # Print results
    print(f"\n{model_name} with {feature_name} - Test Set Performance:")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"F1 Score (weighted): {f1:.4f}")
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred, target_names=label_encoder.classes_))
    
    # Plot confusion matrix
    plt.figure(figsize=(8, 6))
    sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues',
                xticklabels=label_encoder.classes_,
                yticklabels=label_encoder.classes_)
    plt.xlabel('Predicted')
    plt.ylabel('True')
    plt.title(f'Confusion Matrix - {model_name} with {feature_name}')
    plt.tight_layout()
    plt.savefig(os.path.join(results_dir, f'confusion_matrix_{model_name.lower()}_{feature_name.lower()}.png'))
    plt.show()
    
    return {
        'model_name': model_name,
        'feature_name': feature_name,
        'accuracy': accuracy,
        'f1_score': f1,
        'report': report,
        'confusion_matrix': conf_matrix,
        'predictions': y_pred
    }

In [0]:
# Evaluate each top model
test_results = []

for _, row in top_models.iterrows():
    model_name = row['Model']
    feature_name = row['Feature']
    
    print(f"\nEvaluating {model_name} with {feature_name}...")
    
    # Load the appropriate test set
    if feature_name == 'BoW':
        X_test_features = bow_test
    elif feature_name == 'TF-IDF':
        X_test_features = tfidf_test
    elif feature_name == 'Word2Vec':
        X_test_features = w2v_test
    elif feature_name == 'GloVe':
        X_test_features = glove_test
    elif feature_name == 'BERT':
        X_test_features = bert_test
    
    # Load the model
    if model_name == 'BiLSTM':
        model_path = os.path.join(models_dir, f'bilstm_{feature_name.lower()}.pt')
        model, device = load_bilstm_model(model_path, X_test_features.shape[1])
        result = evaluate_model(model, X_test_features, y_test, model_name, feature_name, is_bilstm=True, device=device)
    else:
        # Customize this part based on how you saved your models
        model_path = os.path.join(models_dir, f'{model_name.lower()}_{feature_name.lower()}.pkl')
        model = load_sklearn_model(model_path)
        result = evaluate_model(model, X_test_features, y_test, model_name, feature_name)
    
    test_results.append(result)

## 4. Model Interpretation with SHAP

In [0]:
# Function to interpret sklearn model with SHAP
def interpret_sklearn_model(model, X_train, X_test, feature_names, model_name, feature_type):
    print(f"\nGenerating SHAP values for {model_name} with {feature_type}...")
    
    # For large sparse matrices, sample a subset for speed
    if isinstance(X_train, sp.spmatrix) and X_train.shape[1] > 1000:
        # Limit to top 1000 features
        X_train_sample = X_train[:, :1000]
        X_test_sample = X_test[:, :1000]
        feature_names = feature_names[:1000] if feature_names is not None else None
    else:
        X_train_sample = X_train
        X_test_sample = X_test
    
    # Convert sparse matrices to dense if needed
    if isinstance(X_train_sample, sp.spmatrix):
        X_train_sample = X_train_sample.toarray()
    if isinstance(X_test_sample, sp.spmatrix):
        X_test_sample = X_test_sample.toarray()
    
    # Sample for speed if necessary
    if X_train_sample.shape[0] > 500:
        indices = np.random.choice(X_train_sample.shape[0], 500, replace=False)
        X_train_sample = X_train_sample[indices]
    
    # Use appropriate explainer
    if hasattr(model, 'predict_proba'):
        explainer = shap.KernelExplainer(model.predict_proba, X_train_sample)
        # Sample test set for speed if necessary
        if X_test_sample.shape[0] > 50:
            test_indices = np.random.choice(X_test_sample.shape[0], 50, replace=False)
            X_test_sample = X_test_sample[test_indices]
        
        shap_values = explainer.shap_values(X_test_sample)
        
        # Plot summary
        plt.figure(figsize=(12, 8))
        shap.summary_plot(shap_values, X_test_sample, feature_names=feature_names, show=False)
        plt.title(f"SHAP Summary Plot - {model_name} with {feature_type}")
        plt.tight_layout()
        plt.savefig(os.path.join(results_dir, f'shap_summary_{model_name.lower()}_{feature_type.lower()}.png'))
        plt.show()
        
        return shap_values
    else:
        # For models without predict_proba
        explainer = shap.KernelExplainer(model.predict, X_train_sample)
        if X_test_sample.shape[0] > 50:
            test_indices = np.random.choice(X_test_sample.shape[0], 50, replace=False)
            X_test_sample = X_test_sample[test_indices]
            
        shap_values = explainer.shap_values(X_test_sample)
        
        # Plot summary
        plt.figure(figsize=(12, 8))
        shap.summary_plot(shap_values, X_test_sample, feature_names=feature_names, show=False)
        plt.title(f"SHAP Summary Plot - {model_name} with {feature_type}")
        plt.tight_layout()
        plt.savefig(os.path.join(results_dir, f'shap_summary_{model_name.lower()}_{feature_type.lower()}.png'))
        plt.show()
        
        return shap_values

In [0]:
# Apply SHAP to the best models (only for sklearn models)
for result in test_results:
    if result['model_name'] != 'BiLSTM':  # Skip BiLSTM for now
        model_name = result['model_name']
        feature_name = result['feature_name']
        
        # Load the model
        model_path = os.path.join(models_dir, f'{model_name.lower()}_{feature_name.lower()}.pkl')
        model = load_sklearn_model(model_path)
        
        # Load appropriate features
        if feature_name == 'BoW':
            X_train_features, _, X_test_features, _, _, _ = split_data(X_bow, y)
            # Load vectorizer to get feature names
            with open(os.path.join(models_dir, 'bow_vectorizer.pkl'), 'rb') as f:
                vectorizer = pickle.load(f)
            feature_names = vectorizer.get_feature_names_out()
        elif feature_name == 'TF-IDF':
            X_train_features, _, X_test_features, _, _, _ = split_data(X_tfidf, y)
            # Load vectorizer to get feature names
            with open(os.path.join(models_dir, 'tfidf_vectorizer.pkl'), 'rb') as f:
                vectorizer = pickle.load(f)
            feature_names = vectorizer.get_feature_names_out()
        elif feature_name == 'Word2Vec':
            X_train_features, _, X_test_features, _, _, _ = split_data(X_word2vec, y)
            feature_names = [f"dim_{i}" for i in range(X_train_features.shape[1])]
        elif feature_name == 'GloVe':
            X_train_features, _, X_test_features, _, _, _ = split_data(X_glove, y)
            feature_names = [f"dim_{i}" for i in range(X_train_features.shape[1])]
        elif feature_name == 'BERT':
            X_train_features, _, X_test_features, _, _, _ = split_data(X_bert, y)
            feature_names = [f"dim_{i}" for i in range(X_train_features.shape[1])]
        
        # Interpret model
        interpret_sklearn_model(model, X_train_features, X_test_features, feature_names, model_name, feature_name)

## 5. Analyze Misclassified Examples

In [0]:
# Load original dataset to examine misclassified examples
dataset_path = '../data/cleaned_tweets.csv'
df = pd.read_csv(dataset_path)
print(f"Loaded cleaned dataset with shape: {df.shape}")

# Function to analyze misclassified examples
def analyze_misclassifications(result, df, label_encoder):
    model_name = result['model_name']
    feature_name = result['feature_name']
    y_pred = result['predictions']
    
    # Get test indices
    _, _, _, _, _, test_indices = split_data(np.arange(len(df)), y)
    
    # Create DataFrame with test set results
    test_df = df.iloc[test_indices].copy()
    test_df['true_sentiment'] = label_encoder.inverse_transform(y_test)
    test_df['predicted_sentiment'] = label_encoder.inverse_transform(y_pred)
    test_df['correct'] = test_df['true_sentiment'] == test_df['predicted_sentiment']
    
    # Filter misclassified examples
    misclassified = test_df[~test_df['correct']]
    
    print(f"\nMisclassified Examples for {model_name} with {feature_name}:")
    print(f"Total misclassified: {len(misclassified)} out of {len(test_df)} ({len(misclassified)/len(test_df)*100:.2f}%)")
    
    # Count by true sentiment
    true_sentiment_counts = misclassified['true_sentiment'].value_counts()
    print("\nMisclassifications by True Sentiment:")
    for sentiment, count in true_sentiment_counts.items():
        total_in_class = test_df[test_df['true_sentiment'] == sentiment].shape[0]
        print(f"{sentiment}: {count} out of {total_in_class} ({count/total_in_class*100:.2f}%)")
    
    # Show confusion patterns
    print("\nConfusion Patterns:")
    confusion_counts = misclassified.groupby(['true_sentiment', 'predicted_sentiment']).size().reset_index()
    confusion_counts.columns = ['True Sentiment', 'Predicted Sentiment', 'Count']
    confusion_counts = confusion_counts.sort_values('Count', ascending=False)
    print(confusion_counts)
    
    # Display some examples
    print("\nSample Misclassified Examples:")
    for i, (sentiment_pair, group) in enumerate(misclassified.groupby(['true_sentiment', 'predicted_sentiment'])):
        true_sentiment, pred_sentiment = sentiment_pair
        print(f"\nTrue: {true_sentiment}, Predicted: {pred_sentiment}")
        for _, row in group.head(2).iterrows():
            print(f"Tweet: {row['tweet']}")
            print(f"Cleaned: {row['cleaned_tweet']}")
            print("-----")
        
        if i >= 2:  # Limit to a few examples
            break
    
    return misclassified

In [0]:
# Analyze misclassifications for the best model
best_result = test_results[0]  # Assuming results are sorted by performance
misclassified_examples = analyze_misclassifications(best_result, df, label_encoder)

## 6. Summary and Conclusions

In [0]:
# Create summary table of test results
test_summary = pd.DataFrame([
    {'Model': r['model_name'], 
     'Feature': r['feature_name'], 
     'Accuracy': r['accuracy'], 
     'F1 Score': r['f1_score']} for r in test_results
])

print("Test Set Performance Summary:")
test_summary

In [0]:
# Plot model comparison
plt.figure(figsize=(10, 6))
sns.barplot(x='Model', y='F1 Score', hue='Feature', data=test_summary)
plt.title('Test Set Performance (F1 Score)', fontsize=15)
plt.xlabel('Model', fontsize=12)
plt.ylabel('F1 Score', fontsize=12)
plt.ylim(0, 1.0)
plt.tight_layout()
plt.savefig(os.path.join(results_dir, 'test_performance_comparison.png'))
plt.show()

## 7. Key Findings and Insights

Based on our evaluation, we can draw the following conclusions:

1. **Best Model Performance**:
   - The [MODEL] with [FEATURE] achieved the highest F1 score of [SCORE] on the test set.
   - This indicates that [INSIGHT ABOUT THE MODEL/FEATURE COMBINATION].

2. **Feature Representation Impact**:
   - Traditional features (BoW, TF-IDF) performed [COMPARISON] compared to modern embeddings (Word2Vec, GloVe, BERT).
   - This suggests that [INSIGHT ABOUT FEATURE REPRESENTATIONS FOR THIS TASK].

3. **Classification Challenges**:
   - The most commonly confused classes were [CLASS] misclassified as [CLASS].
   - This is likely due to [REASON, e.g., semantic similarity, ambiguity in tweets, etc.].

4. **Feature Importance**:
   - SHAP analysis revealed that [KEY FEATURES] were most important for sentiment classification.
   - These features correspond to [SEMANTIC MEANING, e.g., positive/negative sentiment words, specific topics].

5. **Recommendations**:
   - For practical applications, we recommend using [MODEL] with [FEATURE] due to [REASON].
   - To further improve performance, we could [SUGGESTION, e.g., collect more data, use ensemble methods, etc.].

These findings demonstrate the effectiveness of various approaches for Twitter sentiment analysis in gaming contexts and provide insights into how different feature representations capture sentiment information.