<a href="https://colab.research.google.com/github/ariesslin/ie7500-g1-tweet-sentiment-nlp/blob/main/scripts/3.%20Model_Development.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div style="background-color:#e6f2ff; border-left:8px solid #0059b3; padding:20px; margin:20px 0;">
  <h2 style="color:#003366;"><strong>3. Model Development</strong></h2>
  <p style="color:#333333;">Model Selection and Preliminary Performance Testing</p>
</div>

## Model Development Overview: Multi-Architecture Sentiment Classification

This notebook presents a structured comparison of three complementary approaches to tweet sentiment classification: **TF-IDF + Logistic Regression**, **Bidirectional LSTM**, and **DistilBERT**. Each model represents a different point on the spectrum from traditional machine learning to state-of-the-art deep learning.

### Three-Model Comparison Strategy

Our multi-model approach systematically evaluates:

1. **Classical Baseline**: TF-IDF + Logistic Regression for interpretable, fast classification
2. **Deep Learning**: Bidirectional LSTM for context-aware sequential modeling  
3. **Transformer**: DistilBERT for sophisticated contextual understanding

### Evaluation Methodology

All models are evaluated using consistent metrics including accuracy, precision, recall, F1-score, and computational efficiency to enable fair comparison and informed model selection.

### Notebook Structure

This notebook provides high-level overviews of each approach, with detailed implementations available in separate focused notebooks:

- **Section 3.1**: Baseline model overview → Also refer to **[3a-Logistic-Regression.ipynb](./3a-Logistic-Regression.ipynb)**
- **Section 3.2**: LSTM model overview → Also refer to **[3b-LSTM.ipynb](./3b-LSTM.ipynb)**  
- **Section 3.3**: Transformer model overview → Also refer to **[3c-BERT.ipynb](./3c-BERT.ipynb)**
- **Section 3.4**: Comparative analysis and results



<div style="background-color:#e6f2ff; border-left:8px solid #0059b3; padding:20px; margin:20px 0;">
  <h2 style="color:#003366;"><strong>3.1 Baseline Model – TF-IDF + Logistic Regression</strong></h2>
  <p style="color:#333333;">Overview of TF-IDF vectorization + logistic regression baseline model.</p>
</div>


## Baseline Model Overview: TF-IDF + Logistic Regression

The **TF-IDF + Logistic Regression** model serves as our baseline for tweet sentiment classification. This approach combines term frequency-inverse document frequency vectorization with a linear classifier to provide a strong, interpretable foundation for comparison.

### Key Features of the Baseline Model:

**TF-IDF Vectorization:**
- Converts raw tweets into numerical feature vectors
- Measures word importance relative to the entire corpus
- Handles sparse, high-dimensional text data effectively
- Applies L2 normalization to prevent length bias

**Logistic Regression Classifier:**
- Linear model that's fast to train and highly interpretable  
- Provides probability estimates for sentiment predictions
- Works well with TF-IDF's sparse feature representation
- Enables coefficient analysis to understand word influences

**Hyperparameter Optimization:**
- Grid search across TF-IDF parameters (max_features, ngram_range, min_df, max_df)
- Logistic regression tuning (regularization strength, class weighting)
- 3-fold cross-validation for robust parameter selection
- Weighted F1-score optimization for balanced performance

### Performance Summary:
- **Validation Accuracy**: ~78.16%
- **Precision**: ~77.00%
- **Recall**: ~80.32%
- **F1 Score**: ~78.62%
- **Strengths**: Fast, interpretable, solid baseline performance
- **Limitations**: Context-insensitive, struggles with negation and nuance

This baseline establishes the performance threshold that our deep learning models (LSTM and BERT) should surpass.


<div style="background-color:#e6f2ff; border-left:8px solid #0059b3; padding:20px; margin:20px 0;">
  <h2 style="color:#003366;"><strong>3.2 Deep Learning Model – Bidirectional LSTM</strong></h2>
  <p style="color:#333333;">Bidirectional LSTM with Word2Vec embeddings for sequence-aware sentiment analysis.</p>
</div>

## Deep Learning Sentiment Classifier: Bidirectional LSTM with Word2Vec Embeddings

The **Bidirectional LSTM + Word2Vec** model represents our deep learning approach for tweet sentiment classification, designed to capture the sequential nature of language and semantic relationships between words.

### Key Features of the LSTM Model:

**Word2Vec Embeddings:**
- Dense word vectors that capture semantic relationships between words
- Trained specifically on our tweet corpus for domain-specific representations  
- 100-dimensional embeddings with 100% vocabulary coverage
- Words with similar meanings positioned close together in embedding space

**Bidirectional LSTM Architecture:**
- Processes text in both forward and backward directions for complete context understanding
- Two-layer bidirectional LSTM with 128 and 64 hidden units respectively
- Dropout layers (0.5 and 0.3) to prevent overfitting during training
- Captures long-term dependencies and sequential patterns in tweets

**Advanced Features:**
- Handles negation, word order, and contextual sentiment better than baseline
- Early stopping with patience=3 to prevent overfitting
- Nadam optimizer for improved convergence on noisy tweet data
- Sequence padding to handle variable tweet lengths efficiently

### Performance Summary:
- **Validation Accuracy**: ~80.25%
- **Precision**: ~79.79%
- **Recall**: ~81.02%
- **F1 Score**: ~80.40%
- **Strengths**: Context-aware, handles sequence patterns, semantic understanding
- **Limitations**: Computationally expensive, requires more training time

This LSTM model bridges the gap between simple linear models and sophisticated transformers, providing strong performance with interpretable sequential modeling.


<div style="background-color:#e6f2ff; border-left:8px solid #0059b3; padding:20px; margin:20px 0;">
  <h2 style="color:#003366;"><strong>3.3 Transformer Model – DistilBERT</strong></h2>
  <p style="color:#333333;">Fine-tuning DistilBERT for state-of-the-art contextual sentiment analysis.</p>
</div>


## Transformer: DistilBERT Fine-tuning

The **DistilBERT** model represents our state-of-the-art approach for tweet sentiment classification, leveraging pre-trained transformer architecture for deep contextual understanding.

### Key Features of the DistilBERT Model:

**Pre-trained Transformer Architecture:**
- Distilled version of BERT with 97% of BERT's performance using 60% fewer parameters
- Pre-trained on 16GB of text data, providing rich contextual representations
- Bidirectional attention mechanism for complete sentence understanding
- Fine-tuned specifically for binary sentiment classification

**Advanced NLP Capabilities:**
- Handles complex linguistic patterns like sarcasm, negation, and context-dependent sentiment
- Understands word relationships across entire tweet sequences simultaneously
- Processes subword tokens for better handling of informal social media language
- Maximum sequence length of 140 tokens optimized for tweet analysis

**Training Configuration:**
- 3 epochs with learning rate of 2e-5 for effective fine-tuning
- Batch size of 32 for efficient GPU utilization
- Early stopping and weight decay for regularization
- Optimized sequence length of 96 tokens based on EDA findings

### Performance Summary:
- **Validation Accuracy**: ~81.18%
- **Precision**: ~81.52%
- **Recall**: ~80.63%
- **F1 Score**: ~81.07%
- **Strengths**: Superior contextual understanding, handles complex linguistic patterns
- **Limitations**: Computationally expensive, requires significant GPU resources

This transformer model achieves the highest overall accuracy and F1 score, demonstrating excellent contextual understanding for robust sentiment classification.


<div style="background-color:#e6f2ff; border-left:8px solid #0059b3; padding:20px; margin:20px 0;">
  <h2 style="color:#003366;"><strong>3.4 Model Comparison and Results Analysis </strong></h2>
  <p style="color:#333333;"></p>
</div>


## Comprehensive Model Comparison and Performance Analysis

This section presents a detailed comparative evaluation of three complementary approaches to tweet sentiment classification: **TF-IDF + Logistic Regression**, **Bidirectional LSTM**, and **DistilBERT**. The analysis encompasses performance metrics, computational efficiency, and practical deployment considerations based on results from the detailed implementation notebooks.

---

### 1. Logistic Regression — Validation Results

**Performance Metrics:**
- **Accuracy**: 78.16%
- **Precision**: 77.00%
- **Recall**: 80.32%
- **F1 Score**: 78.62%

**Confusion Matrix:**

|                   | **Predicted Negative** | **Predicted Positive** |
|-------------------|------------------------|-------------------------|
| **Actual Negative** | 54,639                 | 26,013                  |
| **Actual Positive** | 15,678                 | 100,000+                |

**Sentiment Analysis Insights:**
- Logistic Regression shows solid baseline performance and favors **recall**, which means it's highly sensitive to detecting **positive sentiments** in tweets.
- However, the high number of **false positives (26,013)** suggests it struggles to distinguish **genuinely negative tweets**, often misclassifying them as positive.
- This behavior may be due to:
  - Over-simplification of tweet content and lack of contextual understanding.
  - Tweets containing mixed signals (e.g., sarcasm or slang) being interpreted incorrectly.

- **Use Case Suitability**:
  - Adequate for large-scale monitoring where missing positive sentiment is riskier than mistakenly flagging negative ones (e.g., brand loyalty tracking).
  - Not ideal where negative sentiment needs precise monitoring (e.g., social crisis detection).

---

### 2. LSTM — Validation and Training Results

**Performance Metrics:**
- **Accuracy**: 80.25%
- **Precision**: 79.79%
- **Recall**: 81.02%
- **F1 Score**: 80.40%

**Confusion Matrix:**

|                   | **Predicted Negative** | **Predicted Positive** |
|-------------------|------------------------|-------------------------|
| **Actual Negative** | 95,344                 | 24,616                  |
| **Actual Positive** | 22,770                 | 97,187                  |

**Training Summary:**
- Validation accuracy and loss stabilized around epoch 6, suggesting that’s the best point to stop training.
- Model learned rapidly early on, but overfitting started appearing after epoch 6.

**Sentiment Analysis Insights:**
- The LSTM model is especially well-suited for sentiment classification on tweets due to its ability to model **sequence dependencies**.
- Tweets often contain **non-standard grammar**, **emojis**, and **elongations** (e.g., "soooo goood"), which LSTM handles more effectively than a linear model.
- Strong performance in **both recall and precision** implies it can:
  - Accurately detect **positive** sentiments.
  - Avoid incorrectly labeling **negative tweets** as positive.

- **Use Case Suitability**:
  - Ideal for real-time sentiment dashboards and public opinion tracking tools.
  - Offers a strong balance between catching enthusiastic sentiment and minimizing false optimism.

---

### 3. BERT — Validation and Training Results

**Performance Metrics:**
- **Accuracy**: 81.18%
- **Precision**: 81.52%
- **Recall**: 80.63%
- **F1 Score**: 81.07%

**Confusion Matrix:**

|                   | **Predicted Negative** | **Predicted Positive** |
|-------------------|------------------------|-------------------------|
| **Actual Negative** | ~60,000                | 20,991                  |
| **Actual Positive** | 16,150                 | 100,000+                |

**Training Summary:**
- Training completed in ~31 minutes with early stopping after detecting optimal performance.
- Mixed precision training and optimized sequence length (96 tokens) enabled efficient fine-tuning.

**Sentiment Analysis Insights:**
- DistilBERT achieves the best **overall accuracy** and **F1 score**, demonstrating balanced performance across all metrics.
- The transformer's attention mechanism excels at capturing contextual nuances in tweets with complex language patterns.
- Slightly lower recall compared to LSTM suggests DistilBERT is more conservative in positive predictions, leading to higher precision in sentiment detection.

- **Use Case Suitability**:
  - Excellent for **automated sentiment scoring**, content moderation, or **flagging key influencer reactions**.
  - Especially beneficial when **false positives** (mistakenly labeling negativity as positivity) are more damaging (e.g., in reputation management or crisis alerts).

---

### 4. Comprehensive Model Comparison and Performance Analysis

| **Metric**              | **TF-IDF + LogReg** | **LSTM**           | **DistilBERT**     |
|-------------------------|---------------------|--------------------|--------------------| 
| **Accuracy**            | 78.16%              | 80.25%             | **81.18%**         |
| **Precision**           | 77.00%              | 79.79%             | **81.52%**         |
| **Recall**              | 80.32%              | **81.02%**         | 80.63%             |
| **F1 Score**            | 78.62%              | 80.40%             | **81.07%**         |
| **False Positives**     | 26,013              | 24,616             | **20,991**         |
| **False Negatives**     | 15,678              | **22,770**         | 16,150             |
| **Training Time**       | ~5 minutes          | ~11 minutes (early stop)       | ~30 minutes (early stop)        |
| **Inference Speed**     | **Very Fast**       | Fast               | Moderate           |
| **Interpretability**    | **High**            | Low                | Low                |
| **Resource Needs**      | **Low**             | Moderate           | High               |
| **Context Understanding**| Limited            | Good               | **Excellent**      |

**Overall Insights:**
- **Logistic Regression** remains a decent choice for high-volume, low-compute environments, where identifying most of the **positive sentiments** is more critical than precision.
- **LSTM** is the most **balanced** model, particularly effective at picking up **positive sentiment** while minimizing false detections. Its sequential modeling helps handle emotive expressions common in tweets.
- **BERT** offers the best **precision**, indicating it’s the most confident and contextually aware when assigning **positive sentiment**, but it may miss tweets that use creative or ambiguous language.

### Decision Framework for Model Selection

#### Choose **TF-IDF + Logistic Regression** When:
- **Speed is Critical**: Sub-second response times required
- **Resource Constraints**: CPU-only environments or extreme scale requirements  
- **Interpretability Needed**: Must understand individual feature contributions
- **Rapid Prototyping**: Quick experimentation and baseline establishment

#### Choose **Bidirectional LSTM** When:
- **Balanced Performance**: Need good precision-recall balance
- **Sequential Understanding**: Context awareness without full transformer overhead
- **Mid-Range Hardware**: Limited to consumer-grade GPUs
- **Real-time Applications**: Sentiment dashboards and monitoring systems

#### Choose **DistilBERT** When:
- **Accuracy is Critical**: Small improvements justify computational cost
- **Complex Language Patterns**: Must handle sarcasm, negation, context-dependent sentiment
- **Sufficient Resources**: GPU infrastructure and computational budget available
- **Production Quality**: Building customer-facing or high-stakes applications

---

### Key Insights and Recommendations

**Performance Progression:**
The models demonstrate clear advancement from traditional ML to deep learning, with accuracy improvements of ~2.1% (baseline to LSTM) and ~3.0% (baseline to DistilBERT).

**Computational Trade-offs:**
- DistilBERT achieves highest precision but requires **2.7x longer training** than LSTM (30 vs 11 minutes)
- Early stopping significantly reduced training times for both deep learning models while maintaining performance
- LSTM provides efficient training with good performance gains over the baseline

**Error Analysis:**
- **False Positives**: DistilBERT (20,991) < LSTM (24,616) < LogReg (26,013)
- **False Negatives**: LogReg (15,678) < DistilBERT (16,150) < LSTM (22,770)
- **DistilBERT** shows the best overall error distribution with lowest false positives

**Deployment Recommendations:**
- **High-Throughput Systems**: TF-IDF + Logistic Regression
- **Balanced Production Applications**: Bidirectional LSTM  
- **Accuracy-Critical Applications**: DistilBERT with optimization techniques

This comprehensive comparison enables informed model selection based on specific application requirements, computational constraints, and performance priorities.
