# Sentiment Analysis Assignment - Template

**Course:** [Insert Course Name]  
**Assignment:** Text Classification - Sentiment Analysis  
**Team Members:**
- [Team Member 1]
- [Team Member 2]
- [Team Member 3]
- [Team Member 4]

## Assignment Requirements
- Compare traditional ML (Logistic Regression, SVM, Naive Bayes) vs Deep Learning (RNN, LSTM, GRU)
- Use publicly available dataset (IMDB, Twitter sentiment, Amazon reviews)
- Conduct EDA with statistical analysis and visualizations
- Apply various text preprocessing and embedding techniques
- Include 2+ experiment tables with parameter variations
- Evaluate with appropriate metrics (MSE, cross-entropy, etc.)
- Submit PDF report + GitHub repo

## Team Contributions
| Member | Tasks |
|--------|-------|
| [Name] | [Specific contributions] |
| [Name] | [Specific contributions] |
| [Name] | [Specific contributions] |
| [Name] | [Specific contributions] |

## 1. Import Libraries

In [None]:
# TODO: Import necessary libraries
# Data manipulation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Text preprocessing
# TODO: Add NLTK, sklearn text processing imports

# Traditional ML models
# TODO: Add sklearn models (LogisticRegression, SVM, MultinomialNB)

# Deep Learning
# TODO: Add TensorFlow/Keras imports for LSTM, GRU, etc.

# Evaluation
# TODO: Add metrics imports

# Set random seeds for reproducibility
np.random.seed(42)
# TODO: Set TensorFlow seed

## 2. Dataset Selection and Loading

**Choose ONE dataset:**
- IMDB Movie Reviews (50k reviews)
- Twitter Sentiment Dataset
- Amazon Product Reviews
- Custom dataset

**Justify your choice:**

In [None]:
# TODO: Load your chosen dataset
# Option 1: Load from Keras datasets
# Option 2: Load from CSV file
# Option 3: Download from online source

# Example structure:
# df = pd.read_csv('your_dataset.csv')
# or
# from tensorflow.keras.datasets import imdb

print("Dataset loaded successfully!")
print(f"Shape: {df.shape if 'df' in locals() else 'Load dataset first'}")

## 3. Exploratory Data Analysis (EDA)

**Requirements:**
- Statistical analysis of dataset
- Visualizations to understand data distribution
- Class balance analysis
- Text length analysis
- Word frequency analysis

In [None]:
# TODO: Basic dataset information
print("=== Dataset Overview ===")
# Shape, columns, data types, missing values

# TODO: Display sample data

In [None]:
# TODO: Sentiment distribution analysis
print("=== Sentiment Distribution ===")
# Value counts, percentages

# TODO: Create visualizations
# Pie chart, bar chart

In [None]:
# TODO: Text length analysis
print("=== Text Length Analysis ===")
# Character count, word count statistics

# TODO: Create histograms and box plots

In [None]:
# TODO: Word frequency analysis
print("=== Word Frequency Analysis ===")
# Most common words overall
# Most common words by sentiment

# TODO: Create word clouds (optional)

## 4. Text Preprocessing

**Requirements:**
- Handle missing values
- Tokenization
- Stopword removal
- Text cleaning (punctuation, numbers, etc.)
- **Justify your preprocessing choices**

In [None]:
# TODO: Create text preprocessing function
def preprocess_text(text):
    """
    Preprocess individual text
    
    Steps:
    1. Lowercase
    2. Remove punctuation
    3. Remove numbers
    4. Remove stopwords
    5. Tokenization
    6. Stemming/Lemmatization
    """
    # TODO: Implement preprocessing steps
    return processed_text

# TODO: Apply preprocessing to dataset
# df['cleaned_text'] = df['text'].apply(preprocess_text)

# TODO: Show before/after examples

### Preprocessing Justification

**TODO: Explain your preprocessing choices:**
- Why did you choose specific steps?
- How do they benefit your models?
- Any trade-offs considered?

## 5. Feature Engineering and Embeddings

**Requirements:**
- Implement multiple embedding techniques
- Compare TF-IDF, Word2Vec, GloVe, etc.
- Justify choices

In [None]:
# TODO: Split data into train/test
from sklearn.model_selection import train_test_split

# X = df['cleaned_text']
# y = df['sentiment']
# X_train, X_test, y_train, y_test = train_test_split(...)

In [None]:
# TODO: TF-IDF Vectorization
print("Creating TF-IDF features...")
# from sklearn.feature_extraction.text import TfidfVectorizer

# TODO: Implement and fit TF-IDF

In [None]:
# TODO: Bag of Words (Count Vectorization)
print("Creating BoW features...")
# from sklearn.feature_extraction.text import CountVectorizer

# TODO: Implement Count Vectorizer

In [None]:
# TODO: Word2Vec Embeddings
print("Creating Word2Vec embeddings...")
# from gensim.models import Word2Vec

# TODO: Train Word2Vec model
# TODO: Convert texts to embeddings

In [None]:
# TODO: Optional - GloVe embeddings
# Load pre-trained GloVe vectors if available

## 6. Traditional Machine Learning Models

**Requirements:**
- Implement at least: Logistic Regression, SVM, Naive Bayes
- Test with different feature sets
- Hyperparameter tuning

In [None]:
# TODO: Implement Logistic Regression
print("Training Logistic Regression...")
# from sklearn.linear_model import LogisticRegression

# TODO: Train and evaluate model

In [None]:
# TODO: Implement SVM
print("Training SVM...")
# from sklearn.svm import SVC

# TODO: Train and evaluate model

In [None]:
# TODO: Implement Naive Bayes
print("Training Naive Bayes...")
# from sklearn.naive_bayes import MultinomialNB

# TODO: Train and evaluate model

In [None]:
# TODO: Store and compare results
traditional_results = []
# Store results in list of dictionaries for comparison

## 7. Deep Learning Models

**Requirements:**
- Implement at least: RNN, LSTM, GRU
- Compare architectures
- Hyperparameter experiments

In [None]:
# TODO: Prepare data for deep learning
print("Preparing sequences for deep learning...")
# from tensorflow.keras.preprocessing.text import Tokenizer
# from tensorflow.keras.preprocessing.sequence import pad_sequences

# TODO: Tokenize and pad sequences

In [None]:
# TODO: Create LSTM model
def create_lstm_model(vocab_size, embedding_dim=100, lstm_units=64):
    # TODO: Implement LSTM architecture
    # Use Sequential API
    # Layers: Embedding, LSTM, Dense, Dropout
    pass

# TODO: Train LSTM model

In [None]:
# TODO: Create GRU model
def create_gru_model(vocab_size, embedding_dim=100, gru_units=64):
    # TODO: Implement GRU architecture
    pass

# TODO: Train GRU model

In [None]:
# TODO: Create Bidirectional LSTM
def create_bidirectional_lstm(vocab_size, embedding_dim=100, lstm_units=64):
    # TODO: Implement Bidirectional LSTM
    pass

# TODO: Train Bidirectional LSTM

In [None]:
# TODO: Store deep learning results
deep_learning_results = []
# Store results for comparison

## 8. Experiment Tables (Required)

**Create at least 2 experiment tables with parameter variations**

### Experiment 1: Traditional ML Hyperparameter Tuning

| Model | Feature Type | Hyperparameters | Accuracy | Precision | Recall | F1-Score | MSE | Cross-Entropy |
|-------|--------------|-----------------|----------|-----------|--------|----------|-----|---------------|
| TODO  | TODO         | TODO            | TODO     | TODO      | TODO   | TODO     | TODO| TODO          |
| TODO  | TODO         | TODO            | TODO     | TODO      | TODO   | TODO     | TODO| TODO          |

In [None]:
# TODO: Implement hyperparameter tuning for traditional ML
# Use GridSearchCV or manual parameter testing
experiment1_results = []

# Example parameters to test:
# Logistic Regression: C values, solvers
# SVM: C values, kernels
# Different feature sets: TF-IDF vs BoW vs Word2Vec

### Experiment 2: Deep Learning Architecture Variations

| Model | Embedding Dim | Hidden Units | Dropout | Batch Size | Epochs | Accuracy | F1-Score | Cross-Entropy |
|-------|---------------|--------------|---------|------------|--------|----------|----------|---------------|
| TODO  | TODO          | TODO         | TODO    | TODO       | TODO   | TODO     | TODO     | TODO          |
| TODO  | TODO          | TODO         | TODO    | TODO       | TODO   | TODO     | TODO     | TODO          |

In [None]:
# TODO: Implement architecture variations for deep learning
experiment2_results = []

# Example parameters to test:
# Embedding dimensions: 50, 100, 200
# Hidden units: 32, 64, 128
# Dropout rates: 0.3, 0.5, 0.7
# Batch sizes: 16, 32, 64
# Learning rates: 0.001, 0.01, 0.1

## 9. Model Evaluation and Metrics

**Requirements:**
- Use appropriate performance metrics
- Justify metric choices
- Include MSE and cross-entropy loss

In [None]:
# TODO: Implement comprehensive evaluation
def evaluate_model(y_true, y_pred, y_pred_proba=None):
    """
    Calculate comprehensive evaluation metrics
    """
    # TODO: Calculate all required metrics
    # Accuracy, Precision, Recall, F1-score
    # MSE, Cross-entropy loss
    pass

# TODO: Apply to all models

### Metric Justification

**TODO: Explain why you chose each metric:**

1. **Accuracy**: [Your justification]
2. **Precision**: [Your justification]
3. **Recall**: [Your justification]
4. **F1-Score**: [Your justification]
5. **MSE**: [Your justification]
6. **Cross-Entropy**: [Your justification]

In [None]:
# TODO: Create confusion matrices
# For best performing models

In [None]:
# TODO: Create performance comparison visualizations
# Bar charts, heatmaps, etc.

## 10. Results Analysis and Discussion

**Requirements:**
- Compare traditional ML vs deep learning
- Discuss key findings
- Explain performance variations
- Suggest potential improvements

In [None]:
# TODO: Summarize all results
print("=== Final Results Summary ===")

# TODO: Create comprehensive comparison table
final_results = pd.DataFrame()

# TODO: Identify best performing model

### Key Findings

**TODO: Discuss your findings:**

1. **Best performing model**: [Model name and performance]
2. **Traditional ML vs Deep Learning**: [Comparison and insights]
3. **Feature representation impact**: [TF-IDF vs Word2Vec vs others]
4. **Preprocessing impact**: [Effect of different preprocessing steps]
5. **Hyperparameter sensitivity**: [Most important parameters]

### Performance Variations Explanation

**TODO: Explain why certain models performed better:**

- **Dataset characteristics**: [Size, complexity, domain]
- **Model suitability**: [Why certain models work better]
- **Feature engineering**: [Impact of different representations]
- **Hyperparameter choices**: [Effect of parameter tuning]

### Potential Improvements

**TODO: Suggest improvements:**

1. **Data improvements**: [More data, better preprocessing]
2. **Model enhancements**: [Architecture modifications, ensembles]
3. **Feature engineering**: [Advanced embeddings, feature selection]
4. **Hyperparameter optimization**: [Advanced tuning methods]
5. **Evaluation improvements**: [Additional metrics, cross-validation]

## 11. Conclusions

**TODO: Write your conclusions:**

1. **Main findings summary**
2. **Best approach recommendation**
3. **Lessons learned**
4. **Future work suggestions**

## 12. References

**TODO: Add your references:**

1. Dataset source
2. Literature references
3. Code/library references
4. Any other relevant sources

---

## Assignment Checklist

**Before submission, ensure you have:**

- [ ] Chosen and loaded appropriate dataset
- [ ] Completed comprehensive EDA with visualizations
- [ ] Implemented text preprocessing with justification
- [ ] Applied multiple embedding techniques (TF-IDF, Word2Vec, etc.)
- [ ] Implemented traditional ML models (LR, SVM, NB)
- [ ] Implemented deep learning models (RNN, LSTM, GRU)
- [ ] Created 2+ experiment tables with parameter variations
- [ ] Used appropriate evaluation metrics (MSE, cross-entropy, etc.)
- [ ] Justified metric choices
- [ ] Compared results and discussed findings
- [ ] Explained performance variations
- [ ] Suggested potential improvements
- [ ] Documented team member contributions
- [ ] Prepared PDF report
- [ ] Set up GitHub repository with README
- [ ] Ensured code is well-documented and reproducible