# Emotion Detection — Full Project

**Notebook:** reproducible end-to-end workflow for preprocessing, baseline training, transformer fine-tuning, evaluation, analysis, and demo.

_Run cells sequentially. Use `python3` kernel._

## 0 — Setup

Install required packages (run once). If you use a virtualenv, activate it first. These versions are suggested for reproducibility; adjust if needed.

In [None]:
# Install common dependencies (uncomment if running in a fresh environment)
# !python3 -m pip install -U pip
# !python3 -m pip install -r requirements.txt

# Minimal install commands you can uncomment if needed:
# !python3 -m pip install datasets transformers evaluate torch scikit-learn pandas joblib matplotlib seaborn streamlit emoji nbformat
print("Ready. If packages are missing, install them using pip as commented above.")

## 1 — Dataset overview

We use the TweetEval `emotion` dataset from CardiffNLP (via Hugging Face datasets). This cell loads dataset metadata and shows class distribution.

In [None]:
from datasets import load_dataset
ds = load_dataset("cardiffnlp/tweet_eval", "emotion")
for split in ds:
    print(split, len(ds[split]))
print("Labels:", ds["train"].features["label"].names)

## 2 — Preprocessing

Apply the project's `preprocess_tweet` function (from `src.preprocessing`). If you modified it during Week 1, this will import and apply it. Otherwise use a simple placeholder cleanup.

In [None]:
# Import project's preprocessing pipeline
import sys, os
PROJECT_ROOT = os.path.abspath(os.path.join(os.getcwd()))
if PROJECT_ROOT not in sys.path:
    sys.path.insert(0, PROJECT_ROOT)
try:
    from src.preprocessing import preprocess_tweet
except Exception as e:
    print("Could not import src.preprocessing.preprocess_tweet:", e)
    # Fallback simple preprocessor
    import re
    def preprocess_tweet(x):
        x = str(x).lower()
        x = re.sub(r'http\S+', '', x)
        x = re.sub(r'[^\w\s#@]', '', x)
        x = re.sub(r'\s+', ' ', x).strip()
        return x

# Quick sample
sample = ds['train']['text'][:5]
print([preprocess_tweet(t) for t in sample])

## 3 — Baseline: TF-IDF + Logistic Regression

Train a baseline model (fast). The training script `src/train_baseline.py` does this; below is an inline minimal run example to train on the training split (smaller for notebook demo).

In [None]:
# Quick baseline training (small subset to keep runtime reasonable in notebook)
import numpy as np, pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from joblib import dump

# prepare small sample
train_df = pd.DataFrame(ds['train'])
train_df['text_clean'] = train_df['text'].apply(preprocess_tweet)
# take a sample for quick demo (comment out to use full data)
sample_df = train_df.sample(n=2000, random_state=42) if len(train_df)>2000 else train_df

vec = TfidfVectorizer(ngram_range=(1,2), max_features=10000, min_df=2)
X = vec.fit_transform(sample_df['text_clean'])
y = sample_df['label'].values

X_tr, X_val, y_tr, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
clf = LogisticRegression(solver='saga', multi_class='multinomial', max_iter=1000, class_weight='balanced', n_jobs=-1)
clf.fit(X_tr, y_tr)
y_pred = clf.predict(X_val)
print(classification_report(y_val, y_pred, target_names=ds['train'].features['label'].names))

# save small demo model
os.makedirs('models', exist_ok=True)
dump(clf, 'models/demo_logreg.joblib')
dump(vec, 'models/demo_tfidf.joblib')

## 4 — Evaluation & Visuals

Plot confusion matrix and per-class F1. For the notebook demo we show evaluation on the small validation set produced above.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix
labels = ds['train'].features['label'].names
cm = confusion_matrix(y_val, y_pred)
plt.figure(figsize=(6,5))
sns.heatmap(cm, annot=True, fmt='d', xticklabels=labels, yticklabels=labels, cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix (demo)')
plt.show()

## 5 — Feature importance (TF-IDF + LR)

List top tokens per class using the trained Logistic Regression model.

In [None]:
import numpy as np
feature_names = np.array(vec.get_feature_names_out())
coefs = clf.coef_
topk = 15
for i, label in enumerate(labels):
    top_idx = np.argsort(coefs[i])[-topk:][::-1]
    top_feats = feature_names[top_idx]
    print(f"Top features for {label}:", ', '.join(top_feats[:15]))

## 6 — Linguistic analysis

Compute tweet lengths and emoji counts per class (train split).

In [None]:
import emoji
df_train = pd.DataFrame(ds['train'])
df_train['text_clean'] = df_train['text'].apply(preprocess_tweet)
df_train['char_len'] = df_train['text'].astype(str).apply(len)
df_train['emoji_count'] = df_train['text'].apply(lambda t: sum(1 for ch in str(t) if ch in emoji.EMOJI_DATA))
grouped = df_train.groupby('label')[['char_len','emoji_count']].mean().rename(columns={'char_len':'avg_len','emoji_count':'avg_emoji'})
print(grouped)
# plot avg length
grouped['avg_len'].plot.bar(title='Average tweet length by label')

## 7 — Transformer fine-tuning (RoBERTa)

Fine-tuning a transformer is computationally expensive. Use `src/train_transformer.py` for full runs. Below is a short demonstration of how to load a pretrained TweetEval RoBERTa model for inference.

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
model_name = 'cardiffnlp/twitter-roberta-base-emotion'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
classifier = pipeline('text-classification', model=model, tokenizer=tokenizer, top_k=None)
print(classifier('I am extremely happy and excited today!'))

## 8 — Streamlit Demo

To run the Streamlit app locally (after training and saving models to `models/`):

```bash
# from project root
export STREAMLIT_SERVER_HEADLESS=true
streamlit run src/streamlit_app.py
```

The Model Insights tab reads files from `outputs/` (run `src/feature_analysis.py`, `src/evaluate_model.py` to generate them).

## 9 — Reproducibility & next steps

- Freeze dependencies: `python3 -m pip freeze > requirements.txt` and prune non-essential packages.
- For final submission, include `models/` small demo files and `outputs/` visuals.

---

Good luck — run cells progressively and use the existing `src/*.py` scripts for heavy jobs.