# HW5 : NLP

**Goal.** In this homework you'll practice basic text processing: tokenization, normalization, simple stemming, and a dictionary-based sentiment analysis. You'll work with a sample review-like dataset and answer **5 questions**.

**What you’ll do**
1. Load a small dataset of short texts with gold sentiment labels.
2. Tokenize and normalize the text (lowercasing, punctuation/number handling).
3. Remove stopwords.
4. Apply a rule-based stemmer (toy stemmer for learning).
5. Build a dictionary-based sentiment scorer and evaluate it.


## 0) Dataset

In [None]:
# We'll define a dataset inline: a list of dicts with `text` and `label` in {`pos`, `neg`, `neu`}.
# In a real assignment this might be loaded from a CSV.
dataset = [
    {"text": "I loved the coffee and the sunny patio. Great vibes!", "label": "pos"},
    {"text": "The service was slow and the food was cold. Not coming back.", "label": "neg"},
    {"text": "Average experience. Some parts were fine, others meh.", "label": "neu"},
    {"text": "Absolutely fantastic staff—friendly and helpful. Will recommend!", "label": "pos"},
    {"text": "Terrible. Waited forever and the order was wrong.", "label": "neg"},
    {"text": "It was okay overall; the dessert was nice but pricey.", "label": "neu"},
    {"text": "So happy with the quick service and tasty sandwich.", "label": "pos"},
    {"text": "Disappointed by the portion size and bland flavors.", "label": "neg"},
    {"text": "Nothing special, but not bad either.", "label": "neu"},
    {"text": "Amazing atmosphere, loved every minute there!", "label": "pos"},
    {"text": "The coffee was amazing; I was loving it and quickly ordered seconds.", "label": "pos"},
    {"text": "The sandwich tastes great but the soup was blandly seasoned.", "label": "neu"},
    {"text": "We waited and waited; the order was wrong and slowly arriving.", "label": "neg"},
    {"text": "Absolutely fantastic staff—friendly and helpful; I loved the service.", "label": "pos"},
    {"text": "Not coming back—disappointing and overpriced.", "label": "neg"},
    {"text": "Prices were okay, portions felt small; desserts were nicely presented.", "label": "neu"},
    {"text": "The food was not good at all.", "label": "neg"}
]



len(dataset), dataset[:2]


### Tokenization
Write a simple tokenizer that:
- lowercases text,
- removes punctuation and digits,
- splits on whitespace,
- keeps only alphabetic tokens (a–z).

**a)** After tokenizing all texts, what is the **vocabulary size** (number of unique tokens)?


In [None]:
import re
from collections import Counter

def tokenize(text: str) -> list[str]:
    # Lowercase
    text = text.lower()
    # Replace anything not a-z with space
    text = re.sub(r"[^a-z]+", " ", text)
    # Split on whitespace and filter empties
    toks = [t for t in text.split() if t]
    return toks

# Apply tokenizer
tokenized = [tokenize(item["text"]) for item in dataset]
tokenized[:3]


In [None]:
# TODO: Q1 — Compute and print the vocabulary size.
vocab = sorted({tok for doc in tokenized for tok in doc})
print("Vocabulary size:", len(vocab))
# Also, if helpful, print a few sample tokens:
print(vocab[:25])


### Normalization & Frequency
Using your tokenized output, build a frequency table.

**b)** What are the **top 5 most frequent tokens** across the corpus? (List them with counts.)


In [None]:
n = 5
freq = Counter(tok for doc in tokenized for tok in doc)
top5 = freq.most_common(n)
print("Top-5 tokens:", top5)

### Stopword Removal
**c)** We've provided a stopword list. Which two words in the list should not be in the stop words list and why?

In [None]:
stopwords = {
    "the", "and", "was", "is", "it", "but", "so", "with", "nice", "by", "a", "an",
    "of", "to", "for", "pricey"
}

def remove_stopwords(tokens: list[str]) -> list[str]:
    return [t for t in tokens if t not in stopwords]

tokenized_nostop = [remove_stopwords(doc) for doc in tokenized]
Counter(tok for doc in tokenized_nostop for tok in doc).most_common(5)


### Simple Stemmer
Implement a very simple rule-based stemmer that strips a few common English suffixes: `ing`, `ed`, `ly`, `s` (applied in that order, once per token, only if the token is longer than the suffix).

In [None]:
def toy_stem(token: str) -> str:
    # Apply in order: ing, ed, ly, s
    for suf in ["ing", "ed", "ly", "s"]:
        if token.endswith(suf) and len(token) > len(suf):
            return token[: -len(suf)]
    return token

def stem_doc(tokens: list[str]) -> list[str]:
    return [toy_stem(t) for t in tokens]

# Choose whether to stem tokenized (with or without stopwords). We'll use the no-stop version here.
stemmed = [stem_doc(doc) for doc in tokenized_nostop]
vocab_before = sorted({tok for doc in tokenized_nostop for tok in doc})
vocab_after = sorted({tok for doc in stemmed for tok in doc})
print("Vocab size before stemming:", len(vocab_before))
print("Vocab size after stemming:", len(vocab_after))
print("Sample before→after:", list(zip(vocab_before[:15], [toy_stem(t) for t in vocab_before[:15]])))


### Dictionary-Based Sentiment
Below is a positive/negative lexicon for a very simple dictionary-based sentiment score:
- Score(text) = (# of positive word hits) − (# of negative word hits).  
- If the score > 0 → **pos**, score < 0 → **neg**, score == 0 → **neu**.

> Note: This is intentionally simplistic and will misclassify some items. That's part of the learning!


In [None]:
positive_lex = {
   "good", "love","loved","great","fantastic","friendly","helpful","recommend","amazing","happy","tasty","nice","quick","sunny","vibe","atmosphere"
}
negative_lex = {
    "slow","cold", "terrible","waited","wrong","disappointed","bland","pricey","forever", "bad"
}

def sentiment_score(tokens: list[str]) -> int:
    # count exact matches (pre-stemming tokens tend to work better for tiny lexicons)
    pos_hits = sum(1 for t in tokens if t in positive_lex)
    neg_hits = sum(1 for t in tokens if t in negative_lex)
    return pos_hits - neg_hits

def label_from_score(score: int) -> str:
    if score > 0:
        return "pos"
    elif score < 0:
        return "neg"
    else:
        return "neu"

preds = []
for toks, item in zip(tokenized, dataset):
    s = sentiment_score(toks)
    preds.append(label_from_score(s))

list(zip([d["text"] for d in dataset[:5]], [d["label"] for d in dataset[:5]], preds[:5]))


### Evaluation
Compute **accuracy** of the dictionary-based classifier against the gold labels. Also compute a confusion matrix.

**d:** What is the accuracy (0–1), and where do most errors occur (which true label vs predicted label)?


In [None]:
import itertools
labels = ["neg","neu","pos"]  # fixed order for display

gold = [d["label"] for d in dataset]
pred = preds

# Accuracy
acc = sum(g==p for g,p in zip(gold, pred)) / len(gold)
print("Accuracy:", round(acc, 3))

# Confusion matrix
cm = { (g,p):0 for g in labels for p in labels }
for g,p in zip(gold, pred):
    cm[(g,p)] += 1

# Pretty print
print("\nConfusion Matrix (rows=true, cols=pred):")
header = [""] + labels
rows = []
for g in labels:
    row = [g] + [cm[(g,p)] for p in labels]
    rows.append(row)

# Simple printout
print("\t".join(header))
for r in rows:
    print("\t".join(str(x) for x in r))

# TODO: Q4 — In a short sentence below, report the accuracy and where most errors happen.


In [None]:
# Print misclassified items: (text, gold label, predicted label)
for d, g, p in zip(dataset, gold, pred):
    if g != p:
        print((d["text"], g, p))


### Negation Handling
Modify the scoring so that the presence of **`not`** reverses the next word’s polarity (only for the next single token). Then recompute predictions and accuracy.

In [None]:
# TODO: Q4b — Implement a simple negation-aware scorer and compare accuracy
def sentiment_score_negation(tokens: list[str]) -> int:
    score = 0
    i = 0
    while i < len(tokens):
        t = tokens[i]
        if t == "not" and i + 1 < len(tokens):
            nxt = tokens[i+1]
            if nxt in positive_lex:
                score -= 1  # flip next positive to negative
                i += 2
                continue
            elif nxt in negative_lex:
                score += 1  # flip next negative to positive
                i += 2
                continue
        # default behavior
        if t in positive_lex:
            score += 1
        elif t in negative_lex:
            score -= 1
        i += 1
    return score

def predict_with(func):
    return [label_from_score(func(toks)) for toks in tokenized]

# Original accuracy from previous step is stored in `acc` (printed above).
# Compute new accuracy with negation:
pred_neg = predict_with(sentiment_score_negation)

gold = [d["label"] for d in dataset]
acc_neg = sum(g==p for g,p in zip(gold, pred_neg)) / len(gold)

print("Original accuracy (from earlier cell output): see above")
print("Negation-handling accuracy:", round(acc_neg, 3))

# Show side-by-side differences for quick inspection
diffs = [(i, gold[i], preds[i], pred_neg[i], dataset[i]["text"]) for i in range(len(gold)) if preds[i] != pred_neg[i]]
print("\nCases where prediction changed due to negation handling:")
for i, g, p_old, p_new, txt in diffs:
    print(f"[{i}] GOLD={g} OLD={p_old} NEW={p_new} :: {txt}")


**e:** Why is negation handling important in dictionary sentiment methods? Give an example from the dataset (or make a realistic example) where naive counting would fail.


### Error Analysis 
Pick **3 misclassified** examples (if fewer than 3, pick all misclassified) and briefly explain **why** the dictionary approach failed.

In [None]:
mis = [(i, dataset[i]["text"], gold[i], pred[i]) for i in range(len(gold)) if gold[i] != pred[i]]
print("Misclassified examples:", len(mis))
for i, text, g, p in mis:
    print(f"[{i}] GOLD={g} PRED={p} :: {text}")
# TODO: Q5 — Write your brief explanations below.
