Intro to NLP Practical<br>
======================<br>
Students will work through problems on n-grams, probabilities, OOV handling, and classifiers.<br>

In [1]:
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

Toy corpus for language modeling

In [2]:
corpus = [
    "Mary had a little lamb",
    "Its fleece was white as snow",
    "And everywhere that Mary went",
    "The lamb was sure to go"
]

--- Part 1: Preprocessing ---

 Q1.1 Sequence notation<br>
Exercise: Write sequence notation for the sentence:<br>
"Mary had a little lamb, its fleece was white as snow"

In [4]:
sentence = "Mary had a little lamb, its fleece was white as snow"

tokens = sentence.split()
print(tokens)


['Mary', 'had', 'a', 'little', 'lamb,', 'its', 'fleece', 'was', 'white', 'as', 'snow']


 Q1.2 Add start/end tokens<br>
Exercise: Write a function to tokenize the corpus and add <s>, </s>

In [5]:
add_start_end = lambda sentence: ["<s>"] + sentence.split() + ["</s>"]

tokenized_corpus = [add_start_end(sentence) for sentence in corpus]

print(tokenized_corpus)

[['<s>', 'Mary', 'had', 'a', 'little', 'lamb', '</s>'], ['<s>', 'Its', 'fleece', 'was', 'white', 'as', 'snow', '</s>'], ['<s>', 'And', 'everywhere', 'that', 'Mary', 'went', '</s>'], ['<s>', 'The', 'lamb', 'was', 'sure', 'to', 'go', '</s>']]


--- Part 2: N-grams & Probabilities ---

 Q2.1 Extract unigrams, bigrams, trigrams

In [6]:
unigrams=[]
bigrams=[]
trigrams=[]

for sentence in tokenized_corpus:
    unigrams.extend(sentence)
    bigrams.extend(list(zip(sentence[:-1], sentence[1:])))
    trigrams.extend(list(zip(sentence[:-2], sentence[1:-1], sentence[2:])))

print(unigrams)
print(bigrams)
print(trigrams)

['<s>', 'Mary', 'had', 'a', 'little', 'lamb', '</s>', '<s>', 'Its', 'fleece', 'was', 'white', 'as', 'snow', '</s>', '<s>', 'And', 'everywhere', 'that', 'Mary', 'went', '</s>', '<s>', 'The', 'lamb', 'was', 'sure', 'to', 'go', '</s>']
[('<s>', 'Mary'), ('Mary', 'had'), ('had', 'a'), ('a', 'little'), ('little', 'lamb'), ('lamb', '</s>'), ('<s>', 'Its'), ('Its', 'fleece'), ('fleece', 'was'), ('was', 'white'), ('white', 'as'), ('as', 'snow'), ('snow', '</s>'), ('<s>', 'And'), ('And', 'everywhere'), ('everywhere', 'that'), ('that', 'Mary'), ('Mary', 'went'), ('went', '</s>'), ('<s>', 'The'), ('The', 'lamb'), ('lamb', 'was'), ('was', 'sure'), ('sure', 'to'), ('to', 'go'), ('go', '</s>')]
[('<s>', 'Mary', 'had'), ('Mary', 'had', 'a'), ('had', 'a', 'little'), ('a', 'little', 'lamb'), ('little', 'lamb', '</s>'), ('<s>', 'Its', 'fleece'), ('Its', 'fleece', 'was'), ('fleece', 'was', 'white'), ('was', 'white', 'as'), ('white', 'as', 'snow'), ('as', 'snow', '</s>'), ('<s>', 'And', 'everywhere'), ('A

 Q2.2 Bigram probabilities<br>
Exercise: Write function to compute P(w_i | w_{i-1})

In [13]:
def bigram_probabilities(bigrams):
    bigram_counts = Counter(bigrams)
    unigram_counts = Counter(unigrams)
    probabilities = {}
    for bigram, count in bigram_counts.items():
        word1, word2 = bigram
        probabilities[bigram] = count / unigram_counts[word1]
    return probabilities

 Q2.3 Sentence probability<br>
Exercise: Compute probability of "Mary had a little lamb"

In [20]:
bigram_probs = bigram_probabilities(bigrams[:6])
total_prob = 1
for bigram, prob in bigram_probs.items():
    print(f"P({bigram[1]} | {bigram[0]}) = {prob}")
    total_prob *= prob

print("total_prob = " + str(total_prob))

P(Mary | <s>) = 0.25
P(had | Mary) = 0.5
P(a | had) = 1.0
P(little | a) = 1.0
P(lamb | little) = 1.0
P(</s> | lamb) = 0.5
total_prob = 0.0625


Q2.4 Handling OOV/UNK<br>
Exercise: Replace unseen words with <UNK> and recompute


In [23]:
def replace_unseen_words(corpus, vocab):
    tokenized_corpus = []
    for sentence in corpus:
        tokenized_sentence = []
        for word in sentence.split():
            if word in vocab:
                tokenized_sentence.append(word)
            else:
                tokenized_sentence.append("<UNK>")
        tokenized_corpus.append(tokenized_sentence)
    return tokenized_corpus

vocab = set(unigrams)
example_corpus = ["Mary had a little lamb", "Its fleece was black as snow"]
tokenized_corpus_with_unk = replace_unseen_words(example_corpus, vocab)
print(tokenized_corpus_with_unk)

[['Mary', 'had', 'a', 'little', 'lamb'], ['Its', 'fleece', 'was', '<UNK>', 'as', 'snow']]


--- Part 3: Classifier ---

 Q3.1 Naive Bayes sentiment classifier

# 📽 Exercise 3.1: Sentiment Classification on toy dataset

In this exercise, you will build a simple sentiment classification model that predicts whether a given sentence is **positive** or **negative**.

---

## ✏️ Instructions:


### 1️⃣ Perform Feature Extraction
- Use **TF-IDF Vectorization** to convert names into numerical features.


---

### 2️⃣ Train a Machine Learning Classifier
- Use any classifier you are familiar with (e.g., **Logistic Regression** or **Naive Bayes**).
- Split the data into **training** and **testing** sets.
- Train the classifier on the training data.


🚀 **Goal:** By the end of this exercise, you should be able to:
- Apply **feature extraction** to text data.
- Train and evaluate a **text classification model** using **machine learning**.

In [29]:
train_texts = [
    "I love my dog",
    "This food is great",
    "I hate waiting",
    "The movie was boring",
    "Happy with my phone",
    "This is awful"
]
train_labels = ["pos", "pos", "neg", "neg", "pos", "neg"]

In [33]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

In [31]:
tfidf=TfidfVectorizer()
train_features=tfidf.fit_transform(train_texts)
tfidf_df=pd.DataFrame(train_features.toarray(),columns=tfidf.get_feature_names_out())
tfidf_df

Unnamed: 0,awful,boring,dog,food,great,happy,hate,is,love,movie,my,phone,the,this,waiting,was,with
0,0.0,0.0,0.611713,0.0,0.0,0.0,0.0,0.0,0.611713,0.0,0.501613,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.546779,0.546779,0.0,0.0,0.448367,0.0,0.0,0.0,0.0,0.0,0.448367,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.707107,0.0,0.0
3,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.5,0.0,0.0,0.5,0.0
4,0.0,0.0,0.0,0.0,0.0,0.521823,0.0,0.0,0.0,0.0,0.427903,0.521823,0.0,0.0,0.0,0.0,0.521823
5,0.653044,0.0,0.0,0.0,0.0,0.0,0.0,0.535506,0.0,0.0,0.0,0.0,0.0,0.535506,0.0,0.0,0.0


# 📽 Exercise 3.2: Movie Review Classification using Movies Review Corpus

In this exercise, you will build a simple text classification model that predicts whether a given **movie review** is **positive** or **negative** using the **NLTK Movie Reviews Corpus**.

This is a classical example of text classification at the **sentence level**.

---

## ✏️ Instructions:

### 1️⃣ Load the Data
- Import the **Movie Reviews corpus** from **NLTK**.
- Create a dataset where each example is a review and the label is either `'positive'` or `'negative'`.

---

### 2️⃣ Perform Feature Extraction
- Use **TF-IDF Vectorization** to convert names into numerical features.


---

### 3️⃣ Train a Machine Learning Classifier
- Use any classifier you are familiar with (e.g., **Logistic Regression** or **Naive Bayes**).
- Split the data into **training** and **testing** sets.
- Train the classifier on the training data.

---

### 4️⃣ Evaluate the Classifier
- Use **accuracy** and a **classification report** to evaluate your model on the test set.
- Think about: How well does the model perform? Which reviews are harder to classify?

---

✅ You are free to explore:
- Trying different classifiers.
- Visualizing the results (e.g., confusion matrix).

---

🚀 **Goal:** By the end of this exercise, you should be able to:
- Apply **feature extraction** to text data.
- Train and evaluate a **text classification model** using **machine learning**.

 Q3.3 Discussion: Why bigrams vs unigrams?<br>

 Q3.4 Limitations of n-grams

--- Part 4: Wrap-up Reflection ---

 Discussion Questions<br>
1. Why do we need <UNK> tokens?<br>
2. Why start/end tokens?<br>
3. Why not always use higher n-grams?<br>
4. How do classifiers differ from language models?