# NLP Practical Exam — Text Processing + Language Modeling (90 minutes)

**Instructions**
- Work in this notebook only.
- Write short, clear comments to justify *tool choices* (regex vs NLTK, etc.).
- Do **not** use external NLP libraries beyond **NLTK**, **NumPy**, **PyTorch** (PyTorch not needed here).
- Keep outputs readable (print key variables).

**Total: 10 points**


## Given text

```python
text = ("In mid-February 2026, the CEO of OpenAI, Sam Altman, visited Barcelona. He is 1.86m tall and met with researchers from U.P.C. and U.N.E.S.C.O. A report valued the project at $3.2 billion.")
```

> Treat the text as *synthetic exam data* (no fact-checking needed).


## Questions

1. **(1 pt)** Sentence splitting using **regex + NLTK**.
2. **(1 pt)** Regex normalization: acronyms, height meters→centimeters, money `$X.Y billion` → `x point y billion` (words).
3. **(1 pt)** Lowercase **except** proper nouns; join multiword proper nouns with underscore (e.g., `Sam Altman → Sam_Altman`). Keep acronyms uppercase.
4. **(1 pt)** Tokenize (tool of your choice).
5. **(1 pt)** Remove stopwords (tool of your choice); keep entity tokens.
6. **(1 pt)** Create bigrams with pure Python.
7. **(2 pt)** Build a bigram LM (MLE) and `predict_next(prev_word, top_k=3)`.

8. **(2 pt)** Implement a simple **BPE** on: `corpus = "low lower newest widest"` (≥5 merges or until no merges).
9. **(1 pt)** Compute Accuracy/Precision/Recall/F1 for an invented confusion matrix (explain with comments).


In [47]:
import re
import math
import nltk
from collections import Counter, defaultdict

# NLTK downloads (safe to run multiple times)
nltk.download("punkt", quiet=True)
nltk.download("stopwords", quiet=True)

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

text = ("In mid-February 2026, the CEO of OpenAI, Sam Altman, visited Barcelona. "
        "He is 1.86m tall and met with researchers from U.P.C. and U.N.E.S.C.O. "
        "A report valued the project at $3.2 billion.")

print(text)


In mid-February 2026, the CEO of OpenAI, Sam Altman, visited Barcelona. He is 1.86m tall and met with researchers from U.P.C. and U.N.E.S.C.O. A report valued the project at $3.2 billion.


## Q1

In [48]:
# Q1 (1 pt): Sentence splitting (regex + NLTK)
# - Use regex to protect acronyms like U.P.C. so they don't break sentence boundaries.
# - Then use nltk.sent_tokenize.
#
# Return: sentences (list of strings)

# TODO: implement protect_acronym_dots and restore_acronym_dots (or equivalent)
# TODO: apply sent_tokenize

# First, I will implement the function to protect acronyms
def protect_acronym_dots(text):
    # This function replaces the dots in acronyms with a placeholder.
    # We need to use lambda to ensure that we only replace the dots in  full acronyms and not in other contexts.
    text = re.sub(r"\b([A-Z]\.)+", lambda m: m.group(0).replace(".", "<DOT>"), text)
    # We also need to protect decimal numbers to avoid breaking them into sentences
    text = re.sub(r"\b(\d+)\.(\d+)", r"\1<DOT>\2", text)
    
    return text

def restore_acronym_dots(text):
    # This function restores the dots in acronyms by replacing the placeholder back to dots.
    return text.replace("<DOT>", ".")

# Now, I will apply the protect_acronym_dots function to the text
protected_text = protect_acronym_dots(text)

# Next, I will use nltk.sent_tokenize to split the protected text into sentences
sentences = sent_tokenize(protected_text)

# Restore the acronym dots in each sentence
sentences = [restore_acronym_dots(sentence) for sentence in sentences]

# Finally, I will print the final sentences
    
print(sentences)


['In mid-February 2026, the CEO of OpenAI, Sam Altman, visited Barcelona.', 'He is 1.86m tall and met with researchers from U.P.C. and U.N.E.S.C.O. A report valued the project at $3.2 billion.']


## Q2

In [49]:
# Q2 (1 pt): Regex normalization
# Convert:
#  - U.P.C. -> UPC, U.N.E.S.C.O. -> UNESCO (general rule: remove dots in acronyms)
#  - 1.86m -> 186 centimeters (general: X.YZm -> int(round(float(X.YZ)*100)) centimeters)
#  - $3.2 billion -> three point two billion  (digits 0-9 are enough)
#
# Return: text_norm
def normalize_text(text):
    # We remove dots in acronyms (e.g., U.P.C. -> UPC)
    text = re.sub(r"\b([A-Z]\.)+", lambda m: m.group(0).replace(".", ""), text)
    
    # We convert meters to centimeters
    # I create a function to handle the conversion from meters to centimeters, which will be used in the re.sub function. 
    def meters_to_cm(match):
        meters = float(match.group(1))
        cm = int(round(meters * 100))
        return f"{cm} centimeters"
    text = re.sub(r"(\d+\.\d+)m\b", meters_to_cm, text)
       
    # We convert digits to words for amounts in billions with another function
    def dollar_to_words(match):
        number = match.group(1)
        unit = match.group(2)
        # I create a dictionary to map digits to their word representations.
        digit_map = {"0": "zero", "1": "one", "2": "two", "3": "three", "4": "four",
                     "5": "five", "6": "six", "7": "seven", "8": "eight", "9": "nine"}
        # We convert each character
        words = "".join("point" if c == '.' else digit_map.get(c, c) for c in number)
        return f"{words} {unit}"
    text = re.sub(r"\$(\d+\.?\d*)\s+(billion|million|thousand)", dollar_to_words, text)
    
    return text

text = ("In mid-February 2026, the CEO of OpenAI, Sam Altman, visited Barcelona. "
        "He is 1.86m tall and met with researchers from U.P.C. and U.N.E.S.C.O. "
        "A report valued the project at $3.2 billion.")

text_norm = normalize_text(text)
print(text_norm)


In mid-February 2026, the CEO of OpenAI, Sam Altman, visited Barcelona. He is 186 centimeters tall and met with researchers from UPC and UNESCO A report valued the project at threepointtwo billion.


## Q3

In [None]:
# Q3 (1 pt): Lowercase except proper nouns + underscore multiword proper nouns
# Requirements:
# - Convert to lowercase except:
#   - Acronyms (ALL CAPS) stay uppercase (e.g., UNESCO, UPC, CEO)
#   - MixedCase tokens stay as-is (e.g., OpenAI)
#   - Multiword proper nouns joined with underscore (Sam Altman -> Sam_Altman) and preserved
#
# Return: text_case


text_case = None

# print(text_case)


In mid-february 2026 , the CEO of OpenAI , Sam_Altman , visited Barcelona . He is 186 centimeters tall and met with researchers from UPC and UNESCO a report valued the project at threepointtwo billion .


## Q4

In [51]:
# Q4 (1 pt): Tokenization
# Use a tokenizer of your choice (e.g., nltk.word_tokenize).
# Return: tokens (list)

tokens = None

# print(tokens)


## Q5

In [52]:
# Q5 (1 pt): Stopword removal
# - Remove English stopwords
# - Do NOT remove entity tokens like OpenAI, Sam_Altman, Barcelona, UNESCO, UPC
# Return: tokens_nostop

tokens_nostop = None

# print(tokens_nostop)


## Q6

In [53]:
# Q6 (1 pt): Bigrams with pure Python (no NLTK bigrams helper)
# Return: bigrams = [(w1, w2), ...]

bigrams = None

# print(bigrams)


## Q7

In [54]:
# Q7 (2 pt): Bigram Language Model + next-word prediction
# Build:
# - bigram_counts[(w1,w2)]
# - context_counts[w1]
# - model[w1][w2] = P(w2|w1) = count(w1,w2)/count(w1)
#
# Then implement:
# def predict_next(prev_word, model, top_k=3): -> list[(next_word, prob)] sorted

bigram_counts = None
context_counts = None
model = None

def predict_next(prev_word, model, top_k=3):
    # TODO
    return None

# Example:
# print(predict_next("OpenAI", model, top_k=3))


## Q8

In [55]:
# Q8 (2 pt): Simple BPE (Byte Pair Encoding) on a tiny corpus
corpus = "low lower newest widest"

# Requirements:
# - Represent each word as characters + </w>
# - Compute pair frequencies (weighted by word frequency)
# - Merge most frequent pair
# - Do at least 5 merges (or stop if no pairs)
#
# Deliver:
# - merges: list of merges in order
# - final segmented version of each word

merges = None

# TODO: implement BPE helper functions:
# - get_vocab_from_corpus
# - get_pair_frequencies
# - merge_pair_in_vocab

# print(merges)


## Q9

In [56]:
# Q9 (1 pt): Metrics — Accuracy, Precision, Recall, F1
# Invent a confusion matrix (TP, FP, FN, TN) and compute metrics.
# Explain each formula briefly in comments.

TP = None
FP = None
FN = None
TN = None

accuracy = None
precision = None
recall = None
f1 = None

# print(accuracy, precision, recall, f1)
