# NLP Practical Exam — Text Processing + Language Modeling (90 minutes)

**Instructions**
- Work in this notebook only.
- Write short, clear comments to justify *tool choices* (regex vs NLTK, etc.).
- Do **not** use external NLP libraries beyond **NLTK**, **NumPy**, **PyTorch** (PyTorch not needed here).
- Keep outputs readable (print key variables).

**Total: 10 points**


## Given text

```python
text = ("In mid-February 2026, the CEO of OpenAI, Sam Altman, visited Barcelona. He is 1.86m tall and met with researchers from U.P.C. and U.N.E.S.C.O. A report valued the project at $3.2 billion.")
```

> Treat the text as *synthetic exam data* (no fact-checking needed).


## Questions

1. **(1 pt)** Sentence splitting using **regex + NLTK**.
2. **(1 pt)** Regex normalization: acronyms, height meters→centimeters, money `$X.Y billion` → `x point y billion` (words).
3. **(1 pt)** Lowercase **except** proper nouns; join multiword proper nouns with underscore (e.g., `Sam Altman → Sam_Altman`). Keep acronyms uppercase.
4. **(1 pt)** Tokenize (tool of your choice).
5. **(1 pt)** Remove stopwords (tool of your choice); keep entity tokens.
6. **(1 pt)** Create bigrams with pure Python.
7. **(2 pt)** Build a bigram LM (MLE) and `predict_next(prev_word, top_k=3)`.

8. **(2 pt)** Implement a simple **BPE** on: `corpus = "low lower newest widest"` (≥5 merges or until no merges).
9. **(1 pt)** Compute Accuracy/Precision/Recall/F1 for an invented confusion matrix (explain with comments).


In [207]:
import re
import math
import nltk
from collections import Counter, defaultdict

# NLTK downloads (safe to run multiple times)
nltk.download("punkt", quiet=True)
nltk.download("stopwords", quiet=True)

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

text = ("In mid-February 2026, the CEO of OpenAI, Sam Altman, visited Barcelona. "
        "He is 1.86m tall and met with researchers from U.P.C. and U.N.E.S.C.O. "
        "A report valued the project at $3.2 billion.")

print(text)


In mid-February 2026, the CEO of OpenAI, Sam Altman, visited Barcelona. He is 1.86m tall and met with researchers from U.P.C. and U.N.E.S.C.O. A report valued the project at $3.2 billion.


## Q1

In [208]:
# Q1 (1 pt): Sentence splitting (regex + NLTK)
# - Use regex to protect acronyms like U.P.C. so they don't break sentence boundaries.
# - Then use nltk.sent_tokenize.
#
# Return: sentences (list of strings)

# TODO: implement protect_acronym_dots and restore_acronym_dots (or equivalent)
# TODO: apply sent_tokenize

# First, I will implement the function to protect acronyms
def protect_acronym_dots(text):
    # This function replaces the dots in acronyms with a placeholder.
    # We need to use lambda to ensure that we only replace the dots in  full acronyms and not in other contexts.
    text = re.sub(r"\b([A-Z]\.)+", lambda m: m.group(0).replace(".", "<DOT>"), text)
    # We also need to protect decimal numbers to avoid breaking them into sentences
    text = re.sub(r"\b(\d+)\.(\d+)", r"\1<DOT>\2", text)
    
    return text

def restore_acronym_dots(text):
    # This function restores the dots in acronyms by replacing the placeholder back to dots.
    return text.replace("<DOT>", ".")

# Now, I will apply the protect_acronym_dots function to the text
protected_text = protect_acronym_dots(text)

# Next, I will use nltk.sent_tokenize to split the protected text into sentences
sentences = sent_tokenize(protected_text)

# Restore the acronym dots in each sentence
sentences = [restore_acronym_dots(sentence) for sentence in sentences]

# Finally, I will print the final sentences
    
print(sentences)


['In mid-February 2026, the CEO of OpenAI, Sam Altman, visited Barcelona.', 'He is 1.86m tall and met with researchers from U.P.C. and U.N.E.S.C.O. A report valued the project at $3.2 billion.']


## Q2

In [209]:
# Q2 (1 pt): Regex normalization
# Convert:
#  - U.P.C. -> UPC, U.N.E.S.C.O. -> UNESCO (general rule: remove dots in acronyms)
#  - 1.86m -> 186 centimeters (general: X.YZm -> int(round(float(X.YZ)*100)) centimeters)
#  - $3.2 billion -> three point two billion  (digits 0-9 are enough)
#
# Return: text_norm
def normalize_text(text):
    # We remove dots WITHIN acronyms but keep the final period if it's sentence-ending
    text = re.sub(r"([A-Z])\.(?=[A-Z]\.)", r"\1", text)
    text = re.sub(r"([A-Z])\.(?=\s)", r"\1", text)
    
    # We convert meters to centimeters
    # I create a function to convert the matched meters to centimeters, and then use re.sub to apply it to all matches in the text
    def meters_to_cm(match):
        meters = float(match.group(1))
        cm = int(round(meters * 100))
        return f"{cm} centimeters"
    text = re.sub(r"(\d+\.\d+)m\b", meters_to_cm, text)
       
    # We convert digits to words for amounts in billions
    def dollar_to_words(match):
        number = match.group(1)
        unit = match.group(2)
        # We use a dictionary to map digits to words, and then we join the words with spaces. 
        digit_map = {"0": "zero", "1": "one", "2": "two", "3": "three", "4": "four",
                     "5": "five", "6": "six", "7": "seven", "8": "eight", "9": "nine"}
        # We also need to handle the decimal point, which we will convert to "point".
        words = " ".join("point" if c == "." else digit_map.get(c, c) for c in number)
        return f"{words} {unit}"
    text = re.sub(r"\$(\d+\.?\d*)\s+(billion|million|thousand)", dollar_to_words, text)
    
    return text


text_norm = normalize_text(text)
print(text_norm)





In mid-February 2026, the CEO of OpenAI, Sam Altman, visited Barcelona. He is 186 centimeters tall and met with researchers from UPC and UNESCO A report valued the project at three point two billion.


## Q3

In [210]:
# Q3 (1 pt): Lowercase except proper nouns + underscore multiword proper nouns
# Requirements:
# - Convert to lowercase except:
#   - Acronyms (ALL CAPS) stay uppercase (e.g., UNESCO, UPC, CEO)
#   - MixedCase tokens stay as-is (e.g., OpenAI)
#   - Multiword proper nouns joined with underscore (Sam Altman -> Sam_Altman) and preserved
#
# Return: text_case

def lowercase_except_proper_nouns(text):
    # We split the text into tokens using word_tokenize
    tokens = word_tokenize(text)
    
    # We will use a list to store the processed tokens
    processed_tokens = []
    
    for token in tokens:
        #  We check if the token is an acronym
        if token.isupper() and len(token) > 1 and token.isalpha():
            processed_tokens.append(token)
        #  We check if the token is MixedCase 
        elif any(c.isupper() for c in token[1:]) and any(c.islower() for c in token):
            processed_tokens.append(token)
        else:
            processed_tokens.append(token.lower())
    
    # Now we need to join multiword proper nouns with underscores
    final_tokens = []
    i = 0
    while i < len(processed_tokens):
        # We check if the current token and the next token are both proper nouns
        if (i < len(processed_tokens) - 1 and 
            tokens[i] and tokens[i][0].isupper() and tokens[i][1:].islower() and tokens[i].isalpha() and
            tokens[i+1] and tokens[i+1][0].isupper() and tokens[i+1][1:].islower() and tokens[i+1].isalpha()):
            final_tokens.append(tokens[i] + "_" + tokens[i+1])
            i += 2
        else:
            final_tokens.append(processed_tokens[i])
            i += 1
    
    return " ".join(final_tokens)

text_case = lowercase_except_proper_nouns(text_norm)

print(text_case)




in mid-February 2026 , the CEO of OpenAI , Sam_Altman , visited barcelona . he is 186 centimeters tall and met with researchers from UPC and UNESCO a report valued the project at three point two billion .


## Q4

In [211]:
# Q4 (1 pt): Tokenization
# Use a tokenizer of your choice (e.g., nltk.word_tokenize).
# Return: tokens (list)

def tokenize_text(text):
    # We can use nltk's word_tokenize to tokenize the text
    tokens = word_tokenize(text)
    return tokens
tokens = tokenize_text(text_norm)   

print(tokens)


['In', 'mid-February', '2026', ',', 'the', 'CEO', 'of', 'OpenAI', ',', 'Sam', 'Altman', ',', 'visited', 'Barcelona', '.', 'He', 'is', '186', 'centimeters', 'tall', 'and', 'met', 'with', 'researchers', 'from', 'UPC', 'and', 'UNESCO', 'A', 'report', 'valued', 'the', 'project', 'at', 'three', 'point', 'two', 'billion', '.']


## Q5

In [212]:
# Q5 (1 pt): Stopword removal
# - Remove English stopwords
# - Do NOT remove entity tokens like OpenAI, Sam_Altman, Barcelona, UNESCO, UPC
# Return: tokens_nostop

def remove_stopwords(tokens):
    # We use the function set to create a set of stopwords 
    stop_words = set(stopwords.words("english"))
    # I create a new list of tokens that only includes those that are not in the stop_words set.
    # We need to keep entity tokens like OpenAI, Sam_Altman, Barcelona, UNESCO, UPC.
    # In this case, I create a set of entity tokens beacuse they are a small number of this entities.
    entity_tokens = {"OpenAI", "Sam_Altman", "Barcelona", "UNESCO", "UPC"}
    tokens_nostop = [token for token in tokens if token.lower() not in stop_words or token in entity_tokens]
    return tokens_nostop
tokens_nostop = remove_stopwords(tokens)

print(tokens_nostop)


['mid-February', '2026', ',', 'CEO', 'OpenAI', ',', 'Sam', 'Altman', ',', 'visited', 'Barcelona', '.', '186', 'centimeters', 'tall', 'met', 'researchers', 'UPC', 'UNESCO', 'report', 'valued', 'project', 'three', 'point', 'two', 'billion', '.']


## Q6

In [213]:
# Q6 (1 pt): Bigrams with pure Python (no NLTK bigrams helper)
# Return: bigrams = [(w1, w2), ...]

def create_bigrams(tokens):
    # The fisrt thing we do is I create an empty list to store the bigrams
    bigrams = []
    # Then, I iterate through the tokens and create bigrams by pairing each token with the next one. 
    for i in range(len(tokens) - 1):
        # Here we append a tuple of the current token and the next token to the bigrams list.
        bigrams.append((tokens[i], tokens[i+1]))
    return bigrams
# I didn´t know if you wanted it with or without stopwords, so I created bigrams for both cases.
bigrams = create_bigrams(tokens)
bigrams_nostop = create_bigrams(tokens_nostop)

print(bigrams)
print(bigrams_nostop)


[('In', 'mid-February'), ('mid-February', '2026'), ('2026', ','), (',', 'the'), ('the', 'CEO'), ('CEO', 'of'), ('of', 'OpenAI'), ('OpenAI', ','), (',', 'Sam'), ('Sam', 'Altman'), ('Altman', ','), (',', 'visited'), ('visited', 'Barcelona'), ('Barcelona', '.'), ('.', 'He'), ('He', 'is'), ('is', '186'), ('186', 'centimeters'), ('centimeters', 'tall'), ('tall', 'and'), ('and', 'met'), ('met', 'with'), ('with', 'researchers'), ('researchers', 'from'), ('from', 'UPC'), ('UPC', 'and'), ('and', 'UNESCO'), ('UNESCO', 'A'), ('A', 'report'), ('report', 'valued'), ('valued', 'the'), ('the', 'project'), ('project', 'at'), ('at', 'three'), ('three', 'point'), ('point', 'two'), ('two', 'billion'), ('billion', '.')]
[('mid-February', '2026'), ('2026', ','), (',', 'CEO'), ('CEO', 'OpenAI'), ('OpenAI', ','), (',', 'Sam'), ('Sam', 'Altman'), ('Altman', ','), (',', 'visited'), ('visited', 'Barcelona'), ('Barcelona', '.'), ('.', '186'), ('186', 'centimeters'), ('centimeters', 'tall'), ('tall', 'met'), ('me

## Q7

In [214]:
# Q7 (2 pt): Bigram Language Model + next-word prediction
# Build:
# - bigram_counts[(w1,w2)]
# - context_counts[w1]
# - model[w1][w2] = P(w2|w1) = count(w1,w2)/count(w1)
#
# Then implement:
# def predict_next(prev_word, model, top_k=3): -> list[(next_word, prob)] sorted

def build_bigram_model(bigrams):
    # We create dictionaries to store bigram counts and context counts
    bigram_counts = defaultdict(int)
    context_counts = defaultdict(int)
    
    # We count the bigrams and contexts
    for w1, w2 in bigrams:
        bigram_counts[(w1, w2)] += 1
        context_counts[w1] += 1
    
    # We build the probability model P(w2|w1) = count(w1,w2) / count(w1)
    model = defaultdict(dict)
    for (w1, w2), count in bigram_counts.items():
        model[w1][w2] = count / context_counts[w1]
    
    return model, bigram_counts, context_counts


def predict_next(prev_word, model, top_k=3):
    # We get all possible next words and their probabilities for the given previous word
    if prev_word not in model:
        return []
    
    # We get the dictionary of next words and probabilities
    next_words = model[prev_word]
    
    # We sort by probability in descending order and get top k
    sorted_predictions = sorted(next_words.items(), key=lambda x: x[1], reverse=True)
    
    # We return the top k predictions as a list of (word, probability) tuples
    return sorted_predictions[:top_k]


# We build the model using the bigrams we already created
model, bigram_counts, context_counts = build_bigram_model(bigrams)



# Example:
print(predict_next("OpenAI", model, top_k=3))


[(',', 1.0)]


## Q8

In [None]:
# Q8 (2 pt): Simple BPE (Byte Pair Encoding) on a tiny corpus
corpus = "low lower newest widest"

# Requirements:
# - Represent each word as characters + </w>
# - Compute pair frequencies (weighted by word frequency)
# - Merge most frequent pair
# - Do at least 5 merges (or stop if no pairs)
#
# Deliver:
# - merges: list of merges in order
# - final segmented version of each word

# TODO: implement BPE helper functions:
# - get_vocab_from_corpus
# - get_pair_frequencies
# - merge_pair_in_vocab

from collections import defaultdict, Counter

corpus = "low lower newest widest"

def get_vocab_from_corpus(corpus):
    # We split the corpus into words and count their frequencies
    words = corpus.split()
    word_freq = Counter(words)
    
    # We represent each word as a list of characters with </w> at the end
    vocab = {}
    for word, freq in word_freq.items():
        vocab[tuple(list(word) + ["</w>"])] = freq
    
    return vocab

def get_pair_frequencies(vocab):
    # We count all adjacent pairs weighted by word frequency
    pairs = defaultdict(int)
    
    for word, freq in vocab.items():
        # We iterate through the characters in each word
        for i in range(len(word) - 1):
            pair = (word[i], word[i + 1])
            pairs[pair] += freq
    
    return pairs



[('l', 'o'), ('lo', 'w'), ('e', 's'), ('es', 't'), ('est', '</w>')]


## Q9

In [216]:
# Q9 (1 pt): Metrics — Accuracy, Precision, Recall, F1
# Invent a confusion matrix (TP, FP, FN, TN) and compute metrics.
# Explain each formula briefly in comments.

TP = None
FP = None
FN = None
TN = None

accuracy = None
precision = None
recall = None
f1 = None

# print(accuracy, precision, recall, f1)
