# Hands-On NLP - Class 1: Foundations of Text Analysis

**Date:** January 9, 2026

**Goal:** This first notebook explores the fundamental questions of NLP:
*   What is a character? (Unicode)
*   What is a word? (Tokenization)
*   How do words behave? (Zipf's Law, Type/Token Ratios)

**Instructions:**
1. Run the cells to see the results.
2. Look for **üöß TODO:** markers. These are exercises for you.
3. For "Explain" questions, write your answer in the markdown cell below the question.

This notebook is designed to be completed **individually** in class (1h30).

<span style="color:magenta">Student name:</span>

* üöß TODO: ... fill in your name ...

## Setup

In [None]:
import collections
import re
import unicodedata
import os
import zipfile
import urllib.request
from collections import Counter
from pathlib import Path

import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import nltk
import numpy as np
import pandas as pd
import seaborn as sns
from tqdm import tqdm

# Setup
tqdm.pandas()
sns.set_style("darkgrid")
sns.set_context("notebook")
pd.set_option("display.precision", 2)



## Download the Stackexchange Dataset provided by EleutherAI

* original Github repository: https://github.com/EleutherAI/stackexchange-dataset 
* original data from: https://archive.org/details/stackexchange 
* small subset for this class: gerdes.fr/saclay/honlp/texts.zip

In [None]:
# Download and unzip the texts dataset if not already present

DATA_DIR = Path("texts")

if not DATA_DIR.exists():
    print("Downloading texts.zip...")
    url = "https://gerdes.fr/saclay/honlp/texts.zip"
    zip_path = "texts.zip"
    urllib.request.urlretrieve(url, zip_path)
    
    print("Extracting texts.zip...")
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall(".")
    
    # Clean up zip file
    os.remove(zip_path)
    print("Done!")

# Verify Data
if not DATA_DIR.exists():
    raise FileNotFoundError(f"Could not find 'texts' directory at {DATA_DIR.absolute()}")

print(f"Data directory: {DATA_DIR.absolute()}")



: 

## Part 1: Text as Data (Unicode & Encoding)

Before we process "words", we must handle "characters". Modern text is almost always Unicode (UTF-8).

### 1.1 Exploring Unicode

**üöß TODO:**
1. Why is UTF-8 the most common encoding on the web? What does UTF-8 stand for?
2. Explore the text_sample below. What do you notice? If you're looking at this in VSCode, why do you see boxes around some characters?

**Answer here:**

...

In [None]:
# Example string with diversity
text_sample = "Hello! üëã 123 √ø ‚Ç¨ƒï≈ÅËå∂Íù¢‡•Ä„Åå‚Ç¨,!‚â´‚ñ†‚úÖü§ó\u200c\u200e\u3000\xa0\xad AŒë–êO„Äá0"

Write a function `analyze_string(text)` that prints the name of each character in the text using `unicodedata.name()`.
Each line should look like this:

```
'a' (U+0061): LATIN SMALL LETTER A
```


In [None]:
def analyze_string(text):
    print(f"Analyzing: {text}")
    for char in text:
        try:
            # üöß TODO:
            ... your code here ...
        print(f"'{char}' ... your code here ...")

analyze_string(text_sample)

### 1.2 The Mystery Character

**üöß TODO:**
Identify the highest-numbered non-Chinese Unicode character in the following string.
What is it called? Where did you find it?

In [None]:
mystery_string = "Hello üåç! This is a test with some weird chars: Ô∑Ω, üêç, and ü´Ä."

analyze_string(mystery_string)

# Code to find max explicitly:
# üöß TODO:
... your code here ...
print(f"\nMax char by code point: ... your code here ...")


**Answer:**

*(Write your explanation here. Example: The character is the Earth emoji...)*

---

### 1.3 Visual Lookalikes (Homoglyphs)

**üöß TODO:**
Look at the two groups of three characters in `text_sample` above that look similar.
Run `analyze_string` on them. Why do they have different Unicode code points? How can this be a problem for text processing and internet security?

**Answer:**
... your answer here ...


## Part 2: Loading & Visualizing Corpora

We will load text files from the `texts` folder.

In [None]:
CORPORA = ["mythology", "woodworking", "robotics", "hsm", "health", "portuguese"]

corpora_text = {}
stats = []

# üöß TODO: Complete the code to load the data
# Create a dictionary `corpora_text` mapping corpus name -> full string content
# And a list `stats` with info per corpus

for corpus in tqdm(CORPORA):
    corpus_path = DATA_DIR / corpus

    texts = []
    files = list(corpus_path.glob("*.txt"))
    ... your code here ...
    stats.append({
        "corpus": corpus, 
        "files_n": len(files), 
        "chars_n": len(full_text)
    })

df = pd.DataFrame(stats).set_index("corpus")
df['text'] = ... your code here ...
...
df

### 2.1 Character, Type, and Token Ratios

In [None]:
# üöß TODO: Visualize character counts per corpus (histogram)
plt.figure(figsize=(10, 5))
sns.barplot(x=... your code here ...)
plt.title("Total Characters per Corpus")
plt.xticks(rotation=45)
plt.show()

**Observation:**
Look at the plot above. Is our dataset balanced?

*   **No.** Some corpora are much larger than others (e.g., maybe "woodworking" vs "hsm").
*   **Consequence:** Comparing raw counts (like "number of unique characters") directly is unfair. We need normalized metrics.

### 2.2 Character Frequency Analysis

**Why do we do this?**
*   To identify the **language** (Portuguese will have `√£`, `√ß`).
*   To spot **artifacts** (encoding errors, weird symbols).
*   To fingerprint the **domain** (Math symbols in Robotics? Emojis in informal text?).

Let's look at the distribution of characters in "mythology".

In [None]:
# üöß TODO: Create a frequency histogram for the most frequent characters in mythology
# Then create a log-log plot to see if the characters follow Zipf's law.
# Provide your analysis below.

# Solution:
if "mythology" in df.index:
    myth_text = df.loc["mythology", "text"]
    ... your code here ...

    # Plot top 30
    plt.figure(figsize=(12, 5))
    sns.barplot(... your code here ...)
    plt.title("Top 30 Characters in Mythology")
    plt.show()

    # Log-Log Plot
    plt.figure(figsize=(6, 4))
    plt.loglog(... your code here ...)
    plt.title("Character Distribution (Log-Log) - Zipfian?")
    plt.xlabel("Rank")
    plt.ylabel("Frequency")
    plt.show()

**Observation:**
### üöß TODO:
... your observations here ...

### 2.3 Character Richness (Diversity)

Does every corpus use the same variety of characters?

**Method 1: The Naive Approach**
Calculate `char_types_n` (number of unique characters) and divide by `chars_n`.

In [None]:
# üöß TODO: Add 'char_types_n' and 'char_type_ratio' column
df["char_types_n"] = ... your code here ...
df["char_type_ratio"] = ... your code here ...

# Plot
plt.figure(figsize=(10, 5))
ax1 = sns.barplot(x=df.index, y="char_type_ratio", data=df, hue="corpus")
ax1.yaxis.set_major_formatter(mtick.PercentFormatter(1.0, decimals=2))
plt.title("Naive Diversity Ratio (Types / Total Chars)")
plt.xticks(rotation=45)
plt.show()

**Wait a minute!** üõë

Look at the previous results. The smallest corpus often looks "richest".

**The Problem:**
As you read more text, finding *new* characters (or words) becomes harder.
Type-Token Ratio (TTR) naturally decreases as text length increases.
Comparing TTR on corpora of different sizes is **not fair**.

**Method 2: Fixed-Size Window (The Correct Way)**
We should compare the richness on the **same amount of text** (e.g., the first 10,000 characters).

In [None]:
# üöß TODO: Calculate diversity averaged over sliding windows
N = 10000

def get_sliding_diversity(text, window_size):
    if len(text) < window_size:
        if len(text) == 0: return 0.0
        return len(set(text)) / len(text)

    # We take non-overlapping chunks for efficiency (Mean TTR)
    ratios = []
    for i in range(0, len(text) - window_size + 1, window_size):
        chunk = ... your code here ...
    return np.mean(ratios)

df["fixed_window_diversity"] = df["text"].apply(lambda t: get_sliding_diversity(t, N))

plt.figure(figsize=(10, 5))
ax2 = sns.barplot(x=df.index, y="fixed_window_diversity", data=df, hue="corpus")
ax2.yaxis.set_major_formatter(mtick.PercentFormatter(1.0, decimals=2))
plt.title(f"Fair Diversity Ratio (Avg over {N}-char windows)")
plt.xticks(rotation=45)
plt.show()

**Question:** Now compare the "Naive" vs "Fixed-Window" plots. 
1. Did the ranking change?
2. Provide a hand-wavy explanation of the two "diversity winner" corpora. You may have to look into the texts to answer this.

**Answer:**
... your observations here ...

## Part 3: Tokenization

**Goal:** Split the text into meaningful units (words).

We will compare methods:
1.  **Simple Split:** `text.split()` (Splits on whitespace).
2.  **Regex Split:** `re.split(r'\W+', text)` (Splits on non-alphanumeric).
    *   *Variant:* `re.split(r'(\W+)', text)` (Keeps delimiters).
3.  **Linguistic Split:** `nltk.word_tokenize` (Smart rules).

In [None]:
sentence = "Wait‚Äîwhat?! I can't believe it's 2026..."
print(f"Original: {sentence}\n")

# Method 1: Simple Split
print("1. text.split():")
print(sentence.split())
# Problem: Punctuation sticks to words ("2026..." is one token)

# Method 2a: Regex Split (\W+) - eats punctuation
print("\n2a. Regex split (\W+):")
print(re.split(r'\W+', sentence))
# Problem: "can't" becomes "can", "t". Punctuation is gone.

# Method 2b: Regex Split with Capturing Group ((\W+)) - keeps punctuation
print("\n2b. Regex split with capturing ((\W+)):")
# By wrapping the separator pattern in parentheses, split() returns the separators too!
print(re.split(r'(\W+)', sentence))
# Better: We keep the punctuation, but it's treated as separate tokens.

# Method 3: NLTK
print("\n3. NLTK:")
try:
    print(nltk.word_tokenize(sentence))
except LookupError:
    nltk.download('punkt')
    print(nltk.word_tokenize(sentence))
# Advantage: "n't" is handled, punctuation is preserved as separate tokens.

### üöß TODO:
**Conclusion:**
*   **Split** ... your observations here ...
*   **Regex** ... your observations here ...
*   **NLTK** (and Spacy) ... your observations here ...

### 3.1 Subword Tokenization (BPE)

Large Language Models use **Byte-Pair Encoding (BPE)** to fix the "Out Of Vocabulary" (OOV) problem.

**The Goal:** Algorithmically find the best subwords to represent a text.

**Step 1: Preparation**
We start with a raw string, tokenize it into words, and then split each word into characters.
We append `</w>` to the end of each word to mark the boundary.

In [None]:
# We choose invariant roots (no spelling changes) to show clean merges.
# walk: 4 forms * 3 = 12 times. jump: 12 times.
# suffixes: ing, s, ed appear 6 times each.
# Result: 'walk' and 'jump' should become tokens, then suffixes attach.
raw_text = "walk walking walks walked " * 4 + "jump jumping jumps jumped " * 3

def get_vocab(text):
    vocab = collections.defaultdict(int)
    # Use regex to split words and keep punctuation separate
    words = re.findall(r"\w+|[^\w\s]", text)
    for word in words:
        # Add spaces between chars and </w> at the end
        vocab[' '.join(list(word)) + ' </w>'] += 1
    return vocab

vocab = get_vocab(raw_text)
print("Initial Vocab:", vocab)

**Step 2: Find the Best Pair**

**üöß TODO 1:** Write `get_stats(vocab)` to find one of the most frequent pairs of adjacent symbols.

In [None]:
def get_stats(vocab):
    pairs = collections.defaultdict(int)
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols)-1):
            pairs[symbols[i], symbols[i+1]] += freq
    return pairs

pairs_step1 = get_stats(vocab)
best_pair = max(pairs_step1, key=pairs_step1.get)
print(f"On of the most frequent pair: {best_pair} (Count: {pairs_step1[best_pair]})")

**Step 3: Merge and Iterate**

**üöß TODO 2:** Write `merge_vocab(pair, v_in)` and run a loop for **N merges**.

In [None]:
def merge_vocab(pair, v_in):
    v_out = {}
    bigram = re.escape(' '.join(pair))
    p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
    for word in v_in:
        w_out = p.sub(''.join(pair), word)
        v_out[w_out] = v_in[word]
    return v_out

# üöß TODO: Run 12 merges to see meaningful subwords
current_vocab = vocab.copy()
print("Starting BPE Merges...\n")

steps_log = []

for i in range(12):
    pairs_iter = get_stats(current_vocab)
    if not pairs_iter:
        break
    ... your code here ...
    steps_log.append(f"Step {i+1}: Merged {best_iter}")
    print(f"Step {i+1}: Merged {best_iter}")

print("\nResulting Vocabulary:")
# Display nicely
vocab_table = [{"Tokenized Word": k, "Frequency": v} for k, v in current_vocab.items()]
print(pd.DataFrame(vocab_table))

**Step 4: Real World Example**

**üöß TODO 3:** Train BPE on the 'woodworking' corpus!
1. Build initial vocab from `woodworking` text (first 10k chars).
2. Run 100 merges.
3. Show some words that are ONE token (e.g. "wood") vs words that are SPLIT (e.g. "un-believ-able").

In [None]:
# 1. Get Text
train_text = df.loc["woodworking", "text"][:100000] # Small subset for speed

# 2. Build Vocab
wood_vocab = get_vocab(train_text)

# 3. Train (2000 merges)
for _i in range(2000):
    pairs_train = get_stats(wood_vocab)
    if not pairs_train: break
    best_train = ... your code here ...
    wood_vocab = merge_vocab(best_train, wood_vocab)

# 4. Analysis
print("Training done. Analyzing results...")
results = []

for word_seq in wood_vocab:
    wood_tokens = word_seq.split()
    original_word = "".join(wood_tokens).replace("</w>", "")
    # A word is a "Full Token" if it's a single subword (+ end marker)
    # i.e., either ['word</w>'] or ['X', '</w>']
    is_full_token = (len(wood_tokens) == 1) or (len(wood_tokens) == 2 and wood_tokens[-1] == '</w>')
    if is_full_token:
        results.append({"Word": original_word, "Type": "Full Token", "Sequence": str(wood_tokens)})
    else:
         results.append({"Word": original_word, "Type": "Split", "Sequence": str(wood_tokens)})

df_results = pd.DataFrame(results)

print(f"\nAnalysis ({len(results)} total words):")

print("\n--- Examples of Full Tokens (Learned) ---")
print(df_results[df_results["Type"] == "Full Token"].head(10))

print("\n--- Examples of Split Words (Morphology/Rare) ---")
print(df_results[df_results["Type"] == "Split"].head(10))

print(f"\n--- Vocabulary Statistics ---")
print(f"Total unique words in corpus: {len(results)}")
print(f"Full Tokens (single subword): {len(df_results[df_results['Type'] == 'Full Token'])}")
print(f"Split Words (multiple subwords): {len(df_results[df_results['Type'] == 'Split'])}")


## Part 4: Token Statistics (Zipf's Law)

In [None]:
# Use NLTK to tokenize 'woodworking'
wood_text = df.loc["woodworking", "text"]

# üöß TODO: Tokenize (you can limit the size)
tokens = nltk. ... your code here ...

# üöß TODO: Plot Zipf
counts = Counter(tokens)
freqs = ... your code here ...

plt.figure(figsize=(8, 5))
plt.loglog(range(1, len(freqs)+1), freqs, marker=".")
plt.title("Token Frequency (Zipf's Law) - Woodworking")
plt.xlabel("Rank")
plt.ylabel("Frequency")
plt.show()

## Part 5: Conclusion

You have successfully explored the atomic units of NLP!

**üéâ CONGRATULATIONS! üéâ**

You survived Unicode (barely), tamed the Tokenizers, and validated Zipf's Law without needing a lawyer!

Go treat yourself to a `\u1F355`!

One last TODO: How to find out what to get?

In [None]:
#this looks easy:
print("\u1F355") 
# but wait: it doesn't show what to get:
# Why? Python has two types of Unicode escape sequences:

# 1. \uXXXX (4 digits, 16-bit) -> For Basic Multilingual Plane (BMP) only.
#    "\u1F35" is read as '·Ωë' (Greek Dasia), followed by the literal "5".
# 2. \UXXXXXXXX (8 digits, 32-bit) -> For any Unicode character (including Emoji).
# and \u1F355 requires 8 digits.
print("\U0001F355")

# 3. By Name
print("\N{CUCUMBER}") # that's good for health but not the right name of the treat. 
# TODO: correct it to the right treat to get now. But how to find the name of \U0001F355?

# How to find the name?
treat_char = "\U0001F355"
print(f"Name of {treat_char} is: TODO: ... your code here ...