# Lab 1 - Lesson 1: Dataset exploration and Zipf

**What we use**
- `datasets` to download and slice a small dataset subset.
- `Counter` to count tokens and build a frequency dictionary.
- `matplotlib` to plot a Zipf curve on log-log axes.

**Goals**
- Load a small dataset subset with `datasets`.
- Compute token frequencies and plot Zipf on log-log axes.
- Inspect samples and dataset fields.

**Notes to keep in mind**
- More word types, shorter lemmas, and a larger highest rank often indicate a more morphologically rich language.
- A shallower Zipf slope often indicates a more morphologically rich language.
- The straightness of a Zipf plot is not related to morphology.
- If the same surface word appears with different lemmas in counts, it is ambiguous.


## Step 1: Load and inspect the dataset
We start by loading a dataset from the Hugging Face Hub and selecting a small subset
The Hugging Face Hub is an online repository of datasets and models; `datasets.load_dataset` downloads from the Hub when needed.
so the notebook runs quickly.


In [None]:
import random
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
from datasets import load_dataset

SEED = 42
random.seed(SEED)
np.random.seed(SEED)


In [None]:
# Dataset choice
# Option A: name = 'ag_news', text_field = 'text'
# Option B: name = 'wikitext', config = 'wikitext-2-raw-v1', text_field = 'text'
name = 'ag_news'
config = None
text_field = 'text'

if config:
    ds = load_dataset(name, config, split='train')
else:
    ds = load_dataset(name, split='train')

subset = ds.select(range(2000))
print(subset.features)
print(subset[0])


### Practice A: Create two subsets
Create two different slices of the dataset and compare their sizes.


In [None]:
# TODO: Create two non-overlapping subsets from ds
# Hint: use ds.select(range(start, end))
# Hint: keep the subsets the same size for comparison
# TODO: print sizes with len(subset_a) and len(subset_b)

# Write your code below


In [None]:
for i in range(3):
    print('---')
    print(subset[i][text_field])


## Step 2: Tokenize and count
We use a simple whitespace tokenizer to keep the logic clear.
Then we count tokens and types with `Counter`.


In [None]:
def tokenize_whitespace(text):
    return text.lower().split()

def get_token_counts(texts):
    # TODO: implement token counting with Counter
    counts = Counter()
    for t in texts:
        counts.update(tokenize_whitespace(t))
    return counts


In [None]:
texts = [ex[text_field] for ex in subset if ex[text_field].strip()]
counts = get_token_counts(texts)
total_tokens = sum(counts.values())
total_types = len(counts)

print('tokens:', total_tokens)
print('types:', total_types)
print('top-20:', counts.most_common(20))

assert sum(counts.values()) == total_tokens


### Practice B: Extra frequency stats
Compute type/token ratio and average whitespace length.


**Type/token ratio** is the number of unique word types divided by the total number of tokens.
It is a rough proxy for lexical diversity in a sample.

**Average whitespace length** is the average number of whitespace-separated tokens per sentence.
It is a simple length measure before any advanced tokenization.


In [None]:
def avg_whitespace_len(texts):
    # TODO: compute average whitespace token length
    # Hint: lengths = [len(tokenize_whitespace(t)) for t in texts]
    # Hint: avoid division by zero with max(len(lengths), 1)
    raise NotImplementedError

# TODO: compute type_token_ratio and avg_len
# Hint: type_token_ratio = total_types / total_tokens
# TODO: print the results


## Step 3: Plot Zipf
We first plot on linear axes to see why the long tail is hard to read,
then switch to log-log axes for a clearer Zipf pattern.


### Step 3a: Linear scale (less informative)
We first plot the same data on linear axes to see why the long tail is hard to read.


In [None]:
freqs = sorted(counts.values(), reverse=True)
ranks = range(1, len(freqs) + 1)

# Linear-scale version (for contrast)
plt.figure(figsize=(6, 4))
plt.plot(ranks, freqs)
plt.xlabel('rank')
plt.ylabel('frequency')
plt.title('Zipf plot (linear scale)')
plt.show()


### Step 3b: Log-log scale (more informative)
Log-log axes compress the long tail and make Zipf behavior visible.


In [None]:
freqs = sorted(counts.values(), reverse=True)
ranks = range(1, len(freqs) + 1)

plt.figure(figsize=(6, 4))
plt.loglog(ranks, freqs)
plt.xlabel('rank')
plt.ylabel('frequency')
plt.title('Zipf plot')
plt.show()


### Practice C: Compare Zipf plots
Plot Zipf curves for two different subsets on the same axes.


In [None]:
# TODO: build texts_a and texts_b from subset_a and subset_b
# Hint: filter out empty strings with .strip()
# TODO: compute counts and frequency lists for both subsets
# Hint: counts = get_token_counts(texts)
# TODO: plot both Zipf curves on the same axes with labels
# Hint: use plt.loglog for each and plt.legend()


**Homework**
- Pick TWO different datasets from the Hugging Face Hub that we did **not** use in the lesson.
  Suggested options: `imdb`, `yelp_polarity`, `dbpedia_14`, `rotten_tomatoes`, `trec`.
- For each dataset: create Zipf plots for two non-overlapping subsets (keep sizes equal).
- For each dataset: print top-10 tokens for each subset and note 2 differences.
- For each dataset: compute type/token ratio for each subset and compare.
