## Question 3: Natural Language Processing

In [2]:
corpuses = [
"nlp/01-introduction.csv",
"nlp/02-data-exploration.csv",
"nlp/03-decision-trees.csv",
"nlp/04-regression.csv",
"nlp/05-support-vector-machines.csv",
"nlp/06-neural-networks-1.csv",
"nlp/07-neural-networks-2.csv",
"nlp/08-evaluation.csv",
"nlp/09-clustering.csv",
"nlp/10-frequent-itemsets.csv",
]

In [3]:
# Load the dataset
import pandas as pd
from pathlib import Path

dfs = []

for path in corpuses:
    df = pd.read_csv(path)

    # extract lecture name, e.g. "01-introduction"
    lecture_name = Path(path).stem
    df["lecture"] = lecture_name

    dfs.append(df)

# concatenate into one large dataframe
df_all = pd.concat(dfs, ignore_index=True)

df_all.shape

(1793, 4)

In [4]:
df_all.head()

Unnamed: 0,start,end,text,lecture
0,2.35,31.36,"Okay, a very warm welcome to this first lectur...",01-introduction
1,31.36,59.41,than the normal lectures will will be let me j...,01-introduction
2,59.92,90.26,So let's just dive into it. So what are the ru...,01-introduction
3,90.35,120.32,You also see that there is a long list of topi...,01-introduction
4,120.32,149.54,And here you can see a list of topics. So it's...,01-introduction


### a)

In [5]:
from collections import Counter
import plotly.express as px
# concatenate all text
all_text = " ".join(df_all["text"].astype(str))

# split by spaces only (no preprocessing)
words = all_text.split(" ")

# count word frequencies
word_counts = Counter(words)

# convert to DataFrame
df_freq = (
    pd.DataFrame(word_counts.items(), columns=["word", "count"])
      .sort_values("count", ascending=False)
      .head(25)
)

# plot
fig = px.bar(
    df_freq,
    x="word",
    y="count",
    title="25 Most Frequent Words in IDS Lecture Transcripts",
    labels={"word": "Word", "count": "Frequency"}
)

fig.update_layout(xaxis_tickangle=-45)
fig.show()

**Problems of the basic word frequency approach**

1. Case sensitivity leads to duplicate word counts
The histogram contains both “and” and “And” as separate entries, although they represent the same word. => Fix: Normalize the text by converting all words to lowercase before counting.

2. High-frequency function words dominate the histogram
Common stop words such as “the”, “is”, “to”, and “a” appear most frequently but carry little semantic meaning, obscuring more informative terms. => Fix: Remove stop words or use weighting schemes such as TF-IDF to emphasize content-bearing words.

### b)

In [6]:
import nltk
from nltk.corpus import stopwords
import string
from nltk.tokenize import word_tokenize
# download required nltk resources
nltk.download("punkt_tab")
nltk.download("stopwords")

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [7]:
# Preprocessing
stop_words = set(stopwords.words("english"))

def preprocess_text(text):
    # lowercase
    text = text.lower()

    # remove punctuation
    text = text.translate(str.maketrans("", "", string.punctuation))

    # tokenize using nltk punkt tokenizer
    tokens = word_tokenize(text)

    # remove stopwords
    tokens = [t for t in tokens if t not in stop_words]

    return tokens

In [8]:
# Add tokenized_text column to the dataframe
df_all["tokenized_text"] = df_all["text"].astype(str).apply(preprocess_text)

df_all.head()

Unnamed: 0,start,end,text,lecture,tokenized_text
0,2.35,31.36,"Okay, a very warm welcome to this first lectur...",01-introduction,"[okay, warm, welcome, first, lecture, introduc..."
1,31.36,59.41,than the normal lectures will will be let me j...,01-introduction,"[normal, lectures, let, get, started, mentione..."
2,59.92,90.26,So let's just dive into it. So what are the ru...,01-introduction,"[lets, dive, rules, game, right, people, know,..."
3,90.35,120.32,You also see that there is a long list of topi...,01-introduction,"[also, see, long, list, topics, course, data, ..."
4,120.32,149.54,And here you can see a list of topics. So it's...,01-introduction,"[see, list, topics, broad, course, opportunity..."


In [9]:
df_intro = df_all[df_all["lecture"] == "01-introduction"]
all_tokens_intro = [
    token
    for tokens in df_intro["tokenized_text"]
    for token in tokens
]

token_counts = Counter(all_tokens_intro)

top_25 = (
    pd.DataFrame(token_counts.most_common(25),
                 columns=["token", "count"])
)
fig = px.bar(
    top_25,
    x="token",
    y="count",
    title="25 Most Frequent Tokens in 01-Introduction (After Preprocessing)",
    labels={"token": "Token", "count": "Frequency"}
)

fig.update_layout(xaxis_tickangle=-45)
fig.show()

### c)

In [10]:
target_tokens = [
    "data", "decision", "predict",
    "derivative", "network", "easy", "database"
]

rows = []

# Count token frequencies per lecture
for lecture, df_lecture in df_all.groupby("lecture"):
    # flatten tokens for this lecture
    tokens = [
        token
        for tokens_list in df_lecture["tokenized_text"]
        for token in tokens_list
    ]

    counts = Counter(tokens)

    for token in target_tokens:
        rows.append({
            "lecture": lecture,
            "token": token,
            "count": counts.get(token, 0)
        })

df_token_counts = pd.DataFrame(rows)
df_token_counts.head()

Unnamed: 0,lecture,token,count
0,01-introduction,data,126
1,01-introduction,decision,3
2,01-introduction,predict,9
3,01-introduction,derivative,0
4,01-introduction,network,9


In [11]:
fig = px.bar(
    df_token_counts,
    x="token",
    y="count",
    color="lecture",
    title="Frequency of Selected Tokens by Lecture",
    labels={"token": "Token", "count": "Frequency"},
)

fig.update_layout(barmode="stack")
fig.show()

The token “data” appears frequently across almost all lectures, indicating that data-related concepts are central throughout the course.
Tokens such as “decision” and “predict” are mainly concentrated in lectures on decision trees and regression, while “derivative” and “network” occur predominantly in the neural network lectures.
More specific terms like “database” appear almost exclusively in the frequent itemsets lecture and are rare elsewhere.

### d)

In [12]:
from nltk.util import ngrams
from nltk.probability import ConditionalFreqDist
import random

In [13]:
def preprocess_for_lm(text: str):
    # lowercase
    text = text.lower()
    # remove punctuation
    text = text.translate(str.maketrans("", "", string.punctuation))
    # tokenize with nltk punkt_tab tokenizer
    return word_tokenize(text)

In [14]:
from nltk.probability import ConditionalFreqDist

def build_ngram_cfd(df_all, n: int, text_col="text"):
    """
    Builds a ConditionalFreqDist mapping:
      context (tuple of length n-1) -> FreqDist(next_token)
    Treats each row/segment independently and pads with <s>, </s>.
    """
    cfd = ConditionalFreqDist()
    pad_left = ["<s>"] * (n - 1)

    for segment in df_all[text_col].astype(str):
        tokens = preprocess_for_lm(segment)
        padded = pad_left + tokens + ["</s>"]

        # for each n-gram, add count of next_token given context
        for i in range(len(padded) - n + 1):
            context = tuple(padded[i : i + (n - 1)])
            next_tok = padded[i + (n - 1)]
            cfd[context][next_tok] += 1

    return cfd

In [15]:
def predict_next_token(cfd: ConditionalFreqDist, context: tuple):
    """
    Sample next token from cfd[context], using:
      - candidates sorted lexicographically
      - weights = counts
      - random.seed(3213) before each sampling
    Returns None if unseen context.
    """
    if context not in cfd:
        return None

    fd = cfd[context]
    if fd.N() == 0:
        return None

    candidates = sorted(fd.keys())
    weights = [fd[w] for w in candidates]

    random.seed(3213)
    return random.choices(candidates, weights=weights, k=1)[0]

In [16]:
def generate_text(cfd: ConditionalFreqDist, n: int, seed_text: str, max_new_tokens: int = 30):
    """
    Generate at most max_new_tokens tokens (not counting the seed tokens).
    - Applies same preprocessing to seed text as dataset.
    - Pads with <s> if seed shorter than n-1.
    - Stops if predicts </s> or context unseen.
    Returns list of tokens: seed_tokens + generated_tokens (excluding </s>).
    """
    seed_tokens = preprocess_for_lm(seed_text)

    # initial context = last n-1 tokens, left-padded with <s> if needed
    needed = (n - 1) - len(seed_tokens)
    if needed > 0:
        context_tokens = ["<s>"] * needed + seed_tokens
    else:
        context_tokens = seed_tokens[-(n - 1):]

    generated = []
    context = tuple(context_tokens)

    for _ in range(max_new_tokens):
        next_tok = predict_next_token(cfd, context)
        if next_tok is None or next_tok == "</s>":
            break
        generated.append(next_tok)
        context = tuple((list(context)[1:] + [next_tok]) if n > 2 else [next_tok])

    return seed_tokens + generated

In [17]:
seed = "introduction to data"

for n in [2, 3, 4, 5]:
    cfd = build_ngram_cfd(df_all, n=n, text_col="text")
    out_tokens = generate_text(cfd, n=n, seed_text=seed, max_new_tokens=30)
    print(f"n={n}:")
    print(" ".join(out_tokens))
    print()

n=2:
introduction to data set on the or regression model prediction model prediction model prediction model prediction model prediction model prediction model prediction model prediction model prediction model prediction model prediction model prediction model

n=3:
introduction to data science master and its not shown so again this other one is not possible and so on the part that is possible for specific tasks so the people that invented

n=4:
introduction to data science so this lecture will be a lecture on regression they will be provided to you so the question earlier was like how to interpret these factors the different rate

n=5:
introduction to data



For n = 2, the generated text is very repetitive and quickly falls into loops (e.g., repeating “model prediction”), because the model conditions only on a single previous word and lacks sufficient context.
With n = 3 and n = 4, the text becomes more coherent and lecture-like, showing longer meaningful phrases and topic consistency, as more context is taken into account when predicting the next word.
For n = 5, generation stops almost immediately, indicating that longer contexts are rare in the dataset and the model quickly encounters unseen contexts.

### e)

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import SnowballStemmer
import numpy as np

In [19]:
STOP_WORDS = set(stopwords.words("english"))
STEMMER = SnowballStemmer("english")
PUNCT_TABLE = str.maketrans("", "", string.punctuation)

In [20]:
def analyzer_english_stem_stop(text: str):
    # lowercase
    text = text.lower()
    # remove punctuation
    text = text.translate(PUNCT_TABLE)
    # tokenize (punkt_tab)
    tokens = word_tokenize(text)
    # stopword removal + stemming
    out = []
    for t in tokens:
        if t in STOP_WORDS:
            continue
        # optional tiny cleanup: ignore empty tokens
        if not t.strip():
            continue
        out.append(STEMMER.stem(t))
    return out

In [21]:
def cosine_sim_sparse(X, q_vec):
    """
    X: (n_docs, n_terms) sparse TF-IDF matrix
    q_vec: (1, n_terms) sparse TF-IDF vector
    Returns: (n_docs,) numpy array of cosine similarities
    """
    # dot products
    dots = (X @ q_vec.T).toarray().ravel()

    # norms
    X_norm = np.sqrt(X.multiply(X).sum(axis=1)).A1
    q_norm = np.sqrt(q_vec.multiply(q_vec).sum())

    denom = (X_norm * q_norm)
    denom[denom == 0] = 1e-12
    return dots / denom

In [22]:
# lecture-level corpus (Level 1) + retrieval
def build_lecture_corpus(df_all):
    # aggregate all text per lecture
    lec_docs = (
        df_all.groupby("lecture")["text"]
        .apply(lambda s: " ".join(s.astype(str)))
        .reset_index()
        .rename(columns={"text": "doc"})
    )
    return lec_docs  # columns: lecture, doc

def retrieve_top_k_lectures(df_all, query: str, k: int = 2):
    lec_docs = build_lecture_corpus(df_all)

    vec = TfidfVectorizer(analyzer=analyzer_english_stem_stop)
    X = vec.fit_transform(lec_docs["doc"].values)         # (n_lectures, vocab)
    q = vec.transform([query])                           # (1, vocab)

    sims = cosine_sim_sparse(X, q)
    top_idx = np.argsort(-sims)[:k]

    results = []
    for idx in top_idx:
        results.append({
            "lecture": lec_docs.loc[idx, "lecture"],
            "score": float(sims[idx]),
        })
    return results

In [23]:
# Segment-level corpus + retrieval (Level 2)
def retrieve_top_m_segments_in_lecture(df_all, lecture: str, query: str, m: int = 2):
    df_lec = df_all[df_all["lecture"] == lecture].copy()

    # fit a new vectorizer on this lecture's segments
    vec = TfidfVectorizer(analyzer=analyzer_english_stem_stop)
    X = vec.fit_transform(df_lec["text"].astype(str).values)  # (n_segments, vocab)
    q = vec.transform([query])

    sims = cosine_sim_sparse(X, q)
    top_idx = np.argsort(-sims)[:m]

    segs = []
    for idx in top_idx:
        row = df_lec.iloc[idx]
        segs.append({
            "start": float(row["start"]),
            "end": float(row["end"]),
            "score": float(sims[idx]),
            "text": str(row["text"]),
        })
    return segs

In [24]:
# Hierarchical retrieval wrapper: returns k lectures, each with m timestamps
def hierarchical_retrieve(df_all, query: str, k: int = 2, m: int = 2):
    top_lectures = retrieve_top_k_lectures(df_all, query=query, k=k)

    nested = []
    for lec in top_lectures:
        lecture_name = lec["lecture"]
        segs = retrieve_top_m_segments_in_lecture(df_all, lecture=lecture_name, query=query, m=m)
        nested.append({
            "lecture": lecture_name,
            "lecture_score": lec["score"],
            "segments": segs
        })
    return nested

In [25]:
def print_results(query, results):
    print(f"\nQuery: {query}\n" + "-"*80)
    for i, r in enumerate(results, 1):
        print(f"{i}) Lecture: {r['lecture']}   | lecture_score={r['lecture_score']:.4f}")
        for j, s in enumerate(r["segments"], 1):
            print(f"   {j}. [{s['start']:.2f}, {s['end']:.2f}]  seg_score={s['score']:.4f}")
            print(f"      text: {s['text']}")
        print()
k, m = 2, 2

queries = [
    "gradient descent approach",
    "beer and diapers"
]

for q in queries:
    res = hierarchical_retrieve(df_all, query=q, k=k, m=m)
    print_results(q, res)


Query: gradient descent approach
--------------------------------------------------------------------------------
1) Lecture: 04-regression   | lecture_score=0.0756
   1. [3324.14, 3353.17]  seg_score=0.2849
      text: steep curve, which kind of, okay, here we predict an error, and here we predict good, but in between, the decision boundary is kind of, maybe if you define your function correctly, it's at least defined, but it's not a smooth surface. So this also means that we can't do the approach from before because we don't have these properties. So here we have the decision, but you can imagine that if you plug this into the error function, you also get a non-continuous surface, so you can't do the gradient descent approach without any modifications. And what you need to do,
   2. [1907.18, 1935.92]  seg_score=0.2806
      text: once you fix like the w1 parameter that there would be at least one value for w0 that kind of is the best performing. But I agree it kind of looks very fl

**Query: gradient descent approach**

The retrieval correctly ranks 04-regression highest, and the top segments explicitly discuss why plain gradient descent fails on a non-smooth/non-continuous error surface and then introduce how to modify the approach. The second-best lecture (06-neural-networks-1) is also a good match: its top segment connects gradient descent to having a “nice” derivative (sigmoid derivative) and weight updates, which fits the optimization context.

**Query: beer and diapers**

The top lecture (10-frequent-itemsets) is exactly the expected match: both top segments directly mention the classic “beer and diapers” co-occurrence example and relate it to itemsets and frequency monotonicity. The second match (01-introduction) is much weaker and more general—talking about pattern mining and “people who buy beer also buy chips”—which explains the very low lecture score and why it’s ranked far below the frequent-itemsets lecture.