# Week 01

Content:
1. Softmax
2. Byte-Pair
3. Tokenizer
4. NLP-Pipeline

## Softmax

Rem: you get bonus for this exercise if you answer at least 3 out of 4 questions.

Here we are going to answer the question, why softmax is called "softmax" and investigate softmax characteristics. The $i \text{-th}$ component of the softmax is given by:

$$f(x_i) := \text{softmax}(x_i) = \frac{\exp x_i}{ \sum_{j = 1}^k \exp x_j}$$


And thus the vector is $f(\mathbf x) = [f(x_1), \dots, f(x_k)]$ where $\mathbf x = (x_1, \dots, x_k)$.


1. show that $f(\mathbf x)$ can be interpreted as a probability. To do so, show that $\sum_{i=1}^k f(x_i) = 1$ and $f(x) > 0 \quad \forall x \in \mathbb{R}$

> ![](aufgabe_1-1.png)

2. show that softmax is $C^\infty$, i.e. that you can calculate the derivative of $f(x_i)$ as often as you want with respect to $x_i$. You can use the fact that $\exp x$ is smooth (smooth means $C ^ \infty$)

> ![](aufgabe_1-2.png)

3. show that if $x_i \lt x_j$ then $f(x_i) \lt f(x_j)$. Does the converse hold as well?

> ![](aufgabe_1-3.png)

4. what is the limit of $f(x_i)$ for $x_i \to + \infty$? Same question for $x \to - \infty$.

> ![](aufgabe_1-4.png)


## Byte-pair encoding

Implement Byte-paid encoding and reproduce the example in chapter 2.5.2 of [Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/):

!["Byte Pair Encoding"](byte_pair_encoding.png)

In [29]:
from collections import Counter

def getTokenDistribution(corpus):

    # find the most frequent pair of adjacent tokens in the corpus
    pairs = Counter()
    for word in corpus:
        symbols = word.split()
        for i in range(len(symbols) - 1):
            pairs[symbols[i], symbols[i + 1]] += 1

    # return a dictionary with the pair as key and the frequency as value
    return pairs

def mergeVocabulary(pair, corpus):
    
    newCorpus = []
    
    # make new token by concatenating
    bigram = ' '.join(pair)
    for word in corpus:
        # update the corpus
        newWord = word.replace(bigram, ''.join(pair))
        newCorpus.append(newWord)

    return newCorpus

def bytePairEncoding(corpus, numberOfMerges):
    
    # all unique characters in the corpus
    # initial set of tokens is characters
    vocabulary = set(char for word in corpus for char in word)
    corpus = [' '.join(word) for word in corpus]

    # merge tokens k times
    for _ in range(numberOfMerges):
        pairs = getTokenDistribution(corpus)
        if not pairs:
            break
        # most frequent pair of adjacent tokens in the corpus
        mostFrequentPair = max(pairs, key=pairs.get)

        # make new token by concatenating and update the corpus
        corpus = mergeVocabulary(mostFrequentPair, corpus)

        # update the vocabulary
        vocabulary.add(''.join(mostFrequentPair))

    return vocabulary

corpus = ["low", "lower", "newest", "widest"]
vocab = bytePairEncoding(corpus, numberOfMerges=10)
print("Final Vocabulary:", vocab)

Final Vocabulary: {'l', 'wi', 'new', 'd', 'n', 'r', 'lowe', 'lo', 's', 'w', 'i', 'e', 't', 'low', 'est', 'ne', 'es', 'newest', 'o', 'lower'}


# Tokenizer

Go through the HuggingFace tutorial on [tokenizers](https://huggingface.co/learn/nlp-course/chapter2/4) and answer:

- why do we need tokenizers?

> Tokenizers sind notwendig, weil LLMs nicht mit Wörter umgehen kann, sondern eine numerische Repräsentation eines Worts benötigt.

- what is the difference between a character-based, word-based and a subword-based tokenizer? What are the advantages and disadvantages of each?

> ## Word-based Tokenizer
> * Tokens werden aufgrund der Wortgrenzen erstellt.
> * Wörter, die nicht bekannt sind, werden mit einem Unknown-Token (bspw. <UNK>) markiert
> * Jedes Wort bekommt eine eigene ID
> 
> ### Vorteile
> * Simpel
> 
> ### Nachteile
> * Verwandte Worte erhalten komplett unterschiedliche Ids
> * Grosses Vokabular
> * Unknown-Tokens bedeuten für das Modell dasselbe, obschon der dahinterstehende Text unterschiedliche Bedeutung haben kann
>
> ## Character-based Tokenizer
> * Text wird in die einzelnen Zeichen unterteilt --> Token ist ein Character
> 
> ### Vorteile
> * Kleineres Vokabular
> * Weniger Unknown-Tokens
> 
> ### Nachteile
> * Grosse Anzahl Token --> Modell muss mehr Tokens verarbeiten
> * Ein Token hat eine kleinere Bedeutung als beim Word-Based Tokenizer
>
> ## Subword-based Tokenizer
> * Kombination von Word-Based und Character-Based Tokenizer
> * Häufige Wörter sollen nicht in Subwords unterteilt werden (dog --> dog)
> * Seltene Wörter sollen in sinnvolle Subwords unterteilt werden (dogs --> dog, s)
> 
> ### Vorteile
> * Gleichgewicht zwischen Vokabulargrösse und Ausdrucksstärke
> * Weniger Unknown-Tokens, da unbekannte Wörter in bekannte Subwörter zerlegt werden können
> * Verwandte Wörter teilen oft gemeinsame Subwörter
> 
> ### Nachteile
> * Komplexere Tokenisierung
> * Nicht alle Subwords entsprechen sinnvollen sprachlichen Einheiten


## NLP Pipeline

Recall the NLP-Pipeline:

```{mermaid}
%%| echo: false
flowchart TD
    A[Data Acquistion] --> B[Preprocessing and Normalization]
    B --> C[Modelling]
    C --> C
    C --> D[Model evaluation]
    D --> |more preprocessing needed| B
    D --> |more/different data needed| A
    D --> E[Added Value]
    E --> |reiterate| A
```

Here we are going to do some prepocessing and normalization steps. Your task is to do:



1. **Data Collection**: Collect a corpus of text data. It is completely up to you.
2. **Data Cleaning**: Clean the collected data by removing any irrelevant information such as HTML tags, URLs, numbers, etc. This step depends in the corpus you chose:
    - count the vocabulary size before and after cleaning
3. **Tokenization**: Apply a tokenizer (e.g. using https://www.nltk.org/):
    - count the vocabulary size before and after tokenization
    - how much time is needed per word in average?
4. **Stopwords Removal**: Identify and remove stopwords from the tokens. Stopwords are common words that do not contribute much to the meaning of a sentence, such as 'the', 'is', 'in', etc.
    - count the vocabulary size before and after stopword removal
5. **Stemming and Lemmatization**: Apply stemming and lemmatization techniques to the tokens and observe the differences. Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form. Lemmatization, on the other hand, reduces words to their base word, which is linguistically correct lemmas. It transforms root word with the use of vocabulary and morphological analysis.
    - count the vocabulary size for the stemmed and lemmatized vocabulary
6. **Compare**: 
    - create a barplot of the results you have collected above (https://plotly.com/python/bar-charts/)

Useful libraries:

- NLTK, Spacy
- plotly or matplotlib for plotting graphs

### 1. Get data

In [30]:
from nltk.corpus import gutenberg
import nltk
nltk.download('gutenberg')

raw_text = gutenberg.raw('shakespeare-hamlet.txt')

[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\dario\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\gutenberg.zip.


### 2. Data Cleaning


In [42]:
import re

def clean_text(text):
    cleaned = re.sub(r"<.*?>", " ", text)  # remove HTML tags
    cleaned = re.sub(r"http\S+|www\S+", " ", cleaned)  # remove URLs
    cleaned = re.sub(r"[^A-Za-z\s]", " ", cleaned)  # remove numbers and punctuation
    cleaned = re.sub(r"\s+", " ", cleaned).strip()  # remove extra spaces
    return cleaned

cleaned_text = clean_text(raw_text)

vocab_before = len(set(raw_text.split()))
vocab_after = len(set(cleaned_text.split()))

print("Vocabulary size before cleaning:", vocab_before)
print("Vocabulary size after cleaning:", vocab_after)

Vocabulary size before cleaning: 7422
Vocabulary size after cleaning: 5430


### 3. Tokenization

In [43]:
import time
from nltk.tokenize import word_tokenize
nltk.download('punkt_tab')

start = time.time()
tokens = word_tokenize(cleaned_text)
end = time.time()

vocab_tokenized = len(set(tokens))
avg_time_per_word = (end - start) / len(tokens)

print("Vocabulary size:", vocab_tokenized)
print("Average time per word:", avg_time_per_word)

Vocabulary size: 5428
Average time per word: 1.9758370121523408e-06


[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\dario\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


### 4. Stopwords

In [44]:
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
tokens_no_stop = [word for word in tokens if word.lower() not in stop_words]

vocab_no_stop = len(set(tokens_no_stop))

print("Vocabulary size without stop words:", vocab_no_stop)

Vocabulary size without stop words: 5230


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\dario\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### 5. Stemming and Lemmatization

In [45]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

stemmed = [stemmer.stem(word) for word in tokens_no_stop]
lemmatized = [lemmatizer.lemmatize(word.lower()) for word in tokens_no_stop]

vocab_stemmed = len(set(stemmed))
vocab_lemmatized = len(set(lemmatized))

print("Vocabulary size after stemming:", vocab_stemmed)
print("Vocabulary size after lemmatization:", vocab_lemmatized)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\dario\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\dario\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Vocabulary size after stemming: 3649
Vocabulary size after lemmatization: 4322


### 6. Compare

In [46]:
import plotly.graph_objects as go

labels = [
    "Original Raw",
    "Cleaned",
    "Tokenized",
    "No Stopwords",
    "Stemmed",
    "Lemmatized"
]
values = [
    vocab_before,
    vocab_after,
    vocab_tokenized,
    vocab_no_stop,
    vocab_stemmed,
    vocab_lemmatized
]

fig = go.Figure([go.Bar(x=labels, y=values)])
fig.update_layout(title="Vocabulary Size Across Preprocessing Stages",
                  xaxis_title="Stage", yaxis_title="Vocabulary Size")
fig.show()