# 🧠 NLP 101 for Programmers

## Featuring The Hitchhiker’s Guide to the Galaxy
### ⏱️ Duration: ~30 minutes
### 🛠️ Requirements: Python 3, Jupyter Notebook or any Python IDE, nltk, scikit-learn

### 🗂️ Overview

Welcome to your first dive into NLP! In this tutorial, we’ll explore how machines process and understand text. We’ll start with:
- Tokenization – breaking down text into individual units
- Bag of Words (BoW) – a simple representation of text
- TF-IDF – identifying important words in context

You'll work on short excerpts from The Hitchhiker’s Guide to the Galaxy and complete three exercises along the way.

## 📦 Setup

In [None]:
import nltk

import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')

## 🧪 Exercise 1: Tokenization

**Goal:** Break a passage into tokens

**Optional:** Preprocess it to remove punctuation, numbers and stopwords

**Super Optional:** Visualize the result with a bar plot

### 📖 Sample Text:

In [None]:
text = """Far out in the uncharted backwaters of the unfashionable end of the western spiral arm of the Galaxy lies 
a small unregarded yellow sun."""

### 🧰 Tools:

`word_tokenize` from `nltk.tokenize`
`Counter` from `collections`

### 💻 Task:
- Tokenize the above text.
- Count the number of unique tokens.
- Print the top 5 most frequent tokens.

### ✅ Expected Output (example):

```python
Tokens: ['Far', 'out', 'in', 'the', 'uncharted', 'backwaters', ...]
Unique tokens: 19
Most frequent: [('the', 3), ('of', 2), ('Far', 1), ...]
```

In [None]:
# your code goes here

### 📖 Solution

In [None]:
from nltk.tokenize import word_tokenize
from collections import Counter

tokens = word_tokenize(text.lower())
freq_dist = Counter(tokens)

print(f"Tokens: {tokens}")
print(f"Unique tokens: {len(set(tokens))}")
print(f"Most frequent: {freq_dist.most_common(3)}")

In [None]:
from nltk.corpus import stopwords

# Preprocessing function
def preprocess(text):
    tokens = word_tokenize(text.lower())
    tokens = [t for t in tokens if t.isalpha()]  # remove punctuation/numbers
    tokens = [t for t in tokens if t not in stopwords.words('english')]  # remove stopwords
    # TODO: lemma
    # TODO: stem
    return tokens

tokens = preprocess(text)
freq_dist = Counter(tokens)

print(f"Tokens: {tokens}")
print(f"Unique tokens: {len(set(tokens))}")
print(f"Most frequent: {freq_dist.most_common(3)}")

In [None]:
tokens = word_tokenize(text.lower())
freq_dist = Counter(tokens)
most_common = freq_dist.most_common(5)
words, counts = zip(*most_common)
sns.barplot(x=list(words), y=list(counts))
plt.title("Top 5 Words in Sample Text")
plt.xticks(rotation=45)
plt.show()

## 🧪 Exercise 2: Bag of Words

**Goal:** Represent text as a word-count vector

**Optional:*** Visualize the result with a heatmap

### 📖 Sample Text:

In [None]:
docs = [
    "The ships hung in the sky in much the same way that bricks don’t.",
    "Time is an illusion. Lunchtime doubly so.",
    "The Answer to the Great Question... Of Life, the Universe and Everything... Is... Forty-two."
]

### 🧰 Tools:

`CountVectorizer` from `sklearn.feature_extraction.text`

### 💻 Task:
- Convert the 3 texts into a Bag of Words representation.
- Print the vocabulary.
- Print the count matrix as a DataFrame for readability.

### ✅ Expected Output (example):

```python
Vocabulary: ['answer', 'bricks', 'don’t', 'everything', ...]
BoW Matrix:
|        | answer | bricks | don’t | everything | ... |
|--------|--------|--------|-------|------------|-----|
| Text 1 | 0      | 1      | 1     | 0          | ... |
| Text 2 | 0      | 0      | 0     | 0          | ... |
| Text 3 | 1      | 0      | 0     | 1          | ... |
```

In [None]:
# your code goes here

### 📖 Solution

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize vectorizer
vectorizer = CountVectorizer(stop_words='english')
bow = vectorizer.fit_transform(docs)

df_bow = pd.DataFrame(bow.toarray(), columns=vectorizer.get_feature_names_out())

print("Vocabulary:", vectorizer.get_feature_names_out())
print("BoW Matrix:\n", df_bow)

In [None]:
sns.heatmap(df_bow, annot=True, cmap="YlGnBu", cbar=False)
plt.title("Bag of Words Matrix")
plt.xlabel("Words")
plt.ylabel("Text Index")
plt.show()

## 🧪 Exercise 3: TF-IDF

**Goal:** Identify the most meaningful words in each sentence

### 🧰 Tools:

`TfidfVectorizer` from `sklearn.feature_extraction.text`

### 💻 Task:
- Convert the same texts into TF-IDF vectors.
- Print the resulting matrix as a DataFrame.
- Highlight the top 3 words with the highest TF-IDF scores per text.

### ✅ Expected Output (example):

```python
TF-IDF Matrix:
|        | answer | bricks | don’t | everything | ... |
|--------|--------|--------|-------|------------|-----|
| Text 1 | 0.0    | 0.707  | 0.707 | 0.0        | ... |
| Text 2 | 0.0    | 0.0    | 0.0   | 0.0        | ... |
| Text 3 | 0.5    | 0.0    | 0.0   | 0.5        | ... |

Top words:
- Text 1: bricks, don’t, sky
- Text 2: illusion, lunchtime, doubly
- Text 3: answer, everything, universe
```

In [None]:
### 📖 Solution

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(docs)
df_tfidf = pd.DataFrame(tfidf.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

print("Vocabulary:", tfidf_vectorizer.get_feature_names_out())
print("TF-IDF Matrix:\n", df_tfidf)


In [None]:
sns.heatmap(df_tfidf, annot=False, cmap="coolwarm", linewidths=0.5)
plt.title("TF-IDF Scores per Word")
plt.xlabel("Words")
plt.ylabel("Text Index")
plt.show()

## 🧪 Exercise 4: Cosine Similarity Between Texts

**Goal:** Find which texts are most similar using vector math

**Optional:** Compare with preprocessed texts

### 🧰 Tools:

`cosine_similarity` from `sklearn.metrics.pairwise`

`heatmap` from `seaborn`

### 💻 Task:
- Calculate the Cosine Similarity between all vectors
- Print the resulting matrix as a DataFrame.
- Create a heatmap to visualize the most similar texts

### ✅ Expected Output (example):

```python
Text 1 is most similar to Text 3
```

In [None]:
# your code goes here

### 📖 Solution

In [None]:
# TODO: Preprocess vergleich
from sklearn.metrics.pairwise import cosine_similarity

similarity_matrix = cosine_similarity(tfidf)
df_similarity = pd.DataFrame(similarity_matrix, index=[f"Text {i+1}" for i in range(len(docs))],
                             columns=[f"Text {i+1}" for i in range(len(docs))])

sns.heatmap(df_similarity, annot=True, cmap="Blues")
plt.title("Cosine Similarity Between Texts")
plt.show()

In [None]:
docs = [
    "The ships hung in the sky in much the same way that bricks don’t.",
    "Time is an illusion. Lunchtime doubly so.",
    "The Answer to the Great Question... Of Life, the Universe and Everything... Is... Forty-two."
]

In [None]:
docs

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Preprocessing function
def preprocess(text):
    print(text)
    tokens = word_tokenize(text.lower())
    tokens = [t for t in tokens if t.isalpha()]  # remove punctuation/numbers
    tokens = [t for t in tokens if t not in stopwords.words('english')]  # remove stopwords
    # TODO: lemma
    # TODO: stem
    return tokens

docs = [preprocess(text) for text in docs]

# Initialize TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(pr)
tfidf = tfidf_vectorizer.fit_transform(docs)

similarity_matrix = cosine_similarity(tfidf)
df_similarity = pd.DataFrame(similarity_matrix, index=[f"Text {i+1}" for i in range(len(docs))],
                             columns=[f"Text {i+1}" for i in range(len(docs))])

sns.heatmap(df_similarity, annot=True, cmap="Blues")
plt.title("Cosine Similarity Between Texts")
plt.show()