# 🧠 NLP 101 for Programmers

## Featuring The Hitchhiker’s Guide to the Galaxy
### ⏱️ Duration: ~30 minutes
### 🛠️ Requirements: Python 3, Jupyter Notebook or any Python IDE, nltk, scikit-learn

### 🗂️ Overview

Welcome to your first dive into NLP! In this tutorial, we’ll explore how machines process and understand text. We’ll start with:
- Tokenization – breaking down text into individual units
- Bag of Words (BoW) – a simple representation of text
- TF-IDF – identifying important words in context

You'll work on short excerpts from The Hitchhiker’s Guide to the Galaxy and complete three exercises along the way.

## 📦 Setup

In [None]:
import nltk

import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download("wordnet")

## 🧪 Exercise 1: Tokenization

**Goal:** Break a passage into tokens and print english stopwords

**Optional:** Preprocess it to remove punctuation, numbers and stopwords

**Super Optional:** Visualize the result with a bar plot

### 📖 Sample Text:

In [None]:
text = """Far out in the uncharted backwaters of the unfashionable end of the western spiral arm of the Galaxy lies 
a small unregarded yellow sun."""

### 🧰 Tools:

`word_tokenize` from `nltk.tokenize`

`Counter` from `collections`

### 💻 Task:
- Print english stopwords.
- Tokenize the above text.
- Count the number of unique tokens.
- Print the top 5 most frequent tokens.

### ✅ Expected Output (example):

```python
Tokens: ['Far', 'out', 'in', 'the', 'uncharted', 'backwaters', ...]
Unique tokens: 19
Most frequent: [('the', 3), ('of', 2), ('Far', 1), ...]

{'to', "don't", 'd', 'having'...
```

In [None]:
# your code goes here

### 📖 Solution

In [None]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from collections import Counter

tokens = word_tokenize(text.lower())
freq_dist = Counter(tokens)

print(f"Tokens: {tokens}")
print(f"Unique tokens: {len(set(tokens))}")
print(f"Most frequent: {freq_dist.most_common(3)}")

stops = set(stopwords.words('english'))
print(stops)

In [None]:
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

ps = PorterStemmer()
wnl = WordNetLemmatizer()

# Preprocessing function
def preprocess(text):
    tokens = word_tokenize(text.lower())
    tokens = [t for t in tokens if t.isalpha()]  # remove punctuation/numbers
    tokens = [t for t in tokens if t not in stopwords.words('english')]  # remove stopwords
    tokens = [wnl.lemmatize(t) for t in tokens] # lemma
    tokens = [ps.stem(t) for t in tokens] # stem
    return tokens

tokens = preprocess(text)
freq_dist = Counter(tokens)

print(f"Tokens: {tokens}")
print(f"Unique tokens: {len(set(tokens))}")
print(f"Most frequent: {freq_dist.most_common(3)}")

In [None]:
tokens = word_tokenize(text.lower())
freq_dist = Counter(tokens)
most_common = freq_dist.most_common(5)
words, counts = zip(*most_common)
sns.barplot(x=list(words), y=list(counts))
plt.title("Top 5 Words in Sample Text")
plt.xticks(rotation=45)
plt.show()

## 🧪 Exercise 2: Bag of Words

**Goal:** Represent text as a word-count vector

**Optional:** Visualize the result with a heatmap

### 📖 Sample Text:

In [None]:
docs = [
    "The ships hung in the sky in much the same way that bricks don’t.",
    "Time is an illusion. Lunchtime doubly so.",
    "The Answer to the Great Question... Of Life, the Universe and Everything... Is... Forty-two.",
    "It was a  particular  type  of  rain  he  particularly  disliked, particularly  when he was driving. He had a number for it. It was rain type 17.",
    "He blinked, and understood nothing."
]

### 🧰 Tools:

`CountVectorizer` from `sklearn.feature_extraction.text`

### 💻 Task:
- Convert the 5 texts into a Bag of Words representation.
- Print the vocabulary.
- Print the count matrix as a DataFrame for readability.

### ✅ Expected Output (example):

```python
Vocabulary: ['answer', 'bricks', 'don’t', 'everything', ...]
BoW Matrix:
|        | answer | bricks | don’t | everything | ... |
|--------|--------|--------|-------|------------|-----|
| Text 1 | 0      | 1      | 1     | 0          | ... |
| Text 2 | 0      | 0      | 0     | 0          | ... |
| Text 3 | 1      | 0      | 0     | 1          | ... |
```

In [None]:
# your code goes here

### 📖 Solution

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize vectorizer
vectorizer = CountVectorizer(stop_words='english')
bow = vectorizer.fit_transform(docs)

df_bow = pd.DataFrame(bow.toarray(), columns=vectorizer.get_feature_names_out())

print("Vocabulary:", vectorizer.get_feature_names_out())
print("BoW Matrix:\n", df_bow)

In [None]:
sns.heatmap(df_bow, annot=True, cmap="YlGnBu", cbar=False)
plt.title("Bag of Words Matrix")
plt.xlabel("Words")
plt.ylabel("Text Index")
plt.show()

## 🧪 Exercise 3: TF-IDF

**Goal:** Identify the most meaningful words in each sentence

### 🧰 Tools:

`TfidfVectorizer` from `sklearn.feature_extraction.text`

### 💻 Task:
- Convert the same texts into TF-IDF vectors.
- Print the resulting matrix as a DataFrame.
- Highlight the top 3 words with the highest TF-IDF scores per text.

### ✅ Expected Output (example):

```python
TF-IDF Matrix:
|        | answer | bricks | don’t | everything | ... |
|--------|--------|--------|-------|------------|-----|
| Text 1 | 0.0    | 0.707  | 0.707 | 0.0        | ... |
| Text 2 | 0.0    | 0.0    | 0.0   | 0.0        | ... |
| Text 3 | 0.5    | 0.0    | 0.0   | 0.5        | ... |

Top words:
- Text 1: bricks, don’t, sky
- Text 2: illusion, lunchtime, doubly
- Text 3: answer, everything, universe
```

### 📖 Solution

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(docs)
df_tfidf = pd.DataFrame(tfidf.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

print("Vocabulary:", tfidf_vectorizer.get_feature_names_out())
print("TF-IDF Matrix:\n", df_tfidf)


In [None]:
sns.heatmap(df_tfidf, annot=False, cmap="coolwarm", linewidths=0.5)
plt.title("TF-IDF Scores per Word")
plt.xlabel("Words")
plt.ylabel("Text Index")
plt.show()

## 🧪 Exercise 4: Word-Level Distance Comparisons

**Goal:** Explore how close or far apart individual words are based on the vector space created during tokenization.

**Optional:** Implement a simple k-Nearest Neighbours

### 🧰 Tools:

`cosine_similarity` (or any other) from `sklearn.metrics.pairwise`


### 💻 Task:
- Select two words from the dictionary
- Calculate the distance between them
- Optional: Pick a word and calculate all distances to the other words (order them ascending) - print k

### ✅ Expected Output (example):

```python
Word 1 has distance 0.457234 to Word 2

# optional
The 3 nearest words to word 1 are:
Word 2: 0.0123
Word 7: 0.1543
Word 4: 0.2872
```

In [None]:
# your code goes here

### 📖 Solution

In [None]:
from sklearn.metrics.pairwise import euclidean_distances

# Get vocabulary
vocab = tfidf_vectorizer.get_feature_names_out()
vocab_dict = {word: idx for idx, word in enumerate(vocab)}

# Pick two words
word1 = "illusion"
word2 = "rain"

# Ensure both words exist in vocab
if word1 in vocab_dict and word2 in vocab_dict:
    vec1 = tfidf[:, vocab_dict[word1]].toarray()
    vec2 = tfidf[:, vocab_dict[word2]].toarray()
    
    eucl = euclidean_distances(vec1.T, vec2.T)[0][0]
    
    print(f"Euclidean distance between '{word1}' and '{word2}': {eucl:.4f}")
else:
    print("One of the words is not in the vocabulary.")

In [None]:
# Choose target word
target_word = "rain"

# Ensure word exists
if target_word in vocab_dict:
    target_vec = tfidf[:, vocab_dict[target_word]].toarray()
    
    distances = {}
    for word in vocab:
        if word == target_word:
            continue
        word_vec = tfidf[:, vocab_dict[word]].toarray()
        dist = euclidean_distances(target_vec.T, word_vec.T)[0][0]
        distances[word] = dist

    # Sort by closest
    sorted_words = sorted(distances.items(), key=lambda x: x[1])
    
    print(f"\nTop 5 words closest to '{target_word}' by euclidean distance:")
    for word, dist in sorted_words[:5]:
        print(f"{word}: {dist:.4f}")
else:
    print("Target word is not in the vocabulary.")

## 🧪 Exercise 5: Cosine Similarity Between Texts

**Goal:** Find which texts are most similar using vector math

**Optional:** Test different Metrics (checkout `sklearn.metrics.pairwise`)

**Super Optional:** Compare with preprocessed texts

### 🧰 Tools:

`cosine_similarity` from `sklearn.metrics.pairwise`

`heatmap` from `seaborn`

### 💻 Task:
- Calculate the Cosine Similarity between all vectors
- Print the resulting matrix as a DataFrame.
- Create a heatmap to visualize the most similar texts

### ✅ Expected Output (example):

Heatmap plot of doc simillarity

In [None]:
# your code goes here

### 📖 Solution

In [None]:
# TODO: Preprocess vergleich
from sklearn.metrics.pairwise import cosine_similarity

# Function to plot heatmaps
def plot_heatmap(matrix, title, labels):
    df = pd.DataFrame(matrix, index=labels, columns=labels)
    plt.figure(figsize=(8, 6))
    sns.heatmap(df, annot=True, cmap="viridis")
    plt.title(title)
    plt.show()

labels = [f'Doc {i+1}' for i in range(len(docs))]

cos_sim = cosine_similarity(X_tfidf)
plot_heatmap(cos_sim, "Cosine Similarity", labels)

In [None]:
# Compute similarity/distance matrices
eucl_dist = euclidean_distances(X_tfidf)
manh_dist = manhattan_distances(X_tfidf)

plot_heatmap(eucl_dist, "Euclidean Distance", labels)
plot_heatmap(manh_dist, "Manhattan Distance", labels)

## 🧪 Exercise 6: Full Book Analysis – Frequency & Cosine Similarity

**Goal:** Dive into the full text of *The Hitchhiker’s Guide to the Galaxy* to find the most frequent words and analyze text similarity.

**Optional:** Calculate the distances from exercise 4 again.


### 🧰 Tools:

`CountVectorizer`, `TfidfVectorizer` from `sklearn.feature_extraction.text`  

`cosine_similarity` from `sklearn.metrics.pairwise`  

`seaborn.heatmap`  


### 💻 Task:

- Load the full text of *The Hitchhiker’s Guide to the Galaxy*.
- Split the book into segments (e.g. paragraphs or chunks of N sentences).
- Vectorize the segments using TF-IDF.
- Compute the cosine similarity between all segments.
- Create a heatmap showing segment similarity.
- List the top 20 most frequent words in the book.
- Calculate the distance again.


### ✅ Expected Output (example):

- A heatmap showing which parts of the book are most similar  
- A DataFrame of cosine similarity values  
- A printed list of the top 20 most frequent words  


In [None]:
# your code goes here

### 📖 Solution

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import re

# 1. Load the book (example assumes it's in a .txt file)
with open("data/guide.txt", "r", encoding="utf-8") as file:
    text = file.read()

# 3. Split into segments (e.g. every 10 sentences)
segments = re.split(r'(?<=[.!?]) +', text)
chunk_size = 10
chunks = [' '.join(segments[i:i+chunk_size]) for i in range(0, len(segments), chunk_size)]


# 4. TF-IDF Vectorization
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(chunks)

# 5. Cosine Similarity
cos_sim_matrix = cosine_similarity(X)

# 6. Heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(cos_sim_matrix, cmap="coolwarm")
plt.title("Cosine Similarity Between Text Segments")
plt.xlabel("Segment Index")
plt.ylabel("Segment Index")
plt.show()

# 7. Most Frequent Words (Bonus)
count_vec = CountVectorizer(stop_words='english')
word_matrix = count_vec.fit_transform([text])
sum_words = word_matrix.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in count_vec.vocabulary_.items()]
words_freq = sorted(words_freq, key=lambda x: x[1], reverse=True)
print("Top 20 Most Frequent Words:")
for word, freq in words_freq[:20]:
    print(f"{word}: {freq}")


from sklearn.metrics.pairwise import euclidean_distances

# Get vocabulary
vocab = vectorizer.get_feature_names_out()
vocab_dict = {word: idx for idx, word in enumerate(vocab)}

# Pick two words
word1 = "illusion"
word2 = "rain"

# Ensure both words exist in vocab
if word1 in vocab_dict and word2 in vocab_dict:
    vec1 = X[:, vocab_dict[word1]].toarray()
    vec2 = X[:, vocab_dict[word2]].toarray()
    
    eucl = euclidean_distances(vec1.T, vec2.T)[0][0]
    
    print(f"Euclidean distance between '{word1}' and '{word2}': {eucl:.4f}")
else:
    print("One of the words is not in the vocabulary.")


# Choose target word
target_word = "rain"

# Ensure word exists
if target_word in vocab_dict:
    target_vec = X[:, vocab_dict[target_word]].toarray()
    
    distances = {}
    for word in vocab:
        if word == target_word:
            continue
        word_vec = X[:, vocab_dict[word]].toarray()
        dist = euclidean_distances(target_vec.T, word_vec.T)[0][0]
        distances[word] = dist

    # Sort by closest
    sorted_words = sorted(distances.items(), key=lambda x: x[1])
    
    print(f"\nTop 5 words closest to '{target_word}' by euclidean distance:")
    for word, dist in sorted_words[:5]:
        print(f"{word}: {dist:.4f}")
else:
    print("Target word is not in the vocabulary.")