# üß† NLP 101 for Programmers

## Featuring The Hitchhiker‚Äôs Guide to the Galaxy
### ‚è±Ô∏è Duration: ~30 minutes
### üõ†Ô∏è Requirements: Python 3, Jupyter Notebook or any Python IDE, nltk, scikit-learn

### üóÇÔ∏è Overview

Welcome to your first dive into NLP! In this tutorial, we‚Äôll explore how machines process and understand text. We‚Äôll start with:
- Tokenization ‚Äì breaking down text into individual units
- Bag of Words (BoW) ‚Äì a simple representation of text
- TF-IDF ‚Äì identifying important words in context

You'll work on short excerpts from The Hitchhiker‚Äôs Guide to the Galaxy and complete three exercises along the way.

## üì¶ Setup

In [None]:
import nltk

import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download("wordnet")

## üß™ Exercise 1: Tokenization

**Goal:** Break a passage into tokens and print english stopwords

**Optional:** Preprocess it to remove punctuation, numbers and stopwords

**Super Optional:** Visualize the result with a bar plot

### üìñ Sample Text:

In [None]:
text = """Far out in the uncharted backwaters of the unfashionable end of the western spiral arm of the Galaxy lies 
a small unregarded yellow sun."""

###¬†üß∞ Tools:

`word_tokenize` from `nltk.tokenize`

`Counter` from `collections`

###¬†üíª Task:
- Print english stopwords.
- Tokenize the above text.
- Count the number of unique tokens.
- Print the top 5 most frequent tokens.

###¬†‚úÖ Expected Output (example):

```python
Tokens: ['Far', 'out', 'in', 'the', 'uncharted', 'backwaters', ...]
Unique tokens: 19
Most frequent: [('the', 3), ('of', 2), ('Far', 1), ...]

{'to', "don't", 'd', 'having'...
```

In [None]:
# your code goes here

## üß™ Exercise 2: Bag of Words

**Goal:** Represent text as a word-count vector

**Optional:** Visualize the result with a heatmap

### üìñ Sample Text:

In [None]:
docs = [
    "The ships hung in the sky in much the same way that bricks don‚Äôt.",
    "Time is an illusion. Lunchtime doubly so.",
    "The Answer to the Great Question... Of Life, the Universe and Everything... Is... Forty-two.",
    "It was a  particular  type  of  rain  he  particularly  disliked, particularly  when he was driving. He had a number for it. It was rain type 17.",
    "He blinked, and understood nothing."
]

###¬†üß∞ Tools:

`CountVectorizer` from `sklearn.feature_extraction.text`

###¬†üíª Task:
- Convert the 5 texts into a Bag of Words representation.
- Print the vocabulary.
- Print the count matrix as a DataFrame for readability.

###¬†‚úÖ Expected Output (example):

```python
Vocabulary: ['answer', 'bricks', 'don‚Äôt', 'everything', ...]
BoW Matrix:
|        | answer | bricks | don‚Äôt | everything | ... |
|--------|--------|--------|-------|------------|-----|
| Text 1 | 0      | 1      | 1     | 0          | ... |
| Text 2 | 0      | 0      | 0     | 0          | ... |
| Text 3 | 1      | 0      | 0     | 1          | ... |
```

In [None]:
# your code goes here

## üß™ Exercise 3: TF-IDF

**Goal:** Identify the most meaningful words in each sentence

###¬†üß∞ Tools:

`TfidfVectorizer` from `sklearn.feature_extraction.text`

###¬†üíª Task:
- Convert the same texts into TF-IDF vectors.
- Print the resulting matrix as a DataFrame.
- Highlight the top 3 words with the highest TF-IDF scores per text.

###¬†‚úÖ Expected Output (example):

```python
TF-IDF Matrix:
|        | answer | bricks | don‚Äôt | everything | ... |
|--------|--------|--------|-------|------------|-----|
| Text 1 | 0.0    | 0.707  | 0.707 | 0.0        | ... |
| Text 2 | 0.0    | 0.0    | 0.0   | 0.0        | ... |
| Text 3 | 0.5    | 0.0    | 0.0   | 0.5        | ... |

Top words:
- Text 1: bricks, don‚Äôt, sky
- Text 2: illusion, lunchtime, doubly
- Text 3: answer, everything, universe
```

In [None]:
# your code goes here

## üß™ Exercise 4: Word-Level Distance Comparisons

**Goal:** Explore how close or far apart individual words are based on the vector space created during tokenization.

**Optional:** Implement a simple k-Nearest Neighbours

###¬†üß∞ Tools:

`cosine_similarity` (or any other) from `sklearn.metrics.pairwise`


###¬†üíª Task:
- Select two words from the dictionary
- Calculate the distance between them
- Optional: Pick a word and calculate all distances to the other words (order them ascending) - print k

###¬†‚úÖ Expected Output (example):

```python
Word 1 has distance 0.457234 to Word 2

# optional
The 3 nearest words to word 1 are:
Word 2: 0.0123
Word 7: 0.1543
Word 4: 0.2872
```

## üß™ Exercise 5: Cosine Similarity Between Texts

**Goal:** Find which texts are most similar using vector math

**Optional:** Test different Metrics (checkout `sklearn.metrics.pairwise`)

**Super Optional:** Compare with preprocessed texts

###¬†üß∞ Tools:

`cosine_similarity` from `sklearn.metrics.pairwise`

`heatmap` from `seaborn`

###¬†üíª Task:
- Calculate the Cosine Similarity between all vectors
- Print the resulting matrix as a DataFrame.
- Create a heatmap to visualize the most similar texts

###¬†‚úÖ Expected Output (example):

Heatmap plot of doc simillarity

In [None]:
# your code goes here

## üß™ Exercise 6: Full Book Analysis ‚Äì Frequency & Cosine Similarity

**Goal:** Dive into the full text of *The Hitchhiker‚Äôs Guide to the Galaxy* to find the most frequent words and analyze text similarity.

**Optional:** Calculate the distances from exercise 4 again.


### üß∞ Tools:

`CountVectorizer`, `TfidfVectorizer` from `sklearn.feature_extraction.text`  

`cosine_similarity` from `sklearn.metrics.pairwise`  

`seaborn.heatmap`  


### üíª Task:

- Load the full text of *The Hitchhiker‚Äôs Guide to the Galaxy*.
- Split the book into segments (e.g. paragraphs or chunks of N sentences).
- Vectorize the segments using TF-IDF.
- Compute the cosine similarity between all segments.
- Create a heatmap showing segment similarity.
- List the top 20 most frequent words in the book.
- Calculate the distance again.


### ‚úÖ Expected Output (example):

- A heatmap showing which parts of the book are most similar  
- A DataFrame of cosine similarity values  
- A printed list of the top 20 most frequent words  


In [None]:
# your code goes here