<p style="text-align: center; font-size: 28px;"><b>NLP Basics</b></p>


This notebook summarizes the key concepts explored throughout the **Natural Language Processing (NLP)** lessons.
It provides an overview of foundational methods, models, and ethical considerations that define the field.

### What’s Inside

* **Foundations of NLP** – bridging linguistics, computer science, and AI
* **Text Preprocessing** – cleaning, tokenization, normalization
* **Parsing & Syntax** – analyzing grammatical structure
* **Language Models** – Bag-of-Words, n-grams, and LSTMs
* **Topic Modeling & Similarity** – identifying themes and measuring relatedness
* **Language Prediction** – generating text and suggestions
* **Ethical Considerations** – addressing bias and privacy in NLP systems

Use this notebook to revisit core ideas and reinforce your understanding of how NLP enables machines to process and generate human language.


# Introduction to Natural Language Processing (NLP)

Look around at the technologies we use every day:

* **Spellcheck and autocorrect**
* **Auto-generated video captions**
* **Virtual assistants** like Amazon’s Alexa
* **Autocomplete**
* **News site recommendations**

What do they all have in common?
They rely on **Natural Language Processing (NLP)** — a field at the intersection of **linguistics**, **artificial intelligence**, and **computer science**.

The goal of NLP is to enable computers to **interpret**, **analyze**, and **generate** human languages such as English or Spanish.

NLP’s origins trace back to around **1950**, when **Alan Turing** proposed his famous **Turing Test** — a way to evaluate whether a computer can use language convincingly enough to make humans believe it’s human.

Today, NLP powers a wide range of applications beyond speech approximation — from **detecting spam emails** and **analyzing bias in tweets** to **improving accessibility** for people with disabilities.

While many programming languages can be used for NLP, **Python** stands out thanks to its extensive open-source libraries such as **NLTK (Natural Language Toolkit)**.
In this module, you’ll use **Python** to get your **first hands-on experience with NLP**.


# Text Preprocessing

> "You never know what you have... until you clean your data."
> ~ Unknown (or possibly made up)

Before diving into model training or sentiment analysis, you must **clean and prepare your text data** — and this process is known as **text preprocessing**. It’s the essential first step for any NLP task.

Without preprocessing, your computer would interpret `"the"`, `"The"`, and `"<p>The"` as completely different words. Preprocessing brings structure and consistency to raw text, ensuring that models can interpret it meaningfully.

## Common Text Preprocessing Tasks

* **Noise Removal**
  Strip text of irrelevant or distracting elements such as HTML tags, punctuation, or special characters.
  *Example:* removing `<p>`, `#`, `@`, etc.

* **Tokenization**
  Split text into smaller units (tokens) such as words or sentences.
  *Example:* `"NLP is fun!" → ["NLP", "is", "fun"]`

* **Normalization**
  Bring all tokens to a standard format. This can include several sub-steps:

  * **Stemming** — A rough cut that chops off prefixes and suffixes.
    *Example:* “booing” → “boo”, “booed” → “boo”, but “computer” → “comput”.

  * **Lemmatization** — A more refined approach using linguistic knowledge to find base forms.
    *Example:* “am” → “be”, “are” → “be”.

Other helpful operations:

* Lowercasing text for consistency
* Removing **stopwords** (e.g., “the”, “is”, “and”)
* Fixing **spelling errors** or typos

## Tools You’ll Use

Most of these steps can be handled efficiently with **Regular Expressions (Regex)** and **NLTK (Natural Language Toolkit)** in Python.
These tools make text cleaning systematic, repeatable, and ready for analysis.


In [6]:
# Step 1: Import all necessary libraries for text preprocessing
# - re: for regex-based text cleaning (removing punctuation and symbols)
# - nltk: for tokenization and linguistic processing
# - PorterStemmer: for stemming words (cutting down to their root form)
# - WordNetLemmatizer: for lemmatizing words (smarter root form extraction)
# - get_part_of_speech: helper function to map tokens to POS tags (noun, verb, etc.)
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from utils.part_of_speech import get_part_of_speech

In [8]:
# Step 2: Define the sample text to be processed
# This text contains verbs, nouns, and adverbs that will demonstrate how
# stemming and lemmatization treat words differently.
text = (
    "So many squids are jumping out of suitcases these days that you can barely go anywhere "
    "without seeing one burst forth from a tightly packed valise. I went to the dentist the other day, "
    "and sure enough I saw an angry one jump out of my dentist's bag within minutes of arriving. "
    "She hardly even noticed."
)

# Step 3: Clean the text using regex to remove punctuation and special characters
# The pattern '\W+' matches any non-alphanumeric character and replaces it with a space.
cleaned = re.sub(r'\W+', ' ', text)

# Step 4: Tokenize the cleaned text into a list of individual words
# This allows us to perform stemming and lemmatization on each token.
tokenized = word_tokenize(cleaned)

# Step 5: Initialize the PorterStemmer and apply it to all tokens
# Stemming reduces words to their root forms but can sometimes produce non-dictionary results.
# Example: "jumping" → "jump", "hardly" → "hardli"
stemmer = PorterStemmer()
stemmed = [stemmer.stem(token) for token in tokenized]

# Step 6: Initialize the WordNetLemmatizer and apply it to all tokens
# Lemmatization is context-aware when provided with part-of-speech tags.
# It produces cleaner, dictionary-based root forms (e.g., "went" → "go").
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(token, get_part_of_speech(token)) for token in tokenized]

# Step 7: Print out the results to compare stemming vs lemmatization
# You’ll notice that stemming produces rougher cuts, while lemmatization keeps valid words.
print("Stemmed text:")
print(stemmed)
print("\nLemmatized text:")
print(lemmatized)

Stemmed text:
['so', 'mani', 'squid', 'are', 'jump', 'out', 'of', 'suitcas', 'these', 'day', 'that', 'you', 'can', 'bare', 'go', 'anywher', 'without', 'see', 'one', 'burst', 'forth', 'from', 'a', 'tightli', 'pack', 'valis', 'i', 'went', 'to', 'the', 'dentist', 'the', 'other', 'day', 'and', 'sure', 'enough', 'i', 'saw', 'an', 'angri', 'one', 'jump', 'out', 'of', 'my', 'dentist', 's', 'bag', 'within', 'minut', 'of', 'arriv', 'she', 'hardli', 'even', 'notic']

Lemmatized text:
['So', 'many', 'squid', 'be', 'jump', 'out', 'of', 'suitcase', 'these', 'day', 'that', 'you', 'can', 'barely', 'go', 'anywhere', 'without', 'see', 'one', 'burst', 'forth', 'from', 'a', 'tightly', 'pack', 'valise', 'I', 'go', 'to', 'the', 'dentist', 'the', 'other', 'day', 'and', 'sure', 'enough', 'I', 'saw', 'an', 'angry', 'one', 'jump', 'out', 'of', 'my', 'dentist', 's', 'bag', 'within', 'minute', 'of', 'arrive', 'She', 'hardly', 'even', 'notice']


# Parsing Text

After cleaning and preprocessing your text, the next step is to understand how words **relate to each other** — this is where **parsing** comes in. Parsing is the process of segmenting text and analyzing its **syntactic structure** (grammar and relationships between words).

## Key Parsing Techniques in NLP

* **Part-of-Speech (POS) Tagging**
  POS tagging identifies whether each word is a **noun**, **verb**, **adjective**, or other grammatical category.
  With NLTK, this is automatic — and usually faster (and more accurate) than your grammar teacher.

* **Named Entity Recognition (NER)**
  NER detects **proper nouns** like people, places, and organizations.
  Example: “**Natalia** moved to **Berlin**” → entities = `[Natalia: PERSON, Berlin: LOCATION]`.
  Recognizing named entities helps infer the **topic** or **context** of the text.

* **Dependency Grammar Trees**
  These structures show how words depend on one another — for example, which noun a verb acts upon.
  Libraries like **spaCy** make this process much easier by automatically building dependency trees, although even they can struggle with ambiguous sentences.

## Syntax Ambiguity Example

Consider this sentence:

> *I saw a cow under a tree with binoculars.*

Who has the binoculars — you, the cow, or the tree?
Even humans need context to interpret it correctly! This shows why **syntactic parsing** is both complex and essential.

## Regex and Phrase Chunking

For simpler, rule-based parsing, **Regular Expressions (Regex)** can be used to extract structured patterns like:

* Email addresses
* Postal codes
* Specific phrase patterns (when combined with POS tagging)

Regex parsing provides flexibility and control, especially for well-defined textual patterns, complementing statistical and deep learning–based parsing methods.


In [11]:
!pip install spacy
!python -m spacy download en_core_web_sm

Collecting spacy
  Downloading spacy-3.8.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (27 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Downloading spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading murmurhash-1.0.13-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.2 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Downloading cymem-2.0.11-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.5 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Downloading preshed-3.0.10-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.4 kB)
Collecting thinc<8.4.0,>=8.3.4 (from spacy)
  Downloading thinc-8.3.8-cp310-cp310-manylinux2014_x86

[2K      Successfully uninstalled typing_extensions-4.12.2━━━━━━━━━━━[0m [32m 2/20[0m [typing-extensions]
[2K  Attempting uninstall: pydantic0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━[0m [32m11/20[0m [srsly]-inspection]
[2K    Found existing installation: pydantic 1.9.2━━━━━━━━━━━━━━━[0m [32m11/20[0m [srsly]
[2K    Uninstalling pydantic-1.9.2:90m╺[0m[90m━━━━━━━━━━━━━━━━━[0m [32m11/20[0m [srsly]
[2K      Successfully uninstalled pydantic-1.9.2━━━━━━━━━━━━━━━━━[0m [32m11/20[0m [srsly]
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20/20[0m [spacy]m19/20[0m [spacy]]c]
[1A[2K[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
streamlit 1.11.1 requires pyarrow>=4.0, which is not installed.
tensorflow 2.18.0 requires numpy<2.1.0,>=1.26.0, but you have numpy 2.2.6 which is incompatible.
tensorflow-cpu 2.10.0 requires keras<2.11,>=2.10

In [15]:
# Step 1: Import required libraries and the example squid text
# - spaCy: for dependency parsing
# - NLTK Tree: to visualize syntax trees
# - squids_text: example text with silly squid sentences
import spacy
from nltk import Tree

squids_text = "So many squids are jumping out of suitcases these days. You can barely go anywhere without seeing one. I went to the dentist the other day. Sure enough, I saw an angry one jump out of my dentist's bag. She hardly even noticed."

# Step 2: Load the English dependency parser
# Try multiple options for compatibility; prompt installation if missing.
try:
    dependency_parser = spacy.load("en")
except OSError:
    try:
        dependency_parser = spacy.load("en_core_web_sm")
    except OSError:
        raise OSError("No English spaCy model found. Install it with: python -m spacy download en_core_web_sm")

# Step 3: Parse the example squid text into a spaCy Doc object
parsed_squids = dependency_parser(squids_text)

# Step 4: Define a custom sentence and parse it as well
# You can replace this text to visualize your own sentence structure.
my_sentence = "The quick brown fox jumps over the lazy dog with binoculars."
my_parsed_sentence = dependency_parser(my_sentence)

# Step 5: Define a recursive helper to convert a spaCy parse tree into an NLTK Tree
# Each node is a token, and its children are syntactically related words.
def to_nltk_tree(node):
    if node.n_lefts + node.n_rights > 0:
        parsed_child_nodes = [to_nltk_tree(child) for child in node.children]
        return Tree(node.text, parsed_child_nodes)
    else:
        return node.text

# Step 6: Print dependency trees for the squid sentences
# Each tree shows how words depend on each other syntactically.
for sent in parsed_squids.sents:
    to_nltk_tree(sent.root).pretty_print()

# Step 7: Print dependency tree for the custom sentence
for sent in my_parsed_sentence.sents:
    to_nltk_tree(sent.root).pretty_print()

        jumping                
  _________|________________    
 |   |   squids    out      |  
 |   |     |        |       |   
 |   |    many      of     days
 |   |     |        |       |   
are  .     So   suitcases these

          go                       
  ________|____________________     
 |   |    |       |      |  without
 |   |    |       |      |     |    
 |   |    |       |      |   seeing
 |   |    |       |      |     |    
You can barely anywhere  .    one  

          went               
  _________|_________         
 |   |     to        |       
 |   |     |         |        
 |   |  dentist     day      
 |   |     |      ___|____    
 I   .    the   the     other

                   saw                           
  __________________|_________                    
 |   |   |    |              jump                
 |   |   |    |      _________|__________         
 |   |   |    |     |    |    |         out      
 |   |   |    |     |    |    |          |        

# Language Models: Bag-of-Words

Language models help computers understand and predict language by analyzing how often certain words, letters, or phrases appear in a collection of texts, known as a *corpus*. These models are **probabilistic**, meaning they estimate how likely a given word or phrase is to occur based on patterns found in data. Once trained, they can make predictions on new, unseen text.

One of the simplest types is the **unigram** or **bag-of-words** model. It completely ignores grammar and word order, focusing only on how many times each word appears. For instance, after preprocessing and tokenizing the sentence:

> “The squids jumped out of the suitcases.”

the model would produce a mapping like:

```python
{"the": 2, "squid": 1, "jump": 1, "out": 1, "of": 1, "suitcase": 1}
```

If we analyze another sentence —

> “Why are your suitcases full of jumping squids?”

we’d get:

```python
{"why": 1, "be": 1, "your": 1, "suitcase": 1, "full": 1, "of": 1, "jump": 1, "squid": 1}
```

Even though the two sentences have different word orders and grammatical structures, the key words *“jump,” “squid,”* and *“suitcase”* appear in both.

This illustrates why bag-of-words is useful for tasks like **topic modeling** or **sentiment analysis** — it captures the main subjects of a text without worrying about syntax. However, when word order or grammatical relationships matter, more advanced models are needed.


In [17]:
# ---------- Step 1: Import libraries ----------
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from collections import Counter
from utils.looking_glass import looking_glass_text
from utils.part_of_speech import get_part_of_speech

# Download required NLTK resources (only needed once)
nltk.download("punkt")
nltk.download("stopwords")
nltk.download("averaged_perceptron_tagger")
nltk.download("wordnet")

# ---------- Step 2: Load text ----------
text = looking_glass_text  # replace with your own string if needed

# ---------- Step 3: Preprocess ----------
# Lowercase & remove punctuation
cleaned = re.sub(r"\W+", " ", text).lower()

# Tokenize into words
tokenized = word_tokenize(cleaned)

# Remove stopwords (common words like "the", "and", etc.)
stop_words = stopwords.words("english")
filtered = [word for word in tokenized if word not in stop_words]

# ---------- Step 4: Lemmatize ----------
normalizer = WordNetLemmatizer()
normalized = [normalizer.lemmatize(token, get_part_of_speech(token)) for token in filtered]
print(normalized)

[nltk_data] Downloading package punkt to /home/alamanna1/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/alamanna1/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/alamanna1/nltk_data...


['however', 'egg', 'get', 'large', 'large', 'human', 'come', 'within', 'yard', 'saw', 'eye', 'nose', 'mouth', 'come', 'close', 'saw', 'clearly', 'humpty', 'dumpty', 'cant', 'anybody', 'else', 'say', 'im', 'certain', 'name', 'write', 'face', 'might', 'write', 'hundred', 'time', 'easily', 'enormous', 'face', 'humpty', 'dumpty', 'sit', 'leg', 'cross', 'like', 'turk', 'top', 'high', 'wallsuch', 'narrow', 'one', 'alice', 'quite', 'wonder', 'could', 'keep', 'balanceand', 'eye', 'steadily', 'fix', 'opposite', 'direction', 'didnt', 'take', 'least', 'notice', 'think', 'must', 'stuff', 'figure', 'exactly', 'like', 'egg', 'say', 'aloud', 'stand', 'hand', 'ready', 'catch', 'every', 'moment', 'expect', 'fall', 'provoke', 'humpty', 'dumpty', 'say', 'long', 'silence', 'look', 'away', 'alice', 'speak', 'call', 'eggvery', 'say', 'look', 'like', 'egg', 'sir', 'alice', 'gently', 'explain', 'egg', 'pretty', 'know', 'add', 'hop', 'turn', 'remark', 'sort', 'compliment', 'people', 'say', 'humpty', 'dumpty', 

[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/alamanna1/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [19]:
# ---------- Step 5: Create Bag-of-Words ----------
# Each unique word gets counted — order doesn’t matter
bag_of_looking_glass_words = Counter(normalized)

# ---------- Step 6: Display most common words ----------
print("Bag of words:")
print(bag_of_looking_glass_words)

Bag of words:
Counter({'humpty': 19, 'dumpty': 19, 'say': 19, 'alice': 16, 'name': 7, 'like': 7, 'think': 7, 'look': 6, 'im': 5, 'know': 5, 'mean': 5, 'go': 5, 'egg': 4, 'fall': 4, 'king': 4, 'would': 4, 'dont': 4, 'come': 3, 'write': 3, 'might': 3, 'sit': 3, 'one': 3, 'didnt': 3, 'take': 3, 'must': 3, 'stand': 3, 'hand': 3, 'remark': 3, 'never': 3, 'last': 3, 'wall': 3, 'horse': 3, 'men': 3, 'almost': 3, 'ask': 3, 'shape': 3, 'good': 3, 'another': 3, 'however': 2, 'large': 2, 'saw': 2, 'eye': 2, 'mouth': 2, 'cant': 2, 'face': 2, 'time': 2, 'narrow': 2, 'quite': 2, 'could': 2, 'long': 2, 'away': 2, 'speak': 2, 'call': 2, 'gently': 2, 'explain': 2, 'add': 2, 'turn': 2, 'conversation': 2, 'couldnt': 2, 'much': 2, 'interrupt': 2, 'course': 2, 'short': 2, 'laugh': 2, 'there': 2, 'cry': 2, 'make': 2, 'riddle': 2, 'thats': 2, 'behind': 2, 'book': 2, 'may': 2, 'ear': 2, 'little': 2, 'afraid': 2, 'old': 2, 'id': 2, 'get': 1, 'human': 1, 'within': 1, 'yard': 1, 'nose': 1, 'close': 1, 'clearly':

In [20]:
# ---------- Do it with a new sentence ----------

# ---------- Step 2: Load text ----------
text = "Artificial intelligence is transforming industries and creating new opportunities every day."

# ---------- Step 3: Preprocess ----------
# Lowercase & remove punctuation
cleaned = re.sub(r"\W+", " ", text).lower()

# Tokenize into words
tokenized = word_tokenize(cleaned)

# Remove stopwords (common words like "the", "and", etc.)
stop_words = stopwords.words("english")
filtered = [word for word in tokenized if word not in stop_words]

# ---------- Step 4: Lemmatize ----------
normalizer = WordNetLemmatizer()
normalized = [normalizer.lemmatize(token, get_part_of_speech(token)) for token in filtered]

# ---------- Step 5: Create Bag-of-Words ----------
# Each unique word gets counted — order doesn’t matter
bag_of_looking_glass_words = Counter(normalized)

# ---------- Step 6: Display most common words ----------
print("Bag of words:")
print(bag_of_looking_glass_words)

Bag of words:
Counter({'artificial': 1, 'intelligence': 1, 'transform': 1, 'industry': 1, 'create': 1, 'new': 1, 'opportunity': 1, 'every': 1, 'day': 1})


# **Language Models: N-Gram and Neural Language Models (NLM)**

When we want to analyze entire phrases or predict the next word in a sentence, we need a model that looks beyond individual words.
Unlike **Bag-of-Words**, which ignores word order, the **N-Gram** model takes into account how words appear in sequence. It examines a continuous window of *n* words and calculates how likely each word is to appear given the preceding ones.

For instance, in a **bigram** model (*n = 2*), the sentence:

> “The squids jumped out of the suitcases. The squids were furious.”

would produce frequency counts such as:

```python
{('', 'the'): 2, ('the', 'squids'): 2, ('squids', 'jumped'): 1,
 ('jumped', 'out'): 1, ('out', 'of'): 1, ('of', 'the'): 1,
 ('the', 'suitcases'): 1, ('suitcases', ''): 1,
 ('squids', 'were'): 1, ('were', 'furious'): 1, ('furious', ''): 1}
```

This approach captures short-range dependencies — it “remembers” what word tends to follow another — making it better suited for predicting the next word or analyzing common word pairs.

However, the N-Gram model faces two main challenges:

1. **Unknown words:** if the model encounters a new term like *“mailbox”* that never appeared during training, it can’t assign a meaningful probability. Techniques such as *smoothing* can partially fix this but not perfectly.
2. **Data sparsity:** as *n* grows (for example, from bigram to trigram or higher), the number of unique sequences explodes, while examples for each become rare — making reliable probability estimates difficult.

To overcome these limitations, researchers developed **Neural Language Models (NLMs)**.
Instead of relying on explicit counts, NLMs learn to predict words based on their learned **contextual representations** using neural networks. This enables them to generalize to unseen words or sequences and capture long-range dependencies more naturally.

Common NLM architectures include **LSTMs (Long Short-Term Memory networks)** and **Transformer models**, which form the foundation for today’s most advanced language systems like GPT.


In [23]:
# ---------- N-Gram Exploration with 'Through the Looking Glass' Text ----------

import nltk, re
from nltk.tokenize import word_tokenize
from nltk.util import ngrams
from collections import Counter
from utils.looking_glass_full import looking_glass_full_text

In [24]:
# ---------- Step 1: Preprocess the original text ----------
cleaned = re.sub(r'\W+', ' ', looking_glass_full_text).lower()
tokenized = word_tokenize(cleaned)

# ---------- Step 2: Create different N-Grams ----------
# Bigrams (n = 2)
looking_glass_bigrams = ngrams(tokenized, 2)
looking_glass_bigrams_frequency = Counter(looking_glass_bigrams)

print("Looking Glass Bigrams:")
print(looking_glass_bigrams_frequency.most_common(10))

Looking Glass Bigrams:
[(('of', 'the'), 101), (('said', 'the'), 98), (('in', 'a'), 97), (('in', 'the'), 90), (('as', 'she'), 82), (('you', 'know'), 72), (('a', 'little'), 68), (('the', 'queen'), 67), (('said', 'alice'), 67), (('to', 'the'), 66)]


In [25]:
# Trigrams (n = 3)
looking_glass_trigrams = ngrams(tokenized, 3)
looking_glass_trigrams_frequency = Counter(looking_glass_trigrams)

print("\nLooking Glass Trigrams:")
print(looking_glass_trigrams_frequency.most_common(10))


Looking Glass Trigrams:
[(('the', 'red', 'queen'), 54), (('the', 'white', 'queen'), 31), (('said', 'in', 'a'), 21), (('she', 'went', 'on'), 18), (('said', 'the', 'red'), 17), (('thought', 'to', 'herself'), 16), (('the', 'queen', 'said'), 16), (('said', 'to', 'herself'), 14), (('said', 'humpty', 'dumpty'), 14), (('the', 'knight', 'said'), 14)]


In [26]:
# Larger N-Grams (n > 3)
looking_glass_ngrams = ngrams(tokenized, 5)
looking_glass_ngrams_frequency = Counter(looking_glass_ngrams)

print("\nLooking Glass 5-grams:")
print(looking_glass_ngrams_frequency.most_common(10))


Looking Glass 5-grams:
[(('one', 'and', 'one', 'and', 'one'), 8), (('and', 'one', 'and', 'one', 'and'), 7), (('for', 'a', 'minute', 'or', 'two'), 6), (('the', 'lion', 'and', 'the', 'unicorn'), 6), (('as', 'well', 'as', 'she', 'could'), 5), (('is', 'worth', 'a', 'thousand', 'pounds'), 4), (('the', 'walrus', 'and', 'the', 'carpenter'), 4), (('said', 'to', 'herself', 'as', 'she'), 4), (('twas', 'brillig', 'and', 'the', 'slithy'), 3), (('brillig', 'and', 'the', 'slithy', 'toves'), 3)]


In [28]:
# Larger N-Grams (n > 3)
looking_glass_20grams = ngrams(tokenized, 20)
looking_glass_ngrams_frequency = Counter(looking_glass_20grams)

print("\nLooking Glass 20-grams:")
print(looking_glass_ngrams_frequency.most_common(10))


Looking Glass 20-grams:
[(('twas', 'brillig', 'and', 'the', 'slithy', 'toves', 'did', 'gyre', 'and', 'gimble', 'in', 'the', 'wabe', 'all', 'mimsy', 'were', 'the', 'borogoves', 'and', 'the'), 3), (('brillig', 'and', 'the', 'slithy', 'toves', 'did', 'gyre', 'and', 'gimble', 'in', 'the', 'wabe', 'all', 'mimsy', 'were', 'the', 'borogoves', 'and', 'the', 'mome'), 3), (('and', 'the', 'slithy', 'toves', 'did', 'gyre', 'and', 'gimble', 'in', 'the', 'wabe', 'all', 'mimsy', 'were', 'the', 'borogoves', 'and', 'the', 'mome', 'raths'), 3), (('the', 'slithy', 'toves', 'did', 'gyre', 'and', 'gimble', 'in', 'the', 'wabe', 'all', 'mimsy', 'were', 'the', 'borogoves', 'and', 'the', 'mome', 'raths', 'outgrabe'), 3), (('chapter', 'i', 'looking', 'glass', 'house', 'one', 'thing', 'was', 'certain', 'that', 'the', 'white', 'kitten', 'had', 'had', 'nothing', 'to', 'do', 'with', 'it'), 1), (('i', 'looking', 'glass', 'house', 'one', 'thing', 'was', 'certain', 'that', 'the', 'white', 'kitten', 'had', 'had', 'not

# **Topic Models**

Sometimes, a text is so long or complex that the main themes aren’t immediately clear. **Topic modeling** is a field of NLP designed to uncover these *hidden* or *latent* topics automatically.
It groups together words that frequently appear in the same contexts, helping us identify the underlying structure or themes in a collection of documents.

A common first step in this process is to calculate **term frequency–inverse document frequency (TF-IDF)**.
While it might sound counterintuitive, TF-IDF actually *downweights* very common words (like “the” or “is”) and *upweights* rarer, more distinctive ones — because those unique words are often what define a topic.
Python libraries such as **gensim** and **scikit-learn (sklearn)** provide easy-to-use tools for performing this weighting automatically.

Once you have a TF-IDF representation, you can apply a statistical method called **Latent Dirichlet Allocation (LDA)**.
LDA looks for words that tend to occur together across different documents, revealing clusters that represent potential topics — for instance, “suitcase,” “jump,” and “squid” might frequently appear in the same context, suggesting a “traveling squids” topic.

To visualize these relationships, you can use **word2vec**, a neural embedding model that maps words into a spatial representation.
In this space, words that share similar meanings or appear in similar contexts are placed closer together.
So in our earlier squid example, *“suitcase,” “jump,”* and *“squid”* would likely form a tight cluster — a clear illustration of how words connect within topics.


In [32]:
# ---------- Topic Modeling: BoW vs TF-IDF with custom stop_list ----------

import nltk, re
from utils.sherlock_holmes import (
    bohemia_ch1, bohemia_ch2, bohemia_ch3,
    boscombe_ch1, boscombe_ch2, boscombe_ch3
)
from utils.preprocessing import preprocess_text
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# ---------- Prepare corpus ----------
corpus = [bohemia_ch1, bohemia_ch2, bohemia_ch3, boscombe_ch1, boscombe_ch2, boscombe_ch3]
preprocessed_corpus = [preprocess_text(ch) for ch in corpus]

# ---------- Update stop_list (ensure >= 10 and deduplicated) ----------
stop_list = [
    "say", "see", "holmes", "shall", "man", "upon", "know", "quite", "one",
    "well", "could", "would", "take", "may", "think", "come", "go", "little",
    "must", "look"
]
stop_list = sorted(set(w.lower() for w in stop_list))  # dedupe & lowercase

# ---------- Filter stop words from the preprocessed corpus for BoW ----------
def filter_out_stop_words(corpus, stops):
    out = []
    for chapter in corpus:
        tokens = chapter.split()
        out.append(" ".join([w for w in tokens if w not in stops]))
    return out

filtered_for_stops = filter_out_stop_words(preprocessed_corpus, stop_list)

# ---------- Vectorize ----------
bag_of_words_creator = CountVectorizer()
bag_of_words = bag_of_words_creator.fit_transform(filtered_for_stops)

tfidf_creator = TfidfVectorizer(min_df=0.2)
tfidf = tfidf_creator.fit_transform(preprocessed_corpus)

# ---------- LDA on BoW ----------
lda_bow = LatentDirichletAllocation(
    n_components=10, learning_method="online", random_state=42
)
lda_bow_topics = lda_bow.fit_transform(bag_of_words)

# ---------- LDA on TF-IDF ----------
lda_tfidf = LatentDirichletAllocation(
    n_components=10, learning_method="online", random_state=42
)
lda_tfidf_topics = lda_tfidf.fit_transform(tfidf)

# ---------- Helper to print top terms ----------
def top_terms(model, vectorizer, top_k=5):
    feature_names = vectorizer.get_feature_names_out()
    for topic_id, topic in enumerate(model.components_):
        top_idxs = topic.argsort()[:-top_k-1:-1]
        terms = " ".join(feature_names[i] for i in top_idxs)
        print(f"Topic #{topic_id + 1}: {terms}")

print("~~~ Topics found by bag of words LDA ~~~")
top_terms(lda_bow, bag_of_words_creator, top_k=5)

print("\n\n~~~ Topics found by tf-idf LDA ~~~")
top_terms(lda_tfidf, tfidf_creator, top_k=5)

~~~ Topics found by bag of words LDA ~~~
Topic #1: hand eye son still leave
Topic #2: hand mccarthy street find give
Topic #3: young never walk word find
Topic #4: street room house hand make
Topic #5: majesty king sherlock photograph mr
Topic #6: father mr case remark coroner
Topic #7: mccarthy father son find young
Topic #8: photograph room make find give
Topic #9: leave father mr case back
Topic #10: two speak case doubt side


~~~ Topics found by tf-idf LDA ~~~
Topic #1: circumstance thoroughly however truly miss
Topic #2: scarlet many mumble advantage minor
Topic #3: eight wave tie cloud double
Topic #4: another consider still instinct many
Topic #5: convey utter scene opening farm
Topic #6: king say holmes call alarm
Topic #7: hard favour effect write toss
Topic #8: presumably watch understand three wire
Topic #9: hold clergyman indicate six deadly
Topic #10: holmes say upon know man


# **Text Similarity**

When you’ve ever had your phone autocorrect a message into something embarrassing, you’ve experienced one of the toughest challenges in Natural Language Processing: **text similarity**.
Determining how closely two words or sentences resemble each other — whether in spelling, sound, or meaning — underlies many NLP tasks, from autocorrect and plagiarism detection to recommendation systems.

A foundational approach to comparing words is the **Levenshtein distance** (also called *edit distance*).
It measures the minimum number of changes — **insertions**, **deletions**, or **substitutions** — required to transform one word into another.
For example, converting *“bees”* into *“beans”* takes one substitution (changing “e” → “a”) and one insertion (“n”), giving a distance of **2**.
The smaller the distance, the more similar the words.

Beyond spelling, similarity can also be **phonetic** — based on how words *sound*.
Humans easily distinguish between *“euthanasia”* and *“youth in Asia”*, but for machines, this is extremely difficult.
Advanced systems handle this by incorporating phonetic distance measures and even keyboard layout proximity to simulate human-like correction.

Text similarity is also crucial in other contexts:

* **Lexical similarity** compares the overlap in vocabulary (useful for plagiarism detection).
* **Semantic similarity** measures shared meaning — for example, finding articles or books similar in theme or topic to one you’ve read.

In essence, text similarity helps machines understand not just how words *look* or *sound*, but what they *mean* in relation to one another.


In [34]:
import nltk
# NLTK has a built-in function
# to check Levenshtein distance:
from nltk.metrics import edit_distance

def print_levenshtein(string1, string2):
    print("The Levenshtein distance from '{0}' to '{1}' is {2}!".format(
        string1, string2, edit_distance(string1, string2))
    )

# Check the distance between any two words here!
print_levenshtein("fart", "target")

# Assign passing strings here:
# "code" → "cider" needs 3 edits: substitute 'o'→'i', add 'r', substitute 'e'→'r'
three_away_from_code = "molds"   # actually 3 edits from "code"

# "chunk" → "churn" needs 2 edits: 'n'→'r', 'k'→'n'
two_away_from_chunk = "churn"

print_levenshtein("code", three_away_from_code)
print_levenshtein("chunk", two_away_from_chunk)

The Levenshtein distance from 'fart' to 'target' is 3!
The Levenshtein distance from 'code' to 'molds' is 3!
The Levenshtein distance from 'chunk' to 'churn' is 2!


# Language Prediction & Text Generation

How does your phone’s keyboard know what you want to type next? How does a search engine complete your query before you finish typing?
These are all examples of **language prediction**, a core application of **Natural Language Processing (NLP)** that involves predicting text based on preceding words.

## Common Applications

* **Autosuggest** and **Autocomplete** (e.g., search bars, messaging apps)
* **Suggested Replies** (e.g., “Sounds good!”, “See you soon!”)
* **Predictive Text** in keyboards and chatbots

## Choosing a Language Model

Language prediction starts with selecting a **language model**, which defines how text sequences are represented and predicted.

### 1. Bag of Words (BoW)

While simple, **BoW** ignores word order. As a result, predictions often fall back to the most common words in the training corpus — not ideal for generating meaningful sequences.

### 2. N-gram Models & Markov Chains

An **n-gram** model uses a sliding window of *n* words (or characters) to predict the next token.
It typically relies on a **Markov chain**, which is **memory-less** — predictions depend only on the current n-gram.

**Example:**

> Starting with “I ate so many grilled cheese”,
> a **trigram** model (n=3) might predict **“sandwiches”**
> because in the training corpus, the sequence “grilled cheese sandwiches” frequently follows “grilled cheese”.

The Markov property limits this model’s understanding — it can’t “remember” context beyond its window.

### 3. Neural Language Models (LSTMs)

A more advanced method uses **Long Short-Term Memory (LSTM)** networks — a type of **recurrent neural network (RNN)**.
LSTMs maintain an internal memory state, allowing them to model **long-term dependencies** in text.

This makes them far more effective at generating coherent, contextually relevant text — a foundation for modern **text generation**, **translation**, and **chatbots**.


In [37]:
# Step 1: Import necessary libraries and training documents
# - nltk: for tokenization
# - re: for text cleaning
# - random: for stochastic text generation
# - defaultdict, deque: for managing key-value lists and context windows
# - document1, document2, document3: contain your chosen text data (stories or lyrics)
import nltk, re, random
from nltk.tokenize import word_tokenize
from collections import defaultdict, deque
from utils.document1 import training_doc1
from utils.document2 import training_doc2
from utils.document3 import training_doc3

# Step 2: Define a MarkovChain class to build and use a bigram-based model
class MarkovChain:
    def __init__(self):
        # lookup_dict stores possible next words for each word in the corpus
        self.lookup_dict = defaultdict(list)
        self._seeded = False
        self.__seed_me()

    # Step 3: Seed the random number generator for reproducibility
    def __seed_me(self, rand_seed=None):
        if self._seeded is not True:
            try:
                if rand_seed is not None:
                    random.seed(rand_seed)
                else:
                    random.seed()
                self._seeded = True
            except NotImplementedError:
                self._seeded = False

    # Step 4: Add a document’s text to the Markov chain
    # This preprocesses the text and updates lookup_dict with word pairs
    def add_document(self, text):
        preprocessed_list = self._preprocess(text)
        pairs = self.__generate_tuple_keys(preprocessed_list)
        for pair in pairs:
            self.lookup_dict[pair[0]].append(pair[1])

    # Step 5: Preprocess text by cleaning and tokenizing
    # - Removes non-alphanumeric characters
    # - Converts to lowercase
    # - Tokenizes using NLTK
    def _preprocess(self, text):
        cleaned = re.sub(r'\W+', ' ', text).lower()
        tokenized = word_tokenize(cleaned)
        return tokenized

    # Step 6: Create word pairs (bigrams) for building the Markov chain
    def __generate_tuple_keys(self, data):
        if len(data) < 1:
            return
        for i in range(len(data) - 1):
            yield [data[i], data[i + 1]]

    # Step 7: Generate text based on the trained Markov chain
    # - Starts from a random word
    # - Predicts the next word based on probabilities from lookup_dict
    def generate_text(self, max_length=50):
        context = deque()
        output = []
        if len(self.lookup_dict) > 0:
            self.__seed_me(rand_seed=len(self.lookup_dict))
            chain_head = [list(self.lookup_dict)[0]]
            context.extend(chain_head)

            while len(output) < (max_length - 1):
                next_choices = self.lookup_dict[context[-1]]
                if len(next_choices) > 0:
                    next_word = random.choice(next_choices)
                    context.append(next_word)
                    output.append(context.popleft())
                else:
                    break
            output.extend(list(context))
        return " ".join(output)

# Step 8: Initialize the MarkovChain and add the three training documents
my_markov = MarkovChain()
my_markov.add_document(training_doc1)
my_markov.add_document(training_doc2)
my_markov.add_document(training_doc3)

# Step 9: Generate a short piece of text using the trained model
generated_text = my_markov.generate_text()

# Step 10: Print the generated text for inspection
print(generated_text)

there s buying a stairway lies on the truth by the present our shadows taller than our life is the opinion of chance my unconquerable soul there s still turns to pole i have the best revenge is humming and the stores are two meanings in case you will find


# Advanced NLP Topics

You’ve now explored the foundations of **Natural Language Processing (NLP)** — but this is just the beginning. The field is vast, with advanced topics and applications driven largely by **deep learning** and **neural networks**.

## Naive Bayes Classifiers

A **Naive Bayes classifier** is a **supervised machine learning algorithm** that applies **Bayes’ theorem** to predict outcomes based on probabilities.
It’s particularly effective for:

* **Sentiment analysis** — determining whether text conveys positive or negative emotion
* **Spam detection** — classifying emails as spam or legitimate messages

Despite its simplicity, Naive Bayes remains a strong baseline model in NLP tasks.

## Machine Translation

Modern **machine translation** systems (like Google Translate) have improved dramatically using **neural networks** and **LSTM (Long Short-Term Memory)** architectures.
However, even these models struggle with nuances such as idioms, cultural context, and tone — areas where human understanding still outperforms machines.

## Accessibility and Assistive Technology

NLP plays a major role in **language accessibility**:

* **Text-to-speech (TTS)** converts written text into spoken words.
* **Speech recognition** enables voice commands and dictation.

These technologies, powered by neural language models, have made digital environments much more inclusive for people with disabilities.

## Detecting Bias in Language

NLP can also help uncover **bias** in text — whether political, social, or cultural.
By analyzing patterns in word usage, tone, and framing, NLP tools can highlight subtle biases in speeches, articles, or media coverage, promoting more transparent communication.

In short, NLP is evolving rapidly — from sentiment analysis to speech synthesis — transforming the way humans and machines understand each other.


In [48]:
import os, sys, importlib, pprint

# Ensure project root is on sys.path
if os.getcwd() not in sys.path:
    sys.path.insert(0, os.getcwd())

# Hard-reload the modules to defeat caching
import utils.review as rv
import utils.train_matrix as tm
importlib.reload(rv)
importlib.reload(tm)

# Now import the names exactly
from utils.review import counter
from utils.train_matrix import training_counts

from sklearn.naive_bayes import MultinomialNB

In [49]:
# Step 2: Add your review as a string to be classified
# Write a short text summarizing your experience or impression of the NLP lesson.
review = (
    "The lesson on Natural Language Processing has been very engaging and informative, "
    "introducing core NLP concepts clearly, though it contained so many new ideas that I "
    "might need a bit more time to consolidate them."
)

# Step 3: Convert your review text into numerical feature counts
# This uses the same vocabulary learned during training.
review_counts = counter.transform([review])

# Step 4: Initialize and train a Naive Bayes classifier on the provided dataset
# The labels indicate 0 = negative and 1 = positive reviews.
classifier = MultinomialNB()
training_labels = [0] * 1000 + [1] * 1000
classifier.fit(training_counts, training_labels)

# Step 5: Predict the probability of the review being positive or negative
neg = (classifier.predict_proba(review_counts)[0][0] * 100).round()
pos = (classifier.predict_proba(review_counts)[0][1] * 100).round()

# Step 6: Display a custom message depending on classification confidence
if pos > 50:
    print("Thank you for your positive review!")
elif neg > 50:
    print("We're sorry this hasn't been the best possible lesson for you! We're always looking to improve.")
else:
    print("Naive Bayes cannot determine if this is negative or positive. Thank you or we're sorry?")

# Step 7: Print the computed probabilities for transparency
print(
    "\nAccording to our trained Naive Bayes classifier, "
    "the probability that your review was negative was {0}% "
    "and the probability it was positive was {1}%.".format(neg, pos)
)

Thank you for your positive review!

According to our trained Naive Bayes classifier, the probability that your review was negative was 28.0% and the probability it was positive was 72.0%.


# Challenges and Considerations

With the power of **Natural Language Processing (NLP)** comes significant responsibility. As NLP technologies continue to evolve, it’s important to reflect on the **ethical, linguistic, and privacy challenges** they bring.

## 1. Language and Cultural Bias

Many NLP tools are developed primarily for **English** and often by **English-speaking researchers**.
As a result:

* Models may perform poorly on other languages.
* They can inherit **cultural and linguistic biases** specific to English speakers.
* This creates inequities in global accessibility and accuracy.

## 2. Sociolinguistic Diversity

Even within English, variation is immense — by **region**, **social background**, **gender**, and **dialect**.
Imagine if Amazon Alexa only understood **wealthy men from coastal U.S. cities** — this highlights how limited datasets can result in biased systems.
When designing NLP systems, we must ask: *Who is the tool really built for?*

## 3. Bias Propagation

NLP systems can **amplify bias** present in their training data.
Even with good intentions, developers may inadvertently create tools that reproduce or worsen societal prejudices.
A key responsibility of NLP practitioners is to identify, mitigate, and document potential biases in both the data and the model behavior.

## 4. Privacy Concerns

NLP tools often depend on **large volumes of personal data** — conversations, emails, or voice inputs.
Developers must consider:

* **Who collects the data?**
* **How much data is stored?**
* **What is it used for?**

Transparency and responsible data management are essential to maintain user trust and protect privacy.

---

In short, building ethical NLP systems requires not just technical skill but **awareness, empathy, and accountability** — ensuring that language technologies serve *all* users fairly and responsibly.


In [52]:
# Step 2: Add your review text containing slang or informal expressions
# Normal expression
review = "It was fun!"

# Step 3: Convert your review into numerical token counts using the pre-trained vectorizer
review_counts = counter.transform([review])

# Step 4: Initialize the Naive Bayes classifier and define labels for training
# 0 = negative review, 1 = positive review
classifier = MultinomialNB()
training_labels = [0] * 1000 + [1] * 1000

# Step 5: Fit the classifier on the existing training dataset
classifier.fit(training_counts, training_labels)

# Step 6: Predict probabilities for the review being negative or positive
neg = (classifier.predict_proba(review_counts)[0][0] * 100).round()
pos = (classifier.predict_proba(review_counts)[0][1] * 100).round()

# Step 7: Print classification result based on probability thresholds
if pos > 50:
    print("Naive Bayes classifies this as positive.")
elif neg > 50:
    print("Naive Bayes classifies this as negative.")
else:
    print("Naive Bayes cannot determine if this is negative or positive.")

# Step 8: Print the computed probabilities for both sentiment categories
print(
    "\nAccording to our trained Naive Bayes classifier, "
    "the probability that your review was negative was {0}% "
    "and the probability it was positive was {1}%.".format(neg, pos)
)

Naive Bayes classifies this as positive.

According to our trained Naive Bayes classifier, the probability that your review was negative was 16.0% and the probability it was positive was 84.0%.


In [53]:
# Step 2: Add your review text containing slang or informal expressions
# Try replacing with slang like "lit", "dope", "fire", or "vibes" to see how the model reacts.
review = "It was lit!"

# Step 3: Convert your review into numerical token counts using the pre-trained vectorizer
review_counts = counter.transform([review])

# Step 4: Initialize the Naive Bayes classifier and define labels for training
# 0 = negative review, 1 = positive review
classifier = MultinomialNB()
training_labels = [0] * 1000 + [1] * 1000

# Step 5: Fit the classifier on the existing training dataset
classifier.fit(training_counts, training_labels)

# Step 6: Predict probabilities for the review being negative or positive
neg = (classifier.predict_proba(review_counts)[0][0] * 100).round()
pos = (classifier.predict_proba(review_counts)[0][1] * 100).round()

# Step 7: Print classification result based on probability thresholds
if pos > 50:
    print("Naive Bayes classifies this as positive.")
elif neg > 50:
    print("Naive Bayes classifies this as negative.")
else:
    print("Naive Bayes cannot determine if this is negative or positive.")

# Step 8: Print the computed probabilities for both sentiment categories
print(
    "\nAccording to our trained Naive Bayes classifier, "
    "the probability that your review was negative was {0}% "
    "and the probability it was positive was {1}%.".format(neg, pos)
)

Naive Bayes classifies this as negative.

According to our trained Naive Bayes classifier, the probability that your review was negative was 66.0% and the probability it was positive was 34.0%.


# NLP Review — Final Remarks

You’ve covered a lot of ground in understanding **Natural Language Processing (NLP)** — the field that blends **computer science**, **linguistics**, and **artificial intelligence** to help computers understand and generate human language.

## Key Takeaways

* **NLP Foundations**
  NLP enables computers to process, interpret, and respond to human language in meaningful ways.

* **NLTK Library**
  Python’s **Natural Language Toolkit (NLTK)** provides a rich set of tools for tokenization, stemming, lemmatization, parsing, and more.

* **Text Preprocessing**
  Cleaning and preparing raw text — removing noise, normalizing case, and tokenizing — is a crucial first step for all NLP tasks.

* **Parsing**
  Parsing analyzes grammatical structure and syntax to understand relationships between words and phrases.

* **Language Models**
  Models like **Bag-of-Words**, **n-grams**, and **neural networks (LSTMs)** predict or generate text based on statistical or learned patterns.

* **Topic Modeling**
  Uncovers hidden themes or subjects in large text corpora using algorithms such as LDA (Latent Dirichlet Allocation).

* **Text Similarity**
  Measures how closely related two pieces of text are — essential for search, clustering, and recommendation systems.

* **Language Prediction**
  Powers applications like autocomplete, autosuggest, and smart replies by predicting likely next words or phrases.

* **Ethical & Social Considerations**
  NLP systems must address issues of **bias**, **fairness**, and **privacy**, ensuring inclusivity and responsible use of language data.

---

By mastering these concepts, you now have a strong foundation in how NLP helps machines *understand, analyze, and generate* human language — responsibly and intelligently.
