# Class 6 Notebook – Natural Language Processing (NLP) Basics

This notebook introduces **Natural Language Processing (NLP)** using a small, end-to-end example:
working with raw text, doing basic **tokenization** and **cleaning**, and preparing it for later modeling.

We will connect code back to the Class 6 deck concepts:
- Tokenization and stop words
- Stemming vs lemmatization (conceptual)
- TF–IDF (Term Frequency – Inverse Document Frequency) — **concept only here**
- Basic text classification (e.g., positive vs negative phrases) — **covered in separate demos**

**Objective**: Build a tiny NLP pipeline that:
1. Pre-process text (tokenization, lowercasing, stopwords / simple cleaning)
2. Create a tiny labeled text dataset
3. Inspect how simple preprocessing changes the text

A follow‑up notebook (`NLP_Demos.ipynb`) shows how to add TF–IDF and classifiers on top of this.

Run the first code cell to confirm your environment works.

## Run in the browser (no local setup)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/adzuci/ai-fundamentals/blob/class-6-natural-language-processing/class-6-natural-language-processing/01_class_6_nlp_basics.ipynb)

> Tip: Make sure you are comfortable with basic Python and scikit-learn from Classes 2–3 before this notebook.

## What is NLP (Natural Language Processing)?

**Natural Language Processing (NLP)** is about getting computers to work with **human language**:
understanding text, extracting information, and generating language.

Common applications (from the deck):
- **Question Answering** – Answering questions from text or a knowledge base.
- **Information Extraction** – Pulling structured fields from text (e.g., meeting *Time*, *Venue*).
- **Machine Translation** – Translating between languages.
- **Text summarization / keyword extraction** – Shortening long documents or extracting key phrases.
- **Sentiment analysis** – Detecting whether text is positive, negative, or neutral.
- **Context analysis / topic detection** – Understanding what a conversation or document is about.

In this notebook we focus on a **very small slice** of NLP:
- Turning text into numeric features (TF–IDF)
- Training a small classifier for a toy sentiment-like task.

## STEP 1: Install and import libraries

We use:
- **NumPy** for arrays
- **re** (regular expressions) for simple text cleaning.

> Scikit-learn, TF–IDF, and classifiers are used later in `NLP_Demos.ipynb` and not in this intro notebook.

In [12]:
# Environment sanity check + imports
import platform

print("Python:", platform.python_version())
print("OS:", platform.system(), platform.release())

try:
    import numpy as np
    import re
    from nltk.corpus import stopwords
    from nltk.tokenize import word_tokenize, sent_tokenize

    print("NumPy:", np.__version__)
    print("All libraries imported successfully!")
except ModuleNotFoundError as exc:
    print("Missing dependency:", exc)
    print("Install with: python -m pip install numpy")
    raise

Python: 3.10.14
OS: Darwin 25.2.0
NumPy: 2.2.6
All libraries imported successfully!


In [13]:
# Concept: Sentence and word tokenization with NLTK (for in-class exercise)
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

# Make sure the punkt models are available (needed once per environment).
# If downloads fail (e.g., no internet), you can comment these out and still run the rest of the notebook.
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)

text = 'NLP is amazing. It helps computers understand language'
print(text)

# Sentence-level tokenization
my_sentences = sent_tokenize(text)
print('Sentences:', my_sentences)

# Word-level tokenization
my_words = word_tokenize(text)
print('Words:', my_words)

NLP is amazing. It helps computers understand language
Sentences: ['NLP is amazing.', 'It helps computers understand language']
Words: ['NLP', 'is', 'amazing', '.', 'It', 'helps', 'computers', 'understand', 'language']


## STEP 2: Create a tiny text dataset

To keep things simple (and fast for teaching), we’ll create a **very small** dataset:
short phrases labeled as **positive (1)** or **negative (0)**.

In real projects you would load thousands of examples from files or a database.

In [14]:
# Concept: Tiny labeled text dataset (sentiment-like)
texts = [
    'I love this product, it works great',
    'This is the best course I have taken',
    'Absolutely wonderful experience',
    'I hate this, it is terrible',
    'Really bad experience, would not recommend',
    'The support was awful and slow',
]

# Labels: 1 = positive, 0 = negative
labels = np.array([1, 1, 1, 0, 0, 0])

for text, label in zip(texts, labels):
    sentiment = 'positive' if label == 1 else 'negative'
    print(f'{sentiment.upper():8} | {text}')

print('\nNumber of examples:', len(texts))

POSITIVE | I love this product, it works great
POSITIVE | This is the best course I have taken
POSITIVE | Absolutely wonderful experience
NEGATIVE | I hate this, it is terrible
NEGATIVE | Really bad experience, would not recommend
NEGATIVE | The support was awful and slow

Number of examples: 6


## STEP 3: Text pre-processing (concepts)

Before vectorization, NLP systems usually do some **pre-processing**:
- **Tokenization / Segmentation**: Split text into tokens (often words).
- **Lowercasing**: Treat `Course` and `course` as the same token.
- **Stop words**: Remove very common words (e.g., *the*, *and*, *from*) that carry little information.
- **Stemming vs Lemmatization** (conceptual):
  - *Stemming*: heuristic chop of suffixes (e.g., 'processing' → 'process') — may not be a valid word.
  - *Lemmatization*: map a word to its dictionary form (lemma), e.g., 'doing' → 'do'.
- **Named Entities**: Recognize real-world names (people, places, organizations).

In this minimal example we’ll do only **lowercasing** and basic cleanup to keep the code small,
but the ideas map directly onto more advanced pipelines.

In [15]:
# Concept: Simple text cleaning function
def simple_preprocess(text: str) -> str:
    """Lowercase and remove non-letter characters (very simple).
    In a real system you would use a library (spaCy, NLTK, etc.).
    """
    text = text.lower()
    # Keep letters and spaces only
    text = re.sub(r'[^a-z\s]', '', text)
    # Collapse multiple spaces
    text = re.sub(r'\s+', ' ', text).strip()
    return text

cleaned_texts = [simple_preprocess(t) for t in texts]
for original, cleaned in zip(texts, cleaned_texts):
    print(f'ORIGINAL: {original}')
    print(f'CLEANED : {cleaned}')
    print('-' * 40)

ORIGINAL: I love this product, it works great
CLEANED : i love this product it works great
----------------------------------------
ORIGINAL: This is the best course I have taken
CLEANED : this is the best course i have taken
----------------------------------------
ORIGINAL: Absolutely wonderful experience
CLEANED : absolutely wonderful experience
----------------------------------------
ORIGINAL: I hate this, it is terrible
CLEANED : i hate this it is terrible
----------------------------------------
ORIGINAL: Really bad experience, would not recommend
CLEANED : really bad experience would not recommend
----------------------------------------
ORIGINAL: The support was awful and slow
CLEANED : the support was awful and slow
----------------------------------------


## STEP 4: TF–IDF vectorization

Computers can’t work directly with raw strings, so we convert text to **vectors**.
A very common approach is **TF–IDF (Term Frequency – Inverse Document Frequency)**:

- **Term Frequency (TF)**: How often a term appears in a document.
- **Inverse Document Frequency (IDF)**: How rare a term is across the corpus.
- **TF–IDF score**: TF × IDF — high when a word is frequent in a document but not common everywhere.

We’ll use scikit-learn’s `TfidfVectorizer` to:
1. Tokenize the text
2. Remove simple English stop words
3. Build a vocabulary
4. Compute TF–IDF features.

In [16]:
# Concept: TF–IDF vectorization
vectorizer = TfidfVectorizer(
    preprocessor=simple_preprocess,
    stop_words='english'  # drop common English stop words
)

X = vectorizer.fit_transform(texts)

print('Shape of TF–IDF matrix:', X.shape)
print('Vocabulary size:', len(vectorizer.vocabulary_))

feature_names = vectorizer.get_feature_names_out()
print('Some features:', feature_names[:10])

Shape of TF–IDF matrix: (6, 18)
Vocabulary size: 18
Some features: ['absolutely' 'awful' 'bad' 'best' 'course' 'experience' 'great' 'hate'
 'love' 'product']


## STEP 5: Train a simple classifier

We now have:
- `X`: TF–IDF features (sparse matrix)
- `labels`: 0/1 sentiment-like labels

We’ll train a **Logistic Regression** classifier — a standard choice for text classification.

In [18]:
# Concept: Train/test split + classifier
X_train, X_test, y_train, y_test = train_test_split(
    X, labels, test_size=0.33, random_state=42, stratify=labels
)

clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
print('Accuracy:', accuracy_score(y_test, y_pred))
print('\nClassification report:\n')
print(classification_report(y_test, y_pred, target_names=['negative', 'positive']))

Accuracy: 0.5

Classification report:

              precision    recall  f1-score   support

    negative       0.50      1.00      0.67         1
    positive       0.00      0.00      0.00         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0.25      0.50      0.33         2



  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


## STEP 6: Use the model on new text

Now we can use the trained model to classify **new phrases** that were not seen during training.

In [19]:
# Concept: Predict sentiment of new text
new_texts = [
    'I really love this course',
    'The product was awful and I hate it',
    'It was okay, not great but not terrible either',
]

new_X = vectorizer.transform(new_texts)
new_pred = clf.predict(new_X)

for text, label in zip(new_texts, new_pred):
    sentiment = 'positive' if label == 1 else 'negative'
    print(f'{sentiment.upper():8} | {text}')

POSITIVE | I really love this course
NEGATIVE | The product was awful and I hate it
NEGATIVE | It was okay, not great but not terrible either


## Connecting back to the NLP deck

In this tiny example we touched several key NLP ideas from the slides:

- **Tokenization / Segmentation**: `TfidfVectorizer` tokenizes text into words under the hood.
- **Stop words**: We removed common English stop words with `stop_words='english'`.
- **TF–IDF**: We represented each document as a TF–IDF vector (term frequency × inverse document frequency).
- **Classification**: We trained a Logistic Regression classifier on these vectors.
- **Context**: Even this simple model uses some context (which words co-occur) but does *not* understand long-range context like a modern large language model (LLM).

Concepts we *only mentioned* but did not code here:
- **POS tagging** (Noun, Verb, etc.) and **Word Sense Disambiguation** (which meaning of a word like 'bank').
- **Named Entities** (people, places, organizations).
- **Topic models** (e.g., LDA, NMF) for discovering themes across many documents.

## Next steps

- Swap `LogisticRegression` for another classifier (e.g., `LinearSVC`).
- Experiment with your own small text dataset.
- Explore topic modeling (NMF / LDA) on a collection of documents.
- Compare this classical pipeline to modern transformer-based NLP (e.g., BERT, GPT).