# Class 6 Notebook ‚Äì Natural Language Processing (NLP) Basics

This notebook introduces **Natural Language Processing (NLP)** using a small, end-to-end example:
working with raw text, doing basic **tokenization** and **cleaning**, and then building a tiny
**TF‚ÄìIDF + Logistic Regression** text classifier.

We will connect code back to the Class 6 deck concepts:
- Tokenization and stop words
- Stemming vs lemmatization (conceptual)
- TF‚ÄìIDF (Term Frequency ‚Äì Inverse Document Frequency)
- Basic text classification (e.g., positive vs negative phrases)

**Objective**: Build a tiny NLP pipeline that:
1. Pre-process text (tokenization, lowercasing, stopwords / simple cleaning)
2. Create a tiny labeled text dataset
3. Turn cleaned text into TF‚ÄìIDF vectors
4. Train and evaluate a simple classifier

A follow‚Äëup notebook (`NLP_Demos.ipynb`) goes deeper into TF‚ÄìIDF variations and additional models.

Run the first code cell to confirm your environment works.

## Run in the browser (no local setup)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/adzuci/ai-fundamentals/blob/class-6-natural-language-processing/class-6-natural-language-processing/01_class_6_nlp_basics.ipynb)

> Tip: Make sure you are comfortable with basic Python and scikit-learn from Classes 2‚Äì3 before this notebook.

## What is NLP (Natural Language Processing)?

**Natural Language Processing (NLP)** is about getting computers to work with **human language**:
understanding text, extracting information, and generating language.

Common applications (from the deck):
- **Question Answering** ‚Äì Answering questions from text or a knowledge base.
- **Information Extraction** ‚Äì Pulling structured fields from text (e.g., meeting *Time*, *Venue*).
- **Machine Translation** ‚Äì Translating between languages.
- **Text summarization / keyword extraction** ‚Äì Shortening long documents or extracting key phrases.
- **Sentiment analysis** ‚Äì Detecting whether text is positive, negative, or neutral.
- **Context analysis / topic detection** ‚Äì Understanding what a conversation or document is about.

In this notebook we focus on a **very small slice** of NLP:
- Turning text into numeric features (TF‚ÄìIDF)
- Training a small classifier for a toy sentiment-like task.

## Table of contents

1. [STEP 1 ‚Äì Install and import libraries](#step-1-install-and-import-libraries)
2. [STEP 2 ‚Äì Create a tiny text dataset](#step-2-create-a-tiny-text-dataset)
3. [STEP 3 ‚Äì Text pre-processing](#step-3-text-pre-processing-concepts)
4. [STEP 4 ‚Äì TF‚ÄìIDF vectorization](#step-4-tf‚Äìidf-vectorization)
5. [STEP 5 ‚Äì Train a simple classifier](#step-5-train-a-simple-classifier)
6. [STEP 6 ‚Äì Use the model on new text](#step-6-use-the-model-on-new-text)
7. [NLTK vs scikit-learn (quick comparison)](#nltk-vs-scikit-learn-quick-comparison)
8. [üå± Concept: Stemming](#-concept-stemming)
9. [üå± Concept: Lemmatization](#-concept-lemmatization)

## STEP 1: Install and import libraries

We use:
- **NumPy** for arrays
- **re** (regular expressions) for simple text cleaning.
- **NLTK** for tokenization, stop words, and stemming/lemmatization.
- **scikit-learn** for TF‚ÄìIDF features and a simple Logistic Regression classifier.

> `NLP_Demos.ipynb` builds on this notebook with additional demos and variations.

In [None]:
# Environment sanity check + imports
import platform  # Python / OS info only

print("Python:", platform.python_version())
print("OS:", platform.system(), platform.release())

try:
    import numpy as np  # numerical arrays and simple data work
    import re  # regular expressions for basic text cleaning
    import nltk  # core NLTK package (tokenization, corpora)
    from nltk.corpus import stopwords  # common stop word lists (English, etc.)
    from nltk.tokenize import word_tokenize, sent_tokenize  # word- and sentence-level tokenizers

    # scikit-learn: text features + simple ML models
    from sklearn.feature_extraction.text import TfidfVectorizer  # TF‚ÄìIDF vectorizer
    from sklearn.model_selection import train_test_split  # train/test splitting
    from sklearn.linear_model import LogisticRegression  # logistic regression classifier
    from sklearn.metrics import accuracy_score, classification_report  # evaluation metrics

    print("NumPy:", np.__version__)
    print("All libraries imported successfully!")
except ModuleNotFoundError as exc:
    print("Missing dependency:", exc)
    print("Install with: python -m pip install numpy nltk scikit-learn")
    raise

Python: 3.10.14
OS: Darwin 25.2.0
NumPy: 2.2.6
All libraries imported successfully!


In [2]:
# Concept: Sentence and word tokenization with NLTK (for in-class exercise)
# Download NLTK resources once per environment. If this is slow or you are offline,
# you can comment these lines out and still run most of the notebook.
nltk.download('punkt', quiet=True)      # sentence / word tokenizer models
nltk.download('punkt_tab', quiet=True)  # extra punkt data in newer NLTK versions
nltk.download('stopwords')              # stop word lists (used later or in demos)

text = 'NLP is amazing. It helps computers understand language'
print(text)

# Sentence-level tokenization
my_sentences = sent_tokenize(text)
print('Sentences:', my_sentences)

# Word-level tokenization
my_words = word_tokenize(text)
print('Words:', my_words)

NLP is amazing. It helps computers understand language
Sentences: ['NLP is amazing.', 'It helps computers understand language']
Words: ['NLP', 'is', 'amazing', '.', 'It', 'helps', 'computers', 'understand', 'language']


[nltk_data] Downloading package stopwords to /Users/adam/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## STEP 2: Create a tiny text dataset

To keep things simple (and fast for teaching), we‚Äôll create a **very small** dataset:
short phrases labeled as **positive (1)** or **negative (0)**.

In real projects you would load thousands of examples from files or a database.

In [3]:
# Concept: Tokenizing a different sentence
mytext2 = "NLP is an interesting field of AI and useful to create a Model"

my_words2 = word_tokenize(mytext2)
print(my_words2)

['NLP', 'is', 'an', 'interesting', 'field', 'of', 'AI', 'and', 'useful', 'to', 'create', 'a', 'Model']


In [4]:
# Concept: Tiny labeled text dataset (sentiment-like)
texts = [
    'I love this product, it works great',
    'This is the best course I have taken',
    'Absolutely wonderful experience',
    'I hate this, it is terrible',
    'Really bad experience, would not recommend',
    'The support was awful and slow',
]

# Labels: 1 = positive, 0 = negative
labels = np.array([1, 1, 1, 0, 0, 0])

for text, label in zip(texts, labels):
    sentiment = 'positive' if label == 1 else 'negative'
    print(f'{sentiment.upper():8} | {text}')

print('\nNumber of examples:', len(texts))

POSITIVE | I love this product, it works great
POSITIVE | This is the best course I have taken
POSITIVE | Absolutely wonderful experience
NEGATIVE | I hate this, it is terrible
NEGATIVE | Really bad experience, would not recommend
NEGATIVE | The support was awful and slow

Number of examples: 6


## STEP 3: Text pre-processing (concepts)

Before vectorization, NLP systems usually do some **pre-processing**:
- **Tokenization / Segmentation**: Split text into tokens (often words).
- **Lowercasing**: Treat `Course` and `course` as the same token.
- **Stop words**: Remove very common words (e.g., *the*, *and*, *from*) that carry little information.
- **Stemming vs Lemmatization** (conceptual):
  - *Stemming*: heuristic chop of suffixes (e.g., 'processing' ‚Üí 'process') ‚Äî may not be a valid word.
  - *Lemmatization*: map a word to its dictionary form (lemma), e.g., 'doing' ‚Üí 'do'.
- **Named Entities**: Recognize real-world names (people, places, organizations).

In this minimal example we‚Äôll do only **lowercasing** and basic cleanup to keep the code small,
but the ideas map directly onto more advanced pipelines.

In [5]:
# Concept: Simple text cleaning function
def simple_preprocess(text: str) -> str:
    """Lowercase and remove non-letter characters (very simple).
    In a real system you would use a library (spaCy, NLTK, etc.).
    """
    text = text.lower()
    # Keep letters and spaces only
    text = re.sub(r'[^a-z\s]', '', text)
    # Collapse multiple spaces
    text = re.sub(r'\s+', ' ', text).strip()
    return text

cleaned_texts = [simple_preprocess(t) for t in texts]
for original, cleaned in zip(texts, cleaned_texts):
    print(f'ORIGINAL: {original}')
    print(f'CLEANED : {cleaned}')
    print('-' * 40)

ORIGINAL: I love this product, it works great
CLEANED : i love this product it works great
----------------------------------------
ORIGINAL: This is the best course I have taken
CLEANED : this is the best course i have taken
----------------------------------------
ORIGINAL: Absolutely wonderful experience
CLEANED : absolutely wonderful experience
----------------------------------------
ORIGINAL: I hate this, it is terrible
CLEANED : i hate this it is terrible
----------------------------------------
ORIGINAL: Really bad experience, would not recommend
CLEANED : really bad experience would not recommend
----------------------------------------
ORIGINAL: The support was awful and slow
CLEANED : the support was awful and slow
----------------------------------------


## STEP 4: TF‚ÄìIDF vectorization (with scikit-learn)

Computers can‚Äôt work directly with raw strings, so we convert text to **vectors**.
A very common approach is **TF‚ÄìIDF (Term Frequency ‚Äì Inverse Document Frequency)**:

- **Term Frequency (TF)**: How often a term appears in a document.
- **Inverse Document Frequency (IDF)**: How rare a term is across the corpus.
- **TF‚ÄìIDF score**: TF √ó IDF ‚Äî high when a word is frequent in a document but not common everywhere.

In this step we use scikit-learn‚Äôs `TfidfVectorizer` (imported above) to:
1. Tokenize the text
2. Remove simple English stop words
3. Build a vocabulary
4. Compute TF‚ÄìIDF features for each example.

In [6]:
# Concept: TF‚ÄìIDF vectorization (using scikit-learn)
# This turns each cleaned text into a numeric vector that a classifier can use.
vectorizer = TfidfVectorizer(
    preprocessor=simple_preprocess,
    stop_words='english'  # drop common English stop words
)

X = vectorizer.fit_transform(texts)

print('Shape of TF‚ÄìIDF matrix:', X.shape)
print('Vocabulary size:', len(vectorizer.vocabulary_))

feature_names = vectorizer.get_feature_names_out()
print('Some features:', feature_names[:10])

NameError: name 'TfidfVectorizer' is not defined

## STEP 5: Train a simple classifier (Logistic Regression)

We now have:
- `X`: TF‚ÄìIDF features (sparse matrix)
- `labels`: 0/1 sentiment-like labels

Here we use **Logistic Regression** from scikit-learn as a simple baseline text classifier.

In [None]:
# Concept: Train/test split + classifier (Logistic Regression)
X_train, X_test, y_train, y_test = train_test_split(
    X, labels, test_size=0.33, random_state=42, stratify=labels
)

clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification report:\n")
print(classification_report(y_test, y_pred, target_names=["negative", "positive"]))

Accuracy: 0.5

Classification report:

              precision    recall  f1-score   support

    negative       0.50      1.00      0.67         1
    positive       0.00      0.00      0.00         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0.25      0.50      0.33         2



  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


## STEP 6: Use the model on new text

Now we can use the trained model to classify **new phrases** that were not seen during training.
We transform the new text with the **same** `TfidfVectorizer` and then call `clf.predict`.

In [None]:
# Concept: Predict sentiment of new text
new_texts = [
    "I really love this course",
    "The product was awful and I hate it",
    "It was okay, not great but not terrible either",
]

new_X = vectorizer.transform(new_texts)
new_pred = clf.predict(new_X)

for text, label in zip(new_texts, new_pred):
    sentiment = "positive" if label == 1 else "negative"
    print(f"{sentiment.upper():8} | {text}")

POSITIVE | I really love this course
NEGATIVE | The product was awful and I hate it
NEGATIVE | It was okay, not great but not terrible either


## Connecting back to the NLP deck

In this tiny example we touched several key NLP ideas from the slides:

- **Tokenization / Segmentation**: splitting text into sentences and words.
- **Stop words & cleaning**: lowercasing and removing non-letter characters.
- **TF‚ÄìIDF**: representing each document as a TF‚ÄìIDF vector.
- **Classification**: training a simple Logistic Regression classifier on TF‚ÄìIDF features.

Concepts we *only mentioned* but did not explore in depth here:
- **POS tagging** (Noun, Verb, etc.) and **Word Sense Disambiguation** (which meaning of a word like 'bank').
- **Named Entities** (people, places, organizations).
- **Topic models** (e.g., LDA, NMF) for discovering themes across many documents.

## Next steps

- Try changing the tiny dataset and see how the classifier behaves.
- Experiment with different preprocessing rules (e.g., keep punctuation, change stop words).
- Swap `LogisticRegression` for another classifier (e.g., `LinearSVC`).
- Explore topic modeling (NMF / LDA) or modern transformer-based NLP (e.g., BERT, GPT) in follow‚Äëup notebooks.

## NLTK vs scikit-learn (quick comparison)

| Library       | Purpose                     | What it‚Äôs good at                               |
|--------------|-----------------------------|-------------------------------------------------|
| **NLTK**     | Natural Language Processing | Tokenization, stop words, stemming, lemmatization |
| **scikit-learn** | Machine Learning           | TF‚ÄìIDF, feature extraction, classification models |

**In practice:**
- Use **NLTK** to prepare and clean text (tokens, stop words, stemming/lemmatization).
- Use **scikit-learn** to turn text into **numbers** (e.g., TF‚ÄìIDF vectors) and train **models** (e.g., Logistic Regression).

## üå± Concept: Stemming

**Stemming** reduces words to a root form by chopping off suffixes.

Examples:
- `running` ‚Üí `run`
- `singing` ‚Üí `sing`
- `studies` ‚Üí `studi`

‚ö†Ô∏è **Stems are not always real words.** The goal is consistency, not perfect grammar.

**Why we use stemming:**
- Groups similar words together (e.g., `run`, `running`, `ran` ‚Üí `run`-like stem)
- Reduces vocabulary size
- Can improve model performance, especially on small datasets

## üå± Concept: Lemmatization

**Lemmatization** reduces words to their base or dictionary form (known as a *lemma*).
Unlike stemming, which often just chops off suffixes, lemmatization uses a vocabulary
and morphological analysis of words to return a valid word.

Examples:
- `running` ‚Üí `run`
- `ran` ‚Üí `run`
- `better` ‚Üí `good`
- `studies` ‚Üí `study`

**Why we use lemmatization:**
- Groups morphologically related words together (e.g., all forms of a verb to its infinitive).
- Produces valid words, which can be helpful for downstream tasks or interpretability.
- Reduces vocabulary size, similar to stemming, but with more linguistic accuracy.

In [None]:
# Concept: Lemmatization with NLTK's WordNetLemmatizer
from nltk.stem import WordNetLemmatizer

# Download the WordNet corpus if not already present
nltk.download('wordnet', quiet=True)
nltk.download('omw-1.4', quiet=True)  # WordNet data for multiple languages

words_for_lemmatization = [
    "running", "runs", "ran",
    "better", "best",
    "studies", "studying", "studied",
    "geese", "mice",
]

lemmatizer = WordNetLemmatizer()

lemmatized_words = [lemmatizer.lemmatize(w) for w in words_for_lemmatization]
print("Original words   :", words_for_lemmatization)
print("Lemmatized words :", lemmatized_words)

In [None]:
# Concept: Stemming with NLTK's PorterStemmer
from nltk.stem import PorterStemmer

words = ["running", "singing", "talking", "playing", "run", "studies", "ran"]

mystemmer = PorterStemmer()

mystems = [mystemmer.stem(w) for w in words]
print("Original words:", words)
print("Stemmed words :", mystems)