# 3.1 Natural Langauge Processing and Language Modeling

## Attribution
This notebook was re-used and modified from material created by NVIDIA and Dartmouth College and licensed under the Creative Commons Attribution-Non Commercial 4.0 International License (CC BY-NC 4.0) for the **Generative AI: Theory and Applications** MSc Module at UWS.
Source materials available at: https://developer.nvidia.com/gen-ai-teaching-kit-syllabus (NVIDIA Deep Learning Institute Generative AI Teaching Kit) 

## Overview

Welcome to the first notebook in the third week. In this notebook we will look further into some of tha Natural Language Processing (NLP) tasks and methods that we have seen so far.
In particular, by the end of this notebook you will:

- Process documents using Bag-of-Words models
- Use an N-Gram model to select the most likely next word
- Use Named-Entity Recognition to identify key terms in a text

In [None]:
# PLEASE UNCOMMENT AND RUN THE FOLLOWING LINE IF YOU ARE USING A COLAB NOTEBOOK
# !pip install -qqq nltk scikit-learn transformers torch spacy matplotlib

In [None]:
!python -m spacy download en_core_web_sm

In [None]:
from IPython.display import display, Javascript
display(Javascript('IPython.notebook.kernel.restart();'))

## 1. Bag-of-Words (BOW) Representation:
**Bag of Words (BoW) Explanation**

What is Bag of Words?
- The Bag of Words (BoW) model is a simple and widely used technique in natural language processing (NLP) to represent text data.
- It converts text into numerical features by counting the occurrences of each word in a document, ignoring grammar, order, or context.
- The output is often a sparse matrix where:
  - Each row corresponds to a document.
  - Each column corresponds to a unique word in the corpus.
  - The value represents the frequency (or binary presence) of the word in the document.

How Does Bag of Words Work?
1. **Tokenization**: Split the text into individual words or tokens.
2. **Vocabulary Creation**: Compile a list of unique words across all documents.
3. **Vectorization**: Represent each document as a vector of word frequencies or binary indicators.

Usefulness of Bag of Words
- **Simplicity**: Easy to implement and understand, making it a good starting point for text representation.
- **Feature Engineering**: Provides a straightforward way to generate features for machine learning models.
- **Compatibility**: Works well with traditional machine learning algorithms like Naive Bayes, Logistic Regression, and SVM.
- **Baseline Model**: Serves as a benchmark for evaluating more sophisticated NLP models.

Limitations of Bag of Words
1. **Loss of Context**:
   - Ignores word order and semantic relationships between words.
   - Example: "I love dogs" and "Dogs love me" are treated as identical.
2. **High Dimensionality**:
   - For large vocabularies, the feature space becomes huge, leading to sparse data and increased computational cost.
3. **No Weighting**:
   - Frequent but less important words (e.g., "the", "is") can dominate the representation.
   - Mitigated by using term frequency-inverse document frequency (TF-IDF).
4. **Poor Generalization**:
   - Unseen words in new data are not represented, leading to issues in handling out-of-vocabulary (OOV) words.

When to Use Bag of Words?
- For small to medium-sized datasets where simplicity and speed are prioritized.
- When context and semantic meaning are less critical for the task.
- As a baseline to compare against more advanced models like Word2Vec, GloVe, or transformers.

In [None]:
import nltk
from nltk.tokenize import word_tokenize
import random
from collections import defaultdict, Counter

# For Bag-of-Words classification
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Ensure NLTK data is downloaded (particularly 'punkt' for tokenization)
nltk.download('punkt', quiet=True)

"""
1. BAG-OF-WORDS EXAMPLE:
    - Convert text documents into simple frequency vectors.
    - Train a naive Bayes classifier on a tiny spam vs. not_spam dataset.
    - Evaluate the classifier accuracy on a hold-out test set.
"""
print("=== Bag-of-Words Example ===")

# Step 1: Create a small dataset of documents and labels
# Each document is a short text, and each label indicates whether it is "spam" or "not_spam"
documents = [
    "Win big prizes now",                         # spam
    "Congratulations you won free tickets",       # spam
    "Meeting at the office tomorrow",             # not spam
    "Let's schedule a call about the project",    # not spam
    "Click here to claim your prize!",            # spam
    "Weekly report is due by Friday"              # not spam
]
labels = ["spam", "spam", "not_spam", "not_spam", "spam", "not_spam"]

# Step 2: Convert text data into Bag-of-Words (BOW) feature vectors
# The CountVectorizer converts each document into a vector of word counts
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)

# The vocabulary is the list of unique words across all documents
# Each word is assigned a column in the feature matrix
print("Extracted Vocabulary:", vectorizer.vocabulary_)  # Print the mapping of words to indices
print("BOW Feature Matrix Shape:", X.shape)  # Shape = (#documents, #unique_words)

# Step 3: Split the dataset into training and test sets
# 67% of the data will be used for training, and 33% for testing
X_train, X_test, y_train, y_test = train_test_split(X, labels,
                                                    test_size=0.33,
                                                    random_state=42)

# Step 4: Train a Naive Bayes classifier
# Naive Bayes is a simple algorithm that works well with BOW representations
clf = MultinomialNB()
clf.fit(X_train, y_train)  # Train the model on the training set

# Step 5: Make predictions on the test set
y_pred = clf.predict(X_test)

# Step 6: Evaluate the classifier
# Calculate accuracy by comparing predictions to ground truth labels
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy on our tiny dataset: {accuracy:.2f}")
print("Test Predictions:", y_pred)  # Predicted labels for the test set
print("Test Ground Truth:", y_test)  # Actual labels for the test set
print()

Let's try again but with a much larger dataset

In [None]:
import random
import pandas as pd

# Templates for more varied spam and not_spam messages
spam_templates = [
    "Win a {prize} by clicking here now!",
    "Congratulations! You are eligible for a free {item}.",
    "Claim your exclusive offer for {discount}% off!",
    "Don't miss out on this limited-time deal!",
    "You have won {amount} dollars! Act fast to redeem.",
    "Click to receive your special bonus reward."
]

not_spam_templates = [
    "The meeting is scheduled for {day}.",
    "Let's discuss the {topic} in our next call.",
    "Reminder: {task} is due by {deadline}.",
    "Please review the {document} and provide feedback.",
    "Join us for the upcoming {event} this {day}.",
    "Updates on the {project} will be shared soon."
]

# Parameters for generating varied messages
spam_placeholders = {
    "prize": ["car", "trip to Paris", "iPhone", "gift card"],
    "item": ["laptop", "TV", "smartphone", "headphones"],
    "discount": [20, 30, 50, 70],
    "amount": [1000, 5000, 10000, 50000]
}

not_spam_placeholders = {
    "day": ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"],
    "topic": ["budget", "new project", "strategy plan", "team goals"],
    "task": ["report", "presentation", "proposal", "draft"],
    "deadline": ["next week", "Friday", "end of the month"],
    "document": ["report", "proposal", "presentation slides", "spreadsheet"],
    "event": ["workshop", "seminar", "webinar", "meeting"],
    "project": ["marketing campaign", "software development", "product launch"]
}

# Function to populate templates with random placeholders
def generate_messages(templates, placeholders, count):
    messages = []
    for _ in range(count):
        template = random.choice(templates)
        message = template.format(**{k: random.choice(v) for k, v in placeholders.items()})
        messages.append(message)
    return messages


count = 10
# Generate half as much spam as not_spam messages
spam_messages = generate_messages(spam_templates, spam_placeholders, count//2)
not_spam_messages = generate_messages(not_spam_templates, not_spam_placeholders, count)

# Combine and shuffle the dataset
documents = spam_messages + not_spam_messages
labels = ["spam"] * (count//2) + ["not_spam"] * count



# 1) Convert text data to BOW features
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)
print("Extracted Vocabulary:", vectorizer.vocabulary_)  # Show some of the words
print("BOW Feature Matrix Shape:", X.shape)

# 2) Train a simple classifier (Naive Bayes)
X_train, X_test, y_train, y_test = train_test_split(X, labels,
                                                    test_size=0.33,
                                                    random_state=42)
clf = MultinomialNB()
clf.fit(X_train, y_train)

# 3) Evaluate
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy on our tiny dataset: {accuracy:.2f}")
print("Test Predictions:", y_pred)
print("Test Ground Truth:", y_test)
print()


Try playing around with how many spam emails total, and what the proportion is of not spam. Here we are using half as many, think if that is realistic and what effect that has.



---



## 2. N-Gram Language Model:

What is an N-Gram Model?
- An N-Gram model is an extension of the Bag-of-Words (BoW) model that captures sequences of N consecutive words (or tokens) in the text.
- Instead of treating individual words as independent features, N-Grams represent combinations of words to preserve some context.
- For example:
  - For a text: "I love machine learning", the N-Grams with N=2 (bigrams) are:
    - ["I love", "love machine", "machine learning"]
  - For N=3 (trigrams), the output would be:
    - ["I love machine", "love machine learning"]

Why Use N-Grams?
- **Preserving Context**: Unlike BoW, N-Grams capture the relationship between words, providing limited context.
- **Improved Performance**: Useful in tasks like sentiment analysis, where the sequence of words matters (e.g., "not good" vs. "very good").
- **Better Features**: Helps in distinguishing phrases and expressions that cannot be inferred from individual words alone.

How Does an N-Gram Model Work?
1. **Tokenization**: Break the text into individual tokens.
2. **N-Gram Generation**: Combine N consecutive tokens into phrases.
3. **Vocabulary Creation**: Create a list of unique N-Grams across the corpus.
4. **Vectorization**: Represent each document as a vector of N-Gram frequencies or binary indicators.

Usefulness of N-Gram Models
- **Contextual Representation**: Captures short-term dependencies between words, useful in many NLP tasks.
- **Flexibility**: Can be applied with varying values of N to balance between word independence (N=1) and full dependency modeling (higher N).
- **Baseline Model**: Offers a more context-aware representation than BoW, serving as a bridge to more advanced models.

Limitations of N-Gram Models
1. **Data Sparsity**:
   - Larger N values exponentially increase the number of possible N-Grams, making the feature space sparse.
   - Requires a lot of data to avoid overfitting.
2. **Loss of Long-Range Context**:
   - Captures only local relationships between words; cannot model long-range dependencies.
   - Example: "The dog barked" vs. "The dog that lived next door barked."
3. **High Computational Cost**:
   - Larger N-Grams lead to high dimensionality, increasing storage and computation requirements.
4. **Vocabulary Explosion**:
   - The number of unique N-Grams grows rapidly with the size of the corpus and the value of N.

When to Use N-Gram Models?
- For tasks where word sequences and short-term context are important.
- When working with small to medium datasets where advanced models like transformers are impractical.
- As a baseline to compare against more sophisticated context-aware models.

In [None]:
  import nltk
  nltk.download('punkt_tab')

In [None]:
from nltk.tokenize import word_tokenize
from collections import defaultdict, Counter
import random

"""
2. N-GRAM LANGUAGE MODEL EXAMPLE:
    - We'll implement a basic bigram model: P(word_n | word_n-1).
    - We'll train it on a small corpus and then generate random text.
"""
print("=== N-Gram Language Model Example ===")

# Step 1: Create a small corpus
# This corpus contains a few sentences about NLP and machine learning.
corpus = """
I love natural language processing.
I love machine learning.
I enjoy writing code for NLP tasks.
Language models can generate text.
"""

# Step 2: Tokenize the corpus
# Convert the text into a list of individual words (tokens).
# Tokenization ensures that punctuation and capitalization are handled properly.
tokens = word_tokenize(corpus.lower())  # Convert to lowercase for consistency
print("Tokens in the corpus:", tokens)

# Step 3: Count bigrams
# Create a nested dictionary where bigram_counts[word1][word2] = frequency.
# This means we count how many times a specific word is followed by another word.
bigram_counts = defaultdict(Counter)
for i in range(len(tokens) - 1):  # Loop through tokens, stopping before the last one
    first_word = tokens[i]
    second_word = tokens[i + 1]
    bigram_counts[first_word][second_word] += 1  # Increment the count for the bigram

# Step 4: Inspect bigram counts
# Print the bigrams that follow the word "i" as an example.
print("\nExample bigram counts for 'i':", bigram_counts["i"])

# Step 5: Define a function to compute the next word distribution
def next_word_distribution(current_word):
    """
    Given the current_word, return a list of (next_word, probability) tuples.
    Probabilities are computed by normalizing the raw frequency counts.
    """
    next_word_counts = bigram_counts[current_word]  # Get the counts of words following current_word
    total_count = sum(next_word_counts.values())  # Total occurrences of the current_word
    distribution = []
    for word, count in next_word_counts.items():
        # Calculate the probability of each next_word
        distribution.append((word, count / total_count))
    return distribution

# Step 6: Generate text using the bigram model
# Start with a random word from the corpus and generate text by sampling from bigram probabilities.
current_word = random.choice(list(bigram_counts.keys()))  # Pick a random starting word
generated = [current_word]  # Initialize the generated text with the starting word
for _ in range(10):  # Generate 10 additional words
    dist = next_word_distribution(current_word)  # Get the distribution of next words
    if not dist:  # If there are no following words, stop generation
        break
    words, probs = zip(*dist)  # Separate the words and their probabilities
    chosen = random.choices(words, weights=probs, k=1)[0]  # Sample the next word based on probabilities
    generated.append(chosen)  # Add the chosen word to the generated text
    current_word = chosen  # Update the current word

# Step 7: Display the generated text
# Combine the generated words into a sentence and print it.
print("\nGenerated Text (Bigram Model):")
print(" ".join(generated))
print()

## 3. Named Entity Recognition (NER) / Classification:

What is Named Entity Recognition (NER)?
- Named Entity Recognition (NER) is a subtask of natural language processing (NLP) that identifies and classifies named entities in text into predefined categories.
- Examples of named entities include:
  - **Person**: Names of individuals (e.g., "Albert Einstein").
  - **Organization**: Companies, institutions, or groups (e.g., "Google", "UN").
  - **Location**: Geographic locations (e.g., "Paris", "Mount Everest").
  - **Date/Time**: Temporal expressions (e.g., "January 1, 2024", "3 PM").
  - **Others**: Product names, monetary values, percentages, etc.

Why Use NER?
- **Information Extraction**: Automatically extract structured information from unstructured text.
- **Question Answering**: Identify relevant entities in a document to answer specific queries.
- **Data Enrichment**: Annotate and link entities to external knowledge bases (e.g., Wikipedia).
- **Content Analysis**: Analyze trends or patterns involving specific entities in large text corpora.

How Does NER Work?
1. **Text Input**: The raw text to be analyzed.
2. **Tokenization**: Break down the text into individual tokens (words or subwords).
3. **Entity Recognition**: Use a pre-trained NER model or custom-trained model to identify entities.
4. **Entity Classification**: Assign each entity to one of the predefined categories.

Tools for NER
- **spaCy**: A popular NLP library that provides pre-trained models for NER.
- **Hugging Face Transformers**: Provides state-of-the-art NER models like BERT and RoBERTa.
- **Custom Models**: Train your own NER models using labeled data.

Usefulness of NER
- **Real-Time Applications**: Powering chatbots, virtual assistants, and recommendation systems.
- **Business Intelligence**: Extracting actionable insights from text (e.g., contracts, news).
- **Healthcare**: Identifying medical terms, drug names, or patient details in clinical notes.
- **Finance**: Analyzing financial documents for entities like companies and stock symbols.

Limitations of NER
1. **Domain Dependence**:
   - Pre-trained models may not perform well in specialized domains (e.g., legal, biomedical).
   - Requires domain-specific training data for customization.
2. **Ambiguity**:
   - Entities can have multiple meanings (e.g., "Apple" as a company vs. fruit).
   - Context is critical for disambiguation, which can be challenging for models.
3. **Language Limitations**:
   - Models trained on one language may not generalize well to others without additional training.
4. **Scalability**:
   - Processing large datasets in real time may require significant computational resources.

When to Use NER?
- To identify key entities in large unstructured datasets for downstream analysis.
- When building systems that rely on entity-specific logic, such as search engines or recommendation systems.
- In tasks requiring structured data extraction from text, like contract analysis or customer feedback processing.


In [None]:
# if you are using colab, uncomment the following line
# Install and download spaCy
# !pip install -qqq spacy  # Install the spaCy library
!python -m spacy download en_core_web_sm  # Download the small English language model

In [None]:
# Import necessary libraries
import spacy
import random
import time
import matplotlib.pyplot as plt

# Load the spaCy language model
nlp = spacy.load("en_core_web_sm")  # Load the small English model for NLP tasks

# Step 1: Create a list of sample texts
texts = [
    "Barack Obama was the 44th President of the United States.",
    "Apple Inc. is headquartered in Cupertino, California.",
    "The Eiffel Tower is a famous landmark in Paris, France.",
    "Marie Curie was a pioneering scientist in the field of radioactivity.",
    "The Amazon rainforest is the largest tropical rainforest in the world."
]

# Function to process text with a simulated delay
def process_text_with_delay(text):
    """
    Processes a given text with spaCy to extract named entities.
    Adds a small random delay to simulate longer processing times.

    Args:
        text (str): The input text to process.

    Returns:
        doc: spaCy processed Doc object containing parsed text and entities.
    """
    doc = nlp(text)  # Process the text with the spaCy model
    time.sleep(random.uniform(0.5, 2))  # Introduce a delay (0.5 to 2 seconds) for simulation
    return doc

# Step 2: Process each text and extract named entities
all_entities = []  # List to store all extracted entities across texts
for text in texts:
    doc = process_text_with_delay(text)  # Process the text
    # Extract entities as tuples of (text, label, explanation)
    entities = [(ent.text, ent.label_, spacy.explain(ent.label_)) for ent in doc.ents]
    all_entities.extend(entities)  # Add entities to the global list

    # Print extracted entities for the current text
    print(f"Entities in '{text}':")
    for ent_text, ent_label, ent_explanation in entities:
        print(f"  - {ent_text} ({ent_label}): {ent_explanation}")
    print("-" * 20)  # Separator for better readability

# Step 3: Analyze entity frequencies
entity_frequencies = {}  # Dictionary to store frequencies of entities by type
for ent_text, ent_label, _ in all_entities:
    # Update the frequency of each entity under its label
    if ent_label not in entity_frequencies:
        entity_frequencies[ent_label] = {}
    if ent_text in entity_frequencies[ent_label]:
        entity_frequencies[ent_label][ent_text] += 1
    else:
        entity_frequencies[ent_label][ent_text] = 1

# Print a summary of entity frequencies
print("\nEntity Frequencies:")
for ent_label, entities in entity_frequencies.items():
    print(f"  {ent_label}:")
    for ent_text, frequency in entities.items():
        print(f"    - {ent_text}: {frequency}")

# Step 4: Visualize entity type frequencies using a bar chart
entity_labels = list(entity_frequencies.keys())  # Extract entity labels
entity_counts = [len(entities) for entities in entity_frequencies.values()]  # Count unique entities per label

plt.figure(figsize=(10, 5))  # Set figure size
plt.bar(entity_labels, entity_counts, color='skyblue')  # Create a bar chart
plt.title('Frequency of Entity Types')  # Add a title to the chart
plt.xlabel('Entity Type')  # Label for the x-axis
plt.ylabel('Frequency')  # Label for the y-axis
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability
plt.tight_layout()  # Adjust layout to prevent overlapping labels
plt.show()  # Display the chart