#Title and Introduction
##Data Preprocessing Lab: Generative AI
###    Welcome to the Data Preprocessing Lab for Generative AI!
   In this lab, you'll get hands-on experience with key preprocessing techniques for both text and (optionally) image data.

   **Learning Objectives:**


*  Understand and apply core data preprocessing techniques.
*  Explore word embedding techniques (Word2Vec/GloVe, BERT).
*  Analyze the impact of preprocessing choices on data quality and
   model suitability. List item
*   Practice using cosine similarity for comparing embeddings.

## Part 1: Environment Setup

First, we'll install and import all necessary libraries. Run the following cell to set up your environment.

In [None]:
# SECTION 1: Environment Setup
#############################
# This cell installs and imports all necessary libraries for our text preprocessing pipeline.
# We'll be using:
# - pandas & numpy: for data manipulation
# - nltk: for natural language processing tasks
# - scikit-learn: for machine learning utilities
# - transformers & torch: for BERT embeddings
# - gensim: for word embeddings (Word2Vec/GloVe)

# Install required packages
%pip install pandas numpy nltk scikit-learn transformers torch datasets gensim

# TODO: Import the required libraries
# Hint: You need pandas, numpy, nltk, and sklearn components
# YOUR CODE HERE - import the basic libraries
import pandas as pd
import numpy as np
import nltk
# Add more imports as needed...

# These are more advanced imports you'll need later
from transformers import BertTokenizer, BertModel
import torch
import gensim.downloader as api
from gensim.models import KeyedVectors

# Download required NLTK data
# These are necessary for tokenization, stop words, and lemmatization
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

print("Setup complete! All required libraries have been imported.")



Setup complete! All required libraries have been imported.


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


## Part 2: Loading and Exploring the BBC News Dataset

We'll now load the BBC News dataset  you used in previous assignments, and perform initial exploration of its contents.

In [None]:
# SECTION 2: Data Loading and Initial Exploration
##############################################
# Here we load the BBC News dataset and perform initial analysis
# Understanding our data is crucial before applying any preprocessing

# Load the dataset
# SECTION 2: Data Loading and Initial Exploration
##############################################

# Load the dataset
df = pd.read_csv('bbc-news-data.csv', encoding='latin1')

# TODO: Rename the columns to match our processing pipeline
# Hint: The original columns are 'Text' and 'Category'
# YOUR CODE HERE
df = df.rename(columns={'Text':'text', 'Category': 'category'})

# TODO: Perform basic data exploration
# TASK 1: Display the first few rows and basic information about the dataset
# Hint: Use pandas' head(), info(), and describe() methods
# YOUR CODE HERE
df.head()
df.info()
df.describe()

# TASK 2: Analyze the distribution of categories
# Hint: Use value_counts() on the category column
# YOUR CODE HERE
print(df['category'].value_counts())

# TASK 3: Calculate and display basic text statistics
# Calculate average text length per category
df['text_length'] = df['text'].str.len()

# TODO: Create a visualization of text lengths by category
# Hint: Use seaborn's boxplot
# YOUR CODE HERE
import seaborn as sns
import matplotlib.pyplot as plt

# Create a boxplot to visualize text lengths by category
plt.figure(figsize=(10, 6))
sns.boxplot(x='category', y='text_length', data=df)
plt.title('Text Lengths by Category')
plt.xlabel('Category')
plt.ylabel('Text Length')

# Display your findings
print("\nDataset Statistics:")
# TODO: Add code to display your findings
# YOUR CODE HERE
print("\nðŸ“Š Average Text Length per Category:")
print(df.groupby('category')['text_length'].mean())

print("\nâœ… Dataset exploration complete!")

ParserError: Error tokenizing data. C error: Expected 30 fields in line 7, saw 34


# Comprehension Questions - Data Exploration

Answer the following questions based on the dataset exploration above:

1. What are the dimensions of our dataset?
2. How many different categories are there in the news articles?
3. Is the dataset balanced across categories? Why might this matter?
4. Are there any missing values that need to be addressed?

## Part 3: Text Preprocessing

We'll now implement basic text preprocessing steps to clean our data.

In [None]:
# SECTION 3: Text Cleaning and Preprocessing
########################################
# This section implements fundamental text preprocessing steps:
# 1. Converting to lowercase (why? -> maintains consistency)
# 2. Removing special characters (why? -> reduces noise)
# 3. Handling whitespace (why? -> standardizes format)

########################################

def clean_text(text):
    """
    Performs basic text cleaning operations.

    Parameters:
    text (str): Input text to be cleaned

    Returns:
    str: Cleaned text
    """
    # TODO: Implement the following steps:
    # 1. Convert to lowercase
    # 2. Remove URLs and emails
    # 3. Remove special characters but keep sentence structure
    # 4. Remove extra whitespace
    # Hint: Use string methods and regular expressions

    # YOUR CODE HERE
    return text  # Replace with your cleaned text

# Test the function with a sample
sample_text = "Hello, World! This is a TEST... 123"
print("Original:", sample_text)
print("Cleaned:", clean_text(sample_text))

# Apply to the entire dataset
df['cleaned_text'] = df['text'].apply(clean_text)




## Part 4: Tokenization and Advanced Processing

Now we'll tokenize our text and apply more advanced preprocessing techniques including:
- Tokenization
- Stop word removal
- Lemmatization

In [None]:
# SECTION 4: Tokenization and Advanced Processing
#############################################
# This section implements more sophisticated NLP techniques:
# - Tokenization: splitting text into words
# - Stop word removal: removing common words
# - Lemmatization: reducing words to their base form
# Check if you do not need to install any additional libraries
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Initialize our tools
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def tokenize_and_process(text):
    """
    Performs advanced text processing including tokenization,
    stop word removal, and lemmatization.

    Parameters:
    text (str): Cleaned text to process

    Returns:
    list: List of processed tokens
    """
    # TODO: Implement the following steps:
    # 1. Tokenize the text
    # 2. Remove stop words
    # 3. Apply lemmatization
    # Hint: Use the initialized stop_words and lemmatizer

    # YOUR CODE HERE
    tokens = []  # Replace with actual tokenization
    processed_tokens = []  # Replace with processed tokens

    return processed_tokens

# Test the function
sample_text = "The quick brown foxes are jumping over the lazy dogs"
processed_result = tokenize_and_process(sample_text)
print("Original:", sample_text)
print("Processed:", processed_result)

## Part 5: Word Embeddings with GloVe

We'll now generate word embeddings using pre-trained GloVe vectors. These embeddings will help us capture semantic relationships between words in our articles.

In [None]:
# SECTION 5: Word Embeddings with GloVe
###################################
# This section generates word embeddings using pre-trained GloVe vectors
# Word embeddings capture semantic relationships between words
# by representing them as dense vectors in a high-dimensional space

# Load pre-trained GloVe embeddings
glove_model = api.load("glove-wiki-gigaword-100")

# SECTION 5: Word Embeddings with GloVe
###################################

# Load pre-trained GloVe embeddings
glove_model = api.load("glove-wiki-gigaword-100")

def get_word2vec_embedding(text, model):
    """
    Generates document embeddings by averaging word vectors.

    Parameters:
    text (str): Input text
    model: Pre-trained word embedding model

    Returns:
    numpy.array: Document embedding vector
    """
    # TODO: Implement the following steps:
    # 1. Tokenize the input text
    # 2. Get embedding for each token
    # 3. Average the embeddings
    # Hint: Handle words not in vocabulary

    # YOUR CODE HERE
    embeddings = []  # Replace with actual embeddings

    return np.mean(embeddings, axis=0) if embeddings else np.zeros(model.vector_size)

# Apply to a sample of the dataset
sample_size = 100
sample_df = df.head(sample_size).copy()
sample_df['glove_embedding'] = sample_df['cleaned_text'].apply(
    lambda x: get_word2vec_embedding(x, glove_model)
)

## Part 6: BERT Embeddings

Now we'll use BERT to generate contextual embeddings. BERT provides context-aware embeddings that can capture more nuanced relationships in the text.

In [None]:
# SECTION 6: BERT Embeddings
#########################
# This section implements BERT (Bidirectional Encoder Representations from Transformers)
# BERT provides context-aware embeddings, meaning the same word can have different
# embeddings based on its context in the sentence.
# Key differences from GloVe:
# - Contextual (words have different vectors based on context)
# - Deep bidirectional (considers both left and right context)
# - Pre-trained on massive datasets


# Load BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

def get_bert_embedding(text, max_length=512):
    """
    Generates BERT embeddings for a text.

    Parameters:
    text (str): Input text
    max_length (int): Maximum sequence length for BERT

    Returns:
    numpy.array: BERT embedding vector
    """
    # TODO: Implement the following steps:
    # 1. Tokenize the text using BERT tokenizer
    # 2. Generate BERT embeddings
    # 3. Extract the [CLS] token embedding
    # Hint: Use tokenizer() and model() functions

    # YOUR CODE HERE
    # Step 1: Tokenize
    inputs = None  # Replace with actual tokenization

    # Step 2: Generate embeddings
    with torch.no_grad():
        outputs = None  # Replace with model output

    # Step 3: Extract [CLS] token embedding
    sentence_embedding = None  # Replace with correct embedding extraction

    return sentence_embedding

# Test the function
test_text = "This is a test sentence for BERT embeddings."
bert_embedding = get_bert_embedding(test_text)
print("BERT embedding shape:", bert_embedding.shape)

## Part 7: Comparing Embeddings

Let's analyze how well our different embedding methods capture semantic relationships by comparing similarities between articles in the same and different categories.

In [None]:
# SECTION 7: Similarity Analysis
############################
# This section implements methods to compare different embedding approaches
# We'll analyze how well each embedding type captures semantic relationships
# by comparing similarities between articles in the same and different categories
# SECTION 6: BERT Embeddings
#########################

# Load BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

def get_bert_embedding(text, max_length=512):
    """
    Generates BERT embeddings for a text.

    Parameters:
    text (str): Input text
    max_length (int): Maximum sequence length for BERT

    Returns:
    numpy.array: BERT embedding vector
    """
    # TODO: Implement the following steps:
    # 1. Tokenize the text using BERT tokenizer
    # 2. Generate BERT embeddings
    # 3. Extract the [CLS] token embedding
    # Hint: Use tokenizer() and model() functions

    # YOUR CODE HERE
    # Step 1: Tokenize
    inputs = None  # Replace with actual tokenization

    # Step 2: Generate embeddings
    with torch.no_grad():
        outputs = None  # Replace with model output

    # Step 3: Extract [CLS] token embedding
    sentence_embedding = None  # Replace with correct embedding extraction

    return sentence_embedding

# Test the function
test_text = "This is a test sentence for BERT embeddings."
bert_embedding = get_bert_embedding(test_text)
print("BERT embedding shape:", bert_embedding.shape)


 ##SECTION 8: Detailed Similarity Analysis

In [None]:
# SECTION 8: Detailed Similarity Analysis
####################################
# This section analyzes how well our embeddings capture
# semantic relationships between articles

# SECTION 8: Detailed Similarity Analysis
####################################

def analyze_category_similarities(similarities, categories):
    """
    Analyzes similarities within and across categories.

    Parameters:
    similarities (numpy.array): Similarity matrix
    categories (list): List of category labels

    Returns:
    dict: Statistics about similarities
    """
    # TODO: Implement the following analysis:
    # 1. Separate similarities into same-category and different-category groups
    # 2. Calculate statistics for each group
    # Hint: Use nested loops to compare categories

    # YOUR CODE HERE
    same_category_sims = []
    diff_category_sims = []

    # Your implementation here

    return {
        'same_category': {
            'mean': np.mean(same_category_sims),
            'std': np.std(same_category_sims)
        },
        'diff_category': {
            'mean': np.mean(diff_category_sims),
            'std': np.std(diff_category_sims)
        }
    }

# Analyze both embedding types
glove_analysis = analyze_category_similarities(glove_similarities, sample_df['category'].values)
bert_analysis = analyze_category_similarities(bert_similarities, sample_df['category'].values)

# Print results
print("=== Similarity Analysis Results ===")
# TODO: Format and display the analysis results
# YOUR CODE HERE

## Part 9 Vizualizations

In [None]:
# SECTION 9: Visualization of Results
################################
# This section creates visualizations to help us understand
# the differences between our embedding approaches
################################

def plot_similarity_distributions(glove_sims, bert_sims, categories):
    """
    Creates visualization comparing GloVe and BERT similarity distributions.

    Parameters:
    glove_sims (numpy.array): GloVe similarity matrix
    bert_sims (numpy.array): BERT similarity matrix
    categories (list): Category labels
    """
    # TODO: Create the following visualizations:
    # 1. Histogram or density plot of similarities
    # 2. Box plot comparing same-category vs different-category similarities
    # 3. Add appropriate labels and titles
    # Hint: Use plt.subplots() for multiple plots

    # YOUR CODE HERE
    plt.figure(figsize=(15, 5))

    # Add your visualization code here

    plt.tight_layout()
    plt.show()

# Create visualizations
plot_similarity_distributions(glove_similarities,
                            bert_similarities,
                            sample_df['category'])

## Part 10 Statistical Comparison

In [None]:
# SECTION 10: Statistical Comparison
###############################
# This section performs statistical tests to compare
# the effectiveness of different embedding approaches
###############################

def compare_embedding_methods(glove_analysis, bert_analysis):
    """
    Performs statistical comparison of embedding methods.

    Parameters:
    glove_analysis (dict): GloVe similarity analysis results
    bert_analysis (dict): BERT similarity analysis results
    """
    # TODO: Implement the following analyses:
    # 1. Calculate effect sizes for both methods
    # 2. Perform statistical tests comparing the methods
    # 3. Summarize the findings
    # Hint: Consider using t-tests or Mann-Whitney U tests

    # YOUR CODE HERE

    # Print summary of findings
    print("=== Statistical Comparison Results ===")
    # Add your summary here

# Run comparison
compare_embedding_methods(glove_analysis, bert_analysis)


## Part 11 Metrics and Evaluation

In [None]:
# SECTION 11: Performance Evaluation
##############################
# This section calculates various metrics to evaluate
# the quality of our embeddings

##############################

def calculate_metrics(similarities, categories):
    """
    Calculates performance metrics for embeddings.

    Parameters:
    similarities (numpy.array): Similarity matrix
    categories (list): Category labels

    Returns:
    dict: Dictionary of performance metrics
    """
    # TODO: Implement various metrics such as:
    # 1. Category separation score
    # 2. Silhouette score
    # 3. Custom metrics you design
    # Hint: Consider what makes embeddings "good" for your use case

    # YOUR CODE HERE
    metrics = {}

    return metrics

# Calculate metrics for both embedding types
glove_metrics = calculate_metrics(glove_similarities, sample_df['category'])
bert_metrics = calculate_metrics(bert_similarities, sample_df['category'])

# Display results
print("=== Performance Metrics ===")
# TODO: Format and display the metrics
# YOUR CODE HERE


## Part 12: Final Analysis Questions

1. Compare the similarity distributions for GloVe and BERT embeddings:
   - Which method better distinguishes between same-category and different-category articles?
   - What might explain the differences in performance?

2. Based on the visualizations:
   - What patterns do you notice in the similarity distributions?
   - Are there any unexpected results?

3. Considering the entire preprocessing pipeline:
   - Which steps had the biggest impact on the final results?
   - What additional preprocessing steps might improve the results?
   - How would you modify this pipeline for different types of text data?

4. Ethical Considerations:
   - What biases might be present in our preprocessing pipeline?
   - How might these biases affect the analysis of news articles?
   - What steps could we take to mitigate these biases?

###Assessment Criteria:

  * Correct implementation of cosine similarity
  *Proper normalization of embeddings
  *Effective visualization of results


##Grading Rubric

* Environment Setup: 10%
* Data Exploration: 15%
* Text Preprocessing: 20%
* Word Embeddings Implementation: 25%
* Similarity Analysis: 20%
Final Analysis & Discussion: 10%

##Common Issues and Solutions

1. Memory Issues:

* Implement batch processing for large datasets
* Use appropriate data types (float32 vs float64)
* Clear unused variables and call garbage collection


2. Performance Optimization:

* Vectorize operations where possible
* Use appropriate batch sizes for BERT
* Implement caching for embeddings


3. Error Handling:

* Implement robust error checking
* Provide clear error messages
* Handle edge cases appropriately