# Week 1: Introduction to Natural Language Processing (NLP)

## Learning Objectives
- Understand the fundamentals of Natural Language Processing
- Learn about tokenization, embeddings, and vector representations
- Explore text preprocessing techniques
- Introduction to language models

## Table of Contents
1. [What is NLP?](#what-is-nlp)
2. [Text Preprocessing](#text-preprocessing)
3. [Tokenization](#tokenization)
4. [Word Embeddings](#word-embeddings)
5. [Vector Representations](#vector-representations)
6. [Basic Language Models](#basic-language-models)
7. [Exercises](#exercises)

## What is NLP?

Natural Language Processing (NLP) is a field of artificial intelligence that focuses on the interaction between computers and human language. It involves developing algorithms and models that can understand, interpret, and generate human language in a valuable way.

In [None]:
# Install required packages
!pip install nltk spacy transformers torch numpy pandas matplotlib seaborn
!python -m spacy download en_core_web_sm

In [None]:
import nltk
import spacy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import re

# Download NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

# Load spaCy model
nlp = spacy.load('en_core_web_sm')

## Text Preprocessing

Text preprocessing is a crucial step in NLP that involves cleaning and preparing raw text data for analysis.

In [None]:
def preprocess_text(text):
    """Basic text preprocessing function"""
    # Convert to lowercase
    text = text.lower()
    
    # Remove special characters and digits
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

# Example
sample_text = "Hello World! This is a sample text with numbers 123 and symbols @#$."
preprocessed = preprocess_text(sample_text)
print(f"Original: {sample_text}")
print(f"Preprocessed: {preprocessed}")

## Tokenization

Tokenization is the process of breaking down text into individual tokens (words, subwords, or characters).

In [None]:
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords

# Sample text
text = "Natural Language Processing is fascinating. It helps computers understand human language."

# Word tokenization
words = word_tokenize(text)
print("Word tokens:", words)

# Sentence tokenization
sentences = sent_tokenize(text)
print("\nSentence tokens:", sentences)

# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word.lower() not in stop_words and word.isalpha()]
print("\nFiltered words:", filtered_words)

## Word Embeddings

Word embeddings are dense vector representations of words that capture semantic relationships.

In [None]:
# Using spaCy for word embeddings
doc = nlp("king queen man woman")

# Get word vectors
for token in doc:
    print(f"Word: {token.text}")
    print(f"Vector shape: {token.vector.shape}")
    print(f"First 5 dimensions: {token.vector[:5]}")
    print("-" * 30)

In [None]:
# Calculate similarity between words
def calculate_similarity(word1, word2):
    token1 = nlp(word1)[0]
    token2 = nlp(word2)[0]
    return token1.similarity(token2)

# Examples
print(f"Similarity between 'king' and 'queen': {calculate_similarity('king', 'queen'):.3f}")
print(f"Similarity between 'king' and 'car': {calculate_similarity('king', 'car'):.3f}")
print(f"Similarity between 'happy' and 'joyful': {calculate_similarity('happy', 'joyful'):.3f}")

## Vector Representations

Let's explore different ways to represent text as vectors.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Sample documents
documents = [
    "I love natural language processing",
    "Machine learning is fascinating",
    "Deep learning models are powerful",
    "Natural language understanding is important"
]

# Bag of Words (Count Vectorizer)
count_vectorizer = CountVectorizer()
bow_matrix = count_vectorizer.fit_transform(documents)

print("Bag of Words representation:")
print("Features:", count_vectorizer.get_feature_names_out())
print("Matrix shape:", bow_matrix.shape)
print("Matrix:\n", bow_matrix.toarray())

In [None]:
# TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

print("\nTF-IDF representation:")
print("Features:", tfidf_vectorizer.get_feature_names_out())
print("Matrix shape:", tfidf_matrix.shape)
print("Matrix:\n", tfidf_matrix.toarray().round(3))

## Basic Language Models

Introduction to n-gram language models and their limitations.

In [None]:
from collections import defaultdict
import random

class SimpleNGramModel:
    def __init__(self, n=2):
        self.n = n
        self.ngrams = defaultdict(list)
    
    def train(self, text):
        words = text.lower().split()
        for i in range(len(words) - self.n + 1):
            context = ' '.join(words[i:i+self.n-1])
            next_word = words[i+self.n-1]
            self.ngrams[context].append(next_word)
    
    def predict_next(self, context):
        if context in self.ngrams:
            return random.choice(self.ngrams[context])
        return None
    
    def generate(self, start_text, length=10):
        words = start_text.lower().split()
        for _ in range(length):
            context = ' '.join(words[-(self.n-1):])
            next_word = self.predict_next(context)
            if next_word:
                words.append(next_word)
            else:
                break
        return ' '.join(words)

# Example usage
training_text = "natural language processing is a field of artificial intelligence that focuses on the interaction between computers and human language"

model = SimpleNGramModel(n=2)
model.train(training_text)

generated_text = model.generate("natural language", length=8)
print(f"Generated text: {generated_text}")

## Exercises

### Exercise 1: Text Analysis
Analyze a piece of text and extract basic statistics.

In [None]:
def analyze_text(text):
    """Analyze text and return basic statistics"""
    # Your code here
    doc = nlp(text)
    
    # Basic statistics
    num_characters = len(text)
    num_words = len([token for token in doc if token.is_alpha])
    num_sentences = len(list(doc.sents))
    num_unique_words = len(set([token.text.lower() for token in doc if token.is_alpha]))
    
    # Part-of-speech tags
    pos_tags = [token.pos_ for token in doc if token.is_alpha]
    pos_counts = Counter(pos_tags)
    
    # Named entities
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    
    return {
        'num_characters': num_characters,
        'num_words': num_words,
        'num_sentences': num_sentences,
        'num_unique_words': num_unique_words,
        'lexical_diversity': num_unique_words / num_words if num_words > 0 else 0,
        'pos_counts': dict(pos_counts),
        'entities': entities
    }

# Test with sample text
sample = "Apple Inc. is an American multinational technology company. It was founded by Steve Jobs in California."
analysis = analyze_text(sample)
print("Text Analysis Results:")
for key, value in analysis.items():
    print(f"{key}: {value}")

### Exercise 2: Build a Simple Text Classifier
Create a basic sentiment classifier using traditional ML approaches.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Sample sentiment data
texts = [
    "I love this movie", "This is amazing", "Great product", "Excellent service",
    "I hate this", "This is terrible", "Bad quality", "Worst experience ever",
    "It's okay", "Not bad", "Average product", "Could be better"
]

labels = [1, 1, 1, 1, 0, 0, 0, 0, 2, 2, 2, 2]  # 1: positive, 0: negative, 2: neutral

# Create features
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)
y = labels

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train classifier
classifier = MultinomialNB()
classifier.fit(X_train, y_train)

# Make predictions
y_pred = classifier.predict(X_test)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Negative', 'Positive', 'Neutral']))

## Summary

In this module, we covered:
- Fundamentals of NLP
- Text preprocessing techniques
- Tokenization methods
- Word embeddings and vector representations
- Basic language models
- Simple text classification

## Next Steps
In the next module, we'll explore Transformers and LLM system design, building upon these foundational concepts.

## Additional Resources
- [NLTK Documentation](https://www.nltk.org/)
- [spaCy Documentation](https://spacy.io/)
- [Scikit-learn Text Processing](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)
- [Word2Vec Paper](https://arxiv.org/abs/1301.3781)
- [GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/projects/glove/)