# 📚 Natural Language Processing (NLP) Fundamentals

## Complete Guide from Text Processing to Transformers

**What You'll Learn:**
- Text preprocessing and tokenization
- Word embeddings (Word2Vec, GloVe)
- Sequence models (RNN, LSTM, GRU)
- Attention mechanisms
- Transformers architecture
- Real NLP applications

**Prerequisites:** Deep Learning Fundamentals (Notebook 06)

---

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import re

# NLP libraries
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Deep learning
try:
    import torch
    import torch.nn as nn
    import torch.nn.functional as F
    TORCH_AVAILABLE = True
except ImportError:
    print("PyTorch not installed. Some examples will use NumPy instead.")
    TORCH_AVAILABLE = False

# Download NLTK data
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)

# Plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("✓ Libraries imported successfully!")

---

## Part 1: Text Preprocessing

### 1.1 Tokenization

**Tokenization:** Breaking text into individual tokens (words, subwords, or characters)

**Why it's important:**
- First step in any NLP pipeline
- Affects model performance significantly
- Different tokenization strategies for different tasks

In [None]:
# Sample text
text = """
Natural Language Processing (NLP) is fascinating! It allows computers to understand, 
interpret, and generate human language. Modern NLP uses deep learning techniques.
"""

# Word tokenization
word_tokens = word_tokenize(text.lower())
print("Word tokens:")
print(word_tokens[:20])
print(f"\nTotal word tokens: {len(word_tokens)}")

# Sentence tokenization
sent_tokens = sent_tokenize(text)
print(f"\nSentences: {len(sent_tokens)}")
for i, sent in enumerate(sent_tokens, 1):
    print(f"{i}. {sent.strip()}")

### 1.2 Text Cleaning

In [None]:
def clean_text(text):
    """Complete text cleaning pipeline."""
    # Lowercase
    text = text.lower()
    
    # Remove URLs
    text = re.sub(r'http\S+|www\S+', '', text)
    
    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    
    # Remove special characters and digits
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # Remove extra whitespace
    text = ' '.join(text.split())
    
    return text

# Example
dirty_text = "Check out https://example.com! Email: test@test.com. Price: $99.99"
clean = clean_text(dirty_text)

print(f"Original: {dirty_text}")
print(f"Cleaned:  {clean}")

### 1.3 Stopword Removal and Stemming/Lemmatization

In [None]:
# Stopwords
stop_words = set(stopwords.words('english'))

# Remove stopwords
filtered_tokens = [w for w in word_tokens if w not in stop_words and w.isalpha()]
print(f"Tokens after stopword removal: {len(filtered_tokens)}")
print(filtered_tokens[:15])

# Stemming (faster, less accurate)
stemmer = PorterStemmer()
stemmed = [stemmer.stem(w) for w in filtered_tokens[:5]]
print(f"\nStemmed: {stemmed}")

# Lemmatization (slower, more accurate)
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(w) for w in filtered_tokens[:5]]
print(f"Lemmatized: {lemmatized}")

### 1.4 Complete Preprocessing Pipeline

In [None]:
class TextPreprocessor:
    """Complete text preprocessing pipeline."""
    
    def __init__(self, lowercase=True, remove_stopwords=True, 
                 lemmatize=True, min_token_length=2):
        self.lowercase = lowercase
        self.remove_stopwords = remove_stopwords
        self.lemmatize = lemmatize
        self.min_token_length = min_token_length
        
        if remove_stopwords:
            self.stop_words = set(stopwords.words('english'))
        
        if lemmatize:
            self.lemmatizer = WordNetLemmatizer()
    
    def preprocess(self, text):
        """Preprocess a single text."""
        # Clean
        text = clean_text(text)
        
        # Tokenize
        tokens = word_tokenize(text.lower() if self.lowercase else text)
        
        # Filter
        tokens = [t for t in tokens if len(t) >= self.min_token_length]
        
        # Remove stopwords
        if self.remove_stopwords:
            tokens = [t for t in tokens if t not in self.stop_words]
        
        # Lemmatize
        if self.lemmatize:
            tokens = [self.lemmatizer.lemmatize(t) for t in tokens]
        
        return tokens
    
    def preprocess_corpus(self, texts):
        """Preprocess multiple texts."""
        return [self.preprocess(text) for text in texts]

# Usage
preprocessor = TextPreprocessor()
processed = preprocessor.preprocess(text)
print(f"Processed tokens: {processed}")

---

## Part 2: Text Representation

### 2.1 Bag of Words (BoW)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample documents
documents = [
    "I love machine learning",
    "Machine learning is amazing",
    "I love deep learning",
    "Deep learning uses neural networks"
]

# Create BoW
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(documents)

# Display
feature_names = vectorizer.get_feature_names_out()
bow_df = pd.DataFrame(bow_matrix.toarray(), columns=feature_names)
print("Bag of Words Matrix:")
print(bow_df)

# Visualize
plt.figure(figsize=(12, 4))
sns.heatmap(bow_df, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.title('Bag of Words Representation')
plt.xlabel('Words')
plt.ylabel('Documents')
plt.tight_layout()
plt.show()

### 2.2 TF-IDF (Term Frequency-Inverse Document Frequency)

**Formula:**
$$\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)$$

Where:
- $\text{TF}(t, d) = \frac{\text{count of } t \text{ in } d}{\text{total terms in } d}$
- $\text{IDF}(t) = \log\left(\frac{\text{total documents}}{\text{documents containing } t}\right)$

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create TF-IDF
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

# Display
tfidf_df = pd.DataFrame(
    tfidf_matrix.toarray(), 
    columns=tfidf_vectorizer.get_feature_names_out()
)
print("TF-IDF Matrix:")
print(tfidf_df.round(3))

# Visualize
plt.figure(figsize=(12, 4))
sns.heatmap(tfidf_df, annot=True, fmt='.2f', cmap='YlOrRd', cbar=True)
plt.title('TF-IDF Representation')
plt.xlabel('Words')
plt.ylabel('Documents')
plt.tight_layout()
plt.show()

### 2.3 Word Embeddings - Word2Vec from Scratch

**Word2Vec:** Learn dense vector representations where similar words have similar vectors.

**Two architectures:**
1. **CBOW (Continuous Bag of Words):** Predict center word from context
2. **Skip-gram:** Predict context words from center word

In [None]:
class Word2VecSimple:
    """Simple Word2Vec implementation (Skip-gram)."""
    
    def __init__(self, embedding_dim=50, window_size=2, learning_rate=0.01):
        self.embedding_dim = embedding_dim
        self.window_size = window_size
        self.learning_rate = learning_rate
    
    def build_vocab(self, sentences):
        """Build vocabulary from sentences."""
        words = [word for sent in sentences for word in sent]
        self.vocab = list(set(words))
        self.vocab_size = len(self.vocab)
        
        # Word to index mapping
        self.word2idx = {w: i for i, w in enumerate(self.vocab)}
        self.idx2word = {i: w for w, i in self.word2idx.items()}
        
        print(f"Vocabulary size: {self.vocab_size}")
    
    def initialize_embeddings(self):
        """Initialize embedding matrices."""
        # Input embeddings (center words)
        self.W1 = np.random.randn(self.vocab_size, self.embedding_dim) * 0.01
        
        # Output embeddings (context words)
        self.W2 = np.random.randn(self.embedding_dim, self.vocab_size) * 0.01
    
    def generate_training_data(self, sentences):
        """Generate (center, context) pairs."""
        training_data = []
        
        for sentence in sentences:
            indices = [self.word2idx[w] for w in sentence]
            
            for center_idx in range(len(indices)):
                center_word = indices[center_idx]
                
                # Context window
                start = max(0, center_idx - self.window_size)
                end = min(len(indices), center_idx + self.window_size + 1)
                
                for context_idx in range(start, end):
                    if context_idx != center_idx:
                        context_word = indices[context_idx]
                        training_data.append((center_word, context_word))
        
        return training_data
    
    def softmax(self, x):
        """Stable softmax."""
        exp_x = np.exp(x - np.max(x))
        return exp_x / exp_x.sum()
    
    def forward(self, center_word_idx):
        """Forward pass."""
        # Get center word embedding
        h = self.W1[center_word_idx]  # (embedding_dim,)
        
        # Compute output scores
        u = np.dot(h, self.W2)  # (vocab_size,)
        
        # Softmax
        y_pred = self.softmax(u)
        
        return h, u, y_pred
    
    def backward(self, center_word_idx, context_word_idx, h, y_pred):
        """Backward pass."""
        # Create target vector
        y_true = np.zeros(self.vocab_size)
        y_true[context_word_idx] = 1
        
        # Error
        error = y_pred - y_true  # (vocab_size,)
        
        # Gradients
        dW2 = np.outer(h, error)  # (embedding_dim, vocab_size)
        dW1 = np.dot(self.W2, error)  # (embedding_dim,)
        
        # Update
        self.W2 -= self.learning_rate * dW2
        self.W1[center_word_idx] -= self.learning_rate * dW1
        
        return error
    
    def train(self, sentences, epochs=100):
        """Train Word2Vec model."""
        self.build_vocab(sentences)
        self.initialize_embeddings()
        
        training_data = self.generate_training_data(sentences)
        print(f"Training samples: {len(training_data)}")
        
        for epoch in range(epochs):
            loss = 0
            
            for center_idx, context_idx in training_data:
                h, u, y_pred = self.forward(center_idx)
                error = self.backward(center_idx, context_idx, h, y_pred)
                
                # Cross-entropy loss
                loss += -np.log(y_pred[context_idx] + 1e-10)
            
            if (epoch + 1) % 20 == 0:
                avg_loss = loss / len(training_data)
                print(f"Epoch {epoch+1}/{epochs}, Loss: {avg_loss:.4f}")
    
    def get_embedding(self, word):
        """Get embedding for a word."""
        idx = self.word2idx.get(word)
        if idx is not None:
            return self.W1[idx]
        return None
    
    def most_similar(self, word, top_n=5):
        """Find most similar words."""
        word_vec = self.get_embedding(word)
        if word_vec is None:
            return []
        
        # Cosine similarity
        similarities = []
        for w in self.vocab:
            if w != word:
                w_vec = self.get_embedding(w)
                sim = np.dot(word_vec, w_vec) / (
                    np.linalg.norm(word_vec) * np.linalg.norm(w_vec)
                )
                similarities.append((w, sim))
        
        # Sort by similarity
        similarities.sort(key=lambda x: x[1], reverse=True)
        return similarities[:top_n]

# Train on sample corpus
corpus = [
    "i love machine learning".split(),
    "i love deep learning".split(),
    "machine learning is great".split(),
    "deep learning uses neural networks".split(),
    "neural networks are powerful".split()
]

w2v = Word2VecSimple(embedding_dim=10, window_size=2, learning_rate=0.05)
w2v.train(corpus, epochs=100)

# Test similarity
print("\nMost similar to 'learning':")
for word, sim in w2v.most_similar('learning', top_n=5):
    print(f"  {word}: {sim:.4f}")

---

## Part 3: Sequence Models (RNN, LSTM, GRU)

### 3.1 Recurrent Neural Networks (RNN)

**Key Idea:** Process sequences by maintaining hidden state

**Equations:**
$$h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h)$$
$$y_t = W_{hy} h_t + b_y$$

In [None]:
if TORCH_AVAILABLE:
    class SimpleRNN(nn.Module):
        """Simple RNN for sequence classification."""
        
        def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim):
            super().__init__()
            
            self.embedding = nn.Embedding(vocab_size, embedding_dim)
            self.rnn = nn.RNN(embedding_dim, hidden_dim, batch_first=True)
            self.fc = nn.Linear(hidden_dim, output_dim)
        
        def forward(self, x):
            # x: (batch_size, seq_len)
            embedded = self.embedding(x)  # (batch_size, seq_len, embedding_dim)
            
            # RNN
            output, hidden = self.rnn(embedded)
            # output: (batch_size, seq_len, hidden_dim)
            # hidden: (1, batch_size, hidden_dim)
            
            # Use last hidden state
            final_hidden = hidden.squeeze(0)  # (batch_size, hidden_dim)
            
            # Classification
            logits = self.fc(final_hidden)  # (batch_size, output_dim)
            
            return logits
    
    # Example
    rnn = SimpleRNN(vocab_size=1000, embedding_dim=50, hidden_dim=128, output_dim=2)
    
    # Dummy input
    x = torch.randint(0, 1000, (4, 10))  # Batch of 4, seq_len 10
    output = rnn(x)
    print(f"RNN output shape: {output.shape}")  # (4, 2)
else:
    print("PyTorch not available. Install with: pip install torch")

### 3.2 LSTM (Long Short-Term Memory)

**Problem with RNN:** Vanishing gradients in long sequences

**LSTM Solution:** Gating mechanisms to control information flow

**Gates:**
1. **Forget gate:** What to forget from cell state
2. **Input gate:** What new information to add
3. **Output gate:** What to output

In [None]:
if TORCH_AVAILABLE:
    class SentimentLSTM(nn.Module):
        """LSTM for sentiment analysis."""
        
        def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, 
                     n_layers=2, dropout=0.5):
            super().__init__()
            
            self.embedding = nn.Embedding(vocab_size, embedding_dim)
            self.lstm = nn.LSTM(
                embedding_dim, 
                hidden_dim, 
                num_layers=n_layers,
                dropout=dropout if n_layers > 1 else 0,
                batch_first=True
            )
            self.fc = nn.Linear(hidden_dim, output_dim)
            self.dropout = nn.Dropout(dropout)
        
        def forward(self, x):
            # Embedding
            embedded = self.dropout(self.embedding(x))
            
            # LSTM
            output, (hidden, cell) = self.lstm(embedded)
            
            # Use last layer's hidden state
            final_hidden = hidden[-1]  # (batch_size, hidden_dim)
            
            # Classification
            logits = self.fc(self.dropout(final_hidden))
            
            return logits
    
    # Example
    lstm_model = SentimentLSTM(
        vocab_size=5000, 
        embedding_dim=100, 
        hidden_dim=256, 
        output_dim=2,
        n_layers=2,
        dropout=0.5
    )
    
    print(lstm_model)
    print(f"\nTotal parameters: {sum(p.numel() for p in lstm_model.parameters()):,}")
else:
    print("PyTorch not available")

---

## Part 4: Attention Mechanisms

### 4.1 Attention Intuition

**Problem:** LSTMs compress entire sequence into fixed-size vector

**Solution:** Let model "attend" to different parts of input

**Attention Score:**
$$\alpha_{ij} = \frac{\exp(\text{score}(h_i, h_j))}{\sum_k \exp(\text{score}(h_i, h_k))}$$

In [None]:
if TORCH_AVAILABLE:
    class Attention(nn.Module):
        """Attention mechanism."""
        
        def __init__(self, hidden_dim):
            super().__init__()
            self.attention = nn.Linear(hidden_dim, hidden_dim)
        
        def forward(self, hidden_states):
            """
            Args:
                hidden_states: (batch_size, seq_len, hidden_dim)
            Returns:
                context: (batch_size, hidden_dim)
                attention_weights: (batch_size, seq_len)
            """
            # Compute attention scores
            scores = self.attention(hidden_states)  # (batch, seq_len, hidden_dim)
            
            # Attention weights
            attention_weights = F.softmax(
                scores.sum(dim=-1), dim=-1
            )  # (batch, seq_len)
            
            # Weighted sum
            context = torch.bmm(
                attention_weights.unsqueeze(1),  # (batch, 1, seq_len)
                hidden_states  # (batch, seq_len, hidden_dim)
            ).squeeze(1)  # (batch, hidden_dim)
            
            return context, attention_weights
    
    # Visualize attention
    def visualize_attention(tokens, attention_weights):
        """Visualize attention weights."""
        plt.figure(figsize=(10, 2))
        plt.bar(range(len(tokens)), attention_weights)
        plt.xticks(range(len(tokens)), tokens, rotation=45)
        plt.ylabel('Attention Weight')
        plt.title('Attention Distribution')
        plt.tight_layout()
        plt.show()
    
    # Example
    attention_layer = Attention(hidden_dim=128)
    hidden_states = torch.randn(1, 8, 128)  # 1 batch, 8 tokens
    context, weights = attention_layer(hidden_states)
    
    tokens = ["I", "love", "machine", "learning", "and", "deep", "learning", "!"]
    visualize_attention(tokens, weights[0].detach().numpy())
else:
    print("PyTorch not available")

### 4.2 Self-Attention (Scaled Dot-Product Attention)

**Key Innovation:** Each position attends to all positions

**Formula:**
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Where:
- Q: Query matrix
- K: Key matrix  
- V: Value matrix
- $d_k$: Dimension of keys (for scaling)

In [None]:
if TORCH_AVAILABLE:
    class SelfAttention(nn.Module):
        """Self-attention layer (scaled dot-product)."""
        
        def __init__(self, embed_dim):
            super().__init__()
            self.embed_dim = embed_dim
            
            # Query, Key, Value projections
            self.query = nn.Linear(embed_dim, embed_dim)
            self.key = nn.Linear(embed_dim, embed_dim)
            self.value = nn.Linear(embed_dim, embed_dim)
        
        def forward(self, x):
            """
            Args:
                x: (batch_size, seq_len, embed_dim)
            Returns:
                output: (batch_size, seq_len, embed_dim)
                attention_weights: (batch_size, seq_len, seq_len)
            """
            # Project to Q, K, V
            Q = self.query(x)  # (batch, seq_len, embed_dim)
            K = self.key(x)
            V = self.value(x)
            
            # Scaled dot-product attention
            scores = torch.matmul(Q, K.transpose(-2, -1))  # (batch, seq_len, seq_len)
            scores = scores / np.sqrt(self.embed_dim)  # Scale
            
            attention_weights = F.softmax(scores, dim=-1)
            
            # Apply attention to values
            output = torch.matmul(attention_weights, V)  # (batch, seq_len, embed_dim)
            
            return output, attention_weights
    
    # Example
    self_attn = SelfAttention(embed_dim=64)
    x = torch.randn(2, 5, 64)  # 2 batches, 5 tokens, 64 dims
    output, attn_weights = self_attn(x)
    
    print(f"Output shape: {output.shape}")
    print(f"Attention weights shape: {attn_weights.shape}")
    
    # Visualize attention matrix
    plt.figure(figsize=(6, 5))
    sns.heatmap(
        attn_weights[0].detach().numpy(), 
        cmap='Blues', 
        annot=True, 
        fmt='.2f'
    )
    plt.title('Self-Attention Matrix')
    plt.xlabel('Key Position')
    plt.ylabel('Query Position')
    plt.show()
else:
    print("PyTorch not available")

---

## Part 5: Transformer Architecture

### 5.1 Multi-Head Attention

**Idea:** Run multiple attention operations in parallel

**Benefits:**
- Learn different types of relationships
- More expressiveness
- Better generalization

In [None]:
if TORCH_AVAILABLE:
    class MultiHeadAttention(nn.Module):
        """Multi-head attention."""
        
        def __init__(self, embed_dim, num_heads):
            super().__init__()
            assert embed_dim % num_heads == 0
            
            self.embed_dim = embed_dim
            self.num_heads = num_heads
            self.head_dim = embed_dim // num_heads
            
            # Projections for all heads (batched)
            self.qkv_proj = nn.Linear(embed_dim, 3 * embed_dim)
            self.out_proj = nn.Linear(embed_dim, embed_dim)
        
        def forward(self, x):
            batch_size, seq_len, embed_dim = x.shape
            
            # Project and split into Q, K, V
            qkv = self.qkv_proj(x)  # (batch, seq_len, 3*embed_dim)
            qkv = qkv.reshape(batch_size, seq_len, 3, self.num_heads, self.head_dim)
            qkv = qkv.permute(2, 0, 3, 1, 4)  # (3, batch, heads, seq_len, head_dim)
            q, k, v = qkv[0], qkv[1], qkv[2]
            
            # Scaled dot-product attention
            scores = torch.matmul(q, k.transpose(-2, -1))  # (batch, heads, seq_len, seq_len)
            scores = scores / np.sqrt(self.head_dim)
            attention = F.softmax(scores, dim=-1)
            
            # Apply attention
            out = torch.matmul(attention, v)  # (batch, heads, seq_len, head_dim)
            
            # Concatenate heads
            out = out.transpose(1, 2).contiguous()  # (batch, seq_len, heads, head_dim)
            out = out.reshape(batch_size, seq_len, embed_dim)
            
            # Final projection
            out = self.out_proj(out)
            
            return out, attention
    
    # Example
    mha = MultiHeadAttention(embed_dim=128, num_heads=8)
    x = torch.randn(2, 10, 128)
    output, attention = mha(x)
    
    print(f"Output shape: {output.shape}")
    print(f"Attention shape: {attention.shape}")
else:
    print("PyTorch not available")

### 5.2 Complete Transformer Encoder

**Components:**
1. Multi-head self-attention
2. Feed-forward network
3. Layer normalization
4. Residual connections

In [None]:
if TORCH_AVAILABLE:
    class TransformerEncoderLayer(nn.Module):
        """Single Transformer encoder layer."""
        
        def __init__(self, embed_dim, num_heads, ff_dim, dropout=0.1):
            super().__init__()
            
            # Multi-head attention
            self.self_attn = MultiHeadAttention(embed_dim, num_heads)
            
            # Feed-forward network
            self.ff = nn.Sequential(
                nn.Linear(embed_dim, ff_dim),
                nn.ReLU(),
                nn.Dropout(dropout),
                nn.Linear(ff_dim, embed_dim)
            )
            
            # Layer normalization
            self.norm1 = nn.LayerNorm(embed_dim)
            self.norm2 = nn.LayerNorm(embed_dim)
            
            self.dropout = nn.Dropout(dropout)
        
        def forward(self, x):
            # Multi-head attention with residual
            attn_out, _ = self.self_attn(x)
            x = self.norm1(x + self.dropout(attn_out))
            
            # Feed-forward with residual
            ff_out = self.ff(x)
            x = self.norm2(x + self.dropout(ff_out))
            
            return x
    
    class TransformerEncoder(nn.Module):
        """Complete Transformer encoder for classification."""
        
        def __init__(self, vocab_size, embed_dim, num_heads, ff_dim, 
                     num_layers, max_seq_len, num_classes, dropout=0.1):
            super().__init__()
            
            # Token embedding
            self.token_embed = nn.Embedding(vocab_size, embed_dim)
            
            # Positional embedding
            self.pos_embed = nn.Embedding(max_seq_len, embed_dim)
            
            # Encoder layers
            self.layers = nn.ModuleList([
                TransformerEncoderLayer(embed_dim, num_heads, ff_dim, dropout)
                for _ in range(num_layers)
            ])
            
            # Classification head
            self.classifier = nn.Linear(embed_dim, num_classes)
            
            self.dropout = nn.Dropout(dropout)
        
        def forward(self, x):
            batch_size, seq_len = x.shape
            
            # Token embeddings
            token_emb = self.token_embed(x)  # (batch, seq_len, embed_dim)
            
            # Positional embeddings
            positions = torch.arange(seq_len, device=x.device).unsqueeze(0)
            pos_emb = self.pos_embed(positions)  # (1, seq_len, embed_dim)
            
            # Combine
            x = self.dropout(token_emb + pos_emb)
            
            # Pass through encoder layers
            for layer in self.layers:
                x = layer(x)
            
            # Global average pooling
            x = x.mean(dim=1)  # (batch, embed_dim)
            
            # Classification
            logits = self.classifier(x)
            
            return logits
    
    # Example
    transformer = TransformerEncoder(
        vocab_size=5000,
        embed_dim=128,
        num_heads=8,
        ff_dim=512,
        num_layers=4,
        max_seq_len=100,
        num_classes=2,
        dropout=0.1
    )
    
    print(transformer)
    print(f"\nTotal parameters: {sum(p.numel() for p in transformer.parameters()):,}")
    
    # Test
    x = torch.randint(0, 5000, (4, 50))  # 4 sequences of length 50
    output = transformer(x)
    print(f"\nOutput shape: {output.shape}")  # (4, 2)
else:
    print("PyTorch not available")

---

## Part 6: Practical NLP Application

### 6.1 Sentiment Analysis with Pre-trained Model

In [None]:
# Using transformers library
try:
    from transformers import pipeline
    
    # Load sentiment analysis pipeline
    sentiment_analyzer = pipeline("sentiment-analysis")
    
    # Test sentences
    sentences = [
        "I love this product! It's amazing!",
        "This is terrible. I hate it.",
        "It's okay, nothing special.",
        "Absolutely fantastic! Best purchase ever!"
    ]
    
    results = sentiment_analyzer(sentences)
    
    for sent, result in zip(sentences, results):
        print(f"Text: {sent}")
        print(f"Sentiment: {result['label']} (confidence: {result['score']:.3f})\n")
    
except ImportError:
    print("Install transformers: pip install transformers")

### 6.2 Text Generation

In [None]:
try:
    from transformers import GPT2LMHeadModel, GPT2Tokenizer
    
    # Load model and tokenizer
    tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
    model = GPT2LMHeadModel.from_pretrained('gpt2')
    
    # Generate text
    prompt = "Machine learning is"
    input_ids = tokenizer.encode(prompt, return_tensors='pt')
    
    output = model.generate(
        input_ids,
        max_length=50,
        num_return_sequences=3,
        temperature=0.8,
        top_k=50,
        top_p=0.95,
        do_sample=True
    )
    
    print(f"Prompt: {prompt}\n")
    for i, seq in enumerate(output, 1):
        text = tokenizer.decode(seq, skip_special_tokens=True)
        print(f"Generation {i}: {text}\n")
    
except ImportError:
    print("Install transformers: pip install transformers")

---

## 📝 Summary

### Key Concepts

1. **Text Preprocessing:**
   - Tokenization (word, sentence, subword)
   - Cleaning and normalization
   - Stopword removal, stemming, lemmatization

2. **Text Representation:**
   - Bag of Words (BoW)
   - TF-IDF
   - Word embeddings (Word2Vec, GloVe)

3. **Sequence Models:**
   - RNN: Process sequences sequentially
   - LSTM: Solve vanishing gradient problem
   - GRU: Lighter alternative to LSTM

4. **Attention:**
   - Focus on relevant parts of input
   - Self-attention: Each token attends to all tokens
   - Multi-head attention: Multiple attention patterns

5. **Transformers:**
   - Pure attention-based architecture
   - Parallel processing (faster than RNNs)
   - State-of-the-art for most NLP tasks

### Interview Questions

1. **What is the difference between stemming and lemmatization?**
   - Stemming: Crude rule-based truncation ("running" → "run")
   - Lemmatization: Dictionary-based, considers context ("better" → "good")

2. **Why is TF-IDF better than Bag of Words?**
   - Downweights common words ("the", "and")
   - Upweights rare, informative words
   - Better feature representation

3. **What problem does LSTM solve compared to RNN?**
   - Vanishing gradient problem
   - Can learn long-range dependencies
   - Gating mechanisms control information flow

4. **How does attention help in NLP?**
   - Focuses on relevant parts of input
   - No fixed-size bottleneck
   - Interpretable (can visualize attention weights)

5. **Why are Transformers faster than LSTMs?**
   - Parallel processing (no sequential dependency)
   - Better GPU utilization
   - Can process all tokens simultaneously

### Next Steps

- **Advanced NLP:** See [ADVANCED_NLP_TECHNIQUES.md](../ADVANCED_NLP_TECHNIQUES.md)
- **Modern Techniques:** See [MODERN_ML_AI_TECHNIQUES_2024_2025.md](../MODERN_ML_AI_TECHNIQUES_2024_2025.md)
- **Projects:** Build sentiment analysis, text classification, or chatbot
- **Fine-tuning:** Learn to fine-tune BERT, GPT for specific tasks

---

**Congratulations!** You've completed NLP Fundamentals. You now understand text processing, embeddings, sequence models, attention, and Transformers!

**Next:** [10 - Computer Vision with CNNs](./10_computer_vision.ipynb)