<a href="https://colab.research.google.com/github/debojit11/ml_nlp_dl_transformers/blob/main/DL_week_12.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 12: MLPs – Neural Networks for NLP


# **SECTION 1: Welcome & Objectives**

In [1]:
print("Welcome to Week 12!")
print("This week, you'll:")
print("- Understand how Multi-Layer Perceptrons (MLPs) work")
print("- Build a simple MLP from scratch using PyTorch")
print("- Use it for text classification (spam vs ham)")
print("- Compare it to classical models like Logistic Regression")

Welcome to Week 12!
This week, you'll:
- Understand how Multi-Layer Perceptrons (MLPs) work
- Build a simple MLP from scratch using PyTorch
- Use it for text classification (spam vs ham)
- Compare it to classical models like Logistic Regression


# **SECTION 2: What’s an MLP?**

### What’s an MLP (Multi-Layer Perceptron)?
An MLP is a type of **feedforward neural network**.  
It has layers of **neurons** with weights and activations.

Structure:
- Input Layer (e.g., TF-IDF vector)
- One or more **Hidden Layers** with ReLU
- Output Layer (e.g., probability of spam)

MLPs learn patterns in data using **backpropagation** and **gradient descent**.

---

## ✍️ Example: Spam Classifier with MLP (PyTorch)

- Input: 1000-d TF-IDF vector from SMS messages
- Output: Probability that message is spam
- Architecture:
  - Linear(1000 → 128) + ReLU
  - Dropout
  - Linear(128 → 1) + Sigmoid

---


# **SECTION 3: Load & Vectorize SMS Data**

## ✍️ Example: Spam Classifier with MLP (PyTorch)

We're building a spam classifier using PyTorch. Here's how the key components work:

- **Input:** Each SMS message is transformed into a 1000-dimensional **TF-IDF vector**, representing the importance of words in the message.

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
import torch
from torch.utils.data import Dataset, DataLoader

In [3]:
url = "https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv"
df = pd.read_csv(url, sep='\t', names=["label", "message"])
df['label_num'] = df['label'].map({'ham': 0, 'spam': 1})

In [4]:
vectorizer = TfidfVectorizer(stop_words="english", max_features=1000)
X = vectorizer.fit_transform(df['message']).toarray()
y = df['label_num'].values

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# **SECTION 4: PyTorch Dataset & Dataloader**

- **Dataset and DataLoader (Section 4):**
    - We create a custom `SMSDataset` class to efficiently load and access our SMS data (features and labels).
    - The `DataLoader` then wraps this dataset, providing batches of data for training and testing, and handles shuffling the training data to prevent learning order-dependent patterns.

In [6]:
class SMSDataset(Dataset):
    def __init__(self, X, y):
        self.X = torch.tensor(X, dtype=torch.float32)
        self.y = torch.tensor(y, dtype=torch.float32)

    def __len__(self):
        return len(self.y)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

In [7]:
train_ds = SMSDataset(X_train, y_train)
test_ds = SMSDataset(X_test, y_test)
train_dl = DataLoader(train_ds, batch_size=32, shuffle=True)
test_dl = DataLoader(test_ds, batch_size=32)

# **SECTION 5: Define MLP Model**

- **MLP Model (Section 5):**
    - We define our `MLP` class, inheriting from `nn.Module`.
    - The network architecture consists of:
        - A **Linear layer** that takes the 1000-dimensional input and transforms it into a 128-dimensional hidden representation.
        - A **ReLU (Rectified Linear Unit) activation function** introduces non-linearity, allowing the network to learn complex relationships.
        - **Dropout** is a regularization technique that randomly sets a fraction (e.g., 0.2) of the neurons to zero during training, preventing overfitting.
        - Another **Linear layer** maps the 128-dimensional hidden representation to a 1-dimensional output.
        - A **Sigmoid activation function** squashes the output to a value between 0 and 1, representing the probability of the message being spam.

In [8]:
import torch.nn as nn

In [9]:
class MLP(nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, 128),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(128, 1),
            nn.Sigmoid()
        )

    def forward(self, x):
        return self.net(x)


In [10]:
model = MLP(X.shape[1])

# **SECTION 6: Train the MLP**

- **Training Loop (Section 6):**
    - We define the **loss function** (`nn.BCELoss`) suitable for binary classification.
    - We choose an **optimizer** (`Adam`) to update the model's weights based on the gradients.
    - The training loop iterates through the data for a specified number of **epochs**:
        - In each epoch, the model is set to **training mode** (`model.train()`).
        - We iterate through the batches of data provided by the `train_dl`.
        - For each batch, we perform the **forward pass** to get predictions.
        - We calculate the **loss** by comparing the predictions to the true labels.
        - We perform **backpropagation** (`loss.backward()`) to compute gradients.
        - We update the model's weights using the **optimizer** (`optimizer.step()`).
        - We reset the gradients before the next batch (`optimizer.zero_grad()`).
        - After each epoch, the model is set to **evaluation mode** (`model.eval()`).
        - We disable gradient calculation (`with torch.no_grad()`) during evaluation to save memory and computation.
        - We iterate through the `test_dl` to get predictions on the test set.
        - We calculate and print the **accuracy** of the model on the test set.

In [11]:
from torch.optim import Adam
from sklearn.metrics import accuracy_score

In [12]:
loss_fn = nn.BCELoss()
optimizer = Adam(model.parameters(), lr=1e-3)

In [14]:
for epoch in range(10):
    model.train()
    for xb, yb in train_dl:
        preds = model(xb).squeeze()
        loss = loss_fn(preds, yb)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

    # Eval
    model.eval()
    with torch.no_grad():
        all_preds = []
        for xb, yb in test_dl:
            preds = model(xb).squeeze()
            all_preds.extend((preds > 0.5).int().tolist())
        acc = accuracy_score(y_test, all_preds)
    print(f"Epoch {epoch+1}, Accuracy: {acc:.4f}")

Epoch 1, Accuracy: 0.9892
Epoch 2, Accuracy: 0.9892
Epoch 3, Accuracy: 0.9892
Epoch 4, Accuracy: 0.9892
Epoch 5, Accuracy: 0.9892
Epoch 6, Accuracy: 0.9892
Epoch 7, Accuracy: 0.9892
Epoch 8, Accuracy: 0.9892
Epoch 9, Accuracy: 0.9892
Epoch 10, Accuracy: 0.9892


- **Output:** Probability that message is spam (a value between 0 and 1).

---

## 🔍 Why Use MLPs for Text?

Pros:
- Can learn **non-linear patterns** in the data, potentially capturing more complex relationships than linear models.
- More **expressive** than simple models like logistic regression, especially with multiple hidden layers.
- Can leverage different input representations like **embeddings** (dense vector representations of words) or simpler methods like TF-IDF.

Cons:
- **Don’t inherently capture word order** or sequential dependencies in the text, as they treat the input as a bag of features.
- May **struggle with long sequences** as the input size can become very large.

---


## 🧪 Performance Notes

On simple bag-of-words features like TF-IDF:
- MLP performance is often **comparable to or slightly better** than logistic regression for tasks like spam classification.
- Using **deeper networks** (more hidden layers) might improve performance but also increases the risk of **overfitting**, especially with limited data. Careful tuning and regularization are crucial.

---

## 🧠 Real-World Use Cases

| Use Case                  | MLP Role                             |
|---------------------------|--------------------------------------|
| Spam/Ham Classifier       | Text classification using TF-IDF features. |
| Sentiment Analysis        | Provides a simple deep learning baseline, especially with aggregated word embeddings as input. |
| Intent Detection          | Can be effective when used with pre-trained word embeddings or sentence embeddings as input features. |

---

## 🚧 Limitations

- MLPs treat input features independently and **don’t inherently understand word order** or sequential context.
- Unlike Convolutional Neural Networks (CNNs) or Recurrent Neural Networks (RNNs), MLPs **don’t share weights** across different parts of the input, making them less efficient for processing sequential data.
- Performance can be **highly sensitive to the input representation**. The quality of the features (e.g., TF-IDF, embeddings) significantly impacts the model's ability to learn.

---

---

Next week, we’ll go from MLPs to **Recurrent Neural Networks (RNNs)**, which are built for **sequential data** like sentences! 🧠📈

# **SECTION 7: Wrap-up**

### 🎉 You Trained an MLP!
- It learned to classify messages as spam/ham
- You used a **dense neural network** with hidden layers
- Performance is often similar to logistic regression with TF-IDF, but can improve with more features, deeper models, or embeddings

➡️ Next week: we explore **RNNs & LSTMs** for handling sequential data like sentences.

# **SECTION 8: Exercises**

### ✍️ Exercises:
1. Change the number of hidden units and observe accuracy.
2. Add another hidden layer and compare.
3. Use other vectorizers like `CountVectorizer` or `HashingVectorizer`.
4. Try embeddings instead of TF-IDF for input.