# Assignment 2: Pre-trained Word Embeddings on LSTM Text Classification (Transfer Learning)
- In an LSTM-based text classification model, compare
  1. randomly initialized Embedding
  2. pre-trained GloVe-initialized Embedding
- to determine whether pre-training method as a regularization mechanism.

## 0. Environment Check

In [None]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

## 1. Dataset: AG_NEWS

We use the CSV version of the `AG_NEWS` dataset by downloading it directly.

https://www.kaggle.com/datasets/amananandrai/ag-news-classification-dataset

- 4 classes (World, Sports, Business, Sci/Tech)
- Format: `label,title,text`

In [None]:
import os

data_dir = "data_ag_news"
os.makedirs(data_dir, exist_ok=True)

train_csv = os.path.join(data_dir, "train.csv")
test_csv = os.path.join(data_dir, "test.csv")

if not os.path.exists(train_csv) or not os.path.exists(test_csv):
    # AG_NEWS CSV (widely used mirror)
    !wget -O $train_csv https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv
    !wget -O $test_csv  https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/test.csv

print("Files:", os.listdir(data_dir))

### 1-1. Loading the CSV and Checking the Data

In [None]:
import pandas as pd

train_df = pd.read_csv(train_csv, header=None)
test_df = pd.read_csv(test_csv, header=None)

# CSV structure: [label, title, text]
train_df.columns = ["label", "title", "text"]
test_df.columns = ["label", "title", "text"]

# Combine title + text into a single input sentence
train_df["text"] = train_df["title"] + " " + train_df["text"]
test_df["text"] = test_df["title"] + " " + test_df["text"]

# Optionally remove unused columns
train_df = train_df[["label", "text"]]
test_df = test_df[["label", "text"]]

print("Train size:", len(train_df))
print("Test size:", len(test_df))
train_df.head()

## 2. Tokenization & Vocabulary

- We use a very simple **whitespace tokenizer**.
- Only words with frequency ‚â• `min_freq` are included in the vocab.
- We manually add the special tokens `<pad>` and `<unk>`.

### üî∂ TODO 1
- Try changing `min_freq`, `max_vocab_size`, etc., and observe how they affect performance and speed.

In [None]:
from collections import Counter
from typing import List

PAD_TOKEN = "<pad>"
UNK_TOKEN = "<unk>"

def tokenizer(text: str) -> List[str]:
    return text.lower().strip().split()

def build_vocab(texts, min_freq: int = 5, max_vocab_size: int = 20000):
    counter = Counter()
    for t in texts:
        counter.update(tokenizer(t))
    # Sort by highest frequency
    most_common = counter.most_common(max_vocab_size)
    itos = [PAD_TOKEN, UNK_TOKEN]
    for word, freq in most_common:
        if freq >= min_freq:
            itos.append(word)
    stoi = {w: i for i, w in enumerate(itos)}
    return itos, stoi

itos, stoi = build_vocab(train_df["text"].tolist(), min_freq=5, max_vocab_size=20000)
vocab_size = len(itos)

print("Vocab size:", vocab_size)
print("First 10 tokens:", itos[:10])

## 3. Dataset & DataLoader Definition

- Convert text into sequences of integer IDs  
- Truncate or pad sequences to `max_len`  
- Convert labels into integers in the range 0~3  

### Label Mapping
- **0 ‚Üí World**  
- **1 ‚Üí Sports**  
- **2 ‚Üí Business**  
- **3 ‚Üí Sci/Tech**

### üî∂ TODO 2
- Try changing the `max_len` value and observe how it affects performance and training time.

In [None]:
import torch
from torch.utils.data import Dataset, DataLoader

def text_to_ids(text: str, stoi, max_len: int = 100):
    tokens = tokenizer(text)
    ids = [stoi.get(tok, stoi[UNK_TOKEN]) for tok in tokens]
    if len(ids) > max_len:
        ids = ids[:max_len]
    else:
        ids = ids + [stoi[PAD_TOKEN]] * (max_len - len(ids))
    return ids

class NewsDataset(Dataset):
    def __init__(self, df, stoi, max_len=100):
        self.df = df.reset_index(drop=True)
        self.stoi = stoi
        self.max_len = max_len

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        # original labels are 1~4, we convert to 0~3
        label = int(row["label"]) - 1
        text = str(row["text"])
        ids = text_to_ids(text, self.stoi, self.max_len)
        return torch.tensor(label, dtype=torch.long), torch.tensor(ids, dtype=torch.long)

max_len = 100
batch_size = 128

train_dataset = NewsDataset(train_df, stoi, max_len=max_len)
test_dataset = NewsDataset(test_df, stoi, max_len=max_len)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

labels, texts = next(iter(train_loader))
print("Batch labels shape:", labels.shape)
print("Batch texts shape:", texts.shape)

## 4. LSTM Text Classifier

- Embedding layer  
- Bi-LSTM  
- Collect the final hidden states and feed them into a Linear layer for classification  

We experiment with two versions of the embedding layer:  
1. Randomly initialized (baseline)  
2. Initialized with GloVe (Transfer Learning)

In [None]:
import torch.nn as nn

pad_idx = stoi[PAD_TOKEN]

class LSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes,
                 num_layers=1, bidirectional=True, dropout=0.2,
                 embedding_matrix=None, freeze_embedding=False):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=pad_idx)
        if embedding_matrix is not None:
            self.embedding.weight.data.copy_(torch.tensor(embedding_matrix, dtype=torch.float))
        self.embedding.weight.requires_grad = not freeze_embedding

        self.num_directions = 2 if bidirectional else 1
        self.lstm = nn.LSTM(
            input_size=embed_dim,
            hidden_size=hidden_dim,
            num_layers=num_layers,
            batch_first=True,
            bidirectional=bidirectional,
            dropout=dropout if num_layers > 1 else 0.0,
        )
        self.dropout = nn.Dropout(dropout)
        self.fc = nn.Linear(hidden_dim * self.num_directions, num_classes)

    def forward(self, x):
        emb = self.embedding(x)  # (B, L, E)
        output, (h_n, c_n) = self.lstm(emb)
        # ÎßàÏßÄÎßâ layerÏùò hidden stateÎßå ÏÇ¨Ïö©
        last_layer_h = h_n[-self.num_directions:, :, :]  # (num_directions, B, H)
        last_h = last_layer_h.transpose(0, 1).reshape(x.size(0), -1)
        out = self.fc(self.dropout(last_h))
        return out

## 5. Train / Evaluation Functions

Feel free to modify the optimizer, scheduler, etc., if needed for your experiments.

In [None]:
import torch.optim as optim

def train_one_epoch(model, loader, criterion, optimizer, device):
    model.train()
    total_loss = 0.0
    correct = 0
    total = 0

    for labels, texts in loader:
        # Move data to device
        labels = labels.to(device)
        texts = texts.to(device)

        optimizer.zero_grad()
        outputs = model(texts)
        loss = criterion(outputs, labels)

        # Backpropagation
        loss.backward()
        optimizer.step()

        # Accumulate loss and accuracy
        total_loss += loss.item() * labels.size(0)
        preds = outputs.argmax(dim=1)
        correct += (preds == labels).sum().item()
        total += labels.size(0)

    return total_loss / total, correct / total


def evaluate(model, loader, criterion, device):
    model.eval()
    total_loss = 0.0
    correct = 0
    total = 0

    with torch.no_grad():
        for labels, texts in loader:
            # Move data to device
            labels = labels.to(device)
            texts = texts.to(device)

            outputs = model(texts)
            loss = criterion(outputs, labels)

            # Accumulate loss and accuracy
            total_loss += loss.item() * labels.size(0)
            preds = outputs.argmax(dim=1)
            correct += (preds == labels).sum().item()
            total += labels.size(0)

    return total_loss / total, correct / total

## 6. Experiment 1: Random Initialization (Baseline)

### üî∂ TODO 3
- Run the cell below to evaluate the LSTM trained with **random embeddings**.
- Try adjusting `hidden_dim`, `num_layers`, and `num_epochs` to compare performance.

In [None]:
vocab_size = len(itos)
embed_dim = 100
hidden_dim = 128
num_classes = 4
num_epochs = 5
lr = 1e-3

baseline_model = LSTMClassifier(
    vocab_size=vocab_size,
    embed_dim=embed_dim,
    hidden_dim=hidden_dim,
    num_classes=num_classes,
    bidirectional=True,
    embedding_matrix=None,
    freeze_embedding=False,
).to(device)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(baseline_model.parameters(), lr=lr)

for epoch in range(1, num_epochs + 1):
    train_loss, train_acc = train_one_epoch(baseline_model, train_loader, criterion, optimizer, device)
    val_loss, val_acc = evaluate(baseline_model, test_loader, criterion, device)
    print(f"[Baseline] Epoch {epoch}: "
          f"TrainLoss={train_loss:.4f} Acc={train_acc:.4f} | "
          f"TestLoss={val_loss:.4f} Acc={val_acc:.4f}")

## 7. Experiment 2: Pre-trained GloVe Initialization

Now we initialize the embedding layer using **GloVe 6B 100d**.

### 7-1. Downloading the GloVe File (only once)

In [None]:
glove_zip = "glove.6B.zip"
glove_txt = "glove.6B.100d.txt"

if not os.path.exists(glove_txt):
    if not os.path.exists(glove_zip):
        !wget http://nlp.stanford.edu/data/glove.6B.zip
    !unzip -o glove.6B.zip

print("GloVe file exists:", os.path.exists(glove_txt))

### 7-2. Loading GloVe and Building the Embedding Matrix
- Run the code below to create the `embedding_matrix`.

In [None]:
import numpy as np

embedding_index = {}
with open(glove_txt, encoding="utf-8") as f:
    for line in f:
        values = line.rstrip().split(" ")
        word = values[0]
        vector = np.asarray(values[1:], dtype="float32")
        embedding_index[word] = vector

print("GloVe vocab size:", len(embedding_index))

embedding_dim = 100
embedding_matrix = np.random.normal(scale=0.6, size=(vocab_size, embedding_dim)).astype("float32")

oov_count = 0
for idx, word in enumerate(itos):
    vec = embedding_index.get(word, None)
    if vec is not None:
        embedding_matrix[idx] = vec
    else:
        oov_count += 1

print("OOV words:", oov_count, "/", vocab_size)

### 7-3. Training the LSTM Initialized with GloVe

### üî∂ TODO 4
- Try both `freeze_embedding=True` and `freeze_embedding=False`.
- Compare how test accuracy and training speed differ between the two settings.

In [None]:
glove_model = LSTMClassifier(
    vocab_size=vocab_size,
    embed_dim=embedding_dim,
    hidden_dim=hidden_dim,
    num_classes=num_classes,
    bidirectional=True,
    embedding_matrix=embedding_matrix,
    freeze_embedding=False,  # TODO: also try setting this to True
).to(device)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(glove_model.parameters(), lr=lr)

for epoch in range(1, num_epochs + 1):
    train_loss, train_acc = train_one_epoch(glove_model, train_loader, criterion, optimizer, device)
    val_loss, val_acc = evaluate(glove_model, test_loader, criterion, device)
    print(f"[GloVe Init] Epoch {epoch}: "
          f"TrainLoss={train_loss:.4f} Acc={train_acc:.4f} | "
          f"TestLoss={val_loss:.4f} Acc={val_acc:.4f}")

## 8. Analysis (Report)

üî∂ Write brief analyses for the following:

1. **Test Accuracy Comparison**  
   - Random init vs. GloVe (freeze / non-freeze)

2. **Training**  
   - Discuss which setting learns better/faster.

3. **Effect of Pre-trained Embeddings**  
   - Briefly describe whether pre-training appears to act as a form of regularization.

4. **Embedding Analysis**  
   - Compare the learned embeddings from Random init vs. GloVe init.
   - You may examine word similarities, PCA/TSNE visualizations, OOV handling differences, etc.