In [1]:
# !pip install torch==2.2.0 torchtext==0.17.0 -f https://download.pytorch.org/whl/torch_stable.html
# !pip install -U scikit-learn
!nvidia-smi

Mon Dec  9 21:02:21 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce GTX 1660 Ti     Off | 00000000:01:00.0 Off |                  N/A |
| N/A   44C    P3               9W /  45W |      6MiB /  6144MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [2]:
import torch
import torchtext

print("Torch Version:", torch.__version__)
print("Torchtext Version:", torchtext.__version__)

Torch Version: 2.2.0+rocm5.7
Torchtext Version: 0.17.0+cpu


In [3]:
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from torchtext.vocab import Vocab, vocab
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix
import re
from collections import Counter
from typing import List, Tuple, Dict, Optional, Any

## Feedforward Neural Network (FFNN)


### Data Loading

We will use the same dataset for named entity recognition in Assignment #2. First download the data and take a look at the first 50 lines:


Each line corresponds to a word. Different sentences are separated by an additional line break. Take "EU NNP I-NP ORG" as an example. "EU" is a word. "NNP" and "I-NP" are tags for POS tagging and chunking, which we will ignore. "ORG" is the tag for NER, which is our prediction target. There are 5 possible values for the NER tag: ORG, PER, LOC, MISC, and O.


First, we write a dataloader for loading the dataset into mini-batches used for training the model. See [torch.utils.data](https://pytorch.org/docs/stable/data.html) for how dataloaders work in PyTorch. In short, we typically need to do two things:

1. Define a [map-style dataset](https://pytorch.org/docs/stable/data.html#map-style-datasets) by subclassing [Dataset](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset) and overriding 3 methods: `__init__`, `__getitem__`, and `__len__`.
1. Create a [Dataloader](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset) by calling its constructor. We have to specify the dataset and a few hyperparameters such as batch size.

Most of the work has been done by us. As a simple exercise, try to understand the code and implement `__len__`.


In [4]:
# A sentence is a list of (word, tag) tuples.
# For example, [("hello", "O"), ("world", "O"), ("!", "O")]
Sentence = List[Tuple[str, str]]


def read_data_file(
    datapath: str,
) -> Tuple[List[Sentence], Dict[str, int], Dict[str, int]]:
    """
    Read and preprocess input data from the file `datapath`.
    Example:
    ```
        sentences, word_cnt, tag_cnt = read_data_file("eng.train")
    ```
    Return values:
        `sentences`: a list of sentences, including words and NER tags
        `word_cnt`: a Counter object, the number of occurrences of each word
        `tag_cnt`: a Counter object, the number of occurences of each NER tag
    """
    sentences: List[Sentence] = []
    word_cnt: Dict[str, int] = Counter()
    tag_cnt: Dict[str, int] = Counter()

    for sentence_txt in open(datapath).read().split("\n\n"):
        if "DOCSTART" in sentence_txt:
            # Ignore dummy sentences at the begining of each document.
            continue
        # Read a new sentence
        sentences.append([])
        for token in sentence_txt.split("\n"):
            w, _, _, t = token.split()
            # Replace all digits with "0" to reduce out-of-vocabulary words
            w = re.sub("\d", "0", w)
            word_cnt[w] += 1
            tag_cnt[t] += 1
            sentences[-1].append((w, t))

    return sentences, word_cnt, tag_cnt



## Implement the `__len__` function below **(1 point)**


In [5]:

class FixedWindowDataset(Dataset):
    """
    Each data example is a word, its NER tag (the target), and a fixed window centered around it (the input).
    """

    def __init__(
        self,
        datapath: str,
        window_size: int,
        words_vocab: Optional[Vocab] = None,
        tags_vocab: Optional[Vocab] = None,
    ) -> None:
        """
        Initialize the dataset by reading from datapath.
        """
        super().__init__()
        self.examples = []
        START = "<START>"
        END = "<END>"
        UNKNOWN = "<UNKNOWN>"

        print("Loading data from %s" % datapath)
        sentences, word_cnt, tag_cnt = read_data_file(datapath)

        # Extract windows
        for sent in sentences:
            words = [START for _ in range(window_size)]
            tags = [None for _ in range(window_size)]
            for w, t in sent:
                words.append(w)
                tags.append(t)
            words.extend([END for _ in range(window_size)])
            tags.extend([None for _ in range(window_size)])

            for i, t in enumerate(tags[window_size:-window_size], start=window_size):
                self.examples.append(
                    {
                        "word": words[i],
                        "tag": t,
                        "context": words[i - window_size : i + window_size + 1],
                    }
                )

        print("%d examples loaded." % len(self.examples))

        # set vocabs
        if words_vocab is None:
            words_vocab = vocab(word_cnt, specials=[START, END, UNKNOWN]) # automatically create a vocabulary from words in dataset
            words_vocab.set_default_index(words_vocab[UNKNOWN])
        self.words_vocab = words_vocab
        self.unknown_idx = self.words_vocab[UNKNOWN]
        self.start_idx = self.words_vocab[START]
        self.end_idx = self.words_vocab[END]

        if tags_vocab is None:
            tags_vocab = vocab(tag_cnt, specials=[]) # automatically create tags vocabulary from tags in dataset
        self.tags_vocab = tags_vocab

    def __getitem__(self, idx: int) -> Dict[str, Any]:
        """
        Get the idx'th example in the dataset.
        Convert words and the tag to indexes.
        """
        example = self.examples[idx]
        word = example["word"]
        tag = example["tag"]
        context = example["context"]
        return {
            "word": word,
            "word_idx": self.words_vocab[word],
            "tag": tag,
            "tag_idx": self.tags_vocab[tag],
            "context": context,
            "context_idxs": torch.tensor(
                [self.words_vocab[w] for w in context]
            ),
        }

    def __len__(self) -> int:
        """
        Return the number of examples in the dataset.
        """
        # TODO: Implement this method
        return len(self.examples)




In [6]:
def create_fixed_window_dataloaders(
    batch_size: int, window_size: int, shuffle: bool = True
) -> Tuple[DataLoader, DataLoader, Dict[str, Vocab]]:
    """
    Create the dataloaders for training and validaiton.
    """
    ds_train = FixedWindowDataset("eng.train", window_size)
    # Re-use the vocabulary of the training data
    ds_val = FixedWindowDataset("eng.val", window_size, words_vocab=ds_train.words_vocab, tags_vocab=ds_train.tags_vocab)
    loader_train = DataLoader(
        ds_train, batch_size, shuffle, drop_last=True, pin_memory=True
    )
    loader_val = DataLoader(ds_val, batch_size, pin_memory=True)
    return loader_train, loader_val, ds_train

Let's test our dataloader. Try to understand the output, as it will save your time later.


In [7]:
def check_fixed_window_dataloader() -> None:
    loader_train, _, _ = create_fixed_window_dataloaders(
        batch_size=3, window_size=2, shuffle=False
    )
    print("Iterating on the training data..")
    for i, data_batch in enumerate(loader_train):
        if i == 0:
            print(data_batch)
            print(len(data_batch["context"]))
            print(data_batch["context_idxs"].shape)
    print("Done!")


check_fixed_window_dataloader()

Loading data from eng.train
203621 examples loaded.
Loading data from eng.val
49086 examples loaded.
Iterating on the training data..
{'word': ['EU', 'rejects', 'German'], 'word_idx': tensor([3, 4, 5]), 'tag': ['ORG', 'O', 'MISC'], 'tag_idx': tensor([0, 1, 2]), 'context': [('<START>', '<START>', 'EU'), ('<START>', 'EU', 'rejects'), ('EU', 'rejects', 'German'), ('rejects', 'German', 'call'), ('German', 'call', 'to')], 'context_idxs': tensor([[0, 0, 3, 4, 5],
        [0, 3, 4, 5, 6],
        [3, 4, 5, 6, 7]])}
5
torch.Size([3, 5])
Done!


### Implement the Model **(4 points)**

Next, let's implement feedforward neural networks following the description of Problem 1 in Assignment #3.

Models in PyTorch are subclasses of [torch.nn.Module](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module). You have to override `__init__` for initializing the model and `forward` for calculating the forward pass. Checkout this [tutorial](https://pytorch.org/tutorials/beginner/nn_tutorial.html#) if you are not sure how torch.nn.Module works.

PyTorch provides a wide array of [neural network layers](https://pytorch.org/docs/stable/nn.html) as building blocks for your model. Here are some of them that may be relevant:

- [nn.Embedding](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html#torch.nn.Embedding)
- [nn.Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html#torch.nn.Linear)
- [torch.sigmoid](https://pytorch.org/docs/stable/generated/torch.sigmoid.html#torch.sigmoid) or [nn.Sigmoid](https://pytorch.org/docs/stable/generated/torch.nn.Sigmoid.html#torch.nn.Sigmoid)

Note a difference with Problem 3 of Assignment #2 is that we do not apply softmax when calculatinng $\hat{y}^{(t)}$. Instead, we leave what softmax does to the loss function [F.cross_entropy](https://pytorch.org/docs/stable/nn.functional.html#torch.nn.functional.cross_entropy). For details, please see its difference with [F.nll_loss](https://pytorch.org/docs/stable/nn.functional.html#torch.nn.functional.nll_loss).


In [8]:
class FFNN(nn.Module):
    """
    Feedforward Neural Networks for NER
    """

    def __init__(
        self, words_vocab: Vocab, tags_vocab: Vocab, window_size: int, d_emb: int, d_hidden: int
    ) -> None:
        """
        Initialize a two-layer feedforward neural network with sigmoid activation.
        Parameters:
            `words_vocab`: vocabulary of words
            `tags_vocab`: vocabulary of tags
            `window_size`: size of the context window (w in Problem 3 of Assignment #2)
            `d_emb`: dimension of word embeddings (D in Problem 3 of Assignment #2)
            `d_hidden`: dimension of the hidden layer (H in Problem 3 of Assignment #2)
        """
        super().__init__()
        # TODO: Create the word embeddings (nn.Embedding),
        self.words_vocab = words_vocab
        self.tags_vocab = tags_vocab
        self.window_size = window_size
        self.d_emb = d_emb
        self.d_hidden = d_hidden

        self.embedding = nn.Embedding(len(words_vocab), d_emb)
        self.hidden_layer = nn.Linear((2 * window_size + 1) * d_emb, d_hidden)
        self.output_layer = nn.Linear(d_hidden, len(tags_vocab))

    def forward(self, context_idxs: torch.Tensor) -> torch.Tensor:
        """
        Given the word indexes in a context window, predict the logits of the NER tag.
        Parameters:
            `context_idxs`: a batch_size x (2 * window_size + 1) tensor
                          context_idxs[i] contains word indexes in the window of the i'th data example.
        Return values:
            `logits`: a batch_size x 5 tensor (\hat{y}^{(t)} in Problem 3 of Assignment #2, without softmax)
                    logits[i][j] is the output score (before softmax) of the i'th example for tag j.
        """
        # TODO: Implement the forward pass of the two-layer FFNN with sigmoid hidden layer.
        #       Do not apply softmax, since we will use F.cross_entropy as the loss function.
        
                                                    # context_idxs: batch_size x (2 * window_size + 1) tensor
        context_emb = self.embedding(context_idxs)  # batch_size x (2 * window_size + 1) x d_emb tensor
        batch_size, _, _ = context_emb.size()
        context_flat = context_emb.view(batch_size, -1)
        hidden_output = torch.sigmoid(self.hidden_layer(context_flat))
        logits = self.output_layer(hidden_output)   # batch_size x 5 tensor
        
        return logits

Optionally, let's do a simple sanity check of your implementation. In `check_ffnn`, we load a batch of data examples and pass it through the FFNN.


In [9]:
# Some helper code
def get_device() -> torch.device:
    """
    Use GPU when it is available; use CPU otherwise.
    """
    return torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

print(get_device())

cpu


In [10]:
def check_ffnn() -> None:
  # Hyperparameters
  batch_size = 3
  d_emb = 64
  d_hidden = 128
  window_size = 3
  # Create the dataloaders and the model
  loader_train, _, ds_train = create_fixed_window_dataloaders(batch_size, window_size)
  model = FFNN(ds_train.words_vocab, ds_train.tags_vocab, window_size, d_emb, d_hidden)
  device = get_device()
  model.to(device)
  print(model)
  # Get the first batch
  data_batch = next(iter(loader_train))
  # Move data to GPU
  context_idxs = data_batch["context_idxs"].to(device, non_blocking=True)
  tag_idx = data_batch["tag_idx"].to(device, non_blocking=True)
  # Calculate the model
  print("Input tensor shape:", context_idxs.size())
  logits = model(context_idxs)
  print("Output tensor shape:", logits.size())

check_ffnn()

Loading data from eng.train
203621 examples loaded.
Loading data from eng.val
49086 examples loaded.
FFNN(
  (words_vocab): Vocab()
  (tags_vocab): Vocab()
  (embedding): Embedding(20103, 64)
  (hidden_layer): Linear(in_features=448, out_features=128, bias=True)
  (output_layer): Linear(in_features=128, out_features=5, bias=True)
)
Input tensor shape: torch.Size([3, 7])
Output tensor shape: torch.Size([3, 5])


In [11]:
ds_train = FixedWindowDataset("eng.train", 3)

Loading data from eng.train
203621 examples loaded.


### Training and Validation **(4 points)**

Having implemented the model, the next step is to implement functions for training and validation.


In [12]:
def eval_metrics(ground_truth: List[int], predictions: List[int]) -> Dict[str, Any]:
    """
    Calculate various evaluation metrics such as accuracy and F1 score
    Parameters:
        `ground_truth`: the list of ground truth NER tags
        `predictions`: the list of predicted NER tags
    """
    f1_scores = f1_score(ground_truth, predictions, average=None)
    return {
        "accuracy": accuracy_score(ground_truth, predictions),
        "average f1": np.mean(f1_scores),
        "f1": f1_scores,
        "confusion matrix": confusion_matrix(ground_truth, predictions),
    }


def train_ffnn(
    model: nn.Module,
    loader: DataLoader,
    optimizer: optim.Optimizer,
    device: torch.device,
    silent: bool = False,  # whether to print the training loss
) -> Tuple[float, Dict[str, Any]]:
    """
    Train the FFNN model.
    Return values:
        1. the average training loss
        2. training metrics such as accuracy and F1 score
    """
    model.train()
    ground_truth = []
    predictions = []
    losses = []
    report_interval = 100

    for i, data_batch in enumerate(loader):
        context_idxs = data_batch["context_idxs"].to(device, non_blocking=True)
        tag_idx = data_batch["tag_idx"].to(device, non_blocking=True)

        # TODO:
        # 1. Perform the forward pass to calculate the model's output. Save it to the variable "logits".
        # 2. Calculate the loss using the output and the ground truth tags. Save it to the variable "loss".
        # 3. Perform the backward pass to calculate the gradient.
        # 4. Use the optimizer to update model parameters.
        # Caveat: You may need to call optimizer.zero_grad(). Figure out what it does!
        
        logits = model(context_idxs)
        loss = F.cross_entropy(logits, tag_idx)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        losses.append(loss.item())
        ground_truth.extend(tag_idx.tolist())
        predictions.extend(logits.argmax(dim=-1).tolist())

        if not silent and i > 0 and i % report_interval == 0:
            print(
                "\t[%06d/%06d] Loss: %f"
                % (i, len(loader), np.mean(losses[-report_interval:]))
            )

    return np.mean(losses), eval_metrics(ground_truth, predictions)


def validate_ffnn(
    model: nn.Module, loader: DataLoader, device: torch.device
) -> Tuple[float, Dict[str, Any]]:
    """
    Validate the FFNN model.
    Return values:
        1. the average validation loss
        2. validation metrics such as accuracy and F1 score
    """
    model.eval()
    ground_truth = []
    predictions = []
    losses = []

    with torch.no_grad():

        for data_batch in loader:
            context_idxs = data_batch["context_idxs"].to(device, non_blocking=True)
            tag_idx = data_batch["tag_idx"].to(device, non_blocking=True)

            # TODO: Similar to what you did in train_ffnn, but only step 1 and 2.

            logits = model(context_idxs)
            loss = F.cross_entropy(logits, tag_idx)

            losses.append(loss.item())
            ground_truth.extend(tag_idx.tolist())
            predictions.extend(logits.argmax(dim=-1).tolist())

    return np.mean(losses), eval_metrics(ground_truth, predictions)


def train_val_loop_ffnn(hyperparams: Dict[str, Any]) -> None:
    """
    Train and validate the FFNN model for a number of epochs.
    """
    print("Hyperparameters:", hyperparams)
    # Create the dataloaders
    loader_train, loader_val, ds_train = create_fixed_window_dataloaders(
        hyperparams["batch_size"], hyperparams["window_size"]
    )
    # Create the model
    model = FFNN(
        ds_train.words_vocab,
        ds_train.tags_vocab,
        hyperparams["window_size"],
        hyperparams["d_emb"],
        hyperparams["d_hidden"],
    )
    device = get_device()
    model.to(device)
    print(model)
    # Create the optimizer
    optimizer = optim.RMSprop(
        model.parameters(), hyperparams["learning_rate"], weight_decay=hyperparams["l2"]
    )

    # Train and validate
    for i in range(hyperparams["num_epochs"]):
        print("*" * 80 )
        print(f"Epoch #{i+1}")

        print("Training..")
        loss_train, metrics_train = train_ffnn(
            model, loader_train, optimizer, device, silent=True
        )
        print("Training loss: ", loss_train)
        print("Training metrics:")
        for k, v in metrics_train.items():
            print("\t", k, ": ", v)

        print("Validating..")
        loss_val, metrics_val = validate_ffnn(model, loader_val, device)
        print("Validation loss: ", loss_val)
        print("Validation metrics:")
        for k, v in metrics_val.items():
            print("\t", k, ": ", v)

    print("************ Training Done! ************")

We are ready to run experiments! Let's train the model for 5 epochs, with `window_size=2`. After each epoch, we perform validation and print the evaluation metrics.


In [13]:
train_val_loop_ffnn(
    {
        "batch_size": 512,
        "d_emb": 64,
        "d_hidden": 128,
        "window_size": 2,
        "num_epochs": 5,
        "learning_rate": 0.01,
        "l2": 1e-6,
    }
)

Hyperparameters: {'batch_size': 512, 'd_emb': 64, 'd_hidden': 128, 'window_size': 2, 'num_epochs': 5, 'learning_rate': 0.01, 'l2': 1e-06}
Loading data from eng.train
203621 examples loaded.
Loading data from eng.val
49086 examples loaded.
FFNN(
  (words_vocab): Vocab()
  (tags_vocab): Vocab()
  (embedding): Embedding(20103, 64)
  (hidden_layer): Linear(in_features=320, out_features=128, bias=True)
  (output_layer): Linear(in_features=128, out_features=5, bias=True)
)
********************************************************************************
Epoch #1
Training..
Training loss:  0.18752444131060572
Training metrics:
	 accuracy :  0.9448648063602015
	 average f1 :  0.8018547587051851
	 f1 :  [0.74054702 0.97656275 0.66527402 0.82443958 0.80245042]
	 confusion matrix :  [[  6796   2018    231    467    499]
 [   623 167356    391    644    263]
 [   256   1209   2707    153    258]
 [   311   1699     77   8845    175]
 [   357   1186    149    241   6353]]
Validating..
Validation los

Please re-run with `window_size=1`. How does the final performance change?


In [14]:
train_val_loop_ffnn(
    {
        "batch_size": 512,
        "d_emb": 64,
        "d_hidden": 128,
        "window_size": 1,
        "num_epochs": 5,
        "learning_rate": 0.01,
        "l2": 1e-6,
    }
)

Hyperparameters: {'batch_size': 512, 'd_emb': 64, 'd_hidden': 128, 'window_size': 1, 'num_epochs': 5, 'learning_rate': 0.01, 'l2': 1e-06}
Loading data from eng.train


203621 examples loaded.
Loading data from eng.val
49086 examples loaded.
FFNN(
  (words_vocab): Vocab()
  (tags_vocab): Vocab()
  (embedding): Embedding(20103, 64)
  (hidden_layer): Linear(in_features=192, out_features=128, bias=True)
  (output_layer): Linear(in_features=128, out_features=5, bias=True)
)
********************************************************************************
Epoch #1
Training..
Training loss:  0.18278633777793468
Training metrics:
	 accuracy :  0.9454354927581864
	 average f1 :  0.8045951413786071
	 f1 :  [0.74138211 0.97710196 0.68659716 0.82891048 0.788984  ]
	 confusion matrix :  [[  6850   1902    247    451    562]
 [   550 167359    253    576    547]
 [   351   1108   2733    176    214]
 [   323   1718     34   8871    156]
 [   393   1190    112    228   6360]]
Validating..
Validation loss:  0.14408108143349332
Validation metrics:
	 accuracy :  0.9553640549240109
	 average f1 :  0.8398596878413714
	 f1 :  [0.76190476 0.98072959 0.76614216 0.84455225 0

### Question **(1 point)**

If everything works as expected, you should see the loss decrease and the accuracy increase for both training and validation. The final accuracy can be pretty high; you should probably debug if it's below 92%. However, **is accuracy a good metric for this problem? Why?**. Hint: look at the F1 scores for different tags and the confusion matrix.

**TODO: Please fill in your answer here**


Accuracy might not be the best metric for Named Entity Recognition (NER) tasks. Here's why:

1. **Imbalanced Classes**: NER datasets often have imbalanced classes, where some tags (like 'O' for non-entity tokens) are much more frequent than others (like specific entity tags). Accuracy can be misleading in such cases because a model that always predicts the majority class can still achieve high accuracy.

2. **F1 Score**: The F1 score, which is the harmonic mean of precision and recall, is a better metric for NER tasks. It considers both false positives and false negatives, providing a more balanced evaluation of the model's performance on each tag.

3. **Confusion Matrix**: The confusion matrix can help identify specific types of errors the model is making, such as confusing one entity type with another. This detailed insight is not captured by accuracy alone.

In summary, while accuracy gives a general idea of performance, the F1 score and confusion matrix provide more detailed and meaningful insights for NER tasks.
