## Pytorch Exercise: Augmenting the LSTM part-of-speech tagger with character-level feature
This notebook implements a POS tagger using PyTorch. The model uses both word embeddings and character-level embeddings to predict the POS tags for each word in a sentence. The notebook is buit following the instructions from the PyTorch tutorial: "Sequence Models and Long Short-Term Memory Networks" (https://docs.pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html).

The text for training and validation is the first 4 chapters of "Moby Dick" by Herman Melville. 

LEARNING OUTCOMES:
- Understand how to implement sequence-to-sequence models with LSTMs in PyTorch.
- Use cases of embeddings and Bidirectional LSTMs in NLP tasks.
- Regularization and dropout techniques to prevent overfitting in deep learning models.
- Batching and padding techniques for variable-length sequences in NLP.

BEST RESULTS:
- Training Accuracy: 0.95
- Validation Accuracy: 0.87

In [None]:
# Import required libraries
import spacy
import numpy as np
import re

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

import torch
from torch import optim, nn
import torch.nn.functional as F

print("Cuda is: ", torch.cuda.is_available())

Cuda is:  True


In [None]:
# Load text and process with spaCy to extract sentences
from google.colab import drive
drive.mount('/content/drive')

file_loc = '/content/drive/MyDrive/Colab Notebooks/moby_dick_four_chapters.txt'

with open(file_loc, 'r') as f:
    whole_text = f.read()

nlp_en = spacy.load('en_core_web_sm')
doc = nlp_en(whole_text)
all_sents = list(doc.sents)
len("Total number of senteces: ", all_sents)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


453

In [None]:
# Define relevant POS tags and set every other tag to X
MAJOR_TAGS = ['NOUN', 'VERB', 'ADJ', 'ADV', 'PRON', 'DET', 'ADP', 'AUX', 'PROPN', 'NUM', 'X']

def modify_word(word):
    return word.lower().strip()

def filter_pos_tag(pos_tag):
    return pos_tag if pos_tag in MAJOR_TAGS else 'X'

# Extract (words, tags) pairs from sentences, removing punctuation
all_data = [
    (
        [modify_word(token.text) for token in sent if not token.is_punct and not token.is_space],
        [filter_pos_tag(token.pos_) for token in sent if not token.is_punct and not token.is_space]
    )
    for sent in all_sents
]

# Build vocabularies: word → index, character → index, POS tag → index
word_to_ix = {}
for sent, tags in all_data:
    for word in sent:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)

chars = ''.join(list(word_to_ix.keys()))
chars = list(set(chars))
char_to_idx = {ch: idx for idx, ch in enumerate(chars)}

pos_to_ix = {pos: idx for idx, pos in enumerate(MAJOR_TAGS)}

In [None]:
# BiLSTM POS Tagger with Character-level Features
# Architecture: Word Embeddings + Character LSTM → BiLSTM → POS Tags
class POSTagger(nn.Module):
    def __init__(self, word_embedding_dim, char_embedding_dim, char_hidden_dim, word_hidden_dim, vocab_size, num_chars, output_dim, dropout_rate=0.3):
        super(POSTagger, self).__init__()
        self.word_embedding = nn.Embedding(vocab_size, word_embedding_dim)
        self.char_embedding = nn.Embedding(num_chars, char_embedding_dim)
        
        self.embedding_dropout = nn.Dropout(dropout_rate)

        # Character LSTM: extracts character-level features for each word
        self.char_lstm = nn.LSTM(char_embedding_dim, char_hidden_dim, batch_first=True)
        
        # Bidirectional LSTM: processes word + character features with context from both directions
        self.lstm = nn.LSTM(word_embedding_dim + char_hidden_dim, word_hidden_dim, batch_first=True, bidirectional=True)
        
        self.lstm_dropout = nn.Dropout(dropout_rate)
        # Output layer: bidirectional LSTM outputs *2 dimensions (forward + backward)
        self.fc = nn.Linear(word_hidden_dim * 2, output_dim)
        
    def forward(self, words, word_lengths, padded_chars):
        words_embedded = self.embedding_dropout(self.word_embedding(words))
        
        # Get character embeddings and pack sequences for efficiency
        # Packing skips padding tokens, preventing the LSTM from learning false patterns
        char_embedded = self.embedding_dropout(self.char_embedding(padded_chars))
    
        # Get lengths for packing
        packed_chars = nn.utils.rnn.pack_padded_sequence(
            char_embedded, word_lengths, batch_first=True, enforce_sorted=False
        )
    
        packed_out, _ = self.char_lstm(packed_chars)
        unpacked_out, _ = nn.utils.rnn.pad_packed_sequence(packed_out, batch_first=True)
        
        # Extract last hidden state for each word
        word_char_embedded = unpacked_out[:, -1, :]  # (num_words, char_hidden_dim)
        
        combined = torch.cat((words_embedded.unsqueeze(0), 
                              word_char_embedded.unsqueeze(0)), dim=2)
        lstm_out, _ = self.lstm(combined)
        lstm_out = self.lstm_dropout(lstm_out)
        
        tag_space = self.fc(lstm_out.squeeze(0)) # (num_words, output_dim)
        output = F.log_softmax(tag_space, dim=1)
        return output

In [None]:
# Hyperparameters
WORD_EMBEDDING_DIM = 12
CHAR_EMBEDDING_DIM = 6
CHAR_HIDDEN_DIM = 6
HIDDEN_DIM = 16
VOCAB_SIZE = len(word_to_ix)
print("Vocab size: ", VOCAB_SIZE)
NUM_CHARS = len(char_to_idx)
print("Num chars: ", NUM_CHARS)
OUTPUT_DIM = len(MAJOR_TAGS)
DROPOUT_RATE = 0.35

Vocab size:  2718
Num chars:  38


In [None]:
def prepare_sequence(seq, to_ix):
    idxs = [to_ix[w] for w in seq]
    return torch.tensor(idxs, dtype=torch.long)


# Vectorize data: convert words/chars to indices and pad sequences
all_data_vec = [
    (
        prepare_sequence(seq, word_to_ix),
        [len(word) for word in seq],  # Character counts for packing
        nn.utils.rnn.pad_sequence([prepare_sequence(word, char_to_idx) for word in seq], 
                                  batch_first=True,
                                  padding_value=NUM_CHARS),  # Pad with out-of-range index
        prepare_sequence(tags, pos_to_ix) 
    )
    for seq, tags in all_data
]
print(all_data[0], all_data_vec[0])

train_selection = np.random.choice(len(all_data_vec), size=int(0.8*len(all_data_vec)), replace=False).tolist()
val_selection = [i for i in range(len(all_data_vec)) if i not in train_selection]
training_data = [all_data_vec[i] for i in train_selection]
validation_data = [all_data_vec[i] for i in val_selection]

(['call', 'me', 'ishmael'], ['VERB', 'PRON', 'PROPN']) (tensor([0, 1, 2]), [4, 2, 7], tensor([[24, 16, 33, 33, 38, 38, 38],
        [28, 30, 38, 38, 38, 38, 38],
        [ 8, 34, 37, 28, 16, 30, 33]]), tensor([1, 4, 8]))


In [None]:
# Initialize model, optimizer, and loss function
model = POSTagger(WORD_EMBEDDING_DIM, CHAR_EMBEDDING_DIM, CHAR_HIDDEN_DIM, HIDDEN_DIM, 
                  VOCAB_SIZE, 
                  NUM_CHARS + 1, # Add 1 for padding index
                  OUTPUT_DIM,
                  dropout_rate=DROPOUT_RATE)
optimizer = optim.Adam(model.parameters(), lr=0.005, weight_decay=1e-5)  # L2 regularization via weight_decay
loss_function = nn.NLLLoss()

print("# params in the model: ", sum(p.numel() for p in model.parameters()))

# params in the model:  38157


In [None]:
# Evaluation function
def calculate_accuracy(model, validation_data, print_report=True):
    preds = []
    targets = []
    for words, word_lengths, padded_chars, y in validation_data:
        output = model(words, word_lengths, padded_chars)
        output = output.argmax(dim=1)
        preds.append(output)
        targets.append(y)

    preds = torch.cat(preds, dim=0).detach().cpu().numpy()
    targets = torch.cat(targets, dim=0).detach().cpu().numpy()
    
    if print_report:
        print(classification_report(targets, preds, target_names=MAJOR_TAGS))
    return accuracy_score(targets, preds)

# Baseline: model accuracy before training
calculate_accuracy(model, validation_data)

              precision    recall  f1-score   support

        NOUN       0.00      0.00      0.00       477
        VERB       0.12      0.39      0.18       300
         ADJ       0.06      0.03      0.04       210
         ADV       0.02      0.01      0.01       171
        PRON       0.09      0.00      0.01       292
         DET       0.00      0.00      0.00       299
         ADP       0.04      0.01      0.01       309
         AUX       0.03      0.06      0.04       127
       PROPN       0.01      0.09      0.02        53
         NUM       0.00      0.00      0.00        11
           X       0.10      0.26      0.15       233

    accuracy                           0.08      2482
   macro avg       0.04      0.08      0.04      2482
weighted avg       0.05      0.08      0.04      2482



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


0.08138597904915391

In [None]:
# Early stopping: prevents overfitting by stopping when validation accuracy plateaus
class EarlyStopping:
    def __init__(self, patience=5, min_delta=0.001):
        self.patience = patience  # Wait this many checks before stopping
        self.min_delta = min_delta  # Minimum improvement threshold
        self.counter = 0
        self.best_score = None
        self.early_stop = False
        
    def __call__(self, val_acc):
        if self.best_score is None:
            self.best_score = val_acc
        elif val_acc > self.best_score + self.min_delta:
            self.best_score = val_acc
            self.counter = 0
        else:
            self.counter += 1
            if self.counter >= self.patience:
                self.early_stop = True

early_stopping = EarlyStopping(patience=5, min_delta=0.002)

In [None]:
# Training loop with validation and early stopping
NUM_EPOCHS = 100
losses = []
for epoch in range(NUM_EPOCHS):
    # Training phase
    model.train()  # Enable dropout for regularization
    total_loss = 0
    for words, word_lengths, padded_chars, tags in training_data:
        model.zero_grad()
        tag_scores = model(words, word_lengths, padded_chars)
        loss = loss_function(tag_scores, tags.view(-1))
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    losses.append(total_loss)
    
    # Validation phase
    model.eval()  # Disable dropout for evaluation
    with torch.no_grad():
        train_acc = calculate_accuracy(model, training_data, print_report=False)
        val_acc = calculate_accuracy(model, validation_data, print_report=False)
    
    print(f"Epoch {epoch+1}/{NUM_EPOCHS}, Loss: {total_loss:.4f}")
    print(f"Training accuracy: {train_acc:.4f}, Validation accuracy: {val_acc:.4f}")
    
    # Early stopping check
    early_stopping(val_acc)
    if early_stopping.early_stop:
        print(f"\nEarly stopping triggered at epoch {epoch+1}")
        break

Epoch 1/100, Loss: 677.5242
Training accuracy: 0.6127, Validation accuracy: 0.5967
Epoch 2/100, Loss: 465.9638
Training accuracy: 0.7386, Validation accuracy: 0.7172
Epoch 3/100, Loss: 371.2629
Training accuracy: 0.8152, Validation accuracy: 0.7736
Epoch 4/100, Loss: 306.9255
Training accuracy: 0.8637, Validation accuracy: 0.8155
Epoch 5/100, Loss: 259.1399
Training accuracy: 0.8935, Validation accuracy: 0.8247
Epoch 6/100, Loss: 220.4558
Training accuracy: 0.9146, Validation accuracy: 0.8509
Epoch 7/100, Loss: 190.0280
Training accuracy: 0.9333, Validation accuracy: 0.8634
Epoch 8/100, Loss: 165.3925
Training accuracy: 0.9411, Validation accuracy: 0.8618
Epoch 9/100, Loss: 153.5479
Training accuracy: 0.9466, Validation accuracy: 0.8662
Epoch 10/100, Loss: 142.6477
Training accuracy: 0.9552, Validation accuracy: 0.8747
Epoch 11/100, Loss: 133.7159
Training accuracy: 0.9552, Validation accuracy: 0.8687
Epoch 12/100, Loss: 120.4329
Training accuracy: 0.9617, Validation accuracy: 0.8703
E

In [None]:
# Final evaluation on validation set
calculate_accuracy(model, validation_data)

              precision    recall  f1-score   support

        NOUN       0.77      0.89      0.82       409
        VERB       0.85      0.77      0.81       318
         ADJ       0.67      0.74      0.70       190
         ADV       0.81      0.70      0.75       156
        PRON       0.97      0.96      0.96       308
         DET       0.97      0.98      0.98       242
         ADP       0.96      0.98      0.97       300
         AUX       0.91      0.96      0.93       125
       PROPN       0.71      0.39      0.50        62
         NUM       1.00      0.85      0.92        20
           X       0.95      0.91      0.93       240

    accuracy                           0.87      2370
   macro avg       0.87      0.83      0.84      2370
weighted avg       0.87      0.87      0.87      2370



0.8704641350210971

#### Further things to do

- Train it further to imporve accuracy.
- Experiment with different model architectures.