**Group information**

| Family name | First name | Email address |
| ----------- | ---------- | ------------- |
|             |            |               |
|             |            |               |
|             |            |               |

# Language - Practice

This tutorial examines the use of deep language models to predict the political affiliation of tweet authors, either Republican or Democrat. Using a labelled [dataset](https://www.kaggle.com/datasets/kapastor/democratvsrepublicantweets?resource=download) of approximately 75,000 tweets of varying lengths, the goal is to learn a function that maps textual input to the probability that a given tweet is authored by a Democrat. The modelling approach should explicitly capture both the semantic and syntactic structure of the language.

**Large language models** can be powerful learning tools, but make sure to continue asking questions until you fully understand the produced answer and can judge its correctness. Simply copy-pasting generated code does not contribute to learning and is a waste of your class time.

In [None]:
# Install dependencies
%pip install torchtext -q



In [5]:
# Packages
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import shutil
import torch
import torchinfo
import torchtext
import tqdm
import umap

from sklearn import metrics, model_selection
from torch import nn, optim, utils
from urllib import request

# Device
device = 'cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu'
device = torch.device(device)

OSError: Could not load this library: /opt/anaconda3/lib/python3.12/site-packages/torchtext/lib/libtorchtext.so

In [None]:
# Utilities
def download_data():
    if os.getcwd().endswith('/data'):
        print('Data folder already exists')
    else:
        request.urlretrieve('https://www.dropbox.com/scl/fo/z31b6hon9625a02n19whg/AGGw2KPRyRk-qSDRVeQbutc?rlkey=t2ea3a3snykgu2y35l99rakbu&dl=1', 'data.zip')
        shutil.unpack_archive('data.zip', 'data')
        os.remove('data.zip')
        os.chdir('data')

In [None]:
# Execute on first run
download_data()

**1. Descriptive statistics**

Load the `dataset.csv` file, display a sample of tweets along with their corresponding party labels. Compute the frequency distribution of the labels. Using `torchtext.data.utils.get_tokenizer`, tokenise the tweets. Identify the most common hashtags for each party.

**2. Build vocabulary**

Load pre-trained GloVe embeddings using `torchtext.vocab.GloVe`. Build the vocabulary using `torchtext.vocab.build_vocab_from_iterator`, keeping tokens that appear at least five times and limiting the vocabulary to the 10,000 most frequent. Report the match rate as well as examples of matched and unmatched tokens.

Note: For efficiency, only the subset of tokens present in the vocabulary is retained from the 27B-token dictionary. To download the full dataset, delete the cached file `glove.twitter.27B.100d.txt` or specify a different cache directory.

In [None]:
# Loads pre-trained embeddings
embeds = torchtext.vocab.GloVe(name='twitter.27B', dim=100, cache='.')
vocab  = torchtext.vocab.build_vocab_from_iterator(iterator=dataset['tokens'], min_freq=5, specials=['<pad>', '<unk>'], max_tokens=10_000)
vocab.set_default_index(vocab['<unk>']) # Unmatched words are assigned the "unknown" token

**4. Data loaders**

Map tokens to their corresponding vocabulary indices and encode party labels as `0.0` (Republican) and `1.0` (Democrat). Split the data into training (80%) and test (20%) sample. Using the provided `TweetDataset` and `collate_fn`, initialise training and test `utils.data.DataLoader` instances with `batch_size=128`, enabling shuffling for the training loader.

In [None]:
class TweetDataset(torch.utils.data.Dataset):
    
    def __init__(self, X, y):
        self.X = X
        self.y = y

    def __len__(self):
        return len(self.y)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

def collate_fn(batch, pad_index=vocab['<pad>']):
    X, y = zip(*batch)
    X = [torch.tensor(x, dtype=torch.long) for x in X]
    X = nn.utils.rnn.pad_sequence(X, batch_first=True, padding_value=pad_index)
    y = torch.tensor(y, dtype=torch.float)
    return X, y

**5. Model structure**

Create an embedding layer that maps token indices to embedding vectors using `torch.nn.Embedding.from_pretrained`, with `freeze=True` to keep the embeddings fixed during training.

Define a PyTorch model class that maps a token sequence of shape $b \times t \times d$ to a scalar score using an embedding layer, a 32-unit bidirectional LSTM encoder, dropout regularisation, and a single-unit output layer with no activation. For numerical stability, PyTorch loss functions expect raw logit scores rather than a probability. The sigmoid transformation is applied internally within the loss function. 

Instantiate the model with a dropout probability of $0.5$, as these models quickly overfit, and print its structure using `torchinfo.summary`.

In [None]:
# Embedding layer
embedding_layer = torch.nn.Embedding.from_pretrained(
    embeddings=embeds.get_vecs_by_tokens(vocab.get_itos()),
    freeze=True,
    padding_idx=vocab['<pad>']
)

**6. Model training** 

Define the appropriate loss function `nn.BCEWithLogitsLoss` and an optimisation algorithm (e.g. `optim.AdamW`).  Write a PyTorch training loop to estimate the model parameters using the training sample, with a maximum of 25 epochs and a learning rate of `1e-3`. Remember to move the model and the batch data to the correct device.

Optionally, implement an early-stopping procedure on the validation sample to monitor generalisation performance across epochs, stop training when overfitting begins, and restore the parameters that perform best out of sample.

**7. Threshold selection**

Write a prediction loop for the validation sample and, using the predicted probabilities, select the ROC-based decision threshold that maximises the true positive rate while minimising the false positive rate. Given the class balance in the training sample, what threshold value would you expect?

**8. Model performance** 

Apply the prediction loop to the test set, use the selected threshold, and evaluate performance with `metrics.classification_report` and `metrics.confusion_matrix`.

**9. Prediction**

Display the tweets with the highest and lowest predicted Democrat probability, as well as those with the most neutral scores. Write a pipeline to test the model on your own custom input sentences.

**10. Sequence representations**

The LSTM layer outputs numerical sequence representations that serve as tweet embeddings.  Extract these embeddings and, for a selected tweet, identify the most similar tweets using cosine similarity. How do these representations compare with those obtained via singular value decomposition? What do you think these distances reflect?

**BONUS (intermediate)** 

Project the tweet representations into a two- or three-dimensional space using a dimensionality-reduction method such as UMAP. Colour points by predicted probability and distinguish correct and incorrect predictions by marker symbol.

**BONUS (advanced)** 

Using `captum.attr.LayerIntegratedGradients`, compute token-level importance by averaging attribution scores with respect to the modelâ€™s predictions, either for a single instance or across the full dataset. Identify the most influential words, as well as those most predictive of Democratic and Republican classifications.