## STAT8021 / STAT8307
### Assignment 3: Transformer Mechanics, Application, and Pre-training/Fine-tuning Analysis
### DUE: April 27, 2025, Sunday, 11:59 PM

## 1. Understanding Transformer

In [1]:
import numpy as np
import torch
import torch.nn as nn
from torch.nn import functional as F
import math

def rel_error(x, y):
    """ returns relative error """
    return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))

  from .autonotebook import tqdm as notebook_tqdm


### Transformer: Multi-head Attention




#### Q1 (a) Query, Key, Value


In Transformers, we perform self-attention, which means that the values, keys and query are derived from the input $X \in \mathbb{R}^{\ell \times d_1}$, where $\ell$ is our sequence length. Specifically, we learn parameter matrices $V_i,K_i,Q_i \in \mathbb{R}^{d_1\times d/h}$ to map our input $X$ as follows:

\begin{align}
v_i = V_iX\ \ i \in \{1,\dots,h\}\\
k_i = K_iX\ \ i \in \{1,\dots,h\}\\
q_i = Q_iX\ \ i \in \{1,\dots,h\}
\end{align}

where $i$ refers to the $i$-th head and $h$ is the number of heads.

In [2]:
def get_multihead_qkv(query, key, value, embed_dim, n_heads):
    """
    Inputs:
        - query: Input data to be used as the query, of shape (N, S, hidden_dim)
        - key: Input data to be used as the key, of shape (N, T, hidden_dim)
        - value: Input data to be used as the value, of shape (N, T, hidden_dim)
        - embed_dim: The embedding dimension of q,k,v (d in the formula)
        - n_heads: the number of heads
        Note: In the shape definitions above, N is the batch size, S is the source
        sequence length, T is the target sequence length, and hidden_dim is the hidden dimension of X (d1 in the formula).
    Returns:
        - output: a tuple containg query, key, value with shapes of (N, H, S, head_dim), (N, H, T, head_dim), (N, H, T, head_dim) respectively
    """
    N, S, E = query.shape
    N, T, E = value.shape
    assert embed_dim % n_heads == 0
    
    head_dim = embed_dim // n_heads
    # Notes:
    #  1) Define your projections using nn.Linear() and initialize them following the order q,k,v
    #  2) You'll want to split your shape from (N, T, embed_dim) into (N, T, H, head_dim),
        #     where H is the number of heads.
    #  3) Tensor.view() and Tensor.permute() might help
    # ------------------------------------------------------------------------------------------------------------------------------
    # Write your code here
    q_proj = nn.Linear(E, n_heads * head_dim)
    k_proj = nn.Linear(E, n_heads * head_dim)
    v_proj = nn.Linear(E, n_heads * head_dim)

    q = q_proj(query).view(N, S, n_heads, head_dim).permute(0, 2, 1, 3)
    k = k_proj(key).view(N, T, n_heads, head_dim).permute(0, 2, 1, 3)
    v = v_proj(value).view(N, T, n_heads, head_dim).permute(0, 2, 1, 3)
    
    # ------------------------------------------------------------------------------------------------------------------------------
    
    return q,k,v

In [3]:
torch.manual_seed(123)
batch_size = 1
sequence_length = 3
embed_dim = 8 #d
hidden_dim = 8 #d1
n_heads = 2
data = torch.randn(batch_size, sequence_length, hidden_dim)
q, k, v = get_multihead_qkv(data, data, data, embed_dim, n_heads)
print('The shape of query is {}.'.format(q.shape))
print('The L2 norm of query is {:.4f}.'.format(torch.linalg.norm(q)))

The shape of query is torch.Size([1, 2, 3, 4]).
The L2 norm of query is 2.7053.


#### Q1 (b) Multi-Headed Scaled Dot-Product Attention
In the case of multi-headed attention, we learn a parameter matrix for each head, which gives the model more expressivity to attend to different parts of the input. Let $Y_i$ be the attention output of head $i$. Thus we learn individual matrices $Q_i$, $K_i$ and $V_i$. To keep our overall computation the same as the single-headed case, we choose $Q_i \in \mathbb{R}^{d\times d/h}$, $K_i \in \mathbb{R}^{d\times d/h}$ and $V_i \in \mathbb{R}^{d\times d/h}$. Adding in a scaling term $\frac{1}{\sqrt{d/h}}$ to our simple dot-product attention above, we have

\begin{equation} \label{qkv_eqn}
A_i = \text{softmax}\bigg(\frac{(XQ_i)(XK_i)^\top}{\sqrt{d/h}}\bigg)
\end{equation}

In [4]:
def calculate_multihead_attention(q, k, attn_mask=None):
    """
    Inputs:
        - q: Multi-head query with the shape of (N, H, S, head_dim)
        - k: Multi-head key with the shape of (N, H, T, head_dim)
        - attn_mask (if provided): Array of shape (S, T) where attn_mask[i,j] == 0 indicates token
          j in the key/value should not influence token i in the query output.
        - Note: head_dim refers to embed_dim/n_heads
    Returns:
        - attention_weights: attention tensor with shape of (N, H, S, T)
    """
    # Notes:
    #  1) You need to transpose k
    #  2) You need to set scores to '-inf' where mask==0. Tensor.masked_fill() might help.
    # ------------------------------------------------------------------------------------------------------------------------------
    # Write your code here    
    d_h = q.shape[3]  
    
    qk_proj = torch.matmul(q, k.transpose(-2, -1))
    if attn_mask is not None:
        qk_proj = qk_proj.masked_fill(attn_mask, - torch.inf)
    
    qk_proj = qk_proj / math.sqrt(d_h)
    
    attention_weights = F.softmax(qk_proj, dim=-1)
    
    
    # ------------------------------------------------------------------------------------------------------------------------------
    return attention_weights

In [5]:
torch.manual_seed(456)
# Create a 0/1 mask where 0 means mask out, 1 means keep
mask = torch.randn(sequence_length, sequence_length) < 0.5 # ~50% are 0s (masked)
self_attn_output = calculate_multihead_attention(q, k)
masked_self_attn_output = calculate_multihead_attention(q, k, attn_mask=mask)
print('The shape of attention is {}.'.format(self_attn_output.shape))
print('The L2 norm of self-attention is {:.4f}.'.format(torch.linalg.norm(self_attn_output)))
print('The L2 norm of masked_self_attn_output is {:.4f}.'.format(torch.linalg.norm(masked_self_attn_output)))

The shape of attention is torch.Size([1, 2, 3, 3]).
The L2 norm of self-attention is 1.4747.
The L2 norm of masked_self_attn_output is 2.0186.


#### Q1 (c) Final outputs and Wrap-up

Now we have got our attention $A_i$, and each head's output could be calculated using the following formula.

\begin{equation}
Y_i = A_i(XV_i)
\end{equation}

where $Y_i\in\mathbb{R}^{\ell \times d/h}$, where $\ell$ is our sequence length.



In our implementation, we apply dropout to the attention weights (though in practice it could be used at any step):

\begin{equation} \label{qkvdropout_eqn}
Y_i = \text{dropout}\bigg(A_i\bigg)(XV_i)
\end{equation}

Finally, then the output of the self-attention is a linear transformation of the concatenation of the heads:

\begin{equation}
Y = [Y_1;\dots;Y_h]W
\end{equation}

where $W \in\mathbb{R}^{d\times d}$ and $[Y_1;\dots;Y_h]\in\mathbb{R}^{\ell \times d}$.


In [6]:
class MultiHeadAttention(nn.Module):
    """
    A model layer which implements a simplified version of masked attention, as
    introduced by "Attention Is All You Need" (https://arxiv.org/abs/1706.03762).
    """

    def __init__(self, embed_dim, num_heads, dropout=0.1):
        """
        Construct a new MultiHeadAttention layer.

        Inputs:
         - embed_dim: Dimension of the token embedding
         - num_heads: Number of attention heads
         - dropout: Dropout probability
        """
        super().__init__()
        assert embed_dim % num_heads == 0

        # Linear layers for Q, K, V projections (input dim = output dim = embed_dim)
        self.key = nn.Linear(embed_dim, embed_dim)
        self.query = nn.Linear(embed_dim, embed_dim)
        self.value = nn.Linear(embed_dim, embed_dim)

        # Final linear projection layer
        self.proj = nn.Linear(embed_dim, embed_dim)

        # Dropout layer for attention weights
        self.attn_drop = nn.Dropout(dropout)

        # Store dimensions
        self.n_head = num_heads
        self.emd_dim = embed_dim
        self.head_dim = self.emd_dim // self.n_head

    def forward(self, query, key, value, attn_mask=None):
        """
        Calculate the masked attention output for the provided data, computing
        all attention heads in parallel.

        In the shape definitions below, N is the batch size, S is the source
        sequence length, T is the target sequence length, and E is the embedding
        dimension.

        Inputs:
        - query: Input data to be used as the query, of shape (N, S, E)
        - key: Input data to be used as the key, of shape (N, T, E)
        - value: Input data to be used as the value, of shape (N, T, E)
        - attn_mask: Array of shape (S, T) where mask[i,j] == 0 indicates token
          i in the source should not influence token j in the target.

        Returns:
        - output: Tensor of shape (N, S, E) giving the weighted combination of
          data in value according to the attention weights calculated using key
          and query.
        """
        N, S, E = query.shape
        N, T, E = value.shape

        # Notes:
        #  1) Please do not directly call the functions defined above. Instead, write your code step by step.
        # ------------------------------------------------------------------------------------------------------------------------------
        # Write your code here
        q = self.query(query).view(N, S, self.n_head, self.head_dim).permute(0, 2, 1, 3)
        k = self.key(key).view(N, T, self.n_head, self.head_dim).permute(0, 2, 1, 3)
        v = self.value(value).view(N, T, self.n_head, self.head_dim).permute(0, 2, 1, 3)

        qk_proj = torch.matmul(q, k.transpose(-2, -1))
        if attn_mask is not None:
            qk_proj = qk_proj.masked_fill(attn_mask == 0, - torch.inf)
        
        qk_proj = qk_proj / self.head_dim ** 0.5
        
        attention_weights = F.softmax(qk_proj, dim=-1)

        Y = torch.matmul(self.attn_drop(attention_weights), v).permute(0, 2, 1, 3).reshape(N, T, self.n_head * self.head_dim)
        
        output = self.proj(Y)
        
        # ------------------------------------------------------------------------------------------------------------------------------
        return output

In [7]:
torch.manual_seed(231)


batch_size = 1
sequence_length = 3
attn = MultiHeadAttention(embed_dim, num_heads=2)

# Self-attention.
data = torch.randn(batch_size, sequence_length, embed_dim)
self_attn_output = attn(query=data, key=data, value=data)

# Masked self-attention.
mask = torch.randn(sequence_length, sequence_length) < 0.5
masked_self_attn_output = attn(query=data, key=data, value=data, attn_mask=mask)

# Attention using two inputs.
other_data = torch.randn(batch_size, sequence_length, embed_dim)
attn_output = attn(query=data, key=other_data, value=other_data)

expected_self_attn_output = np.asarray([[
[-0.2494,  0.1396,  0.4323, -0.2411, -0.1547,  0.2329, -0.1936,
          -0.1444],
         [-0.1997,  0.1746,  0.7377, -0.3549, -0.2657,  0.2693, -0.2541,
          -0.2476],
         [-0.0625,  0.1503,  0.7572, -0.3974, -0.1681,  0.2168, -0.2478,
          -0.3038]]])

expected_masked_self_attn_output = np.asarray([[
[-0.1347,  0.1934,  0.8628, -0.4903, -0.2614,  0.2798, -0.2586,
          -0.3019],
         [-0.1013,  0.3111,  0.5783, -0.3248, -0.3842,  0.1482, -0.3628,
          -0.1496],
         [-0.2071,  0.1669,  0.7097, -0.3152, -0.3136,  0.2520, -0.2774,
          -0.2208]]])

expected_attn_output = np.asarray([[
[-0.1980,  0.4083,  0.1968, -0.3477,  0.0321,  0.4258, -0.8972,
          -0.2744],
         [-0.1603,  0.4155,  0.2295, -0.3485, -0.0341,  0.3929, -0.8248,
          -0.2767],
         [-0.0908,  0.4113,  0.3017, -0.3539, -0.1020,  0.3784, -0.7189,
          -0.2912]]])

print('self_attn_output error: ', rel_error(expected_self_attn_output, self_attn_output.detach().numpy()))
print('masked_self_attn_output error: ', rel_error(expected_masked_self_attn_output, masked_self_attn_output.detach().numpy()))
print('attn_output error: ', rel_error(expected_attn_output, attn_output.detach().numpy()))

self_attn_output error:  0.0003772742211599121
masked_self_attn_output error:  0.0001526367643724865
attn_output error:  0.00035224630317522767


Checker: The correct implementation will give an error no more than `e-3`.

### Transformer: Positional Embedding

#### Q1 (d) Positional Embedding

While transformers are able to easily attend to any part of their input, the attention mechanism has no concept of token order. However, for many tasks (especially natural language processing), relative token order is very important. To recover this, the authors add a positional encoding to the embeddings of individual word tokens.

Let us define a matrix $P \in \mathbb{R}^{l\times d}$, where $P_{ij} = $

$$
\begin{cases}
\text{sin}\left(i \cdot 10000^{-\frac{j}{d}}\right) & \text{if j is even} \\
\text{cos}\left(i \cdot 10000^{-\frac{(j-1)}{d}}\right) & \text{otherwise} \\
\end{cases}
$$

Rather than directly passing an input $X \in \mathbb{R}^{l\times d}$ to our network, we instead pass $X + P$.

In [8]:
class PositionalEncoding(nn.Module):
    """
    Encodes information about the positions of the tokens in the sequence. In
    this case, the layer has no learnable parameters, since it is a simple
    function of sines and cosines.
    """
    def __init__(self, embed_dim, dropout=0.1, max_len=5000):
        """
        Construct the PositionalEncoding layer.

        Inputs:
         - embed_dim: the size of the embed dimension
         - dropout: the dropout value
         - max_len: the maximum possible length of the incoming sequence
        """
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)
        assert embed_dim % 2 == 0
        # Create an array with a "batch dimension" of 1 (which will broadcast
        # across all examples in the batch).
        pe = torch.zeros(1, max_len, embed_dim)
        # Notes:
        #  1) Construct the positional encoding array as described above.
        # ------------------------------------------------------------------------------------------------------------------------------
        # Write your code here
        for i in range(max_len):
            for j in range(embed_dim):
                if j % 2 == 0:
                    pe[0][i][j] = math.sin(i * 10000 ** (- j / embed_dim))
                else:
                    pe[0][i][j] = math.cos(i * 10000 ** (- (j-1) / embed_dim))
                    
        self.attn_drop = nn.Dropout(dropout)
        # ------------------------------------------------------------------------------------------------------------------------------

        # Make sure the positional encodings will be saved with the model
        # parameters (mostly for completeness).
        self.register_buffer('pe', pe)

    def forward(self, x):
        """
        Element-wise add positional embeddings to the input sequence.

        Inputs:
         - x: the sequence fed to the positional encoder model, of shape
              (N, S, D), where N is the batch size, S is the sequence length and
              D is embed dim
        Returns:
         - output: the input sequence + positional encodings, of shape (N, S, D)
        """
        N, S, D = x.shape
        # Notes:
        #  1) Index into your array of positional encodings, and add the appropriate ones to the input sequence.
        #  2) Don't forget to apply dropout afterward.
        # ------------------------------------------------------------------------------------------------------------------------------
        # Write your code here
        output = self.attn_drop(x + self.pe[:, :S, :])

        # ------------------------------------------------------------------------------------------------------------------------------


        return output

In [9]:
torch.manual_seed(231)

batch_size = 1
sequence_length = 2
embed_dim = 6
data = torch.randn(batch_size, sequence_length, embed_dim)

pos_encoder = PositionalEncoding(embed_dim)
output = pos_encoder(data)

expected_pe_output = np.asarray([[[-1.2340,  1.1127,  1.6978, -0.0865, -0.0000,  1.2728],
                                  [ 0.9028, -0.4781,  0.5535,  0.8133,  1.2644,  1.7034]]])

print('pe_output error: ', rel_error(expected_pe_output, output.detach().numpy()))

pe_output error:  0.00010421011374914356


Checker: The correct implementation will give an error no more than `e-3`.

## 2. Applying Transformer


In [1]:
! pip install transformers datasets evaluate



In [1]:
from datasets import load_dataset, DatasetDict

ag_news_dataset = load_dataset("ag_news")

  from .autonotebook import tqdm as notebook_tqdm
Found cached dataset parquet (C:/Users/23629/.cache/huggingface/datasets/parquet/ag_news-9af2a5926861d22a/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 250.01it/s]


In [2]:
ag_news_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})

In [3]:
# Just take the first 100 tokens for speed/running on cpu
def truncate(example):
    return {
        'text': " ".join(example['text'].split()[:100]),
        'label': example['label']
    }

# Take 1024 random examples for train and 128 validation
small_ag_news_dataset = DatasetDict(
    train=ag_news_dataset['train'].shuffle(seed=1111).select(range(1024)).map(truncate),
    val=ag_news_dataset['test'].shuffle(seed=1111).select(range(128)).map(truncate),
)

Loading cached shuffled indices for dataset at C:\Users\23629\.cache\huggingface\datasets\parquet\ag_news-9af2a5926861d22a\0.0.0\14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7\cache-63a729b30640687c.arrow
Loading cached processed dataset at C:\Users\23629\.cache\huggingface\datasets\parquet\ag_news-9af2a5926861d22a\0.0.0\14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7\cache-4626180bb011e782.arrow
Loading cached shuffled indices for dataset at C:\Users\23629\.cache\huggingface\datasets\parquet\ag_news-9af2a5926861d22a\0.0.0\14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7\cache-387c2fe96c3db8c6.arrow
Loading cached processed dataset at C:\Users\23629\.cache\huggingface\datasets\parquet\ag_news-9af2a5926861d22a\0.0.0\14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7\cache-25bdc38418fb4a58.arrow


In [4]:
small_ag_news_dataset['train'][0]

{'text': 'India and Pakistan balk at bold Kashmir peace plan Pakistani President Pervez Musharraf this week urged steps to end the bitter dispute.',
 'label': 0}

In [5]:
small_ag_news_dataset['val'][0]

{'text': 'Nortel warns of lower Q3 revenue TORONTO - Nortel Networks warned Thursday its third-quarter revenue will be below the \\$2.6 billion US preliminary unaudited revenues it reported for the second quarter.',
 'label': 2}

In [6]:
id2label = {
    0: "World", 
    1: "Sports",
    2: "Business",
    3: "Sci/Tech",
    }

##### Q2 (a)

In [7]:
from transformers import DistilBertTokenizerFast

# ------------------------------------------------------------------------------------------------------------------------------
# Write your code here
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")

def preprocess_function(token):
    return tokenizer(token["text"], padding="max_length", truncation=True)

small_tokenized_dataset = small_ag_news_dataset.map(preprocess_function, batched=True)

small_tokenized_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])


# print the frist 3 processed samples
small_tokenized_dataset['train'][:3]
# ------------------------------------------------------------------------------------------------------------------------------

Loading cached processed dataset at C:\Users\23629\.cache\huggingface\datasets\parquet\ag_news-9af2a5926861d22a\0.0.0\14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7\cache-660febad63e88154.arrow
Loading cached processed dataset at C:\Users\23629\.cache\huggingface\datasets\parquet\ag_news-9af2a5926861d22a\0.0.0\14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7\cache-d66dcfb19e8d85d1.arrow


{'label': tensor([0, 3, 1]),
 'input_ids': tensor([[ 101, 2634, 1998,  ...,    0,    0,    0],
         [ 101, 3042, 2194,  ...,    0,    0,    0],
         [ 101, 2148, 4420,  ...,    0,    0,    0]]),
 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]])}

##### Q2 (b)

In [9]:
from transformers import get_linear_schedule_with_warmup
from torch.optim import AdamW
from tqdm import tqdm
import os 
from transformers import DistilBertForSequenceClassification
import torch
import evaluate
from torch.utils.data import DataLoader

# ------------------------------------------------------------------------------------------------------------------------------
# Write your code here

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

num_epochs = 3
bsz = 8
lr = 5e-5

train_dataloader = DataLoader(small_tokenized_dataset["train"], batch_size=bsz, shuffle=True)
test_dataloader = DataLoader(small_tokenized_dataset["val"], batch_size=bsz)

# Define your model. optimizer, hyper-parameter and etc.

model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=4)
model.to(device)

optimizer = AdamW(model.parameters(), lr=lr)
num_warmup_steps = int(0.1 * num_epochs * len(train_dataloader))

lr_scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup_steps, 
                                               num_training_steps=num_epochs * len(train_dataloader))

for epoch in range(num_epochs):
    #train and evaluate your model
    model.train()
    train_correct, train_total = 0, 0
    for batch in tqdm(train_dataloader,desc="Training process:"):
        batch = {k: v.to(device) for k, v in batch.items()}
        if 'label' in batch:
            batch['labels'] = batch.pop('label')        
        outputs = model(**batch)
        loss = outputs.loss
        logits = outputs.logits

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        lr_scheduler.step()

        predictions = torch.argmax(logits, dim=-1)
        train_correct += (predictions == batch["labels"]).sum().item()
        train_total += batch["labels"].size(0)

    train_acc = train_correct / train_total

    model.eval()
    test_correct, test_total = 0, 0
    with torch.no_grad():
        for batch in tqdm(test_dataloader, desc="Testing process:"):
            batch = {k: v.to(device) for k, v in batch.items()}
            if 'label' in batch:
                batch['labels'] = batch.pop('label')            
            outputs = model(**batch)
            logits = outputs.logits

            predictions = torch.argmax(logits, dim=-1)
            test_correct += (predictions == batch["labels"]).sum().item()
            test_total += batch["labels"].size(0)

    test_acc = test_correct / test_total

        
    # print the training process
    print("Epoch {}: train acc = {:.4f}, test acc = {:.4f}".format(epoch + 1, train_acc, test_acc))

# ------------------------------------------------------------------------------------------------------------------------------

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias', 'classifier.

Epoch 1: train acc = 0.7217, test acc = 0.8750


Training process:: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 64/64 [07:27<00:00,  6.99s/it]
Testing process:: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:27<00:00,  3.44s/it]


Epoch 2: train acc = 0.9209, test acc = 0.8984


Training process:: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 64/64 [07:28<00:00,  7.00s/it]
Testing process:: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:27<00:00,  3.47s/it]

Epoch 3: train acc = 0.9658, test acc = 0.8906





##### Q2 (c)

In [14]:
chatgpt_generated_news = [
    "In an exciting match last night, the Los Angeles Lakers defeated the Brooklyn Nets 115-110. Lakers' LeBron James made a comeback after missing several games due to injury and scored 25 points while teammate Anthony Davis added 28 points. Nets' star player Kevin Durant scored 32 points but couldn't lead his team to victory.",
    "Scientists have discovered a new species of dinosaur that roamed the earth 80 million years ago. The species, named Almatherium, was found in Uzbekistan and is believed to be an ancestor of the modern-day armadillo. The discovery sheds new light on the evolution of mammals and their relationship with dinosaurs.",
    "The United Nations has called for an immediate ceasefire in Yemen as the country faces a growing humanitarian crisis. The UN's special envoy for Yemen, Martin Griffiths, urged all parties to end the violence and engage in peace talks. The conflict has left millions of Yemenis at risk of famine and disease.",
    "Amazon has announced that it will be opening its first fulfillment center in New Zealand, creating more than 500 new jobs. The center will be located in Auckland and is expected to open in 2022. This move will allow Amazon to expand its operations in the region and improve delivery times for customers.",
]
prediction_label = []

# ------------------------------------------------------------------------------------------------------------------------------
# Write your code here

# test your finetuned model on chatgpt_genreated_news
model.eval()
for news in chatgpt_generated_news:
    inputs = tokenizer(news, return_tensors="pt", truncation=True, padding=True)
    inputs = {key: value.to(device) for key, value in inputs.items()}
    with torch.no_grad():
        outputs = model(**inputs)

    logits = outputs.logits
    predicted_class = torch.argmax(logits, dim=-1).item()
    
    prediction_label.append(predicted_class)

for ids, prediction_label in enumerate(prediction_label):
    print(f"The class of news '{chatgpt_generated_news[ids]}' is {id2label[prediction_label]} \n")

# ------------------------------------------------------------------------------------------------------------------------------


The class of news 'In an exciting match last night, the Los Angeles Lakers defeated the Brooklyn Nets 115-110. Lakers' LeBron James made a comeback after missing several games due to injury and scored 25 points while teammate Anthony Davis added 28 points. Nets' star player Kevin Durant scored 32 points but couldn't lead his team to victory.' is Sports 

The class of news 'Scientists have discovered a new species of dinosaur that roamed the earth 80 million years ago. The species, named Almatherium, was found in Uzbekistan and is believed to be an ancestor of the modern-day armadillo. The discovery sheds new light on the evolution of mammals and their relationship with dinosaurs.' is Sci/Tech 

The class of news 'The United Nations has called for an immediate ceasefire in Yemen as the country faces a growing humanitarian crisis. The UN's special envoy for Yemen, Martin Griffiths, urged all parties to end the violence and engage in peace talks. The conflict has left millions of Yemenis 

##### Q2 (d)

In [15]:
# ------------------------------------------------------------------------------------------------------------------------------
from transformers import RobertaForSequenceClassification, RobertaTokenizerFast

# Write your code here

tokenizer = RobertaTokenizerFast.from_pretrained("roberta-base")

def preprocess_function(token):
    return tokenizer(token["text"], padding="max_length", truncation=True)

small_tokenized_dataset = small_ag_news_dataset.map(preprocess_function, batched=True)

small_tokenized_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])



# Define your model. optimizer, hyper-parameter and etc.


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

num_epochs = 3
bsz = 8
lr = 5e-5

train_dataloader = DataLoader(small_tokenized_dataset["train"], batch_size=bsz, shuffle=True)
test_dataloader = DataLoader(small_tokenized_dataset["val"], batch_size=bsz)

# Define your model. optimizer, hyper-parameter and etc.

model = RobertaForSequenceClassification.from_pretrained("roberta-base", num_labels=4)
model.to(device)

optimizer = AdamW(model.parameters(), lr=lr)
num_warmup_steps = int(0.1 * num_epochs * len(train_dataloader))

lr_scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup_steps, 
                                               num_training_steps=num_epochs * len(train_dataloader))

for epoch in range(num_epochs):
    #train and evaluate your model
    model.train()
    train_correct, train_total = 0, 0
    for batch in tqdm(train_dataloader,desc="Training process:"):
        batch = {k: v.to(device) for k, v in batch.items()}
        if 'label' in batch:
            batch['labels'] = batch.pop('label')        
        outputs = model(**batch)
        loss = outputs.loss
        logits = outputs.logits

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        lr_scheduler.step()

        predictions = torch.argmax(logits, dim=-1)
        train_correct += (predictions == batch["labels"]).sum().item()
        train_total += batch["labels"].size(0)

    train_acc = train_correct / train_total

    model.eval()
    test_correct, test_total = 0, 0
    with torch.no_grad():
        for batch in tqdm(test_dataloader, desc="Testing process:"):
            batch = {k: v.to(device) for k, v in batch.items()}
            if 'label' in batch:
                batch['labels'] = batch.pop('label')            
            outputs = model(**batch)
            logits = outputs.logits

            predictions = torch.argmax(logits, dim=-1)
            test_correct += (predictions == batch["labels"]).sum().item()
            test_total += batch["labels"].size(0)

    test_acc = test_correct / test_total

        
    # print the training process
    print("Epoch {}: train acc = {:.4f}, test acc = {:.4f}".format(epoch + 1, train_acc, test_acc))

# ------------------------------------------------------------------------------------------------------------------------------

Downloading tokenizer_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 25.0/25.0 [00:00<00:00, 5.00kB/s]
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Downloading vocab.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 899k/899k [00:00<00:00, 1.16MB/s]
Downloading merges.txt: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 965kB/s]
Downloading tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.36M/1.36M [00:01<00:00, 1.33MB/s]
Dow

Epoch 1: train acc = 0.7285, test acc = 0.8438


Training process:: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 128/128 [15:11<00:00,  7.12s/it]
Testing process:: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:30<00:00,  1.93s/it]


Epoch 2: train acc = 0.9131, test acc = 0.8672


Training process:: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 128/128 [15:11<00:00,  7.12s/it]
Testing process:: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:30<00:00,  1.89s/it]

Epoch 3: train acc = 0.9492, test acc = 0.9141



