# KAIST AI605 Assignment 4: Sequence and Token Classification with BERT
Instructor: Minjoon Seo (minjoon@kaist.ac.kr)

TA in charge: Seokin Seo (tzs930@kaist.ac.kr)

**Due date**: May 29 (Wed) 11:00pm, 2021

Your name: Seungwoo, Ryu

Your student ID: 20213207

Your collaborators: -

## Assignment Objectives
- Use BERT for sequence classification (Assignment 1)
- Use BERT for token classification (Assignment 2)

## Your Submission
Your submission will be a link to a Colab notebook that has all written answers and is fully executable. You will submit your assignment via KLMS. Use in-line LaTeX (see below) for mathematical expressions. Collaboration among students is allowed but it is not a group assignment so make sure your answer and code are your own. Also make sure to mention your collaborators in your assignment with their names and their student ids.

## Grading
The entire assignment is out of 100 points. There are two bonus questions with 40 points altogether. Your final score can be higher than 100 points.


## Environment
You will only use Python 3.7 and PyTorch 1.8, which is already available on Colab:

In [None]:
from platform import python_version
import torch

print("python", python_version())
print("torch", torch.__version__)

python 3.7.4
torch 1.8.0+cu111


## 1. Hugging Face Transformers
In this assignment, you will  use `transformers` library by Hugging Face. The library provides you an easy way to utilize diverse pretrained language models. You will be specifically asked to re-do sequence classification (sentiment analysis) and token classification (question answering) that you already did in your Assignment 1 and 2. 

First, install both `transformers` and `datasets` packages:

In [None]:
!pip install transformers datasets



In [None]:
import os
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from collections import OrderedDict
from time import time
from typing import List
from tqdm.auto import tqdm
from datasets import load_metric
from dataclasses import dataclass
from torch.utils.data import DataLoader
from torch.utils.data.dataset import Dataset
from transformers import (AutoTokenizer, 
                          AutoModel,
                          AutoModelForSequenceClassification, 
                          AutoModelForTokenClassification,
                          AutoModelForQuestionAnswering,
                          AutoConfig,
                          BertConfig,
                          BertModel,
                          PreTrainedModel,
                          PreTrainedTokenizer, 
                          BatchEncoding)

In Lecture 17, we walked through how we can use pretrained and finetuned BERT for sequence classification (https://huggingface.co/transformers/task_summary.html#sequence-classification) and token classification (https://huggingface.co/transformers/task_summary.html#extractive-question-answering).
Recall that `bert-base-cased-finetuned-mrpc` means that you load a pretrained `bert-base-cased` model and you finetune it on `mrpc` dataset. 

**Problem 1.1** *(10 points)* Put your favorite emoji here 😇
https://getemoji.com/

Your favorite emoji: 💯

## 2. Sequence Classification with BERT
**Problem 2.1** *(20 points)* Tutorial at https://huggingface.co/transformers/training.html#fine-tuning-in-native-pytorch shows you how you can finetune a sequence classification model from `bert-base-cased` for IMDB dataset. Repeat the same process with SST-2 dataset and report the accuracy here (i.e. it's fine to copy & paste code from the documentation).

Note that you can load SST-2 dataset via

**Answer to Problem 2.1**


Accuracy: About 90%

In [None]:
from datasets import load_dataset, load_metric
dataset = load_dataset('glue', 'sst2')
metric = load_metric('accuracy')

Reusing dataset glue (/home/swryu/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


The dataset does not have labels for `test` data so please use `validation` data as your test data. 

In [None]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
model = AutoModelForSequenceClassification.from_pretrained('bert-base-cased', num_labels=2).to(device)

train_dataset = dataset['train']
valid_dataset = dataset['validation']
train_loader = torch.utils.data.DataLoader(train_dataset, shuffle=True, batch_size=64)
valid_loader = torch.utils.data.DataLoader(valid_dataset)

optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
CELoss = nn.CrossEntropyLoss()

train_loss_list = []

model.train()
for epoch in range(3):
  train_loss = 0
  for batch_idx, data in enumerate(tqdm(train_loader)):
    feature = data['sentence']
    label = data['label'].to(device)
    tknz_batch = tokenizer(feature,
                          padding='longest',
                          truncation=True)
    tknz_batch = {k: torch.tensor(v).to(device) for k, v in tknz_batch.items()}
    outputs = model(**tknz_batch)
    loss = CELoss(outputs.logits, label)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    train_loss += loss

    if batch_idx % 50 == 0:
      print('Epoch: %03d/%03d | Batch %04d/%04d | Loss: %.4f' 
      %(epoch+1, 3, batch_idx, len(train_loader), loss))
          
  train_loss_list.append(train_loss/len(train_loader))

print(f"Loss change: {train_loss_list[0]} -> {train_loss_list[1]} -> {train_loss_list[2]}")

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

HBox(children=(FloatProgress(value=0.0, max=1053.0), HTML(value='')))

Epoch: 001/003 | Batch 0000/1053 | Loss: 1.1252
Epoch: 001/003 | Batch 0050/1053 | Loss: 0.3296
Epoch: 001/003 | Batch 0100/1053 | Loss: 0.2505
Epoch: 001/003 | Batch 0150/1053 | Loss: 0.2783
Epoch: 001/003 | Batch 0200/1053 | Loss: 0.2572
Epoch: 001/003 | Batch 0250/1053 | Loss: 0.2584
Epoch: 001/003 | Batch 0300/1053 | Loss: 0.1243
Epoch: 001/003 | Batch 0350/1053 | Loss: 0.2006
Epoch: 001/003 | Batch 0400/1053 | Loss: 0.3084
Epoch: 001/003 | Batch 0450/1053 | Loss: 0.2515
Epoch: 001/003 | Batch 0500/1053 | Loss: 0.2149
Epoch: 001/003 | Batch 0550/1053 | Loss: 0.0898
Epoch: 001/003 | Batch 0600/1053 | Loss: 0.1927
Epoch: 001/003 | Batch 0650/1053 | Loss: 0.2424
Epoch: 001/003 | Batch 0700/1053 | Loss: 0.1238
Epoch: 001/003 | Batch 0750/1053 | Loss: 0.2105
Epoch: 001/003 | Batch 0800/1053 | Loss: 0.1653
Epoch: 001/003 | Batch 0850/1053 | Loss: 0.2956
Epoch: 001/003 | Batch 0900/1053 | Loss: 0.3308
Epoch: 001/003 | Batch 0950/1053 | Loss: 0.1293
Epoch: 001/003 | Batch 1000/1053 | Loss:

HBox(children=(FloatProgress(value=0.0, max=1053.0), HTML(value='')))

Epoch: 002/003 | Batch 0000/1053 | Loss: 0.1716
Epoch: 002/003 | Batch 0050/1053 | Loss: 0.1266
Epoch: 002/003 | Batch 0100/1053 | Loss: 0.0574
Epoch: 002/003 | Batch 0150/1053 | Loss: 0.0960
Epoch: 002/003 | Batch 0200/1053 | Loss: 0.1520
Epoch: 002/003 | Batch 0250/1053 | Loss: 0.0893
Epoch: 002/003 | Batch 0300/1053 | Loss: 0.0871
Epoch: 002/003 | Batch 0350/1053 | Loss: 0.0654
Epoch: 002/003 | Batch 0400/1053 | Loss: 0.0704
Epoch: 002/003 | Batch 0450/1053 | Loss: 0.1128
Epoch: 002/003 | Batch 0500/1053 | Loss: 0.0781
Epoch: 002/003 | Batch 0550/1053 | Loss: 0.1201
Epoch: 002/003 | Batch 0600/1053 | Loss: 0.0925
Epoch: 002/003 | Batch 0650/1053 | Loss: 0.1660
Epoch: 002/003 | Batch 0700/1053 | Loss: 0.1180
Epoch: 002/003 | Batch 0750/1053 | Loss: 0.1991
Epoch: 002/003 | Batch 0800/1053 | Loss: 0.1135
Epoch: 002/003 | Batch 0850/1053 | Loss: 0.1347
Epoch: 002/003 | Batch 0900/1053 | Loss: 0.1472
Epoch: 002/003 | Batch 0950/1053 | Loss: 0.1240
Epoch: 002/003 | Batch 1000/1053 | Loss:

HBox(children=(FloatProgress(value=0.0, max=1053.0), HTML(value='')))

Epoch: 003/003 | Batch 0000/1053 | Loss: 0.1127
Epoch: 003/003 | Batch 0050/1053 | Loss: 0.1796
Epoch: 003/003 | Batch 0100/1053 | Loss: 0.0777
Epoch: 003/003 | Batch 0150/1053 | Loss: 0.0701
Epoch: 003/003 | Batch 0200/1053 | Loss: 0.0067
Epoch: 003/003 | Batch 0250/1053 | Loss: 0.0254
Epoch: 003/003 | Batch 0300/1053 | Loss: 0.0749
Epoch: 003/003 | Batch 0350/1053 | Loss: 0.1357
Epoch: 003/003 | Batch 0400/1053 | Loss: 0.1171
Epoch: 003/003 | Batch 0450/1053 | Loss: 0.0492
Epoch: 003/003 | Batch 0500/1053 | Loss: 0.0633
Epoch: 003/003 | Batch 0550/1053 | Loss: 0.0316
Epoch: 003/003 | Batch 0600/1053 | Loss: 0.0571
Epoch: 003/003 | Batch 0650/1053 | Loss: 0.1549
Epoch: 003/003 | Batch 0700/1053 | Loss: 0.0725
Epoch: 003/003 | Batch 0750/1053 | Loss: 0.0603
Epoch: 003/003 | Batch 0800/1053 | Loss: 0.0708
Epoch: 003/003 | Batch 0850/1053 | Loss: 0.0311
Epoch: 003/003 | Batch 0900/1053 | Loss: 0.0198
Epoch: 003/003 | Batch 0950/1053 | Loss: 0.0447
Epoch: 003/003 | Batch 1000/1053 | Loss:

In [None]:
model.eval()
for data in valid_loader:
  feature = data['sentence']
  label = data['label'].to(device)
  tknz_batch = tokenizer(feature, 
                         padding='longest',
                         truncation=True)
  tknz_batch = {k: torch.tensor(v).to(device) for k, v in tknz_batch.items()}
  with torch.no_grad(): 
    outputs = model(**tknz_batch)
    
  logits = outputs.logits
  predictions = torch.argmax(logits, dim=-1)
  metric.add_batch(predictions=predictions, references=label)

metric.compute()  

{'accuracy': 0.9013761467889908}


**Problem 2.2** *(10 points)* How does your accuracy with BERT compares to your accuracy with LSTM in Assignment 1? How about training speed?


**Answer to Problem 2.2** On the assignment1, the accuracy with LSTM was about 80%. And even, when I used dropout to enhance the performance, the best performance I got was about 83%. However, by using BERT, I could enhance the accuracy about 7% easily. I could feel the power of pretrained model which is pre-trained on large vocab. However, at the expense of improvement of accuracy, it took much more time for training than LSTM. It is becacuse BERT uses transformer encoders.



**Problem 2.3** *(10 points)* Try your own sentences and find three failure cases. Explain why you think the model got them wrong.

**Answer to Problem 2.3**  Of course I know three examples below are really extremely non-sense examples. But, for the extreme case, if someone writes the vocab which has really important role at predicting the sentiment in a strange way, for example, `not` as `nOT` or `sad` as `sAD`, the model cannot judge their implicit meaning. For three sentences I took, it is too easy for human to judge whether it's positive or negative. However, cased model got zero accuracy. This is the reason why, in some cases, we need uncased version model.

In [None]:
sentence = ['I HATE THEM!',
            'I dO nOT LOVE YOU!',
            'I am really sAD hearing that another project assignment comes!']

tknz_batch = tokenizer(sentence, padding='longest', truncation=True)
tknz_batch = {k: torch.tensor(v) for k, v in tknz_batch.items()}

model.cpu().eval()
with torch.no_grad():
  outputs = model(**tknz_batch)

logits = outputs.logits
predictions = torch.argmax(logits, dim=-1)
metric.compute(predictions=predictions, references=torch.LongTensor([0,0,0]))

{'accuracy': 0.0}

**Problem 2.4 (bonus)** *(20 points)*  Try `bert-base-uncased` and analyze if it makes any difference. What is the difference between `cased` and `uncased` in English? How about in Korean?

**Answer to Problem 2.4** Accuracy is improved about 0.7%. It is not that much as I expected, but anyway it is improved. That might be because, now the model can consider the vocabs which were once not carefully considered on cased case. In Korean, there's no need to consider about cased, uncased. However, as far as I know, performing an NLP task on Korean is difficult based on different reasons. For example, Korean can deform in a various colloquial forms(구어체), and due to the existence of polite expression(경어체), it is difficult to consider them on the context in a consistent way. 

In [None]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2).to(device)
train_loss_list = []

model.train()
for epoch in range(3):
  train_loss = 0
  for batch_idx, data in enumerate(tqdm(train_loader)):
    feature = data['sentence']
    label = data['label'].to(device)
    tknz_batch = tokenizer(feature,
                          padding='longest',
                          truncation=True)
    tknz_batch = {k: torch.tensor(v).to(device) for k, v in tknz_batch.items()}
    outputs = model(**tknz_batch)
    loss = CELoss(outputs.logits, label)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    train_loss += loss

    if batch_idx % 50 == 0:
      print('Epoch: %03d/%03d | Batch %04d/%04d | Loss: %.4f' 
      %(epoch+1, 3, batch_idx, len(train_loader), loss))
          
  train_loss_list.append(train_loss/len(train_loader))

print(f"Loss change: {train_loss_list[0]} -> {train_loss_list[1]} -> {train_loss_list[2]}")

model.eval()
for data in valid_loader:
  feature = data['sentence']
  label = data['label'].to(device)
  tknz_batch = tokenizer(feature, 
                         padding='longest',
                         truncation=True)
  tknz_batch = {k: torch.tensor(v).to(device) for k, v in tknz_batch.items()}
  with torch.no_grad(): 
    outputs = model(**tknz_batch)
    
  logits = outputs.logits
  predictions = torch.argmax(logits, dim=-1)
  metric.add_batch(predictions=predictions, references=label)

metric.compute()  

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

HBox(children=(FloatProgress(value=0.0, max=1053.0), HTML(value='')))

Epoch: 001/003 | Batch 0000/1053 | Loss: 0.6911
Epoch: 001/003 | Batch 0050/1053 | Loss: 0.2858
Epoch: 001/003 | Batch 0100/1053 | Loss: 0.2995
Epoch: 001/003 | Batch 0150/1053 | Loss: 0.1466
Epoch: 001/003 | Batch 0200/1053 | Loss: 0.1593
Epoch: 001/003 | Batch 0250/1053 | Loss: 0.1977
Epoch: 001/003 | Batch 0300/1053 | Loss: 0.1251
Epoch: 001/003 | Batch 0350/1053 | Loss: 0.1702
Epoch: 001/003 | Batch 0400/1053 | Loss: 0.2936
Epoch: 001/003 | Batch 0450/1053 | Loss: 0.2211
Epoch: 001/003 | Batch 0500/1053 | Loss: 0.0708
Epoch: 001/003 | Batch 0550/1053 | Loss: 0.1532
Epoch: 001/003 | Batch 0600/1053 | Loss: 0.3943
Epoch: 001/003 | Batch 0650/1053 | Loss: 0.2596
Epoch: 001/003 | Batch 0700/1053 | Loss: 0.2320
Epoch: 001/003 | Batch 0750/1053 | Loss: 0.1723
Epoch: 001/003 | Batch 0800/1053 | Loss: 0.1279
Epoch: 001/003 | Batch 0850/1053 | Loss: 0.0784
Epoch: 001/003 | Batch 0900/1053 | Loss: 0.2068
Epoch: 001/003 | Batch 0950/1053 | Loss: 0.1719
Epoch: 001/003 | Batch 1000/1053 | Loss:

HBox(children=(FloatProgress(value=0.0, max=1053.0), HTML(value='')))

Epoch: 002/003 | Batch 0000/1053 | Loss: 0.0543
Epoch: 002/003 | Batch 0050/1053 | Loss: 0.0322
Epoch: 002/003 | Batch 0100/1053 | Loss: 0.0651
Epoch: 002/003 | Batch 0150/1053 | Loss: 0.0353
Epoch: 002/003 | Batch 0200/1053 | Loss: 0.0203
Epoch: 002/003 | Batch 0250/1053 | Loss: 0.0891
Epoch: 002/003 | Batch 0300/1053 | Loss: 0.1247
Epoch: 002/003 | Batch 0350/1053 | Loss: 0.0723
Epoch: 002/003 | Batch 0400/1053 | Loss: 0.2276
Epoch: 002/003 | Batch 0450/1053 | Loss: 0.0602
Epoch: 002/003 | Batch 0500/1053 | Loss: 0.0335
Epoch: 002/003 | Batch 0550/1053 | Loss: 0.0879
Epoch: 002/003 | Batch 0600/1053 | Loss: 0.0632
Epoch: 002/003 | Batch 0650/1053 | Loss: 0.1487
Epoch: 002/003 | Batch 0700/1053 | Loss: 0.0422
Epoch: 002/003 | Batch 0750/1053 | Loss: 0.0198
Epoch: 002/003 | Batch 0800/1053 | Loss: 0.0819
Epoch: 002/003 | Batch 0850/1053 | Loss: 0.2010
Epoch: 002/003 | Batch 0900/1053 | Loss: 0.0478
Epoch: 002/003 | Batch 0950/1053 | Loss: 0.0468
Epoch: 002/003 | Batch 1000/1053 | Loss:

HBox(children=(FloatProgress(value=0.0, max=1053.0), HTML(value='')))

Epoch: 003/003 | Batch 0000/1053 | Loss: 0.0600
Epoch: 003/003 | Batch 0050/1053 | Loss: 0.0372
Epoch: 003/003 | Batch 0100/1053 | Loss: 0.0378
Epoch: 003/003 | Batch 0150/1053 | Loss: 0.0511
Epoch: 003/003 | Batch 0200/1053 | Loss: 0.0228
Epoch: 003/003 | Batch 0250/1053 | Loss: 0.0741
Epoch: 003/003 | Batch 0300/1053 | Loss: 0.1339
Epoch: 003/003 | Batch 0350/1053 | Loss: 0.0446
Epoch: 003/003 | Batch 0400/1053 | Loss: 0.0121
Epoch: 003/003 | Batch 0450/1053 | Loss: 0.1015
Epoch: 003/003 | Batch 0500/1053 | Loss: 0.0179
Epoch: 003/003 | Batch 0550/1053 | Loss: 0.0621
Epoch: 003/003 | Batch 0600/1053 | Loss: 0.1289
Epoch: 003/003 | Batch 0650/1053 | Loss: 0.0181
Epoch: 003/003 | Batch 0700/1053 | Loss: 0.0629
Epoch: 003/003 | Batch 0750/1053 | Loss: 0.0511
Epoch: 003/003 | Batch 0800/1053 | Loss: 0.0083
Epoch: 003/003 | Batch 0850/1053 | Loss: 0.0907
Epoch: 003/003 | Batch 0900/1053 | Loss: 0.1407
Epoch: 003/003 | Batch 0950/1053 | Loss: 0.1674
Epoch: 003/003 | Batch 1000/1053 | Loss:

{'accuracy': 0.908256880733945}

## 3. Token Classification with BERT
**Problem 3.1** *(30 points)* Finetune your `bert-base-cased` model for `squad` question answering dataset, following a similar procedure to Problem 2.1. Report your accuracy here. For now, if the input is longer than 256, take the first 256 words as the input and truncate the rest. You are allowed to copy any code from the documentation.  *Hint*: If you are having difficulty in implementation, take a peek at  (but do not copy!) https://github.com/huggingface/transformers/tree/master/examples/pytorch/question-answering, though keep in mind that the answer extraction module there is quite complex. It is okay to keep it simple here and sacrifice the accuracy a little.



**Answer to Problem 3.1** The accuracy I got is about 7.2 in EM, and about 10.8 in F1 score. I made inputs in the format of `question + context` as I did in Problem 2.1. For the fair comparison, because there can be a lot of cases where the answers appear after the 256th token, I excluded those cases at the performance evaluation step.

In [None]:
dataset = load_dataset('squad')
squad_metric = load_metric('squad')

Reusing dataset squad (/home/swryu/.cache/huggingface/datasets/squad/plain_text/1.0.0/4fffa6cf76083860f85fa83486ec3028e7e32c342c218ff2a620fc6b2868483a)


In [None]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

def start_end_index(squad_dataset, kind: str):
  """
  kind: 'train' or 'validation'
  In my case, I'm going to choose only the first index of 'text' in references, because there'are so much unnecessary overlays in validation set.
  And also, I'm going to exclude few cases with tokenized 'text' length smaller than 3. 
  """
  qc_list = []
  references = []
  text_dict = OrderedDict()
  start_end_index_dict = OrderedDict()

  dataset = squad_dataset[kind]
  total_data_cnt = len(dataset)
  strange_case_cnt = 0
  index_to_exclude = [] # To exclude datasets correspondig to these indices, later. 

  for i in range(total_data_cnt): # 87599 in training set, 10570 in validation set 
    data = dataset[i]
    question = data['question']
    context = data['context']
    text = data['answers']['text']
    reference_dict = dict()

    if len(text) == 0: # If there's no answer?
      continue

    elif len(text) > 1: # Multi answers case => Too much overlap => Going to choose the only first answer 
      temp_dict = dict()
      temp_answer_start = data['answers']['answer_start'][0]
      temp_answer_text = data['answers']['text'][0]
      temp_dict['answer_start'] = [temp_answer_start]
      temp_dict['text'] = [temp_answer_text]
      reference_dict['answers'] = temp_dict

    else:
      reference_dict['answers'] = data['answers']
    reference_dict['id'] = data['id']
    
    concatenated = tokenizer(question, context, truncation=True, max_length=256) 
    text_interest = text[0]
    text_tknz = tokenizer(text_interest)['input_ids']

    start = None
    end = None
    flag = False

    for index in range(len(concatenated['input_ids'])):
        
        if concatenated['input_ids'][index:index+(len(text_tknz)-2)] == text_tknz[1:-1]: # -2 & [1:-1]: because of [CLS] & [SEP] token
            flag = True
            start = index
            end = start + len(text_tknz) - 3  # Problem in here
            start_end_index_dict[i] = [start, end]

            references.append(reference_dict)
            qc_list.append(concatenated)

            assert tokenizer.decode(concatenated['input_ids'][index:index+(len(text_tknz)-2)]) == tokenizer.decode(text_tknz[1:-1]), 'Decoded strings are not matched.'
            break

        if (flag == False) & (index == len(concatenated['input_ids'])-1):
            # Example: 
            # tokenizer.decode([3325, 1114, 22311, 5912, 1105, 6284, 18608]), tokenizer.decode([3325, 1114, 22311, 5912, 1105, 6284, 1200])
            # ('split with Luckett and Roberson', 'split with Luckett and Rober')
            strange_case_cnt += 1
            index_to_exclude.append(i)
            
  print(f"Strange cases count: {strange_case_cnt}개")

  # start_end_index_dict: need tokenzied-version index 
  # references: will be used later at metric evaluation 
  # ihdex_to_exclude: observation indices that will be used to excluded based on error of data itself. 
  # qc_list: tokenized question+context
  return start_end_index_dict, references, index_to_exclude, qc_list

start_time = time()
preprocessed_for_train = start_end_index(dataset, 'train')
preprocessed_for_valid = start_end_index(dataset, 'validation')
end_time = time()

print(f'elapsed time: {(end_time-start_time)/60}min')

Strange cases count: 1255개
Strange cases count: 205개
elapsed time: 1.2230188608169557min


In [None]:
train_start_end = [preprocessed_for_train[0][k] for k, v in preprocessed_for_train[0].items() if k not in preprocessed_for_train[2]]
valid_start_end = [preprocessed_for_valid[0][k] for k, v in preprocessed_for_valid[0].items() if k not in preprocessed_for_valid[2]]

train_references = preprocessed_for_train[1]
valid_references = preprocessed_for_valid[1]

train_concatenated = preprocessed_for_train[3]
valid_concatenated = preprocessed_for_valid[3]

In [None]:
class TCdataset(Dataset):
  def __init__(self, concatenated_data: List, label_data):
    self.feature = concatenated_data
    self.label = label_data
    self.data = []
    self.concat()

  def concat(self):
    for i in range(len(self.feature)):
      temp = [self.feature[i], self.label[i]]
      self.data.append(temp)

  def __len__(self):
    return len(self.data)

  def __getitem__(self, index):
    return self.data[index]


@dataclass
class TCDatacollator:
  tokenizer: PreTrainedTokenizer
  max_length: int

  def _pad(self, encoded_input, max_length):
    current_input = encoded_input['input_ids']
    pad_needed = len(current_input) < max_length
    cut_needed = len(current_input) >= max_length

    if pad_needed:
        pad_length = max_length - len(current_input)
        encoded_input['input_ids'] = current_input + [self.tokenizer.pad_token_type_id] * pad_length
        encoded_input['attention_mask'] = encoded_input['attention_mask'] + [0] * pad_length
        encoded_input['token_type_ids'] = encoded_input['token_type_ids'] + [self.tokenizer.pad_token_type_id] * pad_length

    elif cut_needed:
        encoded_input['input_ids'] = current_input[0:max_length]
        encoded_input['attention_mask'] = encoded_input['attention_mask'][0:max_length]
        encoded_input['token_type_ids'] = encoded_input['token_type_ids'][0:max_length]

    return encoded_input

  def masking_attention(self, encoded_input):
    token_type_ids = encoded_input['token_type_ids']
    end_idx = token_type_ids.index(1)
    for i in range(end_idx):
      encoded_input['attention_mask'][i] = 0

    return encoded_input

  def __call__(self, tokenized: List[List]):
    feature_data = []
    label_data = []

    for instance in tokenized:
      label_data.append(instance[1])
      feature_data.append(self.masking_attention(instance[0]))

    feature_batch = {}
    for encoded_input in feature_data:
      outputs_padded = self._pad(encoded_input, self.max_length)
      for key, values in outputs_padded.items():
        if key not in feature_batch:
          feature_batch[key] = []
        feature_batch[key].append(values)
    feature_batch = {
        key: torch.tensor(feature_batch[key]) for key in feature_batch.keys()
    }
    
    feature_batch = BatchEncoding(feature_batch)
    label_data = torch.LongTensor(label_data)

    return feature_batch, label_data

In [None]:
# class TCModel(nn.Module):
#   def __init__(self, model_args, config):
#     super(TCModel, self).__init__()
#     self.config = config
#     self.BertModel = Bert(model_args, config=self.config)
#     #self.start_proj = nn.Linear(self.config.hidden_size, 1)
#     #self.end_proj = nn.Linear(self.config.hidden_size, 1)

#   def forward(self, x):
#     output = self.BertModel(x)
#     #start_output = self.start_proj(output)
#     #end_output = self.end_proj(output)
    
#     return start_output, end_output

# class Bert(PreTrainedModel):
#     def __init__(self, model_args, config):
#         super(Bert, self).__init__(config)
#         self.model = AutoModelForQuestionAnswering.from_pretrained(model_args, config=config)

#     def forward(self, x):
#         output = self.model(**x)
#         hidden = output[2]
#         last_hidden = hidden[-1]
        
#         del output, hidden

#         return last_hidden

In [None]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
config = AutoConfig.from_pretrained('bert-base-cased')
                                    #output_hidden_states=True)
# model = TCModel('bert-base-cased', config).to(device)
model = AutoModelForQuestionAnswering.from_pretrained('bert-base-cased').to(device)

optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
CELoss = nn.CrossEntropyLoss()

train_dataset = TCdataset(train_concatenated, train_start_end)
data_collator = TCDatacollator(tokenizer, config.max_position_embeddings)

train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                                batch_size=16,
                                                collate_fn=data_collator,
                                                shuffle=True,
                                                drop_last=True)

train_loss_list = []

model.train()
for epoch in range(3):
  train_loss = 0
  for batch_idx, (feature, label) in enumerate(tqdm(train_loader)):
    feature = feature.to(device)
    label = label.to(device)
    
    outputs = model(**feature, start_positions = label[:,0].unsqueeze(-1), 
                          end_positions = label[:,1].unsqueeze(-1))
    loss = outputs.loss

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    train_loss += loss

    if batch_idx % 50 == 0:
      print('Epoch: %03d/%03d | Batch %04d/%04d | Loss: %.4f' 
      %(epoch+1, 3, batch_idx, len(train_loader), loss))
          
  train_loss_list.append(train_loss/len(train_loader))

#torch.save(model.state_dict(), '/home/swryu/BERT.pt')
print(f"Loss change: {train_loss_list[0]} -> {train_loss_list[1]} -> {train_loss_list[2]}")

del train_loader, train_dataset

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForQuestionAnswering: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-cased and a

HBox(children=(IntProgress(value=0, max=5396), HTML(value='')))

Epoch: 001/003 | Batch 0000/5396 | Loss: 6.1880
Epoch: 001/003 | Batch 0050/5396 | Loss: 4.4782
Epoch: 001/003 | Batch 0100/5396 | Loss: 3.8132
Epoch: 001/003 | Batch 0150/5396 | Loss: 3.7035
Epoch: 001/003 | Batch 0200/5396 | Loss: 3.5584
Epoch: 001/003 | Batch 0250/5396 | Loss: 3.4089
Epoch: 001/003 | Batch 0300/5396 | Loss: 3.5540
Epoch: 001/003 | Batch 0350/5396 | Loss: 3.6111
Epoch: 001/003 | Batch 0400/5396 | Loss: 3.5176
Epoch: 001/003 | Batch 0450/5396 | Loss: 3.0958
Epoch: 001/003 | Batch 0500/5396 | Loss: 3.7522
Epoch: 001/003 | Batch 0550/5396 | Loss: 3.7506
Epoch: 001/003 | Batch 0600/5396 | Loss: 2.9920
Epoch: 001/003 | Batch 0650/5396 | Loss: 3.6073
Epoch: 001/003 | Batch 0700/5396 | Loss: 3.8295
Epoch: 001/003 | Batch 0750/5396 | Loss: 3.1477
Epoch: 001/003 | Batch 0800/5396 | Loss: 3.5156
Epoch: 001/003 | Batch 0850/5396 | Loss: 2.8582
Epoch: 001/003 | Batch 0900/5396 | Loss: 3.6547
Epoch: 001/003 | Batch 0950/5396 | Loss: 3.7198
Epoch: 001/003 | Batch 1000/5396 | Loss:

HBox(children=(IntProgress(value=0, max=5396), HTML(value='')))

Epoch: 002/003 | Batch 0000/5396 | Loss: 3.4899
Epoch: 002/003 | Batch 0050/5396 | Loss: 2.9654
Epoch: 002/003 | Batch 0100/5396 | Loss: 3.1698
Epoch: 002/003 | Batch 0150/5396 | Loss: 3.2676
Epoch: 002/003 | Batch 0200/5396 | Loss: 3.2993
Epoch: 002/003 | Batch 0250/5396 | Loss: 3.4782
Epoch: 002/003 | Batch 0300/5396 | Loss: 3.9385
Epoch: 002/003 | Batch 0350/5396 | Loss: 3.2470
Epoch: 002/003 | Batch 0400/5396 | Loss: 3.4119
Epoch: 002/003 | Batch 0450/5396 | Loss: 3.4544
Epoch: 002/003 | Batch 0500/5396 | Loss: 3.0175
Epoch: 002/003 | Batch 0550/5396 | Loss: 3.2446
Epoch: 002/003 | Batch 0600/5396 | Loss: 3.2799
Epoch: 002/003 | Batch 0650/5396 | Loss: 3.7610
Epoch: 002/003 | Batch 0700/5396 | Loss: 2.8513
Epoch: 002/003 | Batch 0750/5396 | Loss: 3.5548
Epoch: 002/003 | Batch 0800/5396 | Loss: 3.6414
Epoch: 002/003 | Batch 0850/5396 | Loss: 3.2959
Epoch: 002/003 | Batch 0900/5396 | Loss: 3.5500
Epoch: 002/003 | Batch 0950/5396 | Loss: 3.2973
Epoch: 002/003 | Batch 1000/5396 | Loss:

HBox(children=(IntProgress(value=0, max=5396), HTML(value='')))

Epoch: 003/003 | Batch 0000/5396 | Loss: 2.9613
Epoch: 003/003 | Batch 0050/5396 | Loss: 3.0108
Epoch: 003/003 | Batch 0100/5396 | Loss: 2.9963
Epoch: 003/003 | Batch 0150/5396 | Loss: 3.3332
Epoch: 003/003 | Batch 0200/5396 | Loss: 3.0031
Epoch: 003/003 | Batch 0250/5396 | Loss: 3.0275
Epoch: 003/003 | Batch 0300/5396 | Loss: 2.8364
Epoch: 003/003 | Batch 0350/5396 | Loss: 2.6646
Epoch: 003/003 | Batch 0400/5396 | Loss: 2.8803
Epoch: 003/003 | Batch 0450/5396 | Loss: 3.2381
Epoch: 003/003 | Batch 0500/5396 | Loss: 3.1962
Epoch: 003/003 | Batch 0550/5396 | Loss: 3.2599
Epoch: 003/003 | Batch 0600/5396 | Loss: 3.2583
Epoch: 003/003 | Batch 0650/5396 | Loss: 3.1052
Epoch: 003/003 | Batch 0700/5396 | Loss: 2.5415
Epoch: 003/003 | Batch 0750/5396 | Loss: 2.9614
Epoch: 003/003 | Batch 0800/5396 | Loss: 3.1267
Epoch: 003/003 | Batch 0850/5396 | Loss: 3.1132
Epoch: 003/003 | Batch 0900/5396 | Loss: 3.1599
Epoch: 003/003 | Batch 0950/5396 | Loss: 2.8251
Epoch: 003/003 | Batch 1000/5396 | Loss:

In [None]:
predictions = []
valid_dataset = TCdataset(valid_concatenated, valid_start_end)
valid_loader = torch.utils.data.DataLoader(dataset=valid_dataset,
                                           batch_size=1,
                                           collate_fn=data_collator,
                                           shuffle = False,
                                           drop_last=True)
model.eval()
for batch_idx, (feature, _) in enumerate(tqdm(valid_loader)):
    pred_dict = dict()
    feature_input_ids = feature['input_ids'][0]
    
    with torch.no_grad(): 
        feature = feature.to(device)
        
        outputs = model(**feature)
        start_logit = outputs.start_logits
        end_logit = outputs.end_logits
                
        start_idx = torch.argmax(start_logit, dim=1)[0].item()
        end_idx = torch.argmax(end_logit, dim=1)[0].item()
        
        if start_idx <= end_idx:
            words = tokenizer.decode(feature_input_ids[start_idx:end_idx+1])
            pred_dict['prediction_text'] = words
        else:
            pred_dict['prediction_text'] = 'wrong' # Just to fill in the metric format 
        
        pred_dict['id'] = valid_references[batch_idx]['id']
        predictions.append(pred_dict)
           
results = squad_metric.compute(predictions=predictions, references=valid_references)
print(results)

HBox(children=(IntProgress(value=0, max=10365), HTML(value='')))


{'exact_match': 7.361312108055958, 'f1': 10.79437052268124}


**Problem 3.2** *(10 points)* How does your question answering accuracy (F1 and EM) with BERT compares to your accuracy with LSTM and Attention in Assignment 2? How about training speed?


**Answer to Problem 3.2** EM and F1 scores in BERT outperformed those of LSTM+Attention. Especially, in F1 score, I recorded less than 1 in case of LSTM+Attentions. However, in this case, I recorded about 10 times higher in BERT's case. However, because BERT model uses transformers in their encoder, it takes much more computation time comparing to LSTM. It took about one and an half hour for finetuning with 3 epochs!


**Problem 3.3** *(10 points)* Try your own context/questions and find three failure cases. Explain why you think the model got them wrong.

**Answer to Problem 3.3** The three cases I made failed to predict the result exactly. This failure happened due to the limitation of the sequence length: 256. When the question and context is concatenated, if the text is located at the out of bound (All the three cases below corresponds to this case), model cannot expect the result however it learns the parameters well! This is the limitation of the model itself, not caused by model's underperforming.

In [None]:
question1 = 'Which class was the best class you took in a first semester of KAIST graduate life?'
context_base = 'I thankfully got a permission from KAIST, and I started to study in here from March, 2021. ' +\
'I took three classes in this semester. Natural Language Processing, Deep Learning, and Machine Learning for Health Care! ' +\
'I learned a lot of things in those  classes. In Natural Language Processing class, I leart a lot of general topics such as NER, QA, etc. ' +\
'And of course I learnt a lot of models such as RNN, LSTM, Seq2seq models, transformers and transformer variants, and large language models!. ' +\
'In Deep Learning courses, for the first half of the class, I learnt about Deep Learning basics. On the second half, ' +\
'I learnt about some image tasks and natural language processing tasks. ' +\
'For the last, in case of Machine Learning for Health Care class,I learnt a lot of backgrounds about medical tasks: ' +\
'MIMIC3 data, and a lot of visual, language, graphical models. '+\
'All those three classes were very nice. I satisfy at all those three classes. '+\
'Those three classes were really helpful for me though I took about 10 coding assignments. '+\
'Semester is not ended yet, and still have so much things to do. '

context1 = context_base + 'And of course the best class was the class taught by professor Min Joon Seo.'
context2 = context_base + 'I took three classes, and my answer is NLP.'
context3 = context_base + 'Natural Language Processing'

text1 = 'the class taught by professor Min Joon Seo'
text2 = 'NLP'
text3 = 'Natural Language Processing'

In [None]:
def find_index(context, text, max_length=256):
    qc_tknz = tokenizer(context, text, max_length=max_length, return_tensors='pt').to(device)
    text_tknz = tokenizer(text, return_tensors='pt').to(device)
    qc_input_ids = qc_tknz['input_ids'][0]
    qc_input_len = len(qc_input_ids)
    
    start = None
    end = None
    for i in range(qc_input_len):
        if torch.equal(qc_input_ids[i:(i+len(text_tknz['input_ids'][0])-2)], text_tknz['input_ids'][0][1:-1]):
            start = i
            end = i + qc_input_len-1
            continue
            
    model.eval()
    model.to(device)
    with torch.no_grad():
        logit1, logit2 = model(qc_tknz)
        start_idx = torch.argmax(logit1, dim=1)[0].item()
        end_idx = torch.argmax(logit2, dim=1)[0].item()
        output = tokenizer.decode(qc_input_ids[start_idx:end_idx+1])
        
        print(f'The real answer        : {text}')
        print(f'What the model expected: {output}')
        print('-----------------------------------')
        
    return None

In [None]:
find_index(context1, text1)
find_index(context2, text2)
find_index(context3, text3)

The real answer        : the class taught by professor Min Joon Seo
What the model expected: Natural Language Processing, Deep Learning, and Machine Learning for Health Care
-----------------------------------
The real answer        : NLP
What the model expected: Natural Language Processing, Deep Learning, and Machine Learning for Health Care
-----------------------------------
The real answer        : Natural Language Processing
What the model expected: Natural Language Processing, Deep Learning, and Machine Learning for Health Care
-----------------------------------


**Problem 3.4 (bonus)** *(20 points)* Can we do better than truncating tokens if the input length is too long? Suggest (but do not code) a strategy for a problem like SQuAD when the input has an arbitrary length with a pretrained model like BERT that has a predefined input length.

**Answer to Problem 3.4** I think I can improve the performance by using the sliding window techniques. For example, in case of bert-base with sequence length 512, I have to cut or pad real inputs to fit the input sequence length into 512. However, general Bert just considers 512 sequences at the very front and cut. It means we lose precious information at the rear side. Rather than just cut, if we cannot avoid truncation, we can cut by length 512 from front to the back. And then take average of the outputs at the last layer. It might also contain might-be precious information.