##  HW10: BERT fintuning. 

In this exercise, you are going to learn how to perform fine-tuning on a transformer-based model. First, we will provide a tutorial on fine-tuning the Large Movie Review Dataset (IMDB dataset) using distilBERT (https://arxiv.org/abs/1910.01108). After that, you have to complete the exercise by fine-tuning on the TRUE call-center dataset (HW6). This homework is based on the Hugging Face tutorial (https://huggingface.co/transformers/custom_datasets.html).

### 1. Install transformers library form Hugging Face

In [1]:
# !pip install torch==1.4.0
!pip install transformers
!pip install pythainlp
!pip install sentencepiece

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/ed/d5/f4157a376b8a79489a76ce6cfe147f4f3be1e029b7144fa7b8432e8acb26/transformers-4.4.2-py3-none-any.whl (2.0MB)
[K     |████████████████████████████████| 2.0MB 5.7MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 18.3MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/71/23/2ddc317b2121117bf34dd00f5b0de194158f2a44ee2bf5e47c7166878a97/tokenizers-0.10.1-cp37-cp37m-manylinux2010_x86_64.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 16.6MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp37-none-any.whl size=893262 sha256=295d261b62b

### 2. Download Large Movie Review Dataset 

In [3]:
!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

--2021-04-02 20:35:47--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz’


2021-04-02 20:35:55 (10.5 MB/s) - ‘aclImdb_v1.tar.gz’ saved [84125825/84125825]



### 3. Preprocess the dataset  
Large Movie Review Dataset  is a dataset for binary sentiment classification. The input of this dataset is a movie review with its sentiment as a ground truth

In [4]:
from pathlib import Path
from sklearn.model_selection import train_test_split
import numpy as np

def read_imdb_split(split_dir):
    split_dir = Path(split_dir)
    texts = []
    labels = []
    for label_dir in ["pos", "neg"]:
        for text_file in (split_dir/label_dir).iterdir():
            texts.append(text_file.read_text())
            labels.append(0 if label_dir is "neg" else 1)

    return texts, labels

train_texts, train_labels = read_imdb_split('aclImdb/train')
test_texts, test_labels = read_imdb_split('aclImdb/test')
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)

In [5]:
print("Unique label is {}, nb. of train data = {}, test_data = {}".format(np.unique(train_labels), len(train_texts), len(test_texts)))
for i in range(5):
  print("Data = {}".format(train_texts[i]))
  print("Label = {}".format(train_labels[i]))

Unique label is [0 1], nb. of train data = 20000, test_data = 25000
Data = The man who directed 'The Third Man' also directed the 'Who Will Buy' sequence in "Oliver!" Now that is talent.<br /><br />I raise my hat to Carol Reed.<br /><br />I know there are 'second units' involved, but still ...<br /><br />And he had to deal with Orson Welles and Oliver Reed ...<br /><br />I suppose quality will out.<br /><br />(It does show in the final scene with Nancy [ avoiding spoiler - everyone has to see Oliver! for the first time sometime ].) How many lines do I need to type.<br /><br />Encouraging people to type too much is not to be encouraged.<br /><br />I hope this counts as the "10th line".
Label = 1
Data = A group of friends decide to take a camping trip into the desert-and find themselves stalked and murdered by a mysterious killer in a black pick-up truck."Mirage" is obviously inspired by Spielberg's "Duel" and Craven's "The Hills Have Eyes".Still this slasher yarn offers plenty of nasty 

After the dataset is processed, we tokenize each input sentence. This tokenizer has a start token of '[CLS'] (id 101) and a seperator token '[SEP]' (id 102) at the end of each sentence. If the word is an Out-of-vocabulary word (OOV), the token id is 100. The tokenized output has the following format :

```python
{
  'input_ids': List[List[Int]]. List of tokenized input sentence.
  'attention_mask' : List[List[Int]].  List of masked token. See cell [7] for example.
}
```

In [None]:
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




In [None]:
tokenizer([ '[CLS] a' ], truncation=True, padding=True)

{'input_ids': [[101, 101, 1037, 102]], 'attention_mask': [[1, 1, 1, 1]]}

In [None]:
tokenizer( ['Pine apple apple pen  หมา ไก่', 'a b'], truncation=True, padding=True)

{'input_ids': [[101, 7222, 6207, 6207, 7279, 100, 100, 102], [101, 1037, 1038, 102, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 0, 0, 0, 0]]}

In [None]:
a = tokenizer(train_texts[:2], truncation=True, padding=True)
print(a)

{'input_ids': [[101, 1000, 26137, 1000, 11652, 1996, 2474, 2335, 2013, 1996, 2392, 1997, 1996, 4966, 3482, 1012, 2027, 2442, 2031, 2042, 7727, 2000, 1996, 2755, 2008, 2107, 1037, 3374, 3538, 1997, 10231, 2001, 2412, 2207, 1012, 1996, 2143, 19223, 2105, 1037, 9129, 1997, 3057, 2040, 2031, 1037, 4295, 2029, 2749, 2068, 2000, 2468, 2064, 3490, 10264, 2015, 1010, 1998, 4028, 7036, 2111, 2074, 2000, 2994, 4142, 1012, 2037, 3096, 14113, 2015, 2125, 2802, 1996, 2143, 1010, 2057, 2036, 2156, 16574, 3456, 1010, 4641, 4385, 2008, 2024, 2055, 2004, 13359, 2004, 1037, 14414, 18001, 2371, 2275, 1012, 2045, 2003, 2019, 9643, 2843, 1997, 3331, 1038, 1008, 2222, 1008, 1008, 29535, 1010, 1037, 2978, 1997, 2529, 12846, 1998, 2070, 6881, 11798, 4477, 15775, 2361, 2040, 17727, 6935, 5644, 1996, 9015, 2545, 1997, 2056, 3096, 7355, 1999, 2010, 9346, 18019, 2000, 1037, 3242, 1010, 2077, 21690, 2068, 1999, 1996, 2132, 1010, 24494, 4691, 2068, 2046, 9017, 1012, 1012, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1

In [None]:
train_encodings = tokenizer(train_texts, add_special_tokens=True,
            max_length=512,
            pad_to_max_length=True,
            return_token_type_ids=True,truncation=True
        )
val_encodings = tokenizer(val_texts, add_special_tokens=True,
            max_length=512,
            pad_to_max_length=True,
            return_token_type_ids=True,truncation=True
        )
test_encodings = tokenizer(test_texts, add_special_tokens=True,
            max_length=512,
            pad_to_max_length=True,
            return_token_type_ids=True,truncation=True
        )



Convert the dataset into training format. You can see the training input format of distilBERT is in https://huggingface.co/transformers/model_doc/distilbert.html. 

In [None]:
train_data = [np.array(train_encodings['input_ids']), np.array(train_encodings['attention_mask'])]
val_data = [np.array(val_encodings['input_ids']), np.array(val_encodings['attention_mask'])]
test_data = [np.array(test_encodings['input_ids']), np.array(test_encodings['attention_mask'])]

### 4. Model fine-tuning
The model we used for fine-tuning is distilBERT (https://arxiv.org/abs/1910.01108), which is a smaller model distilled from the original BERT. Knowledge distillation is a well-known trick for improving the performance of a small model by learning an estimated uncertainty from a larger model instead of using a hard-label. If you want to know more about knowledge distillation, read https://arxiv.org/abs/1503.02531.

#### Model Initialization

In [None]:
from transformers import DistilBertForSequenceClassification
import torch

model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels= 2)
model = torch.nn.DataParallel(model.cuda(), device_ids=[0])

LEARNING_RATE =  1e-5
loss_fn = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(params =  model.parameters(), lr=LEARNING_RATE)


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267967963.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

#### Set up training generator

In contrast to model.fit which you have used in the previous lab. A more common way to feed the data is to use a generator. It is more memory-efficient than model.fit as the data is only quired when the iterator executes. For example, you can set the generator to load the image from the folder when called instead of storing all of them in the RAM. An example below is a way to create a simple generator, which aggregate the data points into a batch. Both PyTorch and TensorFlow also has a utility module for creating a generator (torch.utils.data.DataLoader for Torch and tf.data.Dataset for Tensorflow) 

In [None]:
def batch_data_generator(data, label, bs = 8, training = True):
  while(True):
    X1= []
    X2 = []
    Y = []
    from sklearn.utils import shuffle
    ids, masks = data[0], data[1]
    if(training):
      ids, masks, label = shuffle(ids, masks, label, random_state = 42)
    for a, b, c in zip(ids, masks, label):
      X1.append(a)
      X2.append(b)
      Y.append(c)
      if(len(X1) == bs):
        yield [np.array(X1), np.array(X2)], np.array(Y)
        X1= []
        X2 = []
        Y = []
    if(len(X1) > 0):
      yield [np.array(X1), np.array(X2)], np.array(Y)
    if(not training):
      yield None
      break


In [None]:
train_generator = batch_data_generator(train_data, np.array(train_labels, dtype = np.int), training = True)

In [None]:
dummy_generator = batch_data_generator(train_data, np.array(train_labels, dtype = np.int), training = True)
X_dummy, Y_dummy = next(dummy_generator)
print(X_dummy[0].shape, X_dummy[1].shape, Y.shape)

(8, 512) (8, 512) (8,)


#### Start Fine-tuning

In [None]:
device = "cuda:0"
from tqdm import tqdm_notebook
from sklearn.metrics import accuracy_score
from collections import deque 

train_acc_stat =  deque(maxlen = 100)
train_loss_stat =  deque(maxlen = 100)

for step in  tqdm_notebook(range(1000)):
    X, Y = next(train_generator)
    ids = torch.tensor(X[0], dtype = torch.long, device = device)
    mask = torch.tensor(X[1], dtype = torch.long, device = device)
    targets = torch.tensor(Y, dtype = torch.long).to(device)

    optimizer.zero_grad()
    outputs = model(ids, mask)
    loss = loss_fn(outputs['logits'], targets)
    
    loss.backward()
    optimizer.step()

    with torch.no_grad():
      train_acc = accuracy_score(Y, outputs['logits'].argmax(axis = 1).cpu().detach().numpy() )
      train_loss = loss.cpu().detach().numpy()
      train_acc_stat.append(train_acc)
      train_loss_stat.append(train_loss)

    if (step + 1) %100==0:
      print("iter = {} train_acc = {}".format(step, np.array(train_acc_stat).mean()))
      print("iter = {} train_loss = {}".format(step, np.array(train_loss_stat).mean()))


    if (step + 1) %500==0:
      #validation step
      with torch.no_grad():
        val_generator = batch_data_generator(val_data, np.array(val_labels, dtype = np.int), training = False)
        y_true = []
        y_pred = []
        while(True):
          d = next(val_generator)
          if(d is None): break
          X, Y = d
          ids = torch.tensor(X[0], dtype = torch.long, device = device)
          mask = torch.tensor(X[1], dtype = torch.long, device = device)
          outputs_cls = model(ids, mask)['logits'].argmax(axis = 1).cpu().detach().numpy()
          y_true.append(Y)
          y_pred.append(outputs_cls)
        y_true = np.concatenate(y_true)
        y_pred = np.concatenate(y_pred)
        print("val acc", accuracy_score(y_true, y_pred))

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  if __name__ == '__main__':


HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))

iter = 99 train_acc = 0.62625
iter = 99 train_loss = 0.6432003974914551
iter = 199 train_acc = 0.8575
iter = 199 train_loss = 0.3508123755455017
iter = 299 train_acc = 0.87375
iter = 299 train_loss = 0.3153204619884491
iter = 399 train_acc = 0.90875
iter = 399 train_loss = 0.25545457005500793
iter = 499 train_acc = 0.91125
iter = 499 train_loss = 0.24945540726184845
val acc 0.892
iter = 599 train_acc = 0.8825
iter = 599 train_loss = 0.2910419702529907
iter = 699 train_acc = 0.88875
iter = 699 train_loss = 0.26341453194618225
iter = 799 train_acc = 0.9025
iter = 799 train_loss = 0.24996988475322723
iter = 899 train_acc = 0.90375
iter = 899 train_loss = 0.2513931393623352
iter = 999 train_acc = 0.91
iter = 999 train_loss = 0.2223140150308609
val acc 0.909



## TODO 
Compare the classification performance between the non-transformer model and the model fine-tuned using pretrained WangchanBERTa on TRUE call-center dataset (HW6). WangchanBERTa (https://arxiv.org/abs/2101.09635) is RoBERTa (https://arxiv.org/abs/1907.11692) trained on thai texts. RoBERTa is also supported in Hugging Face (https://huggingface.co/transformers/model_doc/roberta.html).

For this homework, you may focus only on the object tag.
To successfully fine-tune WangchanBERTa on the TRUE call-center dataset, you should:

1. Preprocess the dataset into the same format as the tutorial.
2. Tokenize the input from 1. See (https://colab.research.google.com/drive/1Kbk6sBspZLwcnOE61adAQo30xxqOQ9ko?usp=sharing&fbclid=IwAR23b8ZEoP6YxlUx7wWEu7dRCrVcyTFrZb3YSgI-nsxe_t4gy-bh8Rv5R9E#scrollTo=kAcpAdkddVQ8) for more details.
3. Process the tokenized input from 1. to the format that could be fed to the model.
4. Initialize WangchanBERTa (<b> you should choose the pretrained weight w.r.t. the tokenizer in 2.</b>)
5. Fine-tune the pretrained model.
6.  (Optional) Before fine-tuning is performed (before step 5), domain adaptation is often performed first by training a masked language model (maskLM). You can train maskLM by following this guideline (https://huggingface.co/transformers/model_doc/bert.html#bertformaskedlm).

In [6]:
!wget --no-check-certificate https://www.dropbox.com/s/37u83g55p19kvrl/clean-phone-data-for-students.csv

--2021-04-02 20:51:51--  https://www.dropbox.com/s/37u83g55p19kvrl/clean-phone-data-for-students.csv
Resolving www.dropbox.com (www.dropbox.com)... 162.125.5.18, 2620:100:601d:18::a27d:512
Connecting to www.dropbox.com (www.dropbox.com)|162.125.5.18|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/37u83g55p19kvrl/clean-phone-data-for-students.csv [following]
--2021-04-02 20:51:52--  https://www.dropbox.com/s/raw/37u83g55p19kvrl/clean-phone-data-for-students.csv
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc83d47ee413a96f41abac76b257.dl.dropboxusercontent.com/cd/0/inline/BL4ic_qhxSwXpjV3mLFT7LJKAVrFQZzVUiBEhaKkdDpc7jb1gw6xbIZxlvcitNXUVUr45QZh456d7efPAMKu_N76YVN0yv8vptMK0nbxlXt68PjCnS-7YSJCQD2xBYRlp_crum7jMr2sMWZDRx7DdS7r/file# [following]
--2021-04-02 20:51:52--  https://uc83d47ee413a96f41abac76b257.dl.dropboxusercontent.com/cd/0/inline/BL4ic_qhxSwXpjV3mLFT7LJKAVr

## 1. Preprocess the dataset into the same format as the tutorial.

In [2]:
import pandas as pd
data_df = pd.read_csv('clean-phone-data-for-students.csv')
data_df.head()

Unnamed: 0,Sentence Utterance,Action,Object
0,<PHONE_NUMBER_REMOVED> ผมไปจ่ายเงินที่ Counte...,enquire,payment
1,internet ยังความเร็วอยุ่เท่าไหร ครับ,enquire,package
2,ตะกี้ไปชำระค่าบริการไปแล้ว แต่ยังใช้งานไม่ได้...,report,suspend
3,พี่ค่ะยังใช้ internet ไม่ได้เลยค่ะ เป็นเครื่อ...,enquire,internet
4,ฮาโหล คะ พอดีว่าเมื่อวานเปิดซิมทรูมูฟ แต่มันโ...,report,phone_issues


In [4]:
# data cleaning
# to lower
data_df.Action = data_df.Action.str.lower().copy()
data_df.Object = data_df.Object.str.lower().copy()

# drop dup
data_df = data_df.drop_duplicates("Sentence Utterance", keep='first')

data_df = data_df.rename(columns={"Sentence Utterance": "input"})

# strip space before input 
data_df.input = data_df.input.str.strip()

data_df.to_csv('checkpoint.csv', index=False)

In [18]:
object_labels = data_df.Object.unique()

object_to_id = dict(zip(object_labels, range(len(object_labels))))
id_to_object = dict(zip(range(len(object_labels)), object_labels))

def cvt_label_id(label):
    return object_to_id[label]

In [19]:
from pathlib import Path
from sklearn.model_selection import train_test_split
import numpy as np

def read_truevoice_split(data_df):
    texts = list(data_df.input.array)
    labels = list(data_df.Object.apply(cvt_label_id).array)

    return texts, labels

all_texts, all_labels = read_truevoice_split(data_df)
train_texts, test_texts, train_labels, test_labels = train_test_split(all_texts.copy(), all_labels.copy(), test_size=0.2)
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)

In [20]:
print("Unique label is {}, nb. of train data = {}, test_data = {}".format(np.unique(train_labels), len(train_texts), len(test_texts)))
for i in range(5):
  print("Data = {}".format(train_texts[i]))
  print("Label = {}".format(train_labels[i]))

Unique label is [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25], nb. of train data = 8568, test_data = 2678
Data = ต้องการจะเปิดใช้บริการค่ะ
Label = 5
Data = ค่ะ คือเครื่องนี้ยังไม่เคยลงทะเบียณใช้ ไวไฟ อ่ะค่ะ อยากใช้ ไวไฟ อ่ะค่ะ
Label = 3
Data = ช่วยเช็คชั่วโมง เน็ต ให้หน่อยว่าเหลือเท่าไหร่ ส่ง ข้อความ แล้วไม่มีตอบกลับ
Label = 18
Data = เข้าอินเตอร์เน็ตไม่ได้ค่ะของทรูมูฟ 3G ค่ะ
Label = 3
Data = จะสอบถามเรื่องซิมมือถือหายค่ะ
Label = 17


## 2. Tokenize the input

## 3. Process the tokenized input
( you should choose the pretrained weight w.r.t. the tokenizer in 2.)

## 4. Initialize WangchanBERTa

## 5. Fine-tune the pretrained model.