<a href="https://colab.research.google.com/github/chantmk/NLP_2021/blob/main/HW10/HW10_BERT_finetuing_finished.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##  HW10: BERT fintuning. 

In this exercise, you are going to learn how to perform fine-tuning on a transformer-based model. First, we will provide a tutorial on fine-tuning the Large Movie Review Dataset (IMDB dataset) using distilBERT (https://arxiv.org/abs/1910.01108). After that, you have to complete the exercise by fine-tuning on the TRUE call-center dataset (HW5). This homework is based on the Hugging Face tutorial (https://huggingface.co/transformers/custom_datasets.html).

### 1. Install transformers library form Hugging Face

In [None]:
# !pip install torch==1.4.0
!pip install transformers
!pip install pythainlp
!pip install sentencepiece

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/ed/d5/f4157a376b8a79489a76ce6cfe147f4f3be1e029b7144fa7b8432e8acb26/transformers-4.4.2-py3-none-any.whl (2.0MB)
[K     |████████████████████████████████| 2.0MB 9.0MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/71/23/2ddc317b2121117bf34dd00f5b0de194158f2a44ee2bf5e47c7166878a97/tokenizers-0.10.1-cp37-cp37m-manylinux2010_x86_64.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 38.5MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 46.5MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp37-none-any.whl size=893262 sha256=753ecfc63db

### 2. Download Large Movie Review Dataset 

In [None]:
!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

--2021-03-31 11:04:25--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz’


2021-03-31 11:04:27 (48.5 MB/s) - ‘aclImdb_v1.tar.gz’ saved [84125825/84125825]



### 3. Preprocess the dataset  
Large Movie Review Dataset  is a dataset for binary sentiment classification. The input of this dataset is a movie review with its sentiment as a ground truth

In [None]:
from pathlib import Path
from sklearn.model_selection import train_test_split
import numpy as np

def read_imdb_split(split_dir):
    split_dir = Path(split_dir)
    texts = []
    labels = []
    for label_dir in ["pos", "neg"]:
        for text_file in (split_dir/label_dir).iterdir():
            texts.append(text_file.read_text())
            labels.append(0 if label_dir is "neg" else 1)

    return texts, labels

train_texts, train_labels = read_imdb_split('aclImdb/train')
test_texts, test_labels = read_imdb_split('aclImdb/test')
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)

In [None]:
print("Unique label is {}, nb. of train data = {}, test_data = {}".format(np.unique(train_labels), len(train_texts), len(test_texts)))
for i in range(5):
  print("Data = {}".format(train_texts[i]))
  print("Label = {}".format(train_labels[i]))

Unique label is [0 1], nb. of train data = 20000, test_data = 25000
Data = Have you ever seen a movie made up entirely of long wide shots? No? Me, neither. Well, I've finally seen one in "Spring in my Hometown," and I must confess, now I KNOW why people don't do this. The technique is "arty," to be sure, but it's definitely NOT ripe for public consumption. The technique is heavily flawed simply because the viewer has no emotional attachment to the characters, and perhaps that might be the director's whole intentions. I don't know, I can't read minds, and I certainly don't know enough about the director to make a judgement.<br /><br />But one thing about this movie that IS painfully obvious is its ridiculous anti-American sentiments. As an American, I'm well aware of my country's participation in the Korean War, and I'm very well aware that we weren't always angels, but I'll be damn if I'll take this guy's version of how things happened. According to this blind fool, Americans were not 

After the dataset is processed, we tokenize each input sentence. This tokenizer has a start token of '[CLS'] (id 101) and a seperator token '[SEP]' (id 102) at the end of each sentence. If the word is an Out-of-vocabulary word (OOV), the token id is 100. The tokenized output has the following format :

```python
{
  'input_ids': List[List[Int]]. List of tokenized input sentence.
  'attention_mask' : List[List[Int]].  List of masked token. See cell [7] for example.
}
```

In [None]:
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




In [None]:
tokenizer([ '[CLS] a' ], truncation=True, padding=True)

{'input_ids': [[101, 101, 1037, 102]], 'attention_mask': [[1, 1, 1, 1]]}

In [None]:
tokenizer( ['Pine apple apple pen  หมา ไก่', 'a b'], truncation=True, padding=True)

{'input_ids': [[101, 7222, 6207, 6207, 7279, 100, 100, 102], [101, 1037, 1038, 102, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 0, 0, 0, 0]]}

In [None]:
a = tokenizer(train_texts[:2], truncation=True, padding=True)
print(a)

{'input_ids': [[101, 2031, 2017, 2412, 2464, 1037, 3185, 2081, 2039, 4498, 1997, 2146, 2898, 7171, 1029, 2053, 1029, 2033, 1010, 4445, 1012, 2092, 1010, 1045, 1005, 2310, 2633, 2464, 2028, 1999, 1000, 3500, 1999, 2026, 9627, 1010, 1000, 1998, 1045, 2442, 18766, 1010, 2085, 1045, 2113, 2339, 2111, 2123, 1005, 1056, 2079, 2023, 1012, 1996, 6028, 2003, 1000, 2396, 2100, 1010, 1000, 2000, 2022, 2469, 1010, 2021, 2009, 1005, 1055, 5791, 2025, 22503, 2005, 2270, 8381, 1012, 1996, 6028, 2003, 4600, 25077, 3432, 2138, 1996, 13972, 2038, 2053, 6832, 14449, 2000, 1996, 3494, 1010, 1998, 3383, 2008, 2453, 2022, 1996, 2472, 1005, 1055, 2878, 11174, 1012, 1045, 2123, 1005, 1056, 2113, 1010, 1045, 2064, 1005, 1056, 3191, 9273, 1010, 1998, 1045, 5121, 2123, 1005, 1056, 2113, 2438, 2055, 1996, 2472, 2000, 2191, 1037, 16646, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 2021, 2028, 2518, 2055, 2023, 3185, 2008, 2003, 16267, 5793, 2003, 2049, 9951, 3424, 1011, 2137, 23541, 1012, 2004, 2019, 2137

In [None]:
train_encodings = tokenizer(train_texts, add_special_tokens=True,
            max_length=512,
            pad_to_max_length=True,
            return_token_type_ids=True,truncation=True
        )
val_encodings = tokenizer(val_texts, add_special_tokens=True,
            max_length=512,
            pad_to_max_length=True,
            return_token_type_ids=True,truncation=True
        )
test_encodings = tokenizer(test_texts, add_special_tokens=True,
            max_length=512,
            pad_to_max_length=True,
            return_token_type_ids=True,truncation=True
        )



Convert the dataset into training format. You can see the training input format of distilBERT is in https://huggingface.co/transformers/model_doc/distilbert.html. 

In [None]:
train_data = [np.array(train_encodings['input_ids']), np.array(train_encodings['attention_mask'])]
val_data = [np.array(val_encodings['input_ids']), np.array(val_encodings['attention_mask'])]
test_data = [np.array(test_encodings['input_ids']), np.array(test_encodings['attention_mask'])]

### 4. Model fine-tuning
The model we used for fine-tuning is distilBERT (https://arxiv.org/abs/1910.01108), which is a smaller model distilled from the original BERT. Knowledge distillation is a well-known trick for improving the performance of a small model by learning an estimated uncertainty from a larger model instead of using a hard-label. If you want to know more about knowledge distillation, read https://arxiv.org/abs/1503.02531.

#### Model Initialization

In [None]:
from transformers import DistilBertForSequenceClassification
import torch

model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels= 2)
model = torch.nn.DataParallel(model.cuda(), device_ids=[0])

LEARNING_RATE =  1e-5
loss_fn = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(params =  model.parameters(), lr=LEARNING_RATE)


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267967963.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

#### Set up training generator

In contrast to model.fit which you have used in the previous lab. A more common way to feed the data is to use a generator. It is more memory-efficient than model.fit as the data is only quired when the iterator executes. For example, you can set the generator to load the image from the folder when called instead of storing all of them in the RAM. An example below is a way to create a simple generator, which aggregate the data points into a batch. Both PyTorch and TensorFlow also has a utility module for creating a generator (torch.utils.data.DataLoader for Torch and tf.data.Dataset for Tensorflow) 

In [None]:
def batch_data_generator(data, label, bs = 8, training = True):
  while(True):
    X1= []
    X2 = []
    Y = []
    from sklearn.utils import shuffle
    ids, masks = data[0], data[1]
    if(training):
      ids, masks, label = shuffle(ids, masks, label, random_state = 42)
    for a, b, c in zip(ids, masks, label):
      X1.append(a)
      X2.append(b)
      Y.append(c)
      if(len(X1) == bs):
        yield [np.array(X1), np.array(X2)], np.array(Y)
        X1= []
        X2 = []
        Y = []
    if(len(X1) > 0):
      yield [np.array(X1), np.array(X2)], np.array(Y)
    if(not training):
      yield None
      break


In [None]:
train_generator = batch_data_generator(train_data, np.array(train_labels, dtype = np.int), training = True)

In [None]:
dummy_generator = batch_data_generator(train_data, np.array(train_labels, dtype = np.int), training = True)
X_dummy, Y_dummy = next(dummy_generator)
print(X_dummy[0].shape, X_dummy[1].shape, Y_dummy.shape)

(8, 512) (8, 512) (8,)


#### Start Fine-tuning

In [None]:
device = "cuda:0"
from tqdm import tqdm_notebook
from sklearn.metrics import accuracy_score
from collections import deque 

train_acc_stat =  deque(maxlen = 100)
train_loss_stat =  deque(maxlen = 100)

for step in  tqdm_notebook(range(1000)):
    X, Y = next(train_generator)
    ids = torch.tensor(X[0], dtype = torch.long, device = device)
    mask = torch.tensor(X[1], dtype = torch.long, device = device)
    targets = torch.tensor(Y, dtype = torch.long).to(device)

    optimizer.zero_grad()
    outputs = model(ids, mask)
    loss = loss_fn(outputs['logits'], targets)
    
    loss.backward()
    optimizer.step()

    with torch.no_grad():
      train_acc = accuracy_score(Y, outputs['logits'].argmax(axis = 1).cpu().detach().numpy() )
      train_loss = loss.cpu().detach().numpy()
      train_acc_stat.append(train_acc)
      train_loss_stat.append(train_loss)

    if (step + 1) %100==0:
      print("iter = {} train_acc = {}".format(step, np.array(train_acc_stat).mean()))
      print("iter = {} train_loss = {}".format(step, np.array(train_loss_stat).mean()))


    if (step + 1) %500==0:
      #validation step
      with torch.no_grad():
        val_generator = batch_data_generator(val_data, np.array(val_labels, dtype = np.int), training = False)
        y_true = []
        y_pred = []
        while(True):
          d = next(val_generator)
          if(d is None): break
          X, Y = d
          ids = torch.tensor(X[0], dtype = torch.long, device = device)
          mask = torch.tensor(X[1], dtype = torch.long, device = device)
          outputs_cls = model(ids, mask)['logits'].argmax(axis = 1).cpu().detach().numpy()
          y_true.append(Y)
          y_pred.append(outputs_cls)
        y_true = np.concatenate(y_true)
        y_pred = np.concatenate(y_pred)
        print("val acc", accuracy_score(y_true, y_pred))

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  if __name__ == '__main__':


HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))

iter = 99 train_acc = 0.7475
iter = 99 train_loss = 0.5295358896255493
iter = 199 train_acc = 0.88125
iter = 199 train_loss = 0.3152180016040802
iter = 299 train_acc = 0.88
iter = 299 train_loss = 0.276839017868042
iter = 399 train_acc = 0.9025
iter = 399 train_loss = 0.24649973213672638
iter = 499 train_acc = 0.8975
iter = 499 train_loss = 0.2540905475616455
val acc 0.8808
iter = 599 train_acc = 0.89125
iter = 599 train_loss = 0.273439884185791
iter = 699 train_acc = 0.87375
iter = 699 train_loss = 0.30381909012794495
iter = 799 train_acc = 0.9075
iter = 799 train_loss = 0.24913911521434784
iter = 899 train_acc = 0.90875
iter = 899 train_loss = 0.22936420142650604
iter = 999 train_acc = 0.9025
iter = 999 train_loss = 0.24598951637744904
val acc 0.9082



## TODO 
Compare the classification performance between the non-transformer model and the model fine-tuned using pretrained WangchanBERTa on TRUE call-center dataset (HW5). WangchanBERTa (https://arxiv.org/abs/2101.09635) is RoBERTa (https://arxiv.org/abs/1907.11692) trained on thai texts. RoBERTa is also supported in Hugging Face (https://huggingface.co/transformers/model_doc/roberta.html).

To successfully fine-tune WangchanBERTa on the TRUE call-center dataset, you should:

1. Preprocess the dataset into the same format as the tutorial.
2. Tokenize the input from 1. See (https://colab.research.google.com/drive/1Kbk6sBspZLwcnOE61adAQo30xxqOQ9ko?usp=sharing&fbclid=IwAR23b8ZEoP6YxlUx7wWEu7dRCrVcyTFrZb3YSgI-nsxe_t4gy-bh8Rv5R9E#scrollTo=kAcpAdkddVQ8) for more details.
3. Process the tokenized input from 1. to the format that could be fed to the model.
4. Initialize WangchanBERTa (<b> you should choose the pretrained weight w.r.t. the tokenizer in 2.</b>)
5. Fine-tune the pretrained model.
6.  (Optional) Before fine-tuning is performed (before step 5), domain adaptation is often performed first by training a masked language model (maskLM). You can train maskLM by following this guideline (https://huggingface.co/transformers/model_doc/bert.html#bertformaskedlm).

### Import data

In [None]:
import numpy as np
import pandas as pd
import torch
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer, AutoModelForSequenceClassification

In [None]:
!wget --no-check-certificate https://www.dropbox.com/s/37u83g55p19kvrl/clean-phone-data-for-students.csv

--2021-03-31 11:15:40--  https://www.dropbox.com/s/37u83g55p19kvrl/clean-phone-data-for-students.csv
Resolving www.dropbox.com (www.dropbox.com)... 162.125.1.18, 2620:100:601a:18::a27d:712
Connecting to www.dropbox.com (www.dropbox.com)|162.125.1.18|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/37u83g55p19kvrl/clean-phone-data-for-students.csv [following]
--2021-03-31 11:15:41--  https://www.dropbox.com/s/raw/37u83g55p19kvrl/clean-phone-data-for-students.csv
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://ucfe5a34779825f331468ad9d4d1.dl.dropboxusercontent.com/cd/0/inline/BLu0EqsTfkfJcjyH4Dge0pljaFl4b-HOYM2r_CFHJFD0aC4rQmGYdR3eYVxr1LR29of9NKAPdwmc3j58dRf_-5nsDfEqDd8iXfCt-ii2HxBPaRx9PCmvwOFtc0Ts6PTl_FYeIRljmoOy3IJYNBkxsDAk/file# [following]
--2021-03-31 11:15:41--  https://ucfe5a34779825f331468ad9d4d1.dl.dropboxusercontent.com/cd/0/inline/BLu0EqsTfkfJcjyH4Dge0pljaFl

In [None]:
data_df = pd.read_csv('clean-phone-data-for-students.csv')
data_df

Unnamed: 0,Sentence Utterance,Action,Object
0,<PHONE_NUMBER_REMOVED> ผมไปจ่ายเงินที่ Counte...,enquire,payment
1,internet ยังความเร็วอยุ่เท่าไหร ครับ,enquire,package
2,ตะกี้ไปชำระค่าบริการไปแล้ว แต่ยังใช้งานไม่ได้...,report,suspend
3,พี่ค่ะยังใช้ internet ไม่ได้เลยค่ะ เป็นเครื่อ...,enquire,internet
4,ฮาโหล คะ พอดีว่าเมื่อวานเปิดซิมทรูมูฟ แต่มันโ...,report,phone_issues
...,...,...,...
16170,เชื่อมต่ออินเตอร์เน็ตไม่ได้ค่ะ,enquire,internet
16171,โทรออกต่างประเทศค่ะ,enquire,idd
16172,ยอดเงินเหลือเท่าไหร่ค่ะ,enquire,balance
16173,ยอดเงินในระบบ,enquire,balance


In [None]:
data_df.describe()

Unnamed: 0,Sentence Utterance,Action,Object
count,16175,16175,16175
unique,13389,10,33
top,บริการอื่นๆ,enquire,service
freq,97,10377,2525


### Preprocess

In [None]:
def lowerString(dataframe, column):
    newColumn = column + "_clean"
    dataframe[newColumn] = dataframe[column].str.lower().copy()
    return dataframe

def label2num(dataframe, column):
    uniqueLabel = dataframe[column].unique()
    label2numMap = dict(zip(uniqueLabel, range(len(uniqueLabel))))
    num2labelMap = dict(zip(range(len(uniqueLabel)), uniqueLabel))
    dataframe[column+"_id"] = dataframe[column].map(label2numMap)
    return dataframe, label2numMap, num2labelMap
    
def getTextAndLabel(data):
    text = list(data["Sentence Utterance"])
    label = list(data["Object_clean_id"])
    return text, label

In [None]:
clean_df = data_df.copy()
clean_df = clean_df.applymap(lambda x: x.strip())
clean_df = lowerString(clean_df, "Action")
clean_df = lowerString(clean_df, "Object")
clean_df = clean_df.drop_duplicates("Sentence Utterance", keep="first")

In [None]:
map_df = clean_df.copy()
map_df, l2n_object, n2l_object = label2num(map_df, "Object_clean")

In [None]:
text = list(map_df["Sentence Utterance"])
label = list(map_df["Object_clean_id"])

In [None]:
test_size = 0.2
val_size = 0.3
random_state = 42
train_texts, test_texts, train_labels, test_labels = train_test_split(text, label, test_size=test_size, random_state=random_state, stratify=label)
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=val_size, random_state=random_state, stratify=train_labels)

In [None]:
print("Unique label is {}, nb. of train data = {}, test_data = {}".format(np.unique(train_labels), len(train_texts), len(test_texts)))
for i in range(5):
  print("Data = {}".format(train_texts[i]))
  print("Label = {}".format(train_labels[i]))

Unique label is [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25], nb. of train data = 7485, test_data = 2674
Data = ต้องการสมัครแพคเกจของเค้าอ่ะครับ iNetน่ะครับ
Label = 1
Data = ถ้าอยากเติมเงิน กด *123 รึป่าวค่ะ
Label = 0
Data = สอบถามแพ็กเกจ อินเตอร์เน็ต 49 บาทไม่อั้น
Label = 1
Data = สอบถามโปรบีบีเหลือเท่าไหร่
Label = 7
Data = สอบถามเรื่อง อินเตอร์เน็ตบ้าน ใช้งานไม่ได้ครับ
Label = 3


### Tokeniztion

In [None]:
model_name="wangchanberta-base-att-spm-uncased"
wangchan_tok = AutoTokenizer.from_pretrained(
                f'airesearch/{model_name}',
                revision='main',
                model_max_length=416,)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=546.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=904693.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=282.0, style=ProgressStyle(description_…




In [None]:
train_encodings = wangchan_tok(train_texts, add_special_tokens=True,
            max_length=512,
            pad_to_max_length=True,
            return_token_type_ids=True,truncation=True
        )
val_encodings = wangchan_tok(val_texts, add_special_tokens=True,
            max_length=512,
            pad_to_max_length=True,
            return_token_type_ids=True,truncation=True
        )
test_encodings = wangchan_tok(test_texts, add_special_tokens=True,
            max_length=512,
            pad_to_max_length=True,
            return_token_type_ids=True,truncation=True
        )



In [None]:
train_data = [np.array(train_encodings['input_ids']), np.array(train_encodings['attention_mask'])]
val_data = [np.array(val_encodings['input_ids']), np.array(val_encodings['attention_mask'])]
test_data = [np.array(test_encodings['input_ids']), np.array(test_encodings['attention_mask'])]

### Model

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(f'airesearch/{model_name}', num_labels=len(np.unique(train_labels)))
model = torch.nn.DataParallel(model.cuda(), device_ids=[0])

LEARNING_RATE =  1e-5
loss_fn = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(params =  model.parameters(), lr=LEARNING_RATE)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=423498558.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at airesearch/wangchanberta-base-att-spm-uncased were not used when initializing CamembertForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.decoder.bias', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing CamembertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CamembertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of CamembertForSequenceClassification were not initialized from the model checkpoint at airesearch/wa

In [None]:
train_generator = batch_data_generator(train_data, np.array(train_labels, dtype = np.int), training = True)

In [None]:
dummy_generator = batch_data_generator(train_data, np.array(train_labels, dtype = np.int), training = True)
X_dummy, Y_dummy = next(dummy_generator)
print(X_dummy[0].shape, X_dummy[1].shape, Y_dummy.shape)

(8, 512) (8, 512) (8,)


#### Start Fine-tuning

In [None]:
device = "cuda:0"
from tqdm import tqdm_notebook
from sklearn.metrics import accuracy_score
from collections import deque 

train_acc_stat =  deque(maxlen = 100)
train_loss_stat =  deque(maxlen = 100)

for step in  tqdm_notebook(range(1000)):
    X, Y = next(train_generator)
    ids = torch.tensor(X[0], dtype = torch.long, device = device)
    mask = torch.tensor(X[1], dtype = torch.long, device = device)
    targets = torch.tensor(Y, dtype = torch.long).to(device)

    optimizer.zero_grad()
    outputs = model(ids, mask)
    loss = loss_fn(outputs['logits'], targets)
    
    loss.backward()
    optimizer.step()

    with torch.no_grad():
      train_acc = accuracy_score(Y, outputs['logits'].argmax(axis = 1).cpu().detach().numpy() )
      train_loss = loss.cpu().detach().numpy()
      train_acc_stat.append(train_acc)
      train_loss_stat.append(train_loss)

    if (step + 1) %100==0:
      print("iter = {} train_acc = {}".format(step, np.array(train_acc_stat).mean()))
      print("iter = {} train_loss = {}".format(step, np.array(train_loss_stat).mean()))


    if (step + 1) %500==0:
      #validation step
      with torch.no_grad():
        val_generator = batch_data_generator(val_data, np.array(val_labels, dtype = np.int), training = False)
        y_true = []
        y_pred = []
        while(True):
          d = next(val_generator)
          if(d is None): break
          X, Y = d
          ids = torch.tensor(X[0], dtype = torch.long, device = device)
          mask = torch.tensor(X[1], dtype = torch.long, device = device)
          outputs_cls = model(ids, mask)['logits'].argmax(axis = 1).cpu().detach().numpy()
          y_true.append(Y)
          y_pred.append(outputs_cls)
        y_true = np.concatenate(y_true)
        y_pred = np.concatenate(y_pred)
        print("val acc", accuracy_score(y_true, y_pred))

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  if __name__ == '__main__':


HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))

iter = 99 train_acc = 0.1675
iter = 99 train_loss = 2.8831191062927246
iter = 199 train_acc = 0.255
iter = 199 train_loss = 2.6094448566436768
iter = 299 train_acc = 0.3425
iter = 299 train_loss = 2.3694663047790527
iter = 399 train_acc = 0.42125
iter = 399 train_loss = 2.0279488563537598
iter = 499 train_acc = 0.45375
iter = 499 train_loss = 1.8841148614883423
val acc 0.5028054862842892
iter = 599 train_acc = 0.54125
iter = 599 train_loss = 1.6460411548614502
iter = 699 train_acc = 0.5875
iter = 699 train_loss = 1.4599534273147583
iter = 799 train_acc = 0.58625
iter = 799 train_loss = 1.4296966791152954
iter = 899 train_acc = 0.60625
iter = 899 train_loss = 1.35552179813385
iter = 999 train_acc = 0.3115
iter = 999 train_loss = 2.665616512298584
val acc 0.5221321695760599



In [None]:
with torch.no_grad():
    test_generator = batch_data_generator(test_data, np.array(test_labels, dtype = np.int), training = False)
    y_true = []
    y_pred = []
    while(True):
        d = next(test_generator)
        if(d is None): break
        X, Y = d
        ids = torch.tensor(X[0], dtype = torch.long, device = device)
        mask = torch.tensor(X[1], dtype = torch.long, device = device)
        outputs_cls = model(ids, mask)['logits'].argmax(axis = 1).cpu().detach().numpy()
        y_true.append(Y)
        y_pred.append(outputs_cls)
    y_true = np.concatenate(y_true)
    y_pred = np.concatenate(y_pred)
    print("test acc", accuracy_score(y_true, y_pred))

test acc 0.5362752430815259


For the non-transformer model in HW6 acccuracy = 0.5846422338568935 whereas the fine-tuned using pre-trained model is a little lower this may caused by different characteristic of the data train for pre-trained and the TRUE call center dataset
The TRUE call center dataset have many incorrect and mostly are speaking language while the pre-trained weight mostly based on writing text