# Assignment 6 : Pretrain and Transfer Learning (20 pts)

### Before working on the assignment please read papers as following 
- SUPERVISED CONTRASTIVE LEARNING FOR PRE-TRAINED LANGUAGE MODEL FINE-TUNING
  - link: https://openreview.net/pdf?id=cu7IUiO
- Few-Shot Intent Detection via Contrastive Pre-Training and Fine-Tuning
  - link: https://arxiv.org/abs/2109.06349

# Question 5: Training on text classification task  on combine two losses Cross Entropy and Supervised Contrastive. (3.5 pts)

- Cross Entropy loss
$$
\mathcal{L}_\text{CE} =-\frac{1}{m} \sum_{i=1}^{m} yi \cdot log(\hat{yi})
$$

- Supervised Contrastive learning loss
$$
\mathcal{L}_\text{S_cl} = -\frac{1}{T}\sum_{i=1}^{N}\sum_{j=1}^{N} \boldsymbol{1_{yi=yj}}\enspace log \frac{e^{sim(hi,hj) / \tau}}{\sum_{n=1}^{N} e^{sim(hi,hn) / \tau}}
$$
     * detail 
       * ui ~ sentence i 
       * hi ~ BERT(ui) in our case using Roberta as a encoder
       * hi : (batch_size,sequence_len,embed_size)
       * hi is the output of model which is last hidden layers before classifier head in the model architecture
       * 1yi=yj ~ we select only the sample that come from the same class to compute in each i and j
       * T ~ the number of pairs that come from the same classes
       * $\tau$ ~ temperature parameter
       * Sim(x1,x2) : cosine similarity [-1, 1]
       - $\lambda'$ is just weighted of cross entropy loss 
       * Sim function is the cosine similarity 
       * N ~ the number of samples in a batch
$$
sim(A,B) = \cos{(\theta)} = \frac{A\cdot B}{|\!|A|\!||\!|B|\!|}
$$


- Loss total
$$
  \mathcal{L}_\text{total} = \mathcal{L}_\text{s_cl} + \lambda ' \mathcal{L}_{CE}
$$

* you can get cross entropy loss like below 
    * outputs = model(input_ids, labels=labels)
    * loss, logits = outputs[:2]
    * loss : this is cross entropy loss
      
- hint : for this question you will utilize the function CustomTextDataset to force dataloader to have at least one pair that come from the same class
     * eg. batch_size = 4 
     * the labels in a batch should be like [ 0, 21, 43, 0]  
     
5. training this model in the code below on loss_total by do experiment the same as question 4.1, 4.2, 4.3, 4.4, 4.5, 4.6

In [96]:
!pwd

/home/st121532/work/NLP/NLP/Assignment/solutions


In [1]:
import os
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from transformers import RobertaConfig, RobertaModel, RobertaTokenizer, RobertaForSequenceClassification
from transformers import AdamW
import random
from IPython.display import clear_output
from utils import create_supervised_pair, supervised_contrasive_loss, Similarity
import matplotlib.pyplot as plt
#comment this if you are not using puffer
os.environ['http_proxy'] = 'http://192.41.170.23:3128'
os.environ['https_proxy'] = 'http://192.41.170.23:3128'

## To download data from file directory both text samples and labels

In [2]:
def load_examples(file_path, do_lower_case=True):
    examples = []
    
    with open('{}/seq.in'.format(file_path),'r',encoding="utf-8") as f_text, open('{}/label'.format(file_path),'r',encoding="utf-8") as f_label:
        for text, label in zip(f_text, f_label):
            
            e = Inputexample(text.strip(),label=label.strip())
            examples.append(e)
            
    return examples

## Each sample has a sentence and label format

In [3]:
class Inputexample(object):
    def __init__(self,text_a,label = None):
        self.text = text_a
        self.label = label

In [4]:
# create custom dataset class
# ===  =  Hint =  ===
# can train on two condition 
# 1.) trainig training with supervise contrastive loss and cross entropy loss using in question 5.) 
#    when self.repeated_label == True:
# 2.) train only cross entropy loss use in question 4.)
#    when self.repeated_label == False:

class CustomTextDataset(Dataset):
    def __init__(self,labels,text,batch_size,repeated_label:bool=False):
        self.labels = labels
        self.text = text
        self.batch_size = batch_size 
        self.count = 0 
        self.batch_labels = []
        self.repeated_label = repeated_label
        
        if self.repeated_label == True:
            print("Train on Combine between Supervised Contrastive and Cross Entropy loss")
            
        else:
            print("Train on Cross Entropy loss")
            
        
        print("len of dataset :",len(self.labels))
              
     
          

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        
        
        
        # write code here for 1)
        if self.repeated_label == True:
        
            if len(np.unique(self.batch_labels)) == self.batch_size - 1:


                while True:
                    idx = np.random.choice(len(self.labels))

                    if self.labels[idx]  in self.batch_labels:

                       
                        break

        self.batch_labels.append(self.labels[idx])
        
        label = self.labels[idx]
        
        data = self.text[idx]
        
        sample = {"Class": label,"Text": data}


    
        return sample

### Train With Combine loss between Cross Entropy and SuperVised Contrastive loss 

In [34]:
def train_contrastive_learnig(model,optimizer,train_loader,tokenizer,valid_loader,device,epochs:int=30):
    
    print(" device using :",device)
    
    train_loss_hist = [] 
    valid_loss_hist = []

    test_acc = []

    min_valid_loss = np.inf
    
    train_loss = 0.0 

    for e in range(epochs):  # loop over the dataset multiple times
 
        running_loss = 0.0
        model.train()
        correct = 0
        for (idx, batch) in enumerate(train_loader):
            sentence = batch["Text"]
            inputs = tokenizer(sentence,padding=True,truncation=True,return_tensors="pt")


            #assert len(np.unique(batch["Class"])) < len(batch["Class"])  
            # move parameter to device
            inputs = {k:v.to(device) for k,v in inputs.items()}

            # map string labels to class idex
            labels = [label_maps[stringtoId] for stringtoId in (batch['Class'])]

            # convert list to tensor
            labels = torch.tensor(labels).unsqueeze(0)
            labels = labels.to(device)


             # clear gradients
            optimizer.zero_grad()
            
            
            outputs = model(**inputs,labels=labels,output_hidden_states=True)     
        
            hidden_states = outputs.hidden_states

            last_hidden_states = hidden_states[12]

            # https://stackoverflow.com/questions/63040954/how-to-extract-and-use-bert-encodings-of-sentences-for-text-similarity-among-sen 
            # (batch_size,seq_len,embed_dim)
            h = last_hidden_states[:,0,:]

            # create pair samples
            T, h_i, h_j, idx_yij = create_supervised_pair(h,batch['Class'],debug=False)

            if h_i is None:
                print("skip this batch")
                skip_time +=1
                continue

            # supervised contrastive loss 
            
            loss_s_cl = supervised_contrasive_loss(device,h_i, h_j, h, T,temp=temp,idx_yij=idx_yij,debug=False)

            # cross entropy loss
            loss_classify, logits = outputs[:2]

            # loss total
            loss = loss_s_cl + (lamda * loss_classify )

            # Calculate gradients
            loss.backward()

            # Update Weights
            optimizer.step()

            # Calculate Loss
            train_loss += loss.item()


        valid_loss = 0.0
        model.eval()     # Optional when not using Model Specific layer

        for (idx, batch) in enumerate(valid_loader):
            
            sentence = batch["Text"]
            inputs = tokenizer(sentence,padding=True,truncation=True,return_tensors="pt")


            #assert len(np.unique(batch["Class"])) < len(batch["Class"])  
            # move parameter to device
            inputs = {k:v.to(device) for k,v in inputs.items()}

            # map string labels to class idex
            labels = [label_maps[stringtoId] for stringtoId in (batch['Class'])]

            # convert list to tensor
            labels = torch.tensor(labels).unsqueeze(0)
            labels = labels.to(device)


             # clear gradients
            optimizer.zero_grad()
            
            
            outputs = model(**inputs,labels=labels,output_hidden_states=True)     
        
            hidden_states = outputs.hidden_states

            last_hidden_states = hidden_states[12]

            # https://stackoverflow.com/questions/63040954/how-to-extract-and-use-bert-encodings-of-sentences-for-text-similarity-among-sen 
            # (batch_size,seq_len,embed_dim)
            h = last_hidden_states[:,0,:]

            # create pair samples
            T, h_i, h_j, idx_yij = create_supervised_pair(h,batch['Class'],debug=False)

            if h_i is None:
                print("skip this batch")
                skip_time +=1
                continue

            # supervised contrastive loss 
            loss_s_cl = supervised_contrasive_loss(device,h_i, h_j, h, T,temp=temp,idx_yij=idx_yij,debug=False)

            # cross entropy loss
            loss_classify, logits = outputs[:2]

            # loss total
            loss = loss_s_cl + (lamda * loss_classify )
            
            # Calculate Loss
            valid_loss += loss.item()
            
        # 5.3 add code to collect loss 
        train_loss_hist.append(train_loss / len(train_loader)) 
        valid_loss_hist.append(valid_loss / len(valid_loader))

        print(f'Epoch {e+1} \t\t Training Loss: {train_loss / len(train_loader)} \t\t Validation Loss: {valid_loss / len(valid_loader)}')

        if min_valid_loss > valid_loss:
            print(f'Validation Loss Decreased({min_valid_loss:.6f}--->{valid_loss:.6f}) \t Saving The Model')
            min_valid_loss = valid_loss   
            
                    # Saving State Dict
            torch.save(model.state_dict(), 'saved_model.pth')
            
    return train_loss_hist, valid_loss_hist

## Define Parameters

In [17]:
N = 5
data = []
labels = []

train_samples = []
train_labels = []

valid_samples = []
valid_labels = []

test_samples = []
test_labels = []

embed_dim = 768
batch_size = 4 
lr= 1e-5  # you can adjust 
temp = 0.3  # you can adjust 
lamda = 0.01  # you can adjust  
skip_time = 0 # the number of time that yi not equal to yj in supervised contrastive loss equation 
device = torch.device('cuda:1' if torch.cuda.is_available() else 'cpu')

### The Aim of these training is to fine tuning on few shot setting on text classification task

Path example of train, validation and test 

In [18]:
path_5shot = f'./HWU64/train_5/'
valid_path = f'./HWU64/valid/'
test_path = f'./HWU64/test/'

In [19]:
# Download data fewshot 
# https://downgit.github.io/#/home?url=https:%2F%2Fgithub.com%2Fjianguoz%2FFew-Shot-Intent-Detection%2Ftree%2Fmain%2FDatasets%2FHWU64

# load data
train_samples = load_examples(path_5shot)
valid_samples = load_examples(valid_path)
test_samples = load_examples(test_path)


print("===== small train set ====")

for i in range(len(train_samples)):
    data.append(train_samples[i].text)
    labels.append(train_samples[i].label)


train_data = CustomTextDataset(labels,data,batch_size=batch_size,repeated_label=True)
train_loader = DataLoader(train_data,batch_size=batch_size,shuffle=True)



print("===== validation set ====")

data = []
labels = []

for i in range(len(valid_samples)):
    data.append(valid_samples[i].text)
    labels.append(valid_samples[i].label)

valid_data = CustomTextDataset(labels,data,batch_size=batch_size,repeated_label=True)
valid_loader = DataLoader(valid_data,batch_size=batch_size,shuffle=True)


print("===== test set ====")

data = []
labels = []
    
for i in range(len(test_samples)):
    data.append(test_samples[i].text)
    labels.append(test_samples[i].label)

test_data = CustomTextDataset(labels,data,batch_size=batch_size,repeated_label=True)
test_loader = DataLoader(test_data,batch_size=batch_size,shuffle=True)



# got the number of unique classes from dataset
num_class = len(np.unique(np.array(labels)))

# get text label of uniqure classes
unique_label = np.unique(np.array(labels))

# map text label to index classes
label_maps = {unique_label[i]: i for i in range(len(unique_label))}

===== small train set ====
Train on Combine between Supervised Contrastive and Cross Entropy loss
len of dataset : 320
===== validation set ====
Train on Combine between Supervised Contrastive and Cross Entropy loss
len of dataset : 1076
===== test set ====
Train on Combine between Supervised Contrastive and Cross Entropy loss
len of dataset : 1076


### 5.1  freeze weight from pretrain model all layer except classifier 

Download Pretrain Model

In [60]:
# download config of Roberta config 
config = RobertaConfig.from_pretrained("roberta-base",output_hidden_states=True)

#chnage modifying the number of classes
config.num_labels = num_class
# Download pretrain models weight 
model = RobertaForSequenceClassification.from_pretrained('roberta-base')
# change from binary classification to muli-classification and loss automatically change to cross entropy loss
model.num_labels = config.num_labels
# change the output of last layer to num_class that we want to predict
model.classifier.out_proj = nn.Linear(in_features=embed_dim,out_features=num_class)
# move to model to device that we set
model = model.to(device)

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.bias', 'roberta.pooler.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.bias', 'classifi

In [21]:
device

device(type='cuda', index=1)

In [22]:
# Download tokenizer that use to tokenize sentence into words by using Pretrain from roberta-base
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

In [61]:
## Fine-Tune  model on SuperVised Contrastive loss 
# 5.1 freeze weight from pretrain model all layer except classifier 
model = freeze_layers(model,freeze_layers_count=9)
# Using adam optimizer 
optimizer= AdamW(model.parameters(), lr=lr)

roberta.encoder.layer.9.attention.self.query.weight
roberta.encoder.layer.9.attention.self.query.bias
roberta.encoder.layer.9.attention.self.key.weight
roberta.encoder.layer.9.attention.self.key.bias
roberta.encoder.layer.9.attention.self.value.weight
roberta.encoder.layer.9.attention.self.value.bias
roberta.encoder.layer.9.attention.output.dense.weight
roberta.encoder.layer.9.attention.output.dense.bias
roberta.encoder.layer.9.attention.output.LayerNorm.weight
roberta.encoder.layer.9.attention.output.LayerNorm.bias
roberta.encoder.layer.9.intermediate.dense.weight
roberta.encoder.layer.9.intermediate.dense.bias
roberta.encoder.layer.9.output.dense.weight
roberta.encoder.layer.9.output.dense.bias
roberta.encoder.layer.9.output.LayerNorm.weight
roberta.encoder.layer.9.output.LayerNorm.bias
roberta.encoder.layer.10.attention.self.query.weight
roberta.encoder.layer.10.attention.self.query.bias
roberta.encoder.layer.10.attention.self.key.weight
roberta.encoder.layer.10.attention.self.key.b

In [24]:
train_log, valid_log = train_contrastive_learnig(model,optimizer,train_loader,tokenizer,valid_loader,device,epochs=30)

 device using : cuda:1
Epoch 1 		 Training Loss: 1.4296314895153046 		 Validation Loss: 1.4283918975454282
Validation Loss Decreased(inf--->384.237420) 	 Saving The Model
Epoch 2 		 Training Loss: 2.855183680355549 		 Validation Loss: 1.4286323169793338
Epoch 3 		 Training Loss: 4.2766884624958035 		 Validation Loss: 1.4291732377722361
Epoch 4 		 Training Loss: 5.693604393303394 		 Validation Loss: 1.4304552268804671
Epoch 5 		 Training Loss: 7.104439316689968 		 Validation Loss: 1.434097200078149
Epoch 6 		 Training Loss: 8.509288670122624 		 Validation Loss: 1.4401414212684205
Epoch 7 		 Training Loss: 9.911508025228978 		 Validation Loss: 1.4451070332615792
Epoch 8 		 Training Loss: 11.312921841442584 		 Validation Loss: 1.4478552611787079
Epoch 9 		 Training Loss: 12.713139644265175 		 Validation Loss: 1.446641223581307
Epoch 10 		 Training Loss: 14.113011240959167 		 Validation Loss: 1.4518223974341353
Epoch 11 		 Training Loss: 15.512352658808231 		 Validation Loss: 1.45442379628

In [94]:
# plt.plot(list(torch.tensor(train_log, device= 'cpu')))
# plt.plot(list(torch.tensor(valid_log, device= 'cpu')))
# plt.legend(['train','validation'])
# plt.title("Combine loss by freezing 9 layers")
# plt.show() 

In [42]:
#test_acc = test(model,test_loader=test_loader)

correct : 43


In [95]:
#print(f'Accuracy : {100 * test_acc} %') 

### 5.2  freeze all from top embeddings to encoder layers (9)

Download Pretrain Model

In [82]:
def freeze_layers(model,freeze_layers_count:int=0):

        """
        model : model object that we create 
        freeze_layers_count : the number of layers to freeze 
        """
        # write the code here
    
        # should not more than the number of layers in a backbone
        assert freeze_layers_count <= 12  
        if freeze_layers_count <= 0:
            pass
        else:
            for name, param in model.named_parameters():
                # print(type(name))

                keys = name.split(".")

                if str(freeze_layers_count) in keys or 'classifier' in keys:
                    break

                param.requires_grad = False 


        #print all parameter that we want to train from scratch 
        
        for name, param in model.named_parameters():
            
            if param.requires_grad == True:
                 
                print(name)
        
    
        return model     

In [91]:
# download config of Roberta config 
config = RobertaConfig.from_pretrained("roberta-base",output_hidden_states=True)

#chnage modifying the number of classes
config.num_labels = num_class
# Download pretrain models weight 
model = RobertaForSequenceClassification.from_pretrained('roberta-base')
# change from binary classification to muli-classification and loss automatically change to cross entropy loss
model.num_labels = config.num_labels
# change the output of last layer to num_class that we want to predict
model.classifier.out_proj = nn.Linear(in_features=embed_dim,out_features=num_class)
# move to model to device that we set
model = model.to(device)

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.bias', 'roberta.pooler.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.bias', 'classifi

In [92]:
## Fine-Tune  model on SuperVised Contrastive loss 
# 5.1 freeze weight from pretrain model all layer except classifier 
model = freeze_layers(model,freeze_layers_count=12)
# Using adam optimizer 
optimizer= AdamW(model.parameters(), lr=lr)

classifier.dense.weight
classifier.dense.bias
classifier.out_proj.weight
classifier.out_proj.bias


In [93]:
train_log, valid_log = train_contrastive_learnig(model,optimizer,train_loader,tokenizer,valid_loader,device,epochs=30)

 device using : cuda:1
Epoch 1 		 Training Loss: 1.444173736870289 		 Validation Loss: 1.428385520513173
Validation Loss Decreased(inf--->384.235705) 	 Saving The Model
Epoch 2 		 Training Loss: 2.909568244218826 		 Validation Loss: 1.4286169846261745
Epoch 3 		 Training Loss: 4.340028534829616 		 Validation Loss: 1.4287136367705675
Epoch 4 		 Training Loss: 5.768017871677875 		 Validation Loss: 1.4291461595372197
Epoch 5 		 Training Loss: 7.184777577221394 		 Validation Loss: 1.4295452460923603
Epoch 6 		 Training Loss: 8.63200129121542 		 Validation Loss: 1.4300799644569482
Epoch 7 		 Training Loss: 10.053215716779231 		 Validation Loss: 1.4309153547960587
Epoch 8 		 Training Loss: 11.490088197588921 		 Validation Loss: 1.4320489460650874
Epoch 9 		 Training Loss: 12.962871822714806 		 Validation Loss: 1.4333633800421506
Epoch 10 		 Training Loss: 14.401282618939877 		 Validation Loss: 1.4346479877663367
Epoch 11 		 Training Loss: 15.837058457732201 		 Validation Loss: 1.436261667194

In [None]:
# plt.plot(list(torch.tensor(train_log, device= 'cpu')))
# plt.plot(list(torch.tensor(valid_log, device= 'cpu')))
# plt.legend(['train','validation'])
# plt.title("Combine loss by freezing 12 layers")
# plt.show() 

In [42]:
#test_acc = test(model,test_loader=test_loader)

correct : 43


In [95]:
#print(f'Accuracy : {100 * test_acc} %') 