# Bird Classification - Final Project

### BERT fine-tuning for document classification

In [1]:
!pip install transformers torch



In [2]:
import os
import re
import numpy as np 
from sklearn.metrics import accuracy_score

import transformers
from transformers import BertTokenizer, BertModel, ElectraModel, ElectraTokenizer

import torch
from torch import cuda
#from tqdm import tqdm_notebook as tqdm
from tqdm.notebook import tqdm
device = 'cuda' if cuda.is_available() else 'cpu'

device

'cuda'

- use X.txt and YL1.txt 

In [6]:
X = [line.strip() for line in open('X.txt').readlines()]
y = [int(line.strip()) for line in open('YL1.txt').readlines()]

len(X), len(y), max(y)

(46985, 46985, 6)

### An easy train/test split

In [7]:
train_X = X[:46000]
train_y = np.array(y[:46000])
test_X = X[46000:]
test_y = np.array(y[46000:])

len(train_X), len(train_y), len(test_X), len(test_y)

(46000, 46000, 985, 985)

In [9]:
# not needed for training or evaluation, but useful for mapping examples
labels = {
    0:'Names',
    1:'Frequency',
    2:'Colors',
    3:'Size',
    4:'Weight',
    5:'Genus',
    6:'Family',
    7:'Description'
}

len(labels)

7

### Fine-tune BERT on the dataset

*Project language fine-tuning*

*loss function needs to change* 

*num-out needs to change*

*Not a binary task so this means we will need to use something like cross enropy - torch uses index values*

*try to get around 80% on the test set*

## Torch Data Set

In [10]:
class MultiLabelDataset(torch.utils.data.Dataset):

    def __init__(self, text, labels, tokenizer, max_len):
        self.tokenizer = tokenizer
        self.text = text
        self.targets = labels
        self.max_len = max_len

    def __len__(self):
        return len(self.text)

    def __getitem__(self, index):
        text = self.text[index]
        inputs = self.tokenizer.encode_plus(
            text,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            pad_to_max_length=True,
            #padding = 'max_length',
            truncation=True,
            return_token_type_ids=True
        )
        ids = inputs['input_ids']
        mask = inputs['attention_mask']
        token_type_ids = inputs["token_type_ids"]


        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'token_type_ids': torch.tensor(token_type_ids, dtype=torch.long),
            #'targets' : self.targets[index].clone().detach().long()
            'targets': torch.tensor(self.targets[index], dtype=torch.long)
        }

## Bert Class
- first "layer" is a pre-trained BERT model
- you can add whatever layers you want after that

In [11]:
class BERTClass(torch.nn.Module):
    def __init__(self, NUM_OUT):
        super(BERTClass, self).__init__()
                   
        self.l1 = BERTModel.from_pretrained("bert-base-uncased")#bert-base-uncased and BertModel
        self.pre_classifier = torch.nn.Linear(768, 256)
  #      self.classifier = torch.nn.Linear(256, NUM_OUT)#768 or 196
         self.dropout = torch.nn.Dropout(0.5)
        self.softmax = torch.nn.Softmax(dim=1)

    def forward(self, input_ids, attention_mask, token_type_ids):
        output_1 = self.l1(input_ids=input_ids, attention_mask=attention_mask)
        hidden_state = output_1[0]
        pooler = hidden_state[:, 0]
        output = self.classifier(pooler)
        output = self.softmax(output)
        return output

## Helpful functions:
### Loss
- This task is binary, so it uses binary crossentropy loss
- Tasks with more labels will use categorical crossentropy
- Tasks that don't have labels, but rather have distributions should use KL divergence
- Tasks that don't have distributions should use something like RMSE loss
### Train
- Steps through the data batch by batch
- grabs ids, masks, and token_type_ids which are required inputs for BERT
- inputs are passed through the model, compared to targets, computes loss function, backprops
### Validation
- Takes a model, passes inputs
- Need to use the targets from here because they are potentially shuffled!

In [13]:
def loss_fn(outputs, targets):
    return torch.nn.CrossEntropyLoss()(outputs, targets)

def train(model, training_loader, optimizer):
    model.train()
    for data in tqdm(training_loader):
        ids = data['ids'].to(device, dtype = torch.long)
        mask = data['mask'].to(device, dtype = torch.long)
        token_type_ids = data['token_type_ids'].to(device, dtype = torch.long)
        targets = data['targets'].to(device, dtype = torch.long)

        outputs = model(ids, mask, token_type_ids)

        optimizer.zero_grad()
        loss = loss_fn(outputs, targets)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    return loss
    
def validation(model, testing_loader):
    model.eval()
    fin_targets=[]
    fin_outputs=[]
    with torch.no_grad():
        for data in tqdm(testing_loader):
            targets = data['targets']
            ids = data['ids'].to(device, dtype = torch.long)
            mask = data['mask'].to(device, dtype = torch.long)
            token_type_ids = data['token_type_ids'].to(device, dtype = torch.long)
            
            outputs = model(ids, mask, token_type_ids)
            #outputs = torch.sigmoid(outputs).cpu().detach()
            outputs = torch.argmax(outputs, dim=1).cpu().detach()
            fin_outputs.extend(outputs.tolist())
            fin_targets.extend(targets.tolist())
    return torch.tensor(fin_outputs), torch.tensor(fin_targets)

## The Tokenizer:

In [14]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
#tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-discriminator')

# what does the tokenizer do?
print(train_X[5])
# train_data[5] = ' '.join(train_data[5])
tokenizer.encode_plus(
            train_X[5],
            None,
            add_special_tokens=True,
            max_length=128,#may be 
            #padding = 'max_length',
            pad_to_max_length=True,
            truncation=True,
            return_token_type_ids=True
        )

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

(Objective) In order to increase classification accuracy of tea-category identification (TCI) system, this paper proposed a novel approach. (Method) The proposed methods first extracted 64 color histogram to obtain color information, and 16 wavelet packet entropy to obtain the texture information. With the aim of reducing the 80 features, principal component analysis was harnessed. The reduced features were used as input to generalized eigenvalue proximal support vector machine (GEPSVM). Winner-takes-all (WTA) was used to handle the multiclass problem. Two kernels were tested, linear kernel and Radial basis function (RBF) kernel. Ten repetitions of 10-fold stratified cross validation technique were used to estimate the out-of-sample errors. We named our method as GEPSVM + RBF + WTA and GEPSVM + WTA. (Result) The results showed that PCA reduced the 80 features to merely five with explaining 99.90% of total variance. The recall rate of GEPSVM + RBF + WTA achieved the highest overall reca



{'input_ids': [101, 1006, 7863, 1007, 1999, 2344, 2000, 3623, 5579, 10640, 1997, 5572, 1011, 4696, 8720, 1006, 22975, 2072, 1007, 2291, 1010, 2023, 3259, 3818, 1037, 3117, 3921, 1012, 1006, 4118, 1007, 1996, 3818, 4725, 2034, 15901, 4185, 3609, 2010, 3406, 13113, 2000, 6855, 3609, 2592, 1010, 1998, 2385, 4400, 7485, 14771, 23077, 2000, 6855, 1996, 14902, 2592, 1012, 2007, 1996, 6614, 1997, 8161, 1996, 3770, 2838, 1010, 4054, 6922, 4106, 2001, 17445, 2098, 1012, 1996, 4359, 2838, 2020, 2109, 2004, 7953, 2000, 18960, 1041, 29206, 10175, 5657, 4013, 9048, 9067, 2490, 9207, 3698, 1006, 16216, 4523, 2615, 2213, 1007, 1012, 3453, 1011, 3138, 1011, 2035, 1006, 21925, 1007, 2001, 2109, 2000, 5047, 1996, 4800, 26266, 3291, 1012, 2048, 16293, 2015, 2020, 7718, 1010, 7399, 16293, 1998, 15255, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

## Training Setup 

In [15]:
MAX_LEN = 128
BATCH_SIZE = 8
EPOCHS = 3
NUM_OUT = 7 # binary task - 2
LEARNING_RATE = 2e-05

training_data = MultiLabelDataset(train_X, torch.from_numpy(train_y), tokenizer, MAX_LEN)
test_data = MultiLabelDataset(test_X, torch.from_numpy(test_y), tokenizer, MAX_LEN)

train_params = {'batch_size': BATCH_SIZE,
                
                'shuffle': True,
                'num_workers': 0
                }

test_params = {'batch_size': BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }    

training_loader = torch.utils.data.DataLoader(training_data, **train_params)
testing_loader = torch.utils.data.DataLoader(test_data, **test_params)

## Train and Evaluate 

In [16]:
model = BERTClass(NUM_OUT)
model.to(device)    

optimizer = torch.optim.Adam(params =  model.parameters(), lr=LEARNING_RATE)

for epoch in range(EPOCHS):
    loss = train(model, training_loader, optimizer)
    print(f'Epoch: {epoch}, Loss:  {loss.item()}')  
    guess, targs = validation(model, testing_loader)
    guesses = guess
    targets = targs
    print('arracy on test set {}'.format(accuracy_score(guesses.numpy(), targets.numpy())))#.indices

  0%|          | 0/5750 [00:00<?, ?it/s]

  'targets': torch.tensor(self.targets[index], dtype=torch.long)


Epoch: 0, Loss:  1.790196418762207


  0%|          | 0/124 [00:00<?, ?it/s]

arracy on test set 0.7309644670050761


  0%|          | 0/5750 [00:00<?, ?it/s]

  'targets': torch.tensor(self.targets[index], dtype=torch.long)


Epoch: 1, Loss:  1.2916243076324463


  0%|          | 0/124 [00:00<?, ?it/s]

arracy on test set 0.750253807106599


  0%|          | 0/5750 [00:00<?, ?it/s]

  'targets': torch.tensor(self.targets[index], dtype=torch.long)


Epoch: 2, Loss:  1.2908127307891846


  0%|          | 0/124 [00:00<?, ?it/s]

arracy on test set 0.766497461928934


### RESULTS
- Results of CrossCategoricalLoss on Bert was a loss of 1.35137 and an accuracy of 0.8324873 on batch size 64
- Results of CrossCategoricalLoss on Bert was a loss of 1.41937 and an accuracy of 0.758375 on batch size 8
- Results of CrossCategoricalLoss on Elecra was a loss of 1.291624 and an accuracy of 0.766497 on batch size 8

## Questions
### What does the BERT Tokenizer do?
- The BERT tokenizer outputs a dict. with a single key, input_ids, and a vector. The values from the vector are what are selected by the tokenizer, from which we get the output strings. 

### What loss function did you use? Why did you choose that loss function?
- I used the CrossCategoricalLoss function. I choose this loss function because this function measures the predicted loss with the actually loss results for a model. 

### Try different batch sizes (e.g., 8 vs 16 vs 32). How does that affect your results?
- When I tried a batch size of 8, I found that the results were worse than before. The accuracy was 0.758375 and the loss function(still with CrossCategoricalLoss) was 1.41937. This shows that the loss was greater and the accuracy was worse.

### Try another huggingface model (e.g., ELECTRA or RoBERTa) and compare results (you will need to change the name of the model and the tokenzier) How do the results compare to BERT
- The result of using ELECTRA was slightly better than Bert. Specifically on its accuracy and loss on a batch size of 8, they are close but you can see that in this case Electa just did a little better this time. I was worried after the first epoch if the model would do better but after the third epoch it just slightly beat Bert. I think I did further testing with different batch sizes and more epochs the results would differ. 

### What is the power of fine-tuning (as opposed to pre-training)?
- Fine-tuning is transfer learning, where instead of training a model from scratch we can take a pre-trained model and transfer knowledge from prior tasks to this new task (with some adjustments). Whereas pre-training is starting from the ground up and the model is learning information for a task for the first time.