<a href="https://colab.research.google.com/github/atherfawaz/BERT-Supervised/blob/master/RoBERTa%20-%20TPU.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Supervised Learning with BERT

***Question*: Supervised Learning with BERT
The Astrological department believes that a person's astrological sign can be guessed from their behavior. An organization is collecting blog-posts of different people from various sources. You have been tasked to build a Deep Learning model that can use these posts data of individuals to predict which star group out of 12 does an individual belong to. You also need to tell the gender of that person.**

In [1]:
#from google.colab import drive
#drive.mount('/content/gdrive')
%cd /content/drive/My Drive/Ebryx/blogs_train

/content/drive/My Drive/Ebryx/blogs_train


# Unzip the dataset

In [None]:
!tar -xvf  'blogs_train.tar.xz'
%cd /content/drive/My Drive/Ebryx/blogs_train

# Dataset
The file name would contain the gender, age, occupation, and astrological sign of the blooger. For example, 4115891.male.24.Student.Leo.xml is one file. A single file will contain a set of blogs separated by date. To illustrate, this is what a sample file looks like:

```
<Blog>
  <date>31,May,2004</date>
    <post>
      Well, everyone got up and going this morning.  It's still raining, but that's okay with me.  Sort of suits my mood.  I could easily have stayed home in bed with my book and the cats.  This has been a lot of rain though!..
    </post>
</Blog
```



# Parsing the dataset
Parsing the dataset from separate files into a Pandas Dataframe for displaying and easy access. Some XML files contain encoding issues and the problematic contents of those files have been replaced by random number. While this could impact the accuracy of the model later, the effect would not be that big.

In [None]:
import os
import pandas as pd
import numpy as np
import codecs
from bs4 import BeautifulSoup
from progressbar import ProgressBar
pbar = ProgressBar()

print('PARSING FILES....')

FILES = os.listdir()
#print('File count: ', len(FILES))
#print(FILES)
#FILES = ['4115891.male.24.Student.Leo.xml', '4115958.male.16.Communications-Media.Libra.xml', '4116071.male.26.Arts.Sagittarius.xml', '4116243.female.24.Manufacturing.Sagittarius.xml']

posts_arr = []
sign_arr = []
gender_arr = []
age_arr = []
occupation_arr = []

for to_fetch in pbar(FILES):
    #print('Parsing file:', to_fetch)
    gender = to_fetch.split('.')[-5]
    age = to_fetch.split('.')[-4]
    occupation = to_fetch.split('.')[-3]
    sign = to_fetch.split('.')[-2]
    with codecs.open(to_fetch, 'r', encoding='utf-8', errors='ignore') as fp:
      soup = BeautifulSoup(fp, 'lxml',
                           from_encoding='utf8')
      posts = soup.find_all('post')
      #print(soup.prettify())
      for post in posts:
        clean_str = post.text
        clean_str = clean_str.replace('\r', '')
        clean_str = clean_str.replace('\n', '')
        posts_arr.append(clean_str)
        sign_arr.append(sign)
        age_arr.append(age)
        occupation_arr.append(occupation)
        gender_arr.append(gender)

df = pd.DataFrame({'Gender': gender_arr, 'Age': age_arr, 'Occupation': occupation_arr, 'Post': posts_arr, 'Sign': sign_arr})

#df.head(50)
df.to_csv('/content/drive/My Drive/Ebryx/dataset.csv', encoding='utf-8', index=False)

In [None]:
df.head(100)

# Loading BERT for finetuning

Taking help from the model implementation from huggingface and [this repository](https://colab.research.google.com/github/abhimishra91/transformers-tutorials/blob/master/transformers_multiclass_classification.ipynb#scrollTo=JrBr2YesGdO_).


In [2]:
!pip install transformers;

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/27/3c/91ed8f5c4e7ef3227b4119200fc0ed4b4fd965b1f0172021c25701087825/transformers-3.0.2-py3-none-any.whl (769kB)
[K     |████████████████████████████████| 778kB 4.7MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 23.0MB/s 
[?25hCollecting tokenizers==0.8.1.rc1
[?25l  Downloading https://files.pythonhosted.org/packages/40/d0/30d5f8d221a0ed981a186c8eb986ce1c94e3a6e87f994eae9f4aa5250217/tokenizers-0.8.1rc1-cp36-cp36m-manylinux1_x86_64.whl (3.0MB)
[K     |████████████████████████████████| 3.0MB 27.1MB/s 
Collecting sentencepiece!=0.1.92
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB

In [3]:
!nvidia-smi

Tue Aug 25 05:26:46 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P8     9W /  70W |      0MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Setting up training environment, imports, and hyperparameters

In [1]:
# Code for TPU packages install
!curl -q https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
!python pytorch-xla-env-setup.py --apt-packages libomp5 libopenblas-dev

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  5115  100  5115    0     0  12092      0 --:--:-- --:--:-- --:--:-- 12092
Updating... This may take around 2 minutes.
Uninstalling torch-1.6.0+cu101:
  Successfully uninstalled torch-1.6.0+cu101
Uninstalling torchvision-0.7.0+cu101:
  Successfully uninstalled torchvision-0.7.0+cu101
Copying gs://tpu-pytorch/wheels/torch-nightly+20200515-cp36-cp36m-linux_x86_64.whl...
\ [1 files][ 91.0 MiB/ 91.0 MiB]                                                
Operation completed over 1 objects/91.0 MiB.                                     
Copying gs://tpu-pytorch/wheels/torch_xla-nightly+20200515-cp36-cp36m-linux_x86_64.whl...
\ [1 files][119.5 MiB/119.5 MiB]                                                
Operation completed over 1 objects/119.5 MiB.                                    
Copying gs://tpu-pytorch/wheels/torchvision-nightly

In [1]:
# Importing the libraries needed
# Importing stock ml libraries

import pandas as pd
import torch
import transformers
from sklearn.utils import shuffle
from torch.utils.data import Dataset, DataLoader
from transformers import DistilBertModel, DistilBertTokenizer
from progressbar import ProgressBar
pbar = ProgressBar()

# Defining some key variables that will be used later on in the training
MAX_LEN = 512
TRAIN_BATCH_SIZE = 32
VALID_BATCH_SIZE = 4
EPOCHS = 1
LEARNING_RATE = 1e-05
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')

from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'

# Loading dataset from CSV

Converted the class labels into a one-hot encoded format to signify zodiac sign plus gender.

In [2]:
# Creating the dataset and dataloader for the neural network
df = pd.read_csv('/content/drive/My Drive/Ebryx/dataset.csv')
df = df[['Post', 'Sign', 'Gender']]

encode_dict = {}

def encode_cat(x):
    if x not in encode_dict.keys():
      encode_dict[x]=len(encode_dict)
    return encode_dict[x]

df['Sign'] = df['Sign'].apply(lambda x: encode_cat(x))
df['Post'] = df['Post'].str.replace('\r','').str.replace('\t','').str.replace('\xa0', '')

df = df [['Post', 'Sign']]

df = shuffle(df)
df.head(10)

Unnamed: 0,Post,Sign
282230,I also enjoyed the writer's circl...,11
196254,really i thought i finally had this fig...,8
354149,Victims aren't we all Woke up this mo...,8
46474,I am back...I have this great urge to b...,0
315440,This morning I walked around the ...,10
191281,Best Film I Saw in 2003: Da Return ...,9
283583,I got to thinking about what makes a TV...,10
184039,.~*~. LOOK OUT! ïòð Sasha i...,4
1397,why? I have been trying to post ...,8
68426,It just bites the big one that I can't ...,1


In [3]:
class Triage(Dataset):
    def __init__(self, dataframe, tokenizer, max_len):
        self.len = len(dataframe)
        self.data = dataframe
        self.tokenizer = tokenizer
        self.max_len = max_len
        
    def __getitem__(self, index):
        post = str(self.data.Post[index])
        post = " ".join(post.split())
        inputs = self.tokenizer.encode_plus(
            post,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            pad_to_max_length=True,
            return_token_type_ids=True,
            truncation=True
        )
        ids = inputs['input_ids']
        mask = inputs['attention_mask']

        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'targets': torch.tensor(self.data.Sign[index], dtype=torch.long)
        } 
    
    def __len__(self):
        return self.len

# Creating the dataset and dataloader for the neural network

train_size = 0.8
train_dataset=df.sample(frac=train_size,random_state=200)
test_dataset=df.drop(train_dataset.index).reset_index(drop=True)
train_dataset = train_dataset.reset_index(drop=True)


print("FULL Dataset: {}".format(df.shape))
print("TRAIN Dataset: {}".format(train_dataset.shape))
print("TEST Dataset: {}".format(test_dataset.shape))

training_set = Triage(train_dataset, tokenizer, MAX_LEN)
testing_set = Triage(test_dataset, tokenizer, MAX_LEN)

FULL Dataset: (380720, 2)
TRAIN Dataset: (304576, 2)
TEST Dataset: (76144, 2)


In [4]:
train_params = {'batch_size': TRAIN_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

test_params = {'batch_size': VALID_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

training_loader = DataLoader(training_set, **train_params)
testing_loader = DataLoader(testing_set, **test_params)

In [5]:
# Creating the customized model, by adding a drop out and a dense layer on top of distil bert to get the final output for the model. 

class DistillBERTClass(torch.nn.Module):
    def __init__(self):
        super(DistillBERTClass, self).__init__()
        self.l1 = DistilBertModel.from_pretrained("distilbert-base-uncased")
        self.pre_classifier = torch.nn.Linear(768, 768)
        self.dropout = torch.nn.Dropout(0.3)
        self.classifier = torch.nn.Linear(768, 12)

    def forward(self, input_ids, attention_mask):
        output_1 = self.l1(input_ids=input_ids, attention_mask=attention_mask)
        hidden_state = output_1[0]
        pooler = hidden_state[:, 0]
        pooler = self.pre_classifier(pooler)
        pooler = torch.nn.ReLU()(pooler)
        pooler = self.dropout(pooler)
        output = self.classifier(pooler)
        return output

model = DistillBERTClass()
model.to(device)

# Creating the loss function and optimizer
loss_function = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(params =  model.parameters(), lr=LEARNING_RATE)

In [6]:
# Function to calcuate the accuracy of the model

def calcuate_accu(big_idx, targets):
    n_correct = (big_idx==targets).sum().item()
    return n_correct

# Defining the training function on the 80% of the dataset for tuning the distilbert model

def train(epoch):
    tr_loss = 0
    n_correct = 0
    nb_tr_steps = 0
    nb_tr_examples = 0
    model.train()
    for i,data in enumerate(training_loader, 0):
        ids = data['ids'].to(device, dtype = torch.long)
        mask = data['mask'].to(device, dtype = torch.long)
        targets = data['targets'].to(device, dtype = torch.long)

        outputs = model(ids, mask)
        loss = loss_function(outputs, targets)
        tr_loss += loss.item()
        big_val, big_idx = torch.max(outputs.data, dim=1)
        n_correct += calcuate_accu(big_idx, targets)

        nb_tr_steps += 1
        nb_tr_examples+=targets.size(0)
        
        if i%5==0:
            loss_step = tr_loss/nb_tr_steps
            accu_step = (n_correct*100)/nb_tr_examples 
            print(f"[ {i} ] Loss: {loss_step:.3f} | Accuracy: {accu_step:.3f}")

        optimizer.zero_grad()
        loss.backward()
        # # When using GPU
        optimizer.step()

    print(f'The Total Accuracy for Epoch {epoch}: {(n_correct*100)/nb_tr_examples}')
    epoch_loss = tr_loss/nb_tr_steps
    epoch_accu = (n_correct*100)/nb_tr_examples
    print(f"Training Loss Epoch: {epoch_loss}")
    print(f"Training Accuracy Epoch: {epoch_accu}")

    return 


for epoch in range(EPOCHS):
    train(epoch)

[ 0 ] Loss: 2.498 | Accuracy: 6.250
[ 5 ] Loss: 2.491 | Accuracy: 9.375
[ 10 ] Loss: 2.491 | Accuracy: 7.670
[ 15 ] Loss: 2.488 | Accuracy: 7.812
[ 20 ] Loss: 2.487 | Accuracy: 7.440
[ 25 ] Loss: 2.488 | Accuracy: 8.534
[ 30 ] Loss: 2.490 | Accuracy: 8.165
[ 35 ] Loss: 2.489 | Accuracy: 8.420
[ 40 ] Loss: 2.490 | Accuracy: 8.384
[ 45 ] Loss: 2.488 | Accuracy: 8.628
[ 50 ] Loss: 2.488 | Accuracy: 8.762
[ 55 ] Loss: 2.488 | Accuracy: 8.426
[ 60 ] Loss: 2.488 | Accuracy: 8.402
[ 65 ] Loss: 2.488 | Accuracy: 8.333
[ 70 ] Loss: 2.487 | Accuracy: 8.451
[ 75 ] Loss: 2.486 | Accuracy: 8.717
[ 80 ] Loss: 2.484 | Accuracy: 8.873
[ 85 ] Loss: 2.484 | Accuracy: 8.757
[ 90 ] Loss: 2.484 | Accuracy: 8.791
[ 95 ] Loss: 2.484 | Accuracy: 8.757


KeyboardInterrupt: ignored

# Validation

In [None]:
def validation(epoch):
    model.eval()
    fin_targets=[]
    fin_outputs=[]
    with torch.no_grad():
        for _, data in enumerate(testing_loader, 0):
            ids = data['ids'].to(device, dtype = torch.long)
            mask = data['mask'].to(device, dtype = torch.long)
            token_type_ids = data['token_type_ids'].to(device, dtype = torch.long)
            targets = data['targets'].to(device, dtype = torch.float)
            outputs = model(ids, mask, token_type_ids)
            fin_targets.extend(targets.cpu().detach().numpy().tolist())
            fin_outputs.extend(torch.sigmoid(outputs).cpu().detach().numpy().tolist())
    return fin_outputs, fin_targets

for epoch in range(EPOCHS):
    outputs, targets = validation(epoch)
    outputs = np.array(outputs) >= 0.5
    accuracy = metrics.accuracy_score(targets, outputs)
    f1_score_micro = metrics.f1_score(targets, outputs, average='micro')
    f1_score_macro = metrics.f1_score(targets, outputs, average='macro')
    print(f"Accuracy Score = {accuracy}")
    print(f"F1 Score (Micro) = {f1_score_micro}")
    print(f"F1 Score (Macro) = {f1_score_macro}")