<a href="https://colab.research.google.com/github/atherfawaz/BERT-Supervised/blob/master/BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Supervised Learning with BERT

***Question*: Supervised Learning with BERT
The Astrological department believes that a person's astrological sign can be guessed from their behavior. An organization is collecting blog-posts of different people from various sources. You have been tasked to build a Deep Learning model that can use these posts data of individuals to predict which star group out of 12 does an individual belong to. You also need to tell the gender of that person.**

In [1]:
#from google.colab import drive
#drive.mount('/content/gdrive')
%cd /content/drive/My Drive/Ebryx/blogs_train

/content/drive/My Drive/Ebryx/blogs_train


In [None]:
!tar -xvf  'blogs_train.tar.xz'
%cd /content/drive/My Drive/Ebryx/blogs_train

# Dataset
The file name would contain the gender, age, occupation, and astrological sign of the blooger. For example, 4115891.male.24.Student.Leo.xml is one file. A single file will contain a set of blogs separated by date. To illustrate, this is what a sample file looks like:

```
<Blog>
  <date>31,May,2004</date>
    <post>
      Well, everyone got up and going this morning.  It's still raining, but that's okay with me.  Sort of suits my mood.  I could easily have stayed home in bed with my book and the cats.  This has been a lot of rain though!..
    </post>
</Blog
```



# Parsing the dataset
Parsing the dataset from separate files into a Pandas Dataframe for displaying and easy access. Some XML files contain encoding issues and the problematic contents of those files have been replaced by random number. While this could impact the accuracy of the model later, the effect would not be that big.

In [None]:
import os
import pandas as pd
import numpy as np
import codecs
from bs4 import BeautifulSoup
from progressbar import ProgressBar
pbar = ProgressBar()

print('PARSING FILES....')

FILES = os.listdir()
#print('File count: ', len(FILES))
#print(FILES)
#FILES = ['4115891.male.24.Student.Leo.xml', '4115958.male.16.Communications-Media.Libra.xml', '4116071.male.26.Arts.Sagittarius.xml', '4116243.female.24.Manufacturing.Sagittarius.xml']

posts_arr = []
sign_arr = []
gender_arr = []
age_arr = []
occupation_arr = []

for to_fetch in pbar(FILES):
    #print('Parsing file:', to_fetch)
    gender = to_fetch.split('.')[-5]
    age = to_fetch.split('.')[-4]
    occupation = to_fetch.split('.')[-3]
    sign = to_fetch.split('.')[-2]
    with codecs.open(to_fetch, 'r', encoding='utf-8', errors='ignore') as fp:
      soup = BeautifulSoup(fp, 'lxml',
                           from_encoding='utf8')
      posts = soup.find_all('post')
      #print(soup.prettify())
      for post in posts:
        clean_str = post.text
        clean_str = clean_str.replace('\r', '')
        clean_str = clean_str.replace('\n', '')
        posts_arr.append(clean_str)
        sign_arr.append(sign)
        age_arr.append(age)
        occupation_arr.append(occupation)
        gender_arr.append(gender)

df = pd.DataFrame({'Gender': gender_arr, 'Age': age_arr, 'Occupation': occupation_arr, 'Post': posts_arr, 'Sign': sign_arr})

#df.head(50)
df.to_csv('/content/drive/My Drive/Ebryx/dataset.csv', encoding='utf-8', index=False)

In [None]:
df.head(100)

# Loading DistilledBERT for fine tuning

Taking help from the model implementation from huggingface and [this repository](https://colab.research.google.com/github/abhimishra91/transformers-tutorials/blob/master/transformers_multiclass_classification.ipynb#scrollTo=JrBr2YesGdO_).


In [3]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/27/3c/91ed8f5c4e7ef3227b4119200fc0ed4b4fd965b1f0172021c25701087825/transformers-3.0.2-py3-none-any.whl (769kB)
[K     |████████████████████████████████| 778kB 8.5MB/s 
[?25hCollecting sentencepiece!=0.1.92
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 30.9MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 55.5MB/s 
[?25hCollecting tokenizers==0.8.1.rc1
[?25l  Downloading https://files.pythonhosted.org/packages/40/d0/30d5f8d221a0ed981a186c8eb986ce1c94e3a6e87f994eae9f4aa5250217/tokenizers-0.8.1rc1-cp36-cp36m-manylinux1_x86_64.whl 

In [4]:
!nvidia-smi

Tue Aug 18 04:51:24 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P0    25W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [1]:
# Importing the libraries needed
import pandas as pd
import torch
import transformers
from torch.utils.data import Dataset, DataLoader
from transformers import DistilBertModel, DistilBertTokenizer

# Defining some key variables that will be used later on in the training
MAX_LEN = 512
TRAIN_BATCH_SIZE = 32
VALID_BATCH_SIZE = 8
EPOCHS = 1
LEARNING_RATE = 0.01
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')

# Setting up the device for GPU usage

from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'
print(device)

cuda


In [2]:
# Creating the dataset and dataloader for the neural network
df = pd.read_csv('/content/drive/My Drive/Ebryx/dataset.csv')
df = df[['Post', 'Sign']]

encode_dict = {}

def encode_cat(x):
    if x not in encode_dict.keys():
        encode_dict[x]=len(encode_dict)
    return encode_dict[x]

df['Sign'] = df['Sign'].apply(lambda x: encode_cat(x))

df.head(100)

Unnamed: 0,Post,Sign
0,I just watched Beauty and the beast...,0
1,This picture shows a Vietnamese ...,0
2,So I just used the term “Bad Ass...,0
3,This is a dumb little story I whipp...,0
4,"Listen, you fuckers, you screwhe...",0
...,...,...
95,\t There is a reason that I haven't be...,6
96,"\t After I wrote that last bit, I prom...",6
97,\t Today is just not my day... I break...,6
98,\t Last night Grandpa got me and asked...,6


In [3]:
class Triage(Dataset):
    def __init__(self, dataframe, tokenizer, max_len):
        self.len = len(dataframe)
        self.data = dataframe
        self.tokenizer = tokenizer
        self.max_len = max_len
        
    def __getitem__(self, index):
        post = str(self.data.Post[index])
        post = " ".join(post.split())
        inputs = self.tokenizer.encode_plus(
            post,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            pad_to_max_length=True,
            return_token_type_ids=True,
            truncation=True
        )
        ids = inputs['input_ids']
        mask = inputs['attention_mask']

        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'targets': torch.tensor(self.data.Sign[index], dtype=torch.long)
        } 
    
    def __len__(self):
        return self.len

train_size = 0.8
train_dataset=df.sample(frac=train_size,random_state=200)
test_dataset=df.drop(train_dataset.index).reset_index(drop=True)
train_dataset = train_dataset.reset_index(drop=True)


print("FULL Dataset: {}".format(df.shape))
print("TRAIN Dataset: {}".format(train_dataset.shape))
print("TEST Dataset: {}".format(test_dataset.shape))

training_set = Triage(train_dataset, tokenizer, MAX_LEN)
testing_set = Triage(test_dataset, tokenizer, MAX_LEN)

FULL Dataset: (380720, 2)
TRAIN Dataset: (304576, 2)
TEST Dataset: (76144, 2)


In [4]:
train_params = {'batch_size': TRAIN_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

test_params = {'batch_size': VALID_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

training_loader = DataLoader(training_set, **train_params)
testing_loader = DataLoader(testing_set, **test_params)

In [5]:
# Creating the customized model, by adding a drop out and a dense layer on top of distil bert to get the final output for the model. 

class DistillBERTClass(torch.nn.Module):
    def __init__(self):
        super(DistillBERTClass, self).__init__()
        self.l1 = DistilBertModel.from_pretrained("distilbert-base-uncased")
        self.pre_classifier = torch.nn.Linear(768, 768)
        self.dropout = torch.nn.Dropout(0.3)
        self.classifier = torch.nn.Linear(768, 12)

    def forward(self, input_ids, attention_mask):
        output_1 = self.l1(input_ids=input_ids, attention_mask=attention_mask)
        hidden_state = output_1[0]
        pooler = hidden_state[:, 0]
        pooler = self.pre_classifier(pooler)
        pooler = torch.nn.ReLU()(pooler)
        pooler = self.dropout(pooler)
        output = self.classifier(pooler)
        return output

model = DistillBERTClass()
model.to(device)

# Creating the loss function and optimizer
loss_function = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(params =  model.parameters(), lr=LEARNING_RATE)

In [6]:
# Function to calcuate the accuracy of the model

def calcuate_accu(big_idx, targets):
    n_correct = (big_idx==targets).sum().item()
    return n_correct

# Defining the training function on the 80% of the dataset for tuning the distilbert model

def train(epoch):
    tr_loss = 0
    n_correct = 0
    nb_tr_steps = 0
    nb_tr_examples = 0
    model.train()
    for i,data in enumerate(training_loader, 0):
        ids = data['ids'].to(device, dtype = torch.long)
        mask = data['mask'].to(device, dtype = torch.long)
        targets = data['targets'].to(device, dtype = torch.long)

        outputs = model(ids, mask)
        loss = loss_function(outputs, targets)
        tr_loss += loss.item()
        big_val, big_idx = torch.max(outputs.data, dim=1)
        n_correct += calcuate_accu(big_idx, targets)

        nb_tr_steps += 1
        nb_tr_examples+=targets.size(0)
        
        if i%5 == 0:
          loss_step = tr_loss/nb_tr_steps
          accu_step = (n_correct*100)/nb_tr_examples 
          print(f"[ {i} ] Training Loss {loss_step:.3f}---Training Accuracy: {accu_step:.3f}")

        optimizer.zero_grad()
        loss.backward()
        # # When using GPU
        optimizer.step()

    print(f'The Total Accuracy for Epoch {epoch}: {(n_correct*100)/nb_tr_examples}')
    epoch_loss = tr_loss/nb_tr_steps
    epoch_accu = (n_correct*100)/nb_tr_examples
    print(f"Training Loss Epoch: {epoch_loss}")
    print(f"Training Accuracy Epoch: {epoch_accu}")

    return 

for epoch in range(EPOCHS):
    train(epoch)

[ 0 ] Training Loss 2.487---Training Accuracy: 6.250
[ 5 ] Training Loss 9.637---Training Accuracy: 6.250
[ 10 ] Training Loss 7.684---Training Accuracy: 7.102
[ 15 ] Training Loss 6.122---Training Accuracy: 7.812
[ 20 ] Training Loss 5.264---Training Accuracy: 8.185
[ 25 ] Training Loss 4.742---Training Accuracy: 7.572
[ 30 ] Training Loss 4.379---Training Accuracy: 7.964
[ 35 ] Training Loss 4.115---Training Accuracy: 8.420
[ 40 ] Training Loss 3.917---Training Accuracy: 8.308
[ 45 ] Training Loss 3.762---Training Accuracy: 8.152
[ 50 ] Training Loss 3.636---Training Accuracy: 8.333
[ 55 ] Training Loss 3.532---Training Accuracy: 8.315
[ 60 ] Training Loss 3.445---Training Accuracy: 8.607
[ 65 ] Training Loss 3.372---Training Accuracy: 8.807
[ 70 ] Training Loss 3.309---Training Accuracy: 8.803
[ 75 ] Training Loss 3.254---Training Accuracy: 8.840
[ 80 ] Training Loss 3.206---Training Accuracy: 9.144
[ 85 ] Training Loss 3.163---Training Accuracy: 9.375
[ 90 ] Training Loss 3.126---T

KeyboardInterrupt: ignored