<a href="https://colab.research.google.com/github/aaronjoseph/KB_Final/blob/master/3_a_RoBERTa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/88/b1/41130a228dd656a1a31ba281598a968320283f48d42782845f6ba567f00b/transformers-4.2.2-py3-none-any.whl (1.8MB)
[K     |████████████████████████████████| 1.8MB 7.9MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 34.6MB/s 
Collecting tokenizers==0.9.4
[?25l  Downloading https://files.pythonhosted.org/packages/0f/1c/e789a8b12e28be5bc1ce2156cf87cb522b379be9cadc7ad8091a4cc107c4/tokenizers-0.9.4-cp36-cp36m-manylinux2010_x86_64.whl (2.9MB)
[K     |████████████████████████████████| 2.9MB 36.5MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp36-none-any.whl size=893261 sha256=54710d9dac903b8a735

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import torch
import seaborn as sns
import transformers
import json
from tqdm import tqdm 
from torch.utils.data import Dataset, DataLoader
from transformers import RobertaModel, RobertaTokenizer
import logging
logging.basicConfig(level=logging.ERROR)

In [3]:
# Setting device as GPU
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'
print(device)
# CUDA is being used for traning

cuda


In [4]:
# Data Loading 
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/training.1600000.processed.noemoticon.csv',header=None,error_bad_lines=False,engine='python')
df.columns=['Sentiment', 'id', 'Date', 'Query', 'User', 'Phrase']
df = df.drop(columns=['id', 'Date', 'Query', 'User'], axis=1)
df['Sentiment'] = df.Sentiment.replace(4,1)

In [22]:
df_1 = df[['Phrase', 'Sentiment']]
df_new = df_1.sample(frac=0.1,random_state=200).reset_index(drop=True)

<a id='section03'></a>
### Preparing the Dataset and Dataloader

- Defining few key variables | Which will be used in the training/fine tuning stage
- Dataset Class - Will pre-process the data before feeding it into the Neural Netork
 [Pytorch Documentation on Datasets](https://pytorch.org/docs/stable/data.html)

#### *SentimentData* Dataset Class

- This class is defined to accept the Dataframe as input and generate tokenized output that is used by the Roberta model for training
- I am using the Roberta tokenizer to tokenize the data in the `TITLE` column of the dataframe. 
- The tokenizer uses the `encode_plus` method to perform tokenization and generate the necessary outputs, namely: `ids`, `attention_mask`
- To read further into the tokenizer, [refer to this document](https://huggingface.co/transformers/model_doc/roberta.html#robertatokenizer)
- `target` is the encoded category on the news headline. 
- The *SentimentData* class is used to create 2 datasets, for training and for validation.
- *Training Dataset* is used to fine tune the model: **80% of the original data**
- *Validation Dataset* is used to evaluate the performance of the model. The model has not seen this data during training. 

#### Dataloader

- Dataloader is used to for creating training and validation dataloader that load data to the neural network in a defined manner. This is needed because all the data from the dataset cannot be loaded to the memory at once, hence the amount of dataloaded to the memory and then passed to the neural network needs to be controlled.
- This control is achieved using the parameters such as `batch_size` and `max_len`.
- Training and Validation dataloaders are used in the training and validation part of the flow respectively

In [23]:
# Key Variable Defination
MAX_LEN = 256
TRAIN_BATCH_SIZE = 8
VALID_BATCH_SIZE = 4
# EPOCHS = 1
LEARNING_RATE = 1e-05
tokenizer = RobertaTokenizer.from_pretrained('roberta-base', truncation=True, do_lower_case=True)

In [24]:
class SentimentData(Dataset):
    def __init__(self, dataframe, tokenizer, max_len):
        self.tokenizer = tokenizer
        self.data = dataframe
        self.text = dataframe.Phrase
        self.targets = self.data.Sentiment
        self.max_len = max_len

    def __len__(self):
        return len(self.text)

    def __getitem__(self, index):
        text = str(self.text[index])
        text = " ".join(text.split())

        inputs = self.tokenizer.encode_plus(
            text,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            pad_to_max_length=True,
            return_token_type_ids=True
        )
        ids = inputs['input_ids']
        mask = inputs['attention_mask']
        token_type_ids = inputs["token_type_ids"]


        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'token_type_ids': torch.tensor(token_type_ids, dtype=torch.long),
            'targets': torch.tensor(self.targets[index], dtype=torch.float)
        }

In [25]:
new_df

Unnamed: 0,Phrase,Sentiment
0,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",0
1,is upset that he can't update his Facebook by ...,0
2,@Kenichan I dived many times for the ball. Man...,0
3,my whole body feels itchy and like its on fire,0
4,"@nationwideclass no, it's not behaving at all....",0
...,...,...
1599995,Just woke up. Having no school is the best fee...,1
1599996,TheWDB.com - Very cool to hear old Walt interv...,1
1599997,Are you ready for your MoJo Makeover? Ask me f...,1
1599998,Happy 38th Birthday to my boo of alll time!!! ...,1


In [26]:
train_size = 0.7
train_data=new_df.sample(frac=train_size,random_state=200)
test_data=new_df.drop(train_data.index).reset_index(drop=True)
train_data = train_data.reset_index(drop=True)

print("FULL Dataset: {}".format(new_df.shape))
print("TRAIN Dataset: {}".format(train_data.shape))
print("TEST Dataset: {}".format(test_data.shape))

FULL Dataset: (1600000, 2)
TRAIN Dataset: (1120000, 2)
TEST Dataset: (480000, 2)


In [27]:
training_set = SentimentData(train_data, tokenizer, MAX_LEN)
testing_set = SentimentData(test_data, tokenizer, MAX_LEN)

train_params = {'batch_size': TRAIN_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

test_params = {'batch_size': VALID_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

training_loader = DataLoader(training_set, **train_params)
testing_loader = DataLoader(testing_set, **test_params)

### Creating the Neural Network for Fine Tuning

#### Neural Network
 - We will be creating a neural network with the `RobertaClass`. 
 - This network will have the Roberta Language model followed by a `dropout` and finally a `Linear` layer to obtain the final outputs. 
 - The data will be fed to the Roberta Language model as defined in the dataset. 
 - Final layer outputs is what will be compared to the `Sentiment category` to determine the accuracy of models prediction. 
 - We will initiate an instance of the network called `model`. This instance will be used for training and then to save the final trained model for future inference. 
 
#### Loss Function and Optimizer
 - `Loss Function` and `Optimizer` and defined in the next cell.
 - The `Loss Function` is used the calculate the difference in the output created by the model and the actual output. 
 - `Optimizer` is used to update the weights of the neural network to improve its performance.

In [28]:
class RobertaClass(torch.nn.Module):
    def __init__(self):
        super(RobertaClass, self).__init__()
        self.l1 = RobertaModel.from_pretrained("roberta-base")
        self.pre_classifier = torch.nn.Linear(768, 768)
        self.dropout = torch.nn.Dropout(0.3)
        self.classifier = torch.nn.Linear(768, 2)

    def forward(self, input_ids, attention_mask, token_type_ids):
        output_1 = self.l1(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        hidden_state = output_1[0]
        pooler = hidden_state[:, 0]
        pooler = self.pre_classifier(pooler)
        pooler = torch.nn.ReLU()(pooler)
        pooler = self.dropout(pooler)
        output = self.classifier(pooler)
        return output

In [29]:
model = RobertaClass()
model.to(device)

RobertaClass(
  (l1): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), e

In [30]:
model_parameters = filter(lambda p: p.requires_grad, model.parameters())
params = sum([np.prod(p.size()) for p in model_parameters])
print("Number of Trainable Parameters = ",params)

Number of Trainable Parameters =  125237762


<a id='section05'></a>
### Fine Tuning the Model

- Here the training function of training the model takes place. Also, as defined by EPOCHs
- The dataloader passes data to the model based on the batch size. 
- Subsequent output from the model and the actual category are compared to calculate the loss. 
- Loss value is used to optimize the weights of the neurons in the network.
- After every 5000 steps the loss value is printed in the console.

In [31]:
#Loss Function & Optimizer
loss_function = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(params =  model.parameters(), lr=LEARNING_RATE)

def calcuate_accuracy(preds, targets):
    n_correct = (preds==targets).sum().item()
    return n_correct

In [32]:
def train(epoch):
    tr_loss = 0
    n_correct = 0
    nb_tr_steps = 0
    nb_tr_examples = 0
    model.train()
    for _,data in tqdm(enumerate(training_loader, 0)):
        ids = data['ids'].to(device, dtype = torch.long)
        mask = data['mask'].to(device, dtype = torch.long)
        token_type_ids = data['token_type_ids'].to(device, dtype = torch.long)
        targets = data['targets'].to(device, dtype = torch.long)

        outputs = model(ids, mask, token_type_ids)
        loss = loss_function(outputs, targets)
        tr_loss += loss.item()
        big_val, big_idx = torch.max(outputs.data, dim=1)
        n_correct += calcuate_accuracy(big_idx, targets)

        nb_tr_steps += 1
        nb_tr_examples+=targets.size(0)
        
        if _%5000==0:
            loss_step = tr_loss/nb_tr_steps
            accu_step = (n_correct*100)/nb_tr_examples 
            print(f"Training Loss per 5000 steps: {loss_step}")
            print(f"Training Accuracy per 5000 steps: {accu_step}")

        optimizer.zero_grad()
        loss.backward()
        # # When using GPU
        optimizer.step()

    print(f'The Total Accuracy for Epoch {epoch}: {(n_correct*100)/nb_tr_examples}')
    epoch_loss = tr_loss/nb_tr_steps
    epoch_accu = (n_correct*100)/nb_tr_examples
    print(f"Training Loss Epoch: {epoch_loss}")
    print(f"Training Accuracy Epoch: {epoch_accu}")

    return 

In [None]:
# Keeping the model at 1 EPOCH
EPOCHS = 1
for epoch in range(EPOCHS):
    train(epoch)


0it [00:00, ?it/s][ATruncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.

1it [00:00,  3.02it/s][A

Training Loss per 5000 steps: 0.7032142877578735
Training Accuracy per 5000 steps: 25.0


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
2it [00:00,  3.26it/s][A
3it [00:00,  3.45it/s][A
4it [00:01,  3.60it/s][A
5it [00:01,  3.71it/s][A
6it [00:01,  3.73it/s][A
7it [00:01,  3.79it/s][A
8it [00:02,  3.87it/s][A
9it [00:02,  3.86it/s][A
10it [00:02,  3.90it/s][A
11it [00:02,  3.93it/s][A
12it [00:03,  3.96it/s][A
13it [00:03,  3.98it/s][A
14it [00:03,  3.99it/s][A
15it [00:03,  3.91it/s][A
16it [00:04,  3.94it/s][A
17it [00:04,  3.89it/s][A
18it [00:04,  3.87it/s][A
19it [00:04,  3.89it/s][A
20it [00:05,  3.93it/s][A
21it [00:05,  3.96it/s][A
22it [00:05,  3.97it/s][A
23it [00:05,  3.98it/s][A
24it [00:06,  4.00it/s][A
25it [00:06,  4.00it/s][A
26it [00:06,  4.02it/s][A
27it [00:06,  4.01it/s][A
28it [00:07,  3.99it/s][A
29it [00:07,  3.93it/s][A
30it [00:07,  3.89it/s][A
31it [00:07,  3.88it/s][A
32it [00:08,  3.92it/s][A
33it [00:08,  3.97it/s][A
34it [00:08,  3.99it/s][A
35it [00:08,  3.99it/s][A
36it [00:09,  3.95it/s][

Training Loss per 5000 steps: 0.40826819931957525
Training Accuracy per 5000 steps: 81.62367526494701


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
5002it [21:00,  3.94it/s][A
5003it [21:00,  3.98it/s][A
5004it [21:00,  4.01it/s][A
5005it [21:00,  4.03it/s][A
5006it [21:01,  4.04it/s][A
5007it [21:01,  4.05it/s][A
5008it [21:01,  4.06it/s][A
5009it [21:01,  4.05it/s][A
5010it [21:02,  3.98it/s][A
5011it [21:02,  4.00it/s][A
5012it [21:02,  4.01it/s][A
5013it [21:02,  4.00it/s][A
5014it [21:03,  3.99it/s][A
5015it [21:03,  3.90it/s][A
5016it [21:03,  3.88it/s][A
5017it [21:03,  3.91it/s][A
5018it [21:04,  3.88it/s][A
5019it [21:04,  3.89it/s][A
5020it [21:04,  3.93it/s][A
5021it [21:04,  3.96it/s][A
5022it [21:05,  4.00it/s][A
5023it [21:05,  4.01it/s][A
5024it [21:05,  3.94it/s][A
5025it [21:05,  3.98it/s][A
5026it [21:06,  3.99it/s][A
5027it [21:06,  4.02it/s][A
5028it [21:06,  4.03it/s][A
5029it [21:06,  4.03it/s][A
5030it [21:07,  4.04it/s][A
5031it [21:07,  4.05it/s][A
5032it [21:07,  4.02it/s][A
5033it [21:07,  3.95it/s][A
5034it 

Training Loss per 5000 steps: 0.3889681040231612
Training Accuracy per 5000 steps: 82.65173482651736


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
10002it [41:59,  4.04it/s][A
10003it [41:59,  4.03it/s][A
10004it [42:00,  4.02it/s][A
10005it [42:00,  4.05it/s][A
10006it [42:00,  3.99it/s][A
10007it [42:00,  3.96it/s][A
10008it [42:01,  3.98it/s][A
10009it [42:01,  3.93it/s][A
10010it [42:01,  3.90it/s][A
10011it [42:01,  3.91it/s][A
10012it [42:02,  3.93it/s][A
10013it [42:02,  3.92it/s][A
10014it [42:02,  3.85it/s][A
10015it [42:02,  3.91it/s][A
10016it [42:03,  3.95it/s][A
10017it [42:03,  4.00it/s][A
10018it [42:03,  4.02it/s][A
10019it [42:03,  4.02it/s][A
10020it [42:04,  4.03it/s][A
10021it [42:04,  4.05it/s][A
10022it [42:04,  4.00it/s][A
10023it [42:04,  3.99it/s][A
10024it [42:05,  4.02it/s][A
10025it [42:05,  4.03it/s][A
10026it [42:05,  3.98it/s][A
10027it [42:05,  3.99it/s][A
10028it [42:06,  3.99it/s][A
10029it [42:06,  3.98it/s][A
10030it [42:06,  3.92it/s][A
10031it [42:06,  3.95it/s][A
10032it [42:07,  3.98it/s][A
10033

Training Loss per 5000 steps: 0.37807415084861595
Training Accuracy per 5000 steps: 83.35444303713086


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
15002it [1:02:57,  4.04it/s][A
15003it [1:02:57,  3.99it/s][A
15004it [1:02:58,  3.95it/s][A
15005it [1:02:58,  3.98it/s][A
15006it [1:02:58,  3.99it/s][A
15007it [1:02:58,  4.01it/s][A
15008it [1:02:59,  4.03it/s][A
15009it [1:02:59,  3.97it/s][A
15010it [1:02:59,  3.98it/s][A
15011it [1:02:59,  4.00it/s][A
15012it [1:03:00,  3.95it/s][A
15013it [1:03:00,  3.96it/s][A
15014it [1:03:00,  3.91it/s][A
15015it [1:03:00,  3.96it/s][A
15016it [1:03:01,  3.98it/s][A
15017it [1:03:01,  4.00it/s][A
15018it [1:03:01,  4.01it/s][A
15019it [1:03:01,  4.02it/s][A
15020it [1:03:02,  4.04it/s][A
15021it [1:03:02,  4.04it/s][A
15022it [1:03:02,  4.03it/s][A
15023it [1:03:02,  4.04it/s][A
15024it [1:03:03,  4.04it/s][A
15025it [1:03:03,  4.04it/s][A
15026it [1:03:03,  4.05it/s][A
15027it [1:03:03,  4.04it/s][A
15028it [1:03:04,  4.03it/s][A
15029it [1:03:04,  4.05it/s][A
15030it [1:03:04,  4.07it/s][A
15031it

Training Loss per 5000 steps: 0.3710747873530251
Training Accuracy per 5000 steps: 83.72081395930203


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
20002it [1:23:54,  3.88it/s][A
20003it [1:23:54,  3.86it/s][A
20004it [1:23:54,  3.91it/s][A
20005it [1:23:55,  3.96it/s][A
20006it [1:23:55,  3.98it/s][A
20007it [1:23:55,  4.01it/s][A
20008it [1:23:55,  4.02it/s][A
20009it [1:23:56,  4.01it/s][A
20010it [1:23:56,  4.01it/s][A
20011it [1:23:56,  4.04it/s][A
20012it [1:23:56,  4.04it/s][A
20013it [1:23:57,  4.03it/s][A
20014it [1:23:57,  4.02it/s][A
20015it [1:23:57,  4.03it/s][A
20016it [1:23:57,  4.05it/s][A
20017it [1:23:58,  4.05it/s][A
20018it [1:23:58,  4.04it/s][A
20019it [1:23:58,  4.04it/s][A
20020it [1:23:58,  3.98it/s][A
20021it [1:23:59,  4.02it/s][A
20022it [1:23:59,  4.04it/s][A
20023it [1:23:59,  4.05it/s][A
20024it [1:23:59,  4.06it/s][A
20025it [1:23:59,  4.05it/s][A
20026it [1:24:00,  3.97it/s][A
20027it [1:24:00,  3.92it/s][A
20028it [1:24:00,  3.88it/s][A
20029it [1:24:01,  3.92it/s][A
20030it [1:24:01,  3.96it/s][A
20031it

Training Loss per 5000 steps: 0.36616025105797323
Training Accuracy per 5000 steps: 83.97364105435783


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
25002it [1:44:50,  3.95it/s][A
25003it [1:44:50,  4.00it/s][A
25004it [1:44:50,  4.02it/s][A
25005it [1:44:51,  4.03it/s][A
25006it [1:44:51,  4.04it/s][A
25007it [1:44:51,  4.05it/s][A
25008it [1:44:51,  4.00it/s][A
25009it [1:44:52,  3.94it/s][A
25010it [1:44:52,  3.91it/s][A
25011it [1:44:52,  3.89it/s][A
25012it [1:44:52,  3.87it/s][A
25013it [1:44:53,  3.91it/s][A
25014it [1:44:53,  3.98it/s][A
25015it [1:44:53,  3.99it/s][A
25016it [1:44:53,  3.96it/s][A
25017it [1:44:54,  4.00it/s][A
25018it [1:44:54,  4.02it/s][A
25019it [1:44:54,  4.03it/s][A
25020it [1:44:54,  4.04it/s][A
25021it [1:44:55,  4.04it/s][A
25022it [1:44:55,  4.05it/s][A
25023it [1:44:55,  4.07it/s][A
25024it [1:44:55,  4.08it/s][A
25025it [1:44:56,  4.02it/s][A
25026it [1:44:56,  4.02it/s][A
25027it [1:44:56,  3.98it/s][A
25028it [1:44:56,  3.98it/s][A
25029it [1:44:57,  4.02it/s][A
25030it [1:44:57,  4.02it/s][A
25031it

Training Loss per 5000 steps: 0.362117808349926
Training Accuracy per 5000 steps: 84.17677744075198



30002it [2:05:47,  3.94it/s][A
30003it [2:05:47,  3.96it/s][A
30004it [2:05:47,  3.97it/s][A
30005it [2:05:48,  3.88it/s][A
30006it [2:05:48,  3.86it/s][A
30007it [2:05:48,  3.82it/s][A
30008it [2:05:48,  3.86it/s][A
30009it [2:05:49,  3.90it/s][A
30010it [2:05:49,  3.87it/s][A
30011it [2:05:49,  3.91it/s][A
30012it [2:05:49,  3.88it/s][A
30013it [2:05:50,  3.91it/s][A
30014it [2:05:50,  3.92it/s][A
30015it [2:05:50,  3.92it/s][A
30016it [2:05:50,  3.84it/s][A
30017it [2:05:51,  3.90it/s][A
30018it [2:05:51,  3.94it/s][A
30019it [2:05:51,  3.89it/s][A
30020it [2:05:51,  3.86it/s][A
30021it [2:05:52,  3.90it/s][A
30022it [2:05:52,  3.95it/s][A
30023it [2:05:52,  3.98it/s][A
30024it [2:05:52,  4.01it/s][A
30025it [2:05:53,  4.02it/s][A
30026it [2:05:53,  4.05it/s][A
30027it [2:05:53,  4.05it/s][A
30028it [2:05:53,  4.00it/s][A
30029it [2:05:54,  4.01it/s][A
30030it [2:05:54,  4.03it/s][A
30031it [2:05:54,  3.99it/s][A
30032it [2:05:54,  4.01it/s][A
30033it

# Validating the Model 

Now testing the data against the 

# Saving the Trained Model Artifact for Inferencing at a later stage

In [None]:
output_model_file = 'pytorch_roberta_sentiment.bin'
output_vocab_file = './'
model_to_save = model 
torch.save(model_to_save, output_model_file)
tokenizer.save_vocabulary(output_vocab_file)
print("Model & Tokenizer saved")