<a href="https://colab.research.google.com/github/diegovianagomes/IMDB-Learning/blob/main/IMDB_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Imports and Initial Setup
We´ll use the huggingface pre-trained models and we´ll also import all necessary  libraries into the pipeline.

In [None]:
!pip -q install transformers

In [None]:
import numpy as np
import pandas as pd
import torch

from torch.utils.data import Dataset, DataLoader
from transformers import BertForSequenceClassification, BertTokenizer, pipeline
from sklearn import preprocessing
from tqdm import tqdm

### Control flags and Hyperparameters

In [None]:
MAX_LENGTH = 512

TRAIN_RATIO = 0.7
VAL_RATIO = 0.2
TEST_RATIO = 0.1

BATCH_SIZE = 16

### CPU/GPU configuration


In [None]:
device = torch.device("cuda:0" if(torch.cuda.is_available()) else "cpu")
print("Process unit check:", device)

## Download and explore datasets

In this section we'll download and read the IMDB dataset and explore the dataset.

In [None]:
df = pd.read_csv("/content/imdb-reviews-pt-br.csv")
#df.head()

Dataset size

In [None]:
print(f"number of examples: {len(df)}")

Distribution of classes

In [None]:
df['sentiment'].value_counts()

In [None]:
df

## Task demonstration in English




In [None]:
pipeline??

In [None]:
#bert_en = pipeline('sentiment-analysis', model='nlptown/bert-base-multilingual-uncased-sentiment')
bert_en = pipeline('sentiment-analysis')

In [None]:
test_instance = 100
df['text_en'][test_instance], bert_en(df['text_en'][test_instance])

## Task demonstration in Portugues
### Tokenization
---
Our words must be inserted into the model as numbers, and in this case we'll use BERT tokeniser, which uses byte-pair encoding.

In [None]:
tokenizer = BertTokenizer.from_pretrained('neuralmind/bert-base-portuguese-cased')

Our dataset is relatively small, we can store entirely in RAM, already tokenized.

The cell below tokenizes all dataset
---
- df_tokenized will be a dictionary as keys ['input_ids', 'token_type_ids', 'attention_mask'
- input_ids -> tokenized instances
- token_type_ids -> mask used in classification tasks of pair phrases(It will be discarded in this task)
- attention_mask -> mask of attention  that stands out for a model of tokens of padding [PAD]

In [None]:
df_tokenized = tokenizer.batch_encode_plus(df['text_pt'], return_tensors='pt', padding=True, truncation=True, max_length=MAX_LENGTH)

Visual inspection of the dataset format

In [None]:
print(df_tokenized["input_ids"].shape, df_tokenized["attention_mask"].shape)

Formatting and defining X and Y, because the ideal is not to have a dictionary but a matrix, which can be seen with:

- [0, DATASET_LEN, MAX_LENGTH] = input_ids
- [1, DATASET_LEN, MAX_LENGTH] = attention_mask

In [None]:
X = torch.stack((df_tokenized['input_ids'], df_tokenized["attention_mask"]), dim=0)

df['sentiment'] = df['sentiment'].apply(lambda x: 0 if x == 'neg' else 1)

y = torch.Tensor(df['sentiment'].to_numpy())

## DataLoader

A dataloader is a set of Pytorch datasets, let's define a class that will be consumed by the DataLoader when we feed the model with data during training.

To create a customised DataLoader in PyTorch, you need to follow a few important steps. Firstly, you must create a class that inherits from the Dataset base class, which is provided by PyTorch itself. This base class serves as a template to define how your data will be organised and accessed.

Within this customised class, there are two essential methods that you need to implement:

- __len__ : This method should return the total size of your dataset. It is used to tell the DataLoader how many samples are available in the dataset. For example, if your dataset contains 1000 images, the __len__ method should return the value 1000.
- __getitem__ : This method is responsible for accessing a specific sample from the dataset. It takes an index as input (e.g. idx) and returns the sample corresponding to that index. This allows the DataLoader to know how to load each item individually.

In [None]:
class TextDataset(Dataset):
    def __init__(self, X, y):
        self.X = X
        self.X = self.X.to(device)

        self.y = y
        self.y = self.y.to(device)

        self.len = len(y)

    def __len__(self):
        return self.len

    def __getitem__(self, idx):
        return self.X[:, idx], self.y[idx]


The next cell will be responsible for instantiating our training, validation and test loaders.

In [None]:
dataset = TextDataset(X, y)

# Calculation of the number of instances that must exist in each split
num_train_instances = int(np.round(dataset.len * TRAIN_RATIO))
num_val_instances = int(np.round(dataset.len * VAL_RATIO))
num_test_instances = int(np.round(dataset.len * TEST_RATIO))

print(f"TRAIN: {num_train_instances}, VAL: {num_val_instances}, TEST: {num_test_instances}")


train_split, val_split, test_split = torch.utils.data.random_split(dataset, [num_train_instances, num_val_instances, num_test_instances])

train_loader = torch.utils.data.DataLoader(train_split, batch_size=BATCH_SIZE, shuffle=True)
val_loader = torch.utils.data.DataLoader(val_split, batch_size=BATCH_SIZE, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_split, batch_size=BATCH_SIZE, shuffle=True)

## Train Setup



In [None]:
epochs = 40
steps_per_epoch = 200
epoch_validation_samples = 50

model = BertForSequenceClassification.from_pretrained('neuralmind/bert-base-portuguese-cased').to(device)

for param in model.base_model.parameters():
    param.requires_grad = False #True

loss_func = torch.nn.CrossEntropyLoss()
optim = torch.optim.Adam(model.parameters())

acc_calc = lambda output, labels : (labels == output.argmax(axis=1)).sum()

scheduler = torch.optim.lr_scheduler.ExponentialLR(optim, gamma=0.9997)

In [None]:
epoch_metadate = []

for i in range(epochs):
    num_train_examples = 0
    num_val_examples = 0

    train_hits = 0
    val_hits = 0

    train_bar = tqdm(total = steps_per_epoch, desc = f"Train", unit = "steps", position = 0, leave = True)
    val_bar = tqdm(total = steps_per_epoch, desc = f"Val", unit = "samples", position = 0, leave = True)

    for batch_number, (features, labels) in enumerate(train_loader):
        train_running_loss = 0

        model.train()

        input_ids, input_masks, = features[:, 0, :], features[:, 1, :]

        loss, logits = model(input_ids, input_masks, labels=labels.long())

        optim.zero_grad()
        loss.backward()
        optim.step()

        train_running_loss += loss.item()
        softmax_prediction = torch.nn.functional.softmax(logits, dim=1)
        train_hits += acc_calc(softmax_prediction, labels)

        train_bar.update(1)

        num_train_examples += features.shape[0]

        scheduler.step()

        if(batch_number + 1) % steps_per_epoch == 0:
            train_bar.close()
            break

    for batch_number, (features, labels) in enumerate(val_loader):
        val_running_loss = 0

        model.eval()

        input_ids, input_masks, = features[:, 0, :], features[:, 1, :]

        loss, logits = model(input_ids, input_masks, labels=labels.long())

        val_running_loss += loss.item()
        softmax_prediction = torch.nn.functional.softmax(logits, dim=1)
        val_hits += acc_calc(softmax_prediction, labels)

        num_val_examples += features.shape[0]

        val_bar.update(1)

        if(batch_number + 1) % steps_per_epoch == 0:
            val_bar.close()
            break

train_acc = torch.true_divide(train_hits, num_train_examples)
val_acc = torch.true_divide(val_hits, num_val_examples)

print(F"EPOCH SUMMARY - {I + 1} \t Train Loss: {train_running_loss} \t Train Acc: {train_acc} \t Val Loss: {val_loss} \t Val Acc: {val_acc}")



In [None]:
model.save_pretrained(f'epoch_{i}')