<a href="https://colab.research.google.com/github/carlosjsaez/MultiClassBERT/blob/main/BERT_Multi_Class_for_Scoring_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is the resolution of the task provided by Aeternity. The addition of this text commentaries is made to assist and guite the reader along the code. Nevertheless, I will keep the explanation short as it is above the three-hour threshold. Feel free to ask me any question regarding the thought process behind this.

We start by downloading the data and doing some EDA on it:

In [3]:
import json
import pandas as pd

dir_data = 'data\AMAZON_FASHION_5.json'

data_raw = []
with open(dir_data) as f:
    for line in f:
        data_raw.append(json.loads(line))

data = []
[data.append([row.get('reviewerID', None), row.get('verified', None), row.get('unixReviewTime', None), row.get('reviewText', None), row['overall']]) for row in data_raw]

df = pd.DataFrame(data, columns = ['id', 'verified', 'timestamp', 'review_text', 'target'])

# Data cleaning: nulls, duplicates (that's why we used the unique identifiers of id and timestamp), and only verified users
df = df.dropna(inplace = False)
df.drop_duplicates(inplace = True)
df = df[df.verified == True]
df.reset_index(drop = True, inplace = True)

df['target'] = df.target.astype(int).astype(str)
df = df.sort_values(by = 'target')

In [4]:
def clean_text(text):
  import re
  lower = text.lower()
  words = re.sub(r"(@[A-Za-z]+)|([^A-Za-z \t])| (\w+:\/\/\S+)|^rt|http.+?", "", lower )
  words2 = words.split()
  # final_words =  [wnl().lemmatize(word , pos = 'v') for word in words2 if word not in stopwords.words('english')]
  final_words = ' '.join(words2)
  return final_words 

df['review_text'] = df['review_text'].apply(clean_text)
df.head()

Unnamed: 0,id,verified,timestamp,review_text,target
121,A1BN6I0B2OF7WB,True,1511049600,i usually wear a size and they fit fine these ...,1
294,A199ICSPL9EXJ5,True,1483488000,returning these the pictures on here make the ...,1
308,A3PTZ7IHGU9BA8,True,1481760000,wrong shoes,1
83,A6CXK8NXD50R2,True,1520726400,they looked very cheap,1
39,A276HQXYS553QW,True,1518998400,constantly rolls down,1


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 432 entries, 121 to 215
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   id           432 non-null    object
 1   verified     432 non-null    bool  
 2   timestamp    432 non-null    int64 
 3   review_text  432 non-null    object
 4   target       432 non-null    object
dtypes: bool(1), int64(1), object(3)
memory usage: 17.3+ KB


In [6]:
df.describe()

Unnamed: 0,timestamp
count,432.0
mean,1490078000.0
std,28503050.0
min,1261699000.0
25%,1478542000.0
50%,1492085000.0
75%,1508911000.0
max,1530749000.0


In [7]:
df.target.value_counts()

5    287
4     65
3     48
1     17
2     15
Name: target, dtype: int64

Cleaning finished, after dropping the duplicates (single entry per data point, as there were many duplicates), 
only using verfied users ( trying to avoid fake reviews) and keeping just the text.
The numbers are small and very imbalanced, which needs to be tackled when selecting and tuning a model.
We will nevertheless use a stratified train_test_split to mitigate that problem.

Regarding the model selection, I saw it clear from the beginning: a Neural Network, a pre-trained BERT for classification with fine tuning was the right approach as it is possible to apply it quickly and it can provide state-of-the-art performance. I never used it before, but I know how it works, how it relates with PyTorch or TF, and with literature help it shouldn't be complicated to make it work. Checked into several articles and repositories (find references below) to find and make a working code within the timeframe provided.

More info about BERT for classification: https://www.geeksforgeeks.org/sentiment-classification-using-bert/

I discarded any traditional model for text classification, as using a TfIdf or Bag of Words with regressions or random forest, as the use of transformers is the approach that I would actually take if I were to work on this project for Aeternity. Obviously in 3 hours, we will be able only to get a PoC with preliminar results, but it shows better which would be my approach in such scenario.

In [3]:
import torch
from tqdm.notebook import tqdm
%pip install transformers
from transformers import BertTokenizer
from torch.utils.data import TensorDataset

from transformers import BertForSequenceClassification



In [4]:
possible_labels = df.target.unique()

label_dict = {}
for index, possible_label in enumerate(possible_labels):
    label_dict[possible_label] = index
label_dict

{'1': 0, '2': 1, '3': 2, '4': 3, '5': 4}

In [5]:
df['label'] = df.target.replace(label_dict)

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(df.index.values, 
                                                  df.label.values, 
                                                  test_size=0.2, 
                                                  random_state=21, 
                                                  stratify=df.label.values)

df['data_type'] = ['not_set']*df.shape[0]

df.loc[X_train, 'data_type'] = 'train'
df.loc[X_val, 'data_type'] = 'val'

df.groupby(['target', 'label', 'data_type']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id,verified,timestamp,review_text
target,label,data_type,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0,train,14,14,14,14
1,0,val,3,3,3,3
2,1,train,12,12,12,12
2,1,val,3,3,3,3
3,2,train,38,38,38,38
3,2,val,10,10,10,10
4,3,train,52,52,52,52
4,3,val,13,13,13,13
5,4,train,229,229,229,229
5,4,val,58,58,58,58


This working code will use PyTorch framework. I had time to test also Tensorflow, but got better results and easier reproducibility with PyTorch.

In [7]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', 
                                          do_lower_case=True)

max_length = 64

encoded_data_train = tokenizer.batch_encode_plus(
    df[df.data_type=='train'].review_text.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=max_length, 
    return_tensors='pt'
)

encoded_data_val = tokenizer.batch_encode_plus(
    df[df.data_type=='val'].review_text.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=max_length, 
    return_tensors='pt'
)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [8]:

input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(df[df.data_type=='train'].label.values)

input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
labels_val = torch.tensor(df[df.data_type=='val'].label.values)

dataset_train = TensorDataset(input_ids_train, attention_masks_train, labels_train)
dataset_val = TensorDataset(input_ids_val, attention_masks_val, labels_val)

In [9]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=len(label_dict),
                                                      output_attentions=False,
                                                      output_hidden_states=False)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [10]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

batch_size = 3

dataloader_train = DataLoader(dataset_train, 
                              sampler=RandomSampler(dataset_train), 
                              batch_size=batch_size)

dataloader_validation = DataLoader(dataset_val, 
                                   sampler=SequentialSampler(dataset_val), 
                                   batch_size=batch_size)

In [11]:
from transformers import AdamW, get_linear_schedule_with_warmup

optimizer = AdamW(model.parameters(),
                  lr=1e-5, 
                  eps=1e-8)
                  
epochs = 5

scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps=0,
                                            num_training_steps=len(dataloader_train)*epochs)

In [12]:
from sklearn.metrics import f1_score

def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average='weighted')

def accuracy_per_class(preds, labels):
    label_dict_inverse = {v: k for k, v in label_dict.items()}
    
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()

    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f'Class: {label_dict_inverse[label]}')
        print(f'Accuracy: {len(y_preds[y_preds==label])}/{len(y_true)}\n')

In [14]:
import random
import numpy as np

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
seed_val = 17
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

def evaluate(dataloader_val):

    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in dataloader_val:
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals
    
for epoch in tqdm(range(1, epochs+1)):
    
    model.train()
    
    loss_train_total = 0

    progress_bar = tqdm(dataloader_train, desc='Epoch {:1d}'.format(epoch), leave=False, disable=False)
    for batch in progress_bar:

        model.zero_grad()
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }       

        outputs = model(**inputs)
        
        loss = outputs[0]
        loss_train_total += loss.item()
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        optimizer.step()
        scheduler.step()
        
        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})
         
        
    torch.save(model.state_dict(), f'finetuned_BERT_epoch_{epoch}.model')
        
    tqdm.write(f'\nEpoch {epoch}')
    
    loss_train_avg = loss_train_total/len(dataloader_train)            
    tqdm.write(f'Training loss: {loss_train_avg}')
    
    val_loss, predictions, true_vals = evaluate(dataloader_validation)
    val_f1 = f1_score_func(predictions, true_vals)
    tqdm.write(f'Validation loss: {val_loss}')
    tqdm.write(f'F1 Score (Weighted): {val_f1}')

HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, description='Epoch 1', max=115.0, style=ProgressStyle(description_widt…


Epoch 1
Training loss: 0.891123566225819
Validation loss: 0.9443342585501999
F1 Score (Weighted): 0.5333333333333334


HBox(children=(FloatProgress(value=0.0, description='Epoch 2', max=115.0, style=ProgressStyle(description_widt…


Epoch 2
Training loss: 0.7695049102539602
Validation loss: 0.9205619090333067
F1 Score (Weighted): 0.6497447219167153


HBox(children=(FloatProgress(value=0.0, description='Epoch 3', max=115.0, style=ProgressStyle(description_widt…


Epoch 3
Training loss: 0.6568590407462224
Validation loss: 0.9265870925938261
F1 Score (Weighted): 0.642561921139168


HBox(children=(FloatProgress(value=0.0, description='Epoch 4', max=115.0, style=ProgressStyle(description_widt…


Epoch 4
Training loss: 0.6053019105578247
Validation loss: 0.9380458610710399
F1 Score (Weighted): 0.642025522671735


HBox(children=(FloatProgress(value=0.0, description='Epoch 5', max=115.0, style=ProgressStyle(description_widt…


Epoch 5
Training loss: 0.5920070150300213
Validation loss: 0.9380458610710399
F1 Score (Weighted): 0.642025522671735



In [18]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=len(label_dict),
                                                      output_attentions=False,
                                                      output_hidden_states=False)

model.to(device)

model.load_state_dict(torch.load('finetuned_BERT_epoch_5.model', map_location=torch.device('cpu')))

_, predictions, true_vals = evaluate(dataloader_validation)
accuracy_per_class(predictions, true_vals)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Class: 1
Accuracy: 0/3

Class: 2
Accuracy: 0/3

Class: 3
Accuracy: 2/10

Class: 4
Accuracy: 3/13

Class: 5
Accuracy: 55/58



These are the results for the first approach on this task. Obviously they are not perfect, and they need a better fine-tuning. But we are on the right path: we have set a proper training environment and a first approach for this results. Next steps must consider:
 adding a better loss function in order to a better adressing of the excesive 

1.   More training, to be sure that this is the proper convergence point (modifying epochs and batch size).
2.   Modifications in the architecture: Adding a better loss function in order to a better adressing of the excesive imbalance (one that performs a better weighting for the less populated).
3.   Artificial resampling (over-sampling in this case).

References:


1.   https://github.com/susanli2016/NLP-with-Python/blob/master/Text_Classification_With_BERT.ipynb
2.   https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_multiclass_classification.ipynb
3.   https://huggingface.co/transformers/model_doc/bert.html
4.   https://www.tensorflow.org/official_models/fine_tuning_bert
5.   https://pytorch.org/docs/stable/generated/
6.   https://medium.com/nerd-for-tech/multi-class-classification-using-bert-3e02a050170d
7.   https://towardsdatascience.com/sentiment-analysis-in-10-minutes-with-bert-and-hugging-face-294e8a04b671


In [None]:
References
