## Text classification using transformer models

### Transformer
- Transformer architecture first introduced in paper attention is all you need @ <a href= "https://arxiv.org/abs/1706.03762">Paper</a> has changed the NLP field significantly.
- With attention & Masked language modelling objective, self supervised learning has revolutionized the NLP field. Now we have DL models that understand & model the context very well. 
- Large models trained on vast amounts of text data paved the way for better models for all NLP tasks such as text classification, summarization, Q & A to name a few.
- Instead of building a model from sracth, we can now fine tune one of the LLMs for our specific task

In this notebook we will explore the Text / Document classification task with one of the LLMs, namely Distill-bert. Its a smallish model with comparable performance with Bert models built by Google. Since transformer literature is widely available I am leaving a link to the article I understood the transformer architecture from. <a href="https://jalammar.github.io/illustrated-transformer/">Illustrated Transformer by Jay Alammar</a>


### What is MLM ? 
Masked language modelling is a self supervised training objective for text understanding but can be adopted for other data domains also. To build language understanding we need to build structures that represent how words are related to each other & that they represent different things / concepts based on the context. MLM tries to build this by masking parts of sentences & forcing the training algorithm to predict those masked words / tokens by using the context of the token in a bi-directional context. 

In [42]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, DataCollatorWithPadding
import pandas as pd
import numpy as np
import re
import os
from torch.utils.data import Dataset, DataLoader
import torch

In [2]:
data_dir = '../data/nlp-getting-started/'

train_data = pd.read_csv(os.path.join(data_dir,'train.csv'))

In [3]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

In [4]:
tokenizer.tokenize(train_data['text'].iloc[0])

['our',
 'deeds',
 'are',
 'the',
 'reason',
 'of',
 'this',
 '#',
 'earthquake',
 'may',
 'allah',
 'forgive',
 'us',
 'all']

In [5]:
def tokenize(examples):
    return tokenizer(examples['text'], truncation=True)


In [61]:
class TextDataset(Dataset):
    def __init__(self, text, labels=[]):
        super(TextDataset, self).__init__()
        self.text = text
        self.labels= labels

    def __len__(self):
        return len(self.text)
    
    def __getitem__(self,idx):
        tokenized = tokenizer(self.text.iloc[idx],return_tensors="pt")
        tokenized['label'] = torch.tensor(self.labels.iloc[idx]) if len(self.labels) > 0 else torch.zeros(1)
        
        return tokenized
                
        


In [44]:
from sklearn.model_selection import train_test_split

In [45]:
X_train, X_val, y_train, y_val = train_test_split(train_data['text'], train_data['target'], 
                                                  test_size=0.2, random_state=8)

In [46]:
train_dataset = TextDataset(X_train, y_train)
eval_dataset = TextDataset(X_val, y_val)

In [13]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="pt")

In [35]:
DataCollatorWithPadding?

[0;31mInit signature:[0m
[0mDataCollatorWithPadding[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mtokenizer[0m[0;34m:[0m [0mtransformers[0m[0;34m.[0m[0mtokenization_utils_base[0m[0;34m.[0m[0mPreTrainedTokenizerBase[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mpadding[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mbool[0m[0;34m,[0m [0mstr[0m[0;34m,[0m [0mtransformers[0m[0;34m.[0m[0mutils[0m[0;34m.[0m[0mgeneric[0m[0;34m.[0m[0mPaddingStrategy[0m[0;34m][0m [0;34m=[0m [0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmax_length[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mint[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mpad_to_multiple_of[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mint[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mreturn_tensors[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;34m'pt'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m [0;34m-

In [14]:
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [15]:
import evaluate

accuracy = evaluate.load("accuracy")

In [17]:
import numpy as np


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

In [47]:
training_args = TrainingArguments(
    output_dir="../data/nlp-getting-started/distill_bert_classifier",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=10,
    weight_decay=0.01,
    save_strategy="epoch",
    evaluation_strategy="epoch",
    load_best_model_at_end=True,
    
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,

)

trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.458128,0.824688
2,0.274300,0.495691,0.826658
3,0.214300,0.615068,0.824688
4,0.158300,0.758261,0.817466
5,0.158300,0.892203,0.80302
6,0.106400,0.963478,0.804334
7,0.076500,1.104767,0.809586
8,0.057700,1.14403,0.796454
9,0.057700,1.176617,0.814183
10,0.040600,1.205845,0.809586


TrainOutput(global_step=3810, training_loss=0.12484021324498135, metrics={'train_runtime': 299.337, 'train_samples_per_second': 203.45, 'train_steps_per_second': 12.728, 'total_flos': 853913828980032.0, 'train_loss': 0.12484021324498135, 'epoch': 10.0})

In [62]:
test_data = pd.read_csv(os.path.join(data_dir,"test.csv"))

In [63]:
test_dataset = TextDataset(test_data['text'])

In [50]:
model.eval()


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [52]:
model.save_pretrained("../data/nlp-getting-started/distill_bert_classifier/best_model")

In [75]:
predictions = list()
with torch.no_grad():
    for _,row in test_data.iterrows():
        tokens = tokenizer(row['text'],
                           return_tensors="pt").to(device)
        predictions.append(model(**tokens).logits.argmax().item())
        

In [76]:
test_data['target'] = predictions

In [78]:
test_data[['id','target']].to_csv("../data/nlp-getting-started/distill_bert_submission.csv",index=False)