<a href="https://colab.research.google.com/github/VidushiBhatia/Mining-Opinions-using-Transformers-PyTorch/blob/main/Mining_Opinions_to_Predict_Customer_Trends_using_Transformers_PyTorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<font color='teal' size='6'><b>
Mining Opinions to Predict Customer Trends with Transfer Learning and Fine Tuning of Transformers in PyTorch</b></font>

### Overview

* **Objective**: Load pre-trained models and further fine tune weights based on considered dataset; use parallel processing to expediate the process
* **Dataset used**: Unprocessed tar file from [Multi-Domain Sentiment Dataset (version 2.0)](https://www.cs.jhu.edu/~mdredze/datasets/sentiment/)

### 1 - Packages

In [None]:
# Install for using ignite
# !pip install torch==1.8.1 pytorch-transformers pytorch-ignite

In [None]:
import os                                        # allows to interact with the underlying operating system, access directories and update paths
import tarfile                                   # the input dataset is a tar file, this package helps in reading it
from bs4 import BeautifulSoup                    # the tar files contain XML datasets, this package helps pulling and manipulating that data
import pandas as pd                              # helps in creating dataframes from the input data
import regex as re                               # for text processing
import string                                    # for text processing
import numpy as np                               # for using numpy arrays

# Relevant torch packages for transfer learning and fine tuning 
import torch
from torch.utils.data import TensorDataset, random_split, DataLoader
from pytorch_transformers import BertTokenizer
from pytorch_transformers.optimization import AdamW
import torch.nn as nn
import torch.nn.functional as F
from ignite.engine import Engine, Events
from ignite.metrics import RunningAverage, Accuracy 
from ignite.handlers import ModelCheckpoint
from ignite.contrib.handlers import CosineAnnealingScheduler, PiecewiseLinear, create_lr_scheduler_with_warmup, ProgressBar
from pytorch_transformers import cached_path

# Data structures
from collections import namedtuple
from typing import Tuple

# Other packages for parallel processing
from concurrent.futures import ProcessPoolExecutor
from multiprocessing import cpu_count

from tqdm.notebook import tqdm                   # for progress bars
from itertools import repeat                     # creates memory space for one variable and repeats iterations using that variable
# from tqdm import tqdm
# import warnings



In [None]:
# Check if GPU is running
!nvidia-smi

Wed Jun 16 07:39:38 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.27       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   37C    P0    28W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

### 2 - Load XML to a Dataframe

The dataset used for this notebook has multiple positive and negative review files compressed into a tar format. To get a dataframe with x and y values (text and sentiment labels respectively), we need to execute the following:
1. Extract relevant files from tar
2. Covert XML tree into a dataframe for relevant elements
3. Create a train and test set with processed x and y values


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Helper Function 1: Extract XML data from tar files
def ExtractContent(path):
  tar = tarfile.open(path,'r' )
  
  # Find relevant file names
  files = [name for name in tar.getnames()]
  pos_files = []
  neg_files = []
  for file in files:
    if file.endswith('positive.review'): 
      pos_files.append(file)
    if file.endswith('negative.review'):
      neg_files.append(file)

  # Extract Positive and Negative reviews
  pos_content = []
  for file in pos_files:
    extracted_file = tar.extractfile(file)
    content = extracted_file.read()
    pos_content.append(content)

  neg_content = []
  for file in neg_files:
    extracted_file = tar.extractfile(file)
    content = extracted_file.read()
    neg_content.append(content)
  return pos_content, neg_content

In [None]:
# Helper Function 2: Create a dataframe from XML file
def CreateDF(content_list):
  # Check the exhaustive list of elements with soup.find_all() and create a dataframe with only relevant columns
  # columns = ['unique_id','asin','product_name','product_type','helpful','rating','title','date','reviewer','reviewer_location','review_text']

  columns = ['rating','title','review_text']     # only processing relevant columns
  interim_df = []
  for idx, item in enumerate(content_list):      # iterate over positive and negative file-list
    for content in item:                         # iterate over all extracted files
      bs_content = BeautifulSoup(content, 'lxml')
      table_rows = bs_content.find_all("review") # create rows from the root element
      df = pd.DataFrame()
      for c in columns:                          # add corresponding columns to the created dataframe
        values = bs_content.find_all(c)
        if len(values)!=len(table_rows):         # in case the size of each element is not equal
          col = []
          for t in table_rows:
            row_val = t.find_all(c)
            row = [t.text.strip() for t in row_val]
            col.append(row[0])
        else:
          col = [t.text.strip() for t in values]
        df[c] = col
      interim_df.append(df)
  output = pd.concat(interim_df)                # create a master df with all positive, negative reviews
  return output

In [None]:
# LOAD DATA USING HELPER FUNCTIONS

# STEP 1 - Extract relevant content from tar file
path = '/content/drive/My Drive/NLP - Sentiment Analysis & Keyword Extraction/unprocessed.tar.gz'
pos_content, neg_content = ExtractContent(path)

# STEP 2 - Convert relevant elements of XML to dataframe 
content_list = [pos_content, neg_content]
master_df = CreateDF(content_list)

# STEP 3 - Create train, test dataset with x (i.e. text) and y (i.e. labels)
master_df['label'] = (pd.to_numeric(master_df['rating'])>3)*1  # ratings >3 are labeled as positive sentiment
master_df['text'] =  master_df['title'].str.cat(master_df['review_text'], sep=' ', na_rep='?') 
master_df['text'] = master_df['text'].replace(r" +"," ",regex = True) # remove whitespaces
temp_mask = np.random.rand(len(master_df)) < 0.7   # 70% data is train set
train_set = master_df[temp_mask]
test_set  = master_df[~temp_mask]
train_set = train_set.drop(['rating','title','review_text'], axis='columns') # retain only processed columns
test_set = test_set.drop(['rating','title','review_text'], axis='columns')

### 3 - Tokenize Representations

The neural network model would require word representations to read the text. To execute this, we will define a text processing module which will take "text" as input and return "sequences of integers".

To convert text to this "id", there are multiple vocabularies available. In this notebook, we'll use `pytorch-transformers’s BertTokenizer` for tokenization.

In [None]:
class TextProcessing:
    CLS = '[CLS]'                                                 # Special token for sentence classification
    PAD = '[PAD]'                                                 # Special token for padding
    def __init__(self, tokenizer, num_max_positions:int=512):
        self.tokenizer=tokenizer
        self.num_max_positions = num_max_positions
    
    def process_example(self, example: Tuple[int, str]):          # function to convert text strings into tokens of equal length
        label, text = example[0], example[1]
        tokens = self.tokenizer.tokenize(text)
        
        if len(tokens) >= self.num_max_positions:                 # shorten the token length is it is longer than max_positions
            tokens = tokens[:self.num_max_positions-1] 
            ids =  self.tokenizer.convert_tokens_to_ids(tokens) + [self.tokenizer.vocab[self.CLS]]
        else:                                                     # pad to ensure that all token arrays are of same length
            pad = [self.tokenizer.vocab[self.PAD]] * (self.num_max_positions-len(tokens)-1)
            ids = self.tokenizer.convert_tokens_to_ids(tokens) + [self.tokenizer.vocab[self.CLS]] + pad
        
        return np.array(ids, dtype='int64'), int(label)

In [None]:
NUM_MAX_POSITIONS = 256 
BATCH_SIZE = 32

# import the 'bert-base-cased' tokenizer from PyTorch
from pytorch_transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-cased', do_lower_case=False)

# Initialize a processor with the imported tokenizer and TextProcessing class
processor = TextProcessing(tokenizer, num_max_positions=NUM_MAX_POSITIONS)

### 4 - Convert Dataset to DataLoader

In [None]:
# set the configurations for fine tuning pre-trained model to the considered dataset (incl. data loaders, parallel processing, etc.)
LOG_DIR = "/content/drive/My Drive/NLP - Sentiment Analysis & Keyword Extraction/logs/"
CACHE_DIR = "/content/drive/My Drive/NLP - Sentiment Analysis & Keyword Extraction/cache/"

device = "cuda" if torch.cuda.is_available() else "cpu"

FineTuningConfig = namedtuple('FineTuningConfig',
      field_names="num_classes, dropout, init_range, batch_size, lr, max_norm, n_epochs,"
                  "n_warmup, valid_pct, gradient_acc_steps, device, log_dir")

finetuning_config = FineTuningConfig(
                2, 0.1, 0.02, BATCH_SIZE, 6.5e-5, 1.0, 2,10, 0.1, 1, device, LOG_DIR)

In [None]:
# Function to process rows using the text processing class defined earlier
def process_row(processor, row):
    return processor.process_example((row[1]['label'], row[1]['text']))

In [None]:
# Function to convert dataframe into DataLoader after processing with the BERT tokenizer using process_row function
def create_dataloader(df: pd.DataFrame,
                      processor: TextProcessing,
                      batch_size: int = 32,
                      valid_pct: float = None):
    
    # to enable multiprocessing
    with ProcessPoolExecutor(max_workers=num_cores) as executor:
        result = list(
            tqdm(executor.map(process_row,
                              repeat(processor),
                              df.iterrows(),
                              chunksize=len(df) // 10),
                 desc=f"Processing {len(df)} examples on {num_cores} cores",
                 total=len(df)))

    features = [r[0] for r in result]
    labels = [r[1] for r in result]

    # Compile features and labels to form the dataset
    dataset = TensorDataset(torch.tensor(features, dtype=torch.long),
                            torch.tensor(labels, dtype=torch.long))

    # define train set and valid set based on defined valid percentage in fine tuning configuration
    if valid_pct is not None:
        valid_size = int(valid_pct * len(df))
        train_size = len(df) - valid_size
        valid_dataset, train_dataset = random_split(dataset,
                                                    [valid_size, train_size])
        valid_loader = DataLoader(valid_dataset,
                                  batch_size=batch_size,
                                  shuffle=False)
        train_loader = DataLoader(train_dataset,
                                  batch_size=batch_size,
                                  shuffle=False)
        return train_loader, valid_loader

    data_loader = DataLoader(dataset,
                             batch_size=batch_size,
                             shuffle=False)
    return data_loader

In [None]:
# create train and valid sets by splitting
num_cores = cpu_count()  # for parallel processing
train_dl, valid_dl = create_dataloader(train_set, processor, 
                                    batch_size=finetuning_config.batch_size, 
                                    valid_pct=finetuning_config.valid_pct)

test_dl = create_dataloader(test_set, processor, 
                             batch_size=finetuning_config.batch_size, 
                             valid_pct=None)

HBox(children=(FloatProgress(value=0.0, description='Processing 27100 examples on 4 cores', max=27100.0, style…




HBox(children=(FloatProgress(value=0.0, description='Processing 11448 examples on 4 cores', max=11448.0, style…




### 5 - Transfer Learning

In [None]:
# Adopted from HuggingFace's Transfer Learning tutorial
class Transformer(nn.Module):
    def __init__(self, embed_dim, hidden_dim, num_embeddings, num_max_positions, num_heads, num_layers, dropout, causal):
        super().__init__()
        self.causal = causal
        self.tokens_embeddings = nn.Embedding(num_embeddings, embed_dim)
        self.position_embeddings = nn.Embedding(num_max_positions, embed_dim)
        self.dropout = nn.Dropout(dropout)
        self.attentions, self.feed_forwards = nn.ModuleList(), nn.ModuleList()
        self.layer_norms_1, self.layer_norms_2 = nn.ModuleList(), nn.ModuleList()
        for _ in range(num_layers):
            self.attentions.append(nn.MultiheadAttention(embed_dim, num_heads, dropout=dropout))
            self.feed_forwards.append(nn.Sequential(nn.Linear(embed_dim, hidden_dim),
                                                    nn.ReLU(),
                                                    nn.Linear(hidden_dim, embed_dim)))
            self.layer_norms_1.append(nn.LayerNorm(embed_dim, eps=1e-12))
            self.layer_norms_2.append(nn.LayerNorm(embed_dim, eps=1e-12))

    def forward(self, x, padding_mask=None):
        positions = torch.arange(len(x), device=x.device).unsqueeze(-1)
        h = self.tokens_embeddings(x)
        h = h + self.position_embeddings(positions).expand_as(h)
        h = self.dropout(h)

        attn_mask = None
        if self.causal:
            attn_mask = torch.full((len(x), len(x)), -float('Inf'), device=h.device, dtype=h.dtype)
            attn_mask = torch.triu(attn_mask, diagonal=1)

        for layer_norm_1, attention, layer_norm_2, feed_forward in zip(self.layer_norms_1, self.attentions,
                                                                       self.layer_norms_2, self.feed_forwards):
            h = layer_norm_1(h)
            x, _ = attention(h, h, h, attn_mask=attn_mask, need_weights=False, key_padding_mask=padding_mask)
            x = self.dropout(x)
            h = x + h

            h = layer_norm_2(h)
            x = feed_forward(h)
            x = self.dropout(x)
            h = x + h
        return h

In [None]:
# Adopted from HuggingFace's Transfer Learning tutorial
class TransformerWithClfHead(nn.Module):
    def __init__(self, config, fine_tuning_config):
        super().__init__()
        self.config = fine_tuning_config
        self.transformer = Transformer(config.embed_dim, config.hidden_dim, config.num_embeddings,
                                       config.num_max_positions, config.num_heads, config.num_layers,
                                       fine_tuning_config.dropout, causal=not config.mlm)
        self.classification_head = nn.Linear(config.embed_dim, fine_tuning_config.num_classes)
        self.apply(self.init_weights)

    def init_weights(self, module):
        if isinstance(module, (nn.Linear, nn.Embedding, nn.LayerNorm)):
            module.weight.data.normal_(mean=0.0, std=self.config.init_range)
        if isinstance(module, (nn.Linear, nn.LayerNorm)) and module.bias is not None:
            module.bias.data.zero_()

    def forward(self, x, clf_tokens_mask, clf_labels=None, padding_mask=None):
        hidden_states = self.transformer(x, padding_mask)

        clf_tokens_states = (hidden_states * clf_tokens_mask.unsqueeze(-1).float()).sum(dim=0)
        clf_logits = self.classification_head(clf_tokens_states)

        if clf_labels is not None:
            loss_fct = nn.CrossEntropyLoss(ignore_index=-1)
            loss = loss_fct(clf_logits.view(-1, clf_logits.size(-1)), clf_labels.view(-1))
            return clf_logits, loss
        return clf_logits

In [None]:
# download pre-trained model and config
state_dict = torch.load(cached_path("https://s3.amazonaws.com/models.huggingface.co/"
                                    "naacl-2019-tutorial/model_checkpoint.pth"), map_location='cpu')

config = torch.load(cached_path("https://s3.amazonaws.com/models.huggingface.co/"
                                        "naacl-2019-tutorial/model_training_args.bin"))

# init model: Transformer base + classifier head
model = TransformerWithClfHead(config=config, fine_tuning_config=finetuning_config).to(finetuning_config.device)

incompatible_keys = model.load_state_dict(state_dict, strict=False)

### 6 - Model Fine Tuning



In [None]:
def update(engine, batch):
    "update function for training"
    model.train()
    inputs, labels = (t.to(finetuning_config.device) for t in batch)
    inputs = inputs.transpose(0, 1).contiguous() # [S, B]
    _, loss = model(inputs, 
                    clf_tokens_mask = (inputs == tokenizer.vocab[processor.CLS]), 
                    clf_labels=labels)
    loss = loss / finetuning_config.gradient_acc_steps
    loss.backward()
    
    torch.nn.utils.clip_grad_norm_(model.parameters(), finetuning_config.max_norm)
    if engine.state.iteration % finetuning_config.gradient_acc_steps == 0:
        optimizer.step()
        optimizer.zero_grad()
    return loss.item()

In [None]:
def inference(engine, batch):
    "update function for evaluation"
    model.eval()
    with torch.no_grad():
        batch, labels = (t.to(finetuning_config.device) for t in batch)
        inputs = batch.transpose(0, 1).contiguous()
        logits = model(inputs,
                       clf_tokens_mask = (inputs == tokenizer.vocab[processor.CLS]),
                       padding_mask = (batch == tokenizer.vocab[processor.PAD]))
    return logits, labels

In [None]:
def predict(model, tokenizer, int2label, input="test"):
    "predict sentiment using model"
    tok = tokenizer.tokenize(input)
    ids = tokenizer.convert_tokens_to_ids(tok) + [tokenizer.vocab['[CLS]']]
    tensor = torch.tensor(ids, dtype=torch.long)
    tensor = tensor.to(device)
    tensor = tensor.reshape(1, -1)
    tensor_in = tensor.transpose(0, 1).contiguous() # [S, 1]
    logits = model(tensor_in,
                   clf_tokens_mask = (tensor_in == tokenizer.vocab['[CLS]']),
                   padding_mask = (tensor == tokenizer.vocab['[PAD]']))
    val, _ = torch.max(logits, 0)
    val = F.softmax(val, dim=0).detach().cpu().numpy()    
    return {int2label[val.argmax()]: val.max(),
            int2label[val.argmin()]: val.min()}

In [None]:
optimizer = AdamW(model.parameters(), lr=finetuning_config.lr, correct_bias=False) 

trainer = Engine(update)
evaluator = Engine(inference)

# add metric to evaluator 
Accuracy().attach(evaluator, "accuracy")

# add evaluator to trainer: eval on valid set after each epoch
@trainer.on(Events.EPOCH_COMPLETED)
def log_validation_results(engine):
    evaluator.run(valid_dl)
    print(f"validation epoch: {engine.state.epoch} acc: {100*evaluator.state.metrics['accuracy']}")
          
# lr schedule: linearly warm-up to lr and then to zero
scheduler = PiecewiseLinear(optimizer, 'lr', [(0, 0.0), (finetuning_config.n_warmup, finetuning_config.lr),
                                              (len(train_dl)*finetuning_config.n_epochs, 0.0)])
trainer.add_event_handler(Events.ITERATION_STARTED, scheduler)


# add progressbar with loss
RunningAverage(output_transform=lambda x: x).attach(trainer, "loss")
ProgressBar(persist=True).attach(trainer, metric_names=['loss'])

# save checkpoints and finetuning config
checkpoint_handler = ModelCheckpoint(finetuning_config.log_dir, 'finetuning_checkpoint', 
                                     save_interval=1, require_empty=False)
trainer.add_event_handler(Events.EPOCH_COMPLETED, checkpoint_handler, {'imdb_model': model})

int2label = {0: 'negative', 1: 'positive'}

# save metadata
torch.save({
    "config": config,
    "config_ft": finetuning_config,
    "int2label": int2label
}, os.path.join(finetuning_config.log_dir, "metadata.bin"))



In [None]:
trainer.run(train_dl, max_epochs=5)

# save model weights
torch.save(model.state_dict(), os.path.join(finetuning_config.log_dir, "model_weights.pth"))

	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, *, Number alpha) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:1005.)
  exp_avg.mul_(beta1).add_(1.0 - beta1, grad)


HBox(children=(FloatProgress(value=0.0, max=763.0), HTML(value='')))

validation epoch: 1 acc: 92.43542435424355



HBox(children=(FloatProgress(value=0.0, max=763.0), HTML(value='')))

validation epoch: 2 acc: 92.2140221402214



HBox(children=(FloatProgress(value=0.0, max=763.0), HTML(value='')))

validation epoch: 3 acc: 92.2140221402214



HBox(children=(FloatProgress(value=0.0, max=763.0), HTML(value='')))

validation epoch: 4 acc: 92.2140221402214



HBox(children=(FloatProgress(value=0.0, max=763.0), HTML(value='')))

validation epoch: 5 acc: 92.2140221402214



### 7 - Evaluate Model

In [None]:
# evaluate the model on test set
evaluator.run(test_dl)
print(f"Test accuracy: {100*evaluator.state.metrics['accuracy']:.3f}")

Test accuracy: 92.558


### 8 - Predict for a Real Time Input

In [None]:
predict(model, tokenizer, int2label, input = "ah! great book")

{'negative': 0.00842914, 'positive': 0.99157083}

In [None]:
predict(model, tokenizer, int2label, input = "I didn't enjoy the toy as muxh as I imagined")

{'negative': 0.89938086, 'positive': 0.10061909}

### References

* https://github.com/huggingface/naacl_transfer_learning_tutorial
* https://medium.com/swlh/transformer-fine-tuning-for-sentiment-analysis-c000da034bb5
