<a href="https://colab.research.google.com/github/eaishwa/quora-insincere-ques-detection-XLNET/blob/master/Quora_Insincere_ques_detection_XLNET.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using the state-of-the-art NLP model XLNET to detect if a quora question is sincere or not !!

This problem was hosted by Quora on Kaggle in Feb, 2019. Quora is on a mission to improve their platform by removing toxic content from its platform. So they made a labelled dataset available to the public by hosting it as a kaggle competition. The data consists of questions and a target column which marks if the question is sincere or not. Any question that is non-neutral and inflammatory/ not truly intended to ask a question is marked insincere. The goal is to develop a model that can identify if the incoming new question is sincere or not.

For more information on the dataset, plese refer 
https://www.kaggle.com/c/quora-insincere-questions-classification/data.


This notebook gives step by step implementation of using the XLNET model developed by google, to fine tune on the Quora Insincere Questions Identification dataset. This is a guide to implement the fine tuning of the XLENT model for a sentence classification problem. This guide shall be followed for similar downstream tasks provided you have the task specific labelled dataset. This notebook uses the pytorch hugging face package namely PyTorch-Transformers.

For more information on XLNET, please refer to my article
https://towardsdatascience.com/xlnet-explained-in-simple-terms-255b9fb2c97c

To know more about pytorch transformers, please refer
https://huggingface.co/pytorch-transformers/model_doc/xlnet.html#

### Upload the train data from the source.

I had sampled 5000 questions in total - 2500 in each category, to enhance speed and memory utilization. 

In [2]:
from google.colab import files
uploaded = files.upload()

Saving sample.csv to sample.csv


### Installing the pytorch-transformers library and importing the necessary requirements.

In [3]:
!pip install pytorch-transformers

import logging
logging.basicConfig(level=logging.INFO)
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from pytorch_transformers import *
from tqdm import tqdm, trange
import pandas as pd
import io
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline



Using TensorFlow backend.
INFO:pytorch_transformers.modeling_bert:Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
INFO:pytorch_transformers.modeling_xlnet:Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .


In [4]:
#data = pd.read_csv(io.BytesIO(uploaded['questions.csv']))
data = pd.read_csv('sample.csv')
data = data.dropna()
print("The percentage of sincere questions is : ")
print(len(data[data['target']==0].index)*100/len(data.index))
print("The percentage of insincere questions is : ")
print(len(data[data['target']==1].index)*100/len(data.index))


The percentage of sincere questions is : 
50.0
The percentage of insincere questions is : 
50.0


### Check the meta information about the dataset

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5000 entries, 0 to 4999
Data columns (total 5 columns):
target           5000 non-null int64
Unnamed: 1       5000 non-null int64
qid              5000 non-null object
question_text    5000 non-null object
target.1         5000 non-null int64
dtypes: int64(3), object(2)
memory usage: 234.4+ KB


### Prepare input data for XLNET

We need special data preparation methods for XLNET. Following are the steps.

1) Tokenize the sentences using the library. 

2) Append the string "[CLS]" to the beginning of the sentence and "[SEP]" to the end of the sentence. 

3) In case of sentence pairs, we need to append "[SEP]" at the end of the second sentence too. 

4) Obtain the input ids of the tokens from the output of the tokenizer. The model needs this to identify tokens uniquely.

5) XLNET accepts input sequences in fixed sizes such as 128, 256, 320, 384, 512. So we need to truncate larger sequences or pad smaller sequences with 0.

6) A segment mask is to be specified to identify if the input is a single sentence or pair of sentences. Indicative values such 0 for first sentence and 1 for second sentence are used for this purpose.

7) An attention mask is to be specified, to let the model know which are the tokens and which are the paddings we introduced in step 5. 1 indicates token and 0 indicates padding.

This is the format in which the original BERT model was trained by google. So the users are also expected to follow the same for the best results.

In [6]:
# Load pre-trained model tokenizer (vocabulary)

tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')

# function to tokenize and generate input ids for the tokens
# returns a list of input ids

def prep_data(ques):
  all_input_ids = []
  
  for q1 in ques:
    
    # first sentence is appended with [CLS] and [SEP] in the beginning and end
    q1 = '[CLS] ' + q1 + ' [SEP] '
    tokens = tokenizer.tokenize(q1)
    
    # input ids are generated for the tokens (one question pair)
    input_ids = tokenizer.convert_tokens_to_ids(tokens)

    # input ids are stored in a separate list
    all_input_ids.append(input_ids)
    
  return all_input_ids


all_input_ids = prep_data(data['question_text'].values)

INFO:pytorch_transformers.tokenization_utils:loading file https://s3.amazonaws.com/models.huggingface.co/bert/xlnet-large-cased-spiece.model from cache at /root/.cache/torch/pytorch_transformers/5b125ba222ff82664771f63cd8fac9696c24b403fc1ab720d537fe2ceaaf0576.8b10bd978b5d01c21303cc761fc9ecd464419b3bf921864a355ba807cfbfafa8


In [0]:
# set MAX_LEN as one of 128, 256, 320, 384, 512
MAX_LEN = 128

# Pad our input tokens
pad_input_ids = pad_sequences(all_input_ids,
                          maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")

In [0]:
# Create attention masks
attention_masks = []

# Create a mask of 1s for each token followed by 0s for padding
for seq in pad_input_ids:
  seq_mask = [float(i>0) for i in seq]
  attention_masks.append(seq_mask)

### Check if GPU is available

In [9]:
import tensorflow as tf

device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Found GPU at: /device:GPU:0


### Obtain the BERT model

In [10]:
model = XLNetForSequenceClassification.from_pretrained("xlnet-base-cased", num_labels=2)
model.cuda()

INFO:pytorch_transformers.modeling_utils:loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/xlnet-base-cased-config.json from cache at /root/.cache/torch/pytorch_transformers/c9cc6e53904f7f3679a31ec4af244f4419e25ebc8e71ebf8c558a31cbcf07fc8.ef1824921bc0786e97dc88d55eb17aabf18aac90f24bd34c0650529e7ba27d6f
INFO:pytorch_transformers.modeling_utils:Model config {
  "attn_type": "bi",
  "bi_data": false,
  "clamp_len": -1,
  "d_head": 64,
  "d_inner": 3072,
  "d_model": 768,
  "dropout": 0.1,
  "end_n_top": 5,
  "ff_activation": "gelu",
  "finetuning_task": null,
  "initializer_range": 0.02,
  "layer_norm_eps": 1e-12,
  "mem_len": null,
  "n_head": 12,
  "n_layer": 12,
  "n_token": 32000,
  "num_labels": 2,
  "output_attentions": false,
  "output_hidden_states": false,
  "pruned_heads": {},
  "reuse_len": null,
  "same_length": false,
  "start_n_top": 5,
  "summary_activation": "tanh",
  "summary_last_dropout": 0.1,
  "summary_type": "last",
  "summary_use_proj": 

XLNetForSequenceClassification(
  (transformer): XLNetModel(
    (word_embedding): Embedding(32000, 768)
    (layer): ModuleList(
      (0): XLNetLayer(
        (rel_attn): XLNetRelativeAttention(
          (layer_norm): LayerNorm(torch.Size([768]), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1)
        )
        (ff): XLNetFeedForward(
          (layer_norm): LayerNorm(torch.Size([768]), eps=1e-12, elementwise_affine=True)
          (layer_1): Linear(in_features=768, out_features=3072, bias=True)
          (layer_2): Linear(in_features=3072, out_features=768, bias=True)
          (dropout): Dropout(p=0.1)
        )
        (dropout): Dropout(p=0.1)
      )
      (1): XLNetLayer(
        (rel_attn): XLNetRelativeAttention(
          (layer_norm): LayerNorm(torch.Size([768]), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1)
        )
        (ff): XLNetFeedForward(
          (layer_norm): LayerNorm(torch.Size([768]), eps=1e-12, elementwise

In [11]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
torch.cuda.get_device_name(0)

'Tesla K80'

### Run the model

Split the input data into train and validation sets. I keep 20% of the data for validation. Convert all the inputs to tensors which is format for this library.

In [0]:
# Use train_test_split to split our data into train and validation sets for training
labels = data.target.values
train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(pad_input_ids, labels, 
                                                            random_state=2018, test_size=0.2)
train_masks, validation_masks, _, _ = train_test_split(attention_masks, pad_input_ids,
                                             random_state=2018, test_size=0.2)

In [0]:
train_inputs = torch.tensor(train_inputs)
validation_inputs = torch.tensor(validation_inputs)
train_labels = torch.tensor(train_labels)
validation_labels = torch.tensor(validation_labels)
train_masks = torch.tensor(train_masks)
validation_masks = torch.tensor(validation_masks)

The DataLoader allows us to get only that particular batch neeed for that epoch. This helps save lot of memory because we wont be loading the entire data in memory during training.

In [0]:
# Select a batch size for training. For fine-tuning BERT on a specific task, the authors recommend a batch size of 16 or 32
batch_size = 32

# Create an iterator of our data with torch DataLoader. This helps save on memory during training because, unlike a for loop, 
# with an iterator the entire dataset does not need to be loaded into memory

train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)


### Get the same hyperparameters as the model and define the Adam Optimizer

In [0]:
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'gamma', 'beta']
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.01},
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.0}
]

In [0]:
# This variable contains all of the hyperparemeter information our training loop needs

optimizer = AdamW(optimizer_grouped_parameters,
                     lr=2e-5,
                     correct_bias=False)

### Define a method to compute accuracy

In [0]:
# Function to calculate the accuracy of our predictions vs labels
def accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

### Run the neural network model

1) Obtain the batch of data to compute gradient from the data loader that we created above.

2) Do not store the gradients as they aren't needed in this case.

3) Make a forward pass in the network followed by backward pass (backpropagation).

4) Update network parameters.

5) Track the loss function.

6) Run predictions on the validation set and record accuracy.

7) Run steps 1 to 6 for the number of epochs specified.

In [35]:
train_loss_set = []

# Number of training epochs (authors recommend between 2 and 4)
epochs = 2

# trange is a tqdm wrapper around the normal python range
for _ in trange(epochs, desc="Epoch"):
  
  
  # Training
  
  # Set our model to training mode (as opposed to evaluation mode)
  model.train()
  
  # Tracking variables
  tr_loss = 0
  nb_tr_examples, nb_tr_steps = 0, 0
  
  # Train the data for one epoch
  for step, batch in enumerate(train_dataloader):
    
    # Add batch to GPU
    batch = tuple(t.to(device) for t in batch)
    
    # Unpack the inputs from our dataloader
    b_input_ids, b_input_mask, b_labels = batch
    
    # Clear out the gradients (by default they accumulate)
    optimizer.zero_grad()
    
    # Forward pass
    outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)
    loss, logits1 = outputs[:2]
    
    train_loss_set.append(loss)    
    
    # Backward pass
    loss.backward()
    
    # Update parameters and take a step using the computed gradient
    optimizer.step()
    
    
    # Update tracking variables
    tr_loss += loss
    nb_tr_examples += b_input_ids.size(0)
    nb_tr_steps += 1

  print("Train loss: {}".format(tr_loss/nb_tr_steps))
    
    
  # Validation

  # Put model in evaluation mode to evaluate loss on the validation set
  model.eval()

  # Tracking variables 
  eval_loss, eval_accuracy = 0, 0
  nb_eval_steps, nb_eval_examples = 0, 0

  # Evaluate data for one epoch
  for batch in validation_dataloader:
    
    # Add batch to GPU
    batch = tuple(t.to(device) for t in batch)
    
    # Unpack the inputs from our dataloader
    b_input_ids, b_input_mask, b_labels = batch
    
    # Telling the model not to compute or store gradients, saving memory and speeding up validation
    with torch.no_grad():
      
      # Forward pass, calculate logit predictions
      logits = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)
      
#     # Move logits and labels to CPU
      logits1 = logits[0].detach().cpu().numpy()
      label_ids = b_labels.to('cpu').numpy()

    tmp_eval_accuracy = accuracy(logits1, label_ids)
    
    eval_accuracy += tmp_eval_accuracy
    nb_eval_steps += 1

  print("Validation Accuracy: {}".format(eval_accuracy/nb_eval_steps))








Epoch:   0%|          | 0/2 [00:00<?, ?it/s][A[A[A[A[A[A[A

Train loss: 0.0677293911576271









Epoch:  50%|█████     | 1/2 [04:40<04:40, 280.11s/it][A[A[A[A[A[A[A

Validation Accuracy: 0.8720703125
Train loss: 0.05677700787782669









Epoch: 100%|██████████| 2/2 [09:19<00:00, 280.00s/it][A[A[A[A[A[A[A






[A[A[A[A[A[A[A

Validation Accuracy: 0.86328125


### Test on new questions to see the performance !!

Upload your test data and apply the same pre-processing techniques and see how the model works!! Please refer the questions I uploaded in the file "results.csv" in this repository.

In [36]:
uploaded = files.upload()

Saving test.csv to test.csv


In [0]:
test = pd.read_csv('test.csv')
all_input_ids = prep_data(test['question_text'].values)

In [0]:
MAX_LEN = 128
# Pad our input tokens
pad_input_ids = pad_sequences(all_input_ids,
                          maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")

In [0]:
# Create attention masks
attention_masks = []

# Create a mask of 1s for each token followed by 0s for padding
for seq in pad_input_ids:
  seq_mask = [float(i>0) for i in seq]
  attention_masks.append(seq_mask)

In [0]:
prediction_inputs = torch.tensor(pad_input_ids)
prediction_masks = torch.tensor(attention_masks)
  
batch_size = 32


prediction_data = TensorDataset(prediction_inputs, prediction_masks)
prediction_sampler = SequentialSampler(prediction_data)
prediction_dataloader = DataLoader(prediction_data, sampler=prediction_sampler, batch_size=batch_size)

In [41]:
# Prediction on test set

# Put model in evaluation mode
model.eval()

# Tracking variables 
predictions = []

# Predict 
for batch in prediction_dataloader:
  # Add batch to GPU
  batch = tuple(t.to(device) for t in batch)
  # Unpack the inputs from our dataloader
  b_input_ids, b_input_mask = batch
  # Telling the model not to compute or store gradients, saving memory and speeding up prediction
  with torch.no_grad():
    # Forward pass, calculate logit predictions
    logits = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)

  # Move logits and labels to CPU
  logits1 = logits[0].detach().cpu().numpy()
  
  # Store predictions and true labels
  predictions.append(logits1)
  pred_flat = np.argmax(logits1, axis=1).flatten()
  print(pred_flat)

[0 1 0 0 0 0 1 0 0 0 1 1 1 1 0 1 1 1 1 1]


The algorithm works pretty good. The results are uploaded in results.csv file. Please check it to understand the predictions above. With just 5000 samples for fine-tuning, the model performs very well on the unseen data. 

XLNET is indeed the state-of-the-art model for NLP !!!