# Text Classification with RoBERTa using Pytorch.

RoBERTa: A Robustly Optimized BERT Pretraining Approach is a recent [paper](https://arxiv.org/pdf/1907.11692.pdf) published by researchers at Facebook. 
It modifies the key hyper-parameters in BERT model:
uses larger mini-batches, learning rates and step sizes for longer training
differences in masking procedure

Roberta gets higher GLEU score as 88.5


In this notebook, I will be using RoBERTa with [Pytorch-Transformers](https://github.com/huggingface/pytorch-transformers) library. PyTorch-Transformers (formerly known as pytorch-pretrained-bert) is a library of state-of-the-art pre-trained models for Natural Language Processing (NLP).  
Most of the code in this notebook is from [run_glue.py](https://github.com/huggingface/pytorch-transformers/blob/master/examples/run_glue.py) file in the pytorch-transformers library. This entire notebook is developed using Google Colab.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!nvidia-smi

Mon Dec  7 05:47:34 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.45.01    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   54C    P8    10W /  70W |      0MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [3]:
!pip install --upgrade "urllib3==1.25.4" awscli

Collecting urllib3==1.25.4
[?25l  Downloading https://files.pythonhosted.org/packages/91/0d/7777358f672a14b7ae0dfcd29f949f409f913e0578190d6bfa68eb55864b/urllib3-1.25.4-py2.py3-none-any.whl (125kB)
[K     |████████████████████████████████| 133kB 8.7MB/s 
[?25hCollecting awscli
[?25l  Downloading https://files.pythonhosted.org/packages/a6/2f/db65135e5d82992ceb3294eae04bc6ceb7af25663a46f9c2f96c3bab6841/awscli-1.18.190-py2.py3-none-any.whl (3.4MB)
[K     |████████████████████████████████| 3.5MB 9.4MB/s 
Collecting botocore==1.19.30
[?25l  Downloading https://files.pythonhosted.org/packages/9c/a3/1ee497faf994d180df5d14d456eef1ef46ca1ffce617816faa4ff8164608/botocore-1.19.30-py2.py3-none-any.whl (7.0MB)
[K     |████████████████████████████████| 7.0MB 28.7MB/s 
[?25hCollecting colorama<0.4.4,>=0.2.5; python_version != "3.4"
  Downloading https://files.pythonhosted.org/packages/c9/dc/45cdef1b4d119eb96316b3117e6d5708a08029992b2fee2c143c7a0a5cc5/colorama-0.4.3-py2.py3-none-any.whl
Collect

In [4]:
!pip install pytorch-transformers

Collecting pytorch-transformers
[?25l  Downloading https://files.pythonhosted.org/packages/a3/b7/d3d18008a67e0b968d1ab93ad444fc05699403fa662f634b2f2c318a508b/pytorch_transformers-1.2.0-py3-none-any.whl (176kB)
[K     |█▉                              | 10kB 18.5MB/s eta 0:00:01[K     |███▊                            | 20kB 15.4MB/s eta 0:00:01[K     |█████▋                          | 30kB 12.9MB/s eta 0:00:01[K     |███████▍                        | 40kB 12.1MB/s eta 0:00:01[K     |█████████▎                      | 51kB 8.0MB/s eta 0:00:01[K     |███████████▏                    | 61kB 7.5MB/s eta 0:00:01[K     |█████████████                   | 71kB 8.5MB/s eta 0:00:01[K     |██████████████▉                 | 81kB 9.4MB/s eta 0:00:01[K     |████████████████▊               | 92kB 8.6MB/s eta 0:00:01[K     |██████████████████▋             | 102kB 7.8MB/s eta 0:00:01[K     |████████████████████▍           | 112kB 7.8MB/s eta 0:00:01[K     |██████████████████████▎  

In [5]:
import csv
import os
import random
from pathlib import Path
import numpy as np
import pandas as pd
import torch
from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
                              TensorDataset)
from torch.utils.data.distributed import DistributedSampler
from pytorch_transformers import RobertaConfig, RobertaForSequenceClassification, RobertaTokenizer
from pytorch_transformers import AdamW, WarmupLinearSchedule
from tqdm import tqdm, trange, tqdm_notebook
from sklearn.metrics import matthews_corrcoef, f1_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns


### Loading the Dataset.
I am using a small subset of Amazon Reviews Dataset containing only 10000 rows. You can use the whole dataset but it will take a much longer time to train.

In [6]:
dataset = pd.read_csv('/content/drive/My Drive/Colab Notebooks/women_clothes_review/Womens Clothing E-Commerce Reviews.csv')
dataset.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


In [7]:
dataset = dataset.loc[:, ['Review Text', 'Rating']]
dataset.head()

Unnamed: 0,Review Text,Rating
0,Absolutely wonderful - silky and sexy and comf...,4
1,Love this dress! it's sooo pretty. i happene...,5
2,I had such high hopes for this dress and reall...,3
3,"I love, love, love this jumpsuit. it's fun, fl...",5
4,This shirt is very flattering to all due to th...,5


In [8]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23486 entries, 0 to 23485
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Review Text  22641 non-null  object
 1   Rating       23486 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 367.1+ KB


In [9]:
dataset = dataset.dropna()
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22641 entries, 0 to 23485
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Review Text  22641 non-null  object
 1   Rating       22641 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 530.6+ KB


In [10]:
def get_sentiment(value):
    if value > 3:
        return 1
    else:
        return 0

In [11]:
dataset['Sentiment'] = dataset['Rating'].apply(get_sentiment)
dataset.head()

Unnamed: 0,Review Text,Rating,Sentiment
0,Absolutely wonderful - silky and sexy and comf...,4,1
1,Love this dress! it's sooo pretty. i happene...,5,1
2,I had such high hopes for this dress and reall...,3,0
3,"I love, love, love this jumpsuit. it's fun, fl...",5,1
4,This shirt is very flattering to all due to th...,5,1


In [12]:
dataset.drop(['Rating'], axis=1, inplace=True)
dataset.head()

Unnamed: 0,Review Text,Sentiment
0,Absolutely wonderful - silky and sexy and comf...,1
1,Love this dress! it's sooo pretty. i happene...,1
2,I had such high hopes for this dress and reall...,0
3,"I love, love, love this jumpsuit. it's fun, fl...",1
4,This shirt is very flattering to all due to th...,1


In [13]:
train_df, val_df = train_test_split(dataset, test_size=0.5, random_state=101)
val_df, test_df = train_test_split(val_df, test_size=0.5, random_state=101)

dataset.shape,train_df.shape, val_df.shape, test_df.shape
print(test_df[:1])
print(test_df[-2:])
print(test_df.shape)

                                             Review Text  Sentiment
17229  The comfort of this fabric is terrific. the sw...          1
                                             Review Text  Sentiment
1893   Disappointment city with this one and i am so ...          0
15996  Loved these pants when i tried them on they lo...          0
(5661, 2)


Pytorch-Transformers library requires dataset to be divided in Train, Valid (read Dev) and Test set. In this case I will not be using a Test set. Also the library requires dataset to be in TSV format but since most of the times we get the data in CSV format, I have decided to use a CSV format.

In [14]:
save_dir = Path('/content/drive/My Drive/Colab Notebooks/Amazon Reviews')

In [15]:
if not os.path.exists(save_dir):
    os.mkdir(save_dir)

fullname = os.path.join(save_dir, 'train.csv')    
train_df.to_csv(fullname)

In [16]:
# if not os.path.exists(save_dir):
#     os.mkdir(save_dir)

# fullname = os.path.join(save_dir, 'dev.csv')    
# train_df.to_csv(fullname)

In [17]:
if not os.path.exists(save_dir):
    os.mkdir(save_dir)

fullname = os.path.join(save_dir, 'val.csv')    
val_df.to_csv(fullname)

In [18]:
if not os.path.exists(save_dir):
    os.mkdir(save_dir)

fullname = os.path.join(save_dir, 'test.csv')    
test_df.to_csv(fullname)

In [19]:
class InputExample(object):
    """A single training/test example for simple sequence classification."""

    def __init__(self, guid, text_a, text_b=None, label=None):
        """Constructs a InputExample.
        Args:
            guid: Unique id for the example.
            text_a: string. The untokenized text of the first sequence. For single
            sequence tasks, only this sequence must be specified.
            text_b: (Optional) string. The untokenized text of the second sequence.
            Only must be specified for sequence pair tasks.
            label: (Optional) string. The label of the example. This should be
            specified for train and dev examples, but not for test examples.
        """
        self.guid = guid
        self.text_a = text_a
        self.text_b = text_b
        self.label = label

In [20]:
class InputFeatures(object):
    """A single set of features of data."""

    def __init__(self, input_ids, input_mask, segment_ids, label_id):
        self.input_ids = input_ids
        self.input_mask = input_mask
        self.segment_ids = segment_ids
        self.label_id = label_id

In [21]:
class DataProcessor(object):
    """Base class for data converters for sequence classification data sets."""

    def get_train_examples(self, data_dir):
        """Gets a collection of `InputExample`s for the train set."""
        raise NotImplementedError()

    def get_dev_examples(self, data_dir):
        """Gets a collection of `InputExample`s for the dev set."""
        raise NotImplementedError()

    def get_labels(self):
        """Gets the list of labels for this data set."""
        raise NotImplementedError()

    @classmethod
    def _read_tsv(cls, input_file, quotechar=None):
        """Reads a tab separated value file."""
        with open(input_file, "r", encoding="utf-8-sig") as f:
            reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
            lines = []
            for line in reader:
                if sys.version_info[0] == 2:
                    line = list(unicode(cell, 'utf-8') for cell in line)
                lines.append(line)
            return lines
          
class AmazonProcessor(DataProcessor):
    """Processor for the Amazon Reviews data set."""

    def get_train_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "train.csv")), "train")

    def get_dev_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "dev.csv")), "dev")

    def get_test_examples(self, data_dir):
        """See base class."""
        df = pd.read_csv(os.path.join(data_dir,'test.csv'))
        print(df[:1])
        print(df[-2:])

        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "test.csv")), "test")
        

    def get_labels(self):
        """See base class."""
        return [0, 1]

    #Hack to be compatible with the existing code in transformers library
    def _read_tsv(self, file_path):
        return pd.read_csv(file_path).values.tolist()

    def _create_examples(self, lines, set_type):
        """Creates examples for the training and dev sets."""
        examples = []
        for (i, line) in enumerate(lines):
          # print(f'line: {line}')
          if i == 0:
              print(f'line0: {line[0]}')
              continue
          guid = "%s-%s" % (set_type, i)
          
          text_a = str(line[1])
           # text_b = None
          
          label = line[2]
          examples.append(InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
        return examples

In [22]:
def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)

In [23]:
def simple_accuracy(preds, labels):
  return (preds == labels).mean()
  
def acc_and_f1(preds, labels):
  acc = simple_accuracy(preds, labels)
  f1 = f1_score(y_true=labels, y_pred=preds)
  return {
      "acc": acc,
      "f1": f1,
      "acc_and_f1": (acc + f1) / 2,
  }


def f1_macro(preds, labels):
  return f1_score(y_true=labels, y_pred=preds, average='macro')
  
def compute_metrics(task_name, preds, labels):
  assert len(preds) == len(labels)
  if task_name == "amazon":
    return {"acc": simple_accuracy(preds, labels)}
  else:
    raise KeyError(task_name)

In [24]:
def _truncate_seq_pair(tokens_a, tokens_b, max_length):
    """Truncates a sequence pair in place to the maximum length."""

    # This is a simple heuristic which will always truncate the longer sequence
    # one token at a time. This makes more sense than truncating an equal percent
    # of tokens from each, since if one sequence is very short then each token
    # that's truncated likely contains more information than a longer sequence.
    while True:
        total_length = len(tokens_a) + len(tokens_b)
        if total_length <= max_length:
            break
        if len(tokens_a) > len(tokens_b):
            tokens_a.pop()
        else:
            tokens_b.pop()


### covert_examples_to_features

In [25]:
def convert_examples_to_features(examples, label_list, max_seq_length,
                                 tokenizer, output_mode,
                                 cls_token_at_end=False, pad_on_left=False,
                                 cls_token='[CLS]', sep_token='[SEP]', pad_token=0,
                                 sequence_a_segment_id=0, sequence_b_segment_id=1,
                                 cls_token_segment_id=1, pad_token_segment_id=0,
                                 mask_padding_with_zero=True):
    """ Loads a data file into a list of `InputBatch`s
        `cls_token_at_end` define the location of the CLS token:
            - False (Default, BERT/XLM pattern): [CLS] + A + [SEP] + B + [SEP]
            - True (XLNet/GPT pattern): A + [SEP] + B + [SEP] + [CLS]
        `cls_token_segment_id` define the segment id associated to the CLS token (0 for BERT, 2 for XLNet)
    """

    label_map = {label : i for i, label in enumerate(label_list)}
  
    features = []
    for (ex_index, example) in enumerate(examples):
        # print(f'ex_index:{ex_index}')
        # print(f'example:{example.guid}')
        # print(f'example:{example.text_a}')
        # print(f'example:{example.text_b}')
        # print(f'example.label:{example.label}')


        tokens_a = tokenizer.tokenize(example.text_a)
        # print(f'tokens_a:{tokens_a}')

        tokens_b = None
        if example.text_b:
            tokens_b = tokenizer.tokenize(example.text_b)
            # Modifies `tokens_a` and `tokens_b` in place so that the total
            # length is less than the specified length.
            # Account for [CLS], [SEP], [SEP] with "- 3"
            _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
        else:
            # Account for [CLS] and [SEP] with "- 2"
            if len(tokens_a) > max_seq_length - 2:
                tokens_a = tokens_a[:(max_seq_length - 2)]

        # The convention in BERT is:
        # (a) For sequence pairs:
        #  tokens:   [CLS] is this jack son ville ? [SEP] no it is not . [SEP]
        #  type_ids:   0   0   0    0    0   0    0   0   1  1  1  1   1   1
        # (b) For single sequences:
        #  tokens:   [CLS] the dog is hairy . [SEP]
        #  type_ids:   0   0   0   0  0     0   0
        #
        # Where "type_ids" are used to indicate whether this is the first
        # sequence or the second sequence. The embedding vectors for `type=0` and
        # `type=1` were learned during pre-training and are added to the wordpiece
        # embedding vector (and position vector). This is not *strictly* necessary
        # since the [SEP] token unambiguously separates the sequences, but it makes
        # it easier for the model to learn the concept of sequences.
        #
        # For classification tasks, the first vector (corresponding to [CLS]) is
        # used as as the "sentence vector". Note that this only makes sense because
        # the entire model is fine-tuned.
        tokens = tokens_a + [sep_token]
        segment_ids = [sequence_a_segment_id] * len(tokens)

        if tokens_b:
            tokens += tokens_b + [sep_token]
            segment_ids += [sequence_b_segment_id] * (len(tokens_b) + 1)

        if cls_token_at_end:
            tokens = tokens + [cls_token]
            segment_ids = segment_ids + [cls_token_segment_id]
        else:
            tokens = [cls_token] + tokens
            segment_ids = [cls_token_segment_id] + segment_ids

        input_ids = tokenizer.convert_tokens_to_ids(tokens)
        # print(f'tokens:{tokens}')

        # The mask has 1 for real tokens and 0 for padding tokens. Only real
        # tokens are attended to.
        input_mask = [1 if mask_padding_with_zero else 0] * len(input_ids)

        # Zero-pad up to the sequence length.
        padding_length = max_seq_length - len(input_ids)
        if pad_on_left:
            input_ids = ([pad_token] * padding_length) + input_ids
            input_mask = ([0 if mask_padding_with_zero else 1] * padding_length) + input_mask
            segment_ids = ([pad_token_segment_id] * padding_length) + segment_ids
        else:
            input_ids = input_ids + ([pad_token] * padding_length)
            input_mask = input_mask + ([0 if mask_padding_with_zero else 1] * padding_length)
            segment_ids = segment_ids + ([pad_token_segment_id] * padding_length)

        assert len(input_ids) == max_seq_length
        assert len(input_mask) == max_seq_length
        assert len(segment_ids) == max_seq_length

        if output_mode == "classification":
          # print(f'label_map:{label_map}')
          # print(f'example.label:{example.label}')
          label_id = label_map[example.label]
          # print(f'lable_id:{label_id}')
        elif output_mode == "regression":
          label_id = float(example.label)
        else:
          raise KeyError(output_mode)

        # print(f'input_ids:{input_ids}')
        # print(f'input_mask:{input_mask}')
        # print(f'segment_ids:{segment_ids}')
        # print(f'label_id:{label_id}')

        features.append(
                InputFeatures(input_ids=input_ids,
                              input_mask=input_mask,
                              segment_ids=segment_ids,
                              label_id=label_id))
        


    return features

In [26]:
processor = AmazonProcessor()
label_list = processor.get_labels()
num_labels = len(label_list)
print(num_labels)

2


In [27]:
# config = BertConfig.from_pretrained('bert-base-uncased', num_labels=num_labels)
# tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
# model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

In [28]:
config = RobertaConfig.from_pretrained('roberta-base', num_labels=num_labels)
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaForSequenceClassification.from_pretrained('roberta-base', config=config)

100%|██████████| 481/481 [00:00<00:00, 349949.74B/s]
100%|██████████| 898823/898823 [00:00<00:00, 2270873.96B/s]
100%|██████████| 456318/456318 [00:00<00:00, 1399349.44B/s]
100%|██████████| 501200538/501200538 [00:13<00:00, 36382271.19B/s]


In [29]:
def load_and_cache_examples(tokenizer, dataset='train'):  
  if dataset == "train":
      examples = processor.get_train_examples(data_dir)
  elif dataset == "dev":
      examples = processor.get_dev_examples(data_dir)
  else:
      examples = processor.get_test_examples(data_dir)
  
  features = convert_examples_to_features(examples, label_list, max_seq_length, tokenizer, output_mode,
            cls_token_at_end=False,            # xlnet has a cls token at the end
            cls_token=tokenizer.cls_token,
            sep_token=tokenizer.sep_token,
            cls_token_segment_id=0,
            pad_on_left=False,                 # pad on the left for xlnet
            pad_token_segment_id=0)
  # Convert to Tensors and build dataset
  all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
  all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)
  all_segment_ids = torch.tensor([f.segment_ids for f in features], dtype=torch.long)
  if output_mode == "classification":
      all_label_ids = torch.tensor([f.label_id for f in features], dtype=torch.long)
      print(len(all_label_ids))
  elif output_mode == "regression":
      all_label_ids = torch.tensor([f.label_id for f in features], dtype=torch.float)

  dataset = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids)
  return dataset
  

Hyperparameters are from the library. I have not fine tuned it.

In [30]:
output_mode = 'classification'
max_seq_length = 128
batch_size = 8
max_grad_norm = 1.0
gradient_accumulation_steps=2
num_train_epochs=3
weight_decay=0.0
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


In [31]:
learning_rate = 2e-5
adam_epsilon = 1e-8
warmup_steps = 0

In [32]:
def train(train_dataset, model, tokenizer):
  """ Train the model """
  train_sampler = RandomSampler(train_dataset)
  train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=batch_size)
  t_total = len(train_dataloader) // gradient_accumulation_steps * num_train_epochs
  # Prepare optimizer and schedule (linear warmup and decay)
  no_decay = ['bias', 'LayerNorm.weight']
  optimizer_grouped_parameters = [
      {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': weight_decay},
      {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
      ]
  optimizer = AdamW(optimizer_grouped_parameters, lr=learning_rate, eps=adam_epsilon)
  scheduler = WarmupLinearSchedule(optimizer, warmup_steps=warmup_steps, t_total=t_total)
  
  global_step = 0
  tr_loss, logging_loss = 0.0, 0.0
  model.zero_grad()

  train_iterator = tqdm_notebook(range(int(num_train_epochs)), desc="Epoch")
  set_seed(42)
  for _ in train_iterator:
    preds = None
    epoch_iterator = tqdm_notebook(train_dataloader, desc="Iteration")
    for step, batch in enumerate(epoch_iterator):
      model.train()
      batch = tuple(t.to(device) for t in batch)
      inputs = {'input_ids':      batch[0],
                'attention_mask': batch[1],
                'token_type_ids': None,       # XLM and RoBERTa don't use segment_ids
                'labels':         batch[3]}
      outputs = model(**inputs)
   # add code to calculate the acc
      tmp_eval_loss, logits = outputs[:2]
      if preds is None:
        preds = logits.detach().cpu().numpy()
        out_label_ids = inputs['labels'].detach().cpu().numpy()
      else:
        preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
        out_label_ids = np.append(out_label_ids, inputs['labels'].detach().cpu().numpy(), axis=0)

      loss = outputs[0]  # model outputs are always tuple in pytorch-transformers (see doc)
      if gradient_accumulation_steps > 1:
        loss = loss / gradient_accumulation_steps
      loss.backward()
      torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
      tr_loss += loss.item()
      if (step + 1) % gradient_accumulation_steps == 0:
          scheduler.step()  # Update learning rate schedule
          optimizer.step()
          model.zero_grad()
          global_step += 1
    if output_mode == "classification":
      preds = np.argmax(preds, axis=1)
    elif output_mode == "regression":
      preds = np.squeeze(preds)
    result = compute_metrics("amazon", preds, out_label_ids)
    print(results)
  return global_step, tr_loss / global_step, result

In [33]:
def train_validation(train_dataset, validation_dataset, model, tokenizer):
  """ Train the model """
  train_sampler = RandomSampler(train_dataset)
  train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=batch_size)
  t_total = len(train_dataloader) // gradient_accumulation_steps * num_train_epochs
  # Prepare optimizer and schedule (linear warmup and decay)
  no_decay = ['bias', 'LayerNorm.weight']
  optimizer_grouped_parameters = [
      {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': weight_decay},
      {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
      ]
  optimizer = AdamW(optimizer_grouped_parameters, lr=learning_rate, eps=adam_epsilon)
  scheduler = WarmupLinearSchedule(optimizer, warmup_steps=warmup_steps, t_total=t_total)
  
  global_step = 0
  tr_loss, logging_loss = 0.0, 0.0
  model.zero_grad()

  train_iterator = tqdm_notebook(range(int(num_train_epochs)), desc="Epoch")
  set_seed(42)

  best_accuracy = 0

  for _ in train_iterator:
    preds = None
    epoch_iterator = tqdm_notebook(train_dataloader, desc="Iteration")
    for step, batch in enumerate(epoch_iterator):
      model.train()
      batch = tuple(t.to(device) for t in batch)
      inputs = {'input_ids':      batch[0],
                'attention_mask': batch[1],
                'token_type_ids': None,       # XLM and RoBERTa don't use segment_ids
                'labels':         batch[3]}
      outputs = model(**inputs)
    # add code to calculate the acc
      tmp_eval_loss, logits = outputs[:2]
      if preds is None:
        preds = logits.detach().cpu().numpy()
        out_label_ids = inputs['labels'].detach().cpu().numpy()
      else:
        preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
        out_label_ids = np.append(out_label_ids, inputs['labels'].detach().cpu().numpy(), axis=0)

      loss = outputs[0]  # model outputs are always tuple in pytorch-transformers (see doc)
      if gradient_accumulation_steps > 1:
        loss = loss / gradient_accumulation_steps
      loss.backward()
      torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
      tr_loss += loss.item()
      if (step + 1) % gradient_accumulation_steps == 0:
          scheduler.step()  # Update learning rate schedule
          optimizer.step()
          model.zero_grad()
          global_step += 1
    if output_mode == "classification":
      preds = np.argmax(preds, axis=1)
    elif output_mode == "regression":
      preds = np.squeeze(preds)
    result = compute_metrics("amazon", preds, out_label_ids)
    train_acc = result['acc']


    print(f'the train accuracy is {train_acc}')
    val_result, preds, out_label_ids, preds_p = evaluate_updated(validation_dataset, model, tokenizer, prefix=global_step)
    val_acc = val_result['acc']
    print(f'the validation accuracy is {val_acc}')

    if  val_acc > best_accuracy:
      best_accuracy = val_acc
      torch.save(model.state_dict(), '/content/drive/My Drive/Colab Notebooks/Amazon Reviews/models/best_model.pt') 
    

  return global_step, tr_loss / global_step, result, preds_p

In [34]:
def sigmoid(x):
  z = 1/(1 + np.exp(-x))
  return z

In [35]:
p = np.array([[ 2.3201258, -2.130842 ],
 [ 1.7659825, -1.7360333],
 [-3.4590256,  3.0680115]])
p1 = sigmoid(p)
print(p1)

[[0.91053019 0.10613508]
 [0.85395734 0.14981748]
 [0.03050083 0.9555538 ]]


In [36]:
def evaluate_updated(dataset, model, tokenizer, prefix=""):
  results = {}
  # eval_dataset = load_and_cache_examples(tokenizer, dataset='dev')
  eval_dataset = dataset
  eval_batch_size = 8
  eval_sampler = SequentialSampler(eval_dataset)
  eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=eval_batch_size)
  eval_loss = 0.0
  nb_eval_steps = 0
  preds = None
  preds_p = None
  out_label_ids = None
  for batch in tqdm_notebook(eval_dataloader, desc="Evaluating"):
    model.eval()
    batch = tuple(t.to(device) for t in batch)
    with torch.no_grad():
      inputs = {'input_ids':      batch[0],
                'attention_mask': batch[1],
                'token_type_ids': None,             # XLM and RoBERTa don't use segment_ids
                'labels':         batch[3]}
      outputs = model(**inputs)
      tmp_eval_loss, logits = outputs[:2]
      eval_loss += tmp_eval_loss.mean().item()
    
    nb_eval_steps += 1
    if preds is None:
        preds = logits.detach().cpu().numpy()
        print(preds)
        print(np.argmax(preds,axis=1))
        out_label_ids = inputs['labels'].detach().cpu().numpy()
        print(f'out_label_ids is: {out_label_ids}')
        preds_p = sigmoid(preds)
        print(f'pred probability is {preds_p}')
    else:
        preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
        out_label_ids = np.append(out_label_ids, inputs['labels'].detach().cpu().numpy(), axis=0)
        temp = logits.detach().cpu().numpy()
        preds_p = np.append(preds_p,sigmoid(temp),axis = 0)
  eval_loss = eval_loss / nb_eval_steps
  if output_mode == "classification":
      preds = np.argmax(preds, axis=1)
  elif output_mode == "regression":
      preds = np.squeeze(preds)
  result = compute_metrics("amazon", preds, out_label_ids)
  return result, preds, out_label_ids,preds_p

In [37]:
def evaluate(model, tokenizer, prefix=""):
  results = {}
  eval_dataset = load_and_cache_examples(tokenizer, dataset='dev')
  eval_batch_size = 8
  eval_sampler = SequentialSampler(eval_dataset)
  eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=eval_batch_size)
  eval_loss = 0.0
  nb_eval_steps = 0
  preds = None
  out_label_ids = None
  for batch in tqdm_notebook(eval_dataloader, desc="Evaluating"):
    model.eval()
    batch = tuple(t.to(device) for t in batch)
    with torch.no_grad():
      inputs = {'input_ids':      batch[0],
                'attention_mask': batch[1],
                'token_type_ids': None,             # XLM and RoBERTa don't use segment_ids
                'labels':         batch[3]}
      outputs = model(**inputs)
      tmp_eval_loss, logits = outputs[:2]
      eval_loss += tmp_eval_loss.mean().item()
    
    nb_eval_steps += 1
    if preds is None:
        preds = logits.detach().cpu().numpy()
        print(preds)
        out_label_ids = inputs['labels'].detach().cpu().numpy()
        
    else:
        preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
        out_label_ids = np.append(out_label_ids, inputs['labels'].detach().cpu().numpy(), axis=0)
  eval_loss = eval_loss / nb_eval_steps
  if output_mode == "classification":
      preds = np.argmax(preds, axis=1)
  elif output_mode == "regression":
      preds = np.squeeze(preds)
  result = compute_metrics("amazon", preds, out_label_ids)
  return result, preds, out_label_ids

### Train the Model

In [38]:
data_dir= '/content/drive/My Drive/Colab Notebooks/Amazon Reviews'
model.to(device)

RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=0)
      (position_embeddings): Embedding(514, 768)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [39]:
# train_dataset = load_and_cache_examples(tokenizer, dataset="train")
# # global_step, tr_loss, result = train(train_dataset, model, tokenizer)

# validation_dataset = load_and_cache_examples(tokenizer, dataset="val")
# global_step, tr_loss, result,preds_p = train_validation(train_dataset, validation_dataset, model, tokenizer)
# print(result)

### Evaluate the Model

In [40]:
# Evaluation
test_dataset = load_and_cache_examples(tokenizer, dataset="test")
model.load_state_dict(torch.load('/content/drive/My Drive/Colab Notebooks/Amazon Reviews/models/best_model.pt') )

   Unnamed: 0                                        Review Text  Sentiment
0       17229  The comfort of this fabric is terrific. the sw...          1
      Unnamed: 0                                        Review Text  Sentiment
5659        1893  Disappointment city with this one and i am so ...          0
5660       15996  Loved these pants when i tried them on they lo...          0
line0: 17229
5660


<All keys matched successfully>

In [41]:
test_dataset.tensors

(tensor([[    0,    38,  2740,  ...,     0,     0,     0],
         [    0,    38,    33,  ...,     0,     0,     0],
         [    0,    85,    18,  ...,     0,     0,     0],
         ...,
         [    0, 31407,  1827,  ...,     0,     0,     0],
         [    0, 39133, 36113,  ...,     0,     0,     0],
         [    0,   226, 12677,  ...,     0,     0,     0]]),
 tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         ...,
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]]),
 tensor([[0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         ...,
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0]]),
 tensor([0, 0, 1,  ..., 0, 0, 0]))

In [None]:
   
result, y_pred, labels,preds_p = evaluate_updated(test_dataset, model, tokenizer, prefix="")
y_true = pd.Dataframe(labels)
y_true.to_csv( '/content/drive/My\ Drive/probability_csv_folder')


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  del sys.path[0]


HBox(children=(FloatProgress(value=0.0, description='Evaluating', max=708.0, style=ProgressStyle(description_w…

[[ 3.4504106 -3.1221192]
 [ 3.335963  -3.0291033]
 [-4.1574306  4.0520906]
 [ 3.4588935 -3.1290665]
 [-4.1802464  4.0835824]
 [-4.127984   4.0149918]
 [-4.1551948  4.048172 ]
 [ 2.5322585 -2.3692017]]
[0 0 1 0 1 1 1 0]
out_label_ids is: [0 0 1 0 1 1 1 0]
pred probability is [[0.9692434  0.04220403]
 [0.96564215 0.04612827]
 [0.01540663 0.98291117]
 [0.96949524 0.04192409]
 [0.01506433 0.9834321 ]
 [0.01585975 0.9822766 ]
 [0.01544059 0.9828452 ]
 [0.9263725  0.08555157]]


In [None]:
%cd /content/drive/My\ Drive/probability_csv_folder

In [None]:
file_list = os.listdir('.')
df_name_list =[]
for file in file_list:
  df_name_list.append(file.split('.')[0])
df_name_list

In [None]:
for i,file in enumerate(file_list):
  df = pd.read_csv(file)
  if (i == 2):
    df_name_list[i] = df[1:]
  else:
    df_name_list[i] = df


In [None]:
r = len(df_name_list[0])
r2= len(df_name_list[2])
print(r2)
p = np.zeros((r,2))

for i,df in enumerate(df_name_list):
  p_temp = df.to_numpy()
  p = np.add(p, p_temp[:,1:])
pred_ensemble = np.argmax(p,axis = 1)

In [None]:
def most_frequent(List): 
    dict = {} 
    count, itm = 0, '' 
    for item in reversed(List): 
        dict[item] = dict.get(item, 0) + 1
        if dict[item] >= count : 
            count, itm = dict[item], item 
    return(itm) 
  


In [None]:
p = np.zeros((r,2))
pred_list=[]
for i,df in enumerate(df_name_list):
  p_temp = df.to_numpy()
  pred = np.argmax(p_temp[:,1:],axis=1)
  # pred = pred.reshape(-1,r)
  pred_list.append(pred)
pred_list = np.array(pred_list)
pred_list.shape
pred_vote =[]
for i in range(r):
  temp = pred_list[:,i]
  pred_vote.append(most_frequent(temp))
  


In [None]:
def acc(pred_ensemble,labels):
  cnt = 0
  for pred,label in zip(pred_ensemble,labels):
    if(pred == label):
      cnt +=1
  acc_ensemble = cnt/len(labels)
  return(acc_ensemble)
print(acc_ensemble)

In [None]:
acc_vote = acc(pred_vote,labels)
print(acc_vote)

cnf_mat = confusion_matrix(labels, pred_vote)

abbreviations=['negative','positive']
fig, ax = plt.subplots(1)
ax = sns.heatmap(cnf_mat, ax=ax, cmap=plt.cm.Blues, annot=True, fmt='g')
ax.set_xticklabels(abbreviations)
ax.set_yticklabels(abbreviations)
plt.title('Confusion Matrix')
plt.xlabel('Predicted Class')
plt.ylabel('True Class')
plt.show()

In [None]:
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

cnf_mat = confusion_matrix(labels, pred_ensemble)

abbreviations=['negative','positive']
fig, ax = plt.subplots(1)
ax = sns.heatmap(cnf_mat, ax=ax, cmap=plt.cm.Blues, annot=True, fmt='g')
ax.set_xticklabels(abbreviations)
ax.set_yticklabels(abbreviations)
plt.title('Confusion Matrix')
plt.xlabel('Predicted Class')
plt.ylabel('True Class')
plt.show()

In [None]:
# result, y_pred, labels = evaluate(model, tokenizer, prefix=global_step)
print(result)
print(len(preds_p)

In [None]:
row_index = [i for i in range(len(preds_p))]
df = pd.DataFrame(data=preds_p, index=row_index, columns=["pred_0", "pred_1"])
df.to_csv('/content/drive/My Drive/Colab Notebooks/women_clothes_review/preds_prob.csv')

In [None]:
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

cnf_mat = confusion_matrix(labels, y_pred)

abbreviations=['negative','positive']
fig, ax = plt.subplots(1)
ax = sns.heatmap(cnf_mat, ax=ax, cmap=plt.cm.Blues, annot=True, fmt='g')
ax.set_xticklabels(abbreviations)
ax.set_yticklabels(abbreviations)
plt.title('Confusion Matrix')
plt.xlabel('Predicted Class')
plt.ylabel('True Class')
plt.show()