# NLI-TR / Hypothesis only baseline

In this experimentation, we want to show the use of NLI-TR dataset on a particular use case where we would like to answer the main question: 

> *What is the difference between hypothesis-only baseline and the actual models when we fine-tune off-the-shelf models on NLI-TR.*

Thanks **Lasha Abzianidze** for this insightful question in our Gather.town session at EMNLP 2020! 

*Disclaimer*: The code is mostly based on the examples in the following repositories and the documentation of Huggingface Datasets and Transformers.

*   https://github.com/huggingface/transformers
*   https://github.com/huggingface/datasets
*   https://github.com/cgpotts/cs224u

## Setup

In [2]:
!pip install datasets
!pip install transformers

Collecting datasets
[?25l  Downloading https://files.pythonhosted.org/packages/94/f8/ff7cd6e3b400b33dcbbfd31c6c1481678a2b2f669f521ad20053009a9aa3/datasets-1.7.0-py3-none-any.whl (234kB)
[K     |████████████████████████████████| 235kB 7.4MB/s 
[?25hCollecting fsspec
[?25l  Downloading https://files.pythonhosted.org/packages/bc/52/816d1a3a599176057bf29dfacb1f8fadb61d35fbd96cb1bab4aaa7df83c0/fsspec-2021.5.0-py3-none-any.whl (111kB)
[K     |████████████████████████████████| 112kB 12.9MB/s 
Collecting xxhash
[?25l  Downloading https://files.pythonhosted.org/packages/7d/4f/0a862cad26aa2ed7a7cd87178cbbfa824fc1383e472d63596a0d018374e7/xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243kB)
[K     |████████████████████████████████| 245kB 12.8MB/s 
Collecting huggingface-hub<0.1.0
  Downloading https://files.pythonhosted.org/packages/32/a1/7c5261396da23ec364e296a4fb8a1cd6a5a2ff457215c6447038f18c0309/huggingface_hub-0.0.9-py3-none-any.whl
Installing collected packages: fsspec, xxhash, hug

In [3]:
import transformers
import datasets

## Dataset readers

In [4]:
import torch
from datasets import load_dataset

class NLITRReader(torch.utils.data.Dataset):
  def __init__(self, dataset_name, split_name, max_example_num=-1):
    self.dataset = load_dataset('nli_tr', dataset_name)
    self.split_name = split_name
    self.max_example_num = max_example_num

  def read(self):
      count = 0
      for example in self.dataset[self.split_name]:
          if example['label'] == -1: # skip examples having no gold value.
              continue
          count += 1
          if self.max_example_num > 0 and count >= self.max_example_num:
             break
          yield example

In [5]:
import torch
class NLITRDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

## Trainer

In [9]:
import torch
import pandas as pd
from transformers import TrainingArguments, Trainer, AutoConfig, AutoTokenizer, AutoModelForSequenceClassification

from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
    }

MAX_TRAIN_EXAMPLE_NUM = -1
MAX_EVALUATION_EXAMPLE_NUM = -1
class NLITRTrainer():
    def __init__(self, 
                 model_name='bert-base-cased', 
                 dataset_name='snli_tr',
                 evaluation_split='validation',
                 num_labels=3, 
                 hypothesis_only=False):
        self.model_name = model_name
        self.dataset_name = dataset_name
        self.evaluation_split = evaluation_split
        self.hypothesis_only = hypothesis_only
        self.max_train_example_num = MAX_TRAIN_EXAMPLE_NUM
        self.max_evaluation_example_num = MAX_EVALUATION_EXAMPLE_NUM

        print('You can set the values of the following parameters via the global variables MAX_TRAIN_EXAMPLE_NUM and MAX_EVALUATION_EXAMPLE_NUM (-1 to use all examples in the splits)')
        print('max_train_example_num',self.max_train_example_num)
        print('max_evaluation_example_num',self.max_evaluation_example_num)
        self.prepare_for_training()
    
    def prepare_for_training(self):
        self.prepare_model()
        self.prepare_datasets()
        self.prepare_trainer()

    def prepare_model(self):
        self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
        self.config = AutoConfig.from_pretrained(self.model_name, num_labels=3)
        self.model = AutoModelForSequenceClassification.from_pretrained(self.model_name, config=self.config)
        
    def get_dataset(self, split_name, max_example_num):
        df = pd.DataFrame(list(NLITRReader(dataset_name=self.dataset_name, split_name=split_name, max_example_num=max_example_num).read()))
        labels = df['label'].values.tolist()
        premises = df['premise'].values.tolist()
        if self.hypothesis_only:
            input = self.tokenizer(premises, truncation=True, padding=True)
        else:
            hypotheses = df['hypothesis'].values.tolist()
            input = self.tokenizer(premises, hypotheses, truncation=True, padding=True)
        
        dataset = NLITRDataset(input, labels)
        return dataset

    def prepare_datasets(self):
        self.train_dataset = self.get_dataset('train', max_example_num=self.max_train_example_num)
        self.evaluation_dataset = self.get_dataset(self.evaluation_split, max_example_num=self.max_evaluation_example_num)
      
    def prepare_trainer(self):
        training_args = TrainingArguments(
            output_dir='./results',          # output directory
            num_train_epochs=3,              # total number of training epochs
            per_device_train_batch_size=16,   # batch size per device during training
            per_device_eval_batch_size=4,   # batch size for evaluation
            gradient_accumulation_steps=32,  # gradient accumulation steps to increase effective batch size on GPU.
            warmup_steps=500,                # number of warmup steps for learning rate scheduler
            weight_decay=0.01,               # strength of weight decay
            logging_dir='./logs',            # directory for storing logs
            logging_steps=10
        )

        self.trainer = Trainer(
            model=self.model,                         # the instantiated 🤗 Transformers model to be trained
            args=training_args,                       # training arguments, defined above
            train_dataset=self.train_dataset,         # training dataset
            eval_dataset=self.evaluation_dataset,     # evaluation dataset,
            compute_metrics=compute_metrics
        )

    def train(self):
        train_results = self.trainer.train()
        return train_results
    
    def evaluate(self):
        eval_results = self.trainer.evaluate()
        return eval_results

## Experiment Manager

This is a simple experiment manager that runs a series of experiments with the given set of hyperparameters and returns the resulting metrics.

In [10]:
import copy
import numpy as np
import random

class NLIExperiment:
    def __init__(self, experiment_parameters, seed=1234):
        self.experiment_parameters = experiment_parameters
        self.set_random_seed(seed)
    
    def set_random_seed(self, seed):
        np.random.seed(seed)
        random.seed(seed)
        torch.manual_seed(seed)
    
    def run(self):
        experiment_results = []
        
        for model_name in self.experiment_parameters['model_names']:
            experiment_parameters = {}
            experiment_parameters['model_name'] = model_name

            for dataset_name, evaliation_split_names in self.experiment_parameters['dataset_info'].items():
                experiment_parameters['dataset_name'] = dataset_name
            
                for evaliation_split_name in evaliation_split_names:
                    experiment_parameters['evaliation_split_name'] = evaliation_split_name

                    for param_key, param_values in self.experiment_parameters['params'].items():
                        
                        for param_value in param_values:
                              experiment_parameters[param_key] = param_value
                              print('\n\nA new experiment started...')
                              nlitr_trainer = NLITRTrainer(model_name=model_name, dataset_name=dataset_name, evaluation_split=evaliation_split_name, **{param_key:param_value})

                              print('Training...')
                              train_results = nlitr_trainer.train()
                              print('Evaluating...')
                              eval_results = nlitr_trainer.evaluate()
                              
                              experiment_parameters.update(eval_results)
                              print('\nexperiment parameters:', experiment_parameters)
                              print('experiment results:', eval_results)
                              experiment_results.append(copy.deepcopy(experiment_parameters))
        return experiment_results

## Experiments

Below is a set of sample parameters to get a sense of how the results will look like.  You may execute the code with alternative sets of parameters to get a deeper understanding of the difference between the hypothesis-only baseline and the full models under different conditions.

In [11]:
%%time
# You may also experiment with some alternative values denoted as comment.
experiment_parameters = {
    'model_names' : ['dbmdz/bert-base-turkish-cased'], #alternative values: 'model_names' : ['bert-base-cased', 'bert-base-multilingual-cased', 'dbmdz/bert-base-turkish-cased'] 
    'dataset_info' : {'snli_tr': ['validation', 'test']},   #alternative values: {'snli_tr': ['validation', 'test'], 'multinli_tr': ['validation_matched', 'validation_mismatched']}
    'params' : {'hypothesis_only': [False]}
}

#You may set different values for the size of training and evaluation splits for fast iterations (-1 to use all examples in the splits). 
MAX_TRAIN_EXAMPLE_NUM = -1
MAX_EVALUATION_EXAMPLE_NUM = -1

experiment = NLIExperiment(experiment_parameters)
experiment_result = experiment.run()



A new experiment started...
You can set the values of the following parameters via the global variables MAX_TRAIN_EXAMPLE_NUM and MAX_EVALUATION_EXAMPLE_NUM (-1 to use all examples in the splits)
max_train_example_num 2048
max_evaluation_example_num 512


Some weights of the model checkpoint at dbmdz/bert-base-turkish-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were 

Training...


Step,Training Loss
10,1.1431


Evaluating...



experiment parameters: {'model_name': 'dbmdz/bert-base-turkish-cased', 'dataset_name': 'snli_tr', 'evaliation_split_name': 'validation', 'hypothesis_only': False, 'eval_loss': 1.1216628551483154, 'eval_accuracy': 0.28180039138943247, 'eval_runtime': 3.724, 'eval_samples_per_second': 137.219, 'epoch': 3.0, 'eval_mem_cpu_alloc_delta': 0, 'eval_mem_gpu_alloc_delta': 0, 'eval_mem_cpu_peaked_delta': 0, 'eval_mem_gpu_peaked_delta': 10937856}
experiment results: {'eval_loss': 1.1216628551483154, 'eval_accuracy': 0.28180039138943247, 'eval_runtime': 3.724, 'eval_samples_per_second': 137.219, 'epoch': 3.0, 'eval_mem_cpu_alloc_delta': 0, 'eval_mem_gpu_alloc_delta': 0, 'eval_mem_cpu_peaked_delta': 0, 'eval_mem_gpu_peaked_delta': 10937856}


A new experiment started...
You can set the values of the following parameters via the global variables MAX_TRAIN_EXAMPLE_NUM and MAX_EVALUATION_EXAMPLE_NUM (-1 to use all examples in the splits)
max_train_example_num 2048
max_evaluation_example_num 512


Some weights of the model checkpoint at dbmdz/bert-base-turkish-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were 

Training...


Step,Training Loss
10,1.1646


Evaluating...



experiment parameters: {'model_name': 'dbmdz/bert-base-turkish-cased', 'dataset_name': 'snli_tr', 'evaliation_split_name': 'test', 'hypothesis_only': False, 'eval_loss': 1.1589947938919067, 'eval_accuracy': 0.273972602739726, 'eval_runtime': 3.7754, 'eval_samples_per_second': 135.35, 'epoch': 3.0, 'eval_mem_cpu_alloc_delta': 0, 'eval_mem_gpu_alloc_delta': 0, 'eval_mem_cpu_peaked_delta': 0, 'eval_mem_gpu_peaked_delta': 9088512}
experiment results: {'eval_loss': 1.1589947938919067, 'eval_accuracy': 0.273972602739726, 'eval_runtime': 3.7754, 'eval_samples_per_second': 135.35, 'epoch': 3.0, 'eval_mem_cpu_alloc_delta': 0, 'eval_mem_gpu_alloc_delta': 0, 'eval_mem_cpu_peaked_delta': 0, 'eval_mem_gpu_peaked_delta': 9088512}
CPU times: user 3min 54s, sys: 43.7 s, total: 4min 37s
Wall time: 2min 38s


## Results

And, here is the results 🙂 

It should be noted that these results are obtained using only a fraction of the dataset splits due to the time limitation.  Please feel free to play with the global parameters MAX_TRAIN_EXAMPLE_NUM and MAX_EVALUATION_EXAMPLE_NUM (as explained above) to use a wider portion (or all). Playing with these parameters will help get a deeper understanding on the resulting difference between hypothesis-oly baseline and full models.

In [None]:
experiment_result_df = pd.DataFrame(experiment_result)
experiment_result_df.head(n=100) #show all dataframe

Unnamed: 0,model_name,dataset_name,evaliation_split_name,hypothesis_only,eval_loss,eval_accuracy,epoch
0,dbmdz/bert-base-turkish-cased,snli_tr,validation,True,1.122703,0.322896,1.0
1,dbmdz/bert-base-turkish-cased,snli_tr,validation,False,1.091679,0.412916,1.0
2,dbmdz/bert-base-turkish-cased,multinli_tr,validation_matched,True,1.105768,0.340509,1.0
3,dbmdz/bert-base-turkish-cased,multinli_tr,validation_matched,False,1.105668,0.369863,1.0
