# NLPGroup5: Comparing Pre-Trained Models’ Performance on QQA

### Part 1: Setting up for Evaluation
To get started, we are importing all the necessary libraries to perform our evaluations.

In [1]:
#Collecting all imports
#Transformers should be at least 4.11.0 required!
from datasets import load_dataset, load_metric
from transformers import AutoTokenizer, pipeline, AutoModelForMultipleChoice, TrainingArguments, Trainer, AutoModelForQuestionAnswering
from dataclasses import dataclass
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from typing import Optional, Union
import torch
from sklearn.metrics import f1_score
import numpy as np
import transformers
import json
from evaluate import load
from tqdm.notebook import tqdm
import random

  torch.utils._pytree._register_pytree_node(
  torch.utils._pytree._register_pytree_node(


Now we are loading our dataset in from our JSON files and manipulating to fit our evaluation process. See the README for where to find these dataset files from the NumEval @ SemEval 2024 site.

In [2]:
datasets = load_dataset("json", data_files={'train':'NLPGroup5/Project/QQA_Data/QQA_train.json', 
                                           'validation':'NLPGroup5/Project/QQA_Data/QQA_dev.json', 
                                           'test':'NLPGroup5/Project/QQA_Data/QQA_test.json'})

In [3]:
#Here we are dropping unnecessary columns for brevity
datasets = datasets.remove_columns(['question_char', 'question_sci_10E',
                         'question_sci_10E_char',
                         'question_mask', 'type',])

We are evaluating our models on "Exact Match" from Huggingface. Essentially we want our model to be <u>completely</u> accurate or else our predicted answers do not count

In [4]:
exact_match = load("exact_match")

### Part 2: Evaluation Function and Variable Initialization
Here we are defining our evaluation function. Here we set up our pretrained models for question answering and tokenize using an AutoTokenizer. We are also using a question answering pipeline from Huggingface using our model and tokenizer. We manipulate our dataset to fit the expected format of our pipeline. We then generate predictions using the pipeline and append those along with our references (correct answers). Finally we compute our comparison metrics (exact match) and print those.

In [5]:
#Intitializng our model name variable and performance list
model_name = ''
modelPerformanceList = []
modelNames = []

def evaluate_hf_model(model_name, dataset):
    global modelPerformance
    model = AutoModelForQuestionAnswering.from_pretrained(model_name)       # Initialize the model
    tokenizer = AutoTokenizer.from_pretrained(model_name)                   # Initialize the tokenizer
    
    #initialize our pipeline with our input model
    processor = pipeline('question-answering', model=model, tokenizer=tokenizer)
    
    #format our dataset
    def dataset_generator(dataset):
        for ex in dataset:
            #Here we clean the option answer to use later for "context"
            cleansedAnswer = ex['answer'].replace(" ", "")
            yield (ex,
                {'question' : ex['question'], 'context': ex[cleansedAnswer]})
    #initialize predictions and references lists        
    predictions = []
    references = []

    # Get predictions
    for ex in tqdm(dataset_generator(datasets[dataset]), total=len(datasets[dataset])):
        predictions.append(processor(ex[1])['answer'])

        #Appending our answer which will be our reference
        references.append(ex[0][ex[0]['answer'].replace(" ", "")])

    # Compute metrics
    modelPerformance = exact_match.compute(predictions=predictions, references=references)
    print('Performance of {} : {}'.format(model_name, modelPerformance))

### Part 3:Evaluation on Train Dataset
In this section we are evaluating performances for our models on our "Train" dataset.

#### Baseline Models
Here we are evaluating our "Baseline" models like BERT

In [6]:
#We are evaluating on our "train" dataset in this section
datasetEval = 'train'

model_name = 'bert-base-uncased'
evaluate_hf_model(model_name, datasetEval)
modelNames.append(model_name)
modelPerformanceList.append(modelPerformance)
print('{} performance saved: {}'.format(model_name, modelPerformance))

model_name = 'distilbert/distilbert-base-uncased'
evaluate_hf_model(model_name, datasetEval)
modelNames.append(model_name)
modelPerformanceList.append(modelPerformance)
print('{} performance saved: {}'.format(model_name, modelPerformance))

model_name = 'dccuchile/bert-base-spanish-wwm-cased'
evaluate_hf_model(model_name, datasetEval)
modelNames.append(model_name)
modelPerformanceList.append(modelPerformance)
print('{} performance saved: {}'.format(model_name, modelPerformance))

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/564 [00:00<?, ?it/s]

Performance of bert-base-uncased : {'exact_match': 0.4875886524822695}
bert-base-uncased performance saved: {'exact_match': 0.4875886524822695}


Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/564 [00:00<?, ?it/s]

Performance of distilbert/distilbert-base-uncased : {'exact_match': 0.4627659574468085}
distilbert/distilbert-base-uncased performance saved: {'exact_match': 0.4627659574468085}


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at dccuchile/bert-base-spanish-wwm-cased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/564 [00:00<?, ?it/s]

Performance of dccuchile/bert-base-spanish-wwm-cased : {'exact_match': 0.6063829787234043}
dccuchile/bert-base-spanish-wwm-cased performance saved: {'exact_match': 0.6063829787234043}


#### SQuAD Models
Here we are evaluating our "SQuAD" models like RoBERTa and Dynamic TinyBERT

In [7]:
#We are evaluating on our "train" dataset in this section
datasetEval = 'train'

model_name = 'deepset/roberta-base-squad2'
evaluate_hf_model(model_name, datasetEval)
modelNames.append(model_name)
modelPerformanceList.append(modelPerformance)
print('{} performance saved: {}'.format(model_name, modelPerformance))

model_name = 'Intel/dynamic_tinybert'
evaluate_hf_model(model_name, datasetEval)
modelNames.append(model_name)
modelPerformanceList.append(modelPerformance)
print('{} performance saved: {}'.format(model_name, modelPerformance))

model_name = 'distilbert/distilbert-base-cased-distilled-squad'
evaluate_hf_model(model_name, datasetEval)
modelNames.append(model_name)
modelPerformanceList.append(modelPerformance)
print('{} performance saved: {}'.format(model_name, modelPerformance))

model_name = 'FabianWillner/distilbert-base-uncased-finetuned-squad'
evaluate_hf_model(model_name, datasetEval)
modelNames.append(model_name)
modelPerformanceList.append(modelPerformance)
print('{} performance saved: {}'.format(model_name, modelPerformance))

  0%|          | 0/564 [00:00<?, ?it/s]

Performance of deepset/roberta-base-squad2 : {'exact_match': 0.7801418439716312}
deepset/roberta-base-squad2 performance saved: {'exact_match': 0.7801418439716312}


  0%|          | 0/564 [00:00<?, ?it/s]

Performance of Intel/dynamic_tinybert : {'exact_match': 0.7535460992907801}
Intel/dynamic_tinybert performance saved: {'exact_match': 0.7535460992907801}


  0%|          | 0/564 [00:00<?, ?it/s]

Performance of distilbert/distilbert-base-cased-distilled-squad : {'exact_match': 0.7836879432624113}
distilbert/distilbert-base-cased-distilled-squad performance saved: {'exact_match': 0.7836879432624113}


  0%|          | 0/564 [00:00<?, ?it/s]

Performance of FabianWillner/distilbert-base-uncased-finetuned-squad : {'exact_match': 0.8368794326241135}
FabianWillner/distilbert-base-uncased-finetuned-squad performance saved: {'exact_match': 0.8368794326241135}


#### Sentiment Analysis Models
Here we are evaluating our "Sentiment Analysis" models like ...

### Part 4: Evaluation on Validation Dataset
In this section we are evaluating performances for our models on our "Validation" dataset.

#### Baseline Models

#### SQuAD Models

#### Sentiment Analysis Models

### Part 5: Evaluation on Test Dataset
In this section we are evaluating performances for our models on our "Test" dataset.

#### Baseline Models

#### SQuAD Models

#### Sentiment Analysis Models

### Part 6: Results Visualization
In this section we will be graphing based on model type and dataset.

#### Baseline Model Visualization

#### SQuAD Model Visualization

#### Sentiment Analysis Model Visualization

### Part 7: Conclusion and Analysis

Our results...