This notebook performs an evaluation of various adversarial attacks on an ELECTRA-based question-answering model fine-tuned on the SQuAD 2.0 dataset. 

### Libraries loading

Imports necessary libraries for natural language processing, machine learning, and data manipulation.

In [1]:
import json
import random
from transformers import AutoModelForQuestionAnswering, AutoTokenizer
import torch
from tqdm import tqdm
import pandas as pd
import nltk
from nltk.translate.bleu_score import sentence_bleu
from nltk.tokenize import word_tokenize
from collections import Counter
import spacy

In [2]:
# Load spaCy for grammatical error detection
spacy.prefer_gpu()
nlp = spacy.load("en_core_web_sm")

In [3]:
# Download necessary NLTK data
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/ferhatsarikaya/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Model and tokenizer loading

It loads a pre-trained ELECTRA model and tokenizer fine-tuned on SQuAD 2.0.

In [4]:
# Load model and tokenizer
model = AutoModelForQuestionAnswering.from_pretrained('deepset/electra-base-squad2')
tokenizer = AutoTokenizer.from_pretrained('deepset/electra-base-squad2')


model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

### Data loading function

load_squad_data() reads JSON files containing SQuAD-format data.

In [5]:
def load_squad_data(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        data = json.load(f)
    return data['data']

### Answer prediction

get_answer() uses the ELECTRA model to predict answers for given questions and contexts.

In [6]:
def get_answer(question, context):
    inputs = tokenizer.encode_plus(question, context, add_special_tokens=True, return_tensors="pt", max_length=512, truncation=True)
    input_ids = inputs["input_ids"].tolist()[0]

    outputs = model(**inputs)
    
    answer_start = torch.argmax(outputs.start_logits)
    answer_end = torch.argmax(outputs.end_logits) + 1

    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
    return answer

### Evaluation metrics

calculate_f1_score(): Calculates the F1 score between predicted and ground truth answers.<br />
calculate_bleu_score(): Computes the BLEU score for predicted answers. <br />
count_grammatical_errors(): Estimates grammatical errors in predicted answers using spaCy.

In [7]:
def calculate_f1_score(prediction, ground_truth):
    prediction_tokens = word_tokenize(prediction.lower())
    ground_truth_tokens = word_tokenize(ground_truth.lower())
    
    common = Counter(prediction_tokens) & Counter(ground_truth_tokens)
    num_same = sum(common.values())
    
    if num_same == 0:
        return 0
    
    precision = 1.0 * num_same / len(prediction_tokens)
    recall = 1.0 * num_same / len(ground_truth_tokens)
    f1 = (2 * precision * recall) / (precision + recall)
    return f1

In [8]:
def calculate_bleu_score(prediction, ground_truth):
    return sentence_bleu([word_tokenize(ground_truth.lower())], word_tokenize(prediction.lower()))


In [9]:
def count_grammatical_errors(text):
    doc = nlp(text)
    return len([token for token in doc if token.dep_ == 'ROOT']) - 1  # A rough estimate of grammatical errors


### Main evaluation function

evaluate_attack() processes the dataset, generates predictions, and calculates various metrics for each question-answer pair.

In [10]:
def evaluate_attack(data, attack_name):
    results = []
    for article in tqdm(data, desc=f"Evaluating {attack_name}"):
        for paragraph in article['paragraphs']:
            context = paragraph['context']
            for qa in paragraph['qas']:
                question = qa['question']
                if qa['answers']:
                    ground_truth = qa['answers'][0]['text']
                    predicted_answer = get_answer(question, context)
                    
                    # Check if the ground truth is in the truncated context
                    if ground_truth not in context[:512]:
                        continue  # Skip this example as the answer is not in the truncated context
                    
                    exact_match = predicted_answer.lower() == ground_truth.lower()
                    f1_score = calculate_f1_score(predicted_answer, ground_truth)
                    bleu_score = calculate_bleu_score(predicted_answer, ground_truth)
                    grammatical_errors = count_grammatical_errors(predicted_answer)
                    
                    results.append({
                        'attack': attack_name,
                        'question': question,
                        'context': context[:512],  # Truncate context for storage
                        'ground_truth': ground_truth,
                        'predicted_answer': predicted_answer,
                        'exact_match': exact_match,
                        'f1_score': f1_score,
                        'bleu_score': bleu_score,
                        'grammatical_errors': grammatical_errors
                    })
    return results

### Attack evaluation loop

The script iterates through different adversarial attack datasets (AddAny, AddSent, CEIA, DPAEG, TextFooler), evaluating the model's performance on each.<br /><br />
Results for all attacks are collected in a list and then converted to a pandas DataFrame. After that, saves detailed results to a CSV file named "electra-base-squad2_adversarial_attack_results.csv".

In [11]:
# List of attack files
attack_files = [
    ("SQuAD/squad-v2.0-addany.json", "AddAny"),
    ("SQuAD/squad-v2.0-addsent.json", "AddSent"),
    ("SQuAD/squad-v2.0-CEIA.json", "CEIA"),
    ("SQuAD/squad-v2.0-dpaeg.json", "DPAEG"),
    ("SQuAD/squad-v2.0-textfooler.json", "TextFooler")
]

all_results = []

for file_path, attack_name in attack_files:
    data = load_squad_data(file_path)
    results = evaluate_attack(data, attack_name)
    all_results.extend(results)

# Save results to CSV
df = pd.DataFrame(all_results)
df.to_csv("electra-base-squad2_adversarial_attack_results.csv", index=False)


The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
Evaluating AddAny: 100%|████████████████████| 442/442 [2:51:57<00:00, 23.34s/it]
Evaluating AddSent: 100%|███████████████████| 442/442 [2:44:37<00:00, 22.35s/it]
Evaluating CEIA: 100%|██████████████████████| 442/442 [2:35:47<00:00, 21.15s/it]
Evaluating DPAEG: 100%|█████████████████████| 442/442 [2:43:00<00:00, 22.13s/it]
E

### Summaries

Calculates and prints summary statistics for each attack type. After that, save the summary statistics to another CSV file named "electra-base-squad2_adversarial_attack_summary.csv".

In [12]:
# Calculate and print summary statistics
summary = df.groupby('attack').agg({
    'exact_match': 'mean',
    'f1_score': 'mean',
    'bleu_score': 'mean',
    'grammatical_errors': 'mean',
    'attack': 'count'
})
summary.columns = ['Exact Match', 'F1 Score', 'BLEU Score', 'Avg Grammatical Errors', 'Sample Size']
print(summary)

# Save summary to CSV
summary.to_csv("electra-base-squad2_adversarial_attack_summary.csv")

            Exact Match  F1 Score  BLEU Score  Avg Grammatical Errors  \
attack                                                                  
AddAny         0.809581  0.887708    0.212839                0.037795   
AddSent        0.805431  0.884460    0.210867                0.037118   
CEIA           0.902246  0.919227    0.171513                0.029635   
DPAEG          0.755072  0.791677    0.142249                0.084516   
TextFooler     0.466026  0.530060    0.084597                0.187982   

            Sample Size  
attack                   
AddAny            67945  
AddSent           67945  
CEIA              56724  
DPAEG             56143  
TextFooler        54351  
