The purpose of this notebook is to create an adversarial version of the SQuAD (Stanford Question Answering Dataset) v2.0 dataset using the <b>AddSent</b> attack method. This attack adds a single adversarial sentence to the end of each context paragraph. The key features of this attack are:<br />

<b>Relevance:</b> The adversarial sentence is related to the content of the paragraph, as it uses key words from one of the questions.<br />
<b>Distraction:</b> The sentence is designed to be misleading, typically stating that a key concept is not related to the answer or context.<br />
<b>Consistency:</b> Only one adversarial sentence is added per paragraph, affecting all questions for that paragraph.<br /><br />

The goals of this adversarial dataset are to:
<ul>
    <li>Test the robustness of question-answering models against misleading information.</li>
    <li>Evaluate how well models can distinguish between relevant and irrelevant information, even when the irrelevant information seems related to the question.</li>
    <li>Provide a more challenging dataset for training and evaluating question-answering systems.</li>
</ul>

This type of attack is more sophisticated than simply adding random sentences (like in the AddAny attack) because it creates semantically relevant but misleading content. It challenges models to not only comprehend the text but also to reason about the relevance and truthfulness of information in the context of specific questions.<br /><br />

The notebook ensures that the original question-answer pairs remain valid by updating their positions in the new, longer contexts. This allows for a direct comparison between model performance on the original and adversarial datasets.

### Loading libraries

Imports necessary libraries for JSON processing, random selection, NLP tasks (spaCy), file handling, and progress tracking.

In [1]:
import json
import random
import spacy
import os
from tqdm import tqdm

### SpaCy model loading

It loads the SpaCy English language model, preferring GPU if available.

In [2]:
# Load spaCy model
spacy.prefer_gpu()
nlp = spacy.load("en_core_web_sm")

### generate_adversarial_sentence() function
Selects a random question-answer pair from the given set.<br />
Extracts key words (nouns, verbs, and adjectives) from the question using spaCy.<br />
Generates a distracting sentence based on these key words and the answer.<br />
For questions without answers (in SQuAD v2.0), it creates a general distracting sentence.

In [3]:
def generate_adversarial_sentence(qas):
    # Choose a random question-answer pair to base the adversarial sentence on
    qa = random.choice(qas)
    question = qa['question']
    answer = qa['answers'][0]['text'] if qa['answers'] else ""

    doc = nlp(question)
    key_words = [token.text for token in doc if token.pos_ in ['NOUN', 'VERB', 'ADJ']]
    if not key_words:
        key_words = [token.text for token in doc if token.pos_ != 'PUNCT']
    
    if answer:
        distracting_sentence = f"However, {random.choice(key_words)} is not related to {answer}."
    else:
        distracting_sentence = f"However, {random.choice(key_words)} is not relevant to this context."
    
    return distracting_sentence

### process_squad_file() function

Reads the input SQuAD JSON file.<br />
Processes each article and paragraph in the dataset.<br />
For each paragraph:
<ul>
    <li>Generates one adversarial sentence using generate_adversarial_sentence().</li>
    <li>Appends this sentence to the original context to create an adversarial context.</li>
    <li>Updates the answer start positions for all QA pairs in the paragraph.</li>
</ul>

Creates a new JSON structure with the modified data.<br />
Writes this new structure to the output file.

In [4]:
def process_squad_file(input_file, output_file):
    with open(input_file, 'r', encoding='utf-8') as f:
        data = json.load(f)

    new_data = {"version": "v2.0", "data": []}

    for article in tqdm(data['data'], desc="Processing articles"):
        new_article = {"title": article['title'], "paragraphs": []}
        
        for paragraph in article['paragraphs']:
            context = paragraph['context']
            new_qas = []
            
            # Generate one adversarial sentence for the entire paragraph
            adv_sentence = generate_adversarial_sentence(paragraph['qas'])
            
            # Create adversarial context
            adv_context = context + " " + adv_sentence
            
            for qa in paragraph['qas']:
                new_qa = qa.copy()
                if new_qa['answers']:
                    # Update answer start position in the new context
                    answer = new_qa['answers'][0]['text']
                    answer_start = adv_context.index(answer)
                    new_qa['answers'][0]['answer_start'] = answer_start
                
                new_qas.append(new_qa)
            
            new_paragraph = {
                "context": adv_context,
                "qas": new_qas
            }
            new_article['paragraphs'].append(new_paragraph)
        
        new_data['data'].append(new_article)

    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(new_data, f, ensure_ascii=False, indent=2)

### Main execution
Sets the input file path to "SQuAD/train-v2.0.json".<br />
Sets the output file path to "SQuAD/squad-v2.0-addsent.json".<br />
Calls process_squad_file() with these paths.

In [5]:
# Create data set
path = "SQuAD/"
input_file = os.path.join(path, "train-v2.0.json")
output_file = os.path.join(path, "squad-v2.0-addsent.json")
process_squad_file(input_file, output_file)

Processing articles: 100%|████████████████████| 442/442 [00:28<00:00, 15.50it/s]
