The purpose of this notebook is to create an adversarial version of the SQuAD (Stanford Question Answering Dataset) v2.0 dataset using the <b>AddAny</b> attack method. This attack adds arbitrary, unrelated sentences to the end of each context paragraph. The goal is to test the robustness of question-answering models against irrelevant information.<br /><br />

By adding these arbitrary sentences, the script creates a more challenging dataset. Question-answering models trained or evaluated on this dataset will need to distinguish between relevant and irrelevant information, potentially exposing weaknesses in their comprehension abilities.<br /><br />

This adversarial dataset can be used to
<ul>
    <li>Evaluate the robustness of existing question-answering models.</li>
    <li>Train more robust models that can handle irrelevant information.</li>
    <li>Study the impact of added noise on model performance.</li>
</ul>

The notebook ensures that the original question-answer pairs remain valid by updating their positions in the new, longer contexts. This allows for a direct comparison between model performance on the original and adversarial datasets.

### Loading libraries

Imports necessary libraries for file handling, JSON processing, random selection, and progress tracking.

In [1]:
import os
import json
import random
import spacy
from tqdm import tqdm

### SpaCy loading

It loads the SpaCy English language model, preferring GPU if available.

In [2]:
# Load spaCy model
spacy.prefer_gpu()
nlp = spacy.load("en_core_web_sm")

### generate_arbitrary_sentence() function

Contains a list of 100 pre-defined arbitrary sentences.<br />
Randomly selects and returns one of these sentences when called.

In [3]:
def generate_arbitrary_sentence():
    arbitrary_sentences = [
        "The weather was unusually warm that day.",
        "A new study shows that coffee might be good for your health.",
        "Scientists have discovered a new species of butterfly.",
        "The local museum is hosting a special exhibition next month.",
        "Recent advancements in technology have revolutionized communication.",
        "A group of researchers published a groundbreaking paper last week.",
        "The city council approved a new urban development plan.",
        "Experts predict significant changes in the job market over the next decade.",
        "A rare astronomical event will be visible in the night sky this weekend.",
        "The national park announced the birth of an endangered species.",
        "The restaurant introduced a new menu featuring exotic dishes.",
        "A famous actor announced his retirement from the film industry.",
        "The university is offering a new course on sustainable energy.",
        "A historic building in the city center was recently renovated.",
        "The annual music festival attracted thousands of visitors.",
        "New research suggests that meditation can improve mental health.",
        "The government launched a campaign to promote recycling.",
        "A local artist unveiled a new sculpture in the town square.",
        "Scientists are exploring the potential of renewable energy sources.",
        "The community center is organizing free workshops for residents.",
        "A groundbreaking ceremony was held for the new hospital wing.",
        "The library extended its hours to accommodate more visitors.",
        "A new bakery opened downtown, specializing in gluten-free pastries.",
        "The zoo welcomed a pair of rare pandas from China.",
        "An ancient manuscript was discovered in a remote monastery.",
        "The local theater is staging a production of Shakespeare's Hamlet.",
        "A tech company unveiled its latest smartphone model.",
        "Volunteers cleaned up the beach as part of an environmental initiative.",
        "A record-breaking heatwave hit the region last summer.",
        "A charity event raised funds for children's education.",
        "The city hosted an international conference on climate change.",
        "A famous chef published a cookbook filled with healthy recipes.",
        "The high school celebrated its centennial anniversary.",
        "A new bike-sharing program was launched in the city.",
        "The botanical garden is hosting a series of gardening workshops.",
        "A local author released a bestselling novel this month.",
        "The national museum opened a new exhibit on ancient Egypt.",
        "A renowned pianist performed at the concert hall last night.",
        "The wildlife reserve is home to several endangered species.",
        "A startup developed an app to help people with disabilities.",
        "The mayor announced plans for a new public transportation system.",
        "A research team found evidence of water on Mars.",
        "The sports team won their championship game in an exciting finish.",
        "A documentary film about climate change received critical acclaim.",
        "The community garden is flourishing thanks to volunteer efforts.",
        "An art gallery displayed works by local artists this weekend.",
        "A historic shipwreck was discovered off the coast.",
        "The orchestra played a symphony by Beethoven to a full house.",
        "A medical breakthrough offers new hope for cancer patients.",
        "The high-speed train service reduced travel time significantly.",
        "A new law was passed to protect endangered wildlife.",
        "The festival featured performances by international musicians.",
        "A children's book was released, inspiring young readers worldwide.",
        "The local market offers a wide variety of organic produce.",
        "A new fitness center opened with state-of-the-art equipment.",
        "The weather forecast predicts heavy snowfall this weekend.",
        "A unique art installation was set up in the city park.",
        "The historic district is known for its beautiful architecture.",
        "A charity organization provided aid to disaster-stricken areas.",
        "The library hosted a reading event for children.",
        "A new app helps users track their carbon footprint.",
        "The film festival showcased independent films from around the world.",
        "A wildlife photographer captured stunning images of a rare bird.",
        "The city introduced a new policy to reduce air pollution.",
        "A culinary school offered classes on international cuisine.",
        "The marathon attracted runners from various countries.",
        "A science fair exhibited innovative projects by students.",
        "The book club discussed a popular novel at their latest meeting.",
        "A new vaccine was developed to combat a viral outbreak.",
        "The theater group performed a modern adaptation of a classic play.",
        "A local band released their debut album to positive reviews.",
        "The national park is a popular destination for hikers and campers.",
        "A new technology aims to make solar energy more efficient.",
        "The charity marathon raised funds for cancer research.",
        "A famous artist donated a painting to the local museum.",
        "The farmers' market features fresh produce from local farms.",
        "A science experiment revealed surprising results about plant growth.",
        "The community center offers after-school programs for children.",
        "A new species of fish was discovered in the deep sea.",
        "The hiking trail offers breathtaking views of the mountains.",
        "A robotics competition challenged students to design innovative robots.",
        "The town square was decorated for the holiday season.",
        "A history professor gave a lecture on ancient civilizations.",
        "The aquarium added a new exhibit featuring marine life from the Arctic.",
        "A famous author gave a talk at the local bookstore.",
        "The cycling race covered challenging terrain and scenic routes.",
        "A local brewery released a new craft beer this month.",
        "The annual fair featured rides, games, and food stalls.",
        "A renewable energy project aims to power the entire community.",
        "The local symphony orchestra played a concert under the stars.",
        "A wildlife documentary highlighted the plight of endangered species."
    ]
    return random.choice(arbitrary_sentences)


### process_squad_file() function

This is the main function that processes the SQuAD dataset.<br />
It reads the input SQuAD JSON file and creates a new, adversarial version.

In [4]:
def process_squad_file(input_file, output_file, num_sentences=1):
    with open(input_file, 'r', encoding='utf-8') as f:
        data = json.load(f)

    new_data = {"version": "v2.0", "data": []}

    for article in tqdm(data['data'], desc="Processing articles"):
        new_article = {"title": article['title'], "paragraphs": []}
        
        for paragraph in article['paragraphs']:
            context = paragraph['context']
            
            # Generate arbitrary sentences
            arbitrary_sentences = [generate_arbitrary_sentence() for _ in range(num_sentences)]
            
            # Create adversarial context
            adv_context = context + " " + " ".join(arbitrary_sentences)
            
            new_qas = []
            for qa in paragraph['qas']:
                new_qa = qa.copy()
                if new_qa['answers']:
                    # Update answer start position in the new context
                    answer = new_qa['answers'][0]['text']
                    answer_start = adv_context.index(answer)
                    new_qa['answers'][0]['answer_start'] = answer_start
                
                new_qas.append(new_qa)
            
            new_paragraph = {
                "context": adv_context,
                "qas": new_qas
            }
            new_article['paragraphs'].append(new_paragraph)
        
        new_data['data'].append(new_article)

    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(new_data, f, ensure_ascii=False, indent=2)

### Dataset processing

Iterates through each article and paragraph in the original dataset.<br />
For each paragraph:
<ul>
    <li>Generates num_sentences arbitrary sentences using generate_arbitrary_sentence().</li>
    <li>Appends these sentences to the original context, creating an adversarial context.</li>
    <li>Updates the answer start positions for all QA pairs in the paragraph to reflect their new positions in the adversarial context.</li>
</ul>

### Output generation:
Creates a new JSON structure with the modified data.<br />
Writes this new structure to the output file.

In [5]:
# Create data set
path = "SQuAD/"
input_file = os.path.join(path, "train-v2.0.json")
output_file = os.path.join(path, "squad-v2.0-addany.json")
process_squad_file(input_file, output_file, num_sentences=2)

Processing articles: 100%|██████████████████| 442/442 [00:00<00:00, 1513.45it/s]
