## Task 1: Data Preparation Notebook
Goal: To load ConflictQA_Dataset.json file and transform it into the two separate .jsonl files we need for our finetuning experiments:

- context_only.jsonl: For the baseline experiment (Context-Only).

- exp_ans.jsonl: For the paper's main experiment (Explain-and-Answer).

### Step 1: Upload & Load Your Dataset

In [1]:
import pandas as pd
import json

json_filename = '/Users/francescodangolo/Desktop/CS 421 - Natural Language Processing/Research Project/qa-with-conflicting-context/data/ConflictQA_Dataset.json'

try:
    df = pd.read_json(json_filename)

    print("--- Dataset Info ---")
    df.info()

    print("\n--- Dataset Splits ---")
    print(df['split'].value_counts())

except FileNotFoundError:
    print(f"Error: File not found. Did you upload '{json_filename}' to the Colab files panel?")
except Exception as e:
    print(f"An error occurred: {e}")

--- Dataset Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1617 entries, 0 to 1616
Data columns (total 18 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   annotation_task_id  1617 non-null   int64 
 1   firstAnswer         1617 non-null   object
 2   firstContext        1617 non-null   object
 3   secondAnswerExist   1617 non-null   object
 4   secondAnswer        796 non-null    object
 5   secondContext       796 non-null    object
 6   thirdAnswerExist    796 non-null    object
 7   thirdAnswer         663 non-null    object
 8   thirdContext        663 non-null    object
 9   fourthAnswerExist   109 non-null    object
 10  correctAnswer       1616 non-null   object
 11  reasons             1617 non-null   object
 12  explanation         1617 non-null   object
 13  question            1617 non-null   object
 14  contexts            1617 non-null   object
 15  sources             1617 non-null   object
 16  spl

In [2]:
df = pd.read_json(json_filename)
df.head()

Unnamed: 0,annotation_task_id,firstAnswer,firstContext,secondAnswerExist,secondAnswer,secondContext,thirdAnswerExist,thirdAnswer,thirdContext,fourthAnswerExist,correctAnswer,reasons,explanation,question,contexts,sources,split,ambigqa_answer
0,0,every Monday,"['F', 'E']",B,,,,,,,every Monday,"['B', 'C']","The answer to the question ""When is the AP Men...",When is ap men's basketball poll released?,[Alabama is No. 1 in the final AP Top 25 men's...,"[apnews.com, en.wikipedia.org, www.cbssports.c...",dev,
1,1,38,['A'],B,,,,,,,38,"['E', 'D']","“His Airness” played 179 postseason games, 38 ...",How many 40 point games does michael jordan ha...,[Michael Jordan had the most games in the play...,"[www.statmuse.com, www.statmuse.com, www.statm...",dev,
2,2,Wasim Akram,"['I', 'G']",A,Akram and Mohammad Sami,['G'],B,,,,Wasim Akram,['B'],Context9 clearly states that Wasim Akram is on...,Who is the only bowler to have taken a hattric...,[Australian Peter Siddle is the only bowler to...,"[en.wikipedia.org, www.quora.com, en.wikipedia...",dev,
3,3,Margo Harshman,"['B', 'D', 'F']",B,,,,,,,Margo Harshman,"['B', 'E']","Contexts 2, 4 and 6 from diverse sources menti...",Who plays timothy mcgee's wife on ncis?,[She is best known for her roles as Tawny Dean...,"[en.wikipedia.org, www.distractify.com, www.re...",dev,
4,4,Noah Andrew Ringer,"['A', 'B', 'C', 'D', 'E', 'F']",B,,,,,,,Noah Andrew Ringer,['B'],Noah Andrew Ringer is an American actor who st...,Who played aang in the last airbender movie?,"[Noah Andrew Ringer (born November 18, 1996) i...","[en.wikipedia.org, www.imdb.com, en.wikipedia....",dev,


### Step 2: Define the Transformation Logic

For finetuning, we need clear input and output pairs.

- Input Format (Same for both): We'll create a single prompt string that contains the question and all its contexts.
- Output Format 1 (Context-Only): Just the correctAnswer.
- Output Format 2 (Exp & Ans): The explanation, a delimiter (\n\n), and then the correctAnswer.

In [3]:
def create_io_pairs(dataframe, target_split, exp_ans=False):
    """
    Processes a DataFrame split and formats it into input/output pairs.

    Args:
        dataframe (pd.DataFrame): The full DataFrame containing all data.
        target_split (str): The split to process (e.g., "train", "dev", "test").
        exp_ans (bool): If True, creates the 'Explain-and-Answer' output.
                        If False, creates the 'Context-Only' output.

    Returns:
        list: A list of dictionaries, where each dict is {"input": ..., "output": ...}
    """
    split_df = dataframe[dataframe['split'] == target_split].copy()
    
    formatted_data = []
    
    # Iterate over each row in the split
    for _, row in split_df.iterrows():
        # Format the contexts into a single string
        # We'll join them with a clear separator
        all_contexts = "\n---\n".join(row['contexts'])
        
        # Create the standard input prompt
        input_text = f"question: {row['question']}\n\ncontexts: {all_contexts}"
        
        output_text = ""
        if exp_ans:
            # Format for the "Explain-and-Answer" experiment
            output_text = f"{row['explanation']}\n\n{row['correctAnswer']}"
        else:
            # Format for the "Context-Only" baseline
            output_text = row['correctAnswer']
            
        # Append the formatted pair to our list
        formatted_data.append({
            "input": input_text,
            "output": output_text
        })
        
    print(f"Created {len(formatted_data)} I/O pairs for split: '{target_split}' (Exp&Ans={exp_ans})")
    return formatted_data

In [10]:
DATA_PATH = "/Users/francescodangolo/Desktop/CS 421 - Natural Language Processing/Research Project/qa-with-conflicting-context/data/splits/"

In [13]:
# Get all the unique splits in our dataset (e.g., ['dev', 'train', 'test'])
splits = df['split'].unique()

for split in splits:
    # --- 1. Create and Save Context-Only file ---
    context_only_data = create_io_pairs(df, split, exp_ans=False)
    context_only_filename = DATA_PATH + f"{split}_context_only.jsonl"
    
    with open(context_only_filename, 'w') as f:
        for item in context_only_data:
            # json.dumps converts the Python dict to a JSON string
            f.write(json.dumps(item) + '\n')
            
    print(f"Saved {context_only_filename}")

    # --- 2. Create and Save Exp-and-Ans file ---
    exp_ans_data = create_io_pairs(df, split, exp_ans=True)
    exp_ans_filename = DATA_PATH + f"{split}_exp_ans.jsonl"
    
    with open(exp_ans_filename, 'w') as f:
        for item in exp_ans_data:
            f.write(json.dumps(item) + '\n')
            
    print(f"Saved {exp_ans_filename}")
    print("---")

print("All files created successfully!")

Created 410 I/O pairs for split: 'dev' (Exp&Ans=False)
Saved /Users/francescodangolo/Desktop/CS 421 - Natural Language Processing/Research Project/qa-with-conflicting-context/data/splits/dev_context_only.jsonl
Created 410 I/O pairs for split: 'dev' (Exp&Ans=True)
Saved /Users/francescodangolo/Desktop/CS 421 - Natural Language Processing/Research Project/qa-with-conflicting-context/data/splits/dev_exp_ans.jsonl
---
Created 813 I/O pairs for split: 'test' (Exp&Ans=False)
Saved /Users/francescodangolo/Desktop/CS 421 - Natural Language Processing/Research Project/qa-with-conflicting-context/data/splits/test_context_only.jsonl
Created 813 I/O pairs for split: 'test' (Exp&Ans=True)
Saved /Users/francescodangolo/Desktop/CS 421 - Natural Language Processing/Research Project/qa-with-conflicting-context/data/splits/test_exp_ans.jsonl
---
Created 394 I/O pairs for split: 'train' (Exp&Ans=False)
Saved /Users/francescodangolo/Desktop/CS 421 - Natural Language Processing/Research Project/qa-with-con

In [17]:
# List all the .jsonl files we just created
%cd /Users/francescodangolo/Desktop/CS 421 - Natural Language Processing/Research Project/qa-with-conflicting-context/data/splits/
!ls -lh *.jsonl

print("\n--- Verifying 'train_context_only.jsonl' ---")
# The 'head' command shows the first two lines of a file
!head -n 2 train_context_only.jsonl

print("\n--- Verifying 'train_exp_ans.jsonl' ---")
!head -n 2 train_exp_ans.jsonl

/Users/francescodangolo/Desktop/CS 421 - Natural Language Processing/Research Project/qa-with-conflicting-context/data/splits
-rw-r--r--  1 francescodangolo  staff   624K Nov  7 17:38 dev_context_only.jsonl
-rw-r--r--  1 francescodangolo  staff   700K Nov  7 17:38 dev_exp_ans.jsonl
-rw-r--r--  1 francescodangolo  staff   1.3M Nov  7 17:38 test_context_only.jsonl
-rw-r--r--  1 francescodangolo  staff   1.4M Nov  7 17:38 test_exp_ans.jsonl
-rw-r--r--  1 francescodangolo  staff   592K Nov  7 17:38 train_context_only.jsonl
-rw-r--r--  1 francescodangolo  staff   643K Nov  7 17:38 train_exp_ans.jsonl

--- Verifying 'train_context_only.jsonl' ---
{"input": "question: Name the landforms that form the boundaries of the peninsular plateau?\n\ncontexts: May 1, 2017 ... The landforms that form the boundaries of Peninsular Plateauare:The Aravali Range 2. The Vindhyan Range, 3. The Satpura Range, 4. The Western\u00a0...\n---\nThe large Deccan Plateau of the Indian Subcontinent is located between th