# JSON File Format Description

The generated JSON file is a list containing multiple question-answer pairs, where each pair has the following attributes:

## Data Format

```json
[
  {
    "id": "<unique_identifier>",
    "Headline": "<headline_text>",
    "Question": "<question_text>",
    "Answer": "<answer_text>"
  },
  ...
]


## Attribute Descriptions

- **id**:  
  A unique identifier for each question-answer pair, formatted as `{row_number}-{qa_number}`, used to distinguish different pairs.

- **Headline**:  
  The headline extracted from the original data, which is the text before the question that starts with `Does`.

- **Question**:  
  The actual question extracted, starting with `Does` and ending with a `?`.

- **Answer**:  
  The answer to the extracted question, which could be `Yes`, `No`, or `Not Given` (when no answer is found).


In [123]:
import re
import json
import pandas as pd
import time

# Record start time
start_time = time.time()

# Load the dataset
df = pd.read_json("hf://datasets/AdaptLLM/finance-tasks/Headline/test.json")

# Store the extracted results
output_data = []

# Regular expression to match questions that start with "Does" and end with "?", and search for "Yes" or "No" within 30 characters after the question mark
qa_pattern = re.compile(r'(Does.*?\?)([\s\S]{0,30})(Yes|No)?')


# Process each row of the dataset
for idx, row in df.iterrows():
    input_text = row['input']
    
    while True:
        # Find the question that starts with "Does" and capture the text before "Does" as the Headline
        does_pos = input_text.find('Does')
        if does_pos == -1:
            break  # If "Does" is not found, exit the loop
        
        headline = input_text[:does_pos].strip()  # Extract the text before "Does" as Headline
        input_text = input_text[does_pos:]  # Process the remaining part starting from "Does"
        
        # Match the Question and possible Answer
        match = qa_pattern.search(input_text)
        if match:
            question = match[1].strip()  # Extract the matched Question
            
            # Find all possible answers (Yes/No)
            possible_answers = list(re.finditer(r'(Yes|No)', match[2], re.IGNORECASE))
           
            if possible_answers:
                last_match = possible_answers[-1]
                answer = last_match.group().capitalize()  # Get the last matched answer and capitalize the first letter
                last_index = last_match.end()
                input_text = match[2][last_index:].strip() + input_text[match.end():].strip()  # Append the remaining part back to input_text
            else:
                answer = "Not Given"  # If no Yes or No is found
                input_text = input_text[match.end():].strip()
            
            # Generate a JSON entry
            entry = {
                "id": f"{idx + 1}-{len(output_data) + 1}",
                "Headline": headline,
                "Question": question,
                "Answer": answer
            }
            output_data.append(entry)
            
        else:
            break  # If no question-answer pair is found, end processing of this row

# Save the results as a JSON file
with open('reformatted_data.json', 'w') as f:
    json.dump(output_data, f, indent=4)

# Record end time
end_time = time.time()

# Calculate and print the total runtime
total_time = end_time - start_time
print(f"Total time taken for dataset cleanup and transformation: {total_time:.2f} seconds")

# Print the total number of extracted question-answer pairs
print(f"Total number of question-answer pairs: {len(output_data)}")


Total time taken for dataset cleanup and transformation: 7.58 seconds
Total number of question-answer pairs: 123282
