In [59]:
import os
from openai import OpenAI
import pandas as pd
import tqdm

client = OpenAI(api_key="")



In [60]:
def prepare_prompt_question(fact):

    prompt = f"""QUESTION GENERATOR PROMPT

You are a question generator. Given an atomic fact, generate a question that can be answered solely using the information in that fact.

IMPORTANT CONTEXT: All atomic facts are NEWS EVENTS - surprising, tragic, or unprecedented occurrences. These are not routine announcements or expected publications.

CRITICAL: Do NOT automatically include dates just because they appear in the fact. Even for news events, dates should ONLY be included when necessary to distinguish between multiple similar events.

REQUIREMENTS

1. SELF-CONTAINED
- The question must be answerable using ONLY the atomic fact
- No external knowledge, context, or details should be needed
- If someone reads only the fact, they should be able to answer the question completely

2. SPECIFIC AND UNAMBIGUOUS
- Include enough detail to prevent confusion with other events
- Use specific names, locations, or numbers from the fact
- Ensure the question cannot be mistaken for asking about generic information or previous occurrences
- The question should clearly reference the unique circumstances in the fact

3. STRATEGIC USE OF DATES FOR NEWS EVENTS (CRITICAL REQUIREMENT)

BEFORE including any date, ask yourself this question:
"Would someone confuse this with a DIFFERENT similar news event if I removed the date?"

For news events, apply this framework:

INCLUDE dates for:
✓ Recurring natural disasters from active locations (earthquakes in California, hurricanes in Florida, Mount Etna eruptions)
✓ Ongoing crisis updates where numbers/status change (death tolls, evacuations, "as of [date]" statements)
✓ Periodic statistics in news (monthly job reports, quarterly GDP, weekly unemployment)
✓ Multiple similar incidents in the same area/type (shootings, accidents, protests)
✓ Breaking developments in ongoing stories where the same entity makes multiple announcements

DO NOT include dates for:
✗ First-time unprecedented events (specific person's unique controversial action, novel policy announcements)
✗ Unique lawsuits (specific plaintiff + specific defendant + specific claim = already unique)
✗ One-time product/feature launches (specific platform + specific feature name = already unique)
✗ Singular appointments or resignations (specific person to specific position = already unique)
✗ Book releases about news events (specific author + specific title = already unique, even if responding to current events)
✗ Single-occurrence tragedies with unique circumstances (specific location + specific type of incident = already unique)

KEY PRINCIPLE FOR NEWS:
Even surprising news can be unique without dates. Ask: "Has this SPECIFIC combination of people/places/actions happened before or could happen again?" If no, omit the date.

4. DIRECT ANSWER
- Provide the answer after generating the question
- The answer must be directly extractable from the fact
- Do not infer or add information not present in the fact
- Provide complete, detailed answers that include relevant context from the fact
- Match the temporal specificity of the question (if question has no date, minimize date emphasis in answer)

OUTPUT FORMAT
```
<Question>: [your question here]

<Answer>: [answer derived from the fact]
```

EXAMPLES

EXAMPLE 1: Date Required - Recurring Natural Disaster Location
ATOMIC FACT: "The Italian Institute of Geophysics and Volcanology reported strong ash emissions from Mount Etna's summit craters on August 25, 2025, prompting authorities to raise the highest alert level for air traffic."

DECISION PROCESS:
- Mount Etna is a highly active volcano with frequent eruptions
- Sicily experiences multiple volcanic events from Etna each year
- Without date: "Mount Etna eruption" could refer to dozens of events
- This is recurring news from the same location
- DATE IS REQUIRED

```
<Question>: "What prompted authorities to raise the highest alert level for air traffic on August 25, 2025?"

<Answer>: "Authorities raised the highest alert level for air traffic on August 25, 2025, due to strong ash emissions reported from Mount Etna's summit craters by the Italian Institute of Geophysics and Volcanology."
```

EXAMPLE 2: Date Required - Ongoing Crisis with Changing Numbers
ATOMIC FACT: "As of August 26, 2025, Catania's international airport remains open, but officials recommend that travelers check with airlines regarding potential disruptions due to the volcanic eruption."

DECISION PROCESS:
- This is an ongoing volcanic crisis
- Airport status changes day by day
- "As of" explicitly signals temporal dependency
- The same airport could have different statuses on different days during the crisis
- DATE IS REQUIRED

```
<Question>: "What do officials recommend travelers do regarding Catania's international airport as of August 26, 2025?"

<Answer>: "Officials recommend that travelers check with airlines regarding potential disruptions due to the volcanic eruption, even though Catania's international airport remains open as of August 26, 2025."
```

EXAMPLE 3: Date Required - Periodic Financial News
ATOMIC FACT: "BlackRock's iShares Bitcoin Trust ETF took in more than $1.3 billion in fresh cash during the last week of June 2025, contributing to a total of over $4 billion absorbed by all U.S. spot Bitcoin ETFs that month."

DECISION PROCESS:
- ETF inflows are reported continuously as financial news
- The same ETF has different inflow amounts every week/month
- This is periodic statistics
- Without date: impossible to know which reporting period
- DATE IS REQUIRED

```
<Question>: "How much did BlackRock's iShares Bitcoin Trust ETF take in during the last week of June 2025?"

<Answer>: "BlackRock's iShares Bitcoin Trust ETF took in more than $1.3 billion in fresh cash during the last week of June 2025."
```

EXAMPLE 4: Date NOT Required - Unprecedented Policy Action
ATOMIC FACT: "Dr. Vinay Prasad, head of the Center for Biologics Evaluation and Research at the FDA, issued a memo arguing against the broad availability of Covid vaccines, stating that he felt 'differently about certain aspects' of the FDA reviewers' conclusions."

DECISION PROCESS:
- This is a specific person in a specific role taking a specific stance
- Has THIS specific FDA official issued multiple contradictory memos about Covid vaccines? Unlikely
- The combination of: specific person + specific title + specific controversial position = unique event
- This is unprecedented/surprising, but the details themselves make it unique
- Would removing the date create confusion? NO - this specific action by this specific person is identifiable
- DATE IS NOT REQUIRED

```
<Question>: "What position did Dr. Vinay Prasad, head of the Center for Biologics Evaluation and Research at the FDA, take regarding the broad availability of Covid vaccines?"

<Answer>: "Dr. Vinay Prasad, head of the Center for Biologics Evaluation and Research at the FDA, argued against the broad availability of Covid vaccines, stating that he felt 'differently about certain aspects' of the FDA reviewers' conclusions."
```

WRONG VERSION (unnecessarily includes date):
```
<Question>: "What position did Dr. Vinay Prasad take regarding Covid vaccines on July 9, 2025?"
WHY WRONG: The date adds no value. This specific person's specific action is already unique.
```

EXAMPLE 5: Date NOT Required - Unique Lawsuit
ATOMIC FACT: "The American Association of Pediatrics, along with several doctors, filed a lawsuit against the Department of Health and Human Services on July 23, 2025, claiming that RFK Jr.'s directive restricting Covid vaccine access was 'arbitrary and capricious' and caused significant harm to doctors trying to administer the vaccine."

DECISION PROCESS:
- Specific plaintiff (AAP) + specific defendant (HHS) + specific claim (RFK Jr.'s directive)
- Has the AAP filed multiple lawsuits against HHS about RFK Jr.'s vaccine directives? Highly unlikely
- This is surprising/unprecedented news, but the combination of parties and claim is unique
- Would removing the date create confusion? NO - these specific parties with this specific claim is identifiable
- DATE IS NOT REQUIRED

```
<Question>: "What did the American Association of Pediatrics claim in their lawsuit against the Department of Health and Human Services regarding RFK Jr.'s directive?"

<Answer>: "The American Association of Pediatrics, along with several doctors, claimed that RFK Jr.'s directive restricting Covid vaccine access was 'arbitrary and capricious' and caused significant harm to doctors trying to administer the vaccine."
```

WRONG VERSION (unnecessarily includes date):
```
<Question>: "What did the AAP claim in their lawsuit filed on July 23, 2025?"
WHY WRONG: The specific parties and specific claim already make this unique.
```

EXAMPLE 6: Date NOT Required - Technology Feature Launch
ATOMIC FACT: "XRP's ledger, XRPL, activated an automated market maker feature on July 7, 2025, allowing any wallet to convert assets on-chain and earn fees from idle balances, targeting institutional investors."

DECISION PROCESS:
- Specific platform (XRPL) + specific feature (automated market maker)
- Has XRPL activated multiple "automated market maker features"? No - this is a one-time feature launch
- This is surprising/significant tech news, but the specific feature name makes it unique
- Would removing the date create confusion? NO - "XRPL's AMM feature activation" is identifiable
- DATE IS NOT REQUIRED

```
<Question>: "What feature did XRP's ledger (XRPL) activate to allow wallets to convert assets on-chain and earn fees?"

<Answer>: "XRP's ledger (XRPL) activated an automated market maker feature, allowing any wallet to convert assets on-chain and earn fees from idle balances, targeting institutional investors."
```

WRONG VERSION (unnecessarily includes date):
```
<Question>: "What feature did XRPL activate on July 7, 2025?"
WHY WRONG: The specific feature name already makes this unique.
```

EXAMPLE 7: Date NOT Required - Book Publication (Even About Current Events)
ATOMIC FACT: "Dr. Sheryl Gonzalez Ziegler, Psy.D., released her book titled 'The Crucial Years: The Essential Guide to Mental Health and Modern Puberty in Middle Childhood' on July 13, 2025, focusing on child development during the ages of 6-12."

DECISION PROCESS:
- Specific author + specific unique book title
- Has this author released multiple books with this exact title? No
- Even if this book responds to current events, the title itself is unique
- Would removing the date create confusion? NO - the title identifies it
- DATE IS NOT REQUIRED

```
<Question>: "What age range does Dr. Sheryl Gonzalez Ziegler's book 'The Crucial Years: The Essential Guide to Mental Health and Modern Puberty in Middle Childhood' focus on?"

<Answer>: "Dr. Sheryl Gonzalez Ziegler's book 'The Crucial Years: The Essential Guide to Mental Health and Modern Puberty in Middle Childhood' focuses on child development during the ages of 6-12."
```

QUALITY CHECKLIST

Before submitting, verify:
□ Can this be answered using ONLY the atomic fact?
□ Does it contain enough specificity to distinguish from similar events?
□ Is the answer explicitly stated or clearly derivable from the fact?
□ Does the answer include sufficient detail and context from the fact?
□ Would someone reading just the fact understand what is being asked?

TEMPORAL MARKER DECISION (Apply systematically):
□ Is this a recurring type of event from the same location/entity? → Include date
□ Is this an ongoing situation with changing numbers/status? → Include date
□ Is this periodic statistical news? → Include date
□ Is the combination of people/places/actions already unique? → Omit date
□ Would this SPECIFIC event be confused with another similar event without the date? → If NO, omit date

Now generate your question and answer.

Input Atomic Fact: {fact}

### OUTPUT:
"""

    return prompt

In [61]:
import json
# with open("extracted_news/atomic_facts_pretty.json" , "r") as f:
#     data = json.load(f)



# read all the jsons in extracted_facts folder 

data = {}
for filename in os.listdir("extracted_facts"):
    if filename.endswith(".json"):
        with open(os.path.join("extracted_facts", filename), "r") as f:
            category = filename.split("atomic_facts_yahoo_")[1].split("_")[0]
            data[category] = json.load(f)


data.keys()



dict_keys(['unprecented', 'earthquake', 'discovery', 'emergency', 'sinkhole', 'abrupt', 'collapse', 'outbreak', 'strike', 'erupt', 'crash'])

In [62]:
def extract_qa(response):
    try:
        parts = response.split("```")[1].strip().split("\n\n")
        question = parts[0].replace("<Question>:", "").strip()
        answer = parts[1].replace("<Answer>:", "").strip()
        return question, answer
    except Exception as e:
        print("Error parsing response:", e)
        return None, None
    

In [63]:
result = []
for category, each_data in data.items():
    for fact in tqdm.tqdm(each_data):
        for each_fat in json.loads(fact['Atomic_Facts']):
            each_fat = each_fat.replace("```", "")
            prompt = prepare_prompt_question(each_fat)
            response = client.responses.create(
                model="gpt-4o-mini",
                instructions = "You are a helpful assistant",
                input = prompt,
                max_output_tokens=500,
                temperature=0.1)
            q , a = extract_qa(response.output_text)
            if q and a:
                result.append({
                    "category": category,
                    "fact": each_fat,
                    "question": q,
                    "answer": a
                })
    tqdm.tqdm.write(f"Finished processing category: {category}")

with open("extracted_QA/qa_pairs.json", "w") as f:
    json.dump(result, f, indent=2)
            



100%|██████████| 166/166 [12:43<00:00,  4.60s/it]


Finished processing category: unprecented


100%|██████████| 172/172 [12:25<00:00,  4.33s/it]


Finished processing category: earthquake


100%|██████████| 175/175 [13:25<00:00,  4.60s/it]


Finished processing category: discovery


100%|██████████| 174/174 [13:57<00:00,  4.81s/it]


Finished processing category: emergency


100%|██████████| 50/50 [04:16<00:00,  5.13s/it]


Finished processing category: sinkhole


100%|██████████| 183/183 [13:48<00:00,  4.52s/it]


Finished processing category: abrupt


100%|██████████| 172/172 [13:49<00:00,  4.82s/it]


Finished processing category: collapse


100%|██████████| 172/172 [12:26<00:00,  4.34s/it]


Finished processing category: outbreak


100%|██████████| 174/174 [15:48<00:00,  5.45s/it]


Finished processing category: strike


100%|██████████| 38/38 [03:54<00:00,  6.18s/it]


Finished processing category: erupt


100%|██████████| 174/174 [12:45<00:00,  4.40s/it]

Finished processing category: crash



