# 📓 Generating High-Credibility Synthetic Healthcare Claims

This notebook uses instruction-tuned text generation to create synthetic healthcare claims and explanations using LaMini-Flan-T5-248M. These synthetic samples supplement real-world data, helping correct label imbalance and improve classifier robustness by introducing more credible (true) examples.

## ✅ 1. Imports & Model Pipeline Setup

We import required libraries and load the LaMini-Flan-T5-248M model via Hugging Face's `pipeline`. This lightweight encoder-decoder transformer is instruction-tuned for general reasoning tasks.

In [None]:
!pip install transformers --quiet

from transformers import pipeline
import pandas as pd
import random
from tqdm import tqdm
import re

# Load LaMini-Flan reasoning model
model_name = "MBZUAI/LaMini-Flan-T5-248M"
reasoning_pipeline = pipeline("text2text-generation", model=model_name, device=-1)


Device set to use cpu
The model 'T5ForConditionalGeneration' is not supported for text-generation. Supported models are ['AriaTextForCausalLM', 'BambaForCausalLM', 'BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CohereForCausalLM', 'Cohere2ForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'DbrxForCausalLM', 'DiffLlamaForCausalLM', 'ElectraForCausalLM', 'Emu3ForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'FalconMambaForCausalLM', 'FuyuForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'Gemma3ForConditionalGeneration', 'Gemma3ForCausalLM', 'GitForCausalLM', 'GlmForCausalLM', 'GotOcr2ForConditionalGeneration', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeo

## 🌀 2. Generate Synthetic Claims in Batch

We generate 100+ responses (10 claims per topic) from the model using the defined prompt. Each output is a plausible, fact-like healthcare claim, although not guaranteed to be medically verified.

In [None]:
# Create 100 prompts for the model to expand on
base_prompts = [
    "The claim is that exercise helps prevent heart disease.",
    "The claim is that vaccines are effective at preventing illness.",
    "The claim is that meditation reduces anxiety.",
    "The claim is that regular sleep improves brain function.",
    "The claim is that sunscreen prevents skin cancer.",
    "The claim is that fiber helps with digestion.",
    "The claim is that drinking water supports kidney health.",
    "The claim is that low sodium intake benefits blood pressure.",
    "The claim is that probiotics support gut health.",
    "The claim is that dental hygiene impacts heart health.",
    # Repeat and randomly vary structure
]
# Pad out to 100 with variations
while len(base_prompts) < 100:
    health_topic = random.choice([
        "exercise", "hydration", "mental health", "nutrition",
        "disease prevention", "vaccination", "chronic illness", "cancer prevention"
    ])
    action = random.choice([
        "helps with", "is important for", "is linked to", "is known to reduce", "supports"
    ])
    outcome = random.choice([
        "heart health", "reduced stress", "stronger immunity", "lower cancer risk",
        "lower blood pressure", "improved sleep", "gut health"
    ])
    prompt = f"The claim is that {health_topic} {action} {outcome}."
    base_prompts.append(prompt)


After installing and importing the required packages and libraries, we load the model and we generated a list of 100 prompts for the reasoning model to expand upon.

1. **Expanding the List to 100 Prompts**:
   - The `while` loop continues to generate new prompts until the total count reaches 100. For each new prompt, random health topics, actions, and outcomes are chosen from predefined lists to vary the structure of the claims.

2. **Result**:
   - The final list `base_prompts` contains 100 unique prompts that can be used to generate explanations and assess the reasoning model's performance.


## 💬 3. Generate Natural Language Explanations
We pass each claim + score into LaMini-Flan using the prompt:

```
Claim: {claim}
Credibility: {score}%
Explain why this claim is likely accurate.
```

Each explanation aims to justify the claim using scientific-sounding reasoning based on the model's pretrained knowledge.

In [None]:
# Generate explanations
claims = []
explanations = []

print("🧠 Generating explanations for synthetic claims...")
for claim_text in tqdm(base_prompts):
    # Strip down to clean claim for CSV
    clean_claim = claim_text.replace("The claim is that ", "").strip().rstrip(".")

    # Make the prompt explicit for explanation
    prompt = f"Claim: {clean_claim}\nCredibility: 90%\nExplain why this claim is likely accurate."

    output = reasoning_pipeline(prompt, max_length=150, num_return_sequences=1)[0]["generated_text"]

    claims.append(clean_claim)
    explanations.append(output.strip())


🧠 Generating explanations for synthetic claims...


100%|██████████| 100/100 [04:12<00:00,  2.52s/it]


The next step is to generate explanations for a set of synthetic claims using the reasoning model.

1. **Generating Explanations**:
   - A loop iterates through each claim in `base_prompts`, cleaning the claim by removing the prefix "The claim is that" and stripping any trailing punctuation.

2. **Creating Prompts**:
   - For each cleaned claim, a prompt is created with a fixed credibility of 90% and the instruction to explain why the claim is likely accurate.

3. **Model Inference**:
   - The `reasoning_pipeline` is used to generate an explanation for each claim. The explanation is returned as the model's output and is appended to the `explanations` list.

4. **Storing Results**:
   - Both the cleaned claims and their corresponding explanations are stored in separate lists, `claims` and `explanations`, which can be used for further evaluation or saving to a file.


## 🧮 4. Assign High Credibility Scores (Simulated)


Since all claims are intentionally designed to be accurate, we simulate a score range between 71–95%, mimicking the distribution of credible examples used during classifier training.

In [None]:
# Assign credibility scores and clean prompts
clean_claims = [c.replace("The claim is that ", "").strip().rstrip(".") for c in claims]
scores = [random.randint(71, 95) for _ in range(len(clean_claims))]

Then we clean the claim texts by removing the prefix "The claim is that" and trimming any trailing punctuation.The cleaned claims and their associated scores are stored in the `clean_claims` and `scores` lists, respectively. These scores simulate varying levels of credibility for each claim.

## 📊 5. Final Dataset Preview

We preview the final set of synthetic samples, each consisting of:

- A claim
- A simulated credibility score
- A generated explanation

In [None]:
df = pd.DataFrame({
    "claim": clean_claims,
    "score": scores,
    "explanation": explanations
})


## 💾 6. Save Synthetic Dataset to CSV

We save the cleaned, formatted dataset as synthetic_claim_explanations.csv, which is later used in the Streamlit app for:

- Demonstration
- Bias mitigation
- Evaluation consistency

In [None]:
df.to_csv("synthetic_claim_explanations.csv", index=False)
print("✅ Saved 100+ synthetic claims to 'synthetic_claim_explanations.csv'")


✅ Saved 100+ synthetic claims to 'synthetic_claim_explanations.csv'
