# **Synthetic FHIR Data Generation for Graph-Based Learning**  

## **Introduction**  
This notebook focuses on generating **synthetic patient data** in **FHIR format**, simulating **family relationships, inherited diseases, and medical conditions**. The output dataset will be structured for **graph-based machine learning applications**, such as **Graph Neural Networks (GNNs)**.  

Key steps in this notebook:  
- **Extracting FHIR data** into structured **Pandas DataFrames**.  
- **Simulating family relationships** and **inherited diseases** based on epidemiological probabilities.  
- **Creating FHIR-compliant resources** such as `Patient`, `FamilyMemberHistory`, `RelatedPerson`, and `Condition`.  
- **Saving processed data** as structured **FHIR JSON files** for further analysis.  

---  

## **List of Contents**  

This notebook is divided into **four major steps**, covering the extraction, simulation, processing, and saving of FHIR data.  

### [**1. Extracting FHIR JSON Data**](#1-extracting-fhir-json-data)
   **Keywords**: `load_json`, `resources`, `patient_data`  
   - Load **FHIR JSON files** containing structured **patient records**.  
   - Extract and categorize resources into **patients, medical conditions, relationships, and family histories**.  
   - Ensure all patient records are stored in **structured dictionaries** for further processing.  

### [**2. Generating Family Relationships and Inherited Diseases**](#2-generating-family-relationships-and-inherited-diseases)
   **Keywords**: `disease_probabilities`, `create_related_person`, `create_family_member_history`, `create_condition_for_related_person`  
   - Define **probability distributions for hereditary diseases** based on epidemiological data.  
   - Generate **FamilyMemberHistory records**, assigning medical conditions to family members probabilistically.  
   - Create **RelatedPerson records**, simulating familial relationships among patients.  

### [**3. Constructing FamilyMemberHistory and RelatedPerson Records**](#3-constructing-familymemberhistory-and-relatedperson-records)  
   **Keywords**: `used_relationships`, `random.sample`, `related_person_id`, `family_member_history`  
   - Assign **synthetic parents and siblings** to each patient with predefined probabilities.  
   - Ensure **no duplicate relationships** exist between individuals.  
   - Append generated relationships and conditions to **FHIR-compliant resources**.  

### [**4. Saving Processed FHIR Data**](#4-saving-processed-fhir-data)  
   **Keywords**: `json.dump`, `output_folder`, `os.path.join`  
   - Save structured **FHIR patient records** into **separate JSON files**.  
   - Ensure **FHIR-compliant formatting** for downstream machine learning applications.  
   - Store the following files in `synthea/output/processed/`:  
     - **`Patient.json`** → Structured patient demographic information.  
     - **`Condition.json`** → Lists of diagnosed conditions.  
     - **`RelatedPerson.json`** → Synthetic family relationships.  
     - **`FamilyMemberHistory.json`** → Inheritance history of diseases.  

---

## **Installation Requirements**  

To execute this notebook, install the required dependencies using:  

```bash
pip install pandas numpy json
```

---

## **Conclusion**  
This notebook **generates and structures synthetic FHIR data**, simulating **real-world healthcare records** for predictive modeling. The processed dataset is stored in **FHIR JSON format**, making it suitable for **graph-based machine learning applications** such as **Graph Neural Networks (GNNs)**.  

In the next step, this dataset will be used to **train a GNN model** to predict **medical conditions based on family relationships**. 

# 1. Extracting FHIR JSON Data

In [None]:
import os
import json
import random
from collections import defaultdict

# **Folder containing processed FHIR data**
data_folder = "synthea/output/fhir"
output_folder = "synthea/output/processed/"
os.makedirs(output_folder, exist_ok=True)

# **Collect all JSON files in the folder**
json_files = [f for f in os.listdir(data_folder) if f.endswith(".json")]

# **Dictionary to store resources by type**
resources = defaultdict(list)

# **Read all files and separate resources by type**
for file in json_files:
    file_path = os.path.join(data_folder, file)
    print(f"Reading file: {file}")

    with open(file_path, "r", encoding="utf-8") as f:
        try:
            data = json.load(f)

            # If JSON is a FHIR Bundle, check for `entry`
            if "entry" in data:
                for entry in data["entry"]:
                    resource = entry["resource"]
                    resource_type = resource["resourceType"]
                    resources[resource_type].append(resource)
            else:
                # If JSON contains a single resource
                resource_type = data["resourceType"]
                resources[resource_type].append(data)

        except json.JSONDecodeError as e:
            print(f"Error reading JSON in {file}: {e}")

# **Retrieve the list of patients**
patients = resources["Patient"]
patient_ids = [p["id"] for p in patients]
patient_data = {p["id"]: p for p in patients}  # Store patient details

print(f"\nTotal number of patients found: {len(patients)}")


Reading file: Abbie917_Abshire638_12f6dd3f-9c76-02ac-540f-8a17f3a6fbf5.json
Reading file: Abbie917_Maggio310_9c87fb9d-f2a6-3149-128a-2a80bd17089c.json
Reading file: Abe604_Rosenbaum794_d7e1e837-68df-de63-a981-10012d427bce.json
Reading file: Adalberto916_Donnelly343_c1a091f1-de53-7b04-20d4-a5dff9ecbfdb.json
Reading file: Adell482_Kuhn96_4de6e038-9511-5a10-fd6a-3fdc7be5512a.json
Reading file: Adell482_Runolfsdottir785_7da148be-b73e-73e3-ed5c-67d7c712a253.json
Reading file: Adria871_Barrows492_6c77c078-4d96-cb2e-9fbe-e52d0ff80b01.json
Reading file: Agnes294_Zoraida650_Daniel959_6fb62c52-b1c7-3126-0764-1ecfc9478136.json
Reading file: Agustin437_Hansen121_e04eb832-fca1-bd64-e376-c2ebabb4a219.json
Reading file: Agustina460_Lueilwitz711_92a61ce4-403b-071c-2281-8f3a75a3319c.json
Reading file: Akiko835_Kerstin790_VonRueden376_12283fb4-d5e1-4ce2-915b-5ce33e1ae56a.json
Reading file: Alan320_Labadie908_5e7e0716-43e2-fd57-bede-5b170e6f3cec.json
Reading file: Alberto639_Wolff180_e2a6b647-b2cb-9632-6

# 2. Generating Family Relationships and Inherited Diseases

In [None]:
# **Probability of Hereditary Diseases Based on Epidemiological Data**
disease_probabilities = {
    "Diabetes": 0.125,
    "Hypertension": 0.35,
    "Cancer": 0.075,
    "Heart Disease": 0.225,
    "Alzheimer": 0.03,
    "Asthma": 0.10
}

# **Function to Create a RelatedPerson Resource**
def create_related_person(patient_id, related_id, relation_code, relation_display):
    """
    Generates a FHIR `RelatedPerson` resource linking a patient to a related individual.
    
    Args:
        patient_id (str): The ID of the primary patient.
        related_id (str): The ID of the related individual.
        relation_code (str): Code representing the type of relationship.
        relation_display (str): Human-readable relationship name.

    Returns:
        dict: A dictionary representing the `RelatedPerson` resource.
    """
    related_patient = patient_data[related_id]
    
    return {
        "resourceType": "RelatedPerson",
        "id": f"urn:uuid:{related_id}",
        "patient": {"reference": f"Patient/{patient_id}"},
        "relationship": [{
            "coding": [{
                "system": "http://terminology.hl7.org/CodeSystem/v3-RoleCode",
                "code": relation_code,
                "display": relation_display
            }],
            "text": relation_display
        }],
        "name": related_patient.get("name", [{"use": "official", "family": "Unknown"}]),
        "gender": related_patient.get("gender", "unknown"),
        "birthDate": related_patient.get("birthDate", "unknown")
    }

# **Function to Create a FamilyMemberHistory Resource**
def create_family_member_history(patient_id, related_id, relation_code, relation_display):
    """
    Generates a FHIR `FamilyMemberHistory` resource indicating inherited medical conditions.
    
    Args:
        patient_id (str): The ID of the primary patient.
        related_id (str): The ID of the related family member.
        relation_code (str): Code representing the type of relationship.
        relation_display (str): Human-readable relationship name.

    Returns:
        tuple: (FamilyMemberHistory resource (dict) or None, list of inherited conditions)
    """
    inherited_conditions = []
    
    # Assign conditions based on predefined probabilities
    for disease, probability in disease_probabilities.items():
        if random.random() <= probability:
            inherited_conditions.append(disease)
    
    if not inherited_conditions:
        return None, None  # No conditions inherited

    family_history = {
        "resourceType": "FamilyMemberHistory",
        "id": f"family-{related_id}-",
        "patient": {"reference": f"Patient/{patient_id}"},
        "relationship": {
            "coding": [{
                "system": "http://terminology.hl7.org/CodeSystem/v3-RoleCode",
                "code": relation_code,
                "display": relation_display
            }]
        },
        "condition": [{
            "code": {
                "coding": [{
                    "system": "http://snomed.info/sct",
                    "code": "22298006",
                    "display": disease
                }],
                "text": disease
            }
        } for disease in inherited_conditions]
    }

    return family_history, inherited_conditions

# **Function to Create a Condition Resource for a RelatedPerson**
def create_condition_for_related_person(related_id, disease):
    """
    Generates a FHIR `Condition` resource for a related person.

    Args:
        related_id (str): The ID of the related individual.
        disease (str): The medical condition assigned to the related person.

    Returns:
        dict: A dictionary representing the `Condition` resource.
    """
    return {
        "resourceType": "Condition",
        "id": f"condition-related-{related_id}-{disease.lower().replace(' ', '-')}",
        "subject": {"reference": f"urn:uuid:{related_id}"},
        "code": {
            "coding": [{
                "system": "http://snomed.info/sct",
                "code": "22298006",
                "display": disease
            }],
            "text": disease
        }
    }


# 3. Constructing FamilyMemberHistory and RelatedPerson Records

In [None]:
# **Adding `FamilyMemberHistory`, `RelatedPerson`, and `Condition` to the dataset**
used_relationships = set()

for patient in patients:
    patient_id = patient["id"]
    
    # Identify potential relatives (excluding self and already assigned relations)
    possible_relations = [p for p in patient_ids if p != patient_id and (patient_id, p) not in used_relationships]
    if not possible_relations:
        continue

    family_members = []

    # Randomly assign parents with a 70% probability
    rand_parent = random.random()
    if rand_parent <= 0.7:
        family_members.append(("FTH", "Father"))
        family_members.append(("MTH", "Mother"))
    elif rand_parent <= 0.9:
        # 20% probability of assigning only one parent
        family_members.append(random.choice([("FTH", "Father"), ("MTH", "Mother")]))
    
    # Randomly assign siblings with a 50% probability
    rand_sibling = random.random()
    if rand_sibling <= 0.5:
        num_siblings = random.randint(1, 2)  # Randomly assign 1 or 2 siblings
        for _ in range(num_siblings):
            family_members.append(random.choice([("BRO", "Brother"), ("SIS", "Sister")]))
    
    # Select a subset of available patient IDs as family members
    chosen_family_members = random.sample(possible_relations, min(len(family_members), len(possible_relations)))

    for related_person_id, (relation_code, relation_display) in zip(chosen_family_members, family_members):
        # Ensure the relationship is not duplicated
        if (related_person_id, patient_id) not in used_relationships:
            used_relationships.add((patient_id, related_person_id))
            used_relationships.add((related_person_id, patient_id))

            # Create RelatedPerson and FamilyMemberHistory records
            related_person = create_related_person(patient_id, related_person_id, relation_code, relation_display)
            family_member_history, inherited_diseases = create_family_member_history(patient_id, related_person_id, relation_code, relation_display)

            # Append to resources dictionary
            resources["RelatedPerson"].append(related_person)
            if family_member_history:
                resources["FamilyMemberHistory"].append(family_member_history)
            
            # Generate Condition records for inherited diseases
            if inherited_diseases:
                for disease in inherited_diseases:
                    condition_for_related = create_condition_for_related_person(related_person_id, disease)
                    print(patient_id, related_person_id, disease)
                    resources["Condition"].append(condition_for_related)

# Display counts of added records
print(f"\n`RelatedPerson` added: {len(resources['RelatedPerson'])}")
print(f"`FamilyMemberHistory` added: {len(resources['FamilyMemberHistory'])}")
print(f"`Condition` for RelatedPerson added: {len(resources['Condition'])}")


7da148be-b73e-73e3-ed5c-67d7c712a253 3a644dcd-672c-9579-cdeb-65ce6783da97 Asthma
7da148be-b73e-73e3-ed5c-67d7c712a253 8463087b-be64-1139-b779-97d09881e034 Hypertension
7da148be-b73e-73e3-ed5c-67d7c712a253 8463087b-be64-1139-b779-97d09881e034 Heart Disease
d4f1d88b-aecc-493e-2977-44a72e0de2d9 00a4d481-551d-9741-dd8f-fa88fe29ab79 Hypertension
d4f1d88b-aecc-493e-2977-44a72e0de2d9 8c97920a-fc41-8150-f54e-9dcfc1f48fef Diabetes
d4f1d88b-aecc-493e-2977-44a72e0de2d9 8c97920a-fc41-8150-f54e-9dcfc1f48fef Hypertension
9f7675c1-1f29-10ac-92e5-8aaf367f05c3 2b27a9c6-3b32-83fe-c4eb-ff271de3536b Cancer
9f7675c1-1f29-10ac-92e5-8aaf367f05c3 6afaf446-c5f5-8967-0c74-9174ec37994d Diabetes
9f7675c1-1f29-10ac-92e5-8aaf367f05c3 921dde19-cf57-df8d-5079-556b79c1c12b Asthma
839e461d-9a4d-a110-1fe9-97bd16378bfd 24b71f9a-cda7-8c08-2df2-4f9c9ae5db55 Heart Disease
839e461d-9a4d-a110-1fe9-97bd16378bfd 9f2b7772-a77d-9323-806a-e15deeb08d98 Cancer
c498075d-c7cc-69ba-23c3-0e6a6c188592 be222f9e-05e3-7c64-349b-02949d6222c7

# 4. Saving Processed FHIR Data

In [None]:
# **Save All Resources to Separate JSON Files**
for resource_type, resource_list in resources.items():
    output_path = os.path.join(output_folder, f"{resource_type}.json")
    with open(output_path, "w", encoding="utf-8") as f:
        json.dump(resource_list, f, indent=4)
    print(f"Saved {resource_type} data to {output_path}")


✅ Data Patient disimpan di synthea/output/processed/Patient.json
✅ Data Encounter disimpan di synthea/output/processed/Encounter.json
✅ Data Condition disimpan di synthea/output/processed/Condition.json
✅ Data DiagnosticReport disimpan di synthea/output/processed/DiagnosticReport.json
✅ Data DocumentReference disimpan di synthea/output/processed/DocumentReference.json
✅ Data Claim disimpan di synthea/output/processed/Claim.json
✅ Data ExplanationOfBenefit disimpan di synthea/output/processed/ExplanationOfBenefit.json
✅ Data Observation disimpan di synthea/output/processed/Observation.json
✅ Data Immunization disimpan di synthea/output/processed/Immunization.json
✅ Data Procedure disimpan di synthea/output/processed/Procedure.json
✅ Data SupplyDelivery disimpan di synthea/output/processed/SupplyDelivery.json
✅ Data MedicationRequest disimpan di synthea/output/processed/MedicationRequest.json
✅ Data CareTeam disimpan di synthea/output/processed/CareTeam.json
✅ Data CarePlan disimpan di s