# **Synthetic FHIR Data Generation for Graph-Based Learning**  

## **Introduction**  
This notebook focuses on generating **synthetic patient data** in **FHIR format**, simulating **family relationships, inherited diseases, and medical conditions**. The output dataset will be structured for **graph-based machine learning applications**, such as **Graph Neural Networks (GNNs)**.  

Key steps in this notebook:  
- **Extracting FHIR data** into structured **Pandas DataFrames**.  
- **Simulating family relationships** and **inherited diseases** based on epidemiological probabilities.  
- **Creating FHIR-compliant resources** such as `Patient`, `FamilyMemberHistory`, `RelatedPerson`, and `Condition`.  
- **Saving processed data** as structured **FHIR JSON files** for further analysis.  

---  

## **List of Contents**  

This notebook is divided into **four major steps**, covering the extraction, simulation, processing, and saving of FHIR data.  

### [**1. Extracting FHIR JSON Data**](#1-extracting-fhir-json-data)
   **Keywords**: `load_json`, `resources`, `patient_data`  
   - Load **FHIR JSON files** containing structured **patient records**.  
   - Extract and categorize resources into **patients, medical conditions, relationships, and family histories**.  
   - Ensure all patient records are stored in **structured dictionaries** for further processing.  

### [**2. Generating Family Relationships and Inherited Diseases**](#2-generating-family-relationships-and-inherited-diseases)
   **Keywords**: `disease_probabilities`, `create_related_person`, `create_family_member_history`, `create_condition_for_related_person`  
   - Define **probability distributions for hereditary diseases** based on epidemiological data.  
   - Generate **FamilyMemberHistory records**, assigning medical conditions to family members probabilistically.  
   - Create **RelatedPerson records**, simulating familial relationships among patients.  

### [**3. Constructing FamilyMemberHistory and RelatedPerson Records**](#3-constructing-familymemberhistory-and-relatedperson-records)  
   **Keywords**: `used_relationships`, `random.sample`, `related_person_id`, `family_member_history`  
   - Assign **synthetic parents and siblings** to each patient with predefined probabilities.  
   - Ensure **no duplicate relationships** exist between individuals.  
   - Append generated relationships and conditions to **FHIR-compliant resources**.  

### [**4. Saving Processed FHIR Data**](#4-saving-processed-fhir-data)  
   **Keywords**: `json.dump`, `output_folder`, `os.path.join`  
   - Save structured **FHIR patient records** into **separate JSON files**.  
   - Ensure **FHIR-compliant formatting** for downstream machine learning applications.  
   - Store the following files in `synthea/output/processed/`:  
     - **`Patient.json`** → Structured patient demographic information.  
     - **`Condition.json`** → Lists of diagnosed conditions.  
     - **`RelatedPerson.json`** → Synthetic family relationships.  
     - **`FamilyMemberHistory.json`** → Inheritance history of diseases.  

---

## **Installation Requirements**  

To execute this notebook, install the required dependencies using:  

```bash
pip install pandas numpy json
```

---

## **Conclusion**  
This notebook **generates and structures synthetic FHIR data**, simulating **real-world healthcare records** for predictive modeling. The processed dataset is stored in **FHIR JSON format**, making it suitable for **graph-based machine learning applications** such as **Graph Neural Networks (GNNs)**.  

In the next step, this dataset will be used to **train a GNN model** to predict **medical conditions based on family relationships**. 

# 1. Extracting FHIR JSON Data

In [31]:
import os
import json
import random
from collections import defaultdict

# **Folder containing processed FHIR data**
data_folder = "synthea/output/sample"
output_folder = "synthea/output/processed/"
os.makedirs(output_folder, exist_ok=True)

# **Collect all JSON files in the folder**
json_files = [f for f in os.listdir(data_folder) if f.endswith(".json")]

# **Dictionary to store resources by type**
resources = defaultdict(list)

# **Read all files and separate resources by type**
for file in json_files:
    file_path = os.path.join(data_folder, file)
    print(f"Reading file: {file}")

    with open(file_path, "r", encoding="utf-8") as f:
        try:
            data = json.load(f)

            # If JSON is a FHIR Bundle, check for `entry`
            if "entry" in data:
                for entry in data["entry"]:
                    resource = entry["resource"]
                    resource_type = resource["resourceType"]
                    resources[resource_type].append(resource)
            else:
                # If JSON contains a single resource
                resource_type = data["resourceType"]
                resources[resource_type].append(data)

        except json.JSONDecodeError as e:
            print(f"Error reading JSON in {file}: {e}")

# **Retrieve the list of patients**
patients = resources["Patient"]
patient_ids = [p["id"] for p in patients]
patient_data = {p["id"]: p for p in patients}  # Store patient details

print(f"\nTotal number of patients found: {len(patients)}")


Reading file: Alvin56_Jerde200_53917de0-aa0c-0700-105a-4f7f893c62f6.json
Reading file: Ardath226_Maude482_Feeney44_611bb297-2f51-beea-a7ec-4df0b59cb135.json
Reading file: Breana975_Harvey63_47e5d1f6-a017-1481-f0a2-49ef0edb3b72.json

Total number of patients found: 3


# 2. Generating Family Relationships and Inherited Diseases

In [32]:
# **Probability of Hereditary Diseases Based on Epidemiological Data**
disease_probabilities = {
    "Diabetes": 0.125,
    "Hypertension": 0.35,
    "Cancer": 0.075,
    "Heart Disease": 0.225,
    "Alzheimer": 0.03,
    "Asthma": 0.10
}

# **Function to Create a RelatedPerson Resource**
def create_related_person(patient_id, related_id, relation_code, relation_display):
    """
    Generates a FHIR `RelatedPerson` resource linking a patient to a related individual.
    
    Args:
        patient_id (str): The ID of the primary patient.
        related_id (str): The ID of the related individual.
        relation_code (str): Code representing the type of relationship.
        relation_display (str): Human-readable relationship name.

    Returns:
        dict: A dictionary representing the `RelatedPerson` resource.
    """
    related_patient = patient_data[related_id]
    
    return {
        "resourceType": "RelatedPerson",
        "id": f"{related_id}",
        "patient": {"reference": f"Patient/{patient_id}"},
        "relationship": [{
            "coding": [{
                "system": "http://terminology.hl7.org/CodeSystem/v3-RoleCode",
                "code": relation_code,
                "display": relation_display
            }],
            "text": relation_display
        }],
        "name": related_patient.get("name", [{"use": "official", "family": "Unknown"}]),
        "gender": related_patient.get("gender", "unknown"),
        "birthDate": related_patient.get("birthDate", "unknown")
    }

# **Function to Create a FamilyMemberHistory Resource**
def create_family_member_history(patient_id, related_id, relation_code, relation_display):
    """
    Generates a FHIR `FamilyMemberHistory` resource indicating inherited medical conditions.
    
    Args:
        patient_id (str): The ID of the primary patient.
        related_id (str): The ID of the related family member.
        relation_code (str): Code representing the type of relationship.
        relation_display (str): Human-readable relationship name.

    Returns:
        tuple: (FamilyMemberHistory resource (dict) or None, list of inherited conditions)
    """
    inherited_conditions = []
    
    # Assign conditions based on predefined probabilities
    for disease, probability in disease_probabilities.items():
        if random.random() <= probability:
            inherited_conditions.append(disease)
    
    if not inherited_conditions:
        return None, None  # No conditions inherited

    family_history = {
        "resourceType": "FamilyMemberHistory",
        "id": f"{related_id}",  # Remove `family-`
        "patient": {"reference": f"Patient/{patient_id}"},
        "relationship": {
            "coding": [{
                "system": "http://terminology.hl7.org/CodeSystem/v3-RoleCode",
                "code": relation_code,
                "display": relation_display
            }]
        },
        "status": "completed",  # Ensure status is present
        "condition": [{
            "code": {
                "coding": [{
                    "system": "http://snomed.info/sct",
                    "code": "22298006",
                    "display": disease
                }],
                "text": disease
            }
        } for disease in inherited_conditions]
    }

    return family_history, inherited_conditions


# **Function to Create a Condition Resource for a RelatedPerson**
def create_condition_for_related_person(related_id, disease):
    """
    Generates a FHIR `Condition` resource for a related person.

    Args:
        related_id (str): The ID of the related individual.
        disease (str): The medical condition assigned to the related person.

    Returns:
        dict: A dictionary representing the `Condition` resource.
    """
    return {
        "resourceType": "Condition",
        "id": f"{related_id}-{disease.lower().replace(' ', '-')}",  # Remove `condition-related-`
        "subject": {"reference": f"urn:uuid:{related_id}"},
        "code": {
            "coding": [{
                "system": "http://snomed.info/sct",
                "code": "22298006",
                "display": disease
            }],
            "text": disease
        }
    }



# 3. Constructing FamilyMemberHistory and RelatedPerson Records

In [33]:
# **Retrieve the list of patients**
patients = resources["Patient"]
patient_ids = [p["id"] for p in patients]
patient_data = {p["id"]: p for p in patients} 

In [34]:
# **Adding `FamilyMemberHistory`, `RelatedPerson`, and `Condition` to the dataset**
used_relationships = set()

for patient in patients:
    patient_id = patient["id"]
    
    # Identify potential relatives (excluding self and already assigned relations)
    possible_relations = [p for p in patient_ids if p != patient_id and (patient_id, p) not in used_relationships]
    if not possible_relations:
        continue

    family_members = []

    # Randomly assign parents with a 70% probability
    rand_parent = random.random()
    if rand_parent <= 0.7:
        family_members.append(("FTH", "Father"))
        family_members.append(("MTH", "Mother"))
    elif rand_parent <= 0.9:
        # 20% probability of assigning only one parent
        family_members.append(random.choice([("FTH", "Father"), ("MTH", "Mother")]))
    
    # Randomly assign siblings with a 50% probability
    rand_sibling = random.random()
    if rand_sibling <= 0.5:
        num_siblings = random.randint(1, 2)  # Randomly assign 1 or 2 siblings
        for _ in range(num_siblings):
            family_members.append(random.choice([("BRO", "Brother"), ("SIS", "Sister")]))
    
    # Select a subset of available patient IDs as family members
    chosen_family_members = random.sample(possible_relations, min(len(family_members), len(possible_relations)))

    for related_person_id, (relation_code, relation_display) in zip(chosen_family_members, family_members):
        # Ensure the relationship is not duplicated
        if (related_person_id, patient_id) not in used_relationships:
            used_relationships.add((patient_id, related_person_id))
            used_relationships.add((related_person_id, patient_id))

            # Create RelatedPerson and FamilyMemberHistory records
            related_person = create_related_person(patient_id, related_person_id, relation_code, relation_display)
            family_member_history, inherited_diseases = create_family_member_history(patient_id, related_person_id, relation_code, relation_display)

            # Append to resources dictionary
            resources["RelatedPerson"].append(related_person)
            if family_member_history:
                resources["FamilyMemberHistory"].append(family_member_history)
            
            # Generate Condition records for inherited diseases
            if inherited_diseases:
                for disease in inherited_diseases:
                    condition_for_related = create_condition_for_related_person(related_person_id, disease)
                    print(patient_id, related_person_id, disease)
                    resources["Condition"].append(condition_for_related)

# Display counts of added records
print(f"\n`RelatedPerson` added: {len(resources['RelatedPerson'])}")
print(f"`FamilyMemberHistory` added: {len(resources['FamilyMemberHistory'])}")
print(f"`Condition` for RelatedPerson added: {len(resources['Condition'])}")


611bb297-2f51-beea-a7ec-4df0b59cb135 47e5d1f6-a017-1481-f0a2-49ef0edb3b72 Hypertension
611bb297-2f51-beea-a7ec-4df0b59cb135 47e5d1f6-a017-1481-f0a2-49ef0edb3b72 Alzheimer

`RelatedPerson` added: 3
`FamilyMemberHistory` added: 1
`Condition` for RelatedPerson added: 101


# 4. Saving Processed FHIR Data

In [None]:
import os
import json
import uuid

# **Prepare Bundle for RelatedPerson Upload**
related_person_bundle = {
    "resourceType": "Bundle",
    "type": "transaction",
    "entry": []
}

# **Set untuk menyimpan ID yang sudah digunakan**
existing_ids = set()

# **Loop hanya untuk RelatedPerson**
for related_person in resources["RelatedPerson"]:
    related_person_id = related_person["id"]
    patient_reference = related_person.get("patient", {}).get("reference")

    # **Pastikan reference merujuk ke Patient**
    if not patient_reference or not patient_reference.startswith("Patient/"):
        print(f"Skipping RelatedPerson {related_person_id} - Invalid Patient reference: {patient_reference}")
        continue

    # **Ambil Patient ID saja (tanpa "Patient/")**
    patient_id = patient_reference.replace("Patient/", "")

    # **Format pencarian dengan identifier untuk Patient**
    patient_reference_formatted = f"Patient?identifier=https://github.com/synthetichealth/synthea|{patient_id}"

    # **Format pencarian dengan identifier untuk RelatedPerson**
    related_person_reference_formatted = f"Patient?identifier=https://github.com/synthetichealth/synthea|{related_person_id}"

    # **Jika ID sudah ada, lewati agar tidak duplikat**
    if related_person_id in existing_ids:
        print(f"Skipping duplicate RelatedPerson ID: {related_person_id}")
        continue
    existing_ids.add(related_person_id)

    # **Generate ID baru untuk fullUrl**
    generated_uuid = str(uuid.uuid4())

    # **Tambahkan extension untuk menyimpan query sebagai valueReference**
    related_person.setdefault("extension", []).append({
        "url": "https://github.com/synthetichealth/synthea/relatedperson-identifier",
        "valueReference": {
            "reference": related_person_reference_formatted
        }
    })

    # **Pastikan "patient" menggunakan referensi identifier dengan query**
    related_person["patient"] = {"reference": patient_reference_formatted}

    # **Tambahkan ke Bundle dengan UUID baru sebagai fullUrl**
    related_person_bundle["entry"].append({
        "fullUrl": f"urn:uuid:{generated_uuid}",  # Gunakan UUID baru
        "resource": related_person,
        "request": {
            "method": "POST",
            "url": "RelatedPerson"
        }
    })

# **Save as JSON File for Upload**
output_folder = "synthea/output/processed/"
os.makedirs(output_folder, exist_ok=True)
related_person_output_path = os.path.join(output_folder, "FHIR_RelatedPerson_Bundle.json")

with open(related_person_output_path, "w", encoding="utf-8") as f:
    json.dump(related_person_bundle, f, indent=4)

print(f"\nSaved RelatedPerson Bundle data to {related_person_output_path}")


Skipping duplicate RelatedPerson ID: 47e5d1f6-a017-1481-f0a2-49ef0edb3b72

Saved RelatedPerson Bundle data to synthea/output/processed/FHIR_RelatedPerson_Bundle.json


In [36]:
# **Save All Resources to Separate JSON Files**
for resource_type, resource_list in resources.items():
    output_path = os.path.join(output_folder, f"{resource_type}.json")
    with open(output_path, "w", encoding="utf-8") as f:
        json.dump(resource_list, f, indent=4)
    print(f"Saved {resource_type} data to {output_path}")


Saved Patient data to synthea/output/processed/Patient.json
Saved Encounter data to synthea/output/processed/Encounter.json
Saved Condition data to synthea/output/processed/Condition.json
Saved DiagnosticReport data to synthea/output/processed/DiagnosticReport.json
Saved DocumentReference data to synthea/output/processed/DocumentReference.json
Saved Claim data to synthea/output/processed/Claim.json
Saved ExplanationOfBenefit data to synthea/output/processed/ExplanationOfBenefit.json
Saved Device data to synthea/output/processed/Device.json
Saved MedicationRequest data to synthea/output/processed/MedicationRequest.json
Saved CareTeam data to synthea/output/processed/CareTeam.json
Saved CarePlan data to synthea/output/processed/CarePlan.json
Saved Observation data to synthea/output/processed/Observation.json
Saved Procedure data to synthea/output/processed/Procedure.json
Saved Immunization data to synthea/output/processed/Immunization.json
Saved SupplyDelivery data to synthea/output/proc

In [37]:
import os
import json

# **Folder berisi sample data**
data_folder = "synthea/output/sample"
output_folder = "synthea/output/processed/"
os.makedirs(output_folder, exist_ok=True)

# **Folder berisi RelatedPerson**
related_person_file = "synthea/output/processed/FHIR_RelatedPerson_Bundle.json"

# **Gabungkan semua JSON dari data_folder**
all_entries = []

# **Loop semua file di data_folder**
for file_name in os.listdir(data_folder):
    file_path = os.path.join(data_folder, file_name)

    if not file_name.endswith(".json"):
        continue  # Skip non-JSON files

    print(f"Reading file: {file_name}")

    with open(file_path, "r", encoding="utf-8") as f:
        try:
            data = json.load(f)

            # Jika ini adalah Bundle, tambahkan entry-nya
            if "entry" in data:
                all_entries.extend(data["entry"])
            else:
                # Jika file hanya berisi satu resource, buat sebagai entry
                all_entries.append({"resource": data})
        
        except json.JSONDecodeError as e:
            print(f"Error parsing {file_name}: {e}")

# # **Tambahkan RelatedPerson ke dalam Bundle**
# if os.path.exists(related_person_file):
#     with open(related_person_file, "r", encoding="utf-8") as f:
#         try:
#             related_person_bundle = json.load(f)
#             if "entry" in related_person_bundle:
#                 all_entries.extend(related_person_bundle["entry"])
#         except json.JSONDecodeError as e:
#             print(f"Error parsing RelatedPerson JSON: {e}")

# # **Buat Bundle baru untuk upload**
# merged_bundle = {
#     "resourceType": "Bundle",
#     "type": "transaction",
#     "entry": all_entries
# }

# **Simpan hasil Bundle gabungan**
merged_output_path = os.path.join(output_folder, "FHIR_Merged_Bundle.json")
with open(merged_output_path, "w", encoding="utf-8") as f:
    json.dump(merged_bundle, f, indent=4)

print(f"\nMerged Bundle saved to {merged_output_path}")


Reading file: Alvin56_Jerde200_53917de0-aa0c-0700-105a-4f7f893c62f6.json
Reading file: Ardath226_Maude482_Feeney44_611bb297-2f51-beea-a7ec-4df0b59cb135.json
Reading file: Breana975_Harvey63_47e5d1f6-a017-1481-f0a2-49ef0edb3b72.json


NameError: name 'merged_bundle' is not defined