# Generate synthetic pregnancy patient pathways

In this file, we generate a set of synthetic pregnancy patient pathways to provide a dataset that can be used to test the Framework for assessing multi-perspective patient pathways.

Mind that the goal of generating synthetic pathways is not to provide a reliable set of data to foster pregnancy pathway assessments, but only to supply those who would like to test the framework but do not have access to a healthcare dataset at the moment, an eventlog with the necessary information to do so.

## 1. Libraries and functions

In [1]:
import numpy as np
import pandas as pd
from datetime import date
from datetime import timedelta
from numpy.random import choice

pd.set_option('max_colwidth', None)

## 2. Synthetic pathways

We start by listing the medical procedures of the patients and their probability of occurrence.

In [2]:
interventions_probabilities = {
    'Antenatal Care Visit' : 0.6,
    'Obstetric Ultrasound Scan' : 0.12,
    'Pregnancy Test' : 0.07,
    'Unscheduled Primary Care Visit' : 0.07,
    'Primary Care Visit' : 0.09,
    'Obstetric Morphology Ultrasound Scan' : 0.04,
    'Color Flow Doppler Obstetric Ultrasound Scan':0.01
}

Next, we define the probability of the type of healthcare unit given the medical procedure.

In [3]:
unit_type_intervention = {
    'Antenatal Care Visit' : {
        'Primary Care' : 0.8, 
        'Secondary Care' : 0.18 ,
        'Hospital' : 0.02
    },
    'Primary Care Visit' :{
        'Primary Care' : 1, 
        'Secondary Care' : 0,
        'Hospital' : 0
    },
    'Obstetric Ultrasound Scan' : {
        'Primary Care' : 0, 
        'Secondary Care' : 0.9,
        'Hospital' : 0.1
    },
    'Pregnancy Test' : {
        'Primary Care' : 0.95, 
        'Secondary Care' : 0.04 ,
        'Hospital' : 0.01
    },
    'Unscheduled Primary Care Visit' : {
        'Primary Care' : 1.0, 
        'Secondary Care' :  0.0,
        'Hospital' : 0.0
    },
    'Obstetric Morphology Ultrasound Scan' : {
        'Primary Care' : 0.0, 
        'Secondary Care' : 0.3 ,
        'Hospital' : 0.7
    },
    'Color Flow Doppler Obstetric Ultrasound Scan' : {
        'Primary Care' :0.0, 
        'Secondary Care' : 0.0 ,
        'Hospital' : 1.0
    },
}

Similarly, we define the probability of the occupation given the medical procedure.

In [4]:
occupation_intervention = {
    'Antenatal Care Visit' : {
        'Doctor in Obstetrics and Gynaecology' : 0.55,
        'General Practitioner' : 0.2,
        'Nurse' : 0.25,
        'Doctor in radiology' : 0,
        'Nursing Assistant' : 0
    },
    'Obstetric Ultrasound Scan' : {
        'Doctor in Obstetrics and Gynaecology' : 0,
        'General Practitioner' : 0,
        'Nurse' : 0,
        'Doctor in radiology' : 1,
        'Nursing Assistant' : 0
    },
    'Primary Care Visit' :{
        'Doctor in Obstetrics and Gynaecology' : 0.2,
        'General Practitioner' : 0.75,
        'Nurse' : 0.05,
        'Doctor in radiology' : 0,
        'Nursing Assistant' : 0
    },
    'Pregnancy Test' : {
        'Doctor in Obstetrics and Gynaecology' : 0,
        'General Practitioner' : 0,
        'Nurse' : 0.3,
        'Doctor in radiology' : 0,
        'Nursing Assistant' : 0.7
    },
    'Unscheduled Primary Care Visit' : {
        'Doctor in Obstetrics and Gynaecology' : 0.15,
        'General Practitioner' : 0.85,
        'Nurse' : 0,
        'Doctor in radiology' : 0,
        'Nursing Assistant' : 0
    },
    'Obstetric Morphology Ultrasound Scan' : {
        'Doctor in Obstetrics and Gynaecology' : 0,
        'General Practitioner' : 0,
        'Nurse' : 0,
        'Doctor in radiology' : 1,
        'Nursing Assistant' : 0
    },
    'Color Flow Doppler Obstetric Ultrasound Scan' : {
        'Doctor in Obstetrics and Gynaecology' : 0,
        'General Practitioner' : 0,
        'Nurse' : 0,
        'Doctor in radiology' : 1,
        'Nursing Assistant' : 0
    }   
}

Now we generate $N$ patient pathways with the three dictionaries created above.

The pathway length follows a binomial distribution with $n=12$ and $p=0.7$.

The time interval between the events follow a normal distribution with $\mu=30$ and $\sigma=15$.

We consider the healthcare system to have $3$ hospitals, $15$ secondary care units and $50$ primary care units. 

In [5]:
N = 1000
np.random.seed(10)

list_patients = []
list_dates = []
list_interventions = []
list_occupations = []
list_units = []
list_unit_types = []

for pac in range(N):
    
    patient = 'Patient '+str(pac)
    
    pac_length = np.random.binomial(12, 0.7, size=None)
    
    pac_interventions = list(choice(
        list(interventions_probabilities.keys()), 
        pac_length, 
        p=list(interventions_probabilities.values())))
    
    pac_occupations = [
        list(choice(list(occupation_intervention[x].keys()), 1, p=list(occupation_intervention[x].values())))[0]
        for x in pac_interventions
    ]
    
    pac_unit_types = [
        list(choice(list(unit_type_intervention[x].keys()), 1, p=list(unit_type_intervention[x].values())))[0]
        for x in pac_interventions
    ]
    
    pac_units = []
    for u_type in pac_unit_types:
        u = u_type + ' Unit ' + str(np.random.randint(1,3+1) if (u_type == 'Hospital')
                                    else np.random.randint(1,15+1) if (u_type == 'Secondary Care')
                                    else np.random.randint(1,50+1))
        pac_units.append(u)
    
    intervals = [max(1,int(x)) for x in np.random.normal(30, 15, pac_length-1)]
    
    dates = [date(2000,1,1)]
    for t in intervals:
        dates.append(dates[-1]+timedelta(days=t))
    dates = [x.strftime('%Y/%m/%d') for x in dates]
    
    list_patients = list_patients + [patient for x in range(pac_length)]
    list_dates = list_dates + dates
    list_interventions = list_interventions + pac_interventions
    list_occupations = list_occupations+pac_occupations
    list_unit_types = list_unit_types + pac_unit_types
    list_units = list_units + pac_units
    

df_sample_data = pd.DataFrame.from_dict({
    'patient_id' : list_patients,
    'date' : list_dates,
    'intervention' : list_interventions,
    'occupation' : list_occupations,
    'unit' : list_units,
    'unit_type' : list_unit_types})

In [6]:
# first lines of the dataframe
df_sample_data.head()

Unnamed: 0,patient_id,date,intervention,occupation,unit,unit_type
0,Patient 0,2000/01/01,Antenatal Care Visit,Doctor in Obstetrics and Gynaecology,Primary Care Unit 14,Primary Care
1,Patient 0,2000/01/26,Obstetric Ultrasound Scan,Doctor in radiology,Secondary Care Unit 13,Secondary Care
2,Patient 0,2000/02/16,Pregnancy Test,Nursing Assistant,Primary Care Unit 23,Primary Care
3,Patient 0,2000/02/22,Antenatal Care Visit,Nurse,Secondary Care Unit 15,Secondary Care
4,Patient 0,2000/04/07,Antenatal Care Visit,Doctor in Obstetrics and Gynaecology,Primary Care Unit 31,Primary Care


In [7]:
# save file
df_sample_data.to_csv('results/sample_synthetic_data.csv',index=False)