# **Final data creation**

#### **Libraries**

We import the necessary libraries.

In [3]:
import pandas as pd
import json
import random
import datetime
from tqdm import tqdm
import pprint
import sys

PP = pprint.PrettyPrinter(sort_dicts=False, depth=6)
sys.path.append('../')
from utilities import generate_random_date, get_random_id

print("> Libraries Imported")

> Libraries Imported


#### **Import patients, conditions and therapies**

First, we import intermediate data, i.e condition, therapies and patients data.

In [5]:
# import conditions and therapies data
with open('intermediate_data/cond_ther.json', 'r') as file:
    cond_and_ther = json.load(file)
# import patients data
with open('intermediate_data/patients.json', 'r') as file:
    patients_dict = json.load(file)

#### **Create an algorithm for patients conditions and trials generation**

Besides id, name and gender, patients have two other attributes: conditions and trials.
Each patient may have one or more conditions, which have a diagnosed date and a cured date. 
Each condition has at least one trial, which has a start date, an end date and a success rate.
The conditions and trials of a patient are identified by an id and refer to the conditions and therapies previously collected and stored (`cond_ther.json`).

#### **Main Algorithm description**

The patients conditions and trials generation main algorithm is designed as follows:

- takes as input the patient, the list of conditions ids and the list of therapies ids

- modify the conditions key's values:
    - choose a random number `N` between 1 and 5 and create `N` conditions for that patient
    - for each condition:
        - create an `id`
        - create a random `diagnosed` date
        - set `cured` date as NULL
        - choose a random condition id from cond_ther.json data (`kind`)

- modify the trials key's values:
    - for each condition:
        - choose a random number `N` between 1 and 5 and create `N` trials for that condition
        - for each trial:
            - create an `id`
            - create `start` date, which has to be greater than `diagnosed`
            - create `end` date, which has to be greater than `start`
            - store condition id in `condition` attrributes
            - choose a random therapy id from cond_ther.json data (`therapy`)
            - choose a random number between 0 and 100 for `successful`, if this number is greater than 75, the condition `cured` date is updated with the trial `end` date


##### Main algorithm constraints:
- Each patient has at least one and and a maximum of 5 conditions;
- Each patient has at least one and a maximum of 5 trials for each condition;
- Each patient's conditions are ordered (the diagnosed date of the later condition is greater than the diagnosed date of the earlier one);
- Each trial start date is greater than the corresponding condition's diagnosed date;
- Each trials for the same condition are ordered (the start date of the later condition is greater than the start date of the earlier one);
- If the tiral for a condition is successful, no further tirals are generated for that condition.

#### **Function code definition**:

In [6]:
def generate_full_patients(patients_dict, list_of_conditions_ids, list_of_therapies_ids):

   # generate conditions and trials for patients
   for patient_dict in patients_dict['Patients']:
      
      # print(f"> PATIENT '{patient_dict['name']}' [ID: {patient_dict['id']}]")
      
      # CONDITIONS

      # obtain the number of conditions to generate
      NUM_OF_CONDITIONS = random.sample(range(1,6), 1)[0]
      # print(f'  * Number of conditions: {NUM_OF_CONDITIONS}')

      # get a copy of all the possible conditions
      list_of_conditions_ids_copy = list_of_conditions_ids.copy()

      # create N conditions
      for idx in range(NUM_OF_CONDITIONS):

         # create condition dictionary
         temp_condition = {}

         # set 'id' 
         temp_condition['id'] = f'c{idx+1}'

         # generate a random date

         # if conditions are already present for the patient --> get the last date
         if len(patient_dict['conditions']) > 0:
            last_date = patient_dict['conditions'][-1]['diagnosed']
         else:
            last_date = None

         # generate the date
         temp_condition['diagnosed'] = generate_random_date(min_date=last_date, type="condition")

         # set 'cured' to None
         temp_condition['cured'] = None

         # set 'kind' to a random condition id
         temp_cond_id = get_random_id(list_of_conditions_ids_copy)
         temp_condition['kind'] = temp_cond_id
         list_of_conditions_ids_copy.remove(temp_cond_id) # ensure that there are no duplicates conditions

         # add the condition dictionary to the patient
         patient_dict['conditions'].append(temp_condition)

      # TRIALS
      
      for condition in patient_dict['conditions']:

         # setup
         CREATE_NEW_TRIALS = True

         # obtain the number of conditions to generate
         NUM_OF_TRIALS = random.sample(range(1,5), 1)[0]
         # print(f"  * Number of trials for condition '{condition['id']}': {NUM_OF_TRIALS}")

         # get a copy of all the possible therapies
         list_of_therapies_ids_copy = list_of_therapies_ids.copy()

         temp_min_date = condition['diagnosed']
         for idx in range(NUM_OF_TRIALS):

            if CREATE_NEW_TRIALS:

               # create trial dictionary
               temp_trial = {}

               # set 'id' 
               temp_trial['id'] = condition['id'] + f'_t{idx+1}'

               # set 'start'
               temp_trial['start'] = generate_random_date(min_date=temp_min_date, type="trial")

               # set 'end' 
               temp_trial_end = generate_random_date(min_date=temp_trial['start'], type="trial")
               temp_trial['end'] = temp_trial_end
               temp_min_date = temp_trial_end

               # set 'condition'
               temp_trial['condition'] = condition['id']

               # set 'therapy'
               temp_ther_id = get_random_id(list_of_therapies_ids_copy)
               temp_trial['therapy'] = temp_ther_id
               list_of_therapies_ids_copy.remove(temp_ther_id) # ensure that there are no duplicates therapies

               # set 'successful'
               SUCC = random.sample(range(0,100), 1)[0]
               temp_trial['successful'] = SUCC

               if SUCC > 75:
                  condition['cured'] = temp_trial['end']
                  CREATE_NEW_TRIALS = False

               # add the trial dictionary to the patient
               patient_dict['trials'].append(temp_trial)

      # print("-"*100)
   
   return patients_dict

#### **Function execution**

We can finally execute the function using our patients dictionary.

In [34]:
# setup
list_of_conditions_ids = [x['id'] for x in cond_and_ther['Conditions']]
list_of_therapies_ids = [x['id'] for x in cond_and_ther['Therapies']]

# and execute
full_patients_dict = generate_full_patients(patients_dict, list_of_conditions_ids, list_of_therapies_ids)
full_patients_dict

{'Patients': [{'id': 'pat_0',
   'name': 'Piera',
   'gender': 'Female',
   'conditions': [{'id': 'c1',
     'diagnosed': datetime.date(2007, 6, 12),
     'cured': None,
     'kind': 'cond_18'}],
   'trials': [{'id': 'c1_t1',
     'start': datetime.date(2009, 3, 5),
     'end': datetime.date(2019, 8, 28),
     'condition': 'c1',
     'therapy': 'ther_52',
     'successful': 63},
    {'id': 'c1_t2',
     'start': datetime.date(2019, 10, 13),
     'end': datetime.date(2019, 10, 28),
     'condition': 'c1',
     'therapy': 'ther_50',
     'successful': 49},
    {'id': 'c1_t3',
     'start': datetime.date(2019, 11, 3),
     'end': datetime.date(2019, 11, 9),
     'condition': 'c1',
     'therapy': 'ther_46',
     'successful': 44}]},
  {'id': 'pat_1',
   'name': 'Gianluigi',
   'gender': 'Male',
   'conditions': [{'id': 'c1',
     'diagnosed': datetime.date(2006, 4, 4),
     'cured': None,
     'kind': 'cond_35'},
    {'id': 'c2',
     'diagnosed': datetime.date(2007, 4, 25),
     'cured':

#### **Compose and store the final data**

The last step of our data collection is merging all the dictionaries (the conditions & therapies with the patients one).

In [42]:
# bind the dictionaries
final_data_dict = cond_and_ther | full_patients_dict

# save it as .json
with open('../../data/full_data.json', 'w') as fp:
    json.dump(final_data_dict, fp, indent=4, default=str)
print("> JSON stored correctly")

> JSON stored correctly
