# **Patients data collection**

#### **Libraries**

We import the necessary libraries.

In [29]:
import pandas as pd
import json
import random
from names_dataset import NameDataset

print("> Libraries Imported")

> Libraries Imported


#### **Import NameDataset**

As suggested by the assignment we use [First and Last Names Database](https://github.com/philipperemy/name-dataset), a Python library that provides information about names. We can then import the library's `NameDataset` and extract the most popular italian female (200 elements) and male names (200 elements).

In [30]:
# import data
nd = NameDataset()
# female names
list_name_F = nd.get_top_names(n=200, gender='Female', country_alpha2='IT')['IT']['F']
list_name_F[:10]

['Maria',
 'Anna',
 'Francesca',
 'Sara',
 'Laura',
 'Antonella',
 'Elena',
 'Angela',
 'Giulia',
 'Daniela']

In [31]:
# male names
list_name_M = nd.get_top_names(n=200, gender='Male', country_alpha2='IT')['IT']['M']
list_name_M[:10]

['Giuseppe',
 'Francesco',
 'Marco',
 'Andrea',
 'Antonio',
 'Alessandro',
 'Luca',
 'Giovanni',
 'Roberto',
 'Stefano']

Placeholders for the geder atribute are then created.

In [32]:
# female placeholder
gender_F = ["Female"]*len(list_name_F)
# male placeholder
gender_M = ["Male"]*len(list_name_M)

#### **Create DataFrame**

We bind the name and the gender lists and create a DataFrame:

In [33]:
# bind name and gender lists
names = list_name_F + list_name_M
gender = gender_F + gender_M

# create df
patients_df = pd.DataFrame(
    data=zip(names, gender),
    columns=["patient_name","patient_gender"]
)
patients_df

Unnamed: 0,patient_name,patient_gender
0,Maria,Female
1,Anna,Female
2,Francesca,Female
3,Sara,Female
4,Laura,Female
...,...,...
395,Manu,Male
396,Natale,Male
397,Guglielmo,Male
398,Giordano,Male


Sort the patients in alphabetical order:

In [38]:
patients_df = patients_df.sort_values("patient_name").reset_index(drop=True)
patients_df

Unnamed: 0,patient_name,patient_gender
0,Adele,Female
1,Adrian,Male
2,Adriana,Female
3,Adriano,Male
4,Agnese,Female
...,...,...
395,Vittoria,Female
396,Vittorio,Male
397,Viviana,Female
398,Walter,Male


#### **Generate patients ids**
Finally we can generate patients ids:

In [31]:
# add column with conditions ids
patients_ids = [f"pat_{n}" for n in range(len(patients_df))]
patients_df["patient_id"] = patients_ids

# reorder columns
patients_df = patients_df[["patient_id", "patient_name", "patient_gender"]]

# show results
patients_df

Unnamed: 0,patient_id,patient_name,patient_gender
0,pat_0,Piera,Female
1,pat_1,Gianluigi,Male
2,pat_2,Walter,Male
3,pat_3,Nicholas,Male
4,pat_4,Elisa,Female
...,...,...,...
395,pat_395,Caterina,Female
396,pat_396,Donatella,Female
397,pat_397,Marcella,Female
398,pat_398,Massimo,Male


#### **Create a dictionary and store them into a .json file**

In [32]:
# create the dict
RESULT_DICT = {"Patients":[]}
for idx,row in patients_df.iterrows():
    RESULT_DICT["Patients"].append(
        {
            "id":row["patient_id"],
            "name":row["patient_name"],
            "gender":row["patient_gender"],
            "conditions": [],
            "trials": [],
        }
    )

# save it as .json
with open('intermediate_data/patients.json', 'w') as fp:
    json.dump(RESULT_DICT, fp, indent=4)
print("> JSON stored correctly")

> JSON stored correctly
