## **Simulating Data for Python Didactics**
### Quick Crosswalk Cases
### BMI: 
#### **Author(s)**: Dominic DiSanto
#### **Written**: 12/11/2020
#### **Updated**: 12/11/2020
#### **Version**: 0.1




*Quick Crosswalk Cases* are use cases which will include simple (that is single or 2/3-step processes) than can be accomplished in Excel, Python, R, and possibly SQL as well. 

The narrative of the case will be identifying a number of patients at elevated risk of some complication, and a clinic wants to identify how many patients meet the criteria of elevated risk and identify their contact information. 

Students will be given a spreadsheet of data and asked to create an excel file with 4 tabs, containing lists of the patient identifier variables and their contact information for 4 sets of criteria:

1) BMI $\geq$ 30  
2) BMI $\geq$ 35  
3) BMI $\geq$ 30 & Age$\geq$60  
4) BMI $\geq$ 35 & Age$\geq$60  

Students will need to first calcualte BMI (given metric height and weight data), then identify the patient ID's of interest, then merge in the contact information of interest. The last step will include exporting (copy/paste, python or R exports) the contact information into a final results spreadsheet. More specifically as a tab in a spreadsheet, separately for results from Excel, Python, and R. The idea being that if you were creating these lists for a document, we want the case to force students/users to also make the results presentable. You would more likely send a single xlsx of 4 tabs than 4 separate spreadsheets. 

As an additional complexity, we will then "update" the data. That is, we have ~130 new patients, and students need to recreate the lists. The idea being that this will be pretty painful in Excel but less so in Python/R, where students simply need to update a file-name or export paths. 


I may get fancy and create a script that can check the data either using a solutions spreadsheet file or to get really fancy and recreate the data internally in the script (using the same data sim code and random-state/seed-number as this notebook) and check the students output, as this would simply require less effort or file management on the student's/user's part (making the case more accessible).

In [1]:
import scipy as sci
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from faker import Faker # Used to simulate the address/phone-number data 
np.random.seed(15951)

# observations in first data set
n1_obs = 756

# observations in data "update"
n2_obs = 145

In [2]:
ids_tot = np.random.randint(111111111, 999999999, n1_obs + n2_obs)
assert len(ids_tot) == len(set(ids_tot))


In [3]:
age = np.random.normal(50, 10, n1_obs + n2_obs)
height_cm = np.round(np.random.normal(180, 20, n1_obs + n2_obs), 1)
bmi = np.round(np.random.normal(27, 5, n1_obs + n2_obs), 1)

In [4]:
bmi_df = pd.DataFrame({'ID':ids_tot,
                       'Age':age.astype(int),
                       'Height (cm)':height_cm,
                       'BMI':bmi})
bmi_df.head()

Unnamed: 0,ID,Age,Height (cm),BMI
0,458439593,62,220.1,40.5
1,466502951,50,192.0,27.3
2,780243050,62,209.8,44.0
3,197598223,28,185.9,29.9
4,918135592,67,148.4,28.2


In [5]:
# Creating a weight variable back-calculated rom the BMI distribution
bmi_df['Weight (kg)'] = np.round(bmi_df['BMI'] * (bmi_df['Height (cm)']/100)**2, 2)

Exploring the BMI variable of interest before deleting it, ensuring we have some samples that meet each criteria set. I've simulated data for both the original data set in `n1_obs` and the addendum `n2_obs`, so I'll check both groups for each criteria. We'll include a small sample size for the 4th criteria (or the most strict criteria) such that the additional data results in no new observations of interest.

In [6]:
# 1) BMI>=30
print("First Cohort: " + str(bmi_df[bmi_df['BMI']>=30].shape[0]))
print("Second Cohort: " + str(bmi_df[(bmi_df.index>n1_obs) & (bmi_df['BMI']>=30)].shape[0]))

First Cohort: 259
Second Cohort: 44


In [7]:
# 2) BMI>=35
print("First Cohort: " + str(bmi_df[bmi_df['BMI']>=35].shape[0]))
print("Second Cohort: " + str(bmi_df[(bmi_df.index>n1_obs) & (bmi_df['BMI']>=35)].shape[0]))

First Cohort: 55
Second Cohort: 12


In [8]:
# 3) BMI>=30 & Age>=60
print("First Cohort: " + str(bmi_df[(bmi_df['BMI']>=30) &\
                                 (bmi_df['Age']>=60)].shape[0]))
print("Second Cohort: " + str(bmi_df[(bmi_df.index>n1_obs) & (bmi_df['BMI']>=30) & \
                                 (bmi_df['Age']>=60)].shape[0]))

First Cohort: 47
Second Cohort: 9


In [9]:
# 4) BMI>=35 & Age>=60
print("First Cohort: " + str(bmi_df[(bmi_df['BMI']>=35) &\
                                 (bmi_df['Age']>=60)].shape[0]))
print("Second Cohort: " + str(bmi_df[(bmi_df.index>n1_obs) & (bmi_df['BMI']>=35) & \
                                 (bmi_df['Age']>=60)].shape[0]))

First Cohort: 10
Second Cohort: 3


### Simulating Contact Information
Using the `Faker` module to be lazy, although this is somewhat slow. In reviewing the [Faker documentation](https://faker.readthedocs.io/en/master/), the only way to simulate multiple values for each faker simulation is to loop over the length of the simulations desired


In [10]:
len(ids_tot)
Faker.seed(456) 
# did a quick test and seems necessary to set this Faker seed in addition to the earlier 
# numpy seed that was set 

In [11]:
phone_nos = []
adds = []

for i in range(n1_obs + n2_obs):
    phone_nos = np.append(phone_nos, Faker().phone_number())
    adds = np.append(adds, Faker().address())    

In [12]:
contact_df = pd.DataFrame({'ID':ids_tot,
                           'PhoneNo':phone_nos,
                           'Address':adds
                          })

contact_df.head()

Unnamed: 0,ID,PhoneNo,Address
0,458439593,001-766-098-1571,"22326 Jensen Mountains Suite 987\nJonesmouth, ..."
1,466502951,635.237.2550,"961 Jennifer Pike Suite 707\nSouth Donmouth, G..."
2,780243050,828.528.2249x47820,"23551 Mahoney Junction\nWest Brandon, MI 53433"
3,197598223,0080358260,"741 Sanchez Stravenue Suite 840\nMichellestad,..."
4,918135592,6632740778,"581 Holland Cove\nCoryburgh, KS 17977"


### Data Export

In [13]:
# First data set 
with pd.ExcelWriter('BMI_Data.xlsx') as writer:  
    bmi_df.loc[bmi_df.index<n1_obs, 
           ['ID', 'Age', 'Height (cm)', 'Weight (kg)']].to_excel(excel_writer = writer, 
                                                                 sheet_name = 'HeightWeight',
                                                                 index=False)
    contact_df.loc[contact_df.index<n1_obs].to_excel(excel_writer = writer,
                      sheet_name = 'Contact Info',
                      index=False)

# 'Updated' data set (all observations)
with pd.ExcelWriter('BMI_Data_UPDATE.xlsx') as writer:  
    bmi_df[['ID', 'Age', 'Height (cm)', 'Weight (kg)']].to_excel(excel_writer = writer, 
                                                                 sheet_name = 'HeightWeight',
                                                                 index=False)
    contact_df.to_excel(excel_writer = writer,
                      sheet_name = 'Contact Info',
                      index=False)