# NIH Survey Workbook
## $\color{grey}{\text{Author: Gaurav Anand}}$

### Objective 

Our research partners are in need of our help to analyze `NIH All of Us COPE` survey data. They are about to launch a new survey and want "to anticipate, based on previous survey results, places where we might expect issues or challenges when we launch the new survey". Specific items like: response counts, skip rates, and response frequency distribution for each question should be included along with any other metrics you can think of that will help the researchers identify questions that are potentially problematic (e.g, low response rates, no variability). If they identify questions (or wording of questions) that don't provide much information they will be able to update/modify the questions on the new survey to help get better information.

Tasks:
<pre> <p style = "font-family:helvetica;font-size:80%">
1) Use the attached codebook to generate synthetic response data. Add some categorical demographic attributes (e.g., racial identity, gender identity, age category).
              a)    - All the questions in the attached codebook should be included. 
                    - The responses don't have to be realistic. Our research partners will replace your synthetic data with their real data and re-run your notebook. 
                    - Be sure to include values that correspond to non-responses.
              b) Attach a separate script that generates the synthetic data. I would expect the actual data will be a csv file with one row per respondent and one column per question plus columns for user id and demographic information.
 
2) Create a Jupyter Notebook that performs the operational analysis for each question.
              a) Make it clear what type of input data is expected so our research partners can easily use their actual data in place of your synthetic data.
              b) There should be an analysis for each question, and possibly for each question group (if it makes sense).
              c) Provide a way for our partners to stratify these metrics across various demographic categories, for instance skip rates by racial identity.
 
3) Notes that briefly describe the parts of the notebook that need improving if you were given more time.</p>
</pre>

## Codebook Data

In [6]:
import pandas as pd

In [7]:
try:
    data = pd.read_excel("All of Us _ Public PPI Codebook - COPE.xlsx")
except ImportError:
    !pip install xlrd
    data = pd.read_excel("All of Us _ Public PPI Codebook - COPE.xlsx")


Notes:
* There exist 365 items in the dataframe. Out of these 365 items:
    * There are 179 'Questions' which constitute the main content of the survey.
    * There are 181 'Answers' which represent the responses that users can select for the survey.
    * There are 4 'Topic' items which include miscellanous text and instruction.

The distribution of answer types can be seen below

In [8]:
import warnings
data_lower = data
data_lower.loc[:,"Answer Type"] = data_lower["Answer Type"].str.lower()
grouped = data_lower.groupby(["Type","Answer Type"])
print("Answer Type Distribution")
display(pd.DataFrame(grouped.apply(len)))

Answer Type Distribution


Unnamed: 0_level_0,Unnamed: 1_level_0,0
Type,Answer Type,Unnamed: 2_level_1
Module Name,text,1
Question,date,1
Question,descriptive,1
Question,integer,19
Question,multi-select,14
Question,single-select,138
Question,text,6
Topic,descriptive,4


### Generating Fake Survey Data

#### Generating demographic information

In order to generate the demographic information, I relied on the `SyntheticData` class from the `data_synthesizer` python file that is attached along with this Jupyter Notebook. 

Kinds of demographic data generated: 
* Gender
* Race
* States
* Age
* Employment

This data is not very accurate and is just meant to be placeholder data for use in this analysis.

In [9]:
import imp
try:
    imp.find_module("faker"); imp.find_module("us")
except ImportError:
    !pip install faker
    !pip install us
finally:
    import faker
    from us import states
from data_synthesizer import * 

n = 100 # number of survey takers

datagen = SyntheticData(n)
seed = 126 # arbitrary

np.random.seed(seed) # set seed for reproducibility

# -------- Gender -------
genders = ["Male","Female","Other","Prefer Not to Say"] # initialize gender
p_genders = [0.4,0.4,0.04,0.16] # initialize probability of each gender
gender = datagen.generate_data(genders,p_genders)

# ------- Race --------
races = ["Hispanic/Latino", "American Indian or Alaska Native", "Asian",\
         "Black or African American", "Native Hawaiian or Other Pacific Islander",\
          "White","Two or more races."]
p_races = np.array([np.random.randint(low=1,high=100) for i in range(len(races))])
p_races = p_races / sum(p_races)
race = datagen.generate_data(races,p_races)

# ------- State --------
state_names = [state.name for state in states.STATES_AND_TERRITORIES]
p_states = np.array([np.random.randint(low=1,high=100) for i in range(len(state_names))])
p_states = p_states / sum(p_states)
state = datagen.generate_data(state_names,p_states)

# ------- Age Categories -------
age_categories = ["Under 12 years old.","12-17 years old.", "18-24 years old.",
                    "25-34 years old.", "35-44 years old.","45-54 years old.","Older than 55 years"]
p_age = np.array([np.random.randint(low=1,high=100) for i in range(len(age_categories))])
p_age = p_age / sum(p_age)
age = datagen.generate_data(age_categories, p_age)

# -------- Employment Status --------
employment_categories = [
    "Employed for wages", "Self-employed","Out of work and looking for work",\
    "Out of work but not currently looking for work","A homemaker","A student",\
    "Military","Retired","Unable to work"
]
p_employment = np.array([15*x for x in range(len(employment_categories),)[::-1]])
p_employment = p_employment/ sum(p_employment)
employment = datagen.generate_data(employment_categories,p_employment)

# --------- Generating Demographics
demographics = pd.DataFrame({"Gender":gender, "Ethnicity":race, "State":state,"Age":age,"Employment":employment})
demographics

Unnamed: 0,Gender,Ethnicity,State,Age,Employment
0,Male,White,Virginia,Under 12 years old.,Military
1,Male,Two or more races.,North Carolina,45-54 years old.,A homemaker
2,Male,Asian,Georgia,12-17 years old.,Employed for wages
3,Male,White,Colorado,35-44 years old.,A homemaker
4,Female,Black or African American,Maryland,25-34 years old.,Self-employed
...,...,...,...,...,...
95,Male,White,Tennessee,Older than 55 years,A homemaker
96,Female,White,Alabama,35-44 years old.,Out of work but not currently looking for work
97,Male,Black or African American,Maryland,35-44 years old.,Employed for wages
98,Prefer Not to Say,White,Alabama,12-17 years old.,A homemaker


#### Generating COPE survey results

In [13]:
options = data.query("Type == 'Answer'") # 181 survey answer options
# Behaviour for single-select questions
# # Non-COPE data
left_df = data.query("Type == 'Question' & `Answer Type` == 'single-select' & `Parent code` not in ['COPE','NaN']") # Parent Code != "COPE"
right_df = data.query("Type == 'Answer' ")[["Display","Parent code","PMI Code"]]\
.rename({"Display":"Answer"},axis=1)
non_cope = datagen.merge(left_df,right_df,on=["Parent code"],indicator = "match")
non_cope
# # single option sampling

# Behaviour for multi-select questions
# # choices and choice probabilities

# Behaviour for integer questions
# # includes number generation

# Behaviour for description questions
# # includes text generation


Unnamed: 0,Display,Topic,Type,Answer Type,PMI System,PMI Code_x,Parent code,Date of Last Update,Answer,PMI Code_y,match
0,Was the test for COVID-19 positive?,COVID Participant Experience Survey,Question,single-select,http://terminology.pmi-ops.org/CodeSystem/ppi,cdc_covid_19_9_xx24,cdc_covid_19_9_xx25,2020-05-04,,,left_only
1,How long has it been since you last smoked?,COVID Participant Experience Survey,Question,single-select,http://terminology.pmi-ops.org/CodeSystem/ppi,copect_50_xx19,copect_50_xx18,2020-05-04,,,left_only
2,How long has it been since you last used an el...,COVID Participant Experience Survey,Question,single-select,http://terminology.pmi-ops.org/CodeSystem/ppi,lifestyle_2_xx12,lifestyle_2_xx13,2020-05-04,,,left_only
3,In the past month:\n How many standard drinks ...,COVID Participant Experience Survey,Question,single-select,http://terminology.pmi-ops.org/CodeSystem/ppi,audit_c_2,audit_c_1,2020-05-04,Monthly or less,COPE_A_125,both
4,In the past month:\n How many standard drinks ...,COVID Participant Experience Survey,Question,single-select,http://terminology.pmi-ops.org/CodeSystem/ppi,audit_c_2,audit_c_1,2020-05-04,4 or more times a week,COPE_A_16,both
...,...,...,...,...,...,...,...,...,...,...,...
439,Do you think you have used more or less of the...,COVID Participant Experience Survey,Question,single-select,http://terminology.pmi-ops.org/CodeSystem/ppi,tsu_ds5_13_xx44,tsu_ds5_13_xx,2020-05-04,Hallucinogens (for example,COPE_A_222,both
440,Do you think you have used more or less of the...,COVID Participant Experience Survey,Question,single-select,http://terminology.pmi-ops.org/CodeSystem/ppi,tsu_ds5_13_xx44,tsu_ds5_13_xx,2020-05-04,Heroin,COPE_A_223,both
441,Do you think you have used more or less of the...,COVID Participant Experience Survey,Question,single-select,http://terminology.pmi-ops.org/CodeSystem/ppi,tsu_ds5_13_xx44,tsu_ds5_13_xx,2020-05-04,Prescription opioids (for example,COPE_A_224,both
442,Do you think you have used more or less of the...,COVID Participant Experience Survey,Question,single-select,http://terminology.pmi-ops.org/CodeSystem/ppi,tsu_ds5_13_xx44,tsu_ds5_13_xx,2020-05-04,,COPE_A_225,both


In [35]:
# data[data["PMI Code"] == "cdc_covid_19_9_xx24"]

Unnamed: 0,Display,Topic,Type,Answer Type,PMI System,PMI Code,Parent code,Date of Last Update
182,Your gender,COVID Participant Experience Survey,Answer,,http://terminology.pmi-ops.org/CodeSystem/ppi,COPE_A_100,eds_follow_up_1,2020-05-04
190,Your ancestry or national origins,COVID Participant Experience Survey,Answer,,http://terminology.pmi-ops.org/CodeSystem/ppi,COPE_A_11,eds_follow_up_1,2020-05-04
217,Your education or income level,COVID Participant Experience Survey,Answer,,http://terminology.pmi-ops.org/CodeSystem/ppi,COPE_A_145,eds_follow_up_1,2020-05-04
250,Your religion,COVID Participant Experience Survey,Answer,,http://terminology.pmi-ops.org/CodeSystem/ppi,COPE_A_190,eds_follow_up_1,2020-05-04
256,"Yes, known COVID-19",COVID Participant Experience Survey,Answer,,http://terminology.pmi-ops.org/CodeSystem/ppi,COPE_A_199,copect_17,2020-05-04
288,"Yes, suspected COVID-19",COVID Participant Experience Survey,Answer,,http://terminology.pmi-ops.org/CodeSystem/ppi,COPE_A_32,copect_17,2020-05-04
297,Your sexual orientation,COVID Participant Experience Survey,Answer,,http://terminology.pmi-ops.org/CodeSystem/ppi,COPE_A_45,eds_follow_up_1,2020-05-04
298,Your weight,COVID Participant Experience Survey,Answer,,http://terminology.pmi-ops.org/CodeSystem/ppi,COPE_A_46,eds_follow_up_1,2020-05-04
318,Your height,COVID Participant Experience Survey,Answer,,http://terminology.pmi-ops.org/CodeSystem/ppi,COPE_A_66,eds_follow_up_1,2020-05-04
328,Your age,COVID Participant Experience Survey,Answer,,http://terminology.pmi-ops.org/CodeSystem/ppi,COPE_A_8,eds_follow_up_1,2020-05-04
