# NIH Survey Workbook
## $\color{grey}{\text{Author: Gaurav Anand}}$

### Objective 

Our research partners are in need of our help to analyze `NIH All of Us COPE` survey data. They are about to launch a new survey and want "to anticipate, based on previous survey results, places where we might expect issues or challenges when we launch the new survey". Specific items like: response counts, skip rates, and response frequency distribution for each question should be included along with any other metrics you can think of that will help the researchers identify questions that are potentially problematic (e.g, low response rates, no variability). If they identify questions (or wording of questions) that don't provide much information they will be able to update/modify the questions on the new survey to help get better information.

Tasks:
<pre> <p style = "font-family:helvetica;font-size:80%">
1) Use the attached codebook to generate synthetic response data. Add some categorical demographic attributes (e.g., racial identity, gender identity, age category).
              a)    - All the questions in the attached codebook should be included. 
                    - The responses don't have to be realistic. Our research partners will replace your synthetic data with their real data and re-run your notebook. 
                    - Be sure to include values that correspond to non-responses.
              b) Attach a separate script that generates the synthetic data. I would expect the actual data will be a csv file with one row per respondent and one column per question plus columns for user id and demographic information.
 
2) Create a Jupyter Notebook that performs the operational analysis for each question.
              a) Make it clear what type of input data is expected so our research partners can easily use their actual data in place of your synthetic data.
              b) There should be an analysis for each question, and possibly for each question group (if it makes sense).
              c) Provide a way for our partners to stratify these metrics across various demographic categories, for instance skip rates by racial identity.
 
3) Notes that briefly describe the parts of the notebook that need improving if you were given more time.</p>
</pre>

## Codebook Data

In [2]:
import pandas as pd

In [3]:
try:
    data = pd.read_excel("../data/All of Us _ Public PPI Codebook - COPE.xlsx")
except ImportError:
    !pip install xlrd
    data = pd.read_excel("../data/All of Us _ Public PPI Codebook - COPE.xlsx")


Collecting xlrd
  Downloading xlrd-1.2.0-py2.py3-none-any.whl (103 kB)
[K     |████████████████████████████████| 103 kB 11.4 MB/s 
[?25hInstalling collected packages: xlrd
Successfully installed xlrd-1.2.0
You should consider upgrading via the '/opt/venv/bin/python -m pip install --upgrade pip' command.[0m


Notes:
* There exist 365 items in the dataframe. Out of these 365 items:
    * There are 179 'Questions' which constitute the main content of the survey.
    * There are 181 'Answers' which represent the responses that users can select for the survey.
    * There are 4 'Topic' items which include miscellanous text and instruction.

The distribution of answer types can be seen below

In [4]:
import warnings
data_lower = data
data_lower.loc[:,"Answer Type"] = data_lower["Answer Type"].str.lower()
grouped = data_lower.groupby(["Type","Answer Type"])
print("Answer Type Distribution")
display(pd.DataFrame(grouped.apply(len)))

Answer Type Distribution


Unnamed: 0_level_0,Unnamed: 1_level_0,0
Type,Answer Type,Unnamed: 2_level_1
Module Name,text,1
Question,date,1
Question,descriptive,1
Question,integer,19
Question,multi-select,14
Question,single-select,138
Question,text,6
Topic,descriptive,4


### Generating Fake Survey Data

#### Generating demographic information

In order to generate the demographic information, I relied on the `SyntheticData` class from the `data_synthesizer` python file that is attached along with this Jupyter Notebook. 

Kinds of demographic data generated: 
* Gender
* Race
* States
* Age
* Employment

This data is not very accurate and is just meant to be placeholder data for use in this analysis.

In [5]:
import imp
import sys
try:
    imp.find_module("faker"); imp.find_module("us")
except ImportError:
    !pip install faker
    !pip install us
finally:
    import faker
    from us import states

sys.path.append('../code/')
from data_synthesizer import * 

n = 100 # number of survey takers

datagen = SyntheticData(n)
seed = 126 # arbitrary

np.random.seed(seed) # set seed for reproducibility

# -------- Gender -------
genders = ["Male","Female","Other","Prefer Not to Say"] # initialize gender
p_genders = [0.4,0.4,0.04,0.16] # initialize probability of each gender
gender = datagen.generate_data(genders,p_genders)

# ------- Race --------
races = ["Hispanic/Latino", "American Indian or Alaska Native", "Asian",\
         "Black or African American", "Native Hawaiian or Other Pacific Islander",\
          "White","Two or more races."]
p_races = np.array([np.random.randint(low=1,high=100) for i in range(len(races))])
p_races = p_races / sum(p_races)
race = datagen.generate_data(races,p_races)

# ------- State --------
state_names = [state.name for state in states.STATES_AND_TERRITORIES]
p_states = np.array([np.random.randint(low=1,high=100) for i in range(len(state_names))])
p_states = p_states / sum(p_states)
state = datagen.generate_data(state_names,p_states)

# ------- Age Categories -------
age_categories = ["Under 12 years old.","12-17 years old.", "18-24 years old.",
                    "25-34 years old.", "35-44 years old.","45-54 years old.","Older than 55 years"]
p_age = np.array([np.random.randint(low=1,high=100) for i in range(len(age_categories))])
p_age = p_age / sum(p_age)
age = datagen.generate_data(age_categories, p_age)

# -------- Employment Status --------
employment_categories = [
    "Employed for wages", "Self-employed","Out of work and looking for work",\
    "Out of work but not currently looking for work","A homemaker","A student",\
    "Military","Retired","Unable to work"
]
p_employment = np.array([15*x for x in range(len(employment_categories),)[::-1]])
p_employment = p_employment/ sum(p_employment)
employment = datagen.generate_data(employment_categories,p_employment)

# --------- Generating Demographics
demographics = pd.DataFrame({"Gender":gender, "Ethnicity":race, "State":state,"Age":age,"Employment":employment})
demographics

Collecting faker
  Downloading Faker-4.14.2-py3-none-any.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 41.3 MB/s 
[?25hCollecting text-unidecode==1.3
  Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
[K     |████████████████████████████████| 78 kB 6.5 MB/s 
Installing collected packages: text-unidecode, faker
Successfully installed faker-4.14.2 text-unidecode-1.3
You should consider upgrading via the '/opt/venv/bin/python -m pip install --upgrade pip' command.[0m
Collecting us
  Downloading us-2.0.2.tar.gz (14 kB)
Collecting jellyfish==0.6.1
  Downloading jellyfish-0.6.1.tar.gz (132 kB)
[K     |████████████████████████████████| 132 kB 12.7 MB/s 
[?25hBuilding wheels for collected packages: us, jellyfish
  Building wheel for us (setup.py) ... [?25ldone
[?25h  Created wheel for us: filename=us-2.0.2-py3-none-any.whl size=11929 sha256=9ddcd0b746eff100572eaac1aed1c8b9df09f36a4ceadbfc8d20b608723c7244
  Stored in directory: /home/jovyan/.cache/pip/wheels/c

Unnamed: 0,Gender,Ethnicity,State,Age,Employment
0,Male,White,Virginia,Under 12 years old.,Military
1,Male,Two or more races.,North Carolina,45-54 years old.,A homemaker
2,Male,Asian,Georgia,12-17 years old.,Employed for wages
3,Male,White,Colorado,35-44 years old.,A homemaker
4,Female,Black or African American,Maryland,25-34 years old.,Self-employed
...,...,...,...,...,...
95,Male,White,Tennessee,Older than 55 years,A homemaker
96,Female,White,Alabama,35-44 years old.,Out of work but not currently looking for work
97,Male,Black or African American,Maryland,35-44 years old.,Employed for wages
98,Prefer Not to Say,White,Alabama,12-17 years old.,A homemaker


#### Generating synthetic answers

In [32]:
yeet = data.query("Type == 'Question' & `Answer Type` == 'single-select'")
questions = data.query("Type == 'Question'")
questions = questions.reset_index(drop=True)
options = data.query("Type == 'Answer' ")
yn = ["Yes","No"]
ynm = ["Yes","No","Maybe"]
answers = dict()
# descriptors 0
answers[0] = [np.nan]
# Section 1
# 1-8
for i in range(1,9):
    answers[i] = ['None of the days (0 days)' ,'A few days (1-2 days)', 'Most days (3-4 days)' ,'Every day']
# 9
answers[9] = ['A lot less than normal','Somewhat less than normal', 'About the same as normal', 'More than normal', "A lot more than normal"]
# 10
answers[10] = ["All of the time", "Most of the time", "Sometimes","Rarely"]
# Section 2
answers[11] = yn
answers[12] = datagen.generate_date()
answers[13] = options[options["Parent code"].str.contains("cdc_covid_19_7_xx23")].Display.to_list()
answers[14] = options[options["Parent code"].str.contains('copect_17')].Display.to_list()
answers[15] = ynm
# Section 3 
answers[16] = ynm
answers[17] = ["Yes", "No","Unknown", "Waiting for Results"]
answers[18] = options[options["Parent code"].str.contains('cdc_covid_19_n_a')].Display.to_list()
answers[19] = ynm
# Section 4
answers[20] = options[options["Parent code"].str.contains('copect_63_xx20')].Display.to_list()
answers[21] = options[options["Parent code"].str.contains('copect_63_xx21')].Display.to_list()
# Section 5
answers[22] = np.nan
sec5ans = ['Not at all', 'A little bit', 'Moderately', 'Quite a bit', 'Extremely']
for i in range(23,29):
    answers[i] = sec5ans
answers[29] = options[options["Parent code"].str.contains('cdc_covid_19_18')].Display.to_list()
answers[30] = options[options["Parent code"].str.contains('cdc_covid_19_23')].Display.to_list()
answers[31] = options[options["Parent code"].str.contains('cdc_covid_19_26')].Display.to_list()
answers[32] = options[options["Parent code"].str.contains('cdc_covid_19_25')].Display.to_list()
# Section 6
answers[33] =  options[options["Parent code"].str.contains('lot_r_1')].Display.to_list()
answers[34] =  options[options["Parent code"].str.contains('ukmh_j1')].Display.to_list()
answers[35] =  options[options["Parent code"].str.contains('ukmh_j3')].Display.to_list()
# Section 7
answers[36] = [i for i in range(0,15)]
answers[37] = [i for i in range(0,14)]
answers[38] = options[options["Parent code"].str.contains('cu_covid')].Display.to_list()
answers[39] = 

questions.iloc[30:45]


Unnamed: 0,Display,Topic,Type,Answer Type,PMI System,PMI Code,Parent code,Date of Last Update
30,"In the past month, have you experienced the fo...",COVID Participant Experience Survey,Question,multi-select,http://terminology.pmi-ops.org/CodeSystem/ppi,cdc_covid_19_23,COPE,2020-05-04
31,"In the past month, have the following behavior...",COVID Participant Experience Survey,Question,multi-select,http://terminology.pmi-ops.org/CodeSystem/ppi,cdc_covid_19_26,COPE,2020-05-04
32,"In the past month, to cope with social distanc...",COVID Participant Experience Survey,Question,multi-select,http://terminology.pmi-ops.org/CodeSystem/ppi,cdc_covid_19_25,COPE,2020-05-04
33,Choose the answer that best describes how you ...,COVID Participant Experience Survey,Question,single-select,http://terminology.pmi-ops.org/CodeSystem/ppi,lot_r_1,COPE,2020-05-04
34,Choose the answer that best describes how you ...,COVID Participant Experience Survey,Question,single-select,http://terminology.pmi-ops.org/CodeSystem/ppi,ukmh_j1,COPE,2020-05-04
35,Choose the answer that best describes how you ...,COVID Participant Experience Survey,Question,single-select,http://terminology.pmi-ops.org/CodeSystem/ppi,ukmh_j3,COPE,2020-05-04
36,"Not including yourself, how many other people ...",COVID Participant Experience Survey,Question,integer,http://terminology.pmi-ops.org/CodeSystem/ppi,basics_xx,COPE,2020-05-04
37,Think of other people who live with you. How m...,COVID Participant Experience Survey,Question,integer,http://terminology.pmi-ops.org/CodeSystem/ppi,basics_xx20,basics_xx,2020-05-04
38,What type of household do you live in?,COVID Participant Experience Survey,Question,single-select,http://terminology.pmi-ops.org/CodeSystem/ppi,cu_covid,COPE,2020-05-04
39,Please specify.,COVID Participant Experience Survey,Question,text,http://terminology.pmi-ops.org/CodeSystem/ppi,cu_covid_cope_a_204,cu_covid,2020-05-04


In [7]:
# options[options["Display"].str.contains("breathing")]
options[options["Parent code"].str.contains('cdc_covid_19_23')].Display.to_list()

['Not enough money to pay for medications',
 'Not enough money to pay rent',
 'Did not have a regular place to sleep or stay',
 'Not enough money to pay for gas',
 'Not enough money to pay for food']

In [33]:
answers

{0: [nan],
 1: ['None of the days (0 days)',
  'A few days (1-2 days)',
  'Most days (3-4 days)',
  'Every day'],
 2: ['None of the days (0 days)',
  'A few days (1-2 days)',
  'Most days (3-4 days)',
  'Every day'],
 3: ['None of the days (0 days)',
  'A few days (1-2 days)',
  'Most days (3-4 days)',
  'Every day'],
 4: ['None of the days (0 days)',
  'A few days (1-2 days)',
  'Most days (3-4 days)',
  'Every day'],
 5: ['None of the days (0 days)',
  'A few days (1-2 days)',
  'Most days (3-4 days)',
  'Every day'],
 6: ['None of the days (0 days)',
  'A few days (1-2 days)',
  'Most days (3-4 days)',
  'Every day'],
 7: ['None of the days (0 days)',
  'A few days (1-2 days)',
  'Most days (3-4 days)',
  'Every day'],
 8: ['None of the days (0 days)',
  'A few days (1-2 days)',
  'Most days (3-4 days)',
  'Every day'],
 9: ['A lot less than normal',
  'Somewhat less than normal',
  'About the same as normal',
  'More than normal',
  'A lot more than normal'],
 10: ['All of the time