# NIH Data Synthesis
## $\color{grey}{\text{Author: Gaurav Anand}}$

<div class="alert alert-block alert-success">

**Note:** The code in this notebook can be found in script format in the folder `code` as `data_generator.py`    
</div> 

### Objective 

Our research partners are in need of our help to analyze `NIH All of Us COPE` survey data. They are about to launch a new survey and want "to anticipate, based on previous survey results, places where we might expect issues or challenges when we launch the new survey". Specific items like: response counts, skip rates, and response frequency distribution for each question should be included along with any other metrics you can think of that will help the researchers identify questions that are potentially problematic (e.g, low response rates, no variability). If they identify questions (or wording of questions) that don't provide much information they will be able to update/modify the questions on the new survey to help get better information.

Tasks:
<pre> <p style = "font-family:helvetica;font-size:80%">
1) Use the attached codebook to generate synthetic response data. Add some categorical demographic attributes (e.g., racial identity, gender identity, age category).
              a)    - All the questions in the attached codebook should be included. 
                    - The responses don't have to be realistic. Our research partners will replace your synthetic data with their real data and re-run your notebook. 
                    - Be sure to include values that correspond to non-responses.
              b) Attach a separate script that generates the synthetic data. I would expect the actual data will be a csv file with one row per respondent and one column per question plus columns for user id and demographic information.
 
2) Create a Jupyter Notebook that performs the operational analysis for each question.
              a) Make it clear what type of input data is expected so our research partners can easily use their actual data in place of your synthetic data.
              b) There should be an analysis for each question, and possibly for each question group (if it makes sense).
              c) Provide a way for our partners to stratify these metrics across various demographic categories, for instance skip rates by racial identity.
 
3) Notes that briefly describe the parts of the notebook that need improving if you were given more time.</p>
</pre>

## Codebook Data

In [2]:
import pandas as pd

In [3]:
try:
    data = pd.read_excel("../data/All of Us _ Public PPI Codebook - COPE.xlsx")
except ImportError:
    !pip install xlrd
    data = pd.read_excel("../data/All of Us _ Public PPI Codebook - COPE.xlsx")

Collecting xlrd
  Downloading xlrd-1.2.0-py2.py3-none-any.whl (103 kB)
[K     |████████████████████████████████| 103 kB 2.1 MB/s eta 0:00:01
[?25hInstalling collected packages: xlrd
Successfully installed xlrd-1.2.0


Notes:
* There exist 365 items in the dataframe. Out of these 365 items:
* There are 179 'Questions' which constitute the main content of the survey.
    * There are 181 'Answers' which represent the responses that users can select for the survey.
    * There are 4 'Topic' items which include miscellanous text and instruction.

The distribution of answer types can be seen below

In [4]:
import warnings
data_lower = data
data_lower.loc[:,"Answer Type"] = data_lower["Answer Type"].str.lower()
grouped = data_lower.groupby(["Type","Answer Type"])
print("Answer Type Distribution")
display(pd.DataFrame(grouped.apply(len)))

Answer Type Distribution


Unnamed: 0_level_0,Unnamed: 1_level_0,0
Type,Answer Type,Unnamed: 2_level_1
Module Name,text,1
Question,date,1
Question,descriptive,1
Question,integer,19
Question,multi-select,14
Question,single-select,138
Question,text,6
Topic,descriptive,4


### Generating Fake Survey Data

#### Generating demographic information

In order to generate the demographic information, I relied on the `SyntheticData` class from the `classes.py` python file that is attached along with this Jupyter Notebook. 

Kinds of demographic data generated: 
* Gender
* Race
* States
* Age
* Employment

This data is not very accurate and is just meant to be placeholder data for use in this analysis.

In [5]:
import imp
import sys
try:
    imp.find_module("faker"); imp.find_module("us")
except ImportError:
    !pip install faker
    !pip install us
finally:
    import faker
    from us import states

sys.path.append('../code/')
from classes import * 

n = 100 # number of survey takers
seed = 126 # arbitrary

datagen = SyntheticData(n,seed)

np.random.seed(seed) # set seed for reproducibility

# -------- Gender -------
genders = ["Male","Female","Other","Prefer Not to Say"] # initialize gender
p_genders = [0.4,0.4,0.04,0.16] # initialize probability of each gender
gender = datagen.generate_data(genders,p_genders)

# ------- Race --------
races = ["Hispanic/Latino", "American Indian or Alaska Native", "Asian",\
         "Black or African American", "Native Hawaiian or Other Pacific Islander",\
          "White","Two or more races."]
p_races = np.array([np.random.randint(low=1,high=100) for i in range(len(races))])
p_races = p_races / sum(p_races)
race = datagen.generate_data(races,p_races)

# ------- State --------
state_names = [state.name for state in states.STATES_AND_TERRITORIES]
p_states = np.array([np.random.randint(low=1,high=100) for i in range(len(state_names))])
p_states = p_states / sum(p_states)
state = datagen.generate_data(state_names,p_states)

# ------- Age Categories -------
age_categories = ["Under 12 years old.","12-17 years old.", "18-24 years old.",
                    "25-34 years old.", "35-44 years old.","45-54 years old.","Older than 55 years"]
p_age = np.array([np.random.randint(low=1,high=100) for i in range(len(age_categories))])
p_age = p_age / sum(p_age)
age = datagen.generate_data(age_categories, p_age)

# -------- Employment Status --------
employment_categories = [
    "Employed for wages", "Self-employed","Out of work and looking for work",\
    "Out of work but not currently looking for work","A homemaker","A student",\
    "Military","Retired","Unable to work"
]
p_employment = np.array([15*x for x in range(len(employment_categories),)[::-1]])
p_employment = p_employment/ sum(p_employment)
employment = datagen.generate_data(employment_categories,p_employment)

# --------- Generating Demographics
demographics = pd.DataFrame({"Gender":gender, "Ethnicity":race, "State":state,"Age":age,"Employment":employment})
demographics

Collecting faker
  Downloading Faker-4.14.2-py3-none-any.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 2.1 MB/s eta 0:00:01
[?25hCollecting text-unidecode==1.3
  Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
[K     |████████████████████████████████| 78 kB 7.2 MB/s eta 0:00:011
Installing collected packages: text-unidecode, faker
Successfully installed faker-4.14.2 text-unidecode-1.3
Collecting us
  Downloading us-2.0.2.tar.gz (14 kB)
Collecting jellyfish==0.6.1
  Downloading jellyfish-0.6.1.tar.gz (132 kB)
[K     |████████████████████████████████| 132 kB 1.4 MB/s eta 0:00:01
[?25hBuilding wheels for collected packages: us, jellyfish
  Building wheel for us (setup.py) ... [?25ldone
[?25h  Created wheel for us: filename=us-2.0.2-py3-none-any.whl size=11929 sha256=2732f49be5be43aa27e8bc7c9b12491310135229f23d71851d456d71695dc111
  Stored in directory: /Users/hs9hd/Library/Caches/pip/wheels/ca/6b/11/cda9ea2438f721330a35c9a2c8e34b4aedcd34c89af48a4d00
  Bu

Unnamed: 0,Gender,Ethnicity,State,Age,Employment
0,Male,White,Virginia,Under 12 years old.,Military
1,Male,Two or more races.,North Carolina,45-54 years old.,A homemaker
2,Male,Asian,Georgia,12-17 years old.,Employed for wages
3,Male,White,Colorado,35-44 years old.,A homemaker
4,Female,Black or African American,Maryland,25-34 years old.,Self-employed
...,...,...,...,...,...
95,Male,White,Tennessee,Older than 55 years,A homemaker
96,Female,White,Alabama,35-44 years old.,Out of work but not currently looking for work
97,Male,Black or African American,Maryland,35-44 years old.,Employed for wages
98,Prefer Not to Say,White,Alabama,12-17 years old.,A homemaker


#### Organizing and Cataloguing questions and answers into a dataframe

In order to organize the responses and questions in this problem, I had to wrangle the data to look processable. Since `Parent code` and `PMI code` did not directly correspond to the questions and answers given in the survey data, it was impossible to do a simple merge/join through either SQL or pandas. I was able to fix this by automating the data joining process, but also added a manual element of sorting questions and answers that were given in <a href="https://www.nlm.nih.gov/dr2/COPE_Survey_NIH_All_of_Us_Clean_4.27.20.pdf"> this link. </a> <br>
The script that is run below is responsible for most of the manual automation with cleanup and organizing.

In [10]:
import os 

os.chdir("../data") # enter data folder in order to create the file in there
os.system("python3 ../code/data_cleaning.py") # run python data cleaning script
os.chdir("../Notebooks/") # enter back into Notebook folder
print("Format:")
print("question id                       Question                      Possible Answers")
qa_df = pd.read_json('../data/ques_ans_clean.json')
display(qa_df)


Format:
question id                       Question                      Possible Answers


Unnamed: 0,qid,Display,Answer Type,answers
0,0,You can choose not to answer any question at a...,single-select,
1,1,"In the past month, have recommendations for so...",Single-select,"[None of the days (0 days), A few days (1-2 da..."
2,2,"Thinking about your current social habits, in ...",Single-select,"[None of the days (0 days), A few days (1-2 da..."
3,3,"Thinking about your current social habits, in ...",Single-select,"[None of the days (0 days), A few days (1-2 da..."
4,4,"Thinking about your current social habits, in ...",Single-select,"[None of the days (0 days), A few days (1-2 da..."
...,...,...,...,...
174,174,Other reason - please specify.,text,text
175,175,How many times were you tested?,single-select,"[Never, Once, 2 - 5 times, More than 5 times]"
176,176,Do you personally know someone who has died of...,single-select,"[Yes, No, Not Sure, Prefer Not to Answer]"
177,177,Who do you know who has died? Check all that a...,multi-select,"[Spouse, partner, boyfriend, girlfriend, Paren..."


#### Creating Survey Data
I created survey data by randomly sampling from the answer choices for each of the questions. The approach that I used to generate the data was different for each type of question:
* text - Generated random text data (3 sentences long)
* date - Generated dates between Feb 14th to the current date.
* integer - Generated numbers between 0 and 100 (dummy data)
* single-select data - Randomly selected a response from the answer choice for that question.
* multi-select data - Selected a randomly chosen number of answer choices without replacement.

In [6]:
import ast
survey_data = pd.DataFrame(np.zeros((n,len(qa_df))),columns=qa_df.qid) # row length - number of subjects
print("qid", end=  " ")
for qid in qa_df.qid:
    selections = False
    print(qid, end= " ") # debugging
    
    # -------- Indexing questions -------
    answers = qa_df.loc[qid,'answers'] # Note: only works if df index and qid match up perfectly
#     answers = qa_df.query(f"qid == {qid}")["answers"] # use this otherwise
    
    # -------- Matching types of questions ------- # 
    q_type = qa_df.loc[qid,'Answer Type']
    if q_type.lower() == "multi-select":
        selections = True # enables multiselect

    if answers != None:
        # --------- Text questions (generate 3 sentences)---------- # 
        if answers == 'text':
            survey_data[qid] = survey_data[qid].apply(lambda x: datagen.generate_text(3))
        # --------- Date questions (generate a date between Feb 14th and today)---------- # 
        elif answers == 'date':
            answers = [datagen.generate_date() for i in range(n)]
            survey_data[qid] = datagen.generate_data(answers,\
                                                        datagen.generate_probabilities(len(answers)))
        else:
        # --------- Multi-Select questions ------- # 
            if selections:
                responses = []
                for item_num in range(n):
                    num_answers = np.random.randint(low=1,high=len(answers))
                    response = np.random.choice(answers,size=num_answers,replace=False)
                    responses.append(response)
                survey_data[qid] = responses
        # --------- Single-select questions ---------- # 
            else:
                survey_data[qid] = datagen.generate_data(answers,\
                                                        datagen.generate_probabilities(len(answers)))
    else:
            survey_data[qid] = survey_data[qid].apply(lambda x: np.nan)
survey_data

qid 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 

qid,0,1,2,3,4,5,6,7,8,9,...,169,170,171,172,173,174,175,176,177,178
0,,Every day,A few days (1-2 days),A few days (1-2 days),Most days (3-4 days),None of the days (0 days),None of the days (0 days),None of the days (0 days),None of the days (0 days),Somewhat less than normal,...,Almost everyday,At least once a week,Almost everyday,Never,"[Your race, Your age, Your religion, Your educ...",Organization again myself want month. Even ont...,Never,Yes,"[Friend, Spouse, partner, boyfriend, girlfrien...",Language manage discuss ok new program soon. D...
1,,A few days (1-2 days),Every day,Most days (3-4 days),Most days (3-4 days),Most days (3-4 days),Every day,Every day,None of the days (0 days),A lot less than normal,...,Almost everyday,A few times a month,Never,At least once a week,[Some other aspect of your physical appearance...,Newspaper share fish party take particular. Cl...,2 - 5 times,Not Sure,"[Co-worker, Parent, Spouse, partner, boyfriend...",Dinner energy spring western others analysis c...
2,,None of the days (0 days),Most days (3-4 days),A few days (1-2 days),A few days (1-2 days),Every day,Every day,A few days (1-2 days),None of the days (0 days),More than normal,...,At least once a week,A few times a month,At least once a week,At least once a week,"[Your education or income level, Your sexual o...",Expect oil response piece although try. Or pro...,Once,Yes,"[Sibling (brother or sister), Grandparent, Chi...",Kid inside box dream ready radio. Real compute...
3,,A few days (1-2 days),None of the days (0 days),None of the days (0 days),None of the days (0 days),None of the days (0 days),Every day,A few days (1-2 days),A few days (1-2 days),More than normal,...,A few times a month,A few times a month,Almost everyday,A few times a month,"[Your religion, Your ancestry or national orig...",Open water even however once. Various upon lat...,Never,Not Sure,[Co-worker],Girl sense learn page expert fine single by. F...
4,,Most days (3-4 days),None of the days (0 days),Most days (3-4 days),A few days (1-2 days),Every day,Most days (3-4 days),A few days (1-2 days),Most days (3-4 days),Somewhat less than normal,...,Almost everyday,Almost everyday,Almost everyday,A few times a month,"[Your ancestry or national origins, Your gende...",Support certainly order great paper. Easy oper...,2 - 5 times,Not Sure,[Friend],Now usually article important join task whole....
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,,Most days (3-4 days),A few days (1-2 days),Most days (3-4 days),A few days (1-2 days),None of the days (0 days),Every day,None of the days (0 days),Most days (3-4 days),About the same as normal,...,At least once a week,A few times a month,At least once a week,Never,"[Your gender, Your religion, Your ancestry or ...",Art food everybody campaign story letter year....,More than 5 times,Prefer Not to Answer,"[Co-worker, Child, Spouse, partner, boyfriend,...",Interview task will let energy true situation....
96,,A few days (1-2 days),A few days (1-2 days),A few days (1-2 days),Most days (3-4 days),Every day,None of the days (0 days),Every day,Most days (3-4 days),A lot less than normal,...,At least once a week,A few times a month,Almost everyday,A few times a month,"[Your ancestry or national origins, Your race,...",Point use yard environmental modern move. Tech...,Once,Not Sure,"[Child, Neighbor, Sibling (brother or sister),...",Word by control theory while. There off reach ...
97,,Every day,Every day,None of the days (0 days),A few days (1-2 days),Every day,Most days (3-4 days),Most days (3-4 days),Every day,Somewhat less than normal,...,A few times a month,A few times a month,Never,Almost everyday,[Your age],Purpose pay college really. Or actually nothin...,Never,Not Sure,[Sibling (brother or sister)],Increase understand ever understand. Easy tell...
98,,Every day,Most days (3-4 days),None of the days (0 days),Every day,Every day,None of the days (0 days),A few days (1-2 days),Most days (3-4 days),More than normal,...,A few times a month,Never,A few times a month,At least once a week,"[Your gender, Some other aspect of your physic...",Matter among million school. Fight any join go...,Never,No,"[Spouse, partner, boyfriend, girlfriend, Neigh...",Western state style want ten. Soon mission ser...


#### Ampute missing data

In [7]:
prob_array = pd.DataFrame(np.array([[np.random.random() for i in range(len(qa_df.qid))] for i in range(n)]))
index_array = np.array([prob_array[column].apply((lambda x: True if x > 0.2 else False)) for column in prob_array.columns]).T
for i in range(index_array.shape[0]):
    for j,column in enumerate(survey_data.columns): #num columns == index_array.shape[1]
        if index_array[i,j] == False:
            survey_data.loc[i,column] = np.nan
survey_data

qid,0,1,2,3,4,5,6,7,8,9,...,169,170,171,172,173,174,175,176,177,178
0,,Every day,,,Most days (3-4 days),None of the days (0 days),None of the days (0 days),None of the days (0 days),None of the days (0 days),Somewhat less than normal,...,Almost everyday,At least once a week,Almost everyday,,,Organization again myself want month. Even ont...,Never,Yes,"[Friend, Spouse, partner, boyfriend, girlfrien...",
1,,A few days (1-2 days),Every day,Most days (3-4 days),Most days (3-4 days),Most days (3-4 days),Every day,Every day,None of the days (0 days),A lot less than normal,...,Almost everyday,A few times a month,Never,At least once a week,[Some other aspect of your physical appearance...,Newspaper share fish party take particular. Cl...,2 - 5 times,Not Sure,"[Co-worker, Parent, Spouse, partner, boyfriend...",Dinner energy spring western others analysis c...
2,,None of the days (0 days),Most days (3-4 days),A few days (1-2 days),A few days (1-2 days),Every day,Every day,A few days (1-2 days),,More than normal,...,At least once a week,A few times a month,At least once a week,At least once a week,,Expect oil response piece although try. Or pro...,Once,,"[Sibling (brother or sister), Grandparent, Chi...",Kid inside box dream ready radio. Real compute...
3,,,None of the days (0 days),None of the days (0 days),None of the days (0 days),None of the days (0 days),Every day,,,More than normal,...,A few times a month,,,A few times a month,"[Your religion, Your ancestry or national orig...",Open water even however once. Various upon lat...,Never,Not Sure,[Co-worker],Girl sense learn page expert fine single by. F...
4,,Most days (3-4 days),None of the days (0 days),Most days (3-4 days),A few days (1-2 days),Every day,,A few days (1-2 days),Most days (3-4 days),Somewhat less than normal,...,Almost everyday,Almost everyday,Almost everyday,A few times a month,"[Your ancestry or national origins, Your gende...",Support certainly order great paper. Easy oper...,2 - 5 times,Not Sure,,Now usually article important join task whole....
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,,,A few days (1-2 days),,A few days (1-2 days),,Every day,None of the days (0 days),Most days (3-4 days),About the same as normal,...,,,At least once a week,Never,,Art food everybody campaign story letter year....,,Prefer Not to Answer,"[Co-worker, Child, Spouse, partner, boyfriend,...",Interview task will let energy true situation....
96,,A few days (1-2 days),A few days (1-2 days),A few days (1-2 days),,Every day,,,Most days (3-4 days),A lot less than normal,...,At least once a week,,Almost everyday,A few times a month,"[Your ancestry or national origins, Your race,...",Point use yard environmental modern move. Tech...,,,"[Child, Neighbor, Sibling (brother or sister),...",Word by control theory while. There off reach ...
97,,,Every day,None of the days (0 days),A few days (1-2 days),,Most days (3-4 days),Most days (3-4 days),,Somewhat less than normal,...,A few times a month,,Never,Almost everyday,[Your age],Purpose pay college really. Or actually nothin...,Never,Not Sure,[Sibling (brother or sister)],
98,,Every day,Most days (3-4 days),None of the days (0 days),Every day,Every day,None of the days (0 days),A few days (1-2 days),Most days (3-4 days),More than normal,...,A few times a month,Never,A few times a month,At least once a week,,Matter among million school. Fight any join go...,Never,No,"[Spouse, partner, boyfriend, girlfriend, Neigh...",


#### Merging demographics and survey results

In [9]:
full_survey_results = pd.concat([pd.Series([i for i in range(len(demographics))],name="user_id"),demographics,survey_data],axis=1)
full_survey_results.to_csv("../data/survey.csv",index=False)
full_survey_results
import warnings
warnings.simplefilter