# SMART Raw Data Pre-processing for Mental Health Trends Dashboard

### Desired Product
* Simplified CSV file with pre-computed statistical information for mental health trends dashboard

### Rationale
* Because the SMART dataset from the CDC is quite large, a responsive Javascript app needs to work with some information pre-computed.
* As an expert in statistical analysis and data science, and the owner of the dashbaord, I am the best person to do the pre-computations
* Building a reproducible pipeline for pre-processing now will make it much easier to maintain the dashboard

## Requirements
*  The standard Python 3.7 environment installed with conda, as well as Pandas (0.24.2) and SciPy (1.2.1)
*  The file `mental_health_env.yml` provides the info for reproducing the environment on Windows 10
*  The raw data should be in a file called `smart_raw_data.csv` in the same directory as this notebook.  
*  The raw data can be downloaded from the Centers for Disease Control at:  https://chronicdata.cdc.gov/Behavioral-Risk-Factors/Behavioral-Risk-Factors-Selected-Metropolitan-Area/j32a-sa6u  (hit the "Export" button and choose the "CSV" file option (non-Excel)

In [1]:
# Dependencies
import pandas as pd
import scipy.stats as stats

In [2]:
# Import dataframe
raw_df = pd.read_csv('smart_raw_data.csv')
raw_df.head()

Unnamed: 0,Year,Locationabbr,Locationdesc,Class,Topic,Question,Response,Break_Out,Break_Out_Category,Sample_Size,...,Data_Value_Footnote,DataSource,ClassId,TopicId,LocationID,BreakoutID,BreakOutCategoryID,QuestionID,RESPONSEID,GeoLocation
0,2011,10020,"Abbeville, LA Micropolitan Statistical Area",Health Status,Overall Health,How is your general health?,Excellent,Overall,Overall,85,...,,BRFSS,CLASS08,Topic41,10020,BO1,CAT1,GENHLTH,RESP056,"(29.7868663, -92.290084)"
1,2011,10020,"Abbeville, LA Micropolitan Statistical Area",Health Status,Overall Health,How is your general health?,Very good,Overall,Overall,153,...,,BRFSS,CLASS08,Topic41,10020,BO1,CAT1,GENHLTH,RESP057,"(29.7868663, -92.290084)"
2,2011,10020,"Abbeville, LA Micropolitan Statistical Area",Health Status,Overall Health,How is your general health?,Good,Overall,Overall,152,...,,BRFSS,CLASS08,Topic41,10020,BO1,CAT1,GENHLTH,RESP058,"(29.7868663, -92.290084)"
3,2011,10020,"Abbeville, LA Micropolitan Statistical Area",Health Status,Overall Health,How is your general health?,Fair,Overall,Overall,80,...,,BRFSS,CLASS08,Topic41,10020,BO1,CAT1,GENHLTH,RESP059,"(29.7868663, -92.290084)"
4,2011,10020,"Abbeville, LA Micropolitan Statistical Area",Health Status,Overall Health,How is your general health?,Poor,Overall,Overall,43,...,,BRFSS,CLASS08,Topic41,10020,BO1,CAT1,GENHLTH,RESP060,"(29.7868663, -92.290084)"


In [3]:
# Check length -- we should see about 120k rows
len(raw_df)

120059

### Filter and Simplify Data

In [None]:
# First, we'll choose only the questions of interest:
# Some of these, we can do by 'questionID' and 'responseID' -- 
# for instance, for the "do you have health insurance?" question, we need only the "yes" response
# to get an indicator.  
# The direct indicators we want are:

In [19]:
#  1) Do you have health insurance (adults 18-64) / yes : questionID = _HCVU651, RESPONSEID = RESP046
# Note that everyone over 65 has insurance via Medicare
has_insurance_df = raw_df.loc[(raw_df.QuestionID == '_HCVU651') & (raw_df.RESPONSEID == 'RESP046')]
has_insurance_df.head()

Unnamed: 0,Year,Locationabbr,Locationdesc,Class,Topic,Question,Response,Break_Out,Break_Out_Category,Sample_Size,...,Data_Value_Footnote,DataSource,ClassId,TopicId,LocationID,BreakoutID,BreakOutCategoryID,QuestionID,RESPONSEID,GeoLocation
9,2011,10020,"Abbeville, LA Micropolitan Statistical Area",Health Care Access/Coverage,Under 65 Coverage,Adults aged 18-64 who have any kind of health ...,Yes,Overall,Overall,270,...,,BRFSS,CLASS07,Topic59,10020,BO1,CAT1,_HCVU651,RESP046,"(29.7868663, -92.290084)"
93,2011,10100,"Aberdeen, SD Micropolitan Statistical Area",Health Care Access/Coverage,Under 65 Coverage,Adults aged 18-64 who have any kind of health ...,Yes,Overall,Overall,299,...,,BRFSS,CLASS07,Topic59,10100,BO1,CAT1,_HCVU651,RESP046,"(45.5222131, -98.7041456)"
175,2011,10420,"Akron, OH Metropolitan Statistical Area",Health Care Access/Coverage,Under 65 Coverage,Adults aged 18-64 who have any kind of health ...,Yes,Overall,Overall,457,...,,BRFSS,CLASS07,Topic59,10420,BO1,CAT1,_HCVU651,RESP046,"(41.1466315, -81.3501066)"
260,2011,10740,"Albuquerque, NM Metropolitan Statistical Area",Health Care Access/Coverage,Under 65 Coverage,Adults aged 18-64 who have any kind of health ...,Yes,Overall,Overall,1898,...,,BRFSS,CLASS07,Topic59,10740,BO1,CAT1,_HCVU651,RESP046,"(35.1165976, -106.4565247)"
343,2011,10900,"Allentown-Bethlehem-Easton, PA-NJ Metropolitan...",Health Care Access/Coverage,Under 65 Coverage,Adults aged 18-64 who have any kind of health ...,Yes,Overall,Overall,708,...,,BRFSS,CLASS07,Topic59,10900,BO1,CAT1,_HCVU651,RESP046,"(40.7892985, -75.3983269)"


In [20]:
# We should see somewhat over 1000 rows ...
len(has_insurance_df)

1071

In [21]:
# We'll build the selected data by appending one dataframe at a time
# starting with a copy of the first
filtered_df = has_insurance_df.copy()

In [28]:
#2) Have you participated in any physical activity in the last 30 days? / yes, questionID = _TOTINDA, RESPONSEID = RESP046
phys_act_df = raw_df.loc[(raw_df.QuestionID == '_TOTINDA') & (raw_df.RESPONSEID == 'RESP046')]
filtered_df = filtered_df.append(phys_act_df)
len(phys_act_df)

1071

In [29]:
# Check that we are appending properly (lengths should add)
len(filtered_df)

2142

In [30]:
#3) When did you last have a cholesterol check? / last 5 years, questionID = _CHOLCHK, RESPONSEID = RESP072
chol_chk_df = raw_df.loc[(raw_df.QuestionID == '_CHOLCHK') & (raw_df.RESPONSEID == 'RESP072')]
filtered_df = filtered_df.append(chol_chk_df)
len(filtered_df)

2615