### CDC Data EDA
- This notebook is an exploratory data analysis of the CDC data.


Context: Given CDC data to explore and analyze the data to understand the trends and patterns in the data.
This document loads the data, investigates it's structure, and defines some functions for working with the data.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# define function to download and zip the data (this is saved in helpers.py)
import helpers as h

```python
import requests
import zipfile
import os

def download_and_zip(url, zip_path):
    # Send a GET request to the URL
    response = requests.get(url)
    # Check if the request was successful
    if response.status_code == 200:
        # Save the file to a temporary location
        with open('temp_file', 'wb') as f:
            f.write(response.content)
        # Create a zip archive with the downloaded file
        with zipfile.ZipFile(zip_path, 'w') as zipf:
            # Add the file to the zip archive
            zipf.write('temp_file', os.path.basename(url))
        # Remove the temporary file
        os.remove('temp_file')
        print("File downloaded and zipped successfully!")
    else:
        print("Failed to download the file")
```


In [3]:
# download and zip
file_url = 'https://data.cdc.gov/api/views/hksd-2xuw/rows.csv?accessType=DOWNLOAD'
zip_path = 'cdc_data.zip'
# Download the file and zip it
h.download_and_zip(file_url, zip_path)

File downloaded and zipped successfully!


In [5]:
# read in file and describe 
cdc_data = pd.read_csv(zip_path)
print(cdc_data.describe())
print(cdc_data.columns)


           YearStart        YearEnd  Response     DataValue  DataValueAlt  \
count  311745.000000  311745.000000       0.0  2.106840e+05  2.106840e+05   
mean     2020.028328    2020.302430       NaN  6.897924e+02  7.308139e+02   
std         1.535006       1.075266       NaN  1.614618e+04  1.828234e+04   
min      2015.000000    2019.000000       NaN  0.000000e+00  0.000000e+00   
25%      2019.000000    2019.000000       NaN  1.240000e+01  1.240000e+01   
50%      2020.000000    2020.000000       NaN  2.700000e+01  2.700000e+01   
75%      2021.000000    2021.000000       NaN  5.830000e+01  5.830000e+01   
max      2022.000000    2022.000000       NaN  2.925456e+06  2.925456e+06   

       LowConfidenceLimit  HighConfidenceLimit  StratificationCategory2  \
count       190373.000000        190378.000000                      0.0   
mean            36.866274            46.092071                      NaN   
std             64.810910            69.765041                      NaN   
min   

KeyError: 0

In [8]:
cdc_data[0]

KeyError: 0

In [7]:
## print unique values for question column
print(cdc_data['Question'].unique())

['Adults with any disability' 'Arthritis among adults'
 'Influenza vaccination among adults' 'Diabetes among adults'
 'Life expectancy at birth' 'Alcohol use among high school students'
 'Current asthma among adults'
 'Asthma mortality among all people, underlying cause'
 'Short sleep duration among adults'
 'All teeth lost among adults aged 65 years and older'
 'Visited dentist or dental clinic in the past year among adults'
 'Depression among adults' 'High blood pressure among adults'
 'Coronary heart disease mortality among all people, underlying cause'
 'Invasive cancer (all sites combined), incidence'
 '2 or more chronic conditions among adults'
 'Invasive cancer (all sites combined) mortality among all people, underlying cause'
 'Colon and rectum (colorectal) cancer mortality among all people, underlying cause'
 'Cervical cancer mortality among all females, underlying cause'
 'Breast cancer mortality among all females, underlying cause'
 'Prostate cancer mortality among all males

In [9]:
## print unique values for question column
print(cdc_data['Topic'].unique())

['Disability' 'Arthritis' 'Immunization' 'Diabetes' 'Health Status'
 'Alcohol' 'Asthma' 'Sleep' 'Oral Health' 'Mental Health'
 'Cardiovascular Disease' 'Cancer' 'Tobacco'
 'Nutrition, Physical Activity, and Weight Status'
 'Chronic Obstructive Pulmonary Disease' 'Social Determinants of Health'
 'Cognitive Health and Caregiving' 'Maternal Health'
 'Chronic Kidney Disease']


In [8]:
# print unique values for LocationDesc column
print(cdc_data['LocationAbbr'].unique())

['GA' 'GU' 'ME' 'NV' 'OH' 'OK' 'VI' 'WV' 'AL' 'AK' 'DC' 'IL' 'KS' 'NJ'
 'PA' 'SC' 'US' 'VT' 'WA' 'WY' 'AZ' 'AR' 'LA' 'MA' 'OR' 'KY' 'MI' 'MN'
 'MO' 'ID' 'CO' 'NY' 'ND' 'TX' 'NC' 'CT' 'MS' 'VA' 'WI' 'DE' 'FL' 'IA'
 'MT' 'IN' 'CA' 'NE' 'HI' 'NM' 'SD' 'RI' 'NH' 'UT' 'MD' 'TN' 'PR']


### print out stratification levels

The data are categorized into different groups

In [9]:
# print unique values for StratificationCategory1 column
level1 = cdc_data['StratificationCategory1'].unique()
# now for each group print the unique values for Stratification2
for l in level1:
    print(f'\nLevels for {l}')
    print(cdc_data[cdc_data['StratificationCategory1'] == l]['Stratification1'].unique())


Levels for Age


['Age >=65' 'Age 18-44' 'Age 45-64' 'Age 6-11' 'Age 6-9' 'Age 1-5'
 'Age 6-14' 'Age 4 m - 5 y' 'Age 0-44' 'Age 12-17' 'Age 10-13']

Levels for Sex
['Female' 'Male']

Levels for Overall
['Overall']

Levels for Race/Ethnicity
['American Indian or Alaska Native, non-Hispanic' 'White, non-Hispanic'
 'Black, non-Hispanic' 'Asian or Pacific Islander, non-Hispanic'
 'Hispanic' 'Hawaiian or Pacific Islander, non-Hispanic'
 'Asian, non-Hispanic' 'Multiracial, non-Hispanic']

Levels for Grade
['Grade 10' 'Grade 11' 'Grade 12' 'Grade 9']


In [10]:
## define function to extract data for a specific question, stratification, for all locations
def filter_data(data, question, stratification):
    filtered = data[(data['Question'] == question) & (data['StratificationCategory1'] == stratification)]
    return filtered

test = cdc_data[(cdc_data['Question'] == 'Influenza vaccination among adults 18–64 who are at increased risk') & (cdc_data['StratificationCategory1'] == 'Overall')]
print(test.head())

       YearStart  YearEnd LocationAbbr LocationDesc DataSource         Topic  \
14364       2019     2019           AL      Alabama      BRFSS  Immunization   
14431       2019     2019           AK       Alaska      BRFSS  Immunization   
15621       2019     2019           AZ      Arizona      BRFSS  Immunization   
16956       2019     2019           AL      Alabama      BRFSS  Immunization   
17908       2019     2019           AZ      Arizona      BRFSS  Immunization   

                                                Question  Response  \
14364  Influenza vaccination among adults 18–64 who a...       NaN   
14431  Influenza vaccination among adults 18–64 who a...       NaN   
15621  Influenza vaccination among adults 18–64 who a...       NaN   
16956  Influenza vaccination among adults 18–64 who a...       NaN   
17908  Influenza vaccination among adults 18–64 who a...       NaN   

      DataValueUnit            DataValueType  ...  TopicID  QuestionID  \
14364             %     

now we can see we get two levels of DataValueTypes, one is 'Crude Prevalence', the other is 'Age-adjusted Prevalence'.
We have to decide which one to use.

In [11]:
def filter_data(data, question, stratification):
    filtered = data[(data['Question'] == question) & (data['StratificationCategory1'] == stratification) & (data['DataValueType'] == 'Age-adjusted Prevalence')]
    return filtered

In [12]:
test = filter_data(cdc_data, 'Influenza vaccination among adults 18–64 who are at increased risk', 'Overall')
print(test.head())

       YearStart  YearEnd LocationAbbr LocationDesc DataSource         Topic  \
16956       2019     2019           AL      Alabama      BRFSS  Immunization   
17908       2019     2019           AZ      Arizona      BRFSS  Immunization   
19078       2019     2019           AK       Alaska      BRFSS  Immunization   
22210       2019     2019           AR     Arkansas      BRFSS  Immunization   
29170       2019     2019           FL      Florida      BRFSS  Immunization   

                                                Question  Response  \
16956  Influenza vaccination among adults 18–64 who a...       NaN   
17908  Influenza vaccination among adults 18–64 who a...       NaN   
19078  Influenza vaccination among adults 18–64 who a...       NaN   
22210  Influenza vaccination among adults 18–64 who a...       NaN   
29170  Influenza vaccination among adults 18–64 who a...       NaN   

      DataValueUnit            DataValueType  ...  TopicID  QuestionID  \
16956             %  Age

In [19]:
## check through for percentage missing for each column
def check_missing(data):
    missing = data.isnull().sum()
    percent_missing = missing / len(data) * 100
    return percent_missing

In [20]:
print(check_missing(cdc_data))

YearStart                      0.000000
YearEnd                        0.000000
LocationAbbr                   0.000000
LocationDesc                   0.000000
DataSource                     0.000000
Topic                          0.000000
Question                       0.000000
Response                     100.000000
DataValueUnit                  0.000000
DataValueType                  0.000000
DataValue                     32.417842
DataValueAlt                  32.417842
DataValueFootnoteSymbol       67.037803
DataValueFootnote             67.037803
LowConfidenceLimit            38.933102
HighConfidenceLimit           38.931499
StratificationCategory1        0.000000
Stratification1                0.000000
StratificationCategory2      100.000000
Stratification2              100.000000
StratificationCategory3      100.000000
Stratification3              100.000000
Geolocation                    1.863382
LocationID                     0.000000
TopicID                        0.000000
