# U.S. Chronic Disease Indicators (CDI) Dataset
## Descriptive Analysis
source: https://data.cdc.gov/Chronic-Disease-Indicators/U-S-Chronic-Disease-Indicators-CDI-/g4ie-h725/about_data

In [4]:
# load dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import sys
import warnings
warnings.filterwarnings('ignore')

# Load the dataset into a pandas dataframe and cache it
df = pd.read_csv('us_cdi_dataset.csv')
df.head()

Unnamed: 0,YearStart,YearEnd,LocationAbbr,LocationDesc,DataSource,Topic,Question,Response,DataValueUnit,DataValueType,...,LocationID,TopicID,QuestionID,DataValueTypeID,StratificationCategoryID1,StratificationID1,StratificationCategoryID2,StratificationID2,StratificationCategoryID3,StratificationID3
0,2014,2014,AR,Arkansas,SEDD; SID,Asthma,Hospitalizations for asthma,,,Number,...,5,AST,AST3_1,NMBR,GENDER,GENM,,,,
1,2018,2018,CO,Colorado,SEDD; SID,Asthma,Hospitalizations for asthma,,,Number,...,8,AST,AST3_1,NMBR,OVERALL,OVR,,,,
2,2018,2018,DC,District of Columbia,SEDD; SID,Asthma,Hospitalizations for asthma,,,Number,...,11,AST,AST3_1,NMBR,OVERALL,OVR,,,,
3,2017,2017,GA,Georgia,SEDD; SID,Asthma,Hospitalizations for asthma,,,Number,...,13,AST,AST3_1,NMBR,GENDER,GENF,,,,
4,2010,2010,MI,Michigan,SEDD; SID,Asthma,Hospitalizations for asthma,,,Number,...,26,AST,AST3_1,NMBR,RACE,HIS,,,,


In [6]:
# print the shape of the dataset
print('The dataset has {} rows and {} columns'.format(df.shape[0], df.shape[1]))

The dataset has 1185676 rows and 34 columns


In [8]:
# print the data types of the columns
print('The data types of the columns are: {}'.format(df.dtypes))

The data types of the columns are: YearStart                      int64
YearEnd                        int64
LocationAbbr                  object
LocationDesc                  object
DataSource                    object
Topic                         object
Question                      object
Response                     float64
DataValueUnit                 object
DataValueType                 object
DataValue                     object
DataValueAlt                 float64
DataValueFootnoteSymbol       object
DatavalueFootnote             object
LowConfidenceLimit           float64
HighConfidenceLimit          float64
StratificationCategory1       object
Stratification1               object
StratificationCategory2      float64
Stratification2              float64
StratificationCategory3      float64
Stratification3              float64
GeoLocation                   object
ResponseID                   float64
LocationID                     int64
TopicID                       object
Que

In [10]:
# print the number of missing values in each column, and the percentage of missing values in each column
missing_values = df.isnull().sum()
missing_values_percentage = (missing_values / df.shape[0]) * 100
missing_values_df = pd.DataFrame({'missing_values': missing_values, 'missing_values_percentage': missing_values_percentage})
print('The number of missing values in each column, and the percentage of missing values in each column are:')
print(missing_values_df)

The number of missing values in each column, and the percentage of missing values in each column are:
                           missing_values  missing_values_percentage
YearStart                               0                   0.000000
YearEnd                                 0                   0.000000
LocationAbbr                            0                   0.000000
LocationDesc                            0                   0.000000
DataSource                              0                   0.000000
Topic                                   0                   0.000000
Question                                0                   0.000000
Response                          1185676                 100.000000
DataValueUnit                      152123                  12.830065
DataValueType                           0                   0.000000
DataValue                          378734                  31.942453
DataValueAlt                       381098                  32.141833
D

In [12]:
# remove columns with 100% missing values
df_clean = df.dropna(axis=1, how='all')

In [13]:
# print the number of unique values in each column
unique_values = df_clean.nunique()
print('The number of unique values in each column are:')
print(unique_values)

The number of unique values in each column are:
YearStart                       16
YearEnd                         16
LocationAbbr                    55
LocationDesc                    55
DataSource                      31
Topic                           17
Question                       203
DataValueUnit                   12
DataValueType                   19
DataValue                    50436
DataValueAlt                 41213
DataValueFootnoteSymbol         17
DatavalueFootnote               18
LowConfidenceLimit           22464
HighConfidenceLimit          24000
StratificationCategory1          3
Stratification1                 11
GeoLocation                     54
LocationID                      55
TopicID                         17
QuestionID                     203
DataValueTypeID                 19
StratificationCategoryID1        3
StratificationID1               13
dtype: int64
