# ETL Project: Leading causes of death in the United States

#### Aim:
An analysis on the top 10 leading causes of death in the United States. This case study seeks to analyze the top causes of deaths from 2013 to 2016 by ranking death causes by year, state, and age-adjusted deaths. Further, would U.S. chronic disease indicators provide any insight into risk factors that may be associated with those leading causes of death? 

#### Notes:
* The 10 leading causes of death are classified by the International Classification of Diseases, Tenth Revision (ICD-10).
* Age-adjusted death rates are per 100,000 standard million population in the year 2000. 


#### Data sources:
* [CDC/HCHS](https://catalog.data.gov/dataset/age-adjusted-death-rates-for-the-top-10-leading-causes-of-death-united-states-2013)

* [CDC](https://catalog.data.gov/dataset/u-s-chronic-disease-indicators-cdi)

#### Tools:
* Jupter Notebook/Python 3
* MySQL

# Leading Causes Of Death In The United States
## Data Extraction

After downloading csv files from our sources, csv files were extracted into dataframes utilizing jupyter notebook python 3 pandas function pd.read_csv().

In [12]:
# import dependencies
import pandas as pd
import pymysql
from sqlalchemy import create_engine
engine = create_engine("mysql://root:Codingislife92!@localhost:3306/leading_causes_of_death_db")

In [2]:
# read csv file for data on leading causes of death
leading_causes_death_df = pd.read_csv('Resources/NCHS_-_Leading_Causes_of_Death__United_States.csv')

In [3]:
# preview the dataframe created
leading_causes_death_df.head()

Unnamed: 0,Year,113 Cause Name,Cause Name,State,Deaths,Age-adjusted Death Rate
0,2016,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,Alabama,2755,55.5
1,2016,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,Alaska,439,63.1
2,2016,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,Arizona,4010,54.2
3,2016,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,Arkansas,1604,51.8
4,2016,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,California,13213,32.0


## Data Transformation
Due to the specificity contained in the column '113 Cause Name' in the leading_causes_death dataframe, it was decided that broader categories of death causes should be used to more closely match the Chronic Disease Indicators' categories for easier grouping of the two datasets. Thus utilizing the column 'Cause Name' as our main category. 

The column 'Age-adjusted Death Rate' was removed due to the lack of clarity from the datasource and not being able to to rank leading causes of death in a meaningful way. Columns were also renamed for easier analysis through python 3.

In [4]:
## cleaning the dataframe

# Remove cause_name and state columns, Rename columns into a new dataframe
leading_causes = leading_causes_death_df[['Year', 'Cause Name', 'State','Deaths']].copy()
leading_causes = leading_causes.rename(columns={
    "Year":"year",
    "Cause Name": "cause_name",
    "State": "state",
    "Deaths": "deaths"
})
leading_causes.sort_values(by=["year"]).head()

Unnamed: 0,year,cause_name,state,deaths
10295,1999,Unintentional injuries,Wyoming,258
1418,1999,Alzheimer's disease,Minnesota,1083
8726,1999,Suicide,Illinois,1020
7016,1999,Kidney disease,Michigan,1417
4334,1999,Diabetes,New Hampshire,294


Since the data sets end in 2016, we chose the time frame from 2013 to 2016. This three year period was to sample for any possible relationships between leading causes of death and chronic disease indicators.


In [5]:
# Filter leading_causes for years >=2013
leading_causes_2013_to_2016 = leading_causes.loc[leading_causes['year'] >= 2013,:]
leading_causes_2013_to_2016=leading_causes_2013_to_2016.sort_values(by=['year'])
leading_causes_2013_to_2016.head()

Unnamed: 0,year,cause_name,state,deaths
10281,2013,Unintentional injuries,Wyoming,325
3708,2013,CLRD,Virginia,3181
3726,2013,CLRD,Washington,2933
8946,2013,Suicide,Montana,243
3744,2013,CLRD,West Virginia,1590


One of the categories in "cause_name" is "All causes". Since this is a summation of all the categories in "cause_name", it was decided that the "All causes" category should be removed from the dataset.

### Death analysis in US

Since the data set merged both all of the United States and individual states, we separated the two areas to take into account of the leading causes of death in the United States and in each individual state by year.

In [6]:
# Remove all causes from dataframe
leading_causes_2013_to_2016_individual = leading_causes_2013_to_2016[leading_causes_2013_to_2016.cause_name != "All causes"]
leading_causes_2013_to_2016_individual.head()

Unnamed: 0,year,cause_name,state,deaths
10281,2013,Unintentional injuries,Wyoming,325
3708,2013,CLRD,Virginia,3181
3726,2013,CLRD,Washington,2933
8946,2013,Suicide,Montana,243
3744,2013,CLRD,West Virginia,1590


After dropping “all causes” we aimed to see the top ten causes for death in the entire United States. Since the dataset has leading cause of death values for all states as well as the entire United States, we are looking at the leading causes of death for the United States and then for the states individually. For example, the state of California’s top 10 leading causes of death may be different than the entire United States’top 10 leading causes of death. Therefor we have to observe the states separately from the United States. Below are the results for the top 10 leading causes of deaths in the United States for the years 2013-2016. 

In [49]:
# Separate US from individual states (2013 to 2016)
leading_causes_2013_to_2016_individual_US = leading_causes_2013_to_2016_individual[leading_causes_2013_to_2016_individual.state == "United States"]
leading_causes_2013_to_2016_individual_US.head()

#leading cause of death in the United States for 2013
leading_causes_2013_individual_US = leading_causes_2013_to_2016_individual_US.loc[leading_causes_2013_to_2016_individual_US['year']==2013,:]
leading_causes_2013_individual_US.sort_values(by=['deaths'], ascending=False).head(10)

Unnamed: 0,year,cause_name,state,deaths
5528,2013,Heart disease,United States,611105
2718,2013,Cancer,United States,584881
3654,2013,CLRD,United States,149205
10162,2013,Unintentional injuries,United States,130557
8334,2013,Stroke,United States,128978
1782,2013,Alzheimer's disease,United States,84767
4590,2013,Diabetes,United States,75578
6462,2013,Influenza and pneumonia,United States,56979
7399,2013,Kidney disease,United States,47112
9270,2013,Suicide,United States,41149


In [77]:
#leading cause of death in the United States for 2014
leading_causes_2014_individual_US = leading_causes_2013_to_2016_individual_US.loc[leading_causes_2013_to_2016_individual_US['year']==2014,:]
leading_causes_2014_individual_US.sort_values(by=['deaths'], ascending=False).head(10)




Unnamed: 0,year,cause_name,state,deaths
5527,2014,Heart disease,United States,614348
2717,2014,Cancer,United States,591700
3653,2014,CLRD,United States,147101
10161,2014,Unintentional injuries,United States,135928
8333,2014,Stroke,United States,133103
1781,2014,Alzheimer's disease,United States,93541
4589,2014,Diabetes,United States,76488
6461,2014,Influenza and pneumonia,United States,55227
7398,2014,Kidney disease,United States,48146
9269,2014,Suicide,United States,42826


In [82]:
#leading cause of death in the United States for 2015
leading_causes_2015_individual_US = leading_causes_2013_to_2016_individual_US.loc[leading_causes_2013_to_2016_individual_US['year']==2015,:]
leading_causes_2015_individual_US.sort_values(by=['deaths'], ascending=False).head(10)



Unnamed: 0,year,cause_name,state,deaths
5526,2015,Heart disease,United States,633842
2716,2015,Cancer,United States,595930
3652,2015,CLRD,United States,155041
10160,2015,Unintentional injuries,United States,146571
8332,2015,Stroke,United States,140323
1780,2015,Alzheimer's disease,United States,110561
4588,2015,Diabetes,United States,79535
6460,2015,Influenza and pneumonia,United States,57062
7397,2015,Kidney disease,United States,49959
9268,2015,Suicide,United States,44193


In [83]:
#leading cause of death in the United States for 2016
leading_causes_2016_individual_US = leading_causes_2013_to_2016_individual_US.loc[leading_causes_2013_to_2016_individual_US['year']==2016,:]
leading_causes_2016_individual_US.sort_values(by=['deaths'], ascending=False).head(10)


Unnamed: 0,year,cause_name,state,deaths
5525,2016,Heart disease,United States,635260
2715,2016,Cancer,United States,598038
10159,2016,Unintentional injuries,United States,161374
3651,2016,CLRD,United States,154596
8331,2016,Stroke,United States,142142
1779,2016,Alzheimer's disease,United States,116103
4587,2016,Diabetes,United States,80058
6459,2016,Influenza and pneumonia,United States,51537
7396,2016,Kidney disease,United States,50046
9267,2016,Suicide,United States,44965


In [72]:

# ALL US 2013 Top 10 leading deaths
# Isolate only 2013 data
top_10_leading_2013_US = leading_causes_2013_to_2016_individual_US.loc[leading_causes_2013_to_2016_individual_US['year']==2013,:]
top_10_leading_2013_US = top_10_leading_2013_US.sort_values(by=['deaths'], ascending=False).head(10)
top_10_leading_2013_US

Unnamed: 0,year,cause_name,state,deaths
5528,2013,Heart disease,United States,611105
2718,2013,Cancer,United States,584881
3654,2013,CLRD,United States,149205
10162,2013,Unintentional injuries,United States,130557
8334,2013,Stroke,United States,128978
1782,2013,Alzheimer's disease,United States,84767
4590,2013,Diabetes,United States,75578
6462,2013,Influenza and pneumonia,United States,56979
7399,2013,Kidney disease,United States,47112
9270,2013,Suicide,United States,41149


In [73]:
# ALL US 2014 Top 10 leading deaths
# Isolate only 2014 data
top_10_leading_2014_US = leading_causes_2013_to_2016_individual_US.loc[leading_causes_2013_to_2016_individual_US['year']==2014,:]
top_10_leading_2014_US = top_10_leading_2014_US.sort_values(by=['deaths'], ascending=False).head(10)
top_10_leading_2014_US

Unnamed: 0,year,cause_name,state,deaths
5527,2014,Heart disease,United States,614348
2717,2014,Cancer,United States,591700
3653,2014,CLRD,United States,147101
10161,2014,Unintentional injuries,United States,135928
8333,2014,Stroke,United States,133103
1781,2014,Alzheimer's disease,United States,93541
4589,2014,Diabetes,United States,76488
6461,2014,Influenza and pneumonia,United States,55227
7398,2014,Kidney disease,United States,48146
9269,2014,Suicide,United States,42826


In [74]:
# ALL US 2015 Top 10 leading deaths
# Isolate only 2015 data
top_10_leading_2015_US = leading_causes_2013_to_2016_individual_US.loc[leading_causes_2013_to_2016_individual_US['year']==2015,:]
top_10_leading_2015_US = top_10_leading_2015_US.sort_values(by=['deaths'], ascending=False).head(10)
top_10_leading_2015_US

Unnamed: 0,year,cause_name,state,deaths
5526,2015,Heart disease,United States,633842
2716,2015,Cancer,United States,595930
3652,2015,CLRD,United States,155041
10160,2015,Unintentional injuries,United States,146571
8332,2015,Stroke,United States,140323
1780,2015,Alzheimer's disease,United States,110561
4588,2015,Diabetes,United States,79535
6460,2015,Influenza and pneumonia,United States,57062
7397,2015,Kidney disease,United States,49959
9268,2015,Suicide,United States,44193


### Death analysis per state

### As per above report top 10 states having more death count are [California, Florida, Texas, New York, Pennsylvania, Ohio, Illinois, Michigan, North Carolina, Georgia]

In [53]:
#engine = create_engine("mysql://root:Codingislife92!@localhost:3306/leading_causes_of_death_db")
pymysql.install_as_MySQLdb()
leading_causes_2013_to_2016.to_sql(name='leading_causes', con=engine, if_exists='append', index=False)

In [54]:
pd.read_sql_query('SELECT * FROM leading_causes', con=engine).head()

Unnamed: 0,year,cause_name,state,deaths
0,2013,Unintentional injuries,Wyoming,325
1,2013,CLRD,Virginia,3181
2,2013,CLRD,Washington,2933
3,2013,Suicide,Montana,243
4,2013,CLRD,West Virginia,1590


In [55]:
# Deaths per state
state_deaths = pd.read_sql_query('SELECT state as State, deaths as total_state_deaths \
                FROM leading_causes', con=engine)
# Group by state
deaths_state = state_deaths.groupby(['State'])
# count total number of deaths per state
deaths_state_df = deaths_state['total_state_deaths'].sum()
deaths_per_state = deaths_state_df.sort_values(ascending=False)
deaths_per_state

State
United States           36302742
California               3594654
Florida                  2641190
Texas                    2567016
New York                 2134936
Pennsylvania             1812200
Ohio                     1620386
Illinois                 1470064
Michigan                 1329726
North Carolina           1203290
Georgia                  1081434
New Jersey                995432
Tennessee                 918922
Virginia                  888782
Indiana                   858830
Missouri                  830076
Massachusetts             761876
Arizona                   741792
Washington                740866
Alabama                   710528
Wisconsin                 707046
Maryland                  640944
South Carolina            640628
Kentucky                  639390
Louisiana                 611966
Minnesota                 572790
Oklahoma                  551108
Colorado                  493130
Oregon                    477310
Mississippi               439700
Arka

In [56]:
# Adding selected columns to mysql
state_deaths.to_sql(name='state_deaths', con=engine, if_exists='append', index=False)

In [57]:
# Deaths in each state per cause per year
state_cause_deaths = pd.read_sql_query('SELECT state, cause_name, year, deaths as total_state_deaths \
                FROM leading_causes', con=engine)
# Groupby statem, cause name, year
deaths_state = state_cause_deaths.groupby(['state', 'cause_name', 'year'])
# count total deaths
deaths_state_df = deaths_state['total_state_deaths'].sum()
deaths_state_df = deaths_state_df.sort_values
print(deaths_state_df)


<bound method Series.sort_values of state    cause_name               year
Alabama  All causes               2013    100378
                                  2014    100430
                                  2015    103818
                                  2016    104932
         Alzheimer's disease      2013      2796
                                  2014      3770
                                  2015      4564
                                  2016      5014
         CLRD                     2013      6086
                                  2014      6100
                                  2015      6558
                                  2016      6652
         Cancer                   2013     20656
                                  2014     20572
                                  2015     20708
                                  2016     20838
         Diabetes                 2013      2698
                                  2014      2562
                                  2015     

In [58]:
# Adding selected data to mysql
state_cause_deaths.to_sql(name='state_cause_deaths', con=engine, if_exists='append', index=False)

In [59]:
# Deaths per cause
death_causes = pd.read_sql_query('SELECT cause_name, deaths as total_state_deaths \
                FROM leading_causes', con=engine)
# Group by cause name
state_causes_group = death_causes.groupby(['cause_name'])
# count total number of deaths per cause
state_causes_df = state_causes_group['total_state_deaths'].sum()
state_causes_df.sort_values(ascending=False)

cause_name
All causes                 41871610
Heart disease               9978220
Cancer                      9482196
CLRD                        2423772
Unintentional injuries      2297720
Stroke                      2178184
Alzheimer's disease         1619888
Diabetes                    1246636
Influenza and pneumonia      883220
Kidney disease               781052
Suicide                      692532
Name: total_state_deaths, dtype: int64

### As per above analysis, top causes of death are Heart disease, followed by Malignant neoplasms and then cancer.

In [60]:
death_causes.to_sql(name='death_causes', con=engine, if_exists='append', index=False)

In [21]:
# Separate out states from 2013 to 2016
leading_causes_2013_to_2016_individual_states = leading_causes_2013_to_2016_individual[leading_causes_2013_to_2016_individual.state != "United States"]
leading_causes_2013_to_2016_individual_states.head()

Unnamed: 0,year,cause_name,state,deaths
10281,2013,Unintentional injuries,Wyoming,325
3708,2013,CLRD,Virginia,3181
3726,2013,CLRD,Washington,2933
8946,2013,Suicide,Montana,243
3744,2013,CLRD,West Virginia,1590


In [22]:
# ALL STATES 2013 Top 10 leading deaths and which states had the most deaths in the year 2013
# Isolate only 2013 data
leading_2013_states = leading_causes_2013_to_2016_individual_states.loc[leading_causes_2013_to_2016_individual_states['year']==2013,:]
top_10_leading_2013_states_all = leading_2013_states.sort_values(by=['deaths'], ascending=False).head(10)
top_10_leading_2013_states_all

Unnamed: 0,year,cause_name,state,deaths
4806,2013,Heart disease,California,60299
1998,2013,Cancer,California,57714
5311,2013,Heart disease,New York,44039
2088,2013,Cancer,Florida,42735
4896,2013,Heart disease,Florida,42656
5510,2013,Heart disease,Texas,40203
2700,2013,Cancer,Texas,38412
2502,2013,Cancer,New York,35738
5419,2013,Heart disease,Pennsylvania,31629
2610,2013,Cancer,Pennsylvania,28512


In [23]:
# Top 10 unique states 2013
# Sort data by number of death from highest to lowest, drop any duplicate states for this year, and obtain only the top 10 states
top_10_leading_2013_states_unique = leading_2013_states.sort_values(by=['deaths'], ascending=False).drop_duplicates(subset='state', keep='first').head(10)
top_10_leading_2013_states_unique

Unnamed: 0,year,cause_name,state,deaths
4806,2013,Heart disease,California,60299
5311,2013,Heart disease,New York,44039
2088,2013,Cancer,Florida,42735
5510,2013,Heart disease,Texas,40203
5419,2013,Heart disease,Pennsylvania,31629
5365,2013,Heart disease,Ohio,26878
4968,2013,Heart disease,Illinois,24839
5131,2013,Heart disease,Michigan,24156
2520,2013,Cancer,North Carolina,18589
5275,2013,Heart disease,New Jersey,18460


In [24]:
# ALL STATES 2014 Top 10 leading deaths and which states had the most deaths in the year 2014
# Isolate only 2014 data
leading_2014_states = leading_causes_2013_to_2016_individual_states.loc[leading_causes_2013_to_2016_individual_states['year']==2014,:]
top_10_leading_2014_states_all = leading_2014_states.sort_values(by=['deaths'], ascending=False).head(10)
top_10_leading_2014_states_all

Unnamed: 0,year,cause_name,state,deaths
1997,2014,Cancer,California,58412
4805,2014,Heart disease,California,58189
4895,2014,Heart disease,Florida,44511
2087,2014,Cancer,Florida,43212
5310,2014,Heart disease,New York,43116
5509,2014,Heart disease,Texas,41479
2699,2014,Cancer,Texas,38847
2501,2014,Cancer,New York,35392
5418,2014,Heart disease,Pennsylvania,31353
2609,2014,Cancer,Pennsylvania,28692


In [25]:
# Top 10 unique states 2014
# Sort data by number of death from highest to lowest, drop any duplicate states for this year, and obtain only the top 10 states
top_10_leading_2014_states_unique = leading_2014_states.sort_values(by=['deaths'], ascending=False).drop_duplicates(subset='state', keep='first').head(10)
top_10_leading_2014_states_unique

Unnamed: 0,year,cause_name,state,deaths
1997,2014,Cancer,California,58412
4895,2014,Heart disease,Florida,44511
5310,2014,Heart disease,New York,43116
5509,2014,Heart disease,Texas,41479
5418,2014,Heart disease,Pennsylvania,31353
5364,2014,Heart disease,Ohio,27000
4967,2014,Heart disease,Illinois,25024
5130,2014,Heart disease,Michigan,24692
2519,2014,Cancer,North Carolina,19342
5274,2014,Heart disease,New Jersey,18319


In [26]:
# ALL STATES 2015 Top 10 leading deaths and which states had the most deaths in the year 2015
# Isolate only 2015 data
leading_2015_states = leading_causes_2013_to_2016_individual_states.loc[leading_causes_2013_to_2016_individual_states['year']==2015,:]
top_10_leading_2015_states_all = leading_2015_states.sort_values(by=['deaths'], ascending=False).head(10)
top_10_leading_2015_states_all

Unnamed: 0,year,cause_name,state,deaths
4804,2015,Heart disease,California,61289
1996,2015,Cancer,California,59629
4894,2015,Heart disease,Florida,45441
5309,2015,Heart disease,New York,44450
2086,2015,Cancer,Florida,44027
5508,2015,Heart disease,Texas,43298
2698,2015,Cancer,Texas,39121
2500,2015,Cancer,New York,35089
5417,2015,Heart disease,Pennsylvania,32042
2608,2015,Cancer,Pennsylvania,28697


In [27]:
# Top 10 unique states 2015
# Sort data by number of death from highest to lowest, drop any duplicate states for this year, and obtain only the top 10 states
top_10_leading_2015_states_unique = leading_2015_states.sort_values(by=['deaths'], ascending=False).drop_duplicates(subset='state', keep='first').head(10)
top_10_leading_2015_states_unique

Unnamed: 0,year,cause_name,state,deaths
4804,2015,Heart disease,California,61289
4894,2015,Heart disease,Florida,45441
5309,2015,Heart disease,New York,44450
5508,2015,Heart disease,Texas,43298
5417,2015,Heart disease,Pennsylvania,32042
5363,2015,Heart disease,Ohio,28069
4966,2015,Heart disease,Illinois,25652
5129,2015,Heart disease,Michigan,24794
2518,2015,Cancer,North Carolina,19322
5273,2015,Heart disease,New Jersey,18647


In [28]:
# ALL STATES 2016 Top 10 leading deaths and which states had the most deaths in the year 2016
# Isolate only 2016 data
leading_2016_states = leading_causes_2013_to_2016_individual_states.loc[leading_causes_2013_to_2016_individual_states['year']==2016,:]
top_10_leading_2016_states_all = leading_2016_states.sort_values(by=['deaths'], ascending=False).head(10)
top_10_leading_2016_states_all

Unnamed: 0,year,cause_name,state,deaths
4803,2016,Heart disease,California,61573
1995,2016,Cancer,California,59515
4893,2016,Heart disease,Florida,45659
2085,2016,Cancer,Florida,44266
5308,2016,Heart disease,New York,44076
5506,2016,Heart disease,Texas,43772
2697,2016,Cancer,Texas,40195
2499,2016,Cancer,New York,35368
5416,2016,Heart disease,Pennsylvania,31990
2607,2016,Cancer,Pennsylvania,28492


In [29]:
# Top 10 unique states 2016
# Sort data by number of death from highest to lowest, drop any duplicate states for this year, and obtain only the top 10 states
top_10_leading_2016_states_unique = leading_2016_states.sort_values(by=['deaths'], ascending=False).drop_duplicates(subset='state', keep='first').head(10)
top_10_leading_2016_states_unique

Unnamed: 0,year,cause_name,state,deaths
4803,2016,Heart disease,California,61573
4893,2016,Heart disease,Florida,45659
5308,2016,Heart disease,New York,44076
5506,2016,Heart disease,Texas,43772
5416,2016,Heart disease,Pennsylvania,31990
5362,2016,Heart disease,Ohio,27410
5128,2016,Heart disease,Michigan,25304
4965,2016,Heart disease,Illinois,25013
2517,2016,Cancer,North Carolina,19523
5272,2016,Heart disease,New Jersey,18597


# Chronic Disease Indicators

#### Column Descriptions:
* YearStart - Starting Year
* YearEnd - Ending Year
* LocationAbbr - Location Abbreviation
* LocationDesc - Location Description
* DataSource - Data Source Abbreviation
* Topic - Topic
* Question - Question full-length text
* Response - Response
* DataValueUnit - The unit, such as $, %, years, etc.
* DataValueType - The data type, such as prevalence or mean
* DataValue - Data Value, such as 14.7 or Category 1
* DataValueAlt - Equal to Data Value, but formatting is numeric
* DataValueFootnoteSymbol	- Footnote Symbol
* DatavalueFootnote - Footnote Text
* LowConfidenceLimit - Low Confidence Limit
* HighConfidenceLimit - High Confidence Limit
* StratificationCategory1	- The category of the stratification, such as * Gender, Overall, or Race/Ethnicity
* Stratification1	- The stratification within the category, such as Male or Female, White/non-Hispanic, Hispanic, Black/non-Hispanic, American Indian or Alaska Native, Asian or Pacific Islander, Multi-racial/non-Hispanic, Other/non-Hispanic, or Overall
* StratificationCategory2
* Stratification2	
* StratificationCategory3
* Stratification3	
* GeoLocation	- Location code to be used for Geocoding
* ResponseID - Identifier for the Response
* LocationID - Location Identifier
* TopicID	
* QuestionID - Question Identifier
* DataValueTypeID	- Identifier for the Data Value Type
* StratificationCategoryID1 - Identifier for stratification category 1
* StratificationID1 - Identifier for stratification 1
* StratificationCategoryID2	
* StratificationID2	
* StratificationCategoryID3	
* StratificationID3	


In [30]:
# read chronic diseases indicators csv
CDI_df = pd.read_csv('Resources/U.S._Chronic_Disease_Indicators__CDI_.csv')
CDI_df.shape

  interactivity=interactivity, compiler=compiler, result=result)


(519718, 34)

This dataset contained several columns of missing or not relevant information to the analysis. 

The following steps below:
* removed unnecessary columns that contained missing or repetitive information
* separated datasets by location of either 'United States' or individual states
* categorized two larger datasets by either 'Crude Prevalence' or 'Age-adjusted Prevalence', which was then stratified by either overall, male, or female gender types ('Stratification1' column). 
* separated each gender type by year (2013, 2014, 2015, 2016)

In [31]:
## CLEAN UP DATAFRAME
# Select desired columns
CDI_df = CDI_df[['YearStart', 'LocationAbbr', 'LocationDesc', 'Topic', 'DataValue', 'DataValueUnit', 'DataValueType', 'StratificationCategory1', 'Stratification1']]

CDI_df.shape

(519718, 9)

In [32]:
# Drop any other rows with missing data
CDI_df = CDI_df.dropna(subset = ['DataValue', 'DataValueUnit'])
CDI_df.shape

(336660, 9)

In [33]:
# Remove rows where 'DataValue' == ' '
CDI_df = CDI_df.loc[CDI_df['DataValue'] != ' ']

In [34]:
# Convert DataValue from string to float

CDI_df.DataValue = [float(x) for x in CDI_df.DataValue]
CDI_df.head()

Unnamed: 0,YearStart,LocationAbbr,LocationDesc,Topic,DataValue,DataValueUnit,DataValueType,StratificationCategory1,Stratification1
0,2016,US,United States,Alcohol,16.9,%,Crude Prevalence,Overall,Overall
1,2016,AL,Alabama,Alcohol,13.0,%,Crude Prevalence,Overall,Overall
2,2016,AK,Alaska,Alcohol,18.2,%,Crude Prevalence,Overall,Overall
3,2016,AZ,Arizona,Alcohol,15.6,%,Crude Prevalence,Overall,Overall
4,2016,AR,Arkansas,Alcohol,15.0,%,Crude Prevalence,Overall,Overall


## Chronic Disease Indicators by United States

Next we observed the chronic disease indicators for the United States only. This dataset describes the disease prevalence present in the United States and for the individual states. The following shows the process we used to clean the dataset to observe the disease prevalence in the United States only for the years 2013 to 2016. 

Below are the results for the prevalence of each year for United States only.




In [35]:
# Obtain rows only for the all of the United States
CDI_US = CDI_df.loc[CDI_df['LocationAbbr']=="US",:]
CDI_US

Unnamed: 0,YearStart,LocationAbbr,LocationDesc,Topic,DataValue,DataValueUnit,DataValueType,StratificationCategory1,Stratification1
0,2016,US,United States,Alcohol,16.9,%,Crude Prevalence,Overall,Overall
55,2016,US,United States,Alcohol,17.9,%,Age-adjusted Prevalence,Overall,Overall
110,2016,US,United States,Alcohol,21.9,%,Crude Prevalence,Gender,Male
165,2016,US,United States,Alcohol,22.5,%,Age-adjusted Prevalence,Gender,Male
220,2016,US,United States,Alcohol,12.0,%,Crude Prevalence,Gender,Female
275,2016,US,United States,Alcohol,13.1,%,Age-adjusted Prevalence,Gender,Female
874,2016,US,United States,Alcohol,18.7,%,Crude Prevalence,Overall,Overall
1201,2016,US,United States,Alcohol,4.6,Number,Mean,Overall,Overall
1254,2016,US,United States,Alcohol,4.7,Number,Age-adjusted Mean,Overall,Overall
1310,2016,US,United States,Alcohol,5.0,Number,Mean,Gender,Male


In [36]:

# Obtain rows only for the years =2013
CDI_US_2013 = CDI_US.loc[CDI_US['YearStart']==2013, :]
CDI_US_2013

Unnamed: 0,YearStart,LocationAbbr,LocationDesc,Topic,DataValue,DataValueUnit,DataValueType,StratificationCategory1,Stratification1
125505,2013,US,United States,Asthma,10.7,"cases per 1,000,000",Age-adjusted Rate,Overall,Overall
125916,2013,US,United States,Asthma,11.5,"cases per 1,000,000",Crude Rate,Overall,Overall
126417,2013,US,United States,Chronic Kidney Disease,326.3,"cases per 1,000,000","Adjusted by age, sex, race and ethnicity",Overall,Overall
126469,2013,US,United States,Chronic Kidney Disease,115877.0,"cases per 1,000,000",Number,Overall,Overall
126521,2013,US,United States,Chronic Kidney Disease,144.8,"cases per 1,000,000","Adjusted by age, sex, race and ethnicity",Overall,Overall
126573,2013,US,United States,Chronic Kidney Disease,51070.0,"cases per 1,000,000",Number,Overall,Overall
129745,2013,US,United States,Cardiovascular Disease,80.4,%,Age-adjusted Prevalence,Gender,Female
129746,2013,US,United States,Cardiovascular Disease,76.7,%,Age-adjusted Prevalence,Gender,Male
129747,2013,US,United States,Cardiovascular Disease,78.5,%,Age-adjusted Prevalence,Overall,Overall
130183,2013,US,United States,Cardiovascular Disease,82.2,%,Crude Prevalence,Gender,Female


In [37]:
# Obtain rows only for the years =2014
CDI_US_2014 = CDI_US.loc[CDI_US['YearStart']==2014, :]
CDI_US_2014

Unnamed: 0,YearStart,LocationAbbr,LocationDesc,Topic,DataValue,DataValueUnit,DataValueType,StratificationCategory1,Stratification1
117329,2014,US,United States,Chronic Kidney Disease,118014.00,"cases per 1,000,000",Number,Overall,Overall
119724,2014,US,United States,Chronic Kidney Disease,327.50,"cases per 1,000,000","Adjusted by age, sex, race and ethnicity",Overall,Overall
119820,2014,US,United States,Chronic Kidney Disease,146.00,"cases per 1,000,000","Adjusted by age, sex, race and ethnicity",Overall,Overall
119871,2014,US,United States,Chronic Kidney Disease,52159.00,"cases per 1,000,000",Number,Overall,Overall
120252,2014,US,United States,Cardiovascular Disease,48.90,%,Crude Prevalence,Gender,Female
120253,2014,US,United States,Cardiovascular Disease,44.80,%,Crude Prevalence,Gender,Male
120254,2014,US,United States,Cardiovascular Disease,47.10,%,Crude Prevalence,Overall,Overall
120696,2014,US,United States,Cardiovascular Disease,81.40,%,Age-adjusted Prevalence,Gender,Female
120697,2014,US,United States,Cardiovascular Disease,77.50,%,Age-adjusted Prevalence,Gender,Male
120698,2014,US,United States,Cardiovascular Disease,79.40,%,Age-adjusted Prevalence,Overall,Overall


In [38]:
# Obtain rows only for the years =2015
CDI_US_2015 = CDI_US.loc[CDI_US['YearStart']==2015, :]
CDI_US_2015

Unnamed: 0,YearStart,LocationAbbr,LocationDesc,Topic,DataValue,DataValueUnit,DataValueType,StratificationCategory1,Stratification1
58824,2015,US,United States,Alcohol,32.80,%,Crude Prevalence,Overall,Overall
58875,2015,US,United States,Alcohol,17.70,%,Crude Prevalence,Overall,Overall
59257,2015,US,United States,Alcohol,12.60,%,Age-adjusted Prevalence,Gender,Female
59258,2015,US,United States,Alcohol,22.00,%,Age-adjusted Prevalence,Gender,Male
59259,2015,US,United States,Alcohol,17.00,%,Age-adjusted Prevalence,Overall,Overall
59694,2015,US,United States,Alcohol,11.70,%,Crude Prevalence,Gender,Female
59695,2015,US,United States,Alcohol,21.40,%,Crude Prevalence,Gender,Male
59696,2015,US,United States,Alcohol,16.30,%,Crude Prevalence,Overall,Overall
60036,2015,US,United States,Alcohol,17.70,%,Crude Prevalence,Overall,Overall
60459,2015,US,United States,Alcohol,3.40,Number,Age-adjusted Mean,Gender,Female


In [39]:
# Obtain rows only for the years =2016
CDI_US_2016 = CDI_US.loc[CDI_US['YearStart']==2016, :]
CDI_US_2016

Unnamed: 0,YearStart,LocationAbbr,LocationDesc,Topic,DataValue,DataValueUnit,DataValueType,StratificationCategory1,Stratification1
0,2016,US,United States,Alcohol,16.9,%,Crude Prevalence,Overall,Overall
55,2016,US,United States,Alcohol,17.9,%,Age-adjusted Prevalence,Overall,Overall
110,2016,US,United States,Alcohol,21.9,%,Crude Prevalence,Gender,Male
165,2016,US,United States,Alcohol,22.5,%,Age-adjusted Prevalence,Gender,Male
220,2016,US,United States,Alcohol,12.0,%,Crude Prevalence,Gender,Female
275,2016,US,United States,Alcohol,13.1,%,Age-adjusted Prevalence,Gender,Female
874,2016,US,United States,Alcohol,18.7,%,Crude Prevalence,Overall,Overall
1201,2016,US,United States,Alcohol,4.6,Number,Mean,Overall,Overall
1254,2016,US,United States,Alcohol,4.7,Number,Age-adjusted Mean,Overall,Overall
1310,2016,US,United States,Alcohol,5.0,Number,Mean,Gender,Male



Below we uploaded all the tables we created in python, into MySQL for the disease prevalance dataset and causes of death dataset.


The rest of the data cleaning and analysis was done on MySQL. The data for disease prevalance was further cleaned in MySQL to only have top ten highest disease prevalence.



In [40]:
CDI_US_2013.to_sql(name='CDI_2013', con=engine, if_exists='append', index=False)



In [41]:
pd.read_sql_query('SELECT * FROM CDI_2013', con=engine).head()

Unnamed: 0,YearStart,LocationAbbr,LocationDesc,Topic,DataValue,DataValueUnit,DataValueType,StratificationCategory1,Stratification1
0,2013,US,United States,Asthma,10.7,"cases per 1,000,000",Age-adjusted Rate,Overall,Overall
1,2013,US,United States,Asthma,11.5,"cases per 1,000,000",Crude Rate,Overall,Overall
2,2013,US,United States,Chronic Kidney Disease,326.3,"cases per 1,000,000","Adjusted by age, sex, race and ethnicity",Overall,Overall
3,2013,US,United States,Chronic Kidney Disease,115877.0,"cases per 1,000,000",Number,Overall,Overall
4,2013,US,United States,Chronic Kidney Disease,144.8,"cases per 1,000,000","Adjusted by age, sex, race and ethnicity",Overall,Overall


In [42]:
CDI_US_2014.to_sql(name='CDI_2014', con=engine, if_exists='append', index=False)



In [43]:
pd.read_sql_query('SELECT * FROM CDI_2014', con=engine).head()

Unnamed: 0,YearStart,LocationAbbr,LocationDesc,Topic,DataValue,DataValueUnit,DataValueType,StratificationCategory1,Stratification1
0,2014,US,United States,Chronic Kidney Disease,118014.0,"cases per 1,000,000",Number,Overall,Overall
1,2014,US,United States,Chronic Kidney Disease,327.5,"cases per 1,000,000","Adjusted by age, sex, race and ethnicity",Overall,Overall
2,2014,US,United States,Chronic Kidney Disease,146.0,"cases per 1,000,000","Adjusted by age, sex, race and ethnicity",Overall,Overall
3,2014,US,United States,Chronic Kidney Disease,52159.0,"cases per 1,000,000",Number,Overall,Overall
4,2014,US,United States,Cardiovascular Disease,48.9,%,Crude Prevalence,Gender,Female


In [44]:
CDI_US_2015.to_sql(name='CDI_2015', con=engine, if_exists='append', index=False)



In [45]:
pd.read_sql_query('SELECT * FROM CDI_2015', con=engine).head()

Unnamed: 0,YearStart,LocationAbbr,LocationDesc,Topic,DataValue,DataValueUnit,DataValueType,StratificationCategory1,Stratification1
0,2015,US,United States,Alcohol,32.8,%,Crude Prevalence,Overall,Overall
1,2015,US,United States,Alcohol,17.7,%,Crude Prevalence,Overall,Overall
2,2015,US,United States,Alcohol,12.6,%,Age-adjusted Prevalence,Gender,Female
3,2015,US,United States,Alcohol,22.0,%,Age-adjusted Prevalence,Gender,Male
4,2015,US,United States,Alcohol,17.0,%,Age-adjusted Prevalence,Overall,Overall


In [46]:
CDI_US_2016.to_sql(name='CDI_2016', con=engine, if_exists='append', index=False)



In [47]:
pd.read_sql_query('SELECT * FROM CDI_2016', con=engine).head()

Unnamed: 0,YearStart,LocationAbbr,LocationDesc,Topic,DataValue,DataValueUnit,DataValueType,StratificationCategory1,Stratification1
0,2016,US,United States,Alcohol,16.9,%,Crude Prevalence,Overall,Overall
1,2016,US,United States,Alcohol,17.9,%,Age-adjusted Prevalence,Overall,Overall
2,2016,US,United States,Alcohol,21.9,%,Crude Prevalence,Gender,Male
3,2016,US,United States,Alcohol,22.5,%,Age-adjusted Prevalence,Gender,Male
4,2016,US,United States,Alcohol,12.0,%,Crude Prevalence,Gender,Female


In [63]:

leading_causes_2013_individual_US.to_sql(name='leadingcauses_2013', con=engine, if_exists='append', index=False)

In [64]:
pd.read_sql_query('SELECT * FROM leadingcauses_2013', con=engine).head()

Unnamed: 0,year,cause_name,state,deaths
0,2013,CLRD,United States,149205
1,2013,Diabetes,United States,75578
2,2013,Suicide,United States,41149
3,2013,Alzheimer's disease,United States,84767
4,2013,Cancer,United States,584881


In [78]:
leading_causes_2014_individual_US.to_sql(name='leadingcauses_2014', con=engine, if_exists='append', index=False)

In [84]:
pd.read_sql_query('SELECT * FROM leadingcauses_2014', con=engine).head()

Unnamed: 0,year,cause_name,state,deaths
0,2014,Influenza and pneumonia,United States,55227
1,2014,Heart disease,United States,614348
2,2014,Diabetes,United States,76488
3,2014,Stroke,United States,133103
4,2014,Cancer,United States,591700


In [85]:
leading_causes_2015_individual_US.to_sql(name='leadingcauses_2015', con=engine, if_exists='append', index=False)

In [86]:
pd.read_sql_query('SELECT * FROM leadingcauses_2015', con=engine).head()

Unnamed: 0,year,cause_name,state,deaths
0,2015,Stroke,United States,140323
1,2015,Unintentional injuries,United States,146571
2,2015,Influenza and pneumonia,United States,57062
3,2015,Kidney disease,United States,49959
4,2015,CLRD,United States,155041


In [87]:
leading_causes_2016_individual_US.to_sql(name='leadingcauses_2016', con=engine, if_exists='append', index=False)

In [88]:
pd.read_sql_query('SELECT * FROM leadingcauses_2016', con=engine).head()

Unnamed: 0,year,cause_name,state,deaths
0,2016,Stroke,United States,142142
1,2016,Suicide,United States,44965
2,2016,Unintentional injuries,United States,161374
3,2016,Alzheimer's disease,United States,116103
4,2016,Cancer,United States,598038


## Chronic Disease Indicators Analysis By State

In [89]:
# Obtain rows only for the all states
CDI_states = CDI_df.loc[CDI_df['LocationAbbr'] !="US",:]

# Obtain rows only for the years >=2013
CDI_states = CDI_states.loc[CDI_states['YearStart'] >=2013,:]
CDI_states.head(10)

Unnamed: 0,YearStart,LocationAbbr,LocationDesc,Topic,DataValue,DataValueUnit,DataValueType,StratificationCategory1,Stratification1
1,2016,AL,Alabama,Alcohol,13.0,%,Crude Prevalence,Overall,Overall
2,2016,AK,Alaska,Alcohol,18.2,%,Crude Prevalence,Overall,Overall
3,2016,AZ,Arizona,Alcohol,15.6,%,Crude Prevalence,Overall,Overall
4,2016,AR,Arkansas,Alcohol,15.0,%,Crude Prevalence,Overall,Overall
5,2016,CA,California,Alcohol,16.3,%,Crude Prevalence,Overall,Overall
6,2016,CO,Colorado,Alcohol,19.0,%,Crude Prevalence,Overall,Overall
7,2016,CT,Connecticut,Alcohol,16.7,%,Crude Prevalence,Overall,Overall
8,2016,DE,Delaware,Alcohol,17.0,%,Crude Prevalence,Overall,Overall
9,2016,DC,District of Columbia,Alcohol,25.6,%,Crude Prevalence,Overall,Overall
10,2016,FL,Florida,Alcohol,15.5,%,Crude Prevalence,Overall,Overall


In [90]:
# Separate crude prevalence and age-adjusted prevalence into separate dataframes
CDI_states_crude = CDI_states.loc[CDI_states['DataValueType']=="Crude Prevalence",:]
CDI_states_age_adjusted = CDI_states.loc[CDI_states['DataValueType']=="Age-adjusted Prevalence", :]


In [91]:
# Separate CDI_states_crude into 3 stratification categories: Overall, Gender Male, Gender Female

# Overall
CDI_states_crude_overall = CDI_states_crude.loc[CDI_states_crude['Stratification1']=="Overall",:]
CDI_states_crude_overall = CDI_states_crude_overall.sort_values(by=['DataValue'], ascending=False)
# Male
CDI_states_crude_male = CDI_states_crude.loc[CDI_states_crude['Stratification1']=="Male",:]
CDI_states_crude_male = CDI_states_crude_male.sort_values(by=['DataValue'], ascending=False)
# Female
CDI_states_crude_female = CDI_states_crude.loc[CDI_states_crude['Stratification1']=="Female",:]
CDI_states_crude_female = CDI_states_crude_female.sort_values(by=['DataValue'], ascending=False)

In [92]:
## Separate CDI_states_crude_{overall, male, female} by year

# CDI state crude overall (sco)
CDI_sco_2013 = CDI_states_crude_overall.loc[CDI_states_crude_overall['YearStart']==2013,:]
CDI_sco_2014 = CDI_states_crude_overall.loc[CDI_states_crude_overall['YearStart']==2014,:]
CDI_sco_2015 = CDI_states_crude_overall.loc[CDI_states_crude_overall['YearStart']==2015,:]
CDI_sco_2016 = CDI_states_crude_overall.loc[CDI_states_crude_overall['YearStart']==2016,:]

# CDI state crude male (scm)
CDI_scm_2013 = CDI_states_crude_male.loc[CDI_states_crude_male['YearStart']==2013,:]
CDI_scm_2014 = CDI_states_crude_male.loc[CDI_states_crude_male['YearStart']==2014,:]
CDI_scm_2015 = CDI_states_crude_male.loc[CDI_states_crude_male['YearStart']==2015,:]
CDI_scm_2016 = CDI_states_crude_male.loc[CDI_states_crude_male['YearStart']==2016,:]

# CDI state crude female (scf)
CDI_scf_2013 = CDI_states_crude_female.loc[CDI_states_crude_female['YearStart']==2013,:]
CDI_scf_2014 = CDI_states_crude_female.loc[CDI_states_crude_female['YearStart']==2014,:]
CDI_scf_2015 = CDI_states_crude_female.loc[CDI_states_crude_female['YearStart']==2015,:]
CDI_scf_2016 = CDI_states_crude_female.loc[CDI_states_crude_female['YearStart']==2016,:]

In [93]:
# Separate CDI_states_age_adjusted into 3 stratification categories: Overall, Gender Male, Gender Female, 
# sort by descending prevalence % ('DataValue' column)

# Overall
CDI_states_age_adjusted_overall = CDI_states_age_adjusted.loc[CDI_states_age_adjusted['Stratification1']=="Overall",:]
CDI_states_age_adjusted_overall = CDI_states_age_adjusted_overall.sort_values(by=['DataValue'], ascending=False)
# Male
CDI_states_age_adjusted_male = CDI_states_age_adjusted.loc[CDI_states_age_adjusted['Stratification1']=="Male",:]
CDI_states_age_adjusted_male = CDI_states_age_adjusted_male.sort_values(by=['DataValue'], ascending=False)
# Female
CDI_states_age_adjusted_female = CDI_states_age_adjusted.loc[CDI_states_age_adjusted['Stratification1']=="Female",:]
CDI_states_age_adjusted_female = CDI_states_age_adjusted_female.sort_values(by=['DataValue'], ascending=False)

In [94]:
## Separate CDI_states_age_adjusted_{overall, male, female} by year

# CDI state crude overall (sco)
CDI_sao_2013 = CDI_states_age_adjusted_overall.loc[CDI_states_age_adjusted_overall['YearStart']==2013,:]
CDI_sao_2014 = CDI_states_age_adjusted_overall.loc[CDI_states_age_adjusted_overall['YearStart']==2014,:]
CDI_sao_2015 = CDI_states_age_adjusted_overall.loc[CDI_states_age_adjusted_overall['YearStart']==2015,:]
CDI_sao_2016 = CDI_states_age_adjusted_overall.loc[CDI_states_age_adjusted_overall['YearStart']==2016,:]

# CDI state crude male (scm)
CDI_sam_2013 = CDI_states_age_adjusted_male.loc[CDI_states_age_adjusted_male['YearStart']==2013,:]
CDI_sam_2014 = CDI_states_age_adjusted_male.loc[CDI_states_age_adjusted_male['YearStart']==2014,:]
CDI_sam_2015 = CDI_states_age_adjusted_male.loc[CDI_states_age_adjusted_male['YearStart']==2015,:]
CDI_sam_2016 = CDI_states_age_adjusted_male.loc[CDI_states_age_adjusted_male['YearStart']==2016,:]

# CDI state crude female (scf)
CDI_saf_2013 = CDI_states_age_adjusted_female.loc[CDI_states_age_adjusted_female['YearStart']==2013,:]
CDI_saf_2014 = CDI_states_age_adjusted_female.loc[CDI_states_age_adjusted_female['YearStart']==2014,:]
CDI_saf_2015 = CDI_states_age_adjusted_female.loc[CDI_states_age_adjusted_female['YearStart']==2015,:]
CDI_saf_2016 = CDI_states_age_adjusted_female.loc[CDI_states_age_adjusted_female['YearStart']==2016,:]

## Data Loading into MySQL

In [95]:
# Load causes by unique states from 2013 to 2016 data
top_10_leading_2013_states_unique.to_sql(name='top_10_2013_unique_states', con=engine, if_exists='replace', index=True)
top_10_leading_2014_states_unique.to_sql(name='top_10_2014_unique_states', con=engine, if_exists='replace', index=True)
top_10_leading_2015_states_unique.to_sql(name='top_10_2015_unique_states', con=engine, if_exists='replace', index=True)
top_10_leading_2016_states_unique.to_sql(name='top_10_2016_unique_states', con=engine, if_exists='replace', index=True)


In [96]:
# Load states crude overall (sco) from 2013 to 2016 data
CDI_sco_2013.to_sql(name='cdi_sco_2013', con=engine, if_exists='replace', index=True)
CDI_sco_2014.to_sql(name='cdi_sco_2014', con=engine, if_exists='replace', index=True)
CDI_sco_2015.to_sql(name='cdi_sco_2015', con=engine, if_exists='replace', index=True)
CDI_sco_2016.to_sql(name='cdi_sco_2016', con=engine, if_exists='replace', index=True)

In [97]:
# Load states crude male (scm) from 2013 to 2016 data
CDI_scm_2013.to_sql(name='cdi_scm_2013', con=engine, if_exists='replace', index=True)
CDI_scm_2014.to_sql(name='cdi_scm_2014', con=engine, if_exists='replace', index=True)
CDI_scm_2015.to_sql(name='cdi_scm_2015', con=engine, if_exists='replace', index=True)
CDI_scm_2016.to_sql(name='cdi_scm_2016', con=engine, if_exists='replace', index=True)


In [98]:
# Load states crude female (scf) from 2013 to 2016 data
CDI_scf_2013.to_sql(name='cdi_scf_2013', con=engine, if_exists='replace', index=True)
CDI_scf_2014.to_sql(name='cdi_scf_2014', con=engine, if_exists='replace', index=True)
CDI_scf_2015.to_sql(name='cdi_scf_2015', con=engine, if_exists='replace', index=True)
CDI_scf_2016.to_sql(name='cdi_scf_2016', con=engine, if_exists='replace', index=True)


In [99]:
# Load states age-adjusted overall (sao) from 2013 to 2016 data
CDI_sao_2013.to_sql(name='cdi_sao_2013', con=engine, if_exists='replace', index=True)
CDI_sao_2014.to_sql(name='cdi_sao_2014', con=engine, if_exists='replace', index=True)
CDI_sao_2015.to_sql(name='cdi_sao_2015', con=engine, if_exists='replace', index=True)
CDI_sao_2016.to_sql(name='cdi_sao_2016', con=engine, if_exists='replace', index=True)


In [100]:
# Load states age-adjusted male (sam) from 2013 to 2016 data
CDI_sam_2013.to_sql(name='cdi_sam_2013', con=engine, if_exists='replace', index=True)
CDI_sam_2014.to_sql(name='cdi_sam_2014', con=engine, if_exists='replace', index=True)
CDI_sam_2015.to_sql(name='cdi_sam_2015', con=engine, if_exists='replace', index=True)
CDI_sam_2016.to_sql(name='cdi_sam_2016', con=engine, if_exists='replace', index=True)


In [101]:
# Load states age-adjusted female (saf) from 2013 to 2016 data
CDI_saf_2013.to_sql(name='cdi_saf_2013', con=engine, if_exists='replace', index=True)
CDI_saf_2014.to_sql(name='cdi_saf_2014', con=engine, if_exists='replace', index=True)
CDI_saf_2015.to_sql(name='cdi_saf_2015', con=engine, if_exists='replace', index=True)
CDI_saf_2016.to_sql(name='cdi_saf_2016', con=engine, if_exists='replace', index=True)


## Export files to CSV for backup

In [102]:
# Export to csv causes by unique states from 2013 to 2016 data
top_10_leading_2013_states_unique.to_csv("top_10_leading_causes_2013_states_unique.csv")
top_10_leading_2014_states_unique.to_csv("top_10_leading_causes_2014_states_unique.csv")
top_10_leading_2015_states_unique.to_csv("top_10_leading_causes_2015_states_unique.csv")
top_10_leading_2016_states_unique.to_csv("top_10_leading_causes_2016_states_unique.csv")

# Export to csv states crude overall (sco) from 2013 to 2016 data
CDI_sco_2013.to_csv("CDI_sco_2013.csv")
CDI_sco_2014.to_csv("CDI_sco_2014.csv")
CDI_sco_2015.to_csv("CDI_sco_2015.csv")
CDI_sco_2016.to_csv("CDI_sco_2016.csv")

# Export to csv states crude male (scm) from 2013 to 2016 data
CDI_scm_2013.to_csv("CDI_scm_2013.csv")
CDI_scm_2014.to_csv("CDI_scm_2014.csv")
CDI_scm_2015.to_csv("CDI_scm_2015.csv")
CDI_scm_2016.to_csv("CDI_scm_2016.csv")

# Export to csv states crude female (scf) from 2013 to 2016 data
CDI_scf_2013.to_csv("CDI_scf_2013.csv")
CDI_scf_2014.to_csv("CDI_scf_2014.csv")
CDI_scf_2015.to_csv("CDI_scf_2015.csv")
CDI_scf_2016.to_csv("CDI_scf_2016.csv")

# Export to csv states age-adjusted overall (sao) from 2013 to 2016 data
CDI_sao_2013.to_csv("CDI_sao_2013.csv")
CDI_sao_2014.to_csv("CDI_sao_2014.csv")
CDI_sao_2015.to_csv("CDI_sao_2015.csv")
CDI_sao_2016.to_csv("CDI_sao_2016.csv")

# Export to csv  states age-adjusted male (sam) from 2013 to 2016 data
CDI_sam_2013.to_csv("CDI_sam_2013.csv")
CDI_sam_2014.to_csv("CDI_sam_2014.csv")
CDI_sam_2015.to_csv("CDI_sam_2015.csv")
CDI_sam_2016.to_csv("CDI_sam_2016.csv")

# Export to csv states age-adjusted female (sam) 2013 to 2016 data
CDI_saf_2013.to_csv("CDI_saf_2013.csv")
CDI_saf_2014.to_csv("CDI_saf_2014.csv")
CDI_saf_2015.to_csv("CDI_saf_2015.csv")
CDI_saf_2016.to_csv("CDI_saf_2016.csv")

