# ETL Project: Leading causes of death in the United States

#### Aim:
An analysis on the top 10 leading causes of death in the United States. This case study seeks to analyze the top causes of deaths from 2013 to 2016 by ranking death causes by year, state, and age-adjusted deaths. Further, would U.S. chronic disease indicators provide any insight into risk factors that may be associated with those leading causes of death? 

#### Notes:
* The 10 leading causes of death are classified by the International Classification of Diseases, Tenth Revision (ICD-10).
* Age-adjusted death rates are per 100,000 standard million population in the year 2000. 


#### Data sources:
* [CDC/HCHS](https://catalog.data.gov/dataset/age-adjusted-death-rates-for-the-top-10-leading-causes-of-death-united-states-2013)

* [CDC](https://catalog.data.gov/dataset/u-s-chronic-disease-indicators-cdi)

#### Tools:
* Jupter Notebook/Python 3
* MySQL

# Leading Causes Of Death In The United States
## Data Extraction

After downloading csv files from our sources, csv files were extracted into dataframes utilizing jupyter notebook python 3 pandas function pd.read_csv().

In [1]:
# import dependencies
import pandas as pd
import pymysql
from sqlalchemy import create_engine

In [2]:
# read csv file for data on leading causes of death
leading_causes_death_df = pd.read_csv('Resources/NCHS_-_Leading_Causes_of_Death__United_States.csv')

In [3]:
# preview the dataframe created
leading_causes_death_df.head()

Unnamed: 0,Year,113 Cause Name,Cause Name,State,Deaths,Age-adjusted Death Rate
0,2016,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,Alabama,2755,55.5
1,2016,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,Alaska,439,63.1
2,2016,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,Arizona,4010,54.2
3,2016,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,Arkansas,1604,51.8
4,2016,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,California,13213,32.0


## Data Transformation
Due to the specificity contained in the column '113 Cause Name' in the leading_causes_death dataframe, it was decided that broader categories of death causes should be used to more closely match the Chronic Disease Indicators' categories for easier grouping of the two datasets. Thus utilizing the column 'Cause Name' as our main category. 

The column 'Age-adjusted Death Rate' was removed due to the lack of clarity from the datasource and not being able to to rank leading causes of death in a meaningful way. Columns were also renamed for easier analysis through python 3.

In [4]:
## cleaning the dataframe

# Remove cause_name and state columns, Rename columns into a new dataframe
leading_causes = leading_causes_death_df[['Year', 'Cause Name', 'State','Deaths']].copy()
leading_causes = leading_causes.rename(columns={
    "Year":"year",
    "Cause Name": "cause_name",
    "State": "state",
    "Deaths": "deaths"
})
leading_causes.sort_values(by=["year"]).head()

Unnamed: 0,year,cause_name,state,deaths
10295,1999,Unintentional injuries,Wyoming,258
1418,1999,Alzheimer's disease,Minnesota,1083
8726,1999,Suicide,Illinois,1020
7016,1999,Kidney disease,Michigan,1417
4334,1999,Diabetes,New Hampshire,294


Since the data sets end in 2016, we chose the time frame from 2013 to 2016. This three year period was to sample for any possible relationships between leading causes of death and chronic disease indicators.


In [5]:
# Filter leading_causes for years >=2013
leading_causes_2013_to_2016 = leading_causes.loc[leading_causes['year'] >= 2013,:]
leading_causes_2013_to_2016=leading_causes_2013_to_2016.sort_values(by=['year'])

#### PUT THIS DATAFRAME INTO THE DATABASE
leading_causes_2013_to_2016.head()

Unnamed: 0,year,cause_name,state,deaths
10281,2013,Unintentional injuries,Wyoming,325
3708,2013,CLRD,Virginia,3181
3726,2013,CLRD,Washington,2933
8946,2013,Suicide,Montana,243
3744,2013,CLRD,West Virginia,1590


One of the categories in "cause_name" is "All causes". Since this is a summation of all the categories in "cause_name", it was decided that the "All causes" category should be removed from the dataset.

In [6]:
# Remove all causes from dataframe
leading_causes_2013_to_2016_individual = leading_causes_2013_to_2016[leading_causes_2013_to_2016.cause_name != "All causes"]
leading_causes_2013_to_2016_individual.head()

Unnamed: 0,year,cause_name,state,deaths
10281,2013,Unintentional injuries,Wyoming,325
3708,2013,CLRD,Virginia,3181
3726,2013,CLRD,Washington,2933
8946,2013,Suicide,Montana,243
3744,2013,CLRD,West Virginia,1590


Since the data set merged both all of the United States and individual states, we separated the two areas to take into account of the leading causes of death in the United States and in each individual state by year.

In [7]:
# Separate US from individual states (2013 to 2016)
leading_causes_2013_to_2016_individual_US = leading_causes_2013_to_2016_individual[leading_causes_2013_to_2016_individual.state == "United States"]
leading_causes_2013_to_2016_individual_US.head()

Unnamed: 0,year,cause_name,state,deaths
3654,2013,CLRD,United States,149205
4590,2013,Diabetes,United States,75578
9270,2013,Suicide,United States,41149
1782,2013,Alzheimer's disease,United States,84767
2718,2013,Cancer,United States,584881


In [8]:
# ALL US 2013 Top 10 leading deaths
# Isolate only 2013 data
top_10_leading_2013_US = leading_causes_2013_to_2016_individual_US.loc[leading_causes_2013_to_2016_individual_US['year']==2013,:]
top_10_leading_2013_US = top_10_leading_2013_US.sort_values(by=['deaths'], ascending=False).head(10)
top_10_leading_2013_US

Unnamed: 0,year,cause_name,state,deaths
5528,2013,Heart disease,United States,611105
2718,2013,Cancer,United States,584881
3654,2013,CLRD,United States,149205
10162,2013,Unintentional injuries,United States,130557
8334,2013,Stroke,United States,128978
1782,2013,Alzheimer's disease,United States,84767
4590,2013,Diabetes,United States,75578
6462,2013,Influenza and pneumonia,United States,56979
7399,2013,Kidney disease,United States,47112
9270,2013,Suicide,United States,41149


In [9]:
# ALL US 2014 Top 10 leading deaths
# Isolate only 2014 data
top_10_leading_2014_US = leading_causes_2013_to_2016_individual_US.loc[leading_causes_2013_to_2016_individual_US['year']==2014,:]
top_10_leading_2014_US = top_10_leading_2014_US.sort_values(by=['deaths'], ascending=False).head(10)
top_10_leading_2014_US

Unnamed: 0,year,cause_name,state,deaths
5527,2014,Heart disease,United States,614348
2717,2014,Cancer,United States,591700
3653,2014,CLRD,United States,147101
10161,2014,Unintentional injuries,United States,135928
8333,2014,Stroke,United States,133103
1781,2014,Alzheimer's disease,United States,93541
4589,2014,Diabetes,United States,76488
6461,2014,Influenza and pneumonia,United States,55227
7398,2014,Kidney disease,United States,48146
9269,2014,Suicide,United States,42826


In [10]:
# ALL US 2015 Top 10 leading deaths
# Isolate only 2015 data
top_10_leading_2015_US = leading_causes_2013_to_2016_individual_US.loc[leading_causes_2013_to_2016_individual_US['year']==2015,:]
top_10_leading_2015_US = top_10_leading_2015_US.sort_values(by=['deaths'], ascending=False).head(10)
top_10_leading_2015_US

Unnamed: 0,year,cause_name,state,deaths
5526,2015,Heart disease,United States,633842
2716,2015,Cancer,United States,595930
3652,2015,CLRD,United States,155041
10160,2015,Unintentional injuries,United States,146571
8332,2015,Stroke,United States,140323
1780,2015,Alzheimer's disease,United States,110561
4588,2015,Diabetes,United States,79535
6460,2015,Influenza and pneumonia,United States,57062
7397,2015,Kidney disease,United States,49959
9268,2015,Suicide,United States,44193


In [11]:
# ALL US 2016 Top 10 leading deaths
# Isolate only 2016 data
top_10_leading_2016_US = leading_causes_2013_to_2016_individual_US.loc[leading_causes_2013_to_2016_individual_US['year']==2016,:]
top_10_leading_2016_US = top_10_leading_2016_US.sort_values(by=['deaths'], ascending=False).head(10)
top_10_leading_2016_US

Unnamed: 0,year,cause_name,state,deaths
5525,2016,Heart disease,United States,635260
2715,2016,Cancer,United States,598038
10159,2016,Unintentional injuries,United States,161374
3651,2016,CLRD,United States,154596
8331,2016,Stroke,United States,142142
1779,2016,Alzheimer's disease,United States,116103
4587,2016,Diabetes,United States,80058
6459,2016,Influenza and pneumonia,United States,51537
7396,2016,Kidney disease,United States,50046
9267,2016,Suicide,United States,44965


In [12]:
# Separate out states from 2013 to 2016
leading_causes_2013_to_2016_individual_states = leading_causes_2013_to_2016_individual[leading_causes_2013_to_2016_individual.state != "United States"]
leading_causes_2013_to_2016_individual_states.head()

Unnamed: 0,year,cause_name,state,deaths
10281,2013,Unintentional injuries,Wyoming,325
3708,2013,CLRD,Virginia,3181
3726,2013,CLRD,Washington,2933
8946,2013,Suicide,Montana,243
3744,2013,CLRD,West Virginia,1590


In [13]:
# ALL STATES 2013 Top 10 leading deaths and which states had the most deaths in the year 2013
# Isolate only 2013 data
leading_2013_states = leading_causes_2013_to_2016_individual_states.loc[leading_causes_2013_to_2016_individual_states['year']==2013,:]
top_10_leading_2013_states_all = leading_2013_states.sort_values(by=['deaths'], ascending=False).head(10)
top_10_leading_2013_states_all

Unnamed: 0,year,cause_name,state,deaths
4806,2013,Heart disease,California,60299
1998,2013,Cancer,California,57714
5311,2013,Heart disease,New York,44039
2088,2013,Cancer,Florida,42735
4896,2013,Heart disease,Florida,42656
5510,2013,Heart disease,Texas,40203
2700,2013,Cancer,Texas,38412
2502,2013,Cancer,New York,35738
5419,2013,Heart disease,Pennsylvania,31629
2610,2013,Cancer,Pennsylvania,28512


In [14]:
# Top 10 unique states 2013
# Sort data by number of death from highest to lowest, drop any duplicate states for this year, and obtain only the top 10 states
top_10_leading_2013_states_unique = leading_2013_states.sort_values(by=['deaths'], ascending=False).drop_duplicates(subset='state', keep='first').head(10)
top_10_leading_2013_states_unique

Unnamed: 0,year,cause_name,state,deaths
4806,2013,Heart disease,California,60299
5311,2013,Heart disease,New York,44039
2088,2013,Cancer,Florida,42735
5510,2013,Heart disease,Texas,40203
5419,2013,Heart disease,Pennsylvania,31629
5365,2013,Heart disease,Ohio,26878
4968,2013,Heart disease,Illinois,24839
5131,2013,Heart disease,Michigan,24156
2520,2013,Cancer,North Carolina,18589
5275,2013,Heart disease,New Jersey,18460


In [15]:
# ALL STATES 2014 Top 10 leading deaths and which states had the most deaths in the year 2014
# Isolate only 2014 data
leading_2014_states = leading_causes_2013_to_2016_individual_states.loc[leading_causes_2013_to_2016_individual_states['year']==2014,:]
top_10_leading_2014_states_all = leading_2014_states.sort_values(by=['deaths'], ascending=False).head(10)
top_10_leading_2014_states_all

Unnamed: 0,year,cause_name,state,deaths
1997,2014,Cancer,California,58412
4805,2014,Heart disease,California,58189
4895,2014,Heart disease,Florida,44511
2087,2014,Cancer,Florida,43212
5310,2014,Heart disease,New York,43116
5509,2014,Heart disease,Texas,41479
2699,2014,Cancer,Texas,38847
2501,2014,Cancer,New York,35392
5418,2014,Heart disease,Pennsylvania,31353
2609,2014,Cancer,Pennsylvania,28692


In [16]:
# Top 10 unique states 2014
# Sort data by number of death from highest to lowest, drop any duplicate states for this year, and obtain only the top 10 states
top_10_leading_2014_states_unique = leading_2014_states.sort_values(by=['deaths'], ascending=False).drop_duplicates(subset='state', keep='first').head(10)
top_10_leading_2014_states_unique

Unnamed: 0,year,cause_name,state,deaths
1997,2014,Cancer,California,58412
4895,2014,Heart disease,Florida,44511
5310,2014,Heart disease,New York,43116
5509,2014,Heart disease,Texas,41479
5418,2014,Heart disease,Pennsylvania,31353
5364,2014,Heart disease,Ohio,27000
4967,2014,Heart disease,Illinois,25024
5130,2014,Heart disease,Michigan,24692
2519,2014,Cancer,North Carolina,19342
5274,2014,Heart disease,New Jersey,18319


In [17]:
# ALL STATES 2015 Top 10 leading deaths and which states had the most deaths in the year 2015
# Isolate only 2015 data
leading_2015_states = leading_causes_2013_to_2016_individual_states.loc[leading_causes_2013_to_2016_individual_states['year']==2015,:]
top_10_leading_2015_states_all = leading_2015_states.sort_values(by=['deaths'], ascending=False).head(10)
top_10_leading_2015_states_all

Unnamed: 0,year,cause_name,state,deaths
4804,2015,Heart disease,California,61289
1996,2015,Cancer,California,59629
4894,2015,Heart disease,Florida,45441
5309,2015,Heart disease,New York,44450
2086,2015,Cancer,Florida,44027
5508,2015,Heart disease,Texas,43298
2698,2015,Cancer,Texas,39121
2500,2015,Cancer,New York,35089
5417,2015,Heart disease,Pennsylvania,32042
2608,2015,Cancer,Pennsylvania,28697


In [18]:
# Top 10 unique states 2015
# Sort data by number of death from highest to lowest, drop any duplicate states for this year, and obtain only the top 10 states
top_10_leading_2015_states_unique = leading_2015_states.sort_values(by=['deaths'], ascending=False).drop_duplicates(subset='state', keep='first').head(10)
top_10_leading_2015_states_unique

Unnamed: 0,year,cause_name,state,deaths
4804,2015,Heart disease,California,61289
4894,2015,Heart disease,Florida,45441
5309,2015,Heart disease,New York,44450
5508,2015,Heart disease,Texas,43298
5417,2015,Heart disease,Pennsylvania,32042
5363,2015,Heart disease,Ohio,28069
4966,2015,Heart disease,Illinois,25652
5129,2015,Heart disease,Michigan,24794
2518,2015,Cancer,North Carolina,19322
5273,2015,Heart disease,New Jersey,18647


In [19]:
# ALL STATES 2016 Top 10 leading deaths and which states had the most deaths in the year 2016
# Isolate only 2016 data
leading_2016_states = leading_causes_2013_to_2016_individual_states.loc[leading_causes_2013_to_2016_individual_states['year']==2016,:]
top_10_leading_2016_states_all = leading_2016_states.sort_values(by=['deaths'], ascending=False).head(10)
top_10_leading_2016_states_all

Unnamed: 0,year,cause_name,state,deaths
4803,2016,Heart disease,California,61573
1995,2016,Cancer,California,59515
4893,2016,Heart disease,Florida,45659
2085,2016,Cancer,Florida,44266
5308,2016,Heart disease,New York,44076
5506,2016,Heart disease,Texas,43772
2697,2016,Cancer,Texas,40195
2499,2016,Cancer,New York,35368
5416,2016,Heart disease,Pennsylvania,31990
2607,2016,Cancer,Pennsylvania,28492


In [20]:
# Top 10 unique states 2016
# Sort data by number of death from highest to lowest, drop any duplicate states for this year, and obtain only the top 10 states
top_10_leading_2016_states_unique = leading_2016_states.sort_values(by=['deaths'], ascending=False).drop_duplicates(subset='state', keep='first').head(10)
top_10_leading_2016_states_unique

Unnamed: 0,year,cause_name,state,deaths
4803,2016,Heart disease,California,61573
4893,2016,Heart disease,Florida,45659
5308,2016,Heart disease,New York,44076
5506,2016,Heart disease,Texas,43772
5416,2016,Heart disease,Pennsylvania,31990
5362,2016,Heart disease,Ohio,27410
5128,2016,Heart disease,Michigan,25304
4965,2016,Heart disease,Illinois,25013
2517,2016,Cancer,North Carolina,19523
5272,2016,Heart disease,New Jersey,18597


# Chronic Disease Indicators

#### Column Descriptions:
* YearStart - Starting Year
* YearEnd - Ending Year
* LocationAbbr - Location Abbreviation
* LocationDesc - Location Description
* DataSource - Data Source Abbreviation
* Topic - Topic
* Question - Question full-length text
* Response - Response
* DataValueUnit - The unit, such as $, %, years, etc.
* DataValueType - The data type, such as prevalence or mean
* DataValue - Data Value, such as 14.7 or Category 1
* DataValueAlt - Equal to Data Value, but formatting is numeric
* DataValueFootnoteSymbol	- Footnote Symbol
* DatavalueFootnote - Footnote Text
* LowConfidenceLimit - Low Confidence Limit
* HighConfidenceLimit - High Confidence Limit
* StratificationCategory1	- The category of the stratification, such as * Gender, Overall, or Race/Ethnicity
* Stratification1	- The stratification within the category, such as Male or Female, White/non-Hispanic, Hispanic, Black/non-Hispanic, American Indian or Alaska Native, Asian or Pacific Islander, Multi-racial/non-Hispanic, Other/non-Hispanic, or Overall
* StratificationCategory2
* Stratification2	
* StratificationCategory3
* Stratification3	
* GeoLocation	- Location code to be used for Geocoding
* ResponseID - Identifier for the Response
* LocationID - Location Identifier
* TopicID	
* QuestionID - Question Identifier
* DataValueTypeID	- Identifier for the Data Value Type
* StratificationCategoryID1 - Identifier for stratification category 1
* StratificationID1 - Identifier for stratification 1
* StratificationCategoryID2	
* StratificationID2	
* StratificationCategoryID3	
* StratificationID3	


In [21]:
# read chronic diseases indicators csv
CDI_df = pd.read_csv('Resources/U.S._Chronic_Disease_Indicators__CDI_.csv')
CDI_df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,YearStart,YearEnd,LocationAbbr,LocationDesc,DataSource,Topic,Question,Response,DataValueUnit,DataValueType,...,LocationID,TopicID,QuestionID,DataValueTypeID,StratificationCategoryID1,StratificationID1,StratificationCategoryID2,StratificationID2,StratificationCategoryID3,StratificationID3
0,2016,2016,US,United States,BRFSS,Alcohol,Binge drinking prevalence among adults aged >=...,,%,Crude Prevalence,...,59,ALC,ALC2_2,CRDPREV,OVERALL,OVR,,,,
1,2016,2016,AL,Alabama,BRFSS,Alcohol,Binge drinking prevalence among adults aged >=...,,%,Crude Prevalence,...,1,ALC,ALC2_2,CRDPREV,OVERALL,OVR,,,,
2,2016,2016,AK,Alaska,BRFSS,Alcohol,Binge drinking prevalence among adults aged >=...,,%,Crude Prevalence,...,2,ALC,ALC2_2,CRDPREV,OVERALL,OVR,,,,
3,2016,2016,AZ,Arizona,BRFSS,Alcohol,Binge drinking prevalence among adults aged >=...,,%,Crude Prevalence,...,4,ALC,ALC2_2,CRDPREV,OVERALL,OVR,,,,
4,2016,2016,AR,Arkansas,BRFSS,Alcohol,Binge drinking prevalence among adults aged >=...,,%,Crude Prevalence,...,5,ALC,ALC2_2,CRDPREV,OVERALL,OVR,,,,


This dataset contained several columns of missing or not relevant information to the analysis. 

The following steps below:
* removed unnecessary columns that contained missing or repetitive information
* separated datasets by location of either 'United States' or individual states
* categorized two larger datasets by either 'Crude Prevalence' or 'Age-adjusted Prevalence', which was then stratified by either overall, male, or female gender types ('Stratification1' column). 
* separated each gender type by year (2013, 2014, 2015, 2016)

In [22]:
## CLEAN UP DATAFRAME
# Select desired columns
CDI_df = CDI_df[['YearStart', 'LocationAbbr', 'LocationDesc', 'Topic', 'DataValue', 'DataValueUnit', 'DataValueType', 'StratificationCategory1', 'Stratification1']]

CDI_df.shape

(180591, 9)

In [23]:
# Drop any other rows with missing data
CDI_df = CDI_df.dropna(subset = ['DataValue'])
CDI_df.shape

(120949, 9)

In [24]:
# Remove rows where 'DataValue' == ' '
CDI_df = CDI_df.loc[CDI_df['DataValue']!= ' ']
CDI_df.shape

(118918, 9)

In [25]:
# Convert DataValue from string to float

CDI_df.DataValue = [float(x) for x in CDI_df.DataValue]
CDI_df.head()


Unnamed: 0,YearStart,LocationAbbr,LocationDesc,Topic,DataValue,DataValueUnit,DataValueType,StratificationCategory1,Stratification1
0,2016,US,United States,Alcohol,16.9,%,Crude Prevalence,Overall,Overall
1,2016,AL,Alabama,Alcohol,13.0,%,Crude Prevalence,Overall,Overall
2,2016,AK,Alaska,Alcohol,18.2,%,Crude Prevalence,Overall,Overall
3,2016,AZ,Arizona,Alcohol,15.6,%,Crude Prevalence,Overall,Overall
4,2016,AR,Arkansas,Alcohol,15.0,%,Crude Prevalence,Overall,Overall


## Chronic Disease Indicators by United States

In [26]:
#### PLACE HOLDER

## Chronic Disease Indicators Analysis By State

In [27]:
# Obtain rows only for the all states
CDI_states = CDI_df.loc[CDI_df['LocationAbbr'] !="US",:]

# Obtain rows only for the years >=2013
CDI_states = CDI_states.loc[CDI_states['YearStart'] >=2013,:]
CDI_states.head(10)

Unnamed: 0,YearStart,LocationAbbr,LocationDesc,Topic,DataValue,DataValueUnit,DataValueType,StratificationCategory1,Stratification1
1,2016,AL,Alabama,Alcohol,13.0,%,Crude Prevalence,Overall,Overall
2,2016,AK,Alaska,Alcohol,18.2,%,Crude Prevalence,Overall,Overall
3,2016,AZ,Arizona,Alcohol,15.6,%,Crude Prevalence,Overall,Overall
4,2016,AR,Arkansas,Alcohol,15.0,%,Crude Prevalence,Overall,Overall
5,2016,CA,California,Alcohol,16.3,%,Crude Prevalence,Overall,Overall
6,2016,CO,Colorado,Alcohol,19.0,%,Crude Prevalence,Overall,Overall
7,2016,CT,Connecticut,Alcohol,16.7,%,Crude Prevalence,Overall,Overall
8,2016,DE,Delaware,Alcohol,17.0,%,Crude Prevalence,Overall,Overall
9,2016,DC,District of Columbia,Alcohol,25.6,%,Crude Prevalence,Overall,Overall
10,2016,FL,Florida,Alcohol,15.5,%,Crude Prevalence,Overall,Overall


In [28]:
# Separate crude prevalence and age-adjusted prevalence into separate dataframes
CDI_states_crude = CDI_states.loc[CDI_states['DataValueType']=="Crude Prevalence",:]
CDI_states_age_adjusted = CDI_states.loc[CDI_states['DataValueType']=="Age-adjusted Prevalence", :]


In [29]:
# Separate CDI_states_crude into 3 stratification categories: Overall, Gender Male, Gender Female

# Overall
CDI_states_crude_overall = CDI_states_crude.loc[CDI_states_crude['Stratification1']=="Overall",:]
CDI_states_crude_overall = CDI_states_crude_overall.sort_values(by=['DataValue'], ascending=False)
# Male
CDI_states_crude_male = CDI_states_crude.loc[CDI_states_crude['Stratification1']=="Male",:]
CDI_states_crude_male = CDI_states_crude_male.sort_values(by=['DataValue'], ascending=False)
# Female
CDI_states_crude_female = CDI_states_crude.loc[CDI_states_crude['Stratification1']=="Female",:]
CDI_states_crude_female = CDI_states_crude_female.sort_values(by=['DataValue'], ascending=False)

In [30]:
## Separate CDI_states_crude_{overall, male, female} by year

# CDI state crude overall (sco)
CDI_sco_2013 = CDI_states_crude_overall.loc[CDI_states_crude_overall['YearStart']==2013,:]
CDI_sco_2014 = CDI_states_crude_overall.loc[CDI_states_crude_overall['YearStart']==2014,:]
CDI_sco_2015 = CDI_states_crude_overall.loc[CDI_states_crude_overall['YearStart']==2015,:]
CDI_sco_2016 = CDI_states_crude_overall.loc[CDI_states_crude_overall['YearStart']==2016,:]

# CDI state crude male (scm)
CDI_scm_2013 = CDI_states_crude_male.loc[CDI_states_crude_male['YearStart']==2013,:]
CDI_scm_2014 = CDI_states_crude_male.loc[CDI_states_crude_male['YearStart']==2014,:]
CDI_scm_2015 = CDI_states_crude_male.loc[CDI_states_crude_male['YearStart']==2015,:]
CDI_scm_2016 = CDI_states_crude_male.loc[CDI_states_crude_male['YearStart']==2016,:]

# CDI state crude female (scf)
CDI_scf_2013 = CDI_states_crude_female.loc[CDI_states_crude_female['YearStart']==2013,:]
CDI_scf_2014 = CDI_states_crude_female.loc[CDI_states_crude_female['YearStart']==2014,:]
CDI_scf_2015 = CDI_states_crude_female.loc[CDI_states_crude_female['YearStart']==2015,:]
CDI_scf_2016 = CDI_states_crude_female.loc[CDI_states_crude_female['YearStart']==2016,:]

In [31]:
# Separate CDI_states_age_adjusted into 3 stratification categories: Overall, Gender Male, Gender Female, 
# sort by descending prevalence % ('DataValue' column)

# Overall
CDI_states_age_adjusted_overall = CDI_states_age_adjusted.loc[CDI_states_age_adjusted['Stratification1']=="Overall",:]
CDI_states_age_adjusted_overall = CDI_states_age_adjusted_overall.sort_values(by=['DataValue'], ascending=False)
# Male
CDI_states_age_adjusted_male = CDI_states_age_adjusted.loc[CDI_states_age_adjusted['Stratification1']=="Male",:]
CDI_states_age_adjusted_male = CDI_states_age_adjusted_male.sort_values(by=['DataValue'], ascending=False)
# Female
CDI_states_age_adjusted_female = CDI_states_age_adjusted.loc[CDI_states_age_adjusted['Stratification1']=="Female",:]
CDI_states_age_adjusted_female = CDI_states_age_adjusted_female.sort_values(by=['DataValue'], ascending=False)

In [32]:
## Separate CDI_states_age_adjusted_{overall, male, female} by year

# CDI state crude overall (sco)
CDI_sao_2013 = CDI_states_age_adjusted_overall.loc[CDI_states_age_adjusted_overall['YearStart']==2013,:]
CDI_sao_2014 = CDI_states_age_adjusted_overall.loc[CDI_states_age_adjusted_overall['YearStart']==2014,:]
CDI_sao_2015 = CDI_states_age_adjusted_overall.loc[CDI_states_age_adjusted_overall['YearStart']==2015,:]
CDI_sao_2016 = CDI_states_age_adjusted_overall.loc[CDI_states_age_adjusted_overall['YearStart']==2016,:]

# CDI state crude male (scm)
CDI_sam_2013 = CDI_states_age_adjusted_male.loc[CDI_states_age_adjusted_male['YearStart']==2013,:]
CDI_sam_2014 = CDI_states_age_adjusted_male.loc[CDI_states_age_adjusted_male['YearStart']==2014,:]
CDI_sam_2015 = CDI_states_age_adjusted_male.loc[CDI_states_age_adjusted_male['YearStart']==2015,:]
CDI_sam_2016 = CDI_states_age_adjusted_male.loc[CDI_states_age_adjusted_male['YearStart']==2016,:]

# CDI state crude female (scf)
CDI_saf_2013 = CDI_states_age_adjusted_female.loc[CDI_states_age_adjusted_female['YearStart']==2013,:]
CDI_saf_2014 = CDI_states_age_adjusted_female.loc[CDI_states_age_adjusted_female['YearStart']==2014,:]
CDI_saf_2015 = CDI_states_age_adjusted_female.loc[CDI_states_age_adjusted_female['YearStart']==2015,:]
CDI_saf_2016 = CDI_states_age_adjusted_female.loc[CDI_states_age_adjusted_female['YearStart']==2016,:]

## Data Loading into MySQL

In [33]:
engine = create_engine("mysql://root:Codingislife92!@localhost:3306/leading_causes_of_death_db")

In [37]:
# Load causes by unique states from 2013 to 2016 data
top_10_leading_2013_states_unique.to_sql(name='top_10_2013_unique_states', con=engine, if_exists='replace', index=True)
top_10_leading_2014_states_unique.to_sql(name='top_10_2014_unique_states', con=engine, if_exists='replace', index=True)
top_10_leading_2015_states_unique.to_sql(name='top_10_2015_unique_states', con=engine, if_exists='replace', index=True)
top_10_leading_2016_states_unique.to_sql(name='top_10_2016_unique_states', con=engine, if_exists='replace', index=True)


In [39]:
# Load sco from 2013 to 2016 data
CDI_sco_2013.to_sql(name='cdi_sco_2013', con=engine, if_exists='replace', index=True)
CDI_sco_2014.to_sql(name='cdi_sco_2014', con=engine, if_exists='replace', index=True)
CDI_sco_2015.to_sql(name='cdi_sco_2015', con=engine, if_exists='replace', index=True)
CDI_sco_2016.to_sql(name='cdi_sco_2016', con=engine, if_exists='replace', index=True)

In [40]:
# Load scm from 2013 to 2016 data
CDI_scm_2013.to_sql(name='cdi_scm_2013', con=engine, if_exists='replace', index=True)
CDI_scm_2014.to_sql(name='cdi_scm_2014', con=engine, if_exists='replace', index=True)
CDI_scm_2015.to_sql(name='cdi_scm_2015', con=engine, if_exists='replace', index=True)
CDI_scm_2016.to_sql(name='cdi_scm_2016', con=engine, if_exists='replace', index=True)


In [41]:
# Load scf from 2013 to 2016 data
CDI_scf_2013.to_sql(name='cdi_scf_2013', con=engine, if_exists='replace', index=True)
CDI_scf_2014.to_sql(name='cdi_scf_2014', con=engine, if_exists='replace', index=True)
CDI_scf_2015.to_sql(name='cdi_scf_2015', con=engine, if_exists='replace', index=True)
CDI_scf_2016.to_sql(name='cdi_scf_2016', con=engine, if_exists='replace', index=True)


In [42]:
# Load sao from 2013 to 2016 data
CDI_sao_2013.to_sql(name='cdi_sao_2013', con=engine, if_exists='replace', index=True)
CDI_sao_2014.to_sql(name='cdi_sao_2014', con=engine, if_exists='replace', index=True)
CDI_sao_2015.to_sql(name='cdi_sao_2015', con=engine, if_exists='replace', index=True)
CDI_sao_2016.to_sql(name='cdi_sao_2016', con=engine, if_exists='replace', index=True)


In [43]:
# Load sam from 2013 to 2016 data
CDI_sam_2013.to_sql(name='cdi_sam_2013', con=engine, if_exists='replace', index=True)
CDI_sam_2014.to_sql(name='cdi_sam_2014', con=engine, if_exists='replace', index=True)
CDI_sam_2015.to_sql(name='cdi_sam_2015', con=engine, if_exists='replace', index=True)
CDI_sam_2016.to_sql(name='cdi_sam_2016', con=engine, if_exists='replace', index=True)
