# HealthData: Top 10 Causes of Death by State

We will begin with importing our dataset and saving it to a dataframe

In [1]:
import pandas as pd
import datetime as dt

path = r"C:\Users\Basil\Documents\Data Science\Projects\20200506 Coronavirus\1. Original Data\NCHS_-_Leading_Causes_of_Death__United_States.csv"
df = pd.read_csv(path)

In [2]:
df.head(5)

Unnamed: 0,Year,113 Cause Name,Cause Name,State,Deaths,Age-adjusted Death Rate
0,2017,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,United States,169936,49.4
1,2017,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,Alabama,2703,53.8
2,2017,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,Alaska,436,63.7
3,2017,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,Arizona,4184,56.2
4,2017,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,Arkansas,1625,51.8


First we will remove the United States Total rows that we can see in the state column.

In [3]:
clean_df = df[df.State != 'United States']

In [4]:
clean_df['Cause Name'].unique()

array(['Unintentional injuries', 'All causes', "Alzheimer's disease",
       'Stroke', 'CLRD', 'Diabetes', 'Heart disease',
       'Influenza and pneumonia', 'Suicide', 'Cancer', 'Kidney disease'],
      dtype=object)

There appears to be an All Causes type. As Tableau will automatically sum this we can remove these rows.

In [5]:
clean_df.head(5)

Unnamed: 0,Year,113 Cause Name,Cause Name,State,Deaths,Age-adjusted Death Rate
1,2017,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,Alabama,2703,53.8
2,2017,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,Alaska,436,63.7
3,2017,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,Arizona,4184,56.2
4,2017,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,Arkansas,1625,51.8
5,2017,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,California,13840,33.2


In [6]:
clean_df = clean_df[clean_df["Cause Name"] != 'All causes']

We need a column with state abbreviations to join to our COVID-19 Death dataset. We will copy the state column and convert the state names to abbreviations.

In [7]:
clean_df['Abbreviated State'] = clean_df['State']

In [8]:
clean_df = clean_df.replace({"Abbreviated State" : {"Alabama" : "AL", 
                                "Alaska" : "AK", 
                                "Arizona" : "AZ",
                                "Arkansas" : "AR",
                                "California" : "CA",
                                "Colorado" : "CO",
                                "Connecticut" : "CT",
                                "Delaware" : "DE",
                                "District of Columbia" : "DC",
                                "Florida" : "FL",
                                "Georgia" : "GA",
                                "Hawaii" : "HI",
                                "Idaho" : "ID",
                                "Illinois" : "IL",
                                "Indiana" : "IN",
                                "Iowa" : "IA",
                                "Kansas" : "KS",
                                "Kentucky" : "KY",
                                "Louisiana" : "LA",
                                "Maine" : "ME",
                                "Maryland" : "MD",
                                "Massachusetts" : "MA",
                                "Michigan" : "MI",
                                "Minnesota" : "MN",
                                "Mississippi" : "MS",
                                "Missouri" : "MO",
                                "Montana" : "MT",
                                "Nebraska" : "NE",
                                "Nevada" : "NV",
                                "New Hampshire" : "NH",
                                "New Jersey" : "NJ",
                                "New Mexico" : "NM",
                                "New York" : "NY",
                                "North Carolina" : "NC",
                                "North Dakota" : "ND",
                                "Ohio" : "OH",
                                "Oklahoma" : "OK",
                                "Oregon" : "OR",
                                "Pennsylvania" : "PA",
                                "Rhode Island" : "RI",
                                "South Carolina" : "SC",
                                "South Dakota" : "SD",
                                "Tennessee" : "TN",
                                "Texas" : "TX",
                                "Utah" : "UT",
                                "Vermont" : "VT",
                                "Virginia" : "VA",
                                "Washington" : "WA",
                                "West Virginia" : "WV",
                                "Wisconsin" : "WI",
                                "Wyoming" : "WY"}})

In [9]:
clean_df.head(5)

Unnamed: 0,Year,113 Cause Name,Cause Name,State,Deaths,Age-adjusted Death Rate,Abbreviated State
1,2017,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,Alabama,2703,53.8,AL
2,2017,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,Alaska,436,63.7,AK
3,2017,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,Arizona,4184,56.2,AZ
4,2017,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,Arkansas,1625,51.8,AR
5,2017,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,California,13840,33.2,CA


Now that the data is clean we will save the file

In [10]:
save_path = r"C:\Users\Basil\Documents\Data Science\Projects\20200506 Coronavirus\2. Prepared Data\Top 10 Deaths by State.csv"
clean_df.to_csv(save_path)