# Analyzing NYC leading causes of death

The leading causes of death by sex and ethnicity in New York City in since 2007. Cause of death is derived from the NYC death certificate which is issued for every death that occurs in New York City.

[Source from NYC OpenData](https://data.cityofnewyork.us/Health/New-York-City-Leading-Causes-of-Death/jb7j-dtam)

In [9]:
# imports
import pandas as pd
pd.options.mode.chained_assignment = None
import plotly.express as px
pd.options.plotting.backend = "plotly"
import numpy as np
from math import pi
import matplotlib.pyplot as plt

from bokeh.plotting import figure, show, output_file, output_notebook
from bokeh.models import ColumnDataSource, Range1d, HoverTool, Legend
from bokeh.models.widgets import Tabs, Panel
from bokeh.io import show
from bokeh.palettes import BuGn, Category20b, Spectral, Plasma, Viridis, YlOrRd, PuOr, BuPu, BuGn, brewer, GnBu, PRGn, Inferno256, PuRd, RdPu, PiYG, RdYlGn, YlGnBu
from bokeh.io import reset_output, output_notebook
reset_output()
output_notebook()

from bokeh.plotting import figure, output_file, save
from IPython.display import IFrame
from IPython.core.display import display, HTML
import tempfile

## Load the dataset saved from NYC OpenData

In [10]:
df = pd.read_csv('./data/New_York_City_Leading_Causes_of_Death.csv')
df.head()

Unnamed: 0,Year,Leading Cause,Sex,Race Ethnicity,Deaths,Death Rate,Age Adjusted Death Rate
0,2009,"Nephritis, Nephrotic Syndrome and Nephrisis (N...",F,Other Race/ Ethnicity,.,.,.
1,2013,Influenza (Flu) and Pneumonia (J09-J18),F,Hispanic,204,16.3,18.5
2,2012,"Assault (Homicide: Y87.1, X85-Y09)",M,Other Race/ Ethnicity,.,.,.
3,2007,Essential Hypertension and Renal Diseases (I10...,F,Not Stated/Unknown,5,.,.
4,2014,Cerebrovascular Disease (Stroke: I60-I69),F,White Non-Hispanic,418,29.5,15.6


In [11]:
len(df)

1272

Each 1272 rows pertains to a leading cause of death per sex and ethnicity per year.

## Preprocess the data.

Check the datatypes of the columns.

In [12]:
df.dtypes

Year                        int64
Leading Cause              object
Sex                        object
Race Ethnicity             object
Deaths                     object
Death Rate                 object
Age Adjusted Death Rate    object
dtype: object

We will change the object columns to the data type that is most fitting.

In [13]:
df.Deaths.unique()

array(['.', '204', '5', '418', '26', '618', '267', '139', '124', '166',
       '56', '132', '395', '292', '240', '1840', '39', '142', '134', '33',
       '252', '1354', '15', '8', '1563', '1680', '248', '165', '24', '38',
       '9', '68', '29', '42', '48', '272', '94', '60', '18', '20', '73',
       '30', '4495', '12', '622', '969', '13', '227', '14', '51', '6',
       '50', '473', '2034', '7', '1069', '5168', '237', '71', '182',
       '307', '107', '1013', '137', '281', '84', '2194', '88', '258',
       '87', '1327', '1918', '36', '633', '10', '57', '168', '83', '276',
       '136', '279', '108', '3184', '271', '437', '147', '61', '1193',
       '1375', '163', '102', '410', '219', '2269', '1195', '245', '145',
       '188', '390', '43', '186', '285', '53', '93', '119', '7050', '129',
       '199', '1097', '2068', '66', '52', '49', '6297', '221', '179',
       '4535', '1341', '2316', '229', '3187', '41', '140', '27', '195',
       '164', '235', '2445', '25', '249', '156', '191', '408

We need to clean some of the values first - we can see that some rows have Deaths, Date Rate, or Age Adjusted Death Rate ='.' 

We decide to drop any rows where they = '.' since we are not sure if the data is valid.

In [14]:
df.dtypes

Year                        int64
Leading Cause              object
Sex                        object
Race Ethnicity             object
Deaths                     object
Death Rate                 object
Age Adjusted Death Rate    object
dtype: object

In [15]:
# drop suspicious rows
df = df[df['Deaths'] != '.']
df = df[df['Death Rate'] != '.']
df = df[df['Age Adjusted Death Rate'] != '.']

# change datatypes
df['Deaths'] = df['Deaths'].astype('float')
df['Death Rate'] = df['Death Rate'].astype('float')
df['Age Adjusted Death Rate'] = df['Age Adjusted Death Rate'].astype('float')
df['Leading Cause'] = df['Leading Cause'].astype('string')
df['Sex'] = df['Sex'].astype('string')
df['Race Ethnicity'] = df['Race Ethnicity'].astype('string')


len(df)


886

Check the unique values to see if other inaccuracies appear.

In [16]:
df.Sex.value_counts()

F         354
M         354
Female     90
Male       88
Name: Sex, dtype: Int64

The options for Male and Female also include F and M - we will change Male and Female to M and F.

In [17]:
# update based on conditions
df['Sex'].loc[df['Sex']=='Female'] ='F'
df['Sex'].loc[df['Sex']=='Male'] ='M'
df.Sex.value_counts()

F    444
M    442
Name: Sex, dtype: Int64

In [18]:
df['Race Ethnicity'].value_counts()

Hispanic                      199
Asian and Pacific Islander    199
Black Non-Hispanic            178
White Non-Hispanic            176
Other Race/ Ethnicity          67
Not Stated/Unknown             23
Non-Hispanic White             22
Non-Hispanic Black             22
Name: Race Ethnicity, dtype: Int64

The dataset now has 886 rows with valid data.

Show each Hispanic death in 2009 sorted by the leading cause and sex, to gain an understanding of how the data is stored.

In [19]:
df[(df.Year==2009) & (df['Race Ethnicity']=='Hispanic')].sort_values(by=['Leading Cause', 'Sex'])

Unnamed: 0,Year,Leading Cause,Sex,Race Ethnicity,Deaths,Death Rate,Age Adjusted Death Rate
1075,2009,"Accidents Except Drug Posioning (V01-X39, X43,...",F,Hispanic,56.0,4.7,5.3
341,2009,"Accidents Except Drug Posioning (V01-X39, X43,...",M,Hispanic,170.0,15.1,17.9
971,2009,All Other Causes,F,Hispanic,775.0,64.7,75.3
400,2009,All Other Causes,M,Hispanic,1067.0,95.0,138.1
830,2009,Alzheimer's Disease (G30),F,Hispanic,56.0,4.7,6.2
898,2009,Cerebrovascular Disease (Stroke: I60-I69),F,Hispanic,154.0,12.9,16.0
886,2009,Cerebrovascular Disease (Stroke: I60-I69),M,Hispanic,143.0,12.7,21.1
143,2009,"Chronic Liver Disease and Cirrhosis (K70, K73)",M,Hispanic,119.0,10.6,14.1
802,2009,Chronic Lower Respiratory Diseases (J40-J47),F,Hispanic,155.0,12.9,16.0
1086,2009,Chronic Lower Respiratory Diseases (J40-J47),M,Hispanic,111.0,9.9,18.3


Now sort by the Death Rate in descending order.

In [20]:
df[(df.Year==2009) & (df['Race Ethnicity']=='Hispanic')].sort_values('Death Rate', ascending=False)

Unnamed: 0,Year,Leading Cause,Sex,Race Ethnicity,Deaths,Death Rate,Age Adjusted Death Rate
538,2009,"Diseases of Heart (I00-I09, I11, I13, I20-I51)",M,Hispanic,1382.0,123.1,227.9
257,2009,"Diseases of Heart (I00-I09, I11, I13, I20-I51)",F,Hispanic,1349.0,112.7,143.8
894,2009,Malignant Neoplasms (Cancer: C00-C97),M,Hispanic,1180.0,105.1,172.2
400,2009,All Other Causes,M,Hispanic,1067.0,95.0,138.1
147,2009,Malignant Neoplasms (Cancer: C00-C97),F,Hispanic,1097.0,91.6,107.2
971,2009,All Other Causes,F,Hispanic,775.0,64.7,75.3
634,2009,Influenza (Flu) and Pneumonia (J09-J18),F,Hispanic,216.0,18.0,22.8
700,2009,Diabetes Mellitus (E10-E14),F,Hispanic,215.0,18.0,22.2
367,2009,Human Immunodeficiency Virus Disease (HIV: B20...,M,Hispanic,196.0,17.5,20.0
642,2009,Influenza (Flu) and Pneumonia (J09-J18),M,Hispanic,183.0,16.3,30.9


### EDA

Now show the distributions of data.

In [30]:
# show histograms for numeric columns
numeric_columns = ['Year', 'Deaths', 'Death Rate', 'Age Adjusted Death Rate']
for column in numeric_columns:
    fig = df[column].hist(title = column)
    fig.show()

In [31]:
df.head()

Unnamed: 0,Year,Leading Cause,Sex,Race Ethnicity,Deaths,Death Rate,Age Adjusted Death Rate
1,2013,Influenza (Flu) and Pneumonia (J09-J18),F,Hispanic,204.0,16.3,18.5
4,2014,Cerebrovascular Disease (Stroke: I60-I69),F,White Non-Hispanic,418.0,29.5,15.6
5,2009,Essential Hypertension and Renal Diseases (I10...,M,Asian and Pacific Islander,26.0,5.1,7.2
6,2013,Influenza (Flu) and Pneumonia (J09-J18),M,White Non-Hispanic,618.0,45.9,36.7
7,2007,"Assault (Homicide: Y87.1, X85-Y09)",M,Black Non-Hispanic,267.0,31.3,31.1


Show breakdown of categorical columns.

In [39]:
df['Leading Cause'].plot(kind='barh')

uhhh what lollll lets just use bokeh

In [54]:
# function to make a dataframe to count the unique values for each category in a given column
def prep_bokeh_df(df, column):
    count_df = pd.Series(df[column].value_counts()).reset_index(name='Count').rename(columns={'index':column})
    return count_df


In [55]:
# make the new dfs
leading_cause_bar_df = prep_bokeh_df(df, 'Leading Cause')
race_bar_df = prep_bokeh_df(df, 'Race Ethnicity')
sex_bar_df = prep_bokeh_df(df, 'Sex')

In [56]:
# function to plot bokeh categorical data in a horizontal bar
# you input the dataframe, column, colorpalette, height, and width
def bokeh_hbar(df, column, title, colorpalette, height, width):
    TOOLS = "hover,save,pan,box_zoom,reset,wheel_zoom,tap"
    p = figure(plot_height=height,
        plot_width=width,
        title=title,
        tools=TOOLS,
        toolbar_location='right',
        y_range=df[column].unique()[::-1])
    p.hbar(y=df[column], right=df['Count'], 
    height=0.75,color=colorpalette[df[column].nunique()]) #color='#8968CD')
    p.yaxis.axis_label = column
    p.xaxis.axis_label = 'Number of documentations'
    p.select_one(HoverTool).tooltips = [
        (column, '@y'),
        ('Number of documentations', '@right')]
    return p

In [59]:
# update colorschemes
PuRd_32 = {32: PuRd[9][:-1]+ PuRd[9][:-1] + PuRd[9][:-1] + PuRd[9][:-1]}
PuRd_2 = {2: PuRd[9][:-7]}


In [63]:
# intialize plots to be put in tabs
p_type = bokeh_hbar(leading_cause_bar_df, 'Leading Cause', "Number of leading cause documentations", 
             PuRd_32, 450, 650)
p_target = bokeh_hbar(race_bar_df, 'Race Ethnicity', "Number of race documentations", 
            PuRd, 450, 650)
# p_offense = bokeh_hbar(sex_bar_df, 'Sex', "Number of sex documentations", 
#             PuRd_2, 450, 650)            

# format in tabs
tabs = Tabs(tabs=[
                 Panel(child=p_type, title='Leading Cause'),
                #  Panel(child=p_target, title='Race'),
                #  Panel(child=p_offense, title='Sex')
                 ]
                 )

# show(tabs)

show(p_type)

AttributeError: 'IntegerArray' object has no attribute 'tolist'