# Dataset exploration

* [Annual mortality and causes by county, 1979-1988](https://www.cdc.gov/nchs/data_access/cmf.htm)
* [Compressed mortality info, 1968-2016](https://wonder.cdc.gov/controller/datarequest/D140)

[Rainfall and turnout](https://faculty.ucmerced.edu/thansford/Articles/The%20Republicans%20Should%20Pray%20for%20Rain%20-%20Weather,%20Turnour,%20and%20Voting%20in%20U.S.%20Presidential%20Elections.pdf)

[Higher temperatures increase suicide rates in the
United States and Mexico](https://web.stanford.edu/~mburke/papers/BurkeEtAl_NCC_2018.pdf)

As temperatures go up, suicide rates go up.
Heat waves, countries or regions that go through heat waves. During those times there were significant differences in suicide rates.

Is there monthly data?

[WHO mortality data](https://www.who.int/healthinfo/statistics/mortality_rawdata/en/)

[CDC Multiple Mortality Cause files](https://www.cdc.gov/nchs/data_access/vitalstatsonline.htm#Mortality_Multiple)

[Suicides and gun ownership](https://mason.gmu.edu/~atabarro/BriggsTabarrokFirearmsSuicide.pdf)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

# Suicides by Month, Year, State

`suicides.txt` from https://wonder.cdc.gov/wonder/help/mcd.html, restricted to suicides
* https://wonder.cdc.gov/mcd-icd10.html
* Click Agree
* https://wonder.cdc.gov/controller/datarequest/D77
* Group results by State, Year, Month
* Ages exclude "Not stated"; Hispanic Origin exclude "Not stated"
* Underlying cause of death: X60-X84 (Intentional self-harm)

In [None]:
suicides = pd.read_csv('data/suicides_month_year_state_1999_2017.txt', sep='\t', na_values='Not Applicable')
suicides = suicides.dropna(subset=['State'])
suicides = suicides.drop(columns='Notes')

In [None]:
suicides.head()

In [None]:
suicides['Month'] = suicides['Month Code'].str.slice(-2).astype(np.int64)

In [None]:
suicides.info()

In [None]:
suicides = suicides.rename(columns={'Crude Rate':'suicide_rate'})

# Need to get population data to calculate rate. 
CDC only provides population data for an annual-level search
* https://wonder.cdc.gov/mcd-icd10.html
* Click Agree
* https://wonder.cdc.gov/controller/datarequest/D77
* Group results by State, Year
* Ages exclude "Not stated"; Hispanic Origin exclude "Not stated"
* Underlying cause of death: X60-X84 (Intentional self-harm)

In [None]:
suicides_pop = pd.read_csv('data/suicides_year_1999_2017.txt', sep='\t', na_values='Not Applicable')
suicides_pop = suicides_pop.dropna(subset=['State'])
suicides_pop = suicides_pop.drop(columns='Notes')
suicides_pop

## Add population numbers to our suicides dataframe

In [None]:
suicides['Population'] = suicides.merge(suicides_pop, on=['State', 'Year'])['Population_y']

In [None]:
suicides['Crude Rate'] = suicides.Deaths / suicides.Population * 100_000

In [None]:
suicides

# Download heat-wave data
* Database: [North America Land Data Assimilation System (NLDAS) Daily Air Temperatures and Heat Index (1979-2011)](https://wonder.cdc.gov/nasa-nldas.html)
* Group results by State, Year, Month
* Dataset goes from 1999 to 2011

In [None]:
heat = pd.read_csv('data/temps_by_state_month_1999_2011.txt', sep='\t', na_values='Missing')
heat = heat.dropna(subset=['State'])
heat['Month'] = heat['Month, Year Code'].str.slice(-2).astype(np.int64)
heat['Year'] = heat['Month, Year Code'].str.slice(0, 4).astype(np.int64)
heat = heat.drop(columns='Notes')
heat

In [None]:
heat[heat.Year == 1999].groupby(['State', 'Month']).mean().query('State == "Alabama"')

In [None]:
heat.groupby('State').mean()

# Merge suicide and heat data

In [None]:
heat = heat.rename(columns={'Avg Daily Max Air Temperature (F)':'avg_max_t',
                    'Avg Daily Min Air Temperature (F)':'avg_min_t',
                    'Avg Daily Max Heat Index (F)':'avg_max_heat_index',
                    'Month, Year Code' : 'Month Code'})

In [None]:
suicides = suicides.merge(heat[['State', 'Month Code', 'avg_max_t', 'avg_min_t', 'avg_max_heat_index']], 
               on=['State', 'Month Code'])

In [None]:
suicides

In [None]:
suicides['min_t_diff'] = suicides.avg_min_t - suicides.groupby(['State', 'Month']).avg_min_t.transform('mean')
suicides['max_t_diff'] = suicides.avg_max_t - suicides.groupby(['State', 'Month']).avg_max_t.transform('mean')
suicides['heat_index_diff'] = suicides.avg_max_heat_index - suicides.groupby(['State', 'Month']).avg_max_heat_index.transform('mean')

## Write cleaned dataset to file

In [None]:
suicides.write_csv('data/suicides_heat.csv')

# Exploratory Data Analysis

In [None]:
sns.barplot(x='State', y='suicide_rate', data=suicides)

In [None]:
sns.scatterplot(x='avg_max_t', y='suicide_rate',  data=suicides.query('4 < Month < 10'), hue='Month', legend='full')

In [None]:
sns.scatterplot(x='avg_max_heat_index', 
                y='suicide_rate', data=suicides.query('4 < Month < 10'), hue='Month', legend='full')

In [None]:
sns.lmplot(x='avg_max_heat_index', 
                y='suicide_rate', data=suicides.query('4 < Month < 10'), legend='full')

In [None]:
sns.scatterplot(x='heat_index_diff', y='suicide_rate', data=suicides.query('4 < Month < 10'), hue='Month')

What if we compare when heat_index diff > 5 to when heat_index diff < 5

In [None]:
sns.lmplot(x='heat_index_diff', y='suicide_rate', data=suicides.query('4 < Month < 10'))

In [None]:
sns.scatterplot(x='max_t_diff', y='suicide_rate', data=suicides.query('4 < Month < 10'), hue='Month')

In [None]:
sns.scatterplot(x='min_t_diff', y='suicide_rate', data=suicides.query('4 < Month < 10'), hue='Month')

There may be a time effect over the years!

In [None]:
sns.scatterplot(x='Month Code', y='suicide_rate', data=suicides[suicides.State == 'California'],hue='Month', legend='full')

There is a strong annual trend.

In [None]:
plt.plot(suicides.groupby('Year').sum().Deaths)

Chi-squared tells us whether two categorical variables are independent


understand occurences with one object and many factors
chi-squared test
```
                 99   | 00   | 01    | 02 | 03 ... 17
AL suicides      sum    sum    sum    sum   sum ...
CA 
CO
...
```

In [None]:
in summer months: categorical variable low heat,  high heat 

## Hypothesis 1 - winter gets more suicides than summer

In [None]:
summer = suicides.query('5 < Month < 9')
winter = suicides.query('Month < 4 or Month > 10')
spring = suicides.query('3 < Month < 6')
longsummer = suicides.query('3 < Month < 9')
fall = suicides.query('8 < Month < 11')

In [None]:
plt.hist(longsummer.suicide_rate, bins='auto', alpha=.5, label='April-August')
plt.hist(winter.suicide_rate, bins='auto', alpha=.5, label='November-March')
# plt.hist(fall.suicide_rate, bins='auto', alpha=.5, label='September-October')
plt.legend()

In [None]:
stats.ttest_ind(winter.suicide_rate, longsummer.suicide_rate, equal_var=False)

## Let's cancel out year effect

In [None]:
# mean out the years
# observations are states
# group by states mean of rate by month
suicides_by_month = suicides.groupby(['Month',
                  'State']).agg('mean').reset_index() \
                .drop(columns=['Year','Year Code','min_t_diff', 'max_t_diff', 'heat_index_diff'])
suicides_by_month

Paired t-test. Compare the sum of the months against the mean of the months

In [None]:
summer_by_month = suicides_by_month.query('3 < Month < 9')
winter_by_month = suicides_by_month.query('Month < 4 or Month > 10')

In [None]:
summer_by_month.groupby('State').mean()['suicide_rate']

In [None]:
stats.ttest_rel(summer_by_month.groupby('State').mean()['suicide_rate'], 
                winter_by_month.groupby('State').mean()['suicide_rate'])

In [None]:
# Cohen's d
def cohen_d(x,y):
    nx = len(x)
    ny = len(y)
    dof = nx + ny - 2
    return (np.mean(x) - np.mean(y)) / np.sqrt(((nx-1)*np.std(x, ddof=1) ** 2 + (ny-1)*np.std(y, ddof=1) ** 2) / dof)

cohen_d(winter.suicide_rate, longsummer.suicide_rate)

Nope, in fact spring & summer are worse than winter!

In [None]:
sns.distplot(suicides.query('3 < Month < 9').avg_max_heat_index.dropna())

In [None]:
sns.distplot(suicides.query('3 < Month < 9').heat_index_diff.dropna())

In [None]:
q_75 = np.quantile(suicides.query('3 < Month < 9').avg_max_heat_index.dropna(), .75)
q_75

In [None]:
q_95 = np.quantile(suicides.query('3 < Month < 9').avg_max_heat_index.dropna(), .95)
q_95

In [None]:
low_heat = longsummer.query(f'avg_max_heat_index < {q_75}')
high_heat = longsummer.query(f'avg_max_heat_index > {q_95}')
stats.ttest_ind(low_heat.suicide_rate, high_heat.suicide_rate, equal_var=False)

In [None]:
q_diff_75 = np.quantile(suicides.query('3 < Month < 9').heat_index_diff.dropna(), .75)
q_diff_75

In [None]:
q_diff_95 = 

In [None]:
suicides.query('3 < Month < 9').groupby('Year').avg_max_heat_index.hist(alpha=.25, bins='auto')

In [None]:
suicides.query('3 < Month < 9').groupby('Year').heat_index_diff.hist(alpha=.25, bins='auto')

In [None]:
sns.distplot(suicides.query('3 < Month < 9').heat_index_diff.dropna())

In [None]:
np.quantile(suicides.query('3 < Month < 9').heat_index_diff.dropna(),.25)

# Download suicide data - county & month level
* Database: Multiple Cause of Death, 1999-2017 (D77)
* Group results by State, County, Year, Month
* Underlying Cause of Death: UCD-ICD10 Code X60-X84


In [None]:
suicides = {}
suicides[1999] = pd.read_csv('data/suicides_by_month/suicides_1999.txt',sep='\t')
suicides[1999].head()


# Heat wave days

In [None]:
df_hw = pd.read_csv('data/heat_wave_days_1981.txt', sep='\t')

In [None]:
df_hw.head()

# CDC API (too hard to use)

In [None]:
# "https://wonder.cdc.gov/controller/datarequest/[database ID]"
base_url = 'https://wonder.cdc.gov/controller/datarequest/'
# D60 = North America Land Data Assimilation System (NLDAS) Daily Air Temperatures and Heat Index (1979-2011)
db_id = 'D60'
params = { 'accept_datause_restrictions' : 'true' }