# Dataset exploration

* [Annual mortality and causes by county, 1979-1988](https://www.cdc.gov/nchs/data_access/cmf.htm)
* [Compressed mortality info, 1968-2016](https://wonder.cdc.gov/controller/datarequest/D140)

[Rainfall and turnout](https://faculty.ucmerced.edu/thansford/Articles/The%20Republicans%20Should%20Pray%20for%20Rain%20-%20Weather,%20Turnour,%20and%20Voting%20in%20U.S.%20Presidential%20Elections.pdf)

[Higher temperatures increase suicide rates in the
United States and Mexico](https://web.stanford.edu/~mburke/papers/BurkeEtAl_NCC_2018.pdf)

As temperatures go up, suicide rates go up.
Heat waves, countries or regions that go through heat waves. During those times there were significant differences in suicide rates.

Is there monthly data?

[WHO mortality data](https://www.who.int/healthinfo/statistics/mortality_rawdata/en/)

[CDC Multiple Mortality Cause files](https://www.cdc.gov/nchs/data_access/vitalstatsonline.htm#Mortality_Multiple)

[Suicides and gun ownership](https://mason.gmu.edu/~atabarro/BriggsTabarrokFirearmsSuicide.pdf)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

# Suicides by Month, Year, State

`suicides.txt` from https://wonder.cdc.gov/wonder/help/mcd.html, restricted to suicides
* https://wonder.cdc.gov/mcd-icd10.html
* Click Agree
* https://wonder.cdc.gov/controller/datarequest/D77
* Group results by State, Year, Month
* Ages exclude "Not stated"; Hispanic Origin exclude "Not stated"
* Underlying cause of death: X60-X84 (Intentional self-harm)

In [None]:
suicides = pd.read_csv('suicides_heat.csv')
suicides.describe()

# Exploratory Data Analysis

In [None]:
sns.barplot(x='State', y='Crude Rate', data=suicides)

In [None]:
sns.scatterplot(x='avg_max_t', y='Crude Rate',  data=suicides.query('4 < Month < 10'), hue='Month', legend='full')

In [None]:
sns.scatterplot(x='avg_max_heat_index', 
                y='Crude Rate', data=suicides.query('4 < Month < 10'), hue='Month', legend='full')

In [None]:
sns.lmplot(x='avg_max_heat_index', 
                y='Crude Rate', data=suicides.query('4 < Month < 10'), legend='full')

In [None]:
sns.scatterplot(x='heat_index_diff', y='Crude Rate', data=suicides.query('4 < Month < 10'), hue='Month')

What if we compare when heat_index diff > 5 to when heat_index diff < 5

In [None]:
sns.lmplot(x='heat_index_diff', y='Crude Rate', data=suicides.query('4 < Month < 10'))

In [None]:
sns.scatterplot(x='max_t_diff', y='Crude Rate', data=suicides.query('4 < Month < 10'), hue='Month')

In [None]:
sns.scatterplot(x='min_t_diff', y='Crude Rate', data=suicides.query('4 < Month < 10'), hue='Month')

There may be a time effect over the years!

In [None]:
sns.scatterplot(x='Month Code', y='Crude Rate', data=suicides[suicides.State == 'California'],hue='Month', legend='full')

There is a strong annual trend.

In [None]:
plt.plot(suicides.groupby('Year').sum().Deaths)
plt.style.use('ggplot')
plt.xlabel('Year')
plt.ylabel('Suicides')
plt.title('Suicides per Year in the US');

Chi-squared tells us whether two categorical variables are independent


understand occurences with one object and many factors
chi-squared test
```
                 99   | 00   | 01    | 02 | 03 ... 17
AL suicides      sum    sum    sum    sum   sum ...
CA 
CO
...
```

In [None]:
in summer months: categorical variable low heat,  high heat 

In [None]:
plt.plot(suicides.groupby('Year').mean()['heat_index_diff'])
plt.style.use('ggplot')
# plt.xlabel('Year')
# plt.ylabel('heat index diff')
# plt.title('heat index diff Year in the US');

## Hypothesis 1 - winter gets more suicides than summer

In [None]:
summer = suicides.query('5 < Month < 9')
winter = suicides.query('Month < 4 or Month > 10')
spring = suicides.query('3 < Month < 6')
longsummer = suicides.query('3 < Month < 9')
fall = suicides.query('8 < Month < 11')

In [None]:
plt.hist(longsummer['Crude Rate'], bins='auto', alpha=.5, label='April-August')
plt.hist(winter['Crude Rate'], bins='auto', alpha=.5, label='November-March')
# plt.hist(fall.suicide_rate, bins='auto', alpha=.5, label='September-October')
plt.legend()

In [None]:
stats.ttest_ind(winter['Crude Rate'], longsummer['Crude Rate'], equal_var=False)

In [None]:
# Cohen's d
def cohen_d(x,y):
    nx = len(x)
    ny = len(y)
    dof = nx + ny - 2
    return (np.mean(x) - np.mean(y)) / np.sqrt(((nx-1)*np.std(x, ddof=1) ** 2 + (ny-1)*np.std(y, ddof=1) ** 2) / dof)

cohen_d(winter['Crude Rate'], longsummer['Crude Rate'])

## Let's cancel out year effect

In [None]:
# mean out the years
# observations are states
# group by states mean of rate by month
suicides_by_month = suicides.groupby(['Month',
                  'State']).agg('mean').reset_index() \
                .drop(columns=['Year','Year Code','min_t_diff', 'max_t_diff'])
suicides_by_month

Paired t-test. Compare the sum of the months against the mean of the months

In [None]:
summer_by_month = suicides_by_month.query('3 < Month < 9')
winter_by_month = suicides_by_month.query('Month < 4 or Month > 10')

In [None]:
summer_by_month.groupby('State').mean()['Crude Rate']

In [None]:
stats.ttest_rel(summer_by_month.groupby('State').mean()['Crude Rate'], 
                winter_by_month.groupby('State').mean()['Crude Rate'])

In [None]:
cohen_d(summer_by_month.groupby('State').mean()['Crude Rate'], 
        winter_by_month.groupby('State').mean()['Crude Rate'])

In [None]:
stats.ttest_rel(summer_by_month.groupby('State').sum()['Crude Rate'], 
                winter_by_month.groupby('State').sum()['Crude Rate'])

In [None]:
cohen_d(summer_by_month.groupby('State').sum()['Crude Rate'], 
        winter_by_month.groupby('State').sum()['Crude Rate'])

Nope, in fact spring & summer are worse than winter!

In [None]:
sns.distplot(suicides.query('3 < Month < 9').avg_max_heat_index.dropna())

In [None]:
sns.distplot(suicides.query('3 < Month < 9').heat_index_diff.dropna())

In [None]:
q_75 = np.quantile(suicides.query('3 < Month < 9').avg_max_heat_index.dropna(), .75)
q_75

In [None]:
q_95 = np.quantile(suicides.query('3 < Month < 9').avg_max_heat_index.dropna(), .95)
q_95

In [None]:
q_05 = np.quantile(suicides.query('3 < Month < 9').avg_max_heat_index.dropna(), .05)
q_05

In [None]:
low_heat = longsummer.query(f'avg_max_heat_index < {q_75}')
high_heat = longsummer.query(f'avg_max_heat_index > {q_95}')
stats.ttest_ind(low_heat['Crude Rate'], high_heat['Crude Rate'], equal_var=False)

In [None]:
q_diff_75 = np.quantile(suicides.query('3 < Month < 9').heat_index_diff.dropna(), .75)
q_diff_75

In [None]:
q_diff_95 = np.quantile(suicides.query('3 < Month < 9').heat_index_diff.dropna(), .95)
q_diff_95

In [None]:
suicides.query('3 < Month < 9').groupby('Year').avg_max_heat_index.hist(alpha=.25, bins='auto')

In [None]:
suicides.query('3 < Month < 9').groupby('Year').heat_index_diff.hist(alpha=.25, bins='auto')

In [None]:
sns.distplot(suicides.query('3 < Month < 9').heat_index_diff.dropna())

In [None]:
np.quantile(suicides.query('3 < Month < 9').heat_index_diff.dropna(),.25)

# SUMMER - LOW/MED/HIGH HEAT BY STATE


In [None]:
longsummer.query('State =="New York"').describe()

In [None]:
summer_by_month.query('State =="New York"')

In [None]:
summer_by_month.groupby(['State','heat_index_level']).agg('mean').reset_index()

In [None]:
def heat_category(temp, low, high):
    if temp <= low:
        return 'low heat'
    if temp >= high:
        return 'high heat'
    return 'medium heat'

summer_by_month['heat_index_level'] = summer_by_month['avg_max_heat_index'].map(lambda x: heat_category(x, q_05, q_95))
summer_by_month

In [None]:
summer_by_month

In [None]:
suicides_by_state_and_heat = pd.pivot_table(summer_by_month, index='State', values='Deaths', 
                                            columns='heat_index_level', aggfunc=np.mean)
suicides_by_state_and_heat

In [None]:
summer_by_month = summer_by_month.dropna()
summer_by_month.info()

In [None]:
summer_by_month.groupby(['State']).count()

In [None]:
suicides.groupby(['State','Year']).sum().reset_index()

In [None]:
from statsmodels.stats.anova import AnovaRM

anovrm_1 = AnovaRM(suicides.groupby(['State','Year']).sum().reset_index(), 'Deaths', 'State', within=['Year'], aggregate_func=np.mean)
res = anovrm_1.fit()

print(res)

In [None]:
east_99 = ['Maryland', 'Delaware', 'New Jersey', 'Rhode Island', 'West Virginia']

In [None]:
longsummer[longsummer.State.isin(east_99) & (longsummer.Year == 1999)]