# Drug Overdose Deaths - a Descriptive and Inferential Analysis

From 2012 - 2018, the state of Connecticut collected and compiled a [sample](https://www.kaggle.com/ruchi798/drug-overdose-deaths) of approximately 5,100 death by overdose. Over the course of this period, overdoses increased noticeably, with heroin overdoses alone spiking in 2017 at 3x the rate seen in 2012.

*What is the point of this project?*: to extrapolate our own observations from the statistical manipulation of this dataset in the hope of answering the below questions:

- **What is the distribution of ages, sexes, and races? Who is the most affected?**
- **What is the distribution of places of residence?**
- **Has the magnitude of overdose deaths changed over time? If so, how much?**
- **How has overdose frequency by a given drug changed over time?**
- **Where do the majority of overdose deaths occur?**

Part III is intended as an entry-level illustration of the basic principles of hypothesis testing, using this dataset for demonstrative examples. 

Part IV will summarize and offer suggestions for both future research and public health efforts.

The project is outlined as follows:

- I. Cleaning Data
- II. Descriptive Analysis
    - Demographic Information
- III. Inferential Analysis
    - Review of Hypothesis Testing
    - Z-test
    - T-test 
- IV. Summary and Conclusion


Without further ado, we will read in the data and begin searching for areas of interest.

## I. Cleaning Data

In [None]:
#importing modules

import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [None]:
#reading in dataset
drugsdata = pd.read_csv("drug_OD.csv")

In [None]:
drugsdata.head()

Looks like the age column is stored as a float, which might make analysis more finicky moving forward. Before we continue, we should convert them to an integer. We'll need to remove the NaN (null) values first - since there are only three, dropping them is the simplest solution. 

In [None]:
drugsdata.dropna(subset = ['Age'], inplace=True)

In [None]:
#and now we can convert the dtype
drugsdata = drugsdata.astype({'Age':'int'})

In [None]:
#we should also do the same to our Fentanyl columns, which are stored differently than the other drug columns
#despite containing only numerics
drugsdata = drugsdata.astype({'Fentanyl_Analogue':'int'})

In [None]:
#let's also convert our Date column to datetime format. We'll also drop a single null row.  
drugsdata.dropna(subset = ['Date'], inplace=True)
drugsdata['Date'] = pd.to_datetime(drugsdata['Date'])

In [None]:
#while we're here, let's create new columns for year and month 
drugsdata['month'] = pd.DatetimeIndex(drugsdata['Date']).month
drugsdata['year'] = pd.DatetimeIndex(drugsdata['Date']).year
drugsdata['day'] = pd.DatetimeIndex(drugsdata['Date']).day

In [None]:
#converting these new columns to integer form again
drugsdata = drugsdata.astype({'month':'int'})
drugsdata = drugsdata.astype({'year':'int'})

In [None]:
#a handful of the values in the manner of death column are not capitalized consistently with the others
drugsdata['MannerofDeath'] = drugsdata['MannerofDeath'].str.upper()

In [None]:
#looking at final column types
drugsdata.info()

In [None]:
#grabbing general stats for out set
drugsdata.describe()

Essentially, our dataset presents basic demographic information, information relevant to the nature of the overdose death, and a binarized set of columns for the relevant substances. It should be noted that several columns contain considerably more null values than others, and we'll handle those moving forward as is appropriate to the questions we start to formulate.

Overall, the dataset is looking much more presentable now, so let's begin drawing up a more insightful descriptive analysis!

## II. Descriptive Statistics 

### Demographic Information

We'll start by breaking down sex, age, and race.

In [None]:
#one row in our dataset contains "Unknown" for sex, so given it's a single instance, we'll simply drop it
drugsdata = drugsdata[drugsdata.Sex != 'Unknown']

In [None]:
sns.set_style('whitegrid')
plt.figure(figsize=(8,8))
sns.countplot(drugsdata['Sex'])
plt.title('Distribution of Sex', fontsize = 18)
plt.xlabel('Sexes', fontsize = 15)
plt.ylabel('Overdose Frequency', fontsize = 15)

Within this dataset, males account for 2.85x as many overdose deaths as women.

Let's see if there are any difference in sex trends over time.

In [None]:
#grabbing counts for each
male = fent = drugsdata[drugsdata['Sex']=='Male']["year"].value_counts().sort_index()
fem = drugsdata[drugsdata['Sex']=='Female']["year"].value_counts().sort_index()


dfsex = pd.concat([male, fem], axis=1)
dfsex.columns=['Male', 'Female']

In [None]:
fig, ax = plt.subplots(figsize=(10,10))
dfsex.plot(ax=ax)
plt.title('Drug Overdose Deaths by Sex Over Time \n 2012-2018', fontsize = 20)
ax.set_xlabel('Year', fontsize = 17)
ax.set_ylabel('Overdose Deaths Frequency', fontsize = 17)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.legend(prop={"size":15})

We can clearly see that overdose deaths among men is considerably higher.

- Male deaths increased **3X** from 2012 by 2017
- Female deaths peaked in 2017, an increase of **1.5X** from 2012.

We will pivot to age data now.

In [None]:
bins = [0, 21, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, np.inf]
range_names = ['<21', '21-25', '26-30', '31-35', '36-40', '41-45', '46-50', '51-55', '56-60', '61-65', '66-70', '70+']

drugsdata['AgeRanges'] = pd.cut(drugsdata['Age'], bins, labels = range_names)

In [None]:
sns.set_style('whitegrid')
plt.figure(figsize=(16,8))
sns.countplot(drugsdata['AgeRanges'], palette = 'Greens')
plt.title('Distribution of Age', fontsize = 20)
plt.xlabel('Age Ranges', fontsize = 17)
plt.ylabel('Overdose Frequency', fontsize = 17)
plt.xticks(fontsize = 12)
-plt.yticks(fontsize = 12)

- Age range with highest count: **46-50**
- Age range with lowest count: **70+**

Taken from earlier stats dataframe:
- Mean age: **42**
- Minimum age: **14**
- Maximum age: **87**

Other observations of note: overdose deaths begin to quickly rise in the first half of the 20's, and likewise quickly decreases around age 60. 

In [None]:
fig = plt.figure(figsize = (12, 12))
ax = drugsdata['Race'].value_counts().plot(kind='barh')
plt.gca().invert_yaxis()
plt.title("Distribution of Race", fontsize = 18)
plt.xlabel("Races", fontsize = 15)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.ylabel("Overdose Death Frequency", fontsize = 15)

- **White** individuals account for about **78%** of the total counts.
- The next two most highly represented races, composed of **hispanic, white** individuals and **black** individuals, account for about **19%** of the total counts.
- **Asian, chinese, native American, hawaiian, and "unknown" or "other" races** account for an incredibly marginal number of cases. They are by the least affected.

In [None]:
fig = plt.figure(figsize = (12, 12))
ax = drugsdata['ResidenceState'].value_counts().plot(kind='barh')
plt.gca().invert_yaxis()
plt.title("Distribution of State of Residence", fontsize = 20)
plt.xlabel("States", fontsize = 18)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.ylabel("Overdose Frequency", fontsize = 18)

- **Important note:** residence state information was only available for 3566 datapoints, so take with something of a grain of salt.
- Unsurprisingly, available points are composed of **96%** Connecticut residents. 
- The next most frequently represented states are **New York** and **Massachuesetts**.

Now let's look at the distribution across counties. For the sake of space, we'll focus on the 15 most heavily affected.

In [None]:
#grabbing 10 15
top15_county = drugsdata['ResidenceCounty'].value_counts()[0:15]

In [None]:
fig = plt.figure(figsize = (12, 12))
top15_county.plot(kind='barh', color = 'lightblue')
plt.gca().invert_yaxis()
plt.title("Distribution of State of Residence", fontsize = 20)
plt.xlabel("States", fontsize = 18)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.ylabel("Overdose Frequency", fontsize = 18)

- **Disclaimer**: the full dataset only contained 4307 total rows with county information. The above graph represents 4222 of these rows. 
- **Hartford** accounts for about a quarter of all available datapoints, which is unsurprising given the capitol is located in this county. 
- **Six** other counties, included **New Haven** and **Fairfield**, account for the majority of the remaining cases, the remaining counties of which report only a handful of cases each.

With our essential demographics covered, we'll move on to visualizing drug information.

### Drug Information

#### Overdose Frequency Over Time

We'll begin by plotting overall overdose death distribution over time. 

In [None]:
fig, ax = plt.subplots(figsize=(9,9))
plot_data = drugsdata.groupby(pd.Grouper(key='Date', freq='M')).count()['Unnamed: 0']
ax = plot_data.plot(ax=ax)
plt.title("Frequency of Overdose Deaths \n 2012 - 2018", fontsize = 16)
plt.xlabel("Timeline", fontsize = 14)
plt.ylabel("Overdose Deaths", fontsize = 14)

That's quite the increased in deaths over the years! This could be attributed to either a legitimate increase in overdose deaths over the years, or a change in data collection processes. In a real world practical application, this aspect would need to be double-checked (if it hadn't been already), but for the sake of this exercise, we will assume this graphic reflects a genuine trend. 

In terms of our observations:

- Overdose deaths tend to spike in either direction every **three to six months**. 
- For comparison, 2018 appears to have experienced roughly **3x** as many deaths as 2012.

Let's plot distribution by month to see if there are any identifiable seasonal trends.

In [None]:
plt.figure(figsize=(13,8))
sns.countplot(drugsdata['month'], color = 'orange')
plt.title('Overdose Death Frequency by Month', fontsize = 18)
plt.xlabel('Month', fontsize = 15)
plt.ylabel('Overdose Death Frequency', fontsize = 15)

- This confirms our general observation with the previous graph, which is that **overdose deaths appear to cycle** every few months, as opposed to a strictly hot/cold seasonal trend.
- **November** demonstrates the highest overall rate, with **January** demonstrating the lowest by a small margin.

Let's get a more granular look by breaking down cases by day of the month. 

In [None]:
plt.figure(figsize=(13,8))
sns.countplot(drugsdata['day'], color = 'salmon')
plt.title('Overdose Death Frequency by Day of the Month', fontsize = 18)
plt.xlabel('Day', fontsize = 15)
plt.ylabel('Overdose Death Frequency', fontsize = 15)

There appears to be a slight bump in cases represented in the first few days of the month, with a slight downwards trend as weeks go on.

Now we'll shift to visualizing the various drugs of choice involved across deaths.

#### Overdoses by Drugs of Choice

In [None]:
#pulling our counts by year
her = drugsdata[drugsdata['Heroin']==1]["year"].value_counts().sort_index()
morph_NH = fent = drugsdata[drugsdata['Morphine_NotHeroin']==1]["year"].value_counts().sort_index()
coke = drugsdata[drugsdata['Cocaine']==1]["year"].value_counts().sort_index()
fent = drugsdata[drugsdata['Fentanyl']==1]["year"].value_counts().sort_index()
fentA = drugsdata[drugsdata['Fentanyl_Analogue']==1]["year"].value_counts().sort_index()
oxyC = drugsdata[drugsdata['Oxycodone']==1]["year"].value_counts().sort_index()
oxyM = drugsdata[drugsdata['Oxymorphone']==1]["year"].value_counts().sort_index()
eth = drugsdata[drugsdata['Ethanol']==1]["year"].value_counts().sort_index()
hydroC = drugsdata[drugsdata['Hydrocodone']==1]["year"].value_counts().sort_index()
benzo = drugsdata[drugsdata['Benzodiazepine']==1]["year"].value_counts().sort_index()
metha = drugsdata[drugsdata['Methadone']==1]["year"].value_counts().sort_index()
amph = drugsdata[drugsdata['Amphet']==1]["year"].value_counts().sort_index()
tram = fent = drugsdata[drugsdata['Tramad']==1]["year"].value_counts().sort_index()
hydroM = drugsdata[drugsdata['Hydromorphone']==1]["year"].value_counts().sort_index()
other = drugsdata[drugsdata['Other']==1]["year"].value_counts().sort_index()

In [None]:
df = pd.concat([her, morph_NH, coke, fent, fentA, oxyC, oxyM, eth, hydroC, hydroM, benzo, metha, amph, tram, other], axis=1)
df.columns=['Heroin','Morpine, non-Heroin', 'Cocaine', 'Fentanyl', 'Fentanyl analogue', 'Oxycodone', 'Oxymorphone', 'Ethanol',
           'Hydrocodone', 'Hydromorphone', 'Benzodiazepine', 'Methadone', 'Amphet', 'Tram', 'Other']
ax=df.plot(kind='bar',figsize=(15,10),fontsize=15)
plt.title("Frequency Distribution of Drugs of Choice by Year", fontsize = 18)
plt.xticks(fontsize = 17, rotation=45)
plt.yticks(fontsize = 17)
plt.ylabel("Overdose Death Frequency", fontsize = 15)
plt.legend(prop={"size":15})

Let's get another perspective by plotting a linechart over time.

In [None]:
fig, ax = plt.subplots(figsize=(18,12))
df.plot(ax=ax)
plt.title('Timeseries of Drugs of Choice', fontsize = 20)
ax.set_xlabel('Year', fontsize = 17)
ax.set_ylabel('Overdose Deaths Frequency', fontsize = 17)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.legend(prop={"size":15})

Between these two charts, some things of worth to note:

- **Heroin** is by far the most frequently used drug of choice in death by overdose cases. It's usage peaked in 2016 with just short of 500 cases, declining to just short of 400 by 2018. 
- **Benzodiazepine** and **cocaine** are the next most frequent, and likewise took a noticeable upswing over the given timeframe. 
- Tracking of **fentanyl analogues** began in 2017, with an upward trend. This substance outstrips usage of **fentanyl proper**, both comparatively before analogue introducton and demonstrably after.
- **Oxycodone** and **Oxymorphone**, both highly potent prescription painkillers, show relatively little change over time. The latter especially acounts for only a few cases per year. 
- Usage of **ethanol** increased substantially over the years, with 2017 reporting roughly **3x** as many cases as in 2012. 


#### Overdose Frequency by Location

In [None]:
fig = plt.figure(figsize = (12, 12))
ax = drugsdata['Location'].value_counts().plot(kind='barh', color = 'darkred')
plt.gca().invert_yaxis()
plt.title("Distribution of Locations of Death", fontsize = 18)
plt.xlabel("Locations", fontsize = 15)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.ylabel("Overdose Death Frequency", fontsize = 15)

- **Note:** this plot represents the 5081 rows with available location data.
- Many overdose deaths occur at home, accounting for approximately 53% of all deaths. This likely because the home environment is where most would feel safe and less afraid of interference when consuming an illicit substance.
- Hospitals are the next most frequent (~31%), followed by "other", which represents a considerable variety of outcomes.

Because "other" accounts for 323 unique string values, it would be prohibitive to plot every every point, especially as many of these values will only account for a handful of cases at most. Additionally, many of the strings are not uniformly coded, and the plotting of which may lead to misjudged conclusions. Instead, I will leave a list of all given outcomes for a holistic sense of the possibilities.

In [None]:
drugsdata['LocationifOther'].unique()

At a glance, this list appears to primarily consist of hotel rooms and others' homes, followed by random buildings and parking lots.

#### Part II Conclusion

We have covered the bases of our most relevant datapoints, retrieving essential descriptive observations so that we might form a basis for more pointed future analytical questions. 

With that in mind, we're going to shift to the review and application of basic inferential statistical techniques within the context of this datset. 

## III. Inferential Analysis

### A Review of Hypothesis Testing

Before we continue deriving insights from our data, we will briefly review some fundamental inferential concepts. 


#### What is Hypothesis Testing?
At the core of the scientific method is the concept known as *hypothesis testing*, or the formulation of a research question that is testable with the available data. Some examples include:

- Eating spinach during childhood helps one to grow stronger.
- Businesses in large cities are more likely to experience theft than those in small towns.
- Students who score at least 26 on the ACT are more likely to go to college than those who scored lower. 

Hypotheses are comprised of two principal components:

- Null hypothesis $H_0$, or the default statement that there is *no* relationship between two phenomena or groups.
- Alternative hypothesis $H_a$, or the *alternative* to the null, stating that there *is* a relationship between phenomena or groups. This is the statement the researcher typically wants to prove.

Statistical inference tests assume the null to be true, requiring contrary evidence in order to disprove it. Ultimately, you either *reject* the null or you *retain* it. 

#### Significance Levels and Directional Tests

Once you have determined your hypothesis, you need to determine the appropriate *significance level*, or a probability threshold that asserts whether you reject or retain the Null. Here, we choose a *p-value*, or probability value, that states the liklihood of obtaining a test result that is at least as extreme as the actually observed results, assuming the veracity of the null. Thus, if your test result is *lower* that the significance level, you can safely reject the null and assume the alternative. 

We should also mention the term *critical value*, which represents the point beyond which it is safe to reject the null.

Signficance levels typically assume a set number of p-value options, for example: significance (α) = 0.05. Here, if our test value were to return 0.03, we would reject the null!

Another important aspect to consider is whether your test is going to be *one-tailed* or *two-tailed*. The "tail" refers to the end of the distribution of your selected test statistic. Some tests can be used with either mode, and you will need to select beforehand which is the most appropriate. Generally speaking:

- One-tailed tests are better when you only want to determine the difference between groups in a particular direction, for example determinine whether group X scored more highly on the GRE on average than group Y.
- Two-tailed tests are better if you are interested determining a group difference in *either* direction, for example if you wanted to know whether group X score higher *or* lower on the GRE on average than group Y. 

If you have good reason to suspect a difference in one particular direction, then one-tailed is an appropriate choice, as it will offer more statistical power. Selecting a two-tailed option is reasonable if there is room for doubt. 

We're almost ready to move on to some concrete examples, but first, it is appropriate to briefly mention the concepts of *type I* and *type II* errors.

#### Type I and Type II Errors

Type I errors are defined by the *rejection* of a *true* null, while type II errors are defined by the *acceptance* of a *false* null. For example, let's say your null hypothesis stated that those who lived in disadvantaged neighborhoods were less likely to have access to fresh produce, as compared to those who lived in middle-class neighborhoods:

- Type I: You determine that those who live in disadvantaged neighborhoods *do* have the same access to fresh produce, as compared to those in middle-class neighborhoods, despite the null being *true*.
- Type II: You determine that those who live in disadvantaged neighborhoods *do not* have the same access to fresh produce, as compared to those in middle-class neighborhoods, despite the null being *false*.

You can attempt to mitigate the possibility of one of these errors through the appropriate selection of the significance value, typically 0.01, 0.05, or 0.10. The liklihood of a type II error can be reduced through increase in sample size.

We'll sum up by listing the necessary steps to hypothesis testing.

#### Hypothesis Testing Checklist

1. Define our null and alternative hypotheses. 
2. Define the significance level. 
3. Select your statistical test. 
4. Calculate the p-value/critical value. 
5. Test, and reject OR retain the null. 

Now we're ready to dive into using our inferential methods! 

### One-Sample Z-Test

One-sample z-tests are used to determine whether the mean of a given population is greater, lesser, or equal to a given value 

Let's say that an up-and-coming political candidate is running for office in Connecticut. **They assert that the average age of death by drug overdose in the state is below the overall average age of death in the country.** You decided to fact-check this statement by downloading a dataset consisting of a sample of approximately 5100 (n) death-by-overdose cases in Connecticut. You learn that the **mean** (μ) age of death in this sample is **42**, and later you determine the mean age of death for the general US population to be [**79**](https://www.cdc.gov/nchs/fastats/life-expectancy.htm) with a standard deviation ($\sigma$) of [**15**](https://www.nber.org/papers/w14093) years. 

While intuition immediately agrees with this statement, let's definitively confirm it with a z-test and walking through the above lined steps. 

In [None]:
import scipy.stats as stats

#### Step 1 - define null and alternative hypotheses

- Null $H_0$ = The average age of death in our overdose sample is *not* lower than that of the general US population.
    - μ = 79
- Alternative $H_a$ = The average age of death in our overdose sample *is* lower than that of the general US population.
    - μ < 79

#### Step 2 - determine significance level

We will select  $\alpha$ = 0.05. We will also choose a one-tailed test since we're pretty certain the result will go in one direction.

#### Step 3 - selecting our test statistic

We know that the one-sample z-test is appropriate since we know the sample mean, the standard deviation, and that there is a sufficient number of items in our sample.

#### Step 4 - calculate critical value

This form of test requires that we calculate a critical value. It's finally time to return to some coding:

In [None]:
Zcritical = round(stats.norm.isf(q = 0.05),1)
print('Critical value is %3.6f' %Zcritical)

Thus, our critical values are -1.6 and 1.6, with the inner values representing the 95% critical value accepted range. A value landing in this range indicated that we cannot reject the null. 

In other words, in order to reject our null hypothesis, our test result will need to be *less than or equal to* -1.6 or *greater than or equal to*  1.6 .

#### Step 5 - calculate test result and determine veracity of null hypothesis

We now have the data necessary to run our one-sample z-test. 

In [None]:
XAvg = 42
mu = 79
sigma = 15
n = 5100

Z = (XAvg - mu)/(sigma/np.sqrt(n))
print('Z value is %2.5f' %Z)

Since our Z value is considerably outside of our critical value range, we can safely reject the null hypothesis and confirm that our sample of death-by-overdose cases has a younger average age of death compared to the general population! 

### Two-Sample T-Test 

While the z-test is well suited for larger samples, the t-test is more appropriate for samples of n=30 or less. 

Our new candidate, in a recent debate, asserted that **white individuals who died of overdose had a lower mean age than  those who were black**.

Again intrigued, you decide to practice your t-test skills to confirm this statement for yourself. 

#### Step 1 - define null and alternative hypotheses

- Null $H_0$ = White individuals who died of overdose have *not* died at a lower mean age than black individuals.
- Alternative $H_a$ = White individuals who died of overdose *have* died at a lower mean age than black individuals.

With this in mind, the next step is to create a custom dataframe of a random sampling of datapoints for the two races.

In [None]:
data = {'White':[33, 67, 24, 39, 37, 34, 26, 53, 55, 32, 
                 46, 58, 51, 35, 37, 38, 35, 57, 50, 57, 
                 46, 43, 55, 24, 27, 60, 36, 39, 33, 41], 
        'Black':[47, 53, 39, 49, 66, 53, 55, 45, 56, 23, 
                 60, 40, 41, 43, 63, 44, 52, 51, 20, 54, 
                 51, 36, 48, 43, 61, 25, 58, 65, 49, 50]}

df_wb = pd.DataFrame(data) 

#### Step 2 - determine significance level

With our samples established, we're almost ready to dive into the test. We will again use 𝛼 = 0.05. 

#### Step 3 - selecting our test statistic

The t-test is appropriate because we have two small population samples and no known standard deviation. 

The t-test will automatically calculate p-value, so we will combine steps 4 and 5. 

#### Step 4 & 5 - calculate p-value, test result and determine veracity of null hypothesis

In [None]:
t_statistic, p_value  =  stats.ttest_ind(df_wb.White,df_wb.Black)
print('tstat  %1.3f' % t_statistic)    
print('P Value %1.5f' % p_value)

Since the p-value is greater than the significance level, we must retain the null hypothesis. In other words, white individuals who have died of drug overdose do not appear to have done so at a significantly lower mean age than black individuals. Looks like our candidate struck out this time. 

## IV. Summary and Conclusion

### Summary

In this project, we have compiled a collection of descriptive and inferential observations about drug overdose deaths in Connecticut. In part II, we learned:

- Demographics:
    - Sex: men are represented almost 3x as much as women. From 2012-2017, male deaths increased 3x, while female deaths increased 1.5x
    - Age: overdose deaths rise quickly starting in the early 20's, and similar decreases in the early 60's. Those within the 46-50 range experienced the most deaths. The mean age of death is 42, with 14 and 87 representing the extreme edges of age at death.
    - Race: overdoses are 78% white. Hispanic-white and black accounts for 19%.
    - Residence states and counties: unsurprisingly, 96% of deaths in this dataset were represented by Connecticut residents. The next two most frequent are New York and Massachusetts. Hartford, New Haven, and Fairfield counties make up a majority of counties represented. 

- Drugs Information:
    - Distribution over time: 2018 experienced approximately 3x as many deaths as 2012. Deaths appear to spike in either direction every three to six months.
    - Months: distribution across months confirms a cycling effect. 
    - Days of the month: slight increase in case frequency in the first few days of the month, with a slight downward trend moving forward.
    - Drugs of choice: heroin is by far the most frequently used, accounting for approximately an entire third of all overdoses cases in 2016. Benzodiazepine and cocaine are the next most frequent. Fentanyl analogues were introduced in this dataset in 2017, and indicated an upward trend. Prescription drugs Oxycodone and Oxymorphone showed little change over time. Ethanol overdoses increased roughly 3x from 2012-2017. 
    - Overdose locations: deaths at home accounted for about 53% of all cases. Hospitals were the next most common at 31%. "Other" accounted for about 550 cases, and consisted primarily of hotel rooms and others' homes.
    



In part three, we presented an entry-level introduction to the concept of hypothesis testing, covering the fundamental concepts of null and alternative hypotheses, significance levels and directional tests, and type I/type II errors. We practiced these concepts via the application of the z-test and t-test on our dataset, which confirmed 1) that the mean age  of death in our dataset was lower than that of the general population, and 2) the mean age of death of white individuals is not meaninfully lower than those who were black.

### Conclusion

We have extracted enough information that local and state policy-makers could utilize our findings to inform their desicions regarding public health initiatives. For example, focusing on increasing affordable drug rehab and AODA counseling services for the community as well as offering educational programs for loved ones of those experiencing substance abuse. 

It is worth noting that the Connecticut Department of Public Health [reports](https://portal.ct.gov/DPH/State-Health-Planning/Healthy-CT-2020-Dashboards/52-INJ-Drug-Overdose) that unintentional deaths by overdose have continued to rise since the end point of this data set, and Fentanyl deaths - seen increasing in our earlier graphs - accounted for [1,200 cases in 2019 alone](https://ctmirror.org/2020/02/17/connecticut-drug-overdose-deaths-up-with-fentanyl-leading-fatalities/), even following the slight dip we saw in overall deaths in 2018.

In addition to public health efforts, we have formed the basis of more intensive research efforts. Some suggestions for future analytical directions:

- Delving into the background of these individuals' cases, documentation allowing, to extract more meaningful features to add to our dataset. For example, including history of family drug/physical/emotional abuse, experience with human trafficking, previous experience with mental health services, known comorbidities, highest education received, yearly income, etc.
- With the inclusion of more features, and a comparison set of those who avoided overdose, we could begin developing a machine learning model that can predict one's likelihood of overdosing.