# The explosion of drug abuse deaths in India

I built this quick notebook to look at how the top causes of death have evolved over the past 10 years. The data shows drug abuses have sky-rocketed. <br> Graphs made using bokeh (http://bokeh.pydata.org/)

In [93]:
import pandas as pd, numpy as np

In [94]:
df=pd.read_csv('Suicides in India 2001-2012.csv')
df.head(3)

Unnamed: 0,State,Year,Type_code,Type,Gender,Age_group,Total
0,A & N Islands,2001,Causes,Illness (Aids/STD),Female,0-14,0
1,A & N Islands,2001,Causes,Bankruptcy or Sudden change in Economic,Female,0-14,0
2,A & N Islands,2001,Causes,Cancellation/Non-Settlement of Marriage,Female,0-14,0


In [95]:
df.shape

(237519, 7)

In [4]:
df.Type_code.value_counts()

Causes                  109200
Means_adopted            67200
Professional_Profile     49263
Education_Status          7296
Social_Status             4560
Name: Type_code, dtype: int64

I started by wondering how the top causes evolved over the 10 years. Since the raw amounts are different, I indexed them versus 2001 totals

In [5]:
#how have causes evolved over the years? Looking at overall data.
causes=df.loc[df['Type_code']=='Causes']
causes_total=causes.drop(['Gender', 'Age_group'], axis=1)
causes_by_y_state=causes_total.groupby(['State', 'Year', 'Type']).agg('sum')

In [6]:
causes_by_y_state.reset_index(inplace=True)
causes_by_y=causes_by_y_state.groupby(['Year', 'Type']).agg('sum')

In [7]:
causes_by_y.reset_index(inplace=True)

In [8]:
def index_cause_totals(total, cause, year, df):
    #not super efficient to keep recalculating the index values, but the dataset is 312 rows so my cpu will survive.
    #if indexer not available for 2001, use index base of the next year, until and index base exists.
    i=0
    while(i==0):
        try:
            indexer=df.loc[(df['Year']==year) & (df['Type']==cause), 'Total'].values
            return float(total/indexer)
        
        except TypeError:
            year+=1

In [13]:
causes_by_y['indexed_total']=causes_by_y.apply(lambda x: index_cause_totals(x['Total'], x['Type'], 2001, causes_by_y), axis=1)

In [14]:
causes_by_y.head(3)

Unnamed: 0,Year,Type,Total,indexed_total
0,2001,Bankruptcy or Sudden change in Economic,2918,1.0
1,2001,Cancellation/Non-Settlement of Marriage,924,1.0
2,2001,Cancer,780,1.0


In [15]:
#keeping only the top 10 causes of suicide, excluding 2 categories with limited information
top_5=pd.DataFrame(causes_by_y.groupby(['Type']).agg('sum'))[['Total']].sort(columns='Total', ascending=False).iloc[:10]
top_5_list=[x for x in top_5.index if x not in ['Causes Not known', 'Other Causes (Please Specity)']]
top_5_list

  from ipykernel import kernelapp as app


['Family Problems',
 'Other Prolonged Illness',
 'Insanity/Mental Illness',
 'Love Affairs',
 'Bankruptcy or Sudden change in Economic',
 'Poverty',
 'Dowry Dispute',
 'Drug Abuse/Addiction']

In [16]:
causes_by_y_top_5=causes_by_y.loc[causes_by_y.Type.isin(top_5_list), ['Year', 'Type', 'indexed_total']]

I needed to get a look at this graphically. Sadly, drug abuses seem to be increasing much faster than all other causes of death. What's going on here?

In [88]:
from bokeh.plotting import figure, output_notebook, show
from bokeh.charts import Line, Bar
from bokeh.layouts import row
from bokeh.models.widgets import Panel,Tabs
from bokeh.models import ranges
output_notebook(hide_banner=True)

In [86]:
#the graph is a bit too cluttered, if I have time to go back I should replace some of the lines below with an avg.
p=Line(causes_by_y_top_5, x='Year', y='indexed_total', title='Drug Abuses Sky-Rocket',ylabel='Suicides, indexed at 2001 values',
       color='Type', width=800)
p.toolbar_location=None
show(p)

Where is this increase in drug abuses coming from? Let's look at age and gender first.

In [24]:
#Get drug abuse data only. Break down by age, gender, and an interaction of the two
causes_sub=causes[['Year', 'Type', 'Gender', 'Age_group', 'Total']].loc[causes.Type == 'Drug Abuse/Addiction']
drug_by_gender_age=causes_sub.groupby(['Year', 'Gender', 'Age_group']).agg('sum').reset_index()
drug_by_gender=causes_sub.groupby(['Year', 'Gender']).agg('sum').reset_index()
drug_by_age=causes_sub.groupby(['Year', 'Age_group']).agg('sum').reset_index()

In [25]:
#a more generalizable function than the last indexer, with the same purpose
def index_variable_totals(total_value, var_name,var_value, index_year, df):
    i=0
    while(i==0):
        try:
            indexer=df.loc[(df['Year']==index_year) & (df[var_name]==var_value), 'Total'].values
            return float(total_value/indexer)
        
        except TypeError:
            year+=1

In [26]:
#index gender and age_group totals vs 2001 values.
for d,category in zip([drug_by_gender, drug_by_age], ['Gender', 'Age_group']):
    d['indexed_total']=d.apply(lambda x: index_variable_totals(x['Total'],category, x[category], 2001, d), axis=1)

It seems men make up the vast majority of drug abuses, and drug abuses in men have grawn at a shocking pace for the past few years.

In [91]:
#create plots
p1=Line(drug_by_gender, x='Year', y='indexed_total', title='Male drug abuses increase',ylabel='Suicides, indexed 2001 values',
       color='Gender', width=550)
p2=Bar(drug_by_gender.loc[drug_by_gender.Year == 2012], 'Gender', values='Total',color='Gender', 
       width=400,height=550, ylabel='Amount, 2012', title='Men constitute majority of drug abuses')
p3=Bar(drug_by_gender.loc[drug_by_gender.Year == 2001], 'Gender', values='Total',color='Gender', 
       width=400,height=550, ylabel='Amount, 2001', title='Men constitute majority of drug abuses')

#scale y axes for p2 p3 to ease comparison
p2.y_range=ranges.Range1d(0,4000)
p3.y_range=ranges.Range1d(0,4000)

#remove toolbars
p1.toolbar_location = p2.toolbar_location = p3.toolbar_location = None

#create tabs
tab1=Panel(child=p2, title='2012')
tab2=Panel(child=p3, title='2001')

p1.toolbar_location=p2.toolbar_location=p3.toolbar_location=None
layout=Tabs(tabs=[tab1, tab2])
chart=row(p1,layout)
show(chart)

The increase in drug abuses does not seem to be restricted to a certain age group. Contrarily to what I might expect, 30-44 year olds make up a much larger portion of drug abuses than 15-29 year olds. This may be a hint that these aren't your run of the mill recreational drug overdoses.

In [92]:
#removing 0-14, since total is so low
drug_by_age=drug_by_age.loc[drug_by_age.Age_group != '0-14']

p1=Line(drug_by_age, x='Year', y='indexed_total', title='Drug abuses increase across age groups',
        ylabel='Suicides, indexed 2001 values',
       color='Age_group', width=550)
p2=Bar(drug_by_age.loc[drug_by_age.Year==2012], 'Age_group', values='Total', color='Age_group',
       ylabel='Total, 2012', width=400, height=550, title='Non-indexed values')
p3=Bar(drug_by_age.loc[drug_by_age.Year==2001], 'Age_group', values='Total', color='Age_group',
       ylabel='Total, 2001', width=400, height=550, title='Non-indexed values')

#scaling the y axis labels to make it easy to compare 2012 to 2001.
p2.y_range=ranges.Range1d(0,1700)
p3.y_range=ranges.Range1d(0,1700)

#creating tabs
tab1=Panel(child=p2, title='2012')
tab2=Panel(child=p3, title='2001')

p1.toolbar_location=p2.toolbar_location=p3.toolbar_location=None
layout=Tabs(tabs=[tab1, tab2])
chart=row(p1,layout)
show(chart)

<b> My conclusion at this point is that drug abuses are increasing for all notable age groups, and especially among men, who constitute the vast majority of drug abuses. <b>

I now want to understand where this is coming from Geographically. Are all states seeing this increase in drug overdoses? The problem here is there are vastly different populations in each state. Once again, I'm going to index everything vs 2001 values and plot 2012 values to start.

In [126]:
import warnings ; warnings.simplefilter('ignore')

In [107]:
drug_by_state_y=causes.loc[causes.Type=='Drug Abuse/Addiction'].groupby(['State', 'Year']).agg('sum')

In [127]:
drug_by_state_y.reset_index(inplace=True)
drug_by_state_y['indexed_total']=drug_by_state_y.apply(lambda x: 
            index_variable_totals(x['Total'],'State', x['State'], 2001, drug_by_state_y), axis=1)

In [128]:
drug_by_state_y.loc[(drug_by_state_y.Year ==2012) & 
                    (drug_by_state_y.indexed_total<30) &
                    (drug_by_state_y.indexed_total>1)].sort('Total', ascending=False)

Unnamed: 0,level_0,index,State,Year,Total,indexed_total
251,251,251,Maharashtra,2012,1689,3.391566
239,239,239,Madhya Pradesh,2012,581,5.330275
215,215,215,Kerala,2012,275,3.313253
371,371,371,Tamil Nadu,2012,242,3.723077
83,83,83,Chhattisgarh,2012,240,2.891566
23,23,23,Andhra Pradesh,2012,164,6.56
347,347,347,Rajasthan,2012,152,1.277311
395,395,395,Uttar Pradesh,2012,124,1.631579
335,335,335,Punjab,2012,91,13.0
203,203,203,Karnataka,2012,77,1.040541


The epidemic seems to be well spread out among. many of the largest states. I used an online tool from indzara.com to quickly generate a heatmap that gives me an idea of the geographic distribution. I graphed the inverse of the 2012 indexed total. Don't take the colors too seriously.

![title](http://i.imgur.com/DnZOBik.png)

A few things stand out:
1. This seems to be concentrated around the middle of the country. Why?
2. Punjab and Haryana have seen high drug abuse levels, despite their neighbouring states not seeing these increased rates. Why? Note that this correlates with what we are seeing in the news today. Maybe they should have run this analysis and done something 5 years ago... http://www.bbc.com/news/world-asia-india-38824478 . This is truly tragic, and further analysis is needed to do something about this.
