## The Effect Of Cigarette Taxes On Tobacco Consumption

All states and the Federal government levy excise taxes on cigarettes. Here, we will examine how a large tax hike altered smoking rates in an important and interesting population. In 1993, the Michigan legislature raised the cigarette tax from 25 to 75 cents per pack. The higher tax rate went into effect on May 1, 1994. The Surgeon General of the US estimates that smoking during pregnancy doubles the chance a baby will be born with a low birth weight (<2500 grams). 17 percent of births are to women who smoked during their pregnancy. In recent years, a number of public health officials have suggested that higher cigarette taxes can be used as way to improve birth outcomes. We will use the data from the Michigan “experiment” to evaluate this conjecture. The data for this project are taken from the Natality Detail File, which is an annual census of births in the US. I have taken a 5% random sample of births for the state of Michigan for the 32 months prior and 24 months after the tax hike. I have also include a 5% random sample of data over the same period for two states that had no nominal change in their state cigarette tax rates over this period: Iowa and Pennsylvania. 

Variable Definition
MONTH This is an index that equals 1 in the first month (September 1991) 2 in the second (October 1991), through month 56. Month 33 is the month the new tax went into effect (May of 1994).
1. STATE 2-digit state FIPS code. Michigan is state 26.

2. SMOKED Dummy variable, =1 if a mother self-reported that she smoked during her pregnancy, =0 otherwise.

3. MRACE3 3 level variable, =1 if mother wife, =2 if Black, =3 if other race. 

4. MEDUC6 6-level variable for mother’s education: =1 if <9 years, =2 if 9-11 years, =3 if 12 years, =4 if 13-15 years, =5 if 16+ years, =6 if education was not reported.

5. PARITY 4-level variable for mother’s parity of birth. =1 if this is the first birth, =2 if the second birth, =3 if third birth, =4 if fourth or higher birth.

6. HISPANIC Dummy variable, =1 if mother is Hispanic, =0 otherwise.

7. MARRIED Dummy variable, =1 if mother is married, =0 otherwise.

Treat the data from Iowa and Pennsylvania as one control group.


In [5]:
#This is the code that will load the dataset you need. 

In [40]:
#Import basic libariaries
%matplotlib inline
import math
import numpy as np
import scipy
from scipy.stats import binom, hypergeom
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

In [7]:
df=pd.read_excel('Michigan.xlsx')

In [8]:
#Please load all the other libraries as you find them necessary for your assignment!. Good luck!!!

In [9]:
df.head()

Unnamed: 0,state,month,smoked,ageg,mrace3,parity,meduc6,married,hispanic
0,19,4,0,3,1,2,3,1,1
1,19,3,0,2,1,1,4,0,0
2,19,2,0,4,1,2,5,1,0
3,19,2,0,2,1,1,3,1,0
4,19,2,0,1,2,1,1,0,0


### Answer to question 1

Question 1. Construct two variables: A dummy variable for Michigan (name it MI) and another for the period after the tax rate is increased (HIKE). Calculate the mean smoking rate before and after the tax hike for Michigan and the control group. Using these means, calculate the difference in difference estimate of the impact of higher taxes on smoking in Michigan.

In [10]:
df.state.value_counts()

42    36538
26    30733
19     8755
Name: state, dtype: int64

In [14]:
MI=[]
for i in range(0,df.shape[0]):
    if df.state[i]==26:
        MI.append(1)
    else:
        MI.append(0)

In [20]:
df['MI']=MI

In [24]:
df['state'].value_counts()

42    36538
26    30733
19     8755
Name: state, dtype: int64

In [23]:
df.MI.value_counts()

0    45293
1    30733
Name: MI, dtype: int64

In [25]:
HIKE=[]
for i in range(0,df.shape[0]):
    if df.month[i]>=33:
        HIKE.append(1)
    else:
        HIKE.append(0)

In [26]:
df['HIKE']=HIKE

In [30]:
table = pd.pivot_table(df, values='smoked', index=['HIKE'],columns=['MI'], aggfunc=np.mean)
table

MI,0,1
HIKE,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.195,0.195728
1,0.182792,0.17832


In [43]:
# dif_1 = the effect of law in the treatment group (Michigan)
# dif_2 = the effect of law in the control group (Iowa and Pennsylvania)
# dif_3 = the total effect of law in the treatment group 

dif_1 = 0.178320-0.195728
dif_2 = 0.182792-0.195000
dif_3 = dif_1-dif_2

print ('dif_1=',round(dif_1,4))
print ('dif_2=',round(dif_2,4))
print ('dif_3=',round(dif_3,4))

dif_1= -0.0174
dif_2= -0.0122
dif_3= -0.0052


#### The overall effect of taxes in Michigan is around 0.5% decrease in the smoking rate

### Answer to question 2

Question 2. Using the two variables from Question 1 above and any other necessary variables, calculate a “difference in difference” estimate in a regression framework. How does this estimate compare to the estimate from question 1? Did the tax hike reduce smoking rates by a statistically significant amount?

In [36]:
product=[]
for i in range(0,df.shape[0]):
    product.append(df.HIKE[i]*df.MI[i])


In [37]:
df['product']=product

In [46]:
df['HIKE'].value_counts()

0    44231
1    31795
Name: HIKE, dtype: int64

In [45]:
df['product'].value_counts()

0    63083
1    12943
Name: product, dtype: int64

In [48]:
table1 = pd.pivot_table(df, values='smoked', index=['HIKE'],columns=['MI'], aggfunc='count')
table1

MI,0,1
HIKE,Unnamed: 1_level_1,Unnamed: 2_level_1
0,26441,17790
1,18852,12943


In [38]:
Y=df.smoked
X=df.loc[:,['HIKE','MI','product']]

In [41]:
X=sm.tools.tools.add_constant(X, prepend=True, has_constant='skip')

In [42]:
model = sm.OLS(Y, X)
model_fit = model.fit(disp=0)
print(model_fit.summary())

                            OLS Regression Results                            
Dep. Variable:                 smoked   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                  0.000
Method:                 Least Squares   F-statistic:                     8.588
Date:                Thu, 16 May 2019   Prob (F-statistic):           1.07e-05
Time:                        17:05:15   Log-Likelihood:                -36617.
No. Observations:               76026   AIC:                         7.324e+04
Df Residuals:                   76022   BIC:                         7.328e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.1950      0.002     80.952      0.0

#### From the result summary, we can see that the Hike variable is significant and negative, meaning that it will reduce the smotking. 