## Lab 9 - Regression


### (Based on [RABE](http://www.ilr.cornell.edu/~hadi/RABE4/) 3.15)

A national insurance organization wanted to study the consumption pattern of cigarettes in all 50 states and the District of Columbia. The variables chosen for the study are:

* Age: Median age of a person living in a state.

* HS: Percentage of people over 25 years of age in a state who had completed high school.

* Income: Per capita personal income for a state (income in dollars).

* Black: Percentage of blacks living in a state.

* Female: Percentage of females living in a state.

* Price: Weighted average price (in cents) of a pack ofcigarettes in a state.

* Sales: Number of packs of cigarettes sold in a state on a per capita basis.

The data can be found at [http://www1.aucegypt.edu/faculty/hadi/RABE5/Data5/P088.txt](http://www1.aucegypt.edu/faculty/hadi/RABE5/Data5/P088.txt).

Below, specify the null and alternative hypotheses, the test used, and your conclusion using a 5% level of significance.

1. Test the hypothesis that the variable `Female` is not needed in the regression equation relating Sales to the six predictor variables.

2. Test the hypothesis that the variables `Female` and `HS` are not needed in the above regression equation.

3. Compute a 95% confidence interval for the true regression coefficient of the variable `Income`.

4. What percentage of the variation in `Sales` can be accounted for when `Income` is removed from the above regression equation? Which model did you use?

In [7]:
from __future__ import division

import matplotlib as mpl
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.formula.api as smf

from numpy.random import randn
from scipy import stats
from seaborn import plt
from statsmodels.stats.anova import anova_lm

In [4]:
data = pd.read_table('http://www1.aucegypt.edu/faculty/hadi/RABE5/Data5/P088.txt')
data.head()

model = smf.ols('Sales ~ Age + HS + Income + Black + Female + Price', data=data).fit()
model.summary()

0,1,2,3
Dep. Variable:,Sales,R-squared:,0.321
Model:,OLS,Adj. R-squared:,0.228
Method:,Least Squares,F-statistic:,3.464
Date:,"Wed, 18 Mar 2015",Prob (F-statistic):,0.00686
Time:,12:29:16,Log-Likelihood:,-238.86
No. Observations:,51,AIC:,491.7
Df Residuals:,44,BIC:,505.2
Df Model:,6,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,103.3448,245.607,0.421,0.676,-391.644 598.334
Age,4.5205,3.220,1.404,0.167,-1.969 11.009
HS,-0.0616,0.815,-0.076,0.940,-1.703 1.580
Income,0.0189,0.010,1.855,0.070,-0.002 0.040
Black,0.3575,0.487,0.734,0.467,-0.624 1.339
Female,-1.0529,5.561,-0.189,0.851,-12.260 10.155
Price,-3.2549,1.031,-3.156,0.003,-5.334 -1.176

0,1,2,3
Omnibus:,56.254,Durbin-Watson:,1.663
Prob(Omnibus):,0.0,Jarque-Bera (JB):,358.088
Skew:,2.842,Prob(JB):,1.75e-78
Kurtosis:,14.67,Cond. No.,237000.0


#### Interpreting the Model Summary
The p value is one of the most important diagnostics in the summary. Any coefficient with a large p value should not be trusted. A good rule is any p>=0.5 is not trustworthy.

+ P-value for Female is 0.85 and we fail to reject the null hypothesis. The variable Female is not needed in the regression equation.



R2, as discussed previously, is the total sum of the squares of the residuals. The adjusted R2 is the R2 penalized by the degrees of freedom to the number of training examples. It's typically better to rely on the adjusted R2. A good R2 is usually between 0.7 - 1.0



#### Test the hypothesis that the variables Female and HS are not needed in the above regression equation.

In [17]:
reduced_mod = smf.ols('Sales ~ Age + Income + Black + Price', data=data).fit()
anova_lm(reduced_mod, model)

Unnamed: 0,df_resid,ssr,df_diff,ss_diff,F,Pr(>F)
0,46,34959.767412,0,,,
1,44,34925.968854,2,33.798558,0.02129,0.978945


Since the p-value for the F-test is 0.98, we fail to reject that Bfem=Bhs=0 at level 0.05.

#### Compute a 95% confidence interval for the true regression coefficient of the variable Income.

In [18]:
model.conf_int(alpha=.05).iloc[3:4, :]

Unnamed: 0,0,1
Income,-0.001643,0.039535


#### What percentage of the variation in Sales can be accounted for when Income is removed from the above regression equation? Which model did you use?

In [22]:
reduced_mod.summary()

0,1,2,3
Dep. Variable:,Sales,R-squared:,0.32
Model:,OLS,Adj. R-squared:,0.261
Method:,Least Squares,F-statistic:,5.416
Date:,"Wed, 18 Mar 2015",Prob (F-statistic):,0.00117
Time:,13:39:12,Log-Likelihood:,-238.88
No. Observations:,51,AIC:,487.8
Df Residuals:,46,BIC:,497.4
Df Model:,4,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,55.3296,62.395,0.887,0.380,-70.266 180.925
Age,4.1915,2.196,1.909,0.062,-0.228 8.611
Income,0.0189,0.007,2.745,0.009,0.005 0.033
Black,0.3342,0.312,1.071,0.290,-0.294 0.962
Price,-3.2399,0.999,-3.244,0.002,-5.250 -1.230

0,1,2,3
Omnibus:,56.03,Durbin-Watson:,1.661
Prob(Omnibus):,0.0,Jarque-Bera (JB):,350.319
Skew:,2.838,Prob(JB):,8.49e-77
Kurtosis:,14.517,Cond. No.,61600.0
