### Problem Set 2:
   Tom Curran
   
   MAC30100 Winter 2018
   
   January 22, 2018

#### Question 2

Linear regression and MLE (4 points). You can do maximum likelihood estimation as a way to estimate parameters in regression analysis. Assume the following linear regression model for determining what effects the number of weeks that an individual i is sick during the year (sicki).

$$
sick_{i} = \beta_{0} + \beta_{1}age_i + \beta_2children_i + \beta_3tempwinter_i + \epsilon_i \\ where \  \epsilon \sim N(0, \sigma^2)
$$

The parameters (β0, β1, β2, β3, σ2) are the parameters of the model that we want to estimate. The variable agei gives the age of individual i at the end of 2016 (including fractions of a year). The variable childreni states how many chil- dren individual i had at the end of 2016. And the variable temp winteri is the average temperature during the months of January, February, and Decem- ber 2016 for individual i. The data for this model are in the file sick.txt, which contains comma-separated values of 200 individuals for four variables (sicki, agei, childreni, temp winteri) with variable labels in the first row.
***

a) Estimate the parameters of the model(β0,β1,β2,β3,σ2)by maximum likelihood using the fact that each error term εi is distributed normally N(0, σ2). We can solve the regression equation for εi which tells us that the following equation is distributed normally N(0, σ2).

$$sick - \beta_0 - \beta_1 age_i - \beta_2 children_i - \beta_3 tempwinter_i \sim N(0,\sigma^2)$$

Estimate (β0, β1, β2, β3, σ2) to maximize the likelihood of seeing the data in sick.txt. Report your estimates, the value of the log likelihood function, and the estimated variance covariance matrix of the estimates.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.optimize as opt
import scipy.stats as sts
import statsmodels.formula.api as smf
sickdf= pd.read_csv("sick.txt")
sickdf.describe()

Unnamed: 0,sick,age,children,avgtemp_winter
count,200.0,200.0,200.0,200.0
mean,1.0086,40.68385,1.67495,44.04125
std,0.504222,11.268686,0.969761,11.101977
min,0.04,12.81,0.0,16.5
25%,0.65,33.9675,0.97,36.1125
50%,0.96,41.015,1.56,43.3
75%,1.3225,47.75,2.3225,52.1725
max,2.8,74.89,4.96,68.6


***

In [2]:
ols_example = smf.ols('sick ~age + children + avgtemp_winter', data = sickdf).fit().summary()
ols_example

0,1,2,3
Dep. Variable:,sick,R-squared:,1.0
Model:,OLS,Adj. R-squared:,1.0
Method:,Least Squares,F-statistic:,1815000.0
Date:,"Sun, 21 Jan 2018",Prob (F-statistic):,0.0
Time:,20:56:30,Log-Likelihood:,876.87
No. Observations:,200,AIC:,-1746.0
Df Residuals:,196,BIC:,-1733.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.2516,0.001,254.032,0.000,0.250,0.254
age,0.0129,6.49e-05,199.257,0.000,0.013,0.013
children,0.4005,0.001,643.790,0.000,0.399,0.402
avgtemp_winter,-0.0100,4.51e-05,-221.388,0.000,-0.010,-0.010

0,1,2,3
Omnibus:,24.095,Durbin-Watson:,1.997
Prob(Omnibus):,0.0,Jarque-Bera (JB):,7.115
Skew:,-0.002,Prob(JB):,0.0285
Kurtosis:,2.076,Cond. No.,290.0


In [3]:
numsick = sickdf.sick
age = sickdf.age
children = sickdf.children
wintertemp = sickdf.avgtemp_winter

In [4]:
def errorSum(coef, variables):
    b0, b1, b2, b3, sigma = coef
    
    numsick2, age2, children2, wintertemp2 = variables
    
    errors = (numsick2 - b0 - b1*age2 - b2*children2 - b3*wintertemp2)**2
    
    sum_errors = errors.sum()
    
    return(sum_errors)

In [5]:
def loglikelihood(coef, variables):
    n = len(sickdf)
    
    ll_b0, ll_b1, ll_b2, ll_b3, ll_sigma = coef
    
    total_errors = errorSum(coef, variables)
    
    likelihood = - (n * np.log (2 * np.pi)) / 2  - (n * np.log(ll_sigma ** 2)) / 2 - (1 / (2 * ll_sigma ** 2)) * total_errors
    
    return(likelihood)

In [6]:
def crit(params, *args):
    
    c_b0, c_b1, c_b2, c_b3, c_sigma = params
    
    sick, age, children, temp = args
    
    c_logliklihood = loglikelihood(params, args)
    
    c_loglikely_neg = - c_logliklihood
    
    return(c_loglikely_neg)

In [7]:
b0_init = .1
b1_init = .1
b2_init = .1
b3_init = .1
sigma = 1

params_init = np.array([b0_init, b1_init, b2_init, b3_init, sigma])

mle_args = numsick, age, children, wintertemp

bounds= ((None, None), (None, None),(None, None), (None, None),(1e-10, None))

In [8]:
results = opt.minimize(crit, params_init, args = (mle_args), method = 'L-BFGS-B', bounds = bounds)
results

      fun: -876.86506388489283
 hess_inv: <5x5 LbfgsInvHessProduct with dtype=float64>
      jac: array([  1.30695526,  52.52322808,   1.55824864,  74.00824416,   0.90410595])
  message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
     nfev: 666
      nit: 57
   status: 0
  success: True
        x: array([ 0.25164454,  0.01293343,  0.40050135, -0.00999168,  0.0030177 ])

In [9]:
mleb0, mleb1, mleb2, mleb3, sigma = results.x

print("Estimate for Beta 0: ", mleb0)
print('------------------------------------------------------------------')
print("Estimate for Beta 1: ", mleb1)
print('------------------------------------------------------------------')
print("Estimate for Beta 2: ", mleb2)
print('------------------------------------------------------------------')
print("Estimate for Beta 3: ", mleb3)
print('------------------------------------------------------------------')
print("Estimate for Sigma:  ", sigma)
print('------------------------------------------------------------------')

Estimate for Beta 0:  0.251644535194
------------------------------------------------------------------
Estimate for Beta 1:  0.012933429607
------------------------------------------------------------------
Estimate for Beta 2:  0.400501345733
------------------------------------------------------------------
Estimate for Beta 3:  -0.00999167643603
------------------------------------------------------------------
Estimate for Sigma:   0.00301770494473
------------------------------------------------------------------


In [10]:
mle_coef = results.x

mle_vars = numsick, age, children, wintertemp

mle_loglikelihood = loglikelihood(mle_coef, mle_vars)

print("The Log Likelihood value is ", mle_loglikelihood)

The Log Likelihood value is  876.865063885


In [11]:
offdiagneg = [[1,-1,-1,-1,-1],
             [-1,1,-1,-1,-1],
             [-1,-1,1,-1,-1],
             [-1,-1,-1,1,-1],
             [1,-1,-1,-1,1]]
hess_matrix = results.hess_inv

var_cov_matrix = hess_matrix * offdiagneg

print("Estimate Vary, Covary Matrix:")
print('------------------------------------------------------------------')
print(var_cov_matrix)

Estimate Vary, Covary Matrix:
------------------------------------------------------------------
[[ 427.24472524 -187.68681806 -425.55677393 -161.05074325 -152.97357621]
 [ -10.4282363     4.58110311   10.38703421    3.93093842    3.73377675]
 [-185.58416665   81.52634396  184.85096412   69.95631751   66.44780227]
 [   9.18530841   -4.03508657   -9.14901723   -3.46241515   -3.28875335]
 [  15.13295817   -6.6478829   -15.07316713   -5.70438801   -5.41827063]]


b) Use a likelihood ratio test to determine the probability that β0 = 1.0, σ2 = 0.01 and β1,β2,β3 = 0. That is, what is the likelihood that age, number of children, and average winter temperature have no effect on the number of sick days?

In [12]:
#assign part b coefficients values based on quesiton parameters
b_b0, b_b1, b_b2, b_b3, sigma = np.array([1.0, 0,0,0,0.01])

b_coef = b_b0, b_b1, b_b2, b_b3, sigma

b_variables = numsick, age, children, wintertemp

null_likelihood = loglikelihood(b_coef, b_variables)

alt_likelihood = loglikelihood(results.x, b_variables)

lr_val = 2 * (alt_likelihood - null_likelihood)

pval_null = 1.0 - sts.chi2.cdf(lr_val, 5)

print("p value:", pval_null)

print("Hypothesis is rejected that age, children, age winter temperature have no effect on number of sick days")


p value: 0.0
Hypothesis is rejected that age, children, age winter temperature have no effect on number of sick days
