In [356]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import statsmodels.formula.api as smf
from statsmodels.api import add_constant
from datetime import datetime
%matplotlib notebook

In [357]:
def graph_residuals(model):
    for ix, resid in enumerate(model.resid):
        plt.plot([ix, ix], [0, resid], color="orange")
        plt.scatter(ix, resid, color="orange")

    plt.hlines(0, -1, 15.2, alpha=.5, linestyles="dashed")
    plt.xlim(-0.5, 15.2)
    plt.title(r"Residuals between $Y$ and $\bar Y$")
    plt.show()

****
## Exercise 7.16
Table 7.6 gives quarterly data on these variables:

* $Y$: Quantity of roses sold, dozens
* $X_2$: Average wholesale price of roses (\$/dozen)
* $X_3$: Average wholesale price of carnations (\$/dozen)
* $X_4$: Average weekly family disposible income (\$/dozen)
* $X_5$: The trend variable taking values in $\mathbb{N}$ for the period 1971-II to 1975-II in the Detroid metropolitan area

$$
Y_t = \alpha_1 + \alpha_2 X_{2t} + \alpha_3 X_{3t} + \alpha_4 X_{4t} + \alpha_5 X_{5t} + \varepsilon_t
$$

$$
\ln Y_t = \beta_1 + \beta_2 \ln X_{2t} + \beta_3 \ln X_{3t} + \beta_4 \ln X_{4t} + \beta_5 \ln X_{5t} + \varepsilon_t
$$


- a. Estimate the parameters of the linear model and interpret the results.


- b. Estimate the parameters of the log–linear model and interpret the results.


- c. $\beta_2$ , $\beta_3$ , and $\beta_4$ give, respectively, the own-price, cross-price, and income elastici-ties of demand. What are their a priori signs? Do the results concur with the a priori expectations?


- d. How would you compute the own-price, cross-price, and income elasticities for the linear model?


- e. On the basis of your analysis, which model, if either, would you choose and why?

#### a) Fitting the linear model

In [3]:
roses = pd.read_csv("roses.csv")
roses.head()

Unnamed: 0,Y,X2,X3,X4,X5
0,11484,2.26,3.49,158.11,1
1,9348,2.54,2.85,173.36,2
2,8429,3.07,4.06,165.26,3
3,10079,2.91,3.64,172.92,4
4,9240,2.73,3.21,178.46,5


In [4]:
roses_lmod = smf.ols("Y ~ X2 + X3 + X4 + X5", data=roses).fit()
roses_lmod.summary2()

  "anyway, n=%i" % int(n))


0,1,2,3
Model:,OLS,Adj. R-squared:,0.775
Dependent Variable:,Y,AIC:,269.4803
Date:,2017-04-13 16:43,BIC:,273.3432
No. Observations:,16,Log-Likelihood:,-129.74
Df Model:,4,F-statistic:,13.89
Df Residuals:,11,Prob (F-statistic):,0.000281
R-squared:,0.835,Scale:,940660.0

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
Intercept,10816.0433,5988.3484,1.8062,0.0983,-2364.2228,23996.3093
X2,-2227.7044,920.4657,-2.4202,0.0340,-4253.6357,-201.7730
X3,1251.1412,1157.0206,1.0813,0.3027,-1295.4441,3797.7265
X4,6.2830,30.6217,0.2052,0.8412,-61.1148,73.6808
X5,-197.3999,101.5612,-1.9437,0.0780,-420.9347,26.1348

0,1,2,3
Omnibus:,1.968,Durbin-Watson:,2.334
Prob(Omnibus):,0.374,Jarque-Bera (JB):,1.094
Skew:,0.639,Prob(JB):,0.579
Kurtosis:,2.904,Condition No.:,4482.0


Considering $\alpha = 0.05$. We see that this model is better than having an *intercept-only model*, since the F-statistic is significant at this $\alpha$-level.

On the other hand, the model fails to assert the significance of the intercept and the variables $X_4$ and $X_5$.

In [360]:
graph_residuals(roses_lmod)

<IPython.core.display.Javascript object>

#### b) Fitting the log-model

In [6]:
roses_logmod = smf.ols("""np.log(Y) ~
                       np.log(X2) + np.log(X3) +
                       np.log(X4) + X5""",
                       data=roses).fit()
roses_logmod.summary2()

  "anyway, n=%i" % int(n))


0,1,2,3
Model:,OLS,Adj. R-squared:,0.726
Dependent Variable:,np.log(Y),AIC:,-9.0821
Date:,2017-04-13 16:43,BIC:,-5.2191
No. Observations:,16,Log-Likelihood:,9.541
Df Model:,4,F-statistic:,10.92
Df Residuals:,11,Prob (F-statistic):,0.000798
R-squared:,0.799,Scale:,0.02584

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
Intercept,3.5722,4.6952,0.7608,0.4628,-6.7618,13.9061
np.log(X2),-1.1707,0.4883,-2.3974,0.0354,-2.2455,-0.0959
np.log(X3),0.7379,0.6529,1.1303,0.2824,-0.6990,2.1749
np.log(X4),1.1532,0.9020,1.2785,0.2274,-0.8321,3.1385
X5,-0.0301,0.0164,-1.8339,0.0938,-0.0662,0.0060

0,1,2,3
Omnibus:,1.619,Durbin-Watson:,2.049
Prob(Omnibus):,0.445,Jarque-Bera (JB):,0.324
Skew:,0.248,Prob(JB):,0.851
Kurtosis:,3.49,Condition No.:,1297.0


The *log-lin* model appears to be significant, as it has a large F-statistic. On the other hand, looking at every coefficient in the model, we see that there appears to be 3 non-significant variables at $\alpha = 0.05$.

Also, this model appears to have a lower $R^2$ than the linear model.

In [361]:
graph_residuals(roses_logmod)

<IPython.core.display.Javascript object>

#### c) Interpreting the Parameters

In [8]:
roses_logmod.params

Intercept     3.572156
np.log(X2)   -1.170728
np.log(X3)    0.737938
np.log(X4)    1.153213
X5           -0.030111
dtype: float64

The values for $\beta_2$ and $\beta_3$ do concur with the *a priori* expectations. This is due the fact that, for a small positive change in the price of the roses and the income of families, we would expect that the quantity of roses sold decreases in the former case and increases in the latter.

Furthermore, the value of $\beta_4$ does not match the *a priori* expectations. A small average increase in the whilesale price of roses would not imply, under the *a priori* expectation, an increase in the quantity of roses sold.

#### d) Linear model computations
We could compute $\beta_2$ by changing X2 one percent and see how $Y$ reacts. The same for rationale could be applied to $\beta_3$

#### e) Choosing a model

I would choose neither. Given the existence of non-significant variables, and a possible superfluous variable $X_5$. The best alternative is to decide which variables do give useful information

****
## Exercise 7.17

Wildcats are wells drilled to ﬁnd and produce oil and/or gas in an improved area or to ﬁnd a new reservoir in a ﬁeld previously found to be productive of oil or gas or to extend the limit of a known oil or gas reservoir.

- $Y$: The number of wildcats drilled
- $X_2$: Price at the wellhead in the previous period
- $X_3$: domestic output
- $X_4$: GNP constant dollars
- $X_5$: Trend variable

See if the following model fitst the data:

$$
    Y_t = \beta_1 + \beta_2X_{2_t} + \beta_3\ln X_{3t} + \beta_4X_{4t} + \beta_5X_{5t} + \varepsilon_t
$$

* **a.** Can you offer an a priori rationale to this model?

* **b.** Assuming the model is acceptable, estimate the parameters of the model and their standard errors, and obtain $R^2$ and $\bar R^2$.

* **c.** Comment on your results in view of your prior expectations.

* **d.** What other speciﬁcation would you suggest to explain wildcat activity?  Why?

In [9]:
wildcat = pd.read_csv("wildcat.csv")
wildcat.head()

Unnamed: 0,Y,X2,X3,X4,X5
0,8.01,4.89,5.52,487.67,0
1,9.06,4.83,5.05,490.59,1
2,10.31,4.68,5.41,533.55,2
3,11.76,4.42,6.16,576.57,3
4,12.43,4.36,6.26,598.62,4


#### a) Rationale

The rationale of this model is to explain the number of wildcats drilled based on the price of previous wellheads, the percentual change of the domestic output, the GNP and some trend variable to allocate time.

#### b) Obtaining $R^2$ and $\bar R^2$

In [354]:
wcat_model  = smf.ols("Y ~ X2 + np.log(X3) + X4 + X5",
                      data=wildcat).fit()

plt.plot(wcat_model.fittedvalues, label=r"$\bar Y$")
plt.plot(wildcat.Y, label="$Y$")
plt.legend()
plt.show()

<IPython.core.display.Javascript object>

In [19]:
print("R^2: {}".format(round(wcat_model.rsquared,4)))
print("Adj. R^2: {}".format(round(wcat_model.rsquared_adj, 4)))

R^2: 0.6559
Adj. R^2: 0.603


#### c) Summary

The model seems to explain over half of the variation in the observed data. Further analysis would help us determine whether there are variables missing or variables currently in the model that we do not need.

#### d) Model Adjustment

I would consider the demand of gas and oil as an extra variable to determine the number of wilcats drilled.

****
## Exercise 7.24

Table 7.12 gives data for real consumption expenditure, real income, real wealth, and real interest rates for the U.S. for the years 1947–2000.

- a. Given the data in the table, estimate the linear consumption function using income, wealth, and interest rate. What is the ﬁtted equation?

- b. What do the estimated coefﬁcients indicate about the variables’ relationships to consumption expenditure?

#### a) Estimating the model

In [30]:
econ = pd.read_table("./table_712.txt", sep="\t", usecols=range(5))
econ.head()

Unnamed: 0,ear,C,Yd,wealth,interest
0,1947,976.4,1035.2,5166.8,-10.351
1,1948,998.1,1090.0,5280.8,-4.72
2,1949,1025.3,1095.6,5607.4,1.044
3,1950,1090.9,1192.7,5759.5,0.407
4,1951,1107.1,1227.0,6086.1,-5.283


In [58]:
econ_model = smf.ols("C ~ Yd + wealth + interest",
                    data = econ).fit()

econ_model.summary().tables[1]

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-20.6333,12.827,-1.609,0.114,-46.397,5.130
Yd,0.7340,0.014,53.376,0.000,0.706,0.762
wealth,0.0360,0.002,14.488,0.000,0.031,0.041
interest,-5.5211,2.307,-2.394,0.020,-10.154,-0.888


#### b) Explaining the coeficients

The coefficients in the model express the following:
- **Yd**: The for each 1 billion extra dollars available as disposable income, the total consumption grows about 0.70 billions
- **wealth**: As total wealth increases, it does not  account for a great impact in consumption
- **Interest**: Each percent rise in interest rates affects the total consumption vastly.

****
## Exercise 7.25

*Estimating Qualcomm stock prices*. As an example of the polynomial regression, consider data on the weekly stock prices of Qualcomm, Inc., a digital wireless telecommunications designer and manufacturer over the time period of 1995 to 2000. The full data can be found on the textbook’s website in Table 7.13. During the late 1990, technological stocks were particularly proﬁtable, but what type of regression model will best ﬁt these data? Figure 7.4 shows a basic plot of the data for those years. 

This plot does seem to resemble an elongated S curve; there seems to be a slight increase in the average stock price, but then the rate increases dramatically toward the far right side of the graph. As the demand for more specialized phones dramatically increased and the technology boom got under way, the stock price followed suit and increased at a much faster rate.

- a. Estimate a linear model to predict the closing stock price based on time. Does this model seem to ﬁt the data well?

- b. Now estimate a squared model by using both time and time-squared. Is this a better ﬁt than in (a)?

- c. Finally, ﬁt the following cubic or third-degree polynomial:

$$
    Y_i = \beta_0 + \beta_1X_i + \beta_2X_i^2 + \beta_3X_i^3 + \varepsilon_i
$$

where $Y$ is the stock price and $X$ represents time. Which model seems to be the best estimator for the stock prices?

In [177]:
qualcomm = pd.read_table("qualcomm.txt", sep="\t")
qualcomm["date"] = [datetime.strptime(date, "%m/%d/%Y") for date in qualcomm.ate]
del qualcomm["ate"]

qualcomm.head()

Unnamed: 0,time,Close,date
0,1,23.47,1995-01-03
1,2,20.54,1995-01-09
2,3,22.74,1995-01-16
3,4,27.88,1995-01-23
4,5,27.39,1995-01-30


In [192]:
qcom_linear_model = smf.ols("Close ~ time", data=qualcomm).fit()
qcom_2nd_model = smf.ols("Close ~ time + I(time**2)", data=qualcomm).fit()
qcom_3rd_model = smf.ols("Close ~ time + I(time**2) + I(time**3)", data=qualcomm).fit()

qcom_models = [qcom_linear_model,
          qcom_2nd_model,
          qcom_3rd_model]

In [353]:
qualcomm[["date", "Close"]].plot()
qcom_linear_model.fittedvalues.plot()
qcom_2nd_model.fittedvalues.plot()
qcom_3rd_model.fittedvalues.plot()
plt.show()

<IPython.core.display.Javascript object>

In [143]:
print("linear model")
qcom_linear_model.summary().tables[1]

linear model


0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-4.6941,6.881,-0.682,0.496,-18.244,8.856
time,0.5805,0.046,12.701,0.000,0.491,0.671


In [145]:
print("2nd Degree Polynomial")
qcom_2nd_model.summary().tables[1]

2nd Degree Polynomial


0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,72.6825,8.147,8.921,0.000,56.639,88.726
time,-1.1915,0.144,-8.266,0.000,-1.475,-0.908
I(time ** 2),0.0068,0.001,12.694,0.000,0.006,0.008


In [146]:
print("3rd Degree Polynomial")
qcom_3rd_model.summary().tables[1]

3rd Degree Polynomial


0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-10.8543,7.670,-1.415,0.158,-25.959,4.250
time,2.6128,0.254,10.286,0.000,2.113,3.113
I(time ** 2),-0.0296,0.002,-13.094,0.000,-0.034,-0.025
I(time ** 3),9.29e-05,5.69e-06,16.326,0.000,8.17e-05,0.000


#### 4) Choosing the best model

In [253]:
for model in qcom_models:
    print("{: >40} | R2_adj = {}".format(model.model.formula, model.rsquared_adj))

                            Close ~ time | R2_adj = 0.38230849211923457
               Close ~ time + I(time**2) | R2_adj = 0.6188645887565991
  Close ~ time + I(time**2) + I(time**3) | R2_adj = 0.8125417956906624


Given the three models. The model that explains the most variation is the cubic model. 

****
## Exercise 8.36
According to the National Bureau of Economic Research, the most recent U.S. business contraction cycle ended in late 2001. Split the data into three sections:

**(1)** 1970–1981, **(2)** 1982–2001, and **(3)** 2002–2005.

- a. Estimate both the model for the full dataset (years 1970–2005) and the third section (post-2002). Using the Chow test, determine if there is a signiﬁcant break between the third period and the full dataset.

- b. With this new data in Table 8.11, determine if there is still a signiﬁcant difference between the ﬁrst set of years (1970–1981) and the full dataset, now that there are more observations available.

- c. Perform the Chow test on the middle period (1982–2001) versus the full dataset to see if the data in this period behave signiﬁcantly differently than the rest of the data.

In [321]:
cycle = pd.read_table("table8_11.txt",
                      usecols=range(3), thousands=",")
cycle.head()

Unnamed: 0,ear,Savings,Income
0,1970,69.5,735.7
1,1971,80.6,801.8
2,1972,77.2,869.1
3,1973,102.7,978.3
4,1974,113.6,1071.6


In [304]:
def sections(X):
    if X.ear <= 1981:
        return "blue"
    elif 1982 <= X.ear <= 2001:
        return "red"
    else:
        return "green"

cycle["section"] = cycle.apply(sections, 1)

In [362]:
for yval in cycle.values:
    plt.scatter(yval[0], yval[1], color=yval[3])
plt.plot(cycle.ear, cycle.Savings, alpha=0.5);

<IPython.core.display.Javascript object>