# Regression

Regression analysis is one of the most widely used
tools in statistical analysis. Most of us may have come
across it at some point either by employing it or interpreting
it. It is a powerful technique due to both its ease of calculation and simplicity of assumptions. However, it is
due to these attributes that sometimes regression is
misapplied or misinterpreted.

## 4.1 Relationships between Variables: Regression

Consider a situation where you are interested to
determine the association between two (or more) pieces of information; say for example the relation of the height of a
child compared to that of her parents, or ice cream sales and
temperature, or even the body mass of an animal and the mass of its brain. Ultimately,
our goal is to use our model to predict the outcome of the
variable of interest given the values of the other variable(s).
We usually call the quantity of interest the response or
dependent variable and denote it with the variable y quantities are called predictors, explanatory or
independent variables and denote them as x. Intuitively, we know that two quantities are correlated if there is a
relationship between the two variables, i.e. the value of one
tells us something about the value of the other one.

In a correlation analysis we estimate a value bounded
between
1 and 1 and we call it the correlation coefficient.
This coefficient tells us the strength of the linear association between the two variables. If the two quantities vary in
tandem (if one increases/decreases, the other one does too)
the correlation coefficient is positive, whereas it is negative
when the two quantities vary out of sync. It is important to remember that the correlation coefficient measures the strength of linear relationship between the
variables and therefore a value of zero does not mean that
there is no relationship at all. It simply indicates that there
is no linear relation between the variables in question.
Just because we measure a correlation between two variables, it does not
mean that there is a causal relationship between them.

Back to our subject of interest, Galton pioneered the
application of statistical methods to many of his scientific
interests. For instance, he indeed was interested in the
relative size/height of children and their parents(both inanimals and plants). Among his observations he noticed
that a tall parent is likely to have a child that is taller than
average. However, the child is likely to be less tall than the
parent. Similarly, a parent that is shorter than average
would have children taller than the parent, but still below
the average. In other words, the difference in height
between parent and offspring is proportional to the parent’s deviation from the typical population. He described this by
saying that the height of the offspring regresses towards a
mediocre point.
All in all, regression is thus the mean value of a response
variable as a function of one or more explanatory variables.

Linear Regression

![alt text](images/linear_regression.png "Title")

where b0 is the intercept of the line, b1 is the slope of the  line, and e denotes a vector of random deviations or
residuals assumed to be independent and identically
normally distributed. We refer to b0 and b1 as the
regression coefficients. The intercept is the point where
the line crosses the y-axis.

## 4.2 Multivariate Linear Regression

We can extend the model to
include many more variables, for example let us consider N
observations on the response yi with i = 1, 2, 3, . . . , N; and
with M regressors xj with j = 1, 2, 3, . . . , M. The multivariate linear regression is written as:

![alt text](images/multivariate_linear_regression.png "Title")

## 4.4 Brain and Body: Regression with One Variable

In [38]:
%pylab inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import altair as alt
alt.renderers.enable('default')

Populating the interactive namespace from numpy and matplotlib


RendererRegistry.enable('default')

In [39]:
mammals = pd.read_csv("Data/mammals.csv")

In [40]:
alt.Chart(mammals).mark_circle(size=60).encode(
    x='body',
    y='brain'
).interactive()

<VegaLite 3 object>

If you see this message, it means the renderer has not been properly enabled
for the frontend that you are using. For more information, see
https://altair-viz.github.io/user_guide/troubleshooting.html


In [20]:
body_data=mammals["body"]
brain_data = mammals["brain"]

In [22]:
import statsmodels.api as sm 
body_data = sm.add_constant(body_data)
reggression1 = sm.OLS(brain_data,body_data).fit()

In [25]:
# Alternative using R style formula notation
import statsmodels.formula.api as smf
regression2 = smf.ols(formula="brain~body",data=mammals).fit()

In [27]:
reggression1.summary()

0,1,2,3
Dep. Variable:,brain,R-squared:,0.873
Model:,OLS,Adj. R-squared:,0.871
Method:,Least Squares,F-statistic:,411.2
Date:,"Wed, 05 Aug 2020",Prob (F-statistic):,1.54e-28
Time:,08:54:27,Log-Likelihood:,-447.38
No. Observations:,62,AIC:,898.8
Df Residuals:,60,BIC:,903.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,91.0044,43.553,2.090,0.041,3.886,178.123
body,0.9665,0.048,20.278,0.000,0.871,1.062

0,1,2,3
Omnibus:,92.942,Durbin-Watson:,2.339
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1738.656
Skew:,4.382,Prob(JB):,0.0
Kurtosis:,27.417,Cond. No.,936.0


In [28]:
print(regression2.summary())

OLS Regression Results                            
Dep. Variable:                  brain   R-squared:                       0.873
Model:                            OLS   Adj. R-squared:                  0.871
Method:                 Least Squares   F-statistic:                     411.2
Date:                Wed, 05 Aug 2020   Prob (F-statistic):           1.54e-28
Time:                        08:54:45   Log-Likelihood:                -447.38
No. Observations:                  62   AIC:                             898.8
Df Residuals:                      60   BIC:                             903.0
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     91.0044     43.553      2.090      0.041       3.886     178.123
b

R2 also called the
coefficient of determination. The values of this number
range between 0 and 1, and it tells us how well the data fit
the model. Having said that, there are some drawbacks with only
looking at the value of R2. Namely, that it increases as we
add more explanatory variables to the mix. An increase in the value of R2 may
not be due to the explanatory power of the input, but to
the fact that we added that extra input. That is why OLS
also provides information for the adjusted R2. It is very
similar to R2, but it introduces a penalty as extra variables
are included in the model. The adjusted R2 value increases only in cases where the new input actually improves the
model more than would be expected by pure chance.
The column named “coef” lists the
estimated values of the coefficient listed in the table. Notice
that the “const” corresponds to the intercept of the model.
OLS lists the rest of the coefficients using the names of 
the variables included in the model. The “std err” column
corresponds to the basic standard error of the estimate of
the coefficient; “t” is the so-called t-statistic and it tells us
how statistically significant the coefficient is. The P-value is
listed in the “P > |t|” column and it helps us determine the
significance of the results considering the null-hypothesis that the coefficient being equal to zero is true. A small P-
value (typically < 0.05) indicates strong evidence against the null hypothesis and you should go with the value obtained for the coefficient. Finally “95% Conf. Interval” gives us the
lower and upper values of the 95% confidence interval.

We can use this equation to predict the mass of a mammal
given its body mass and this can easily be done with the
predict method in OLS. Let us consider new body mass measurements that will be used to predict the brain mass
using the model obtained above. We need to prepare the
new data in a way that is compatible with the model. We
can therefore create an array of 10 new data inputs as
follows:

In [29]:
new_body = np.linspace(0,7000,10)

For the predict method of the model run with the formula
API we do not need to add a column of ones to our data
and instead we simply indicate that the new data points are going to be treated as a dictionary to replace the
independent variable (i.e. exog in StatsModels parlance) in
the fitted model.

In [30]:
brain_pred = regression2.predict(exog=dict(body=new_body))
print(brain_pred)

0      91.004396
1     842.723793
2    1594.443190
3    2346.162587
4    3097.881985
5    3849.601382
6    4601.320779
7    5353.040176
8    6104.759573
9    6856.478970
dtype: float64


The numbers shown correspond to the brain mass
predictions for the artificial body mass measurements used
as input.