# Linear Regressions
- Author: Congxin (David) Xu
- Date: 2021/01/12


## Description

This tutorial is going to discuss how to implement linear regressions in `Python`. We are going to cover:

- Ordinary Least Squares Regression
- Step-wise Regression
- Penalized Linear Regression
  - Lasso Regression
  - Ridge Regression
  - Elastic Net Regression

## Package Dependency

- [`pandas`](https://pandas.pydata.org/)
  - We will mainly use `pandas` for data manipulation and visualization.
- [`numpy`](https://numpy.org/)
  - We will mainly use `numpy` for calculations and data manipulation. 
- [`sklearn`](https://scikit-learn.org/stable/)
  - Title: scikit-learn: machine learning in Python
  - This is package that contains the `sklearn.neighbors.KNeighborsRegressor` function that will perform the K-Nearest-Neighbor regression
  - We will also use the function `sklearn.model_selection.GridSearchCV` to perform cross validation.

  
## Use Case

- Linear Regression models assume the linear relationship between the response variable and the predictors. It can be used to solve almost all regression type of problems.

## Caution

- If you care more about the inference of the model or the interpretation of the model, you need to pay attention to the potential violation of the assumptions of linear regression models. 
- If you care more about the predictive power of the model, you need to pay attention to the accuracy of the model.

## Tutorial
Load the required library

In [1]:
import pandas
import numpy
import sklearn.linear_model

The data we will use is the housing price data from [Kaggle](https://www.kaggle.com/c/house-prices-advanced-regression-techniques).

- Response Variable: **`price`**

**Read and Preview the Training Data**

In [2]:
train = pandas.read_csv(".\\Data\\realestate-train.csv")
train.head()

Unnamed: 0,price,PoolArea,GarageCars,Fireplaces,TotRmsAbvGrd,Baths,SqFeet,CentralAir,Age,LotSize,BldgType,HouseStyle,condition
0,208.5,0,2,0,8,3,1710,Y,5,8450,1Fam,2Story,5
1,140.0,0,3,1,7,1,1717,Y,91,9550,1Fam,2Story,5
2,250.0,0,3,1,9,3,2198,Y,8,14260,1Fam,2Story,5
3,143.0,0,2,0,5,2,1362,Y,16,14115,1Fam,1.5Fin,5
4,307.0,0,2,1,7,2,1694,Y,3,10084,1Fam,1Story,5


**Read and Preview the Testing Data**

In [3]:
test = pandas.read_csv(".\\Data\\realestate-train.csv")
test.head()

Unnamed: 0,price,PoolArea,GarageCars,Fireplaces,TotRmsAbvGrd,Baths,SqFeet,CentralAir,Age,LotSize,BldgType,HouseStyle,condition
0,208.5,0,2,0,8,3,1710,Y,5,8450,1Fam,2Story,5
1,140.0,0,3,1,7,1,1717,Y,91,9550,1Fam,2Story,5
2,250.0,0,3,1,9,3,2198,Y,8,14260,1Fam,2Story,5
3,143.0,0,2,0,5,2,1362,Y,16,14115,1Fam,1.5Fin,5
4,307.0,0,2,1,7,2,1694,Y,3,10084,1Fam,1Story,5



### Ordinary Least Squares Regression

**Assumptions**

1. The errors, for each fixed value of $x$, have mean 0.
2. The errors, for each fixed value of $x$, have constant variance.
3. The errors are independent.
4. The errors, for each fixed value of $x$, follow a normal distribution.

**For this section, we will just focus on the following predictors:**

- `SqFeet`: *numeric*
- `Age`: *numeric*
- `Baths`: *numeric*
- `TotRmsAbvGrd`: *numeric*
- `BldgType`: *categorical*

Because the last predictor `BldgType`, is a categorical variable, we need to convert that column to dummy variables. We will use the function `get_dummies(df, drop_first=True)` to get `n - 1` additional dummy variables, where `n` is the number of levels within the `BldgType` column.

In [4]:
pandas.get_dummies(train[['SqFeet', 'Age', 'Baths', 'TotRmsAbvGrd', 'BldgType']], drop_first=True).head()

Unnamed: 0,SqFeet,Age,Baths,TotRmsAbvGrd,BldgType_2fmCon,BldgType_Duplex,BldgType_Twnhs,BldgType_TwnhsE
0,1710,5,3,8,0,0,0,0
1,1717,91,1,7,0,0,0,0
2,2198,8,3,9,0,0,0,0
3,1362,16,2,5,0,0,0,0
4,1694,3,2,7,0,0,0,0


**Set Up the Model**
- Main Function: [`sklearn.linear_model.LinearRegression()`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#)

In [5]:
# Train the linear model
linear_model = sklearn.linear_model.\
    LinearRegression().fit(X = pandas.get_dummies(train[['SqFeet', 'Age', 'Baths', 'TotRmsAbvGrd', 'BldgType']],
                                                  drop_first=True),
                           y = train.price)

In [6]:
# Report the coefficients
linear_model.coef_

array([  0.12599788,  -1.19301576, -16.14740865,  -2.8715537 ,
       -13.88413882, -48.50312778, -34.41530279, -10.47523877])

In [7]:
# Report the intercept
linear_model.intercept_

88.84982744287555

**Making Predictions**

In [8]:
linear_model.predict(X = pandas.get_dummies(test[['SqFeet', 'Age', 'Baths', 'TotRmsAbvGrd', 'BldgType']],
                                            drop_first=True))

array([226.92645952, 160.37546013, 281.96282159, ..., 259.47122497,
       243.22796703, 122.58941395])

### Stepwise Regression Model
After doing some research, there is no good stepwise regression implementation in `sklearn`. Therefore, we are not going to have one here. You can refer to these pages: 
1. https://datascience.stackexchange.com/questions/937/does-scikit-learn-have-forward-selection-stepwise-regression-algorithm
2. https://stackoverflow.com/questions/15433372/stepwise-regression-in-python


### Penalized Linear Regression

- $\alpha$ is the penalty on including additional variables to the model.
- **In the R HTML report, the penalty letter is $\lambda$, in this Python report, the penalty is $\alpha$**
- We will first use cross validation to determine the best penalty for this model.

#### Lasso Regression 

- **Lasso tends to shrink the coefficient of the predictors all the way to 0**.



In [9]:
lasso_cv_model = sklearn.linear_model.LassoCV(cv = 10, random_state = 666).\
    fit(X = train[['SqFeet', 'Age', 'Baths', 
                   'TotRmsAbvGrd', 'PoolArea', 'GarageCars', 
                   'Fireplaces', 'LotSize', 'condition']],
        y = train.price)

**Report the Lasso CV Penalty `alpha_`**.

In [10]:
lasso_cv_model.alpha_

222.75694053734532

**Report the Lasso CV Model Coefficients**

In [11]:
pandas.DataFrame(lasso_cv_model.coef_.reshape(1,-1),
                 columns=['SqFeet', 'Age', 'Baths', 
                          'TotRmsAbvGrd', 'PoolArea', 'GarageCars', 
                          'Fireplaces', 'LotSize', 'condition'])

Unnamed: 0,SqFeet,Age,Baths,TotRmsAbvGrd,PoolArea,GarageCars,Fireplaces,LotSize,condition
0,0.102112,-0.783919,-0.0,-0.0,0.0,0.0,0.0,0.00079,0.0


Therefore, we can see that only `SqFeet`, `Age` and `LotSize` are selected by the Lasso Cross Validation Model.

**Making Predictions**

In [12]:
lasso_cv_model.predict(X = test[['SqFeet', 'Age', 'Baths', 
                                 'TotRmsAbvGrd', 'PoolArea', 'GarageCars', 
                                 'Fireplaces', 'LotSize', 'condition']])

array([225.23103146, 159.39760038, 277.29909107, ..., 244.86400869,
       239.85853225, 118.58125091])

#### Ridge Regression

- **Ridge tends to keep all the predictors.**

In [13]:
ridge_cv_model = sklearn.linear_model.RidgeCV(cv = 10).\
    fit(X = train[['SqFeet', 'Age', 'Baths', 
                   'TotRmsAbvGrd', 'PoolArea', 'GarageCars', 
                   'Fireplaces', 'LotSize', 'condition']],
        y = train.price)

**Report the Ridge CV Penalty `alpha_`**.

In [14]:
ridge_cv_model.alpha_

10.0

**Report the Ridge CV Model Coefficients**

In [15]:
pandas.DataFrame(ridge_cv_model.coef_.reshape(1,-1),
                 columns=['SqFeet', 'Age', 'Baths', 
                          'TotRmsAbvGrd', 'PoolArea', 'GarageCars', 
                          'Fireplaces', 'LotSize', 'condition'])

Unnamed: 0,SqFeet,Age,Baths,TotRmsAbvGrd,PoolArea,GarageCars,Fireplaces,LotSize,condition
0,0.105516,-1.082253,-14.205147,-3.037337,0.037954,17.507441,12.659982,0.00053,10.004051


**Making Predictions**

In [16]:
ridge_cv_model.predict(X = test[['SqFeet', 'Age', 'Baths', 
                                 'TotRmsAbvGrd', 'PoolArea', 'GarageCars', 
                                 'Fireplaces', 'LotSize', 'condition']])

array([206.6501121 , 176.51334752, 285.10609287, ..., 270.80360447,
       263.17145672, 111.13089136])

#### Elastic Net Regression
- For Elastic Net, we have 2 tuning parameters `alpha` (penalty) and `l1_ratio` (scaling between l1 and l2 penalties).
- We will pass a list of values for `l1_ratio` to the `sklearn.linear_model.ElasticNetCV()` function and let the cross validation help us choose the best value for `l1_ratio`.
    - We choose 100 decimal values from 0.01 to 1 in this case using `numpy.arange()`

In [17]:
elastic_net_model = sklearn.linear_model.ElasticNetCV(l1_ratio = numpy.arange(0.01, 1.0, 0.01), 
                                                      cv = 10, random_state = 666).\
    fit(X = train[['SqFeet', 'Age', 'Baths', 
                   'TotRmsAbvGrd', 'PoolArea', 'GarageCars', 
                   'Fireplaces', 'LotSize', 'condition']],
        y = train.price)

**Report the Elastic Net CV L1 Ratio**

In [18]:
elastic_net_model.l1_ratio_

0.99

**Report the Elastic Net Penalty**

In [19]:
elastic_net_model.alpha_

225.00701064378327

**Report the Elastic Net CV Model Coefficients**

In [20]:
pandas.DataFrame(elastic_net_model.coef_.reshape(1,-1),
                 columns=['SqFeet', 'Age', 'Baths', 
                          'TotRmsAbvGrd', 'PoolArea', 'GarageCars', 
                          'Fireplaces', 'LotSize', 'condition'])

Unnamed: 0,SqFeet,Age,Baths,TotRmsAbvGrd,PoolArea,GarageCars,Fireplaces,LotSize,condition
0,0.102138,-0.781866,-0.0,-0.0,0.0,0.0,0.0,0.00079,0.0


**Making Predictions**

In [21]:
elastic_net_model.predict(X = test[['SqFeet', 'Age', 'Baths', 
                                 'TotRmsAbvGrd', 'PoolArea', 'GarageCars', 
                                 'Fireplaces', 'LotSize', 'condition']])

array([225.17355639, 159.51656131, 277.25866825, ..., 244.86995935,
       239.94825671, 118.62033247])