# Linear Regressions
- Author: Congxin (David) Xu
- Date: 2021/01/12


## Description

This tutorial is going to discuss how to implement linear regressions in `Python`. We are going to cover:

- Ordinary Least Squares Regression
- Step-wise Regression
- Penalized Linear Regression
  - Lasso Regression
  - Ridge Regression
  - Elastic Net Regression

## Package Dependency

- [`pandas`](https://pandas.pydata.org/)
  - We will mainly use `pandas` for data manipulation and visualization.
- [`numpy`](https://numpy.org/)
  - We will mainly use `numpy` for calculations and data manipulation. 
- [`sklearn`](https://scikit-learn.org/stable/)
  - Title: scikit-learn: machine learning in Python
  - This is package that contains the `sklearn.neighbors.KNeighborsRegressor` function that will perform the K-Nearest-Neighbor regression
  - We will also use the function `sklearn.model_selection.GridSearchCV` to perform cross validation.

  
## Use Case

- Linear Regression models assume the linear relationship between the response variable and the predictors. It can be used to solve almost all regression type of problems.

## Caution

- If you care more about the inference of the model or the interpretation of the model, you need to pay attention to the potential violation of the assumptions of linear regression models. 
- If you care more about the predictive power of the model, you need to pay attention to the accuracy of the model.

## Tutorial
Load the required library

In [10]:
import pandas
import numpy
import sklearn.linear_model

The data we will use is the housing price data from [Kaggle](https://www.kaggle.com/c/house-prices-advanced-regression-techniques).

- Response Variable: **`price`**

**Read and Preview the Training Data**

In [11]:
train = pandas.read_csv(".\\Data\\realestate-train.csv")
train.head()

Unnamed: 0,price,PoolArea,GarageCars,Fireplaces,TotRmsAbvGrd,Baths,SqFeet,CentralAir,Age,LotSize,BldgType,HouseStyle,condition
0,208.5,0,2,0,8,3,1710,Y,5,8450,1Fam,2Story,5
1,140.0,0,3,1,7,1,1717,Y,91,9550,1Fam,2Story,5
2,250.0,0,3,1,9,3,2198,Y,8,14260,1Fam,2Story,5
3,143.0,0,2,0,5,2,1362,Y,16,14115,1Fam,1.5Fin,5
4,307.0,0,2,1,7,2,1694,Y,3,10084,1Fam,1Story,5


**Read and Preview the Testing Data**

In [12]:
test = pandas.read_csv(".\\Data\\realestate-train.csv")
test.head()

Unnamed: 0,price,PoolArea,GarageCars,Fireplaces,TotRmsAbvGrd,Baths,SqFeet,CentralAir,Age,LotSize,BldgType,HouseStyle,condition
0,208.5,0,2,0,8,3,1710,Y,5,8450,1Fam,2Story,5
1,140.0,0,3,1,7,1,1717,Y,91,9550,1Fam,2Story,5
2,250.0,0,3,1,9,3,2198,Y,8,14260,1Fam,2Story,5
3,143.0,0,2,0,5,2,1362,Y,16,14115,1Fam,1.5Fin,5
4,307.0,0,2,1,7,2,1694,Y,3,10084,1Fam,1Story,5



### Ordinary Least Squares Regression

**Assumptions**

1. The errors, for each fixed value of $x$, have mean 0.
2. The errors, for each fixed value of $x$, have constant variance.
3. The errors are independent.
4. The errors, for each fixed value of $x$, follow a normal distribution.

**For this section, we will just focus on the following predictors:**

- `SqFeet`: *numeric*
- `Age`: *numeric*
- `Baths`: *numeric*
- `TotRmsAbvGrd`: *numeric*
- `BldgType`: *categorical*

Because the last predictor `BldgType`, is a categorical variable, we need to convert that column to dummy variables. We will use the function `get_dummies(df, drop_first=True)` to get `n - 1` additional dummy variables, where `n` is the number of levels within the `BldgType` column.

In [22]:
pandas.get_dummies(train[['SqFeet', 'Age', 'Baths', 'TotRmsAbvGrd', 'BldgType']], drop_first=True).head()

Unnamed: 0,SqFeet,Age,Baths,TotRmsAbvGrd,BldgType_2fmCon,BldgType_Duplex,BldgType_Twnhs,BldgType_TwnhsE
0,1710,5,3,8,0,0,0,0
1,1717,91,1,7,0,0,0,0
2,2198,8,3,9,0,0,0,0
3,1362,16,2,5,0,0,0,0
4,1694,3,2,7,0,0,0,0


**Set Up the Model**

In [23]:
# Train the linear model
linear_model = sklearn.linear_model.\
    LinearRegression().fit(X = pandas.get_dummies(train[['SqFeet', 'Age', 'Baths', 'TotRmsAbvGrd', 'BldgType']],
                                                  drop_first=True),
                           y = train[['price']])

In [28]:
# Report the coefficients
linear_model.coef_

array([[  0.12599788,  -1.19301576, -16.14740865,  -2.8715537 ,
        -13.88413882, -48.50312778, -34.41530279, -10.47523877]])

In [29]:
# Report the intercept
linear_model.intercept_

array([88.84982744])

**Making Predictions**

In [30]:
linear_model.predict(X = pandas.get_dummies(test[['SqFeet', 'Age', 'Baths', 'TotRmsAbvGrd', 'BldgType']],
                                            drop_first=True))

array([[226.92645952],
       [160.37546013],
       [281.96282159],
       ...,
       [259.47122497],
       [243.22796703],
       [122.58941395]])