# Linear Regression

In this exercise we use the classical [Boston house-price data](http://archive.ics.uci.edu/ml/datasets/Housing) in order to construct a linear Regression model for predicting house-prices given a set of 13 attributes. This data set has been used in various machine learning papers and is already available in the sklearn Python module. 

In [None]:
import numpy as np 
import pandas as pd
import scipy.stats as stats
import sklearn

In [None]:
from sklearn.datasets import load_boston
boston = load_boston()

##### Understanding the data set

In [None]:
print(boston.DESCR)

In [None]:
boston.data.shape

In [None]:
print(boston.feature_names)

##### Convert the date into a suitable pandas data frame

In [None]:
bos = pd.DataFrame(boston.data)
bos.columns = boston.feature_names
bos["PRICE"] = boston.target
bos.head()


### Q: What do we expect from our data?

##### Investigate the data set

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.pairplot(bos, x_vars=['RM', 'NOX', 'INDUS'], y_vars='PRICE', size=7, aspect=0.7, kind='reg')
plt.show()

### Q: What are we looking for?


## A first naive model 

In [None]:
from sklearn.linear_model import LinearRegression
X = bos.drop("PRICE", axis = 1)

linReg = LinearRegression()

In [None]:
linReg.fit(X, bos.PRICE)
print("Estimated intercept coefficient: ", linReg.intercept_)
print("Number of coefficients: ", len(linReg.coef_))

##### Investigate Results

In [None]:
pd.DataFrame(list(zip(X.columns, linReg.coef_)), 
             columns = ["features", "estimatedCoefficients"])

### Q: How do we interpret these coefficients?

##### Predict Prices with naive model

In [None]:
predNaive = linReg.predict(X)
plt.scatter(bos.PRICE, predNaive)
plt.xlabel("Acutal Prices")
plt.ylabel("Predicted Prices")
plt.show()

## Q: What do we observe?

In [None]:
from sklearn import metrics

mse_naiveModell = metrics.mean_squared_error(bos.PRICE, predNaive)
print("MSE_naiveModell: ", mse_naiveModell)

## Q: Is this a valid approach? Where are the problems?

In [None]:
X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(
    X, bos.PRICE, test_size=0.33)
print(X_train.shape)
print(X_test.shape)

## Build LRM using only train Dataset

In [None]:
linReg = LinearRegression()
linReg.fit(X_train, Y_train)
pred_linReg = linReg.predict(X_test)

mse = metrics.mean_squared_error(Y_test, pred_linReg)
print("MSE: ", mse)

##### Consider Residual Plots

In [None]:
pred_train = linReg.predict(X_train)

plt.scatter(pred_train, pred_train - Y_train)
plt.hlines(y=0, xmin=-10, xmax=50)
plt.ylabel("Residuals")
plt.show()

### Q: What do we expect from the residuals?

### Q: How to proceed?

## Excursion: Other Linear Models

In [None]:
from sklearn.linear_model import Ridge, Lasso

ridgeReg = Ridge(alpha=1)
ridgeReg.fit(X_train,Y_train)
pred_ridgeReg = ridgeReg.predict(X_test)

lassoReg = Lasso(alpha=1)
lassoReg.fit(X_train, Y_train)
pred_lassoReg = lassoReg.predict(X_test)

print("MSE LinReg: ", metrics.mean_squared_error(Y_test, pred_linReg))
print("MSE RidgeReg: ", metrics.mean_squared_error(Y_test, pred_ridgeReg))
print("MSE LassoReg: ", metrics.mean_squared_error(Y_test, pred_lassoReg))      

In [None]:
print("R-Squared LinReg: ", linReg.score(X_test, Y_test))
print("R-Squared RidgeReg: ", ridgeReg.score(X_test, Y_test))
print("R-Squared LassoReg: ", lassoReg.score(X_test, Y_test))