### Linear Regression as building bloc for Supervised Learning

We're going to start the Supervised Learning session with Linear Regression (using scikit-learn).

Refresh your knowledge of linear regession methods.

1) Notes from Econometrics courses taken here at CEU (Econometrics 1 and 2)

2) Check out the wikipedia page and also check out Andrew Ng's  lectures for free on youtube

3) Read chapter 3 in the ISLR Book

In [1]:


<img src="../slides/Figures/linear-regression.png">

SyntaxError: invalid syntax (<ipython-input-1-b803e56c5eb4>, line 1)

In [3]:
pwd

'/Users/ariedamuco/Dropbox (CEU Econ)/ML-for-NLP/code'

Example that can be found in thee ISLR book. "For the Advertising data, the least squares fit for the regression
of sales onto TV is shown. The fit is found by minimizing the sum of squared
errors. Each grey line segment represents an error, and the fit makes a compromise by averaging their squares. In this case a linear fit captures the essence of
the relationship, although it is somewhat deficient in the left of the plot."

### Step 1: Load the data

We'll be analyzing a data set consisting of house prices in Boston in the 1970s (as Mueller-Guido book/many and many oher ML tutorials with Python). The data are available in the sklearn library and we are going to load them and do some basic analysis.

In [None]:
import pandas as pd

In [None]:
from sklearn.datasets import load_boston

In [2]:
# Load the housing dataset
boston = load_boston()

NameError: name 'load_boston' is not defined

In [None]:
print(boston.DESCR)

In [None]:
boston_data = pd.DataFrame(boston.data, columns = boston.feature_names)

In [None]:
boston_data['Price'] = boston.target

In [None]:
boston_data.head()

### Step 2: Visualizing current data

In [None]:
#Let's plot using the pandas built in visualization tools
boston_data.Price.hist(bins=50)

In [None]:
#Alternatively we can do this with Matplotlib
import matplotlib.pyplot as plt
plt.hist(boston_data['Price'], bins=50)
plt.xlabel('Price in $1000s')
plt.ylabel('Number of houses for each price bin')
#plt.savefig('Outputs/Histogram_Price.png')

In [None]:
# Correlations between number of rooms and prices
plt.scatter(boston_data['RM'], boston_data['Price'])
plt.ylabel('Price in $1000s')
plt.xlabel('Number of rooms')
#plt.savefig('Outputs/Corr_Price_Rooms.png')

### Challenge: Describe the type of relationship you observe in the plot above.

In [None]:
import seaborn as sns
sns.pairplot(boston_data[['RM', 'Price', 'CRIM']])

In [None]:
boston_data[['RM', 'Price', 'CRIM']].describe()

### Challenge: Use pairplot method to plot three variables (features) of your own choice.

In scikit-learn, all estimators implement the fit() and predict() methods. The former method is used to learn the parameters of a model, and the latter method is used to predict the value of a response variable for an explanatory variable using the learned parameters. It is easy to experiment with different models using scikit-learn because all estimators implement the fit and predict methods.

In [None]:
from sklearn import linear_model
reg = linear_model.LinearRegression()

We are going to perform cross-validation. The training set is implemented to build up the model that best predicts the outcome. The validation set is used to validate the model(s). Observations in the training set are excluded from the validation set. The correct way to pick out samples from your dataset to be part either the training or validation (also called test) set is randomly.

Scikit helps us with `train_test_split`.

The parameters passed are your features and outcome. We can also specify the fraction of observations we leave out as a test set. Remind to set the seed, using `random_state` to replicate your own results. 

<img src="../slides/Figures/train-test-split.png">

<img src="../slides/Figures/overfitting.png">

In [None]:
# Data Columns
X = boston_data.drop('Price',1)

# Targets
y = boston_data.Price

In [None]:
X.shape

In [None]:
X.columns

In [None]:
y.shape

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42)

In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
# Implement Linear Regression
reg.fit(X,y)

In [None]:
#Get the intercept
reg.intercept_

In [None]:
#Get the coefficients
reg.coef_

In [None]:
#aGet coefficients and set them to a dataframe
estimated_coeff = pd.DataFrame(reg.coef_, X.columns, columns=['Features'])

In [None]:
estimated_coeff

We want to check correctedness of our model using the testing data. We do so by using the predict method.

In [None]:
yhat_train = reg.predict(X_train)

In [None]:
#Plot true outcomes (target) and predicted values from our model

plt.scatter(y_train, yhat_train)
plt.xlabel("Prices: $Y_i$")
plt.ylabel("Predicted prices: $\hat{Y}_i$")
plt.title("Prices vs Predicted prices: $Y_i$ vs $\hat{Y}_i$")

### Residual Plot

Let's check the residual plot to see if we are doing a good job. If we are doing a decent job, we don't expect the residual plot to display patters. If we, instead, observe some patterns i.e residuals are larger for higher prices we have heteroskedasticity in the data and we have to take care of this. If dealing with prices think about log-transformation of the data. Prices are usually log-normally distributed with longer right tail.

In [None]:
residuals_train = y_train - yhat_train

In [None]:
plt.scatter(yhat_train, residuals_train, c='b',alpha=0.5)
plt.xlabel("Prices: $Y_i$")
plt.ylabel("Residuals: $\hat{\epsilon}_i$")
plt.title("Residuals training set")

### Challenge: Plot the residuals for the test set against y_test as above. What other type of plot can we use?

### Model evaluation

We want to quantify the extent to which the model fits the data. The quality of a linear regression fit is typically assessed
using two related quantities: the root mean squared error (RMSE) and the R2 statistic. Sklearn helps us again.

In [None]:
#R2
print ("Training set score: {:.2f}".format(reg.score (X_train , y_train ))) 
print ( "Test set score: {:.2f}".format(reg.score( X_test , y_test ))) 

### Challenge: What is the R2 saying? Do you think the RMSE will provide a different answer than the R2? Use <a href="https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html"> MSE from Sklearn</a> to aswer this question.

### References
https://scikit-learn.org/stable/modules/linear_model.html

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

https://medium.com/@haydar_ai/learning-data-science-day-9-linear-regression-on-boston-housing-dataset-cd62a80775ef

https://towardsdatascience.com/heteroscedasticity-is-nothing-to-be-afraid-of-730dd3f7ca1f

https://nbviewer.jupyter.org/github/jmportilla/Udemy---Machine-Learning/blob/master/Supervised%20Learning%20-%20%20Linear%20Regression.ipynb