In [1]:
import numpy as np
import pandas as pd

from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import SGDRegressor
from sklearn.linear_model import Ridge

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

from sklearn.metrics import mean_squared_error

from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyRegressor

# Boston Housing with Linear Regression

In this notebook we will work with the classic [Boston Housing dataset](https://scikit-learn.org/stable/datasets/index.html#boston-house-prices-dataset) again!

This time we will fix up some of our previous mistakes.

In [2]:
data = load_boston()

In [3]:
print(data.DESCR)

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

In [4]:
X = data.data
y = data.target

In [5]:
X = pd.DataFrame(X, columns=data.feature_names) # little trick to get the column names in correctly
y = pd.Series(y)

# Train Test Split

Ok, now we're going to divide up our dataset so that we have an unseen test set that we can evaluate against. This way we can **simulate** tomorrows world and figure out if our trained model can generalize to unseen data at all.

Here is the code to split up our data. It's important to remember how the arrays are passed back, it's always in this exact order, so don't forget!! but if you do forget... you can always look it up [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) (scroll down for the code example).  I have looked it up probably 100 times until I remembered it while teaching this course.

In [None]:
## you need to set the test size!
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = ) 

In [None]:
# let's take a look at how many samples we got, just to check.
print(X_train.shape)
print(X_test.shape)

## Scaling our dataset

Ok, we need to scale our dataset.  We learned about two choices so you can pick from two options!

* StandardScaler
* MinMaxScaler

Now there is a small wrinkle.  We need to fit and transform on our `X_train` data, but we __only__ `transform` the test data.  This is because the transformation requires __knowing__ something about the dataset.  We actually have our scaler learn that from the data with `fit`, but since we are simluating tomorrows world, we absolutely do not know the mean and std of the test data! Therefore we will scale the test data with whatever stats we learned from the training data and hope it's close enough.  This is the messy reality of machine learning.

In [None]:
## initialize a scaler here

scaler = # your choice of scaler here (check the imports to remember their names)

In [None]:
## first we need to `.fit()` our scaler on the dataset
## this teaches the scaler the mean values of the features and the standard deviation.
## these two values are used in the transformation process.
## You should only call `.fit` on training data! Never on the testing data!
scaler.fit(X_train)
# now we transform the data
X_train_scaled = scaler.transform(X_train)

In [None]:
## Now we need to transform our testing set here

X_test_scaled = #use the scaler transform function on our dataset to return a new dataset

# Train a model

Ok it's time to train our model.  We will want to choose one of the three options we imported earlier. The only difference here is that we will train __only__ on the training data.

* LinearRegression
* SGDRegressor
* Ridge

We are also going to train a [dummy regressor](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyRegressor.html).  This regressor will use a default strategy like 'mean' or 'median'. That means it will always guess the average value in the dataset, which is pretty dumb,b ut we need to prove to ourselves that our model learned something -- so this is a good way to do it.

In [None]:
# initialize a model here

reg = # initialize your model here
reg_dummy = DummyRegressor(strategy='') # put a strategy into the keyword argument there!

In [None]:
# Train your model here with .fit

reg.fit(, ) # only use the training data for fitting
reg_dummy.fit(,) # we gotta fit our dummy too

# Evaluate our model

Ok, let's use our trained model to make predictions, then we can evaluate those predictions against the real known `y` values. What should we evaluate against though?  Should we evaluate the training data? Or just the test data?
You can decide, but obviously we should definitely evaluate the test data.

In [None]:
## use your model to make predictions on the data

y_pred = # your model.predict here  -- now you can use X_test
dumb_y_preds = reg_dummy.predict()  # put in the testing x to make some predictions

Ok now we need to evaluate our predictions. We will use scikit-learns inbuilt mean squared error metric for this. It's very important that you pass your arguments correctly to the evaluation function, all scikit-learn metrics use `y_true, y_pred` format, which means pass the ground-truth first, followed by the prediction.

Ok let's compare our results to our dummy!

In [None]:
reg_score = mean_squared_error(y_test, y_pred)
dummy_score = mean_squared_error(y_test, dumb_y_preds)
print (f"the MSE for our regressor is:{reg_score}")
print (f"the MSE for our dummy is:{dummy_score}")

## That's a wrap!

Any issues? Are we happy with everything?