In this lab you are going to predict house values with another model.

Go through the following steps. Your solution should be in a Python notebook, including all the steps described here in the following.

### 1. Load Dataset

Read the data as a pandas dataframe.

In [None]:
import pandas as pd
import numpy as np

In [None]:
housing = pd.read_csv('housing.csv')

housing.head()

### 2. Split Data Into Train and Test Sets

Split the dataset into train and test sets we can use to evaluate our model. This step also helps prevent data leakages.

In [None]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(housing, test_size=0.2, random_state=42)

### 3. Feature Engineering

Adding additional features that will help improve the models performance.

In [None]:
train["rooms_per_household"] = train["total_rooms"]/train["households"]
train["bedrooms_per_room"] = train["total_bedrooms"]/train["total_rooms"]
train["population_per_household"]=train["population"]/train["households"]

### 4. Feature Scaling

Handle missing values and re-scale values.

In [None]:
# Save the label values in a new series
train_labels = train['median_house_value'].copy()

# Drop the label column from the rest of the dataframe
train =  train.drop('median_house_value', axis=1)

# Separate numerical fields in the test dataset
train_num = train.drop("ocean_proximity", axis=1) 

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

# Pipeline to impute missing values with the median value and scale the data using the standard scaler
num_pipeline = Pipeline([('imputer', SimpleImputer(strategy="median")), ('std_scaler', StandardScaler())])

num_attribs = list(train_num)
cat_attribs = ["ocean_proximity"]

# Additional pipline to add a step to previous pipeline that encode categorical values
full_pipeline = ColumnTransformer([("num", num_pipeline, num_attribs), ("cat", OneHotEncoder(), cat_attribs)])

# Run the final pipeline
train_prepared = full_pipeline.fit_transform(train)

### 5. Modelling

In [None]:
from sklearn.linear_model import SGDRegressor

sgd_reg = SGDRegressor()
sgd_reg.fit(train_prepared, train_labels)

### 5.1 Cross-Validation

Now that the model is trained, let’s evaluate it on the training set

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(sgd_reg, train_prepared, train_labels, scoring="neg_mean_squared_error", cv=10)

sgd_rmse_scores = np.sqrt(-scores)

In [None]:
sgd_rmse_scores

Let’s look at the results:

In [None]:
def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())


In [None]:
display_scores(sgd_rmse_scores)

### 5.2 Model Tuning

Now let's tune the model with a set of hyperparameters to see if we can improve it's performance.

In [None]:
from sklearn.model_selection import GridSearchCV


param_grid = [
    {'eta0': [0.0001, 0.005, 0.001, 0.01], 'power_t': [0.35, 0.3, 0.25], 'max_iter': [2000, 4000, 6000]}, 
    ]

sgd_reg = SGDRegressor()

grid_search = GridSearchCV(sgd_reg, param_grid, cv=5,scoring='neg_mean_squared_error',return_train_score=True)

grid_search.fit(train_prepared, train_labels)

In [None]:
grid_search.best_params_ 

# Our result {'eta0': 0.0001, 'max_iter': 6000, 'power_t': 0.25}

In [None]:
cvres = grid_search.cv_results_

for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

### 5.3 Model Testing

Let's evaluate our model against our test set so we can compare it's overall performance to our previous models.

First we prepare the data. Then we use our model to predict and finally evaluate the model.

In [None]:
test["rooms_per_household"] = test["total_rooms"]/test["households"]
test["bedrooms_per_room"] = test["total_bedrooms"]/test["total_rooms"]
test["population_per_household"]=test["population"]/test["households"]

In [None]:
# Save the label values in a new series
test_labels = test['median_house_value'].copy()

# Drop the label column from the rest of the dataframe
test =  test.drop('median_house_value', axis=1)

# Separate numerical fields in the test dataset
test_num = test.drop("ocean_proximity", axis=1) 

In [None]:
test_prepared = full_pipeline.fit_transform(test)

In [None]:
best_sgd_model = grid_search.best_estimator_

In [None]:
test_predictions = best_sgd_model.predict(test_prepared)

test_predictions

In [None]:
from sklearn.metrics import mean_squared_error

model_mse = mean_squared_error(test_labels, test_predictions)
model_rmse = np.sqrt(model_mse)

model_rmse

We have successfully built our model and can now compare the test result to estimate which model would be better for production. 

Our previous model (random forest model) had a better RMSE score (66940.81) than this model (69250.28) so it is safe to pick that model over this. Although we woudl not be using this model, this has not been a waste of time. The errors from both models are still too high and we can still spend more time tuning the hyperparameters in order to reduce it for both models. 

We can also try out more complex models to help achieve this goal. Maybe you can try a more complex tree-based algorithm to improve the our error scores.


### 5.4 Production

The next step after obtaining an optimal model is using it in a production environment. To do that we would have to save our preferred model and serve it usually trough a website where it can take input and make predictions.

Remember we performed some data preprocessing and feature engineering and scaling on our data and would have to replicate these steps for any input we intend to feed to our model in the future 

In [None]:
import pickle
#from sklearn.externals import joblib

# filename = 'sgd_housing_model.pkl'

#joblib.dump(grid_search.best_estimator_,filename)

filename = 'sgd_housing_model.sav'
pickle.dump(grid_search.best_estimator_, open(filename, 'wb'))

In [None]:
# if you want re-load the model you can use the following code:

model = pickle.load(open(filename, 'rb'))