For today's group work, you will be evaluating your model on the reminder of the data from our split.

Go through the following steps with your teammates. Your solution should be a Python notebook, including all the steps described here in the following sections:

### 1. Evaluating Your Model On Test Data

You have trained and evaluated your model using your training data. Now we will evaluate the model again using only the test data to check if the model is really performing as good as our training metrics have us believing.

### 1.1 Data Preparation

### 1.1.1 Read Data

To test the model, we first have to load the entire data.



In [1]:
import pandas as pd
import numpy as np

In [2]:
housing = pd.read_csv('housing.csv')

housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


Now we have to split our dataset exactly how we did during our training.

<b> Hint: </b> Remember the random_state and test_size parameters we used previously.

In [3]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(housing, test_size=0.2, random_state=42)

### 1.1.2 Clean Data

Clean the data using the same processes you used before training your model.

Check the test set for any missing values and replace or drop them like you did for your training dataset.

# Missing values will be imputed in the feature scaling stage below which contains a pipeline to handle this


### 1.1.3 Feature Engineering

Now create the same features from the training set to ensure that our model will work with the same set of features.

In [5]:
test["rooms_per_household"] = test["total_rooms"]/test["households"]
test["bedrooms_per_room"] = test["total_bedrooms"]/test["total_rooms"]
test["population_per_household"]=test["population"]/test["households"]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test["rooms_per_household"] = test["total_rooms"]/test["households"]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test["bedrooms_per_room"] = test["total_bedrooms"]/test["total_rooms"]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test["population_per_household"]=test["population"]/test["househo



### 1.1.4 Feature Scaling

Since our models were created using data from a specific range, we have to ensure that our test data is in the same range. Scale the numerical and categorical data to their right scales using the pipelines you used earlier.

To do this we must first prepare the data for preocessing in the pipeline.

In [6]:
# Save the label values in a new series
test_labels = test['median_house_value'].copy()

# Drop the label column from the rest of the dataframe
test =  test.drop('median_house_value', axis=1)

# Separate numerical fields in the test dataset
test_num = test.drop("ocean_proximity", axis=1) 

In [7]:
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

num_pipeline = Pipeline([('imputer', SimpleImputer(strategy="median")), ('std_scaler', StandardScaler())])

num_attribs = list(test_num)
cat_attribs = ["ocean_proximity"]

full_pipeline = ColumnTransformer([("num", num_pipeline, num_attribs), ("cat", OneHotEncoder(), cat_attribs)])

test_prepared = full_pipeline.fit_transform(test)


### 1.2 Modeling

At last! You are now ready to test your model.

<b>Note:</b> Before you use your model ensure that your data is in the right shape (i.e. Properly split into features and labels with the same number of feature columns)


### 1.2.1 Load Model

Re-load your model using the code from the _week11_housing.ipynb_ file.

In [8]:
import pickle

filename = 'forest_housing_model.sav'

model = pickle.load(open(filename, 'rb'))

### 1.2.2 Predict Values

Make predictions with your model based on your test dataset.


In [9]:
test_predictions = model.predict(test_prepared)

test_predictions

array([ 58436.66666667,  96350.        , 363483.56666667, ...,
       486394.2       , 107323.33333333, 157060.        ])

### 1.2.3 Evaluate Model

Check the errors based on the values your model predicted and the actual values from the test dataset.

In [10]:
from sklearn.metrics import mean_squared_error

model_mse = mean_squared_error(test_labels, test_predictions)
model_rmse = np.sqrt(model_mse)

model_rmse

66940.80706909164