# Predicting House Prices

In this example, you will learn how to:
1. How to run MLFlow locally
2. Learn how to implement a RandomForestRegressor
3. Perform hyperparamater tuning
4. Identify the accuracy of the newly created model

In [1]:
%pip install -r requirements.txt 

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


For the next step, we will need to pull the data from [here](https://www.kaggle.com/datasets/harlfoxem/housesalesprediction?resource=download).


## Get Data

After you have downloaded the dataset or use the one I have provided for you, we can define the columns in this dataset. We will download the House Sales in King County, USA dataset with a 21 feature columns and price as the target column.

I will define the columns for you here.  <br>  <br>
**id** - Unique ID for each home sold <br>
**date** - Date of the home sale <br>
**price** - Price of each home sold <br>
**bedrooms** - Number of bedrooms <br>
**bathrooms** - Number of bathrooms, where .5 accounts for a room with a toilet but no shower <br>
**sqft_living** - Square footage of the apartments interior living space <br>
**sqft_lot** - Square footage of the land space  <br>
**floors** - Number of floors <br>
**waterfront** - A dummy variable for whether the apartment was overlooking the waterfront or not <br>
**view** - An index from 0 to 4 of how good the view of the property was <br>
**condition** - An index from 1 to 5 on the condition of the apartment  <br>
**grade** - An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of construction and design, and 11-13 have a high quality level of construction and design.  <br>
**sqft_above** - The square footage of the interior housing space that is above ground level  <br>
**sqft_basement** - The square footage of the interior housing space that is below ground level  <br>
**yr_built** - The year the house was initially built <br>
**yr_renovated** - The year of the house’s last renovation <br>
**zipcode** - What zipcode area the house is in <br>
**lat** - Latitude <br>
**long** - Longitude <br>
**sqft_living15** - The square footage of interior housing living space for the nearest 15 neighbors <br>
**sqft_lot15** - The square footage of the land lots of the nearest 15 neighbors <br>


In [2]:
import pandas as pd
df = pd.read_csv('kc_house_data.csv')
df = df.drop(['id', 'date'], axis=1)
df = df.dropna()

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
# split into input and output elements
X = df.loc[:, df.columns != 'price']
y = df.loc[:, 'price']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

Now, we are ready to train our model. 

## Train Model

We will be using GridSearchCV in this example. GridSearchCV allows us to do hyperparamater tuning in our RandomForestRegressor model, and it is a great way to showcase MLFlow. MLFlow makes it easy to identify the best model quickly by recording all of the paramaters and the results.

In [4]:
from helper import fetch_logged_data, yield_artifacts
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

model = RandomForestRegressor(random_state=42)
param_grid = {
    'n_estimators': [100, 200],
    'max_features': [1.0],
    'max_depth': [4, 6, 8],
    'criterion': ['squared_error']
}
# define search
search = GridSearchCV(
    estimator=model, param_grid=param_grid, n_jobs=-1)
result = search.fit(X_train, y_train)
# summarize result
print('Best Score: %s' % result.best_score_)
print('Best Hyperparameters: %s' % result.best_params_)

Best Score: 0.8438245555164338
Best Hyperparameters: {'criterion': 'squared_error', 'max_depth': 8, 'max_features': 1.0, 'n_estimators': 200}


Our best model is RandomForestRegressor with {'criterion': 'squared_error', 'max_depth': 8, 'max_features': 1.0, 'n_estimators': 200} as hyperparamaters for a final score of 0.843825. The mean absolute error is within the standard deviation of 367127. 

Neat thing about RandomForestRegressor is that you can figure out the feature_importance. 

In [5]:
importances = result.best_estimator_.feature_importances_

In [6]:
forest_importances = pd.DataFrame(importances, index=X.columns)


In [7]:
#Add color based upon x value
forest_importances['colors'] = [ 'red'
   if x > 0.1
   else 'blue' for x in forest_importances[0]]

In [8]:
forest_importances['colors']

bedrooms         blue
bathrooms        blue
sqft_living       red
sqft_lot         blue
floors           blue
waterfront       blue
view             blue
condition        blue
grade             red
sqft_above       blue
sqft_basement    blue
yr_built         blue
yr_renovated     blue
zipcode          blue
lat               red
long             blue
sqft_living15    blue
sqft_lot15       blue
Name: colors, dtype: object

In [9]:
import plotly.express as px
fig = px.bar(forest_importances, orientation='h', color='colors')
fig.update_layout({
'plot_bgcolor': 'rgba(0, 0, 0, 0)',
}, yaxis={'categoryorder':'total ascending','showgrid':False, 'title': 'Feature'},
xaxis={'showgrid':False, 'title':'Importance Value'}
,showlegend=False)

Suggesting that the most important features in this model are grade or how nice the house is, how big the house is and where it is located. I am glad that this data confirms what we already see in real life. 

*Now that is something you can show your stakeholders!*