## **Modelling our data with machine learning**

Once our data is cleaned, we can move over to preprocessing it and creating a model for predictions. We are looking to build a supervised machine learning model that solves a regression task by inputting tabular data.

To start off, we import the required modules and create a preprocessing pipeline that takes our clean dataset and transforms it to a model-friendly representation. To do this, we separate our data into numerical and categorical features, which will be treated differently by our preprocessing pipeline with the help of a scikit-learn ColumnTransformer. The numerical features will be passed through unchanged, since we will use a tree-based model (XGBoost) that does not require feature scaling (standarization), and the only thing we need to do is encode our categorical data using One-Hot encoding. We encode the home type and city variables to finish our preprocessor transformer.

In [130]:
# Import required modules
import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn import set_config
set_config(transform_output = "pandas")

# Create preprocessing pipeline
numeric_features = ['bedrooms', 'bathrooms', 'livingArea', 'longitude', 'latitude']
categorical_features = ['homeType', 'city']
preprocessor = ColumnTransformer([('ohe', OneHotEncoder(sparse_output= False), categorical_features),
                                  ('passthrough', 'passthrough' , numeric_features)],
                                remainder = 'drop')

Now, we can move over to making our final pipeline, which connects our preprocessor to an estimator that will learn the underlying patterns in the data and output a price prediction. We will use an extreme gradient boosting tree model, since it is one of the best performing models when working with tabular data, whose hyperparameters have been tuned using cross-validation.

In [None]:
# Import ML libraries
from xgboost import XGBRegressor
from sklearn.model_selection import KFold, train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Create pipeline of preprocessing and model (XGBoost with tuned hyperparameters)
pipeline = Pipeline([('preprocessor', preprocessor), ('estimator', XGBRegressor(learning_rate = 0.1, max_depth = 7, n_estimators = 200, reg_lambda = 0.1, reg_alpha = 0.1))])

Now that we have defined the complete model pipeline, we can train and test the model using cross validation and assess the quality of the model's performance. We will use the mean absolute error, root mean square error, and R² value to rate our model's performance.

In [187]:
# Split data into features and target
X = df.drop('price', axis=1)
y = df['price']

# Test model performance with 5-fold cross validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)
mae = []
rmse = []
r2 = []
for train_index, test_index in kf.split(X):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    mae.append(mean_absolute_error(y_test, y_pred))
    rmse.append(np.sqrt(mean_squared_error(y_test, y_pred)))
    r2.append(r2_score(y_test, y_pred))

print('Mean Absolute Error: ', np.mean(mae))
print('Root Mean Squared Error: ', np.mean(rmse))
print('R2 Score: ', np.mean(r2))

Mean Absolute Error:  146246.19434381404
Root Mean Squared Error:  247346.50659278594
R2 Score:  0.8641549692006647


The mean absolute error shows the mean difference between actual and predicted price, while the root mean squared error shows the square root of the differences of the squared values (penalizes large differences). We can see that the mean of the cross validated errors are \\$146,246 for the mean absolute error, and \\$247,346 for the mean squared error. This errors are calculated across all price ranges, so they are most notable in high value homes, as we will see further on. The R² value is one of the most important metrics for quantifying our model performance, and it describes how much of the variance in the price can be attributed to the predictor variables that our model used. In this case, R² is 0.864, which tells us that our model is capable of explaining 86.4% of the variance in house price, which is pretty high.

Now, we will split our data into a training (80%) and testing (20%) set, train the model on the training set, and predict the prices on the testing set. As we can see, the predicted price is pretty close to the actual price on most examples, and while there is a variation, it is small in proportion across all data ranges.

In [190]:
# Predict on test set and compare to actual values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=3)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
predictions = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred}).astype(int)
predictions.head(10)

Unnamed: 0,Actual,Predicted
751,558500,546432
7094,631000,548060
7153,795000,770689
17846,590000,511749
9375,740105,669186
5126,759000,748256
11709,1000000,1025955
18650,750000,654058
1242,1039000,922971
12811,1530000,1355266


Something interesting that we can extract from our model, since it is tree-based, is the importance of the features in determining a prediction. The most important factor in determining house price is whether the house is a mobile home (manufactured) or not. After that, the living area is the second most important factor, and after those two, the city or location of the house is the next factor to take into account. Some areas like Coronado, Clairemont and La Jolla, among others, have a clearer price range than other areas.

In [189]:
# Get feature importances from model
feature_importances = pd.DataFrame(pipeline['estimator'].feature_importances_, index = pipeline['preprocessor'].get_feature_names_out(), columns=['importance']).sort_values('importance', ascending=False)
feature_importances.head(10)

Unnamed: 0,importance
ohe__homeType_MANUFACTURED,0.190955
passthrough__livingArea,0.125039
ohe__city_Coronado,0.115405
ohe__city_Clairemont,0.087133
ohe__city_Carlsbad,0.072721
ohe__city_Poway,0.047217
passthrough__latitude,0.045832
passthrough__longitude,0.045093
ohe__city_La Jolla,0.037546
ohe__city_Rancho Santa Fe,0.026697


Finally, we can use our model to predict house prices of synthetic, or made up data. We define three different homes, two regular single family homes, and one mobile home, located in Chula Vista, La Jolla and Poway respectively. The living area is chosen according to the selected number of bedrooms and bathrooms and the coordinates are taken from the mean values of the chosen zone. These three homes describe homes at an expected medium, high and low price ranges respectively

In [129]:
# Predict house prices on synthetic data
synthetic_data = pd.DataFrame({'bedrooms': [3, 5, 2],
                                'bathrooms': [2, 3, 2],
                                'livingArea': [1750, 3500, 1500],
                                'homeType': ['SINGLE_FAMILY', 'SINGLE_FAMILY', 'MANUFACTURED'],
                                'city': ['Chula Vista', 'La Jolla', 'Poway'],
                                'longitude': [-117.0032, -117.1692, -117.0407],
                                'latitude': [32.6277, 32.6769, 32.9766]})

synthetic_data['predictedPrice'] = pipeline.predict(synthetic_data).astype(int)
synthetic_data

Unnamed: 0,bedrooms,bathrooms,livingArea,homeType,city,longitude,latitude,predictedPrice
0,3,2,1750,SINGLE_FAMILY,Chula Vista,-117.0032,32.6277,824026
1,5,3,3500,SINGLE_FAMILY,La Jolla,-117.1692,32.6769,3333409
2,2,2,1500,MANUFACTURED,Poway,-117.0407,32.9766,293514


The model predicted a price of \\$824,026 for the medium priced home, \\$3,333,409 for the expensive home, and \$293,514 for the cheap home. These results are comparable to the actual house prices of real, similar home. The final step is to save our model in order to create a simple web app that allows a user to input the required data, and returns a price estimate for that home.

In [None]:
# Save model as pickle file
import pickle
file_name = 'sd_pipeline.pkl'
pickle.dump(pipeline, open(file_name, 'wb'))

Now, we can open our model using Streamlit and make predictions with the saved pipeline using data with the same format as the one used to train it.