<img alt="Colaboratory logo" width="15%" src="https://raw.githubusercontent.com/carlosfab/escola-data-science/master/img/novo_logo_bg_claro.png">

#### **Data Science na Prática 2.0**
*by [sigmoidal.ai](https://sigmoidal.ai)*

---

## Property Prices in Sao Paulo

In this project we will train a model to predict property prices from an apartment sale database of the city o São Paulo (Brazil). This model will be used to deploy an web application to estimate prices on the go.

The model was pre-analysed and pre-treated before it was made available. The data can be originally found on [Kaggle](https://www.kaggle.com/datasets/argonalyst/sao-paulo-real-estate-sale-rent-april-2019), and was made available by the OpenImob startup.

## Property Data

We will do just the basic preparation of our data for the Machine Learning model

In [1]:
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Importing dataset
df = pd.read_csv("data/sao-paulo-properties-april-2019.csv")

# Checking first entries
df.head()

Unnamed: 0,Price,Condo,Size,Rooms,Toilets,Suites,Parking,Elevator,Furnished,Swimming Pool,New,District,Negotiation Type,Property Type,Latitude,Longitude
0,930,220,47,2,2,1,1,0,0,0,0,Artur Alvim/São Paulo,rent,apartment,-23.543138,-46.479486
1,1000,148,45,2,2,1,1,0,0,0,0,Artur Alvim/São Paulo,rent,apartment,-23.550239,-46.480718
2,1000,100,48,2,2,1,1,0,0,0,0,Artur Alvim/São Paulo,rent,apartment,-23.542818,-46.485665
3,1000,200,48,2,2,1,1,0,0,0,0,Artur Alvim/São Paulo,rent,apartment,-23.547171,-46.483014
4,1300,410,55,2,2,1,1,1,0,0,0,Artur Alvim/São Paulo,rent,apartment,-23.525025,-46.482436


We will apply some filtering for the `District` names to improve our data frame.

In [9]:
df_clean = df.copy()

# Cleaning dataframe
df_clean['District'] = df_clean['District'].apply(lambda x: x.split('/')[0])
df_clean.head()

Unnamed: 0,Price,Condo,Size,Rooms,Toilets,Suites,Parking,Elevator,Furnished,Swimming Pool,New,District,Negotiation Type,Property Type,Latitude,Longitude
0,930,220,47,2,2,1,1,0,0,0,0,Artur Alvim,rent,apartment,-23.543138,-46.479486
1,1000,148,45,2,2,1,1,0,0,0,0,Artur Alvim,rent,apartment,-23.550239,-46.480718
2,1000,100,48,2,2,1,1,0,0,0,0,Artur Alvim,rent,apartment,-23.542818,-46.485665
3,1000,200,48,2,2,1,1,0,0,0,0,Artur Alvim,rent,apartment,-23.547171,-46.483014
4,1300,410,55,2,2,1,1,1,0,0,0,Artur Alvim,rent,apartment,-23.525025,-46.482436


Now let's train our model using a Random Forest Regresor.

In [10]:
# Converting dummy variables
df_clean = pd.get_dummies(df_clean)

# Getting X and y
X = df_clean.drop('Price', axis=1)
y = df_clean['Price']

# Splitting data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

# Starting model
model = RandomForestRegressor(random_state=6327)
model.fit(X_train, y_train)

# Predicting
y_pred = model.predict(X_test)

# Evaluating model
print("r2: \t{:.4f}".format(r2_score(y_test, y_pred)))
print("MAE: \t{:.4f}".format(mean_absolute_error(y_test, y_pred)))
print("MSE: \t{:.4f}".format(mean_squared_error(y_test, y_pred)))

r2: 	0.9238
MAE: 	46695.5149
MSE: 	28878780934.7046


Now let's save our model for deployment.

In [11]:
from joblib import dump, load

# Saving model
dump(model, "data/model.joblib")

['data/model.joblib']

It is also vital to save feature information so we know which features this model expects to receive in the future (it must also be in the same order!).

In [12]:
# Saving features
features = X_train.columns.values
dump(features, "data/features.names")

['data/features.names']

### Testing the model

Now that we have successfully saved our model, let's test it out to see that it works.

In [13]:
new_model = load("data/model.joblib")
features = load("data/features.names")

type(new_model)

sklearn.ensemble._forest.RandomForestRegressor

In [14]:
# Important: check sklearn version
import sklearn
sklearn.__version__

'0.23.2'

In [17]:
# Getting columns for testing
import numpy as np
import json

json_object = json.dumps(dict(zip(X_test.columns.values, np.zeros(X_test.shape[0]).astype(int).tolist())), indent = 4)
print(json_object)

{
    "Condo": 0,
    "Size": 0,
    "Rooms": 0,
    "Toilets": 0,
    "Suites": 0,
    "Parking": 0,
    "Elevator": 0,
    "Furnished": 0,
    "Swimming Pool": 0,
    "New": 0,
    "Latitude": 0,
    "Longitude": 0,
    "District_Alto de Pinheiros": 0,
    "District_Anhanguera": 0,
    "District_Aricanduva": 0,
    "District_Artur Alvim": 0,
    "District_Barra Funda": 0,
    "District_Bela Vista": 0,
    "District_Bel\u00e9m": 0,
    "District_Bom Retiro": 0,
    "District_Brasil\u00e2ndia": 0,
    "District_Brooklin": 0,
    "District_Br\u00e1s": 0,
    "District_Butant\u00e3": 0,
    "District_Cachoeirinha": 0,
    "District_Cambuci": 0,
    "District_Campo Belo": 0,
    "District_Campo Grande": 0,
    "District_Campo Limpo": 0,
    "District_Canga\u00edba": 0,
    "District_Cap\u00e3o Redondo": 0,
    "District_Carr\u00e3o": 0,
    "District_Casa Verde": 0,
    "District_Cidade Ademar": 0,
    "District_Cidade Dutra": 0,
    "District_Cidade L\u00edder": 0,
    "District_Cidade T