# Predicting Real Estate Data in St. Petersburg

This project was made in the scope of our Business Analytics and Big Data class at the Graduate School of Management (SPBU). Our class was led by a former head of product at Yandex.

The data is from Yandex.Realty classified https://realty.yandex.ru and contains real estate listings for apartments in St. Petersburg from 2016 till the middle of August 2018. 

The aim of this project was to apply machine learning algorithms to solve business problems. Accurate price prediction helps to find fraud automatically.

In this project, I have successfully accomplished the following tasks:

✅ Data science: using ML algorithms (CatBooster) to predict sale prices using historical data

✅ Data science: Calculate metrics to find out whether our ML model is ready for production

✅ Data analysis: perform statistical calculations (MSE and MAPE), exploratory data analysis

✅ Data engineering : Prepare datasets for machine learning algorithms


## Steps to accomplish 
1) Clean dataset
2) Split dataset to test, train and validation datasets
3) Apply decision tree algorithm to build ML model for price predictions
4) Calculate business metrics

In [3]:
import pandas as pd
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', 1000)
import math
from sklearn.metrics import mean_squared_error

In [8]:
spb_df = pd.read_csv(r'./data/spb.real.estate.archive.2018.csv')
spb_df.sample(5)

<p>We use the results of the real estate analysis notebook
<p>Reminder: offer_type column contains data to distinct rent from sell items: 2 stands for RENT, 1 for SELL

In [11]:
rent_df = spb_df[spb_df.offer_type == 2]
print("Total rent data size: {}".format(len(rent_df)))
rent_df_spb = rent_df[rent_df.unified_address.str.contains('Россия, Санкт-Петербург')]
print("Rent data size in city limits: {}".format(len(rent_df_spb)))
# calculate price per sq m
rent_df_spb['price_per_sq_m'] = rent_df_spb.last_price/rent_df.area
house_rent_df = rent_df_spb.groupby('unified_address').price_per_sq_m.median().reset_index()
house_rent_df.rename(columns = {'price_per_sq_m': 'house_price_sqm_median'}, inplace = True)
rent_df_spb = rent_df_spb.merge(house_rent_df)

Total rent data size: 171186
Rent data size in city limits: 156054


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  rent_df_spb['price_per_sq_m'] = rent_df_spb.last_price/rent_df.area


In [12]:
rent_df_cleaned = rent_df_spb[~((rent_df_spb.price_per_sq_m/rent_df_spb.house_price_sqm_median) > 5)]
rent_df_cleaned = rent_df_cleaned[rent_df_cleaned.last_price < 1000000]
rent_df_cleaned = rent_df_cleaned[~((rent_df_cleaned.price_per_sq_m > 3000) 
                                     & ((rent_df_cleaned.house_price_sqm_median < 1000) 
                                        | (rent_df_cleaned.house_price_sqm_median == rent_df_cleaned.price_per_sq_m)))]
rent_df_cleaned = rent_df_cleaned[~((rent_df_cleaned.price_per_sq_m < 250) 
                               & (rent_df_cleaned.house_price_sqm_median/rent_df_cleaned.price_per_sq_m >= 2))]
rent_df_cleaned = rent_df_cleaned[~((rent_df_cleaned.price_per_sq_m < 200) 
                                          & (rent_df_cleaned.price_per_sq_m == rent_df_cleaned.house_price_sqm_median))]

## train, test and holdout datasets
We need a holdout dataset to assess the final quality of the algorithm.

Testing dataset can be used to test models and tune the hyperparameters.

Since our model will be used to predict prices for new offers based on the old data, it's a good option to select split by time instead of just random split.

In [13]:
# select all offers added the first 3 months of 2018 as train dataset.
# conditions in bracket to circle the right priority of operations
train_df = rent_df_cleaned[(rent_df_spb.first_day_exposition >= '2018-01-01') 
                          & (rent_df_spb.first_day_exposition < '2018-04-01')]
len(train_df)

  train_df = rent_df_cleaned[(rent_df_spb.first_day_exposition >= '2018-01-01')


17007

In [14]:
# all offers added in april and may 2018 as test dataset
test_df = rent_df_cleaned[(rent_df_spb.first_day_exposition >= '2018-04-01') 
                          & (rent_df_spb.first_day_exposition < '2018-06-01')]
len(test_df)

  test_df = rent_df_cleaned[(rent_df_spb.first_day_exposition >= '2018-04-01')


In [15]:
# data from 2018-06-01 as a hodout dataset to simulate how algorithms would work in production
holdout_df = rent_df_cleaned[rent_df_spb.first_day_exposition >= '2018-06-01']
len(holdout_df)

  holdout_df = rent_df_cleaned[rent_df_spb.first_day_exposition >= '2018-06-01']


In [17]:
test_df = rent_df_cleaned[rent_df_spb.first_day_exposition < '2018-04-01']
len(test_df)

  test_df = rent_df_cleaned[rent_df_spb.first_day_exposition < '2018-04-01']


119694

## ML model (CatBoost) and business metrics

Creatings functions to test our model using appropriate business metrics

In [32]:
import numpy as np
np.abs?

[1;31mCall signature:[0m  [0mnp[0m[1;33m.[0m[0mabs[0m[1;33m([0m[1;33m*[0m[0margs[0m[1;33m,[0m [1;33m**[0m[0mkwargs[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mType:[0m            ufunc
[1;31mString form:[0m     <ufunc 'absolute'>
[1;31mFile:[0m            c:\users\celin\appdata\local\programs\python\python311\lib\site-packages\numpy\__init__.py
[1;31mDocstring:[0m      
absolute(x, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])

Calculate the absolute value element-wise.

``np.abs`` is a shorthand for this function.

Parameters
----------
x : array_like
    Input array.
out : ndarray, None, or tuple of ndarray and None, optional
    A location into which the result is stored. If provided, it must have
    a shape that the inputs broadcast to. If not provided or None,
    a freshly-allocated array is returned. A tuple (possible only as a
    keyword argument) must have length equal to the number of 

In [18]:
# MAPE - mean absolute percentage error (no standard module in sklearn)
def mean_absolute_percentage_error(y_true, y_pred): 
    # np.array to make an array out of pd.Series object passed 
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    
    # calculate how much predicted price differ from real price 
    # np.abs = absolute value element-wise
    diff_true_pred_ration = np.abs((y_true - y_pred) / y_true)
    # calculate mean value of the difference ratios across all items
    return np.mean(diff_true_pred_ration) * 100

In [19]:
# import math library which we'll need later for calculating metrics
import math
# import functions for calculating metrics from sklearn library
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
r2_score?

In [21]:
# create an utility functions to round prices to 1000 rubles
def round_price_to_1000_rubles(price):
    return int((price + 500) / 1000) * 1000
# test whether this function works correctly
print(round_price_to_1000_rubles(22000))
print(round_price_to_1000_rubles(22300))
print(round_price_to_1000_rubles(22500))
print(round_price_to_1000_rubles(22600))


22000
22000
23000
23000


In [22]:
# function to test the model passed datasets and prints all important business metrics
# analysts might also look at RMSE (root mean squared error), R2_score, MAE (mean average error) to compare the models
def test_model(model, X_test, y_test):
    # use model to get predicted results on the passed dataset
    y_pred = model.predict(X_test)
    
    # round predicted prices to 1000 rubles
    # call list so that map function applies passed function to each element in y_pred array
    y_pred = list(map(round_price_to_1000_rubles, y_pred))
    
    # share of error between predicted and real prices
    # zip function returns a list of tuples of elements of the same index from 2 lists of the same size
    error_percents = list(((math.fabs(pred - test) / test) for (pred, test) in zip(y_pred, y_test.values)))
    
    # print out all metrics for analysis and model comparisons
    print(" ")
    print("rmse: " + str(math.sqrt(mean_squared_error(y_test, y_pred))) + "  ")
    print("r2_score: " + str(r2_score(y_test, y_pred)) + "  ")
    print("mae: " + str(mean_absolute_error(y_test, y_pred)) + "  ")
    print("mape: " + str(mean_absolute_percentage_error(y_test, y_pred)) + "  ")
    
    # print out which maximum error we have for each percentile
    for percent in [50, 83, 90, 95, 99]:
        print(str(percent) + " percentile: %.1f%%" % (np.percentile(error_percents, percent) * 100.0))
    return y_pred, error_percents


#### Define function which will build catboost model and calculate quality metrics on test dataset

In [24]:
# import ML method for regression from catboost library
from catboost import CatBoostRegressor
# train catboost regression model on the passed training data and return final trained model
def train_catboost_model(X_train, y_train, 
                         learning_rate=0.08,
                         n_estimators=1500,
                         max_depth=7,
                         nthread=10,
                         seed=27):
    # create the catBoost machine learning model
    model = CatBoostRegressor(iterations=n_estimators, 
                                 depth=max_depth,
                                 learning_rate=learning_rate,
                                 logging_level='Silent',
                                 thread_count=nthread,
                                 random_seed=seed)
    # train the model on training dataset
    model.fit(X_train, y_train)
    return model
    

In [27]:
# list of data we can use to predict apartment prices
list(rent_df_cleaned)
factors = ['floor', 'open_plan', 'rooms', 'studio', 
         'area', 'kitchen_area', 'living_area', 'renovation' ]

In [28]:
X_train = train_df[factors]
X_train.head()

Unnamed: 0,floor,open_plan,rooms,studio,area,kitchen_area,living_area,renovation
8,12,False,1,False,36.0,,,
24,9,False,1,False,32.0,7.0,18.0,1.0
25,4,False,1,False,38.0,8.0,18.0,
26,12,False,1,False,32.0,,,
27,5,False,1,False,32.0,7.0,20.0,


In [29]:
y_train = train_df['last_price']
y_train.head()

8     26000
24    17500
25    16000
26    22000
27    20000
Name: last_price, dtype: int64

In [30]:
# train catboost regression model
model = train_catboost_model(X_train, y_train)

In [33]:
# performance on train data
y_pred_train, error_percents_train = test_model(model, X_train, y_train)

 
rmse: 6488.727824500219  
r2_score: 0.890119967612922  
mae: 4323.464044217087  
mape: 16.16723662922012  
50 percentile: 12.5%
83 percentile: 28.0%
90 percentile: 35.3%
95 percentile: 44.0%
99 percentile: 68.8%


In [34]:
# performance on testing data
X_test = test_df[factors]
y_test = test_df.last_price
y_pred_test, error_percents_test = test_model(model, X_test, y_test)

 
rmse: 16806.90198477168  
r2_score: 0.5260486690486061  
mae: 9189.6100807058  
mape: 26.62274940089377  
50 percentile: 18.5%
83 percentile: 45.0%
90 percentile: 60.0%
95 percentile: 80.0%
99 percentile: 132.4%
