## Lab 5 - Predicting Real Estate Data in St. Petersburg
We have data from Yandex.Realty classified https://realty.yandex.ru containing real estate listings for apartments in St. Petersburg and Leningrad Oblast from 2016 till the middle of August 2018. In this Lab you'll learn how to apply machine learning algorithms to solve business problems. Accurate price prediction can help to find fraudsters automatically and help Yandex.Realty users to make better decisions when buying and selling real estate.

Using python with machine learning algotithms is the #1 option for prototyping solutions among data scientists today. We'll take a look at it in this lab.

### Main objectives
After successful completion of the lab work students will be able to:
-	Prepare datasets for machine learning algorithms
-	Apply machine learning for solving price prediction problem
-   Calculate metrics which can help us find out whether our machine learning model is ready for production

### Tasks
-	Clean dataset
-	Split dataset to test, train and validation datasets
-	Apply decision tree algorithm to build ML (machine learning) model for price predictions
-   Calculate business metrics
-   Try other algorithms and factors to get a better solution 


### 1. Load data with real estate prices

In [None]:
# let's import pandas library and set options to be able to view data right in the browser
import pandas as pd
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', 1000)

In [None]:
# import math library which we'll need later for calculating metrics
import math
from sklearn.metrics import mean_squared_error

In [None]:
# load our dataset and see which data it contains.
spb_df = pd.read_table('../data/spb.real.estate.archive.2018.tsv')

In [None]:
# let's look at random sample of the loaded dataset to understand what's inside
spb_df.sample(5)

In [None]:
# let's check how much data to we have
len(spb_df)

### 2. Prepare cleaned dataset with RENT data in St.Peterburg without Oblast 
<p>Use results of our analysis of the previous Lab Works for cleaning the dataset
<p>Reminder: offer_type column contains data to distinct rent from sell items, 2 stands for RENT, 1 for SELL


#### Prepare dataframe with rent data in city limits

In [None]:
rent_df = spb_df[spb_df.offer_type == 2]
print("Total rent data size: {}".format(len(rent_df)))
rent_df_spb = rent_df[rent_df.unified_address.str.contains('Россия, Санкт-Петербург')]
print("Rent data size in city limits: {}".format(len(rent_df_spb)))

#### Calculate price per square meter, get median prices for house and find outliers with the help of this

In [None]:
# calculate price per sq m
rent_df_spb['price_per_sq_m'] = rent_df_spb.last_price/rent_df.area

##### Find median price per sq m per house

In [None]:
house_rent_df = rent_df_spb.groupby('unified_address').price_per_sq_m.median().reset_index()

In [None]:
house_rent_df.rename(columns = {'price_per_sq_m': 'house_price_sqm_median'}, inplace = True)

##### Merge rent data with house median prices and inspect outliers

In [None]:
rent_df_spb = rent_df_spb.merge(house_rent_df)

##### Clean data from the outliers - use results from Lab 4

In [None]:
rent_df_cleaned = rent_df_spb[~((rent_df_spb.price_per_sq_m/rent_df_spb.house_price_sqm_median) > 5)]
rent_df_cleaned = rent_df_cleaned[rent_df_cleaned.last_price < 1000000]
rent_df_cleaned = rent_df_cleaned[~((rent_df_cleaned.price_per_sq_m > 3000) 
                                     & ((rent_df_cleaned.house_price_sqm_median < 1000) 
                                        | (rent_df_cleaned.house_price_sqm_median == rent_df_cleaned.price_per_sq_m)))]
rent_df_cleaned = rent_df_cleaned[~((rent_df_cleaned.price_per_sq_m < 250) 
                               & (rent_df_cleaned.house_price_sqm_median/rent_df_cleaned.price_per_sq_m >= 2))]
rent_df_cleaned = rent_df_cleaned[~((rent_df_cleaned.price_per_sq_m < 200) 
                                          & (rent_df_cleaned.price_per_sq_m == rent_df_cleaned.house_price_sqm_median))]

### Create datasets training, testing and a holdout dataset.
We need a holdout dataset to assess the final quality of the algorithm. When several teams create their models based on different models and factors, holdout dataset is used to compare results.
Testing dataset can be used to test models and tune hyperparameters.
Since our model will be used to predict prices for new offers based on the old data, it's a good option to select split by time instead of just random split.

In [None]:
# select all offers added the first 3 months of 2018 as train dataset.
# '&' means 'and' and should be used when both conditions are satisfied
# pay attention that it's better always to put conditions in brackets to embrace the right priority of operations
train_df = rent_df_cleaned[(rent_df_spb.first_day_exposition >= '2018-01-01') 
                          & (rent_df_spb.first_day_exposition < '2018-04-01')]

In [None]:
len(train_df)

In [None]:
# select all offers added in april and may 2018 as test dataset.
test_df = rent_df_cleaned[(rent_df_spb.first_day_exposition >= '2018-04-01') 
                          & (rent_df_spb.first_day_exposition < '2018-06-01')]

In [None]:
len(test_df)

In [None]:
# let's use latest data from 2018-06-01 as a hodout dataset to simulate how algorithms would
# behave in production
holdout_df = rent_df_cleaned[rent_df_spb.first_day_exposition >= '2018-06-01']

In [None]:
len(holdout_df)

In [None]:
test_df = rent_df_cleaned[rent_df_spb.first_day_exposition < '2018-04-01']

In [None]:
len(test_df)

### Build ML model with catboost library for predicting real estate prices and test it using business metrics
#### Create functions to test our model using appropriate business metrics

In [None]:
# import numpy library for fast mathematical operations over arrays
# use 'np'  as a short alias
import numpy as np

In [None]:
np.abs?

In [None]:
# create an utility function which takes real and predicted Series and calculate
# MAPE - mean absolute percentage error
# we have to implement this functions ourselves, because there is no 
# standart implementation in sklearn module
def mean_absolute_percentage_error(y_true, y_pred): 
    # use np.array function to make an array out of pd.Series object passed 
    # for further processing with numpy
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    
    # for each row let's calculate how much our predicted prices differ from real prices
    # np.abs calculates calculate the absolute value element-wise
    diff_true_pred_ration = np.abs((y_true - y_pred) / y_true)
    # calculate the mean value of the difference ratios across all items
    # and multiply by 100 to get percentages
    return np.mean(diff_true_pred_ration) * 100

In [None]:
# import math library which we'll need later for calculating metrics
import math
# import functions for calculating metrics from sklearn library
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score

In [None]:
# let'ts remind ourselved on what actually these metrics mean and how are they calculated
r2_score?

In [None]:
# uncomment the following line to see the documentation on mean_absolute_error
# mean_absolute_error?

In [None]:
# uncomment the following line to see the documentation on mean_squared_error
# mean_squared_error?

In [None]:
# create an utility functions to round prices to 1000 rubles
def round_price_to_1000_rubles(price):
    return int((price + 500) / 1000) * 1000
# let's test, whether this function works correctly
print(round_price_to_1000_rubles(22000))
print(round_price_to_1000_rubles(22300))
print(round_price_to_1000_rubles(22500))
print(round_price_to_1000_rubles(22600))


In [None]:
# create an utility function which tests the model passed datasets
# and prints all important business metrics

# in our case business is most interested in MAPE (mean absolute percentage error) 
# and percentiles for error rates to understand for which percentage of offers we would have
# certain levels of model quality

# analysts might also look at RMSE (root mean squared error), R2_score, 
# MAE (mean average error) to compare the models
# returns predicted prices and percentage of error in prediction
def test_model(model, X_test, y_test):
    # use model to get predicted results on the passed dataset
    y_pred = model.predict(X_test)
    
    # round predicted prices to 1000 rubles
    # map function applies passed function to each element in y_pred array, but it does it in a lazy way, 
    # it doesn't do anything until we iterate over each element, that's why we call list to make a list,
    # under the hood it applies function to the element and append it to the final list
    # finally we have rounded values of predicted prices
    y_pred = list(map(round_price_to_1000_rubles, y_pred))
    
    # let's calculate share of error between predicted and real prices
    # zip function allows to return a list of tuples of elements of the same index from 2 lists of the same size
    error_percents = list(((math.fabs(pred - test) / test) for (pred, test) in zip(y_pred, y_test.values)))
    
    # print out all metrics we need for analysis and model comparisons
    print(" ")
    print("rmse: " + str(math.sqrt(mean_squared_error(y_test, y_pred))) + "  ")
    print("r2_score: " + str(r2_score(y_test, y_pred)) + "  ")
    print("mae: " + str(mean_absolute_error(y_test, y_pred)) + "  ")
    print("mape: " + str(mean_absolute_percentage_error(y_test, y_pred)) + "  ")
    
    # print out which maximum error we have for each percentile
    for percent in [50, 83, 90, 95, 99]:
        print(str(percent) + " percentile: %.1f%%" % (np.percentile(error_percents, percent) * 100.0))
    return y_pred, error_percents


#### Define function which will build catboost model and calculate quality metrics on test dataset

In [None]:
# import ML method for regression from catboost library
from catboost import CatBoostRegressor
# train catboost regression model on the passed training data and return final trained model
def train_catboost_model(X_train, y_train, 
                         # set default hyperparameters, you can read about them 
                         # in documenation
                         # feel free to play with them and test results on train and test sets
                         learning_rate=0.08,
                         n_estimators=1500,
                         max_depth=7,
                         nthread=10,
                         seed=27):
    # create the catBoost machine learning model
    model = CatBoostRegressor(iterations=n_estimators, 
                                 depth=max_depth,
                                 learning_rate=learning_rate,
                                 logging_level='Silent',
                                 thread_count=nthread,
                                 random_seed=seed)
    # train the model on training dataset
    model.fit(X_train, y_train)
    return model
    
    

In [None]:
# let's look at what data do we have which we can use in predicting apartment prices
list(rent_df_cleaned)

In [None]:
# let's try these factors first
factors = ['floor', 'open_plan', 'rooms', 'studio', 
         'area', 'kitchen_area', 'living_area', 'renovation' ]

In [None]:
X_train = train_df[factors]
X_train.head()

In [None]:
y_train = train_df['last_price']

In [None]:
y_train.head()

In [None]:
# train catboost regression model, it will take some time
model = train_catboost_model(X_train, y_train)

In [None]:
# let's have a look at how it performs on our training data
y_pred_train, error_percents_train = test_model(model, X_train, y_train)

In [None]:
# let's have a look at how it performs on testing data
X_test = test_df[factors]
y_test = test_df.last_price
y_pred_test, error_percents_test = test_model(model, X_test, y_test)

#### Pause and think
We see that on the test set we get worse results, how do you think why this happens?

### Self-control stops
1. What other factors might influence price? Think of the factors which can be actually calculated and included in the model.
2. Compete with other teams to create the best solution. You can play with factors and algorithm parameters to come up with it.