## Lab 1.3 - Predicting Real Estate Data in St. Petersburg
We have data from Yandex.Realty classified https://realty.yandex.ru containing real estate listings for apartments in St. Petersburg and Leningrad Oblast from 2016 till the middle of August 2018. In this Lab you'll learn how to apply machine learning algorithms to solve business problems. Accurate price prediction can help to find fraudsters automatically and help Yandex.Realty users to make better decisions when buying and selling real estate.

Using python with machine learning algotithms is the #1 option for prototyping solutions among data scientists today. We'll take a look at it in this lab.

### Main objectives
After successful completion of the lab work students will be able to:
-	Apply machine learning for solving price prediction problem
-   Calculate metrics which can help us find out whether our machine learning model is ready for production

### Tasks
-	Encode dataset
-	Split dataset to train and validation datasets
-	Apply decision tree algorithm to build ML (machine learning) model for price predictions
-   Calculate metrics
-   Try other algorithms and factors to get a better solution 


### 1. Load data with real estate prices

In [1]:
!python -m pip install scikit-learn --upgrade!pip install --upgrade pip
!pip install sklearn_pandas


Usage:   
  /opt/conda/bin/python -m pip install [options] <requirement specifier> [package-index-options] ...
  /opt/conda/bin/python -m pip install [options] -r <requirements file> [package-index-options] ...
  /opt/conda/bin/python -m pip install [options] [-e] <vcs project url> ...
  /opt/conda/bin/python -m pip install [options] [-e] <local project path> ...
  /opt/conda/bin/python -m pip install [options] <archive url/path> ...

no such option: --upgrade!pip

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
# let's import pandas library and set options to be able to view data right in the browser
import pandas as pd
import seaborn as sns
import scipy.stats as stats
import matplotlib.style as style
from matplotlib import pyplot as plt
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', 1000)
style.use('fivethirtyeight')

from sklearn.model_selection import train_test_split
from sklearn.model_selection import ParameterGrid

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import ElasticNet

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeRegressor

from sklearn_pandas import DataFrameMapper

from sklearn.impute import SimpleImputer

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder

from sklearn.pipeline import make_pipeline, Pipeline

from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_percentage_error
from sklearn.metrics import PredictionErrorDisplay

from sklearn.model_selection import cross_val_score, GridSearchCV, KFold, RandomizedSearchCV, train_test_split

import xgboost as xgb

#import lightgbm as ltb


In [4]:
rent_df_cleaned = pd.read_csv('cleaned_dataset.csv')

In [5]:
rent_df_cleaned.head()

Unnamed: 0,first_day_exposition,last_day_exposition,last_price,open_plan,rooms,area,renovation,last_price_log
0,2015-01-24T00:00:00+03:00,2016-01-19T00:00:00+03:00,20000.0,0,1,28.0,3.0,9.903488
1,2015-11-17T00:00:00+03:00,2016-03-04T00:00:00+03:00,24000.0,0,2,59.0,3.0,10.085809
2,2015-11-17T00:00:00+03:00,2016-04-24T00:00:00+03:00,18000.0,0,1,36.0,3.0,9.798127
3,2016-02-04T00:00:00+03:00,2016-02-28T00:00:00+03:00,18000.0,0,1,39.0,0.0,9.798127
4,2016-02-28T00:00:00+03:00,2016-04-02T00:00:00+03:00,19000.0,0,1,36.0,11.0,9.852194


In [6]:
rent_df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 155391 entries, 0 to 155390
Data columns (total 8 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   first_day_exposition  155391 non-null  object 
 1   last_day_exposition   155391 non-null  object 
 2   last_price            155391 non-null  float64
 3   open_plan             155391 non-null  int64  
 4   rooms                 155391 non-null  int64  
 5   area                  155391 non-null  float64
 6   renovation            155391 non-null  float64
 7   last_price_log        155391 non-null  float64
dtypes: float64(4), int64(2), object(2)
memory usage: 9.5+ MB


In [6]:
##renovation_encoded = pd.get_dummies(rent_df_cleaned['renovation'])

### Self-control stops
1. Compete with other teams to create the best solution. You can play with factors and algorithm parameters to come up with it.

In [7]:
rent_df_cleaned['f_day_exposition'] = pd.to_datetime(rent_df_cleaned.first_day_exposition)

In [8]:
rent_df_cleaned['l_day_exposition'] = pd.to_datetime(rent_df_cleaned.last_day_exposition)

In [9]:
rent_df_cleaned['Difference'] = (rent_df_cleaned['l_day_exposition'] - rent_df_cleaned['f_day_exposition']).dt.days

In [11]:
rent_df_cleaned.drop(columns=['last_price_log'], inplace = True)
rent_df_cleaned.drop(columns=['first_day_exposition'], inplace = True)
rent_df_cleaned.drop(columns=['last_day_exposition'], inplace = True)
rent_df_cleaned.drop(columns=['f_day_exposition'], inplace = True)
rent_df_cleaned.drop(columns=['l_day_exposition'], inplace = True)

In [12]:
rent_df_cleaned.head()

Unnamed: 0,last_price,open_plan,rooms,area,renovation,Difference
0,20000.0,0,1,28.0,3.0,360
1,24000.0,0,2,59.0,3.0,108
2,18000.0,0,1,36.0,3.0,159
3,18000.0,0,1,39.0,0.0,24
4,19000.0,0,1,36.0,11.0,34


In [13]:
#renovation_encoded = pd.get_dummies(rent_df_cleaned, columns=['renovation','open_plan','rooms'])

In [13]:
X, y = rent_df_cleaned.drop(columns = ['last_price']), rent_df_cleaned.last_price
X_train, X_test, y_train, y_test = train_test_split (X, y, test_size = 0.2)

In [14]:
numeric_features = ['area', 'Difference']
nominal_features = ['renovation','open_plan','rooms']

In [15]:
mapper = DataFrameMapper([([feature], SimpleImputer()) for feature in numeric_features] +\
                         [([feature], OneHotEncoder(handle_unknown='ignore')) for feature in nominal_features],
                             df_out=True)

pipeline = Pipeline(steps = [('preprocessing', mapper), 
                             ('scaler', StandardScaler()),
                             #('en', ElasticNet(alpha=1, l1_ratio=1)),
                             #('forest', RandomForestRegressor(random_state=0))])
                             #('gradboost', GradientBoostingRegressor(random_state=0))])
                             #('gradboostclass', GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
                                                                           #max_depth=1, random_state=0))])
                             ('xgb', xgb.XGBRegressor(objective="reg:linear", random_state=42))])
                             #('randomsearch', RandomizedSearchCV(xgb.XGBRegressor(),\
                             #param_distributions=params, random_state=42, n_iter=200,\
                             #cv=3, verbose=1, n_jobs=1, return_train_score=True))])

                             #('classifier', LinearRegression())])
pipeline

In [28]:
result = mapper.fit_transform(rent_df_cleaned)
result.head()

Unnamed: 0,renovation_0,renovation_1,renovation_2,renovation_3,renovation_4,renovation_5,renovation_6,renovation_7,renovation_8,renovation_9,renovation_10,open_plan_0,open_plan_1,rooms_0,rooms_1,rooms_2,rooms_3,rooms_4,rooms_5
0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


In [18]:
mapper = DataFrameMapper([([feature], SimpleImputer()) for feature in numeric_features] +\
                         [([feature], OneHotEncoder(handle_unknown='ignore')) for feature in nominal_features],
                             df_out=True)

pipeline = Pipeline(steps = [('preprocessing', mapper), 
                             ('scaler', StandardScaler()),
                             #('en', ElasticNet(alpha=1, l1_ratio=1)),
                             #('forest', RandomForestRegressor(random_state=0))])
                             #('gradboost', GradientBoostingRegressor(random_state=0))])
                             #('gradboostclass', GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
                                                                           #max_depth=1, random_state=0))])
                             ('tree', DecisionTreeRegressor(max_depth=10, min_samples_leaf=8, max_features=4))])
                             #('randomsearch', RandomizedSearchCV(xgb.XGBRegressor(),\
                             #param_distributions=params, random_state=42, n_iter=200,\
                             #cv=3, verbose=1, n_jobs=1, return_train_score=True))])

                             #('classifier', LinearRegression())])
pipeline

In [21]:
mapper = DataFrameMapper([([feature], SimpleImputer()) for feature in numeric_features] +\
                         [([feature], OneHotEncoder(handle_unknown='ignore')) for feature in nominal_features],
                             df_out=True)

pipeline = Pipeline(steps = [('preprocessing', mapper), 
                             ('scaler', StandardScaler()),
                             #('en', ElasticNet(alpha=1, l1_ratio=1)),
                             #('forest', RandomForestRegressor(random_state=0))])
                             #('gradboost', GradientBoostingRegressor(random_state=0))])
                             #('gradboostclass', GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
                                                                           #max_depth=1, random_state=0))])
                             ('forest', RandomForestRegressor(random_state=0))])
                             #('randomsearch', RandomizedSearchCV(xgb.XGBRegressor(),\
                             #param_distributions=params, random_state=42, n_iter=200,\
                             #cv=3, verbose=1, n_jobs=1, return_train_score=True))])

                             #('classifier', LinearRegression())])
pipeline

In [16]:
pipeline.fit(X_train, y_train)



In [17]:
lr_train_pred = pipeline.predict(X_train)
lr_rmse = mean_squared_error(y_true=y_train, y_pred = lr_train_pred, squared=False)
lr_mape = mean_absolute_percentage_error(y_true=y_train, y_pred=lr_train_pred)
accuracy = pipeline.score(X_train,y_train)


print (f'LR RMSE train = {round(lr_rmse, 3)}')
print (f'LR MAPE train = {round(lr_mape, 3)}')
print (f'LR SCORE = {round(pipeline.score(X_train, y_train), 3)}')
print('Accuracy = ', accuracy*100,'%')

LR RMSE train = 10995.539
LR MAPE train = 0.206
LR SCORE = 0.709
Accuracy =  70.91013394256946 %


In [24]:
lr_test_pred = pipeline.predict(X_test)
lr_rmse = mean_squared_error(y_true=y_test, y_pred = lr_test_pred, squared=False)
lr_mape = mean_absolute_percentage_error(y_true=y_test, y_pred=lr_test_pred)
accuracy = pipeline.score(X_test,y_test)

print (f'LR RMSE test = {round(lr_rmse, 3)}')
print (f'LR MAPE test = {round(lr_mape, 3)}')
print (f'LR SCORE = {round(pipeline.score(X_test, y_test), 3)}')
print('Accuracy = ', accuracy*100,'%')

LR RMSE test = 13022.87
LR MAPE test = 0.231
LR SCORE = 0.584
Accuracy =  58.350810637318254 %
