# ML part of the project #
## We've considered 4 approaches to model training ##

In [71]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from catboost import CatBoostRegressor

In [None]:
df = pd.read_csv('../data/processed/scraped_cian_processed.csv')

First approach was to fit the model with all categorical features indeed. They were introduced as one-hot-encoding labels.
The result was represented by the following metrics:
$$
r2: 0.69
$$
$$
MSE: 25.55
$$ 

In [40]:
df_processed = pd.get_dummies(df, columns=['author_type','underground', 'district','street'])
X = df_processed.drop(columns=['price', 'lats', 'lons'], axis=1)
y = df_processed['price']
X_train, X_test, Y_train, Y_test = train_test_split(X,y,test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, Y_train)
Y_pred = model.predict(X_test)
mse = mean_squared_error(Y_test, Y_pred)
r2 = r2_score(Y_test, Y_pred)
print(r2)
mse

0.6970619990056001


25.553354461661847

After that we considered to remove 'street' feature, because the number of unique examples on each street hardly reached 5. The results reached:
$$
r2: 0.78
$$
$$
MSE: 18.06 
$$  

In [47]:
df_processed_without_streets = pd.get_dummies(df, columns=['author_type', 'district', 'underground'])
X = df_processed_without_streets.drop(columns=['price','street','lats', 'lons'])
Y = df_processed_without_streets['price']
X_train_streets, X_test_streets, Y_train_streets, Y_test_streets = train_test_split(X, Y, random_state=42, test_size=0.2) 
model_streets = LinearRegression()
model_streets.fit(X_train_streets, Y_train_streets)
Y_pred_streets = model_streets.predict(X_test_streets)
mse_streets = mean_squared_error(Y_test_streets, Y_pred_streets)
r2_score_streets = r2_score(Y_test_streets, Y_pred_streets)
print(r2_score_streets)
mse_streets

0.7858213233847932


18.066348968148098

Then we tried to preprocess categorical features by 'mean target encoding'.
The results didn't change:
$$
r2: 0.77
$$
$$
MSE: 19.46
$$  

In [48]:
categorical = ['author_type', 'underground', 'district']
df_mean_target = df.drop(columns=['street','lats', 'lons'])
for feature in categorical:
    mean_prices = df_mean_target.groupby(feature)['price'].mean()
    df_mean_target[feature] = df_mean_target[feature].map(mean_prices)

In [69]:
X_mean_target = df_mean_target.drop(columns='price')
Y_mean_target = df_mean_target['price']
X_train_mean_target, X_test_mean_target,Y_train_mean_target, Y_test_mean_target = train_test_split(X_mean_target, Y_mean_target, test_size=0.2, shuffle=True)
model_mean_target = LinearRegression()
model_mean_target.fit(X_train_mean_target, Y_train_mean_target)
Y_pred_mean_target = model_mean_target.predict(X_test_mean_target)
mse_mean_target = mean_squared_error(Y_test_mean_target, Y_pred_mean_target)
r2_score_mean_target = r2_score(Y_test_mean_target, Y_pred_mean_target)
print(r2_score_mean_target)
mse_mean_target

0.7777544013160519


19.464742875354798

After that it was interesting for us to use another regressor.
Such type of model becomes Yandex Catboost Regressor which can preprocess categorical features by itself.
The results exceeded our expectations:
$$
r2: 0.85
$$
$$
MSE: 12.73
$$ 


In [79]:
X_catboost = df.drop(columns=['price', 'lats', 'lons', 'street'])
Y_catboost = df['price']
X_train_catboost, X_test_catboost, Y_train_catboost, Y_test_catboost = train_test_split(X_catboost, Y_catboost, random_state=42, test_size=0.2)
catboost_model = CatBoostRegressor()
catboost_model.fit(X_train_catboost, Y_train_catboost, cat_features=categorical)

Learning rate set to 0.047696
0:	learn: 8.8378131	total: 2.27ms	remaining: 2.27s
1:	learn: 8.6038519	total: 4.94ms	remaining: 2.47s
2:	learn: 8.3777559	total: 6.34ms	remaining: 2.11s
3:	learn: 8.1628736	total: 7.71ms	remaining: 1.92s
4:	learn: 7.9630356	total: 8.95ms	remaining: 1.78s
5:	learn: 7.7738507	total: 10.3ms	remaining: 1.71s
6:	learn: 7.5866327	total: 11.5ms	remaining: 1.64s
7:	learn: 7.4098670	total: 12.6ms	remaining: 1.57s
8:	learn: 7.2463613	total: 14ms	remaining: 1.54s
9:	learn: 7.1087736	total: 15.6ms	remaining: 1.55s
10:	learn: 6.9627025	total: 17.2ms	remaining: 1.55s
11:	learn: 6.8356517	total: 18.9ms	remaining: 1.55s
12:	learn: 6.7079966	total: 20.8ms	remaining: 1.58s
13:	learn: 6.5921445	total: 22.2ms	remaining: 1.56s
14:	learn: 6.4818028	total: 23.5ms	remaining: 1.54s
15:	learn: 6.3753156	total: 24.9ms	remaining: 1.53s
16:	learn: 6.2724067	total: 26.3ms	remaining: 1.52s
17:	learn: 6.1826425	total: 27.7ms	remaining: 1.51s
18:	learn: 6.0993605	total: 28.9ms	remaining: 

<catboost.core.CatBoostRegressor at 0x15b4edf70>

In [80]:
Y_predict_catboost = catboost_model.predict(X_test_catboost)
mse_catboost = mean_squared_error(Y_test_catboost, Y_predict_catboost)
r2_catboost = r2_score(Y_test_catboost, Y_predict_catboost)
print(r2_catboost)
mse_catboost

0.8490093992614303


12.736323367774704

In [1]:
catboost_model.save_model('../models/catboost_model.pth')

NameError: name 'catboost_model' is not defined

In [1]:
from src import CianPredictionModel
model = CianPredictionModel()
model.fit_estimate("../data/processed/scraped_cian_processed.csv")

Learning rate set to 0.047696
0:	learn: 8.8527585	total: 63.1ms	remaining: 1m 3s
1:	learn: 8.6129873	total: 64.4ms	remaining: 32.1s
2:	learn: 8.3975083	total: 66.1ms	remaining: 22s
3:	learn: 8.1907062	total: 67.4ms	remaining: 16.8s
4:	learn: 7.9960353	total: 68.6ms	remaining: 13.6s
5:	learn: 7.8063894	total: 70.1ms	remaining: 11.6s
6:	learn: 7.6299614	total: 71.9ms	remaining: 10.2s
7:	learn: 7.4658808	total: 73ms	remaining: 9.05s
8:	learn: 7.3137659	total: 74.3ms	remaining: 8.18s
9:	learn: 7.1855122	total: 75.2ms	remaining: 7.44s
10:	learn: 7.0458403	total: 76.4ms	remaining: 6.87s
11:	learn: 6.9086413	total: 77.7ms	remaining: 6.39s
12:	learn: 6.7869003	total: 79.7ms	remaining: 6.05s
13:	learn: 6.6673815	total: 81ms	remaining: 5.71s
14:	learn: 6.5520853	total: 82.5ms	remaining: 5.42s
15:	learn: 6.4533907	total: 83.7ms	remaining: 5.15s
16:	learn: 6.3515019	total: 85ms	remaining: 4.92s
17:	learn: 6.2572181	total: 86ms	remaining: 4.69s
18:	learn: 6.1696235	total: 87.8ms	remaining: 4.54s
19