# Regression Sberbank Russian Housing Market

## Dataset Source
Data originates from the Kaggle dataset titled "Sberbank Russian Housing Market" available at https://www.kaggle.com/competitions/sberbank-russian-housing-market/data.

In [1]:
with open ("data_dictionary.txt", 'r') as f:
    print(f.read())

# train.csv and test.csv

price_doc: sale price (this is the target variable)
id: transaction id
timestamp: date of transaction
full_sq: total area in square meters, including loggias, balconies and other non-residential areas
life_sq: living area in square meters, excluding loggias, balconies and other non-residential areas
floor: for apartments, floor of the building
max_floor: number of floors in the building
material: wall material
build_year: year built
num_room: number of living rooms
kitch_sq: kitchen area
state: apartment condition
product_type: owner-occupier purchase or investment
sub_area: name of the district

The dataset also includes a collection of features about each property's surrounding neighbourhood, and some features that are constant across each sub area (known as a Raion). Most of the feature names are self explanatory, with the following notes. See below for a complete list.

full_all: subarea population
male_f, female_f: subarea population by gender
young_*: po

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
df = pd.read_csv("train.csv")

In [4]:
df.head()

Unnamed: 0,id,timestamp,full_sq,life_sq,floor,max_floor,material,build_year,num_room,kitch_sq,...,cafe_count_5000_price_2500,cafe_count_5000_price_4000,cafe_count_5000_price_high,big_church_count_5000,church_count_5000,mosque_count_5000,leisure_count_5000,sport_count_5000,market_count_5000,price_doc
0,1,2011-08-20,43,27.0,4.0,,,,,,...,9,4,0,13,22,1,0,52,4,5850000
1,2,2011-08-23,34,19.0,3.0,,,,,,...,15,3,0,15,29,1,10,66,14,6000000
2,3,2011-08-27,43,29.0,2.0,,,,,,...,10,3,0,11,27,0,4,67,10,5700000
3,4,2011-09-01,89,50.0,9.0,,,,,,...,11,2,1,4,4,0,0,26,3,13100000
4,5,2011-09-05,77,77.0,4.0,,,,,,...,319,108,17,135,236,2,91,195,14,16331452


In [5]:
df.shape

(30471, 292)

In [6]:
df.info(verbose = True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30471 entries, 0 to 30470
Data columns (total 292 columns):
 #    Column                                 Dtype  
---   ------                                 -----  
 0    id                                     int64  
 1    timestamp                              object 
 2    full_sq                                int64  
 3    life_sq                                float64
 4    floor                                  float64
 5    max_floor                              float64
 6    material                               float64
 7    build_year                             float64
 8    num_room                               float64
 9    kitch_sq                               float64
 10   state                                  float64
 11   product_type                           object 
 12   sub_area                               object 
 13   area_m                                 float64
 14   raion_popul                         

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30471 entries, 0 to 30470
Columns: 292 entries, id to price_doc
dtypes: float64(119), int64(157), object(16)
memory usage: 67.9+ MB


## Exploratory Data Analysis

### Statistical Data Analysis
#### Descriptive Analysis

In [8]:
def missing_val_dt(df):
    data = {
    "Pandas Dtype": [],
    "Missing Values": [],
    "%Missing Values": []
    }
    index = []
    for column, dtype in df.dtypes.items():
        if df[column].isnull().sum() != 0:
            data["Pandas Dtype"].append(dtype)
            data["Missing Values"].append(df[column].isnull().sum())
            data["%Missing Values"].append(df[column].isnull().sum() / df.shape[0] * 100)
            index.append(column)
    desc_df = pd.DataFrame(data, index = index)
    return desc_df.sort_values(by = "Missing Values", ascending = False)

missing_val_dt(df)

Unnamed: 0,Pandas Dtype,Missing Values,%Missing Values
hospital_beds_raion,float64,14441,47.392603
build_year,float64,13605,44.649011
state,float64,13559,44.498047
cafe_sum_500_max_price_avg,float64,13281,43.585704
cafe_sum_500_min_price_avg,float64,13281,43.585704
cafe_avg_price_500,float64,13281,43.585704
max_floor,float64,9572,31.413475
material,float64,9572,31.413475
num_room,float64,9572,31.413475
kitch_sq,float64,9572,31.413475


In [9]:
df.describe()

Unnamed: 0,id,full_sq,life_sq,floor,max_floor,material,build_year,num_room,kitch_sq,state,...,cafe_count_5000_price_2500,cafe_count_5000_price_4000,cafe_count_5000_price_high,big_church_count_5000,church_count_5000,mosque_count_5000,leisure_count_5000,sport_count_5000,market_count_5000,price_doc
count,30471.0,30471.0,24088.0,30304.0,20899.0,20899.0,16866.0,20899.0,20899.0,16912.0,...,30471.0,30471.0,30471.0,30471.0,30471.0,30471.0,30471.0,30471.0,30471.0,30471.0
mean,15237.917397,54.214269,34.403271,7.670803,12.558974,1.827121,3068.057,1.909804,6.399301,2.107025,...,32.058318,10.78386,1.771783,15.045552,30.251518,0.442421,8.648814,52.796593,5.98707,7123035.0
std,8796.501536,38.031487,52.285733,5.319989,6.75655,1.481154,154387.8,0.851805,28.265979,0.880148,...,73.465611,28.385679,5.418807,29.118668,47.347938,0.609269,20.580741,46.29266,4.889219,4780111.0
min,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100000.0
25%,7620.5,38.0,20.0,3.0,9.0,1.0,1967.0,1.0,1.0,1.0,...,2.0,1.0,0.0,2.0,9.0,0.0,0.0,11.0,1.0,4740002.0
50%,15238.0,49.0,30.0,6.5,12.0,1.0,1979.0,2.0,6.0,2.0,...,8.0,2.0,0.0,7.0,16.0,0.0,2.0,48.0,5.0,6274411.0
75%,22855.5,63.0,43.0,11.0,17.0,2.0,2005.0,2.0,9.0,3.0,...,21.0,5.0,1.0,12.0,28.0,1.0,7.0,76.0,10.0,8300000.0
max,30473.0,5326.0,7478.0,77.0,117.0,6.0,20052010.0,19.0,2014.0,33.0,...,377.0,147.0,30.0,151.0,250.0,2.0,106.0,218.0,21.0,111111100.0


In [10]:
def imbalance_level(value_counts):
    dominant_value_count = max(value_counts)
    total_counts = sum(value_counts)
    
    if dominant_value_count / total_counts > 0.9:
        return 'Severely Imbalanced'
    elif dominant_value_count / total_counts > 0.7:
        return 'Highly Imbalanced'
    elif dominant_value_count / total_counts > 0.3:
        return 'Moderately Imbalanced'
    else:
        return 'Balanced'

def few_valued(df):
    numeric_columns = [col for col in df.columns if df[col].dtype != 'object']
    data = {
        'column': [],
        'number_of_unique_values': [],
        '%unique_values': [],
        'correlation_with_target': [],
        'imbalance_level': []
    }
    correlation_matrix = df[numeric_columns].corr().abs()

    for col in numeric_columns:
        unique_values = df[col].unique()
        value_counts = df[col].value_counts()
        nunique = len(unique_values)
        percentage = float(nunique) / df.shape[0] * 100
        correlation = correlation_matrix.loc[col, 'price_doc']
        imbalance = imbalance_level(value_counts)

        data['column'].append(col)
        data['number_of_unique_values'].append(nunique)
        data['%unique_values'].append(percentage)
        data['correlation_with_target'].append(correlation)
        data['imbalance_level'].append(imbalance)

    result_df = pd.DataFrame(data = data).set_index('column').sort_values(by = ['%unique_values', 'correlation_with_target'], ascending = [True, True])
    return result_df

In [11]:
few_val_df = few_valued(df)
few_val_df = few_val_df.drop('price_doc', axis=0)
few_val_df

Unnamed: 0_level_0,number_of_unique_values,%unique_values,correlation_with_target,imbalance_level
column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
mosque_count_500,2,0.006564,0.018474,Severely Imbalanced
mosque_count_1000,2,0.006564,0.089308,Severely Imbalanced
mosque_count_2000,2,0.006564,0.100228,Severely Imbalanced
mosque_count_1500,2,0.006564,0.111034,Severely Imbalanced
mosque_count_3000,3,0.009845,0.096199,Highly Imbalanced
...,...,...,...,...
ttk_km,11852,38.895999,0.272620,Balanced
bulvar_ring_km,11852,38.895999,0.279158,Balanced
kremlin_km,11852,38.895999,0.279249,Balanced
sadovoe_km,11852,38.895999,0.283622,Balanced


In [12]:
object_col = [column for column in df.columns if df[column].dtype == 'object']
df[object_col].nunique()

timestamp                    1161
product_type                    2
sub_area                      146
culture_objects_top_25          2
thermal_power_plant_raion       2
incineration_raion              2
oil_chemistry_raion             2
radiation_raion                 2
railroad_terminal_raion         2
big_market_raion                2
nuclear_reactor_raion           2
detention_facility_raion        2
water_1line                     2
big_road1_1line                 2
railroad_1line                  2
ecology                         5
dtype: int64

In [13]:
del object_col[0]

In [14]:
for col in object_col:
    print(df[col].value_counts())

product_type
Investment       19448
OwnerOccupier    11023
Name: count, dtype: int64
sub_area
Poselenie Sosenskoe               1776
Nekrasovka                        1611
Poselenie Vnukovskoe              1372
Poselenie Moskovskij               925
Poselenie Voskresenskoe            713
                                  ... 
Molzhaninovskoe                      3
Poselenie Shhapovskoe                2
Poselenie Kievskij                   2
Poselenie Klenovskoe                 1
Poselenie Mihajlovo-Jarcevskoe       1
Name: count, Length: 146, dtype: int64
culture_objects_top_25
no     28543
yes     1928
Name: count, dtype: int64
thermal_power_plant_raion
no     28817
yes     1654
Name: count, dtype: int64
incineration_raion
no     28155
yes     2316
Name: count, dtype: int64
oil_chemistry_raion
no     30175
yes      296
Name: count, dtype: int64
radiation_raion
no     19600
yes    10871
Name: count, dtype: int64
railroad_terminal_raion
no     29335
yes     1136
Name: count, dtype: int6

#### Correlation analysis 

##### Correlation between numerical features

In [15]:
import itertools

def get_top_abs_correlations_df(df, target):
    data = {
        'column_1': [],
        'column_2': [],
        'correlation': []
    }
    numeric_columns = [column for column in df.columns if df[column].dtype != 'object' and col != target]
    corr_matrix = df[numeric_columns].corr().abs()
    column_pairs = list(itertools.combinations(numeric_columns, 2))
    
    for col1, col2 in column_pairs:
        data['column_1'].append(col1)
        data['column_2'].append(col2)
        data['correlation'].append(corr_matrix.loc[col1, col2])

    df_correlations = pd.DataFrame(data)
    return df_correlations.sort_values(by = 'correlation', ascending = False).reset_index(drop = True)

multicollin = get_top_abs_correlations_df(df, 'price_doc')
multicollin.head(500)

Unnamed: 0,column_1,column_2,correlation
0,public_transport_station_km,public_transport_station_min_walk,1.000000
1,metro_min_walk,metro_km_walk,1.000000
2,children_preschool,0_6_all,1.000000
3,children_school,7_14_all,1.000000
4,railroad_station_walk_km,railroad_station_walk_min,1.000000
...,...,...,...
495,cafe_count_1500_price_500,cafe_count_3000_price_1500,0.972713
496,cafe_count_2000_price_500,leisure_count_2000,0.972696
497,cafe_count_5000_price_2500,cafe_count_5000_price_high,0.972604
498,big_church_count_2000,big_church_count_3000,0.972478


##### Correlation between numerical features and target

In [16]:
numeric_columns = [col for col in df.columns if df[col].dtype != 'object']
corr_nft = df[numeric_columns].corr().abs().loc[:, 'price_doc']

In [17]:
corr_nft.sort_values()

trc_sqm_500                    0.000374
build_year                     0.002161
cafe_sum_3000_max_price_avg    0.002200
cafe_avg_price_3000            0.003339
cafe_sum_3000_min_price_avg    0.005119
                                 ...   
sport_count_3000               0.290651
sport_count_5000               0.294864
full_sq                        0.341840
num_room                       0.476337
price_doc                      1.000000
Name: price_doc, Length: 276, dtype: float64

##### Correlation between categorical features

In [18]:
from scipy.stats import chi2_contingency

def get_top_chi_squared_tests_df(df):
    data = {
        'column_1': [],
        'column_2': [],
        'chi_squared': [],
        'p_value': []
    }
    object_columns = [column for column in df.columns if df[column].dtype == 'object' and column != 'timestamp']
    column_pairs = list(itertools.combinations(object_columns, 2))
    
    for col1, col2 in column_pairs:
        contingency_table = pd.crosstab(df[col1], df[col2])
        chi2, p_value, _, _ = chi2_contingency(contingency_table)
        data['column_1'].append(col1)
        data['column_2'].append(col2)
        data['chi_squared'].append(chi2)
        data['p_value'].append(p_value)

    df_chi_squared_tests = pd.DataFrame(data)
    return df_chi_squared_tests.sort_values(by='p_value', ascending=True).reset_index(drop=True)


chi_squared_tests = get_top_chi_squared_tests_df(df)
chi_squared_tests


Unnamed: 0,column_1,column_2,chi_squared,p_value
0,product_type,sub_area,20556.069224,0.000000
1,culture_objects_top_25,detention_facility_raion,1447.864722,0.000000
2,culture_objects_top_25,ecology,6457.562959,0.000000
3,sub_area,ecology,102073.604482,0.000000
4,sub_area,railroad_1line,8593.594347,0.000000
...,...,...,...,...
100,thermal_power_plant_raion,incineration_raion,0.949871,0.329752
101,big_market_raion,detention_facility_raion,0.924142,0.336390
102,thermal_power_plant_raion,railroad_terminal_raion,0.474870,0.490755
103,culture_objects_top_25,big_road1_1line,0.369720,0.543157


In [19]:
chi_squared_tests[chi_squared_tests['p_value'] == 0]

Unnamed: 0,column_1,column_2,chi_squared,p_value
0,product_type,sub_area,20556.069224,0.0
1,culture_objects_top_25,detention_facility_raion,1447.864722,0.0
2,culture_objects_top_25,ecology,6457.562959,0.0
3,sub_area,ecology,102073.604482,0.0
4,sub_area,railroad_1line,8593.594347,0.0
5,sub_area,big_road1_1line,4652.294586,0.0
6,sub_area,water_1line,9707.672544,0.0
7,sub_area,detention_facility_raion,30471.0,0.0
8,sub_area,nuclear_reactor_raion,30471.0,0.0
9,sub_area,big_market_raion,30471.0,0.0


##### Correlation between categorical features and target

In [20]:
from scipy.stats import f_oneway

def calculate_anova(df, categorical_columns, target_variable):
    anova_results = []
    for col in categorical_columns:
        category_groups = [df[target_variable][df[col] == category] for category in df[col].unique()]
        f_statistic, p_value = f_oneway(*category_groups)
        anova_results.append({'Categorical_Variable': col, 'F-statistic': f_statistic, 'p-value': p_value})
    return pd.DataFrame(anova_results).sort_values(by = "p-value", ascending = False)

In [21]:
categorical_columns = [col for col in df.columns if df[col].dtype == "object" and col != "timestamp"]
anova_df = calculate_anova(df, categorical_columns, 'price_doc')
anova_df

Unnamed: 0,Categorical_Variable,F-statistic,p-value
5,oil_chemistry_raion,18.286085,1.906688e-05
13,railroad_1line,21.123049,4.324426e-06
9,nuclear_reactor_raion,23.93301,1.002551e-06
3,thermal_power_plant_raion,36.138663,1.858515e-09
10,detention_facility_raion,42.898606,5.85688e-11
11,water_1line,45.292765,1.726791e-11
7,railroad_terminal_raion,100.267735,1.4478580000000002e-23
12,big_road1_1line,114.377334,1.2012719999999999e-26
8,big_market_raion,187.263185,1.680809e-42
4,incineration_raion,192.678478,1.124088e-43


#### Feature Selection

In [22]:
chi_squared_tests.head()

Unnamed: 0,column_1,column_2,chi_squared,p_value
0,product_type,sub_area,20556.069224,0.0
1,culture_objects_top_25,detention_facility_raion,1447.864722,0.0
2,culture_objects_top_25,ecology,6457.562959,0.0
3,sub_area,ecology,102073.604482,0.0
4,sub_area,railroad_1line,8593.594347,0.0


In [23]:
chi_squared_tests[chi_squared_tests['p_value'] == 0]

Unnamed: 0,column_1,column_2,chi_squared,p_value
0,product_type,sub_area,20556.069224,0.0
1,culture_objects_top_25,detention_facility_raion,1447.864722,0.0
2,culture_objects_top_25,ecology,6457.562959,0.0
3,sub_area,ecology,102073.604482,0.0
4,sub_area,railroad_1line,8593.594347,0.0
5,sub_area,big_road1_1line,4652.294586,0.0
6,sub_area,water_1line,9707.672544,0.0
7,sub_area,detention_facility_raion,30471.0,0.0
8,sub_area,nuclear_reactor_raion,30471.0,0.0
9,sub_area,big_market_raion,30471.0,0.0


In [24]:
features_to_delete = set(chi_squared_tests[(chi_squared_tests['p_value'] == 0) & 
                                        (chi_squared_tests['column_1'] == 'sub_area')]['column_2'].tolist())


In [25]:
multcol99 = multicollin[multicollin['correlation'] > 0.9]
features_to_delete.update(multcol99['column_2'].unique())

In [26]:
few_val_df

Unnamed: 0_level_0,number_of_unique_values,%unique_values,correlation_with_target,imbalance_level
column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
mosque_count_500,2,0.006564,0.018474,Severely Imbalanced
mosque_count_1000,2,0.006564,0.089308,Severely Imbalanced
mosque_count_2000,2,0.006564,0.100228,Severely Imbalanced
mosque_count_1500,2,0.006564,0.111034,Severely Imbalanced
mosque_count_3000,3,0.009845,0.096199,Highly Imbalanced
...,...,...,...,...
ttk_km,11852,38.895999,0.272620,Balanced
bulvar_ring_km,11852,38.895999,0.279158,Balanced
kremlin_km,11852,38.895999,0.279249,Balanced
sadovoe_km,11852,38.895999,0.283622,Balanced


In [27]:
features_to_delete.update(few_val_df[few_val_df['imbalance_level'] == 'Severely Imbalanced'].index.tolist())

In [28]:
miss_val = missing_val_dt(df)
miss_val.head()

Unnamed: 0,Pandas Dtype,Missing Values,%Missing Values
hospital_beds_raion,float64,14441,47.392603
build_year,float64,13605,44.649011
state,float64,13559,44.498047
cafe_sum_500_max_price_avg,float64,13281,43.585704
cafe_sum_500_min_price_avg,float64,13281,43.585704


In [29]:
features_to_delete.update(miss_val[miss_val['%Missing Values'] >= 40].index)

In [30]:
features_to_delete.add('id')

## Data Preprocessing

In [31]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import IsolationForest
from sklearn.base import BaseEstimator, TransformerMixin


In [32]:
X = df.iloc[:, :-1]
y = df.iloc[:, -1]

In [33]:
class DateTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X=None, y=None):
        return self
    
    def transform(self, X, y=None):
        X['timestamp'] = pd.to_datetime(X['timestamp'])
        X['year'] = X['timestamp'].dt.year
        X['month'] = X['timestamp'].dt.month
        X['day'] = X['timestamp'].dt.day
        X = X.drop('timestamp', axis = 1)
        return X

In [34]:
numeric_columns = [col for col in df.columns if df[col].dtype != "object" and col != 'price_doc' and col not in features_to_delete]
categorical_columns = [col for col in df.columns if df[col].dtype == "object" and col != 'timestamp' and col not in features_to_delete]

In [35]:
df1 = df.drop(columns = list(features_to_delete))

In [36]:
df1 = DateTransformer().fit_transform(df1)
df1.head()

Unnamed: 0,full_sq,life_sq,floor,max_floor,material,num_room,kitch_sq,product_type,sub_area,area_m,...,green_part_5000,prom_part_5000,trc_sqm_5000,cafe_sum_5000_min_price_avg,mosque_count_5000,market_count_5000,price_doc,year,month,day
0,43,27.0,4.0,,,,,Investment,Bibirevo,6407578.0,...,13.09,13.31,4036616,708.57,1,4,5850000,2011,8,20
1,34,19.0,3.0,,,,,Investment,Nagatinskij Zaton,9589337.0,...,10.26,27.47,2034942,673.81,1,14,6000000,2011,8,23
2,43,29.0,2.0,,,,,Investment,Tekstil'shhiki,4808270.0,...,13.69,21.58,1572990,702.68,0,10,5700000,2011,8,27
3,89,50.0,9.0,,,,,Investment,Mitino,12583540.0,...,14.18,3.89,942180,931.58,0,3,13100000,2011,9,1
4,77,77.0,4.0,,,,,Investment,Basmannoe,8398461.0,...,8.38,10.92,3503058,853.88,2,14,16331452,2011,9,5


In [37]:
numeric_imputer = SimpleImputer(strategy='mean')
scaler = StandardScaler()
categ_imputer = SimpleImputer(strategy='most_frequent')

In [38]:
numeric_imputer.fit(df1[numeric_columns])
df1[numeric_columns] = numeric_imputer.transform(df1[numeric_columns])

In [39]:
categ_imputer.fit(df1[categorical_columns])
df1[categorical_columns] = categ_imputer.transform(df1[categorical_columns])

In [40]:
df1['product_type'] = df1['product_type'].map({'Investment': 1, 'OwnerOccupier': 0})

In [41]:
mean_target = df1.groupby('sub_area')['price_doc'].mean()
df1['sub_area'] = df1['sub_area'].map(mean_target)

## Algorithms Evaluation

In [42]:
from sklearn.model_selection import TimeSeriesSplit
from sklearn.linear_model import LinearRegression, Lasso, ElasticNet
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.model_selection import cross_validate

In [43]:
train_set, test_set= np.split(df1, [int(.9 *len(df1))])

In [44]:
X_train = train_set.drop(columns='price_doc')
X_test = test_set.drop(columns='price_doc')
y_train = train_set['price_doc']
y_test = test_set['price_doc']

In [45]:
splitter = TimeSeriesSplit(n_splits=4)

In [46]:
import warnings
warnings.filterwarnings('ignore')

In [47]:
models = []
models.append(('LR', LinearRegression()))
models.append(('LASSO', Lasso()))
models.append(('EN', ElasticNet()))
models.append(('KNN', KNeighborsRegressor()))
models.append(('CART', DecisionTreeRegressor()))
models.append(('SVR', SVR()))

results = []
names = []
scoring = 'neg_mean_squared_error'

for name, model in models:
    cv_results = cross_validate(model, X_train, y_train, cv=splitter, scoring=scoring, return_train_score=True)
    train_rmse = np.sqrt(-np.mean(cv_results['train_score']))
    test_rmse = np.sqrt(-np.mean(cv_results['test_score']))
    results.append(cv_results)
    names.append(name)
    print(f"{name}:")
    print(f"Среднее MSE на тренировочных фолдах: {-np.mean(cv_results['train_score']).round(3)}")
    print(f"Среднее MSE на тестовых фолдах: {-np.mean(cv_results['test_score']).round(3)}")
    print(f"Среднее RMSE на тренировочных фолдах: {train_rmse.round(3)}")
    print(f"Среднее RMSE на тестовых фолдах: {test_rmse.round(3)}")
    print()


LR:
Среднее MSE на тренировочных фолдах: 12453790943251.41
Среднее MSE на тестовых фолдах: 23583398534477.137
Среднее RMSE на тренировочных фолдах: 3528992.908
Среднее RMSE на тестовых фолдах: 4856274.141

LASSO:
Среднее MSE на тренировочных фолдах: 12459344360865.09
Среднее MSE на тестовых фолдах: 23567104295733.242
Среднее RMSE на тренировочных фолдах: 3529779.648
Среднее RMSE на тестовых фолдах: 4854596.203

EN:
Среднее MSE на тренировочных фолдах: 12822988452290.92
Среднее MSE на тестовых фолдах: 25694078359736.625
Среднее RMSE на тренировочных фолдах: 3580920.057
Среднее RMSE на тестовых фолдах: 5068932.665

KNN:
Среднее MSE на тренировочных фолдах: 9718483956221.506
Среднее MSE на тестовых фолдах: 16447056625305.898
Среднее RMSE на тренировочных фолдах: 3117448.309
Среднее RMSE на тестовых фолдах: 4055497.087

CART:
Среднее MSE на тренировочных фолдах: 22429567.448
Среднее MSE на тестовых фолдах: 17742279018769.04
Среднее RMSE на тренировочных фолдах: 4735.986
Среднее RMSE на тес

In [48]:
pipelines = []
pipelines.append(('ScaledLR', Pipeline([('Scaler', StandardScaler()),('LR', LinearRegression())])))
pipelines.append(('ScaledLASSO', Pipeline([('Scaler', StandardScaler()),('LASSO', Lasso())])))
pipelines.append(('ScaledEN', Pipeline([('Scaler', StandardScaler()),('EN', ElasticNet())])))
pipelines.append(('ScaledKNN', Pipeline([('Scaler', StandardScaler()),('KNN', KNeighborsRegressor())])))
pipelines.append(('ScaledCART', Pipeline([('Scaler', StandardScaler()),('CART', DecisionTreeRegressor())])))
pipelines.append(('ScaledSVR', Pipeline([('Scaler', StandardScaler()),('SVR', SVR())])))
resultsp = []
namesp = []
for name, model in pipelines:
    cv_results = cross_validate(model, X_train, y_train, cv=splitter, scoring=scoring, return_train_score=True)
    train_rmse = np.sqrt(-np.mean(cv_results['train_score']))
    test_rmse = np.sqrt(-np.mean(cv_results['test_score']))
    resultsp.append(cv_results)
    namesp.append(name)
    print(f"{name}:")
    print(f"Среднее MSE на тренировочных фолдах: {-np.mean(cv_results['train_score']).round(3)}")
    print(f"Среднее MSE на тестовых фолдах: {-np.mean(cv_results['test_score']).round(3)}")
    print(f"Среднее RMSE на тренировочных фолдах: {train_rmse.round(3)}")
    print(f"Среднее RMSE на тестовых фолдах: {test_rmse.round(3)}")
    print()


ScaledLR:
Среднее MSE на тренировочных фолдах: 12469828933956.71
Среднее MSE на тестовых фолдах: 2.837423023949027e+39
Среднее RMSE на тренировочных фолдах: 3531264.495
Среднее RMSE на тестовых фолдах: 5.3267466843740815e+19

ScaledLASSO:
Среднее MSE на тренировочных фолдах: 12459345349900.285
Среднее MSE на тестовых фолдах: 23567072150609.996
Среднее RMSE на тренировочных фолдах: 3529779.788
Среднее RMSE на тестовых фолдах: 4854592.892

ScaledEN:
Среднее MSE на тренировочных фолдах: 13124123203024.914
Среднее MSE на тестовых фолдах: 18612120047055.676
Среднее RMSE на тренировочных фолдах: 3622723.175
Среднее RMSE на тестовых фолдах: 4314176.636

ScaledKNN:
Среднее MSE на тренировочных фолдах: 8709856328773.423
Среднее MSE на тестовых фолдах: 14259713778525.574
Среднее RMSE на тренировочных фолдах: 2951246.572
Среднее RMSE на тестовых фолдах: 3776203.62

ScaledCART:
Среднее MSE на тренировочных фолдах: 22429567.448
Среднее MSE на тестовых фолдах: 17577449599078.951
Среднее RMSE на трен

## Algorithm Tuning

In [49]:
from sklearn.model_selection import GridSearchCV

In [50]:
scaler = StandardScaler().fit(X_train)
rescaledX = scaler.transform(X_train)

In [51]:
k_values = np.array([1,3,5,7,9,11,13,15,17,19,21])
weights_values = ['uniform', 'distance']
metric_values = ['euclidean', 'manhattan']

param_grid = {
    'n_neighbors': k_values,
    'weights': weights_values,
    'metric': metric_values
}

model = KNeighborsRegressor()

In [52]:
grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring, cv=splitter)
grid_result = grid.fit(rescaledX, y_train)

In [53]:
print("Best: %f using %s" % (np.sqrt(-grid_result.best_score_), grid_result.best_params_))
means = np.sqrt(-grid_result.cv_results_['mean_test_score'])
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']

for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 3729306.031934 using {'metric': 'euclidean', 'n_neighbors': 7, 'weights': 'distance'}
4272728.085124 (2459249539148.897949) with: {'metric': 'euclidean', 'n_neighbors': 1, 'weights': 'uniform'}
4272728.085124 (2459249539148.897949) with: {'metric': 'euclidean', 'n_neighbors': 1, 'weights': 'distance'}
3814185.879729 (1312285561143.799316) with: {'metric': 'euclidean', 'n_neighbors': 3, 'weights': 'uniform'}
3784278.440560 (1324747103942.364746) with: {'metric': 'euclidean', 'n_neighbors': 3, 'weights': 'distance'}
3781845.986385 (1303215040793.145752) with: {'metric': 'euclidean', 'n_neighbors': 5, 'weights': 'uniform'}
3738996.410139 (1257034963312.378418) with: {'metric': 'euclidean', 'n_neighbors': 5, 'weights': 'distance'}
3776305.945636 (1333010156576.655518) with: {'metric': 'euclidean', 'n_neighbors': 7, 'weights': 'uniform'}
3729306.031934 (1245324039903.057373) with: {'metric': 'euclidean', 'n_neighbors': 7, 'weights': 'distance'}
3787821.352374 (1463347500408.914062) wi

## Ensemble Methods

In [54]:
from sklearn.ensemble import AdaBoostRegressor, GradientBoostingRegressor, RandomForestRegressor, ExtraTreesRegressor

In [55]:
ensembles = []
ensembles.append(('ScaledAB', Pipeline([('Scaler', StandardScaler()),('AB', AdaBoostRegressor())])))
ensembles.append(('ScaledGBM', Pipeline([('Scaler', StandardScaler()),('GBM', GradientBoostingRegressor())])))
ensembles.append(('ScaledRF', Pipeline([('Scaler', StandardScaler()),('RF', RandomForestRegressor())])))

e_results = []
e_names = []

for name, ensemble in ensembles:
    cv_results = cross_validate(ensemble, X_train, y_train, cv=splitter, scoring=scoring, return_train_score=True)
    train_rmse = np.sqrt(-np.mean(cv_results['train_score']))
    test_rmse = np.sqrt(-np.mean(cv_results['test_score']))
    e_results.append(cv_results)
    e_names.append(name)
    print(f"{name}:")
    print(f"Среднее MSE на тренировочных фолдах: {-np.mean(cv_results['train_score']).round(3)}")
    print(f"Среднее MSE на тестовых фолдах: {-np.mean(cv_results['test_score']).round(3)}")
    print(f"Среднее RMSE на тренировочных фолдах: {train_rmse.round(3)}")
    print(f"Среднее RMSE на тестовых фолдах: {test_rmse.round(3)}")
    print()

ScaledAB:
Среднее MSE на тренировочных фолдах: 17572072165888.328
Среднее MSE на тестовых фолдах: 17202608533639.742
Среднее RMSE на тренировочных фолдах: 4191905.553
Среднее RMSE на тестовых фолдах: 4147602.745

ScaledGBM:
Среднее MSE на тренировочных фолдах: 5579051881256.122
Среднее MSE на тестовых фолдах: 7891611992740.809
Среднее RMSE на тренировочных фолдах: 2362001.668
Среднее RMSE на тестовых фолдах: 2809201.309

ScaledRF:
Среднее MSE на тренировочных фолдах: 1268053374054.779
Среднее MSE на тестовых фолдах: 8192767458723.376
Среднее RMSE на тренировочных фолдах: 1126078.76
Среднее RMSE на тестовых фолдах: 2862301.078



In [56]:
seed = 7
scaler = StandardScaler().fit(X_train)
rescaledX = scaler.transform(X_train)
param_grid = dict(n_estimators=np.array([50,100,150,200,250,300,350,400]))
model = GradientBoostingRegressor(random_state=seed)
grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring, cv=splitter)
grid_result = grid.fit(rescaledX, y_train)

In [57]:
print("Best: %f using %s" % (np.sqrt(-grid_result.best_score_), grid_result.best_params_))
means = np.sqrt(-grid_result.cv_results_['mean_test_score'])
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']

for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 2754951.011548 using {'n_estimators': 300}
2890160.187359 (618836398950.942749) with: {'n_estimators': 50}
2796703.598584 (501091193560.160095) with: {'n_estimators': 100}
2777979.185517 (485409322100.652344) with: {'n_estimators': 150}
2767079.579577 (505908619900.953491) with: {'n_estimators': 200}
2757968.157194 (538386588379.710510) with: {'n_estimators': 250}
2754951.011548 (577445110417.141479) with: {'n_estimators': 300}
2759902.824703 (606320300569.366211) with: {'n_estimators': 350}
2760922.089000 (647099774806.185425) with: {'n_estimators': 400}


## Submission

In [93]:
test = pd.read_csv('test.csv')
test.drop(columns = list(features_to_delete), inplace = True)
test[numeric_columns] = numeric_imputer.transform(test[numeric_columns])
test[categorical_columns] = categ_imputer.transform(test[categorical_columns])
test = DateTransformer().fit_transform(test)
test['product_type'] = test['product_type'].map({'Investment': 1, 'OwnerOccupier': 0})
test['sub_area'] = test['sub_area'].map(mean_target)
scaled_test = scaler.transform(test)

In [96]:
y_pred = grid.predict(scaled_test)
y_pred_no_neg = np.maximum(y_pred, 0)

In [97]:
test = pd.read_csv('test.csv')
sub = pd.DataFrame({'id' : test['id'], 'price_doc' : y_pred_no_neg})

In [99]:
sub.to_csv('sub.csv', index=False)