# Workflow template
This notebook is a template for a more structured workflow.

The sberbank dataset contains a lot of variables (291 in the regular train- and test-files, 100 in the macro-file) and working through all of those can become rather tedious even on a 1080+ screen. Newly created columns are added to the end of the dataframe and comparing new and old columns can be made easier with topic-related column-collections.

## Column collections

Under the feature-section of this notebook you can find all variables that belong together, grouped into arrays.

Most of the default variables are positioned next to each other in the source-dataframe but some are not sorted in an ideal way (e.g. the square-meters columns are not next to each other, kitch_sq is a few columns to the right, km-distance-columns are all over the place).

With the column-collections you can use **`train_df.loc[:3,sub_area_columns]`** where **`sub_area_columns`** is the array containing all relevant columns regarding areas. 

After you add new features, just add them to the array and display all topic-related columns with the same line.



## Jupyter Notebook Extensions
The regular jupyter notebook can be enhanced with a number of easy-to-install extensions.

If you download this kernel as a ipynb-file and active the extensions, it will be a lot easier to work with almost 400 variables.


You can find the jupyter_contrib_extensions on github: https://github.com/ipython-contrib/jupyter_contrib_nbextensions

Installation with conda: **`conda install -c conda-forge jupyter_contrib_nbextensions`**

Installation with pip: **`pip install jupyter_contrib_nbextensions`**

### Helpful extensions
You need to start a new jupyter notebook server in order to activate the extensions.

After this, you can acces the extensions-menu in the main-page that opens up when you start a jupyter notebook server:
![title](https://www.dropbox.com/s/9ue80c7e1tz6kd8/Nbextensions.png?dl=1)

You can uncheck the "disable configuration for nbextensions without explicit compatibility". I am using the latest version of jupyter and haven't experienced any problems.

Then choose the extensions you want to use below and maybe change a few of their options.
![title](https://www.dropbox.com/s/dz7jbeyx5pe3zfc/extension_selection.png?dl=1)

#### Collapsible Headers
To make the whole workflow easier this notebook is seperated by header and sub-headers for each topic. With the help of the "collapsible headers extension", working with them becomes much easier.

This extension allows you to collapse all headers in a hierarchical way.
![title](https://www.dropbox.com/s/0zzmi5ttk5ic39p/collapsible_headers.png?dl=1)

#### Codefolding

As with headers, you can fold long sections of code.
![title](http://jupyter-contrib-nbextensions.readthedocs.io/en/latest/_images/codefolding_indent_unfolded.png)

#### Table of Contents (2)
I'm not sure where Table of Contents (1) is but this extension is great for quickly moving to different sections once your notebook has become very long.

You can move the ToC around and let it automatically generate numbered headings.
![title](http://jupyter-contrib-nbextensions.readthedocs.io/en/latest/_images/demo3.gif)

# General

#### Todos and Ideas to try

Place your general ideas, links and todos here


#### Imports

In [1]:
import numpy as np
np.set_printoptions(linewidth=140) # numpy displays 75 lines by default which is not optimal on a larger screen.
import pandas as pd
pd.set_option('display.max_columns', 300) # increase number of columns and rows to print out
pd.set_option('display.max_rows', 300)

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("darkgrid")
%matplotlib inline

from datetime import datetime

# the following 3 lines let your notebook take up more space on the screen / removes the unused width on the side
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:99% !important; }</style>"))
display(HTML("<style>table {float:left}</style>")) # makes the changelog table nicer

In [2]:
from sklearn import model_selection
from sklearn import preprocessing
from sklearn import metrics

#### Helper functions

In [3]:
def show_dtypes(df):
    for dtype in df.dtypes.unique():
        print(str(dtype).ljust(14), ":",list(df.dtypes[df.dtypes == dtype].index),"\n")

# returns all columns of a DataFrame with type float64, int64, uint8
def get_numeric_columns(df):
    return list(df.dtypes[(df.dtypes == "float64") | (df.dtypes == "int64") | (df.dtypes == "uint8")].index)


# creates dummy columns with prefix, merge with source dataframe
def create_dummy_columns(df, columns):
    for column in columns:
        df[column]    = df[column].apply(lambda x: str(x)) #convert to str just in case
        new_columns = [column + "_" + i for i in full[column].unique()] #only use the columns that appear in the test set and add prefix like in get_dummies
        df = pd.concat((df,    pd.get_dummies(df[column],    prefix = column)[new_columns]), axis = 1)
    return df


def del_columns(df, columns):
    for column in columns:
        if column in df.columns:
            del df[column]
            print("Deleted: ", column)
        else:
            print("Not in DataFrame: ",column)
    return df

# Loading Data
[back to top](#Table-of-Contents)

## Macro

In [4]:
macro_df = pd.read_csv("input/macro.csv", parse_dates=["timestamp"])
macro_df.shape

(2484, 100)

## Train

In [5]:
train_df = pd.read_csv("input/train.csv", parse_dates=["timestamp"])
train_df["train_test"] = "train"

train_df = pd.merge(train_df, macro_df, how="left", on="timestamp")

print("Shape: ", train_df.shape,"\n")
train_df.info(memory_usage="deep")

Shape:  (30471, 392) 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30471 entries, 0 to 30470
Columns: 392 entries, id to apartment_fund_sqm
dtypes: datetime64[ns](1), float64(213), int64(159), object(19)
memory usage: 120.2 MB


## Test

In [6]:
test_df = pd.read_csv("input/test.csv", parse_dates=["timestamp"])
test_df["train_test"] = "test"

test_df = pd.merge(test_df, macro_df, how="left", on="timestamp")

print("Shape: ",test_df.shape, "\n")
test_df.info(memory_usage="deep")

Shape:  (7662, 391) 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7662 entries, 0 to 7661
Columns: 391 entries, id to apartment_fund_sqm
dtypes: datetime64[ns](1), float64(210), int64(161), object(19)
memory usage: 29.8 MB


## Target

In [7]:
train_df["price_doc"].describe().apply(lambda x: '%.f' % x)

count        30471
mean       7123035
std        4780111
min         100000
25%        4740002
50%        6274411
75%        8300000
max      111111112
Name: price_doc, dtype: object

In [8]:
train_df.price_doc.sort_values(ascending=True).head(5)

20244    100000
1167     190000
1169     200000
9221     260000
3258     300000
Name: price_doc, dtype: int64

In [9]:
train_df.price_doc.sort_values(ascending=False).head(5)

2118     111111112
28326     95122496
7457      91066096
19095     80777440
6319      78802248
Name: price_doc, dtype: int64

In [10]:
ulimit = np.percentile(train_df.price_doc.values, 99.9)
llimit = np.percentile(train_df.price_doc.values, .1)
train_df.loc[train_df['price_doc']>ulimit, ["price_doc"]] = ulimit
train_df.loc[train_df['price_doc']<llimit, ["price_doc"]] = llimit

In [11]:
train_df["price_doc"].describe().apply(lambda x: '%.f' % x)

count       30471
mean      7110463
std       4615500
min        712477
25%       4740002
50%       6274411
75%       8300000
max      53899956
Name: price_doc, dtype: object

In [12]:
train_y = np.array(train_df["price_doc"])
print(train_y[0:10])

# Since our metric is "RMSLE", let us use log of the target variable for model building rather than using the actual target variable.
train_y = np.log1p(train_y)
print(train_y[0:10])

[  5850000.   6000000.   5700000.  13100000.  16331452.   9100000.   5500000.   2000000.   5300000.   2000000.]
[ 15.58195239  15.60727019  15.55597691  16.38812286  16.60860344  16.02378508  15.52025883  14.50865824  15.48321757  14.50865824]


# Exploratory Data Analysis
[back to top](#Table-of-Contents)

In [13]:
# put in your code here

# Features
[back to top](#Table-of-Contents)

In [14]:
# show_dtypes(train_df)

## Feature Dictionary

In [15]:
# price_doc: sale price (this is the target variable)
# id: transaction id
# timestamp: date of transaction
# full_sq: total area in square meters, including loggias, balconies and other non-residential areas
# life_sq: living area in square meters, excluding loggias, balconies and other non-residential areas
# floor: for apartments, floor of the building
# max_floor: number of floors in the building
# material: wall material
# build_year: year built
# num_room: number of living rooms
# kitch_sq: kitchen area
# state: apartment condition
# product_type: owner-occupier purchase or investment
# sub_area: name of the district

# The dataset also includes a collection of features about each property's surrounding neighbourhood, and some features that are constant across each sub area (known as a Raion). Most of the feature names are self explanatory, with the following notes.

# full_all: subarea population
# male_f, female_f: subarea population by gender
# young_*: population younger than working age
# work_*: working-age population
# ekder_*: retirement-age population
# n_m_{all|male|female}: population between n and m years old
# build_count_*: buildings in the subarea by construction type or year
# x_count_500: the number of x within 500m of the property
# x_part_500: the share of x within 500m of the property
# _sqm_: square meters
# cafe_count_d_price_p: number of cafes within d meters of the property that have an average bill under p RUB
# trc_: shopping malls
# prom_: industrial zones
# green_: green zones
# metro_: subway
# _avto_: distances by car
# mkad_: Moscow Circle Auto Road
# ttk_: Third Transport Ring
# sadovoe_: Garden Ring
# bulvar_ring_: Boulevard Ring
# kremlin_: City center
# zd_vokzaly_: Train station
# oil_chemistry_: Dirty industry
# ts_: Power plant

## General features

In [16]:
train_df["null_count"] = train_df.isnull().sum(axis=1)
test_df["null_count"] = test_df.isnull().sum(axis=1)

In [17]:
# these make it easier to see if a property has an unusually high or low price if you are not familiar with Rubles
train_df["price_doc_euro"] = np.round(train_df["price_doc"]*0.016,0)
train_df["price_doc_dollars"] = np.round(train_df["price_doc"]*0.02,0)

## Building info

### timestamp

In [18]:
time_columns = ["timestamp"]
train_df.loc[:3,time_columns]

Unnamed: 0,timestamp
0,2011-08-20
1,2011-08-23
2,2011-08-27
3,2011-09-01


In [19]:
train_df["time_year"] = train_df["timestamp"].dt.year
test_df["time_year"] = test_df["timestamp"].dt.year

train_df["time_month_of_year"] = train_df["timestamp"].dt.month
test_df["time_month_of_year"] = test_df["timestamp"].dt.month

train_df["time_yearmonth"] = train_df["timestamp"].dt.year*100 + train_df["timestamp"].dt.month-200000
test_df["time_yearmonth"] = test_df["timestamp"].dt.year*100 + test_df["timestamp"].dt.month-200000

train_df["time_week_of_year"] = train_df["timestamp"].dt.weekofyear
test_df["time_week_of_year"] = test_df["timestamp"].dt.weekofyear

train_df["time_yearweek"] = train_df["timestamp"].dt.year*100 + train_df["timestamp"].dt.weekofyear-200000
test_df["time_yearweek"] = test_df["timestamp"].dt.year*100 + test_df["timestamp"].dt.weekofyear-200000

train_df["time_day_of_year"] = train_df["timestamp"].dt.dayofyear
test_df["time_day_of_year"] = test_df["timestamp"].dt.dayofyear

train_df["time_day_of_month"] = train_df["timestamp"].dt.day
test_df["time_day_of_month"] = test_df["timestamp"].dt.day

train_df["time_day_of_week"] = train_df["timestamp"].dt.weekday
test_df["time_day_of_week"] = test_df["timestamp"].dt.weekday

In [20]:
time_columns = ["timestamp", "time_year", "time_month_of_year", "time_yearmonth", "time_week_of_year", "time_yearweek",
                "time_day_of_year", "time_day_of_month", "time_day_of_week"]
train_df.loc[:3,time_columns]

Unnamed: 0,timestamp,time_year,time_month_of_year,time_yearmonth,time_week_of_year,time_yearweek,time_day_of_year,time_day_of_month,time_day_of_week
0,2011-08-20,2011,8,1108,33,1133,232,20,5
1,2011-08-23,2011,8,1108,34,1134,235,23,1
2,2011-08-27,2011,8,1108,34,1134,239,27,5
3,2011-09-01,2011,9,1109,35,1135,244,1,3


### square meters, floors, rooms

In [21]:
sqm_floors_columns = ['full_sq', 'life_sq', 'kitch_sq', 'floor', 'max_floor', 'num_room']

In [22]:
train_df.loc[10005:10008,sqm_floors_columns]

Unnamed: 0,full_sq,life_sq,kitch_sq,floor,max_floor,num_room
10005,62,,10.0,9.0,17.0,2.0
10006,71,48.0,8.0,7.0,8.0,3.0
10007,28,16.0,5.0,3.0,8.0,1.0
10008,60,,1.0,16.0,22.0,2.0


### material

In [23]:
material_columns = ["material"]
train_df.loc[:3,material_columns]

Unnamed: 0,material
0,
1,
2,
3,


### build year

In [24]:
build_year_columns = ["build_year"]
train_df.loc[:3,build_year_columns]

Unnamed: 0,build_year
0,
1,
2,
3,


### state

In [25]:
state_columns = ["state"]
train_df.loc[:3,state_columns]

Unnamed: 0,state
0,
1,
2,
3,


### product type

In [26]:
product_type_columns = ["product_type"]
train_df.loc[:3,product_type_columns]

Unnamed: 0,product_type
0,Investment
1,Investment
2,Investment
3,Investment


## Area / Raion info

### sub_area : until indust_part

In [27]:
sub_area_columns = ['sub_area', 'area_m', 'raion_popul', 'green_zone_part', 'indust_part']
train_df.loc[:3,sub_area_columns]

Unnamed: 0,sub_area,area_m,raion_popul,green_zone_part,indust_part
0,Bibirevo,6407578.0,155572,0.189727,7e-05
1,Nagatinskij Zaton,9589337.0,115352,0.372602,0.049637
2,Tekstil'shhiki,4808270.0,101708,0.11256,0.118537
3,Mitino,12583540.0,178473,0.194703,0.069753


### children, education, university, additional education

In [28]:
education_columns = ['children_preschool', 'preschool_quota', 'preschool_education_centers_raion', 'children_school', 'school_quota', 'school_education_centers_raion', 
                     'school_education_centers_top_20_raion', 'university_top_20_raion', 'additional_education_raion',]
train_df.loc[:3,education_columns]

Unnamed: 0,children_preschool,preschool_quota,preschool_education_centers_raion,children_school,school_quota,school_education_centers_raion,school_education_centers_top_20_raion,university_top_20_raion,additional_education_raion
0,9576,5001.0,5,10309,11065.0,5,0,0,3
1,6880,3119.0,5,7759,6237.0,8,0,0,1
2,5879,1463.0,4,6207,5580.0,7,0,0,1
3,13087,6839.0,9,13670,17063.0,10,0,0,6


### healthcare

In [29]:
healthcare_columns = ['hospital_beds_raion', 'healthcare_centers_raion']
train_df.loc[:3,healthcare_columns]

Unnamed: 0,hospital_beds_raion,healthcare_centers_raion
0,240.0,1
1,229.0,1
2,1183.0,1
3,,1


### culture

In [30]:
culture_columns = ['sport_objects_raion', 'culture_objects_top_25', 'culture_objects_top_25_raion', 'shopping_centers_raion', 'office_raion']
train_df.loc[:3,culture_columns]

Unnamed: 0,sport_objects_raion,culture_objects_top_25,culture_objects_top_25_raion,shopping_centers_raion,office_raion
0,7,no,0,16,1
1,6,yes,1,3,0
2,5,no,0,0,1
3,17,no,0,11,4


### power and industry yes/no in area

In [31]:
industry_columns = ['thermal_power_plant_raion', 'incineration_raion', 'oil_chemistry_raion', 'radiation_raion', 'railroad_terminal_raion', 
                    'big_market_raion', 'nuclear_reactor_raion', 'detention_facility_raion']

train_df.loc[:3,industry_columns]

Unnamed: 0,thermal_power_plant_raion,incineration_raion,oil_chemistry_raion,radiation_raion,railroad_terminal_raion,big_market_raion,nuclear_reactor_raion,detention_facility_raion
0,no,no,no,no,no,no,no,no
1,no,no,no,no,no,no,no,no
2,no,no,no,yes,no,no,no,no
3,no,no,no,no,no,no,no,no


## People counts

### people count all

In [32]:
ppl_count_all = ['full_all', 'male_f', 'female_f', 'young_all', 'young_male', 'young_female', 'work_all', 'work_male', 'work_female',
                 'ekder_all', 'ekder_male', 'ekder_female']
train_df.loc[:3,ppl_count_all]

Unnamed: 0,full_all,male_f,female_f,young_all,young_male,young_female,work_all,work_male,work_female,ekder_all,ekder_male,ekder_female
0,86206,40477,45729,21154,11007,10147,98207,52277,45930,36211,10580,25631
1,76284,34200,42084,15727,7925,7802,70194,35622,34572,29431,9266,20165
2,101982,46076,55906,13028,6835,6193,63388,31813,31575,25292,7609,17683
3,21155,9828,11327,28563,14680,13883,120381,60040,60341,29529,9083,20446


### people count 0-6

In [33]:
ppl_count_0_6 = ['0_6_all', '0_6_male', '0_6_female']
train_df.loc[:2,ppl_count_0_6]

Unnamed: 0,0_6_all,0_6_male,0_6_female
0,9576,4899,4677
1,6880,3466,3414
2,5879,3095,2784


### people count 7-14

In [34]:
ppl_count_7_14 = ['7_14_all', '7_14_male', '7_14_female']
train_df.loc[:2, ppl_count_7_14]

Unnamed: 0,7_14_all,7_14_male,7_14_female
0,10309,5463,4846
1,7759,3909,3850
2,6207,3269,2938


### people count 0-17

In [35]:
ppl_count_0_17 = ['0_17_all', '0_17_male', '0_17_female']
train_df.loc[:2, ppl_count_0_17]

Unnamed: 0,0_17_all,0_17_male,0_17_female
0,23603,12286,11317
1,17700,8998,8702
2,14884,7821,7063


### people count 16-29

In [36]:
ppl_count_16_29 = ['16_29_all', '16_29_male', '16_29_female']
train_df.loc[:2, ppl_count_16_29]

Unnamed: 0,16_29_all,16_29_male,16_29_female
0,17508,9425,8083
1,15164,7571,7593
2,19401,9045,10356


### people count 0-13

In [37]:
ppl_count_0_13 = ['0_13_all', '0_13_male', '0_13_female']
train_df.loc[:2, ppl_count_0_13]

Unnamed: 0,0_13_all,0_13_male,0_13_female
0,18654,9709,8945
1,13729,6929,6800
2,11252,5916,5336


## Build Count

### build count material

In [38]:
build_count_mat_columns = ['raion_build_count_with_material_info', 'build_count_block', 'build_count_wood', 'build_count_frame', 
                           'build_count_brick', 'build_count_monolith', 'build_count_panel', 'build_count_foam', 'build_count_slag', 'build_count_mix']
train_df.loc[:2, build_count_mat_columns]

Unnamed: 0,raion_build_count_with_material_info,build_count_block,build_count_wood,build_count_frame,build_count_brick,build_count_monolith,build_count_panel,build_count_foam,build_count_slag,build_count_mix
0,211.0,25.0,0.0,0.0,0.0,2.0,184.0,0.0,0.0,0.0
1,245.0,83.0,1.0,0.0,67.0,4.0,90.0,0.0,0.0,0.0
2,330.0,59.0,0.0,0.0,206.0,4.0,60.0,0.0,1.0,0.0


### build count date

In [39]:
build_count_date_columns = ['raion_build_count_with_builddate_info', 'build_count_before_1920', 'build_count_1921-1945', 
                           'build_count_1946-1970', 'build_count_1971-1995', 'build_count_after_1995']
train_df.loc[:2, build_count_date_columns]

Unnamed: 0,raion_build_count_with_builddate_info,build_count_before_1920,build_count_1921-1945,build_count_1946-1970,build_count_1971-1995,build_count_after_1995
0,211.0,0.0,0.0,0.0,206.0,5.0
1,244.0,1.0,1.0,143.0,84.0,15.0
2,330.0,1.0,0.0,246.0,63.0,20.0


## Nearby Locations

### 1line features

In [40]:
columns = np.array(train_df.columns)
[item for item in columns if "1line" in item]

['water_1line', 'big_road1_1line', 'railroad_1line']

In [41]:
one_line_columns = ['water_1line', 'big_road1_1line', 'railroad_1line']
train_df.loc[:2,one_line_columns]

Unnamed: 0,water_1line,big_road1_1line,railroad_1line
0,no,no,no
1,no,no,no
2,no,no,no


### km distances

In [42]:
loc_km_dist_columns = ['kindergarten_km', 'school_km', 'park_km', 'green_zone_km', 'industrial_km', 'water_treatment_km', 'cemetery_km', 
                       'incineration_km', 'water_km', 'oil_chemistry_km', 'nuclear_reactor_km', 'radiation_km', 
                       'power_transmission_line_km', 'thermal_power_plant_km', 'ts_km', 'big_market_km', 'market_shop_km', 'fitness_km',
                       'swim_pool_km', 'ice_rink_km', 'stadium_km', 'basketball_km', 'hospice_morgue_km', 'detention_facility_km', 
                       'public_healthcare_km', 'university_km', 'workplaces_km', 'shopping_centers_km', 'office_km', 'additional_education_km',
                       'preschool_km', 'big_church_km', 'church_synagogue_km', 'mosque_km', 'theater_km', 'museum_km', 'exhibition_km', 'catering_km']
train_df.loc[:2,loc_km_dist_columns]

Unnamed: 0,kindergarten_km,school_km,park_km,green_zone_km,industrial_km,water_treatment_km,cemetery_km,incineration_km,water_km,oil_chemistry_km,nuclear_reactor_km,radiation_km,power_transmission_line_km,thermal_power_plant_km,ts_km,big_market_km,market_shop_km,fitness_km,swim_pool_km,ice_rink_km,stadium_km,basketball_km,hospice_morgue_km,detention_facility_km,public_healthcare_km,university_km,workplaces_km,shopping_centers_km,office_km,additional_education_km,preschool_km,big_church_km,church_synagogue_km,mosque_km,theater_km,museum_km,exhibition_km,catering_km
0,0.1457,0.177975,2.158587,0.600973,1.080934,23.68346,1.804127,3.633334,0.992631,18.152338,5.718519,1.210027,1.062513,5.814135,4.308127,10.814172,1.676258,0.485841,3.065047,1.107594,8.148591,3.516513,2.392353,4.248036,0.974743,6.715026,0.88435,0.648488,0.637189,0.947962,0.177975,0.625783,0.628187,3.93204,14.053047,7.389498,7.023705,0.516838
1,0.147754,0.273345,0.55069,0.065321,0.966479,1.317476,4.655004,8.648587,0.698081,9.034642,3.489954,2.724295,1.246149,3.419574,0.72556,6.910568,3.424716,0.668364,2.000154,8.972823,6.127073,1.161579,2.543747,12.649879,1.477723,1.85256,0.686252,0.519311,0.688796,1.072315,0.273345,0.967821,0.471447,4.841544,6.829889,0.70926,2.35884,0.230287
2,0.049102,0.158072,0.374848,0.453172,0.939275,4.91266,3.381083,11.99648,0.468265,5.777394,7.506612,0.772216,1.602183,3.682455,3.562188,5.752368,1.375443,0.733101,1.239304,1.978517,0.767569,1.952771,0.621357,7.682303,0.097144,0.841254,1.510089,1.486533,1.543049,0.391957,0.158072,3.178751,0.755946,7.922152,4.2732,3.156423,4.958214,0.190462


### Metro

In [43]:
loc_metro_columns = ['ID_metro', 'metro_min_avto', 'metro_km_avto', 'metro_min_walk', 'metro_km_walk']
train_df.loc[:2,loc_metro_columns]

Unnamed: 0,ID_metro,metro_min_avto,metro_km_avto,metro_min_walk,metro_km_walk
0,1,2.590241,1.13126,13.575119,1.13126
1,2,0.9367,0.647337,7.62063,0.635053
2,3,2.120999,1.637996,17.351515,1.44596


### railroad and public transport

In [44]:
loc_railroad_pubtrans_columns = ['ID_railroad_station_walk', 'railroad_station_walk_km', 'railroad_station_walk_min',
                                 'ID_railroad_station_avto', 'railroad_station_avto_km', 'railroad_station_avto_min',
                                 'public_transport_station_km', 'public_transport_station_min_walk', 
                                 'railroad_km', 'zd_vokzaly_avto_km', 'ID_railroad_terminal', 
                                 'ID_bus_terminal', 'bus_terminal_avto_km']
train_df.loc[:2,loc_railroad_pubtrans_columns]

Unnamed: 0,ID_railroad_station_walk,railroad_station_walk_km,railroad_station_walk_min,ID_railroad_station_avto,railroad_station_avto_km,railroad_station_avto_min,public_transport_station_km,public_transport_station_min_walk,railroad_km,zd_vokzaly_avto_km,ID_railroad_terminal,ID_bus_terminal,bus_terminal_avto_km
0,1.0,5.419893,65.038716,1,5.419893,6.905893,0.274985,3.299822,1.305159,14.231961,101,1,24.292406
1,2.0,3.411993,40.943917,2,3.641773,4.679745,0.065263,0.78316,0.694536,9.242586,32,2,5.706113
2,3.0,1.277658,15.331896,3,1.277658,1.70142,0.328756,3.945073,0.700691,9.540544,5,3,6.710302


### Major roads

* mkad = ring road
* ttk = TransTelekom ?
* sadovoe = Ring Garden

In [45]:
road_columns = ['mkad_km', 'ttk_km', 'sadovoe_km', 'bulvar_ring_km', 'kremlin_km',
                'ID_big_road1', 'big_road1_km',
                'ID_big_road2', 'big_road2_km']
train_df.loc[:2,road_columns]

Unnamed: 0,mkad_km,ttk_km,sadovoe_km,bulvar_ring_km,kremlin_km,ID_big_road1,big_road1_km,ID_big_road2,big_road2_km
0,1.422391,10.918587,13.100618,13.675657,15.156211,1,1.422391,5,3.830951
1,9.503405,3.103996,6.444333,8.13264,8.698054,2,2.887377,4,3.103996
2,5.6048,2.927487,6.963403,8.054252,9.067885,3,0.64725,4,2.927487


### ecology

In [46]:
ecology_columns = ['ecology']
train_df.loc[:3, ecology_columns]

Unnamed: 0,ecology
0,good
1,excellent
2,poor
3,good


## Series 500, 1000, 1500, 2000, 3000, 5000

### 500

In [47]:
count_500_columns = ['green_part_500', 'prom_part_500', 'office_count_500', 'office_sqm_500', 'trc_count_500', 'trc_sqm_500', 
                     'cafe_count_500', 'cafe_sum_500_min_price_avg', 'cafe_sum_500_max_price_avg', 'cafe_avg_price_500', 
                     'cafe_count_500_na_price', 'cafe_count_500_price_500', 'cafe_count_500_price_1000', 'cafe_count_500_price_1500',
                     'cafe_count_500_price_2500', 'cafe_count_500_price_4000', 'cafe_count_500_price_high',
                     'big_church_count_500', 'church_count_500', 'mosque_count_500',
                     'leisure_count_500', 'sport_count_500', 'market_count_500']
train_df.loc[:3,count_500_columns]

Unnamed: 0,green_part_500,prom_part_500,office_count_500,office_sqm_500,trc_count_500,trc_sqm_500,cafe_count_500,cafe_sum_500_min_price_avg,cafe_sum_500_max_price_avg,cafe_avg_price_500,cafe_count_500_na_price,cafe_count_500_price_500,cafe_count_500_price_1000,cafe_count_500_price_1500,cafe_count_500_price_2500,cafe_count_500_price_4000,cafe_count_500_price_high,big_church_count_500,church_count_500,mosque_count_500,leisure_count_500,sport_count_500,market_count_500
0,0.0,0.0,0,0,0,0,0,,,,0,0,0,0,0,0,0,0,0,0,0,1,0
1,25.14,0.0,0,0,0,0,5,860.0,1500.0,1180.0,0,1,3,0,0,1,0,0,1,0,0,0,0
2,1.67,0.0,0,0,0,0,3,666.67,1166.67,916.67,0,0,2,1,0,0,0,0,0,0,0,0,0
3,17.36,0.57,0,0,0,0,2,1000.0,1500.0,1250.0,0,0,0,2,0,0,0,0,0,0,0,0,0


### 1000

In [48]:
count_1000_columns = ['green_part_1000', 'prom_part_1000', 'office_count_1000', 'office_sqm_1000', 'trc_count_1000',
                     'trc_sqm_1000', 'cafe_count_1000', 'cafe_sum_1000_min_price_avg', 'cafe_sum_1000_max_price_avg', 'cafe_avg_price_1000',
                     'cafe_count_1000_na_price', 'cafe_count_1000_price_500', 'cafe_count_1000_price_1000', 'cafe_count_1000_price_1500',
                     'cafe_count_1000_price_2500', 'cafe_count_1000_price_4000', 'cafe_count_1000_price_high', 
                     'big_church_count_1000', 'church_count_1000', 'mosque_count_1000', 
                     'leisure_count_1000', 'sport_count_1000', 'market_count_1000']
train_df.loc[:3,count_1000_columns]

Unnamed: 0,green_part_1000,prom_part_1000,office_count_1000,office_sqm_1000,trc_count_1000,trc_sqm_1000,cafe_count_1000,cafe_sum_1000_min_price_avg,cafe_sum_1000_max_price_avg,cafe_avg_price_1000,cafe_count_1000_na_price,cafe_count_1000_price_500,cafe_count_1000_price_1000,cafe_count_1000_price_1500,cafe_count_1000_price_2500,cafe_count_1000_price_4000,cafe_count_1000_price_high,big_church_count_1000,church_count_1000,mosque_count_1000,leisure_count_1000,sport_count_1000,market_count_1000
0,7.36,0.0,1,30500,3,55600,19,527.78,888.89,708.33,1,10,4,3,1,0,0,1,2,0,0,6,1
1,26.66,0.07,2,86600,5,94065,13,615.38,1076.92,846.15,0,5,6,1,0,1,0,1,2,0,4,2,0
2,4.99,0.29,0,0,0,0,9,642.86,1142.86,892.86,2,0,5,2,0,0,0,0,1,0,0,5,3
3,19.25,10.35,1,11000,6,80780,12,658.33,1083.33,870.83,0,3,4,5,0,0,0,0,0,0,0,3,1


### 1500

In [49]:
count_1500_columns = ['green_part_1500', 'prom_part_1500', 'office_count_1500', 'office_sqm_1500', 'trc_count_1500', 'trc_sqm_1500', 
                      'cafe_count_1500', 'cafe_sum_1500_min_price_avg', 'cafe_sum_1500_max_price_avg', 'cafe_avg_price_1500', 'cafe_count_1500_na_price',
                      'cafe_count_1500_price_500', 'cafe_count_1500_price_1000', 'cafe_count_1500_price_1500', 'cafe_count_1500_price_2500',
                      'cafe_count_1500_price_4000', 'cafe_count_1500_price_high', 
                      'big_church_count_1500', 'church_count_1500', 'mosque_count_1500',
                      'leisure_count_1500', 'sport_count_1500', 'market_count_1500',]
train_df.loc[:3,count_1500_columns]

Unnamed: 0,green_part_1500,prom_part_1500,office_count_1500,office_sqm_1500,trc_count_1500,trc_sqm_1500,cafe_count_1500,cafe_sum_1500_min_price_avg,cafe_sum_1500_max_price_avg,cafe_avg_price_1500,cafe_count_1500_na_price,cafe_count_1500_price_500,cafe_count_1500_price_1000,cafe_count_1500_price_1500,cafe_count_1500_price_2500,cafe_count_1500_price_4000,cafe_count_1500_price_high,big_church_count_1500,church_count_1500,mosque_count_1500,leisure_count_1500,sport_count_1500,market_count_1500
0,14.27,6.92,3,39554,9,171420,34,566.67,969.7,768.18,1,14,11,6,2,0,0,1,2,0,0,7,1
1,21.53,7.71,3,102910,7,127065,17,694.12,1205.88,950.0,0,6,7,1,2,1,0,1,5,0,4,9,0
2,9.92,6.73,0,0,1,2600,14,516.67,916.67,716.67,2,4,6,2,0,0,0,0,4,0,0,6,5
3,28.38,6.57,2,11000,7,89492,23,673.91,1130.43,902.17,0,5,9,8,1,0,0,1,0,0,0,9,2


### 2000

In [50]:
count_2000_columns = ['green_part_2000', 'prom_part_2000', 'office_count_2000', 'office_sqm_2000', 'trc_count_2000', 'trc_sqm_2000', 
                      'cafe_count_2000', 'cafe_sum_2000_min_price_avg', 'cafe_sum_2000_max_price_avg', 'cafe_avg_price_2000', 'cafe_count_2000_na_price',
                      'cafe_count_2000_price_500', 'cafe_count_2000_price_1000', 'cafe_count_2000_price_1500', 'cafe_count_2000_price_2500',
                      'cafe_count_2000_price_4000', 'cafe_count_2000_price_high',
                      'big_church_count_2000', 'church_count_2000', 'mosque_count_2000',
                      'leisure_count_2000', 'sport_count_2000', 'market_count_2000']
train_df.loc[:3,count_2000_columns]

Unnamed: 0,green_part_2000,prom_part_2000,office_count_2000,office_sqm_2000,trc_count_2000,trc_sqm_2000,cafe_count_2000,cafe_sum_2000_min_price_avg,cafe_sum_2000_max_price_avg,cafe_avg_price_2000,cafe_count_2000_na_price,cafe_count_2000_price_500,cafe_count_2000_price_1000,cafe_count_2000_price_1500,cafe_count_2000_price_2500,cafe_count_2000_price_4000,cafe_count_2000_price_high,big_church_count_2000,church_count_2000,mosque_count_2000,leisure_count_2000,sport_count_2000,market_count_2000
0,11.77,15.97,9,188854,19,1244891,36,614.29,1042.86,828.57,1,15,11,6,2,1,0,1,2,0,0,10,1
1,22.37,19.25,4,165510,8,179065,21,695.24,1190.48,942.86,0,7,8,3,2,1,0,1,5,0,4,11,0
2,12.99,12.75,4,100200,7,52550,24,563.64,977.27,770.45,2,8,9,4,1,0,0,0,4,0,0,8,5
3,32.29,5.73,2,11000,7,89492,25,660.0,1120.0,890.0,0,5,11,8,1,0,0,1,1,0,0,13,2


### 3000

In [51]:
count_3000_columns = ['green_part_3000', 'prom_part_3000', 'office_count_3000', 'office_sqm_3000', 'trc_count_3000', 'trc_sqm_3000', 
                      'cafe_count_3000', 'cafe_sum_3000_min_price_avg', 'cafe_sum_3000_max_price_avg', 'cafe_avg_price_3000', 'cafe_count_3000_na_price',
                      'cafe_count_3000_price_500', 'cafe_count_3000_price_1000', 'cafe_count_3000_price_1500', 'cafe_count_3000_price_2500',
                      'cafe_count_3000_price_4000', 'cafe_count_3000_price_high',
                      'big_church_count_3000', 'church_count_3000', 'mosque_count_3000',
                      'leisure_count_3000', 'sport_count_3000', 'market_count_3000']
train_df.loc[:3,count_3000_columns]

Unnamed: 0,green_part_3000,prom_part_3000,office_count_3000,office_sqm_3000,trc_count_3000,trc_sqm_3000,cafe_count_3000,cafe_sum_3000_min_price_avg,cafe_sum_3000_max_price_avg,cafe_avg_price_3000,cafe_count_3000_na_price,cafe_count_3000_price_500,cafe_count_3000_price_1000,cafe_count_3000_price_1500,cafe_count_3000_price_2500,cafe_count_3000_price_4000,cafe_count_3000_price_high,big_church_count_3000,church_count_3000,mosque_count_3000,leisure_count_3000,sport_count_3000,market_count_3000
0,11.98,13.55,12,251554,23,1419204,68,639.68,1079.37,859.52,5,21,22,16,3,1,0,2,4,0,0,21,1
1,18.07,27.32,12,821986,14,491565,30,631.03,1086.21,858.62,1,11,11,4,2,1,0,1,7,0,6,19,1
2,12.14,26.46,8,110856,7,52550,41,697.44,1192.31,944.87,2,9,17,9,3,1,0,0,11,0,0,20,6
3,20.79,3.57,4,167000,12,205756,32,718.75,1218.75,968.75,0,5,14,10,3,0,0,1,2,0,0,18,3


### 5000

In [52]:
count_5000_columns = ['green_part_5000', 'prom_part_5000', 'office_count_5000', 'office_sqm_5000', 'trc_count_5000', 'trc_sqm_5000', 
                      'cafe_count_5000', 'cafe_sum_5000_min_price_avg', 'cafe_sum_5000_max_price_avg', 'cafe_avg_price_5000', 'cafe_count_5000_na_price',
                      'cafe_count_5000_price_500', 'cafe_count_5000_price_1000', 'cafe_count_5000_price_1500', 'cafe_count_5000_price_2500',
                      'cafe_count_5000_price_4000', 'cafe_count_5000_price_high',
                      'big_church_count_5000', 'church_count_5000', 'mosque_count_5000',
                      'leisure_count_5000', 'sport_count_5000', 'market_count_5000']
train_df.loc[:3,count_5000_columns]

Unnamed: 0,green_part_5000,prom_part_5000,office_count_5000,office_sqm_5000,trc_count_5000,trc_sqm_5000,cafe_count_5000,cafe_sum_5000_min_price_avg,cafe_sum_5000_max_price_avg,cafe_avg_price_5000,cafe_count_5000_na_price,cafe_count_5000_price_500,cafe_count_5000_price_1000,cafe_count_5000_price_1500,cafe_count_5000_price_2500,cafe_count_5000_price_4000,cafe_count_5000_price_high,big_church_count_5000,church_count_5000,mosque_count_5000,leisure_count_5000,sport_count_5000,market_count_5000
0,13.09,13.31,29,807385,52,4036616,152,708.57,1185.71,947.14,12,39,48,40,9,4,0,13,22,1,0,52,4
1,10.26,27.47,66,2690465,40,2034942,177,673.81,1148.81,911.31,9,49,65,36,15,3,0,15,29,1,10,66,14
2,13.69,21.58,43,1478160,35,1572990,122,702.68,1196.43,949.55,10,29,45,25,10,3,0,11,27,0,4,67,10
3,14.18,3.89,8,244166,22,942180,61,931.58,1552.63,1242.11,4,7,21,15,11,2,1,4,4,0,0,26,3


## Additional macro economic features

### Oil

In [53]:
macro_oil_columns = ['oil_urals', 'brent']
train_df.loc[:2, macro_oil_columns]

Unnamed: 0,oil_urals,brent
0,109.31,108.62
1,109.31,109.31
2,109.31,111.36


### GDP and GRP (domestic and regional)

In [54]:
macro_gpd_columns = ['gdp_quart', 'gdp_quart_growth', 'gdp_deflator', 'gdp_annual', 'gdp_annual_growth', 'grp', 'grp_growth', 'income_per_cap',
       'real_dispos_income_per_cap_growth', 'salary', 'salary_growth']
train_df.loc[:2, macro_gpd_columns]

Unnamed: 0,gdp_quart,gdp_quart_growth,gdp_deflator,gdp_annual,gdp_annual_growth,grp,grp_growth,income_per_cap,real_dispos_income_per_cap_growth,salary,salary_growth
0,14313.7,3.3,86.721,46308.5,0.045037,9948.7728,0.187791,42688.6,-0.005,44898.7,0.168917
1,14313.7,3.3,86.721,46308.5,0.045037,9948.7728,0.187791,42688.6,-0.005,44898.7,0.168917
2,14313.7,3.3,86.721,46308.5,0.045037,9948.7728,0.187791,42688.6,-0.005,44898.7,0.168917


### General Indexes (Consumer & Producer, Russia Trading System, micex Moscow Exchange)

In [55]:
macro_price_indexes_columns = ['cpi', 'ppi', 'rts', 'micex', 'micex_rgbi_tr', 'micex_cbi_tr', 'fixed_basket']
train_df.loc[:2, macro_price_indexes_columns]

Unnamed: 0,cpi,ppi,rts,micex,micex_rgbi_tr,micex_cbi_tr,fixed_basket
0,354.0,420.7,1575.33,1438.74,131.16,204.78,12838.36
1,354.0,420.7,1578.91,1444.11,131.45,204.92,12838.36
2,354.0,420.7,1596.17,1458.84,131.08,204.84,12838.36


### Trade

In [56]:
macro_price_indexes_columns = ['balance_trade', 'balance_trade_growth', 'net_capital_export', 'retail_trade_turnover',
                               'retail_trade_turnover_per_cap', 'retail_trade_turnover_growth']
train_df.loc[:2, macro_price_indexes_columns]

Unnamed: 0,balance_trade,balance_trade_growth,net_capital_export,retail_trade_turnover,retail_trade_turnover_per_cap,retail_trade_turnover_growth
0,15.459,10.1,0.301811,3322.047,286.952,106.6
1,15.459,10.1,0.301811,3322.047,286.952,106.6
2,15.459,10.1,0.301811,3322.047,286.952,106.6


### Exchange rates

In [57]:
macro_exchange_rates_columns = ['usdrub', 'eurrub']
train_df.loc[:2, macro_exchange_rates_columns]

Unnamed: 0,usdrub,eurrub
0,29.0048,41.7681
1,28.9525,41.7537
2,28.8082,41.7114


### Build Contract

In [58]:
macro_build_contract_columns = ['average_provision_of_build_contract', 'average_provision_of_build_contract_moscow']
train_df.loc[:2, macro_exchange_rates_columns]

Unnamed: 0,usdrub,eurrub
0,29.0048,41.7681
1,28.9525,41.7537
2,28.8082,41.7114


### Deposits and Mortgages

In [59]:
macro_deposits_morgages_columns = ['deposits_value', 'deposits_growth', 'deposits_rate', 'mortgage_value', 'mortgage_growth', 'mortgage_rate']
train_df.loc[:2, macro_deposits_morgages_columns]

Unnamed: 0,deposits_value,deposits_growth,deposits_rate,mortgage_value,mortgage_growth,mortgage_rate
0,10618898,0.00974,4.1,323275,1.051914,11.84
1,10618898,0.00974,4.1,323275,1.051914,11.84
2,10618898,0.00974,4.1,323275,1.051914,11.84


### Demographics

In [60]:
macro_demographics_columns = ['labor_force', 'unemployment', 'employment', 'marriages_per_1000_cap', 'divorce_rate', 'pop_natural_increase', 
                              'pop_migration', 'pop_total_inc', 'childbirth', 'mortality', 'average_life_exp', 'infant_mortarity_per_1000_cap',
                              'perinatal_mort_per_1000_cap', 'incidence_population']
train_df.loc[:2, macro_demographics_columns]

Unnamed: 0,labor_force,unemployment,employment,marriages_per_1000_cap,divorce_rate,pop_natural_increase,pop_migration,pop_total_inc,childbirth,mortality,average_life_exp,infant_mortarity_per_1000_cap,perinatal_mort_per_1000_cap,incidence_population
0,6643.626,0.014,0.708,8.5,3.8,1.1,5.1,6.2,10.8,9.7,75.79,6.2,5.53,715.1
1,6643.626,0.014,0.708,8.5,3.8,1.1,5.1,6.2,10.8,9.7,75.79,6.2,5.53,715.1
2,6643.626,0.014,0.708,8.5,3.8,1.1,5.1,6.2,10.8,9.7,75.79,6.2,5.53,715.1


### Investments and Enterprises

In [61]:
macro_invest_enterprises_columns = ['invest_fixed_capital_per_cap', 'invest_fixed_assets', 'invest_fixed_assets_phys', 'profitable_enterpr_share', 'unprofitable_enterpr_share', 
                                    'share_own_revenues', 'overdue_wages_per_cap', 'fin_res_per_cap']
train_df.loc[:2, macro_invest_enterprises_columns]

Unnamed: 0,invest_fixed_capital_per_cap,invest_fixed_assets,invest_fixed_assets_phys,profitable_enterpr_share,unprofitable_enterpr_share,share_own_revenues,overdue_wages_per_cap,fin_res_per_cap
0,73976.19863,856.424079,106.6,0.708,0.292,0.891478,53636.0,226.214157
1,73976.19863,856.424079,106.6,0.708,0.292,0.891478,53636.0,226.214157
2,73976.19863,856.424079,106.6,0.708,0.292,0.891478,53636.0,226.214157


### Housing

In [62]:
macro_housing_columns = ['housing_fund_sqm', 'lodging_sqm_per_cap', 'water_pipes_share', 'baths_share', 'sewerage_share', 'gas_share',
                         'hot_water_share', 'electric_stove_share', 'heating_share', 'old_house_share', 
                         'rent_price_4+room_bus', 'rent_price_3room_bus', 'rent_price_2room_bus', 'rent_price_1room_bus', 
                         'rent_price_3room_eco',  'rent_price_2room_eco', 'rent_price_1room_eco',
                         'apartment_build', 'apartment_fund_sqm']
train_df.loc[:2, macro_housing_columns]

Unnamed: 0,housing_fund_sqm,lodging_sqm_per_cap,water_pipes_share,baths_share,sewerage_share,gas_share,hot_water_share,electric_stove_share,heating_share,old_house_share,rent_price_4+room_bus,rent_price_3room_bus,rent_price_2room_bus,rent_price_1room_bus,rent_price_3room_eco,rent_price_2room_eco,rent_price_1room_eco,apartment_build,apartment_fund_sqm
0,218.0,18.772066,99.9,99.8,99.5,43.9,95.7,55.3,99.9,0.3,136.11,77.93,62.89,47.85,41.8,36.77,29.07,23587.0,230310.0
1,218.0,18.772066,99.9,99.8,99.5,43.9,95.7,55.3,99.9,0.3,136.11,77.93,62.89,47.85,41.8,36.77,29.07,23587.0,230310.0
2,218.0,18.772066,99.9,99.8,99.5,43.9,95.7,55.3,99.9,0.3,136.11,77.93,62.89,47.85,41.8,36.77,29.07,23587.0,230310.0


### Education

In [63]:
macro_education_columns = ['load_of_teachers_preschool_per_teacher', 'child_on_acc_pre_school', 'load_of_teachers_school_per_teacher',
                           'students_state_oneshift', 'modern_education_share', 'old_education_build_share']
train_df.loc[:2, macro_education_columns]

Unnamed: 0,load_of_teachers_preschool_per_teacher,child_on_acc_pre_school,load_of_teachers_school_per_teacher,students_state_oneshift,modern_education_share,old_education_build_share
0,793.319561,#!,1391.710938,89.0495,,
1,793.319561,#!,1391.710938,89.0495,,
2,793.319561,#!,1391.710938,89.0495,,


### Healthcare

In [64]:
macro_healthcare_columns = ['provision_doctors', 'provision_nurse', 'load_on_doctors', 'power_clinics', 
                            'hospital_beds_available_per_cap', 'hospital_bed_occupancy_per_year']
train_df.loc[:2, macro_healthcare_columns]

Unnamed: 0,provision_doctors,provision_nurse,load_on_doctors,power_clinics,hospital_beds_available_per_cap,hospital_bed_occupancy_per_year
0,65.9,99.6,8180.755454,375.8,846.0,302.0
1,65.9,99.6,8180.755454,375.8,846.0,302.0
2,65.9,99.6,8180.755454,375.8,846.0,302.0


### Retail

In [65]:
macro_retail_columns = ['provision_retail_space_sqm', 'provision_retail_space_modern_sqm']
train_df.loc[:2, macro_retail_columns]

Unnamed: 0,provision_retail_space_sqm,provision_retail_space_modern_sqm
0,741.0,271.0
1,741.0,271.0
2,741.0,271.0


### Food and Culture

In [66]:
macro_food_culture_columns = ['turnover_catering_per_cap', 'theaters_viewers_per_1000_cap', 'seats_theather_rfmin_per_100000_cap', 'museum_visitis_per_100_cap',
                              'bandwidth_sports', 'population_reg_sports_share', 'students_reg_sports_share']
train_df.loc[:2, macro_food_culture_columns]

Unnamed: 0,turnover_catering_per_cap,theaters_viewers_per_1000_cap,seats_theather_rfmin_per_100000_cap,museum_visitis_per_100_cap,bandwidth_sports,population_reg_sports_share,students_reg_sports_share
0,6943.0,565.0,0.45356,1240.0,269768.0,22.37,64.12
1,6943.0,565.0,0.45356,1240.0,269768.0,22.37,64.12
2,6943.0,565.0,0.45356,1240.0,269768.0,22.37,64.12


# Modeling
[back to top](#Table-of-Contents)

### Deleting unwanted columns

In [67]:
full = del_columns(train_df, ["id", "price_doc", "price_doc_euro", "price_doc_dollars"])
full = del_columns(test_df, ["id"])
print("\n", full.shape)

Deleted:  id
Deleted:  price_doc
Deleted:  price_doc_euro
Deleted:  price_doc_dollars
Deleted:  id

 (7662, 399)


In [68]:
numeric_columns = get_numeric_columns(train_df)
print(train_df.shape, "to numeric only:", train_df[numeric_columns].shape)

(30471, 399) to numeric only: (30471, 379)


## Train Test Split with last year of training set

In [69]:
val_time = 1407
dev_indices = np.where(train_df["time_yearmonth"]<val_time)
val_indices = np.where(train_df["time_yearmonth"]>=val_time)
dev_X = train_df[numeric_columns].ix[dev_indices]
val_X = train_df[numeric_columns].ix[val_indices]
dev_y = train_y[dev_indices]
val_y = train_y[val_indices]
print(dev_X.shape, val_X.shape)

(20483, 379) (9988, 379)


## Linear Benchmark

## XGBoost

In [70]:
import xgboost as xgb



## LightGBM

## Neural Network

## More models

# Ensembles
[back to top](#Table-of-Contents)

### Mean Ensemble

### Stacking