# Sberbank: Russian Housing Market - Part 2

url = https://www.kaggle.com/c/sberbank-russian-housing-market

### Improving the Model

In this notebook I am going to build on some of the work I did previously in part 1.  I will load in my data and functions from part 1 and spend most of this notebook feature engineering to improve my model.

In [2]:
import pandas as pd
import numpy as np

Below is one idea I was trying - in the link provided, there is the case made for only using certain macro columns b/c the rest do not seem to provide any new information.

In [3]:
# From here: https://www.kaggle.com/robertoruiz/sberbank-russian-housing-market/dealing-with-multicollinearity/notebook
macro_cols = ["balance_trade", "balance_trade_growth", "eurrub", "average_provision_of_build_contract",
"micex_rgbi_tr", "micex_cbi_tr", "deposits_rate", "mortgage_value", "mortgage_rate",
"income_per_cap", "rent_price_4+room_bus", "museum_visitis_per_100_cap", "apartment_build"]

In [4]:
#Load Data
macro_raw = pd.read_csv('data/macro.csv', parse_dates=['timestamp']) #Load all macro data
#macro_raw = pd.read_csv('data/macro.csv', parse_dates=['timestamp'], usecols=['timestamp']+macro_cols) #Load only macro_cols
train_raw = pd.read_csv('data/train.csv', parse_dates=['timestamp'])
test_raw = pd.read_csv('data/test.csv', parse_dates=['timestamp'])

In [5]:
#Join macro-economic data
train_full = pd.merge(train_raw, macro_raw, how='left', on='timestamp')
test_full = pd.merge(test_raw, macro_raw, how='left', on='timestamp')

### Feature Engineering

##### First encode categorical features

In [6]:
from sklearn.preprocessing import LabelEncoder

def encode_object_features(train, test):
    '''(DataFrame, DataFrame) -> DataFrame, DataFrame
    
    Will encode each non-numerical column.
    '''
    train = pd.DataFrame(train)
    test = pd.DataFrame(test)
    cols_to_encode = train.select_dtypes(include=['object'], exclude=['int64', 'float64']).columns
    for col in cols_to_encode:
        le = LabelEncoder()
        #Fit on both sets of data
        le.fit(train[col].append(test[col]))
        #Transform each
        train[col] = le.transform(train[col])
        test[col] = le.transform(test[col])
    
    return train, test

In [8]:
train_df, test_df = encode_object_features(train_full, test_full);
#train_df, test_df = encode_object_features(train_raw, test_raw)  #No Macro Data - tried once not using any macro data

#### Add new features

In [10]:
def add_date_features(df):
    '''(DataFrame) -> DataFrame
    
    Will add some specific columns based on the date
    of the sale.
    '''
    #Convert to datetime to make extraction easier
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    #Extract features
    df['month'] = df['timestamp'].dt.month
    df['day'] = df['timestamp'].dt.day
    df['year'] = df['timestamp'].dt.year
    
    #These features inspired by Bruno's Notebook at https://www.kaggle.com/bguberfain/naive-xgb-lb-0-317
    #Month-Year
    month_year = df['timestamp'].dt.month + df['timestamp'].dt.year * 100
    month_year_map = month_year.value_counts().to_dict()
    df['month_year'] = month_year.map(month_year_map)
    #Week-Year
    week_year = df['timestamp'].dt.weekofyear + df['timestamp'].dt.year * 100
    week_year_map = week_year.value_counts().to_dict()
    df['week_year'] = week_year.map(week_year_map)
    df.drop('timestamp', axis=1, inplace=True)
    return df

In [11]:
def add_state_features(df):
    '''(DataFrame) -> DataFrame
    
    Add's features, meant to be used for both train and test df's.
    Does operations to the state grouping
    '''
    #Get median of full sq by state
    df['state_median_full_sq'] = df['full_sq'].groupby(df['state']).transform('median')
    #Build features from full sq median by state
    df['full_sq_state_median_diff'] = df['full_sq'] - df['state_median_full_sq']
    df['life_sq_state_median_full_diff'] = df['life_sq'] - df['state_median_full_sq']
    #Drop helper columns
    df.drop('state_median_full_sq', axis=1, inplace=True)
    
    return df

In [12]:
def add_features(df):
    '''(DataFrame) -> DataFrame
    
    Add's features, meant to be used for both train and test df's.
    '''
    #Floor
    df['floor_ratio'] = df['floor'] / df['max_floor'].astype(float)
    df['floor_from_top'] = df['max_floor'] - df['floor']
    #Sq areas
    df['kitch_sq_ratio'] = df['kitch_sq'] / df['full_sq'].astype(float)
    df['life_sq_ratio'] = df['life_sq'] / df['full_sq'].astype(float)
    df['full_sq_per_room'] = df['full_sq'] / df['num_room'].astype(float)
    df['life_sq_per_room'] = df['life_sq'] / df['num_room'].astype(float)
    df['full_living_sq_diff'] = df['full_sq'] - df['life_sq']
    #df['full_sq_per_floor'] = df['full_sq'] / df['max_floor'].astype(float) #No value added
    df = add_date_features(df)
    df = add_state_features(df)
    df['build_year_vs_year_diff'] = df['build_year'] - df['year']  #no change
    
    #Drop Id -> Made it worse
    #df.drop('id', axis=1, inplace=True)
    
    #School Variables -> Made it worse
    #df['preschool_quota_ratio'] = df["children_preschool"] / df["preschool_quota"].astype("float")
    #df['school_quota_ratio'] = df["children_school"] / df["school_quota"].astype("float")
    return df

In [13]:
train_df = add_features(train_df)
test_df = add_features(test_df)

In [14]:
train_df.shape

(30471, 405)

### Cross-Validate

Here I use cross-validation to test my new features.  After, I also train a model to take a look at the feature_importances_ as determined by the XGB algorithm.  These importances can give you ideas of which features to focus on for further feature engineering.

In [15]:
from xgboost import XGBRegressor
from sklearn.model_selection import TimeSeriesSplit, cross_val_score

#Get Data
#Y_train = train_df['price_doc'].values
Y_train = np.log1p(train_df['price_doc'].values)
X_train = train_df.ix[:, train_df.columns != 'price_doc'].values
X_test = test_df.values

#Initialize Model
xgb = XGBRegressor()
#Create cross-validation
cv = TimeSeriesSplit(n_splits=5)
#Train & Test Model
cross_val_results = cross_val_score(xgb, X_train, Y_train, cv=cv, scoring='neg_mean_squared_error')
print cross_val_results.mean()

-0.20403424895


In [16]:
model = xgb.fit(X_train, Y_train)
importances = zip(model.feature_importances_, train_df.ix[:, train_df.columns != 'price_doc'].columns)
importances = pd.DataFrame(importances, columns=['importance', 'feature'])
importances.sort_values('importance', ascending=False).head(30)

Unnamed: 0,importance,feature
1,0.144118,full_sq
0,0.023529,id
9,0.023529,state
3,0.023529,floor
116,0.019118,railroad_km
91,0.019118,green_zone_km
107,0.017647,ttk_km
10,0.016176,product_type
403,0.016176,build_year_vs_year_diff
123,0.014706,nuclear_reactor_km


##### CV Results

Below are notes to myself on some of the CV scores from trying different feature combinations.

Top 20 +18 ratios & Default = -8.652 x10^12  
Top 20 only & Default = -8.652 x10^12  
27 Features (all ratios) = -8.59286663107e+12    
22 Features (removed some ratios) = -8.59753424352e+12  
28 (full sq per floor) = -8.64215439693e+12  
All Base = -7.98651473264e+12  
Base + date, ratios = -7.85642839197e+12  
Base + best + ratio full to state median = -7.87055210987e+12  
Base + best + difference full to state median = -7.85431265712e+12    
Base + best + differences to state median = -7.83445282066e+12  
Base + all so far + NO MACRO = -7.83445282066e+12  
@Base + All (broken date here and up) = -7.809975808e+12 == -0.20160089611 log  
Base + All -Id = -7.94649835405e+12 
Base + All + fixed date = -0.20403424895  == -7.84  
*Base + All + fixed date -Id = -0.201769935446  
Base + All fixed-id w/Few Macro = -0.203388567584  

### Train Model & Submit

Here I train my final model using the features I determined to use above.  Then I create a submission csv.

In [34]:
from xgboost import XGBRegressor

#Get Data
Y_train = train_df['price_doc'].values
X_train = train_df.ix[:, train_df.columns != 'price_doc'].values
X_test = test_df.values
#Init Model
xgb = XGBRegressor(learning_rate=0.05, max_depth=6, subsample=0.8, colsample_bytree=0.7)
#Train Model
model = xgb.fit(X_train, Y_train)
#Make Predictions
predictions = xgb.predict(X_test)

In [35]:
#Make Submission File
submission_df = pd.DataFrame({'id':test_full['id'], 'price_doc':predictions})
submission_df.to_csv('xgb-added_features.csv', index=False)

#### Results

Base + Date & Default XGBRegressor = 0.33413  
Base Top 20 & Default XGBRegressor = 0.34833  
Base Top 20 + 18 ratios & Default  = 0.34833   
Base + All Features & Default     = 0.33386  
Base + All Features & Tuned 317    = 0.32671  
Base + All + No Macro & Tuned 317  = 0.32832  
Base + All + fixed date -Id (-0.20177 cv) = 0.3260  
*Base + All + fixed date +Id (-0.20160 cv) = 0.32552

### Next Steps

Currently my best submission as me at 53% on the leaderboard.  

Some next steps I want to take are:  
-Engineer more features - play with groupby (sub_area, more state features, etc)  
-Remove some features - try PCA? Learn more about when to remove features that aren't adding new info, how to tell  
-NaN's - there is a lot of missing data and wrong data - should I remove some/all? Correct some (ie. incorrect years)?  
-ID field seems to improve results - why is this? should I remove it anyway?
-Optimize XGB using GridSearch and Build Ensemble learner