Housing costs demand a significant investment from both consumers and developers. And when it comes to planning a 
budget—whether personal or corporate—the last thing anyone needs is uncertainty about one of their biggets expenses. 
**Sberbank**, Russia’s oldest and largest bank, helps their customers by making predictions about realty prices so renters, 
developers, and lenders are more confident when they sign a lease or purchase a building.

Although the housing market is relatively stable in Russia, the country’s volatile economy makes forecasting prices 
as a function of apartment characteristics a unique challenge. Complex interactions between housing features such 
as number of bedrooms and location are enough to make pricing predictions complicated. Adding an unstable economy 
to the mix means Sberbank and their customers need more than simple regression models in their arsenal.

In this competition, Sberbank is challenging Kagglers to develop algorithms which use a broad spectrum of features 
to predict realty prices. Competitors will rely on a rich dataset that includes housing data and macroeconomic 
patterns. An accurate forecasting model will allow Sberbank to provide more certainty to their customers in 
an uncertain economy.

In [7]:
import pandas as pd
import os 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from collections import defaultdict
%matplotlib inline 
os.getcwd()

'/Users/chadleonard/Springboard/work/springboard'

In [3]:
dir_path = os.getcwd() + '/data/all/'
print dir_path
df = pd.read_csv(dir_path + 'train.csv')
df_macro = pd.read_csv(dir_path + 'macro.csv')

/Users/chadleonard/Springboard/work/springboard/data/all/


In [4]:
print df.columns[:10]

Index([u'id', u'timestamp', u'full_sq', u'life_sq', u'floor', u'max_floor',
       u'material', u'build_year', u'num_room', u'kitch_sq'],
      dtype='object')


In [6]:
df.shape

(30471, 292)

In [13]:
df_macro.columns
df_macro.shape

(2485, 100)

In [58]:
d = defaultdict(int)

for k in df['price_doc']:
    cat = len(str(k))
    if cat == 7 and k > 5e6 :
        d[7.5] += 1
    else:
        d[cat] += 1

df['price_cat'] = pd.Series([7.5 if cat > 5e6 and cat < 1e7 else len(str(cat)) \
                             for cat in df['price_doc'] ]).astype('category').cat.codes
#df['gender_4'] = df['Gender'].astype('category').cat.codes
d

defaultdict(int, {6: 233, 7: 8619, 7.5: 16809, 8: 4809, 9: 1})

In [59]:
set(df['price_cat'])

{0, 1, 2, 3, 4}

In [64]:
x_vars = ['full_sq','life_sq','floor','max_floor','material',\
        'build_year','num_room','kitch_sq','state','product_type','sub_area']
cols = ['price_doc','id','timestamp','full_sq','life_sq','floor','max_floor','material',\
        'build_year','num_room','kitch_sq','state','product_type','sub_area']
print x_vars

['full_sq', 'life_sq', 'floor', 'max_floor', 'material', 'build_year', 'num_room', 'kitch_sq', 'state', 'product_type', 'sub_area']


In [61]:
print df[df['price_cat'] == 1]['price_doc'].min()
print df[df['price_cat'] == 1]['price_doc'].max()
print df[df['price_cat'] == 2]['price_doc'].min()
print df[df['price_cat'] == 2]['price_doc'].max()

1000000
5000000
5000050
9991069


In [41]:
df_price_cat_one = df[df['price_cat'] == 1]['price_doc']

In [47]:
dd = defaultdict(int)
for k in df_price_cat_one:
    if k <= 5e6:
        dd['1_to_5'] += 1
    else:
        dd['5_to_9'] += 1
        
dd

defaultdict(int, {'1_to_5': 8619, '5_to_9': 16809})

In [37]:
df['price_cat'].head()

0    1
1    1
2    1
3    2
4    2
Name: price_cat, dtype: int8

In [6]:
df['price_doc']
print df['price_doc'].min()
print df['price_doc'].max()
print df['timestamp'].min()
print df['timestamp'].max()

100000
111111112
2011-08-20
2015-06-30


In [5]:
#sns.pairplot(df[:5])

In [None]:
plt.plot(df['market_count_5000'],df['price_doc'], 'ro' )
plt.ylabel('Price')
plt.xlabel('Market Count 5000')
plt.show()

In [None]:
plt.hist(df['price_doc'],bins=60, range=[0.0,1e7])
plt.show()

In [None]:
#df.columns[:100]