EDA, Exploratory Data Analysis
===
latitude/logitude are given by unknown consersion. Suppose that this conversion formula still kept the original infomation, how could we extract their usefule information in prediction? As well-known knowledge, house price is dependent on the region where their locate; this is why we have to consider lat/lon infomation seriously.
1. nan conversion
2. target, $\mathbf{y\Rightarrow\log(1+y)}$ (`np.log1p`), for normalize fitting; later, back by $\mathbf{y_p \Rightarrow \exp(y_p)-1}$ (`np.expm1`)
- latitude/longitude conversion, a°). knn means, b°). dbscans, then one-hot converstion
  ```Lasso, 0.7012 ➡︎ 0.6893```, the last one can not assign a fixed value to fix the data.
- different models, xgb, lgb, ...; here we try the `lightgbm`;
- stack model, blend moder, ...; install `mlxtend` by pip.


In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from tqdm import tqdm,tqdm_notebook
import folium

import seaborn as sns
%matplotlib inline

In [None]:
train_df = pd.read_csv('../dataset/train.csv')
test_df = pd.read_csv('../dataset/test.csv')

In [None]:
print ("Train: ",train_df.shape[0],"sales, and ",train_df.shape[1],"features")
print ("Test:  ",test_df.shape[0],"sales, and ",test_df.shape[1],"features")

Data
---
1. Quantitative
   - time-state: 'txn_dt', 'building_complete_dt'
   - non-time,
           'parking_price','building_area','village_income_median','town_population','town_area',
           'town_population_density','I_Min','II_MIN','III_MIN','IV_MIN','V_MIN','VI_MIN',
           'VII_MIN','VIII_MIN','IX_MIN','X_MIN','XI_MIN','XII_MIN','XIII_MIN','XIV_MIN',
   - location: 'lon','lat'
- Qualitative
  - building_material(9),city(11),total_floor(29),building_type(5),building_use(10),parking_way,
    parking_area, txn_floor,'doc_rate', 'master_rate', 'bachelor_rate', 'jobschool_rate',
       'highschool_rate', 'junior_rate', 'elementary_rate', 'born_rate',
       'death_rate', 'marriage_rate', 'divorce_rate'
  - village(2899)     

In [None]:
quantitative = ['txn_dt', 'building_complete_dt','parking_price','building_area','village_income_median','town_population','town_area',
           'town_population_density','I_MIN','II_MIN','III_MIN','IV_MIN','V_MIN','VI_MIN',
           'VII_MIN','VIII_MIN','IX_MIN','X_MIN','XI_MIN','XII_MIN','XIII_MIN','XIV_MIN',
           'lon','lat']
qualitative=['building_material','city','total_floor','building_type','building_use',
             'parking_way','parking_area','txn_floor','doc_rate', 'master_rate', 
             'bachelor_rate', 'jobschool_rate','highschool_rate', 'junior_rate', 
             'elementary_rate', 'born_rate','death_rate', 'marriage_rate', 'divorce_rate',
             'village']

In [None]:
train_df.head(2)

 Normality test
---
For quanntitative features, do the features follow normal distributed? The Shapior test,  `scipy.stats.shapiro`, does help to filter out the data.

In [None]:
import scipy.stats as stats
from scipy import stats
from scipy.stats import norm, skew 

In [None]:
stats.shapiro?

In [None]:
train_df['total_price'].sample(n=5000, random_state=100).values

In [None]:
# p-value <0.01
#test_normality = lambda x: stats.shapiro(x.fillna(0))[1] < 0.01
#normal = pd.DataFrame(train_df['total_price'])
stats.shapiro(np.log(train_df['total_price'].sample(n=5000, random_state=100).values))
#print(not normal.any())

In [None]:
import scipy.stats as stats

# p-value <0.01
test_normality = lambda x: stats.shapiro(x.fillna(0))[1] < 0.01
normal = pd.DataFrame(train_df[quantitative].sample(n=5000, random_state=100))
normal = normal.apply(test_normality)
print(not normal.any())

In [None]:
def dist_check(y,kind='log'):
    if kind=='log':
       y_c=np.log(y)
    else:
       y_c=y 
    plt.figure(figsize=(12,6))
    plt.subplot(121)
    sns.distplot(y_c , fit=norm);
    (mu, sigma) = norm.fit(y_c)
    #print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))
    plt.legend(['Normal dist. ($\mu:$ {:.2f}, $\sigma:$ {:.2f} )'.format(mu, sigma)],fontsize=14,
            loc='best')
    plt.title('Convert by %s' %kind)
    plt.ylabel('Frequency')
    
    plt.subplot(122)
    #fig = plt.figure(figsize=[8,6])
    res = stats.probplot(y_c, plot=plt)

In [None]:
y=(train_df['total_price'])

dist_check(y,kind='log')

In [None]:
# too large for data skewness and kurtosis
print("Skewness: %f" % train_df['total_price'].skew()) 
print("Kurtosis: %f" % train_df['total_price'].kurt())

In [None]:
# try another one in log degree, look ...
print("Skewness: %f" % np.log(train_df['total_price']).skew()) 
print("Kurtosis: %f" % np.log(train_df['total_price']).kurt())

**Linear dependings** of quantitative feature and log(taget) variable and others

In [None]:
plt.figure(figsize=(20, 5))
log_target=np.log(train_df['total_price'])

for num, var in enumerate(quantitative[-2:]):
    plt.subplot(1, len(quantitative[-2:]), num + 1)
    sns.regplot(x=train_df[var], y = log_target);

In [None]:
plt.figure(figsize=(20, 5))
log_target=np.log(train_df['total_price'])
n=8
for num, var in enumerate(quantitative[3*(n-1):3*n]):
    plt.subplot(1, len(quantitative[:3]), num + 1)
    sns.regplot(x=train_df[var], y = log_target);

In [None]:
plt.figure(figsize=(20, 5))
log_target=np.log(train_df['total_price'])
n=7

for num, var in enumerate(qualitative[3*(n-1):3*n]):
    plt.subplot(1, len(qualitative[:3]), num + 1)
    sns.regplot(x=train_df[var], y = log_target);

In [None]:
# only one class in this feature, remove it
train_df['I_index_5000'].value_counts()

Within the last half part of features, features should be considered in group, for instance, the features about their neighborhood infomation:
      
        'N_50','N_500','N_1000','N_10000' 
Group them together and name it as `N_arr`


In [None]:
removale=['I_index_5000','I_index_10000', ...
         ]
N_arr=['N_50','N_500','N_1000','N_10000']
I_arr=['I_10','I_50','I_100','I_250','I_500','I_1000','I_5000','I_10000']
I_ind_arr=['I_index_50','I_index_500','I_index_1000']
II_arr=['II_10','II_50','II_100','II_250','II_500','II_1000','II_5000','II_10000']
II_ind_arr=['II_index_50','II_index_500','II_index_1000']
III_arr=[...]
III_ind_arr=[...]

...

XIV_arr=['XIV_10','XIV_50','XIV_100','XIV_250','XIV_500','XIV_1000','XIV_5000','XIV_10000']
XIV_ind_arr=['XIV_index_50','XIV_index_500','XIV_index_1000']
target=['total_price']


In [None]:
plt.figure(figsize=(20, 5))
vars=I_ind_arr
for num, var in enumerate(vars):
    plt.subplot(1, len(vars), num + 1)
    sns.regplot(x=train_df[var], y = log_target);

In [None]:
def spearman(frame, features,target='total_price'):
    spr = pd.DataFrame()
    spr['feature'] = features
    spr['spearman'] = [frame[f].corr(frame[target], 'spearman') for f in features]
    spr = spr.sort_values('spearman')
    plt.figure(figsize=(6, 0.25*len(features)))
    sns.barplot(data=spr, y='feature', x='spearman', orient='h')
    
features = quantitative 

In [None]:
spearman(train_df, features)

In [None]:
target=['total_price']
plt.figure(1)
corr = train_df[quantitative+target].corr()
sns.heatmap(corr)
plt.figure(2)
corr = train_df[qual_encoded+target].corr()
sns.heatmap(corr)
plt.figure(3)
corr = pd.DataFrame(np.zeros([len(quantitative)+1, len(qual_encoded)+1]), index=quantitative+target, columns=qual_encoded+['SalePrice'])
for q1 in quantitative+target:
    for q2 in qual_encoded+target:
        corr.loc[q1, q2] = train_df[q1].corr(train_df[q2])
sns.heatmap(corr)

In [None]:
train_df[set(XII_ind_arr).union(target)].corr()