### Resources

https://www.kaggle.com/sudalairajkumar/simple-exploration-baseline-santander-value - Basic EDA  
https://www.kaggle.com/nanomathias/distribution-of-test-vs-training-data - Exploring differences between Test & Train (tSNE)  
https://www.kaggle.com/the1owl/love-is-the-answer/notebook  -  Dimensionality reduction & blending  
https://www.kaggle.com/ogrellier/santander-46-features/code - Feature transformation  

https://lightgbm.readthedocs.io/en/latest/Python-API.html  LightGBM docs  
https://xgboost.readthedocs.io/en/latest/python/python_api.html XGBoost docs  

### Log

*June*  
> RMSLE: -  
Commentary: read through exploratory kernels and Kaggle discussions, performed basic EDA, familiarised myself with data
problematics.

*02/07/18*  
> RMSLE: ```LightGBM = 1.438633```    
LB RMSLE: **1.47**  
Wall time: 3min 17s  
Commentary: reduced features by removing duplicated columns and columns with stdev=0, transformed skewed columns, implemented 5KFold model with LightGBM, hyperparameters from sudalairajkumar's kernel.

*03/07/18*  
> RMSLE: ```LightGBM = mean: 1.43222, std: 0.02606  ``` & ``` XGBoost = mean: 1.42559, std: 0.02309  ```  
LB RMSLE: **1.46**  
Wall time: 6min 10s  
Commentary: based on the1owl's kernel, reduced features by using the top features from a basic TreeRegressor model, streamlined code, blended LightGBM and XGBoost models

*04/07/18*  
> RMSLE: ```XGBoost = mean: 1.37952, std: 0.01519  ```  
LB RMSLE: **1.44**  
Wall time: 1min 16s  
Commentary: working on top of previous notebook, top 20 dimensionality reduction components appended, tuned hyperparams, dropped LightGBM

*15/07/18*
> LightGBM ```RMSLE mean: 1.34165, std: 0.02615```  
LB RMSLE: **1.40**   
Wall time: 3min 13s  
Commentary: removed dimensionality reduction agg features, added new statistical agg features, switch from 400 to 1200 Random forest most important features

In [108]:
from time import time
from datetime import timedelta
from contextlib import contextmanager
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

import lightgbm as lgb
import xgboost as xgb
from lightgbm import LGBMRegressor, Dataset
from xgboost import XGBRegressor, DMatrix
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import KFold, train_test_split
from sklearn.decomposition import PCA, TruncatedSVD, FastICA
from sklearn.random_projection import GaussianRandomProjection, SparseRandomProjection
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)


def rmsle(y, predictions):
    return np.sqrt(np.mean(np.power(np.log1p(y) - np.log1p(predictions), 2)))

@contextmanager
def timer(title=""):
    start = time()
    yield
    print("{} done in {:.0f}s".format(title, str(timedelta(seconds=time()-start))))

In [130]:
train = pd.read_csv("C:\\Users\\heret\\Downloads\\Santander\\train.csv")
test = pd.read_csv("C:\\Users\\heret\\Downloads\\Santander\\test.csv")

print(("Train: {} rows, {} columns \nTest: {} rows, {} columns".format(train.shape[0], train.shape[1], test.shape[0], test.shape[1])))

Train: 4459 rows, 4993 columns 
Test: 49342 rows, 4992 columns


In [131]:
def feat_selection(X, n_feats=1200):
    col = [c for c in X.columns if c not in ['ID', 'target']]

    scl = StandardScaler()
    x1, x2, y1, y2 = train_test_split(X[col], X["target"].values, test_size=0.20, random_state=5)
    model = RandomForestRegressor(n_jobs = -1, random_state = 7)
    model.fit(scl.fit_transform(x1), y1)
    print(rmsle(y2, model.predict(scl.transform(x2))))

    col = pd.DataFrame({'importance': model.feature_importances_, 'feature': col}).sort_values(by=['importance'], ascending=[False])[:n_feats]['feature'].values
    print("Selected {} most important features".format(col.size))
    importances = model.feature_importances_
    # indices = np.argsort(importances)
    # plt.figure(1)
    # plt.title('Feature Importances')
    # plt.barh(range(len(indices)), importances[indices], color='r', align='center')
    # plt.xlabel('Relative Importance')
    # plt.xlim(0,0.012)
    return col

In [132]:
col = feat_selection(train)
ids = test["ID"]
train_y = np.log1p(train["target"])
train_X, test_X = train[col], test[col]

print("train_X: {}, test_X: {}, train_y: {}".format(train_X.shape, test_X.shape, train_y.shape))
del train, test, col

1.7700695707637661
Selected 1200 most important features
train_X: (4459, 1200), test_X: (49342, 1200), train_y: (4459,)


In [133]:
#replacing 0 for NaNs
#creating aggregated variables with the information for each observaton
for dataset in [train_X, test_X]:
    dataset.replace(0, np.nan, inplace=True)

    dataset['nans'] = dataset.isnull().sum(axis=1)
    dataset['median'] = dataset.median(axis=1)
    dataset['mean'] = dataset.mean(axis=1)
    dataset['sum'] = dataset.sum(axis=1)
    dataset['std'] = dataset.std(axis=1)
    dataset['kur'] = dataset.kurtosis(axis=1)
    dataset['max'] = dataset.max(axis=1)
    dataset['min'] = dataset.min(axis=1)
    dataset['skew'] = dataset.skew(axis=1)
    dataset['sum'] = dataset.sum(axis=1)

In [137]:
def fold_boost(X,y,T):
    
    folds = KFold(n_splits=5, shuffle=True, random_state=23)
    folds_scores_lgbm=[]
    folds_scores_xgb=[]
    valid_preds_lgbm = np.zeros(X.shape[0])
    valid_preds_xgb = np.zeros(X.shape[0])
    test_preds_lgbm = np.zeros(T.shape[0])
    test_preds_xgb = np.zeros(T.shape[0])

    for fold_no, (train_idx, valid_idx) in enumerate(folds.split(X,y)):

        X_train = X.iloc[train_idx]
        y_train = y.iloc[train_idx]
        X_valid = X.iloc[valid_idx]
        y_valid = y.iloc[valid_idx]        
        print("X_train: {} y_train: {} X_valid: {} y_valid: {}".format(X_train.shape, y_train.shape, X_valid.shape, y_valid.shape))

        #LightGBM
        params = {"objective":'regression',"num_leaves":144, "learning_rate":0.005, "max_depth":13, "metric":'rmse',"is_training_metric":True, "max_bin" : 55, "bagging_fraction" : 0.8, "bagging_freq" : 5, "feature_fraction" : 0.9}
        model = lgb.train(params=params, train_set=Dataset(X_train, label=y_train),
                          num_boost_round=10000, valid_sets=Dataset(X_valid, label=y_valid),
                          verbose_eval=False, early_stopping_rounds=100)
        test_preds_lgbm += np.expm1(model.predict(T, num_iteration=model.best_iteration)) / folds.n_splits
        valid_preds_lgbm[valid_idx] = model.predict(X_valid, num_iteration=model.best_iteration)
        score = rmsle(np.expm1(y_valid), np.expm1(valid_preds_lgbm[valid_idx]))
        folds_scores_lgbm.append(score)
        print("\nFold %2d RMSLE : %.6f\n" % (fold_no + 1, score))

        #XGB
#         watchlist = [(DMatrix(X_train, y_train), 'train'), (DMatrix(X_valid, y_valid), 'valid')]
#         params = {'objective': 'reg:linear', 'booster': 'gbtree', "learning_rate":0.01, "max_depth":30, "min_child_weight":30, "gamma":0, "subsample": 0.75, "colsample_bytree": 0.05,"colsample_bylevel":0.7, "n_jobs": -1, "reg_lambda": 0.1}
#         model = xgb.train(params, DMatrix(X_train, y_train), 5000,  watchlist, maximize=False, verbose_eval=False, early_stopping_rounds=100)
#         test_preds_xgb += np.expm1(model.predict(DMatrix(T), ntree_limit=model.best_ntree_limit)) / folds.n_splits
#         valid_preds_xgb[valid_idx] = model.predict(DMatrix(X_valid), ntree_limit=model.best_ntree_limit)
#         score = rmsle(np.expm1(y_valid), np.expm1(valid_preds_xgb[valid_idx]))
#         folds_scores_xgb.append(score)
#         print("\nFold %2d RMSLE : %.6f\n" % (fold_no + 1, score))
    
    print("LightGBM RMSLE mean: {}, std: {}".format(np.mean(folds_scores_lgbm).round(5), np.std(folds_scores_lgbm).round(5)))
    print("XGBoost RMSLE mean: {}, std: {}".format(np.mean(folds_scores_xgb).round(5), np.std(folds_scores_xgb).round(5)))
    return (valid_preds_lgbm, test_preds_lgbm)

In [135]:
%%time
valid_preds_lgbm, test_preds_lgbm = fold_boost(train_X, train_y, test_X)

X_train: (3567, 1209) y_train: (3567,) X_valid: (892, 1209) y_valid: (892,)

Fold  1 RMSLE : 1.316284

X_train: (3567, 1209) y_train: (3567,) X_valid: (892, 1209) y_valid: (892,)

Fold  2 RMSLE : 1.349357

X_train: (3567, 1209) y_train: (3567,) X_valid: (892, 1209) y_valid: (892,)

Fold  3 RMSLE : 1.379036

X_train: (3567, 1209) y_train: (3567,) X_valid: (892, 1209) y_valid: (892,)

Fold  4 RMSLE : 1.355482

X_train: (3568, 1209) y_train: (3568,) X_valid: (891, 1209) y_valid: (891,)

Fold  5 RMSLE : 1.308081

LightGBM RMSLE mean: 1.34165, std: 0.02615
XGBoost RMSLE mean: nan, std: nan
Wall time: 3min 13s


  out=out, **kwargs)
  ret = ret.dtype.type(ret / rcount)
  keepdims=keepdims)
  arrmean, rcount, out=arrmean, casting='unsafe', subok=False)
  ret = ret.dtype.type(ret / rcount)


In [136]:
predictions = test_preds_lgbm
submissions = pd.DataFrame({"ID":ids, "target":predictions})
submissions.to_csv("santanderv4.csv", index=False)
submissions.head()

Unnamed: 0,ID,target
0,000137c73,7636659.0
1,00021489f,1773018.0
2,0004d7953,2909381.0
3,00056a333,5305285.0
4,00056d8eb,1193637.0
