<img src="header.png"></img>

<div align="center">
    <h1>Report on Participation in Kaggle Competition</h1>
    <h1>Part III: Model Training</h1>
</div>

<div align="center"><i>
In this Jupyter notebook I am engineering features from 4 data sheets provided by competition originator.<br/>
<br/>
Prepared by Artem Drofa.
</i></div>

<a id='main_takeaways'></a>
## Main Takeaways
* Prepared features allowed to reach RMSE 3.685 (on validation dataset).
* Model's parameters tuning allowed to improve RMSE slightly and reach 3.681.
* Much better results were reached with additional features engineering. After new features were added to train (and valid) dataset, RMSE 3.674 level was touched.

## Model Training

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from sklearn.preprocessing import MinMaxScaler, OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import SGDRegressor

import lightgbm as lgb

from scipy.sparse import csr_matrix, hstack
from scipy.spatial import distance

from IPython.core import display as ICD
from itertools import combinations
import pickle
import copy
import os

import gc
from tqdm import tqdm
from sys import getsizeof
import warnings
warnings.simplefilter(action='ignore')

In [2]:
PATH_TO_DATA = '.../data'

In [3]:
#function returns memory usage by object
def mem_usage(obj):
    usage_b = getsizeof(obj)
    usage_mb = usage_b / 1024 ** 2 # convert bytes to megabytes
    return "{:03.2f} MB".format(usage_mb)

In [4]:
def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

In [5]:
#transforms train_df and test_df to sparse matrices and saves them,
#columns (features') names and index column in the same folder
def sparse_matrix_prep(source_df_name):

    #opening df
    with open(os.path.join(PATH_TO_DATA, source_df_name+'.pkl'), 'rb') as pkl_f:
        df = pickle.load(pkl_f)

    #transforming to csr
    df_csr = csr_matrix(df)

    #saving csr matrix, features names and card_id (observations' indeces)
    with open(os.path.join(PATH_TO_DATA, source_df_name+'_csr.pkl'), 'wb') as pkl_f:
        pickle.dump(df_csr, pkl_f)
    with open(os.path.join(PATH_TO_DATA, source_df_name+'_columns.pkl'), 'wb') as pkl_f:
        pickle.dump(df.columns, pkl_f)
    with open(os.path.join(PATH_TO_DATA, source_df_name+'_card_id.pkl'), 'wb') as pkl_f:
        pickle.dump(df.index, pkl_f)

    #memory clearing
    del df, df_csr
    gc.collect()

In [6]:
%%time
sparse_matrix_prep('train_df')
sparse_matrix_prep('test_df')

CPU times: user 1min 7s, sys: 22.2 s, total: 1min 29s
Wall time: 4min 43s


### LGB
For evaluating of models' results I will split train dataset on 2 samples: sample for train (77%) and sample for validation (33%).

In [7]:
#Model
def kfold_lightgbm(X_train, y_train, X_test, num_folds, file=False, params='default'):
    
    #storing results
    evals_results = []
    
    # Create arrays and dataframes to store results
    oof_preds = np.zeros(X_train.shape[0])
    sub_preds = np.zeros(X_test.shape[0])

    #Cross-Validation
    folds = KFold(n_splits= num_folds, shuffle=True, random_state=17)
    
    for n_fold, (train_idx, valid_idx) in tqdm(enumerate(folds.split(X_train))):
        train_x, train_y = X_train[train_idx], y_train[train_idx]
        valid_x, valid_y = X_train[valid_idx], y_train[valid_idx]

        #transform data to lgb format
        lgb_train = lgb.Dataset(train_x, label=train_y, free_raw_data=False)
        lgb_test = lgb.Dataset(valid_x, label=valid_y,free_raw_data=False)
        
        #memory clearing
        del train_x, train_y
        gc.collect()
        
        #params
        if params == 'default':
            params = {'objective': 'regression',
                      'metric': 'rmse',
                      'verbose': 1}
        else:
            params = params
                      
        params['seed'] = int(17**n_fold),
        params['bagging_seed'] = int(17**n_fold),
        params['drop_seed'] = int(17**n_fold)
        
        #dict for storing model's result
        evals_result = {}
        
        #model
        reg = lgb.train(params,
                        lgb_train,
                        valid_sets=[lgb_train, lgb_test],
                        valid_names=['train', 'test'],
                        evals_result = evals_result,
                        verbose_eval= 10)
        
        #adding fold's result
        evals_results.append(evals_result)
        
        #predictions
        oof_preds[valid_idx] = reg.predict(valid_x, num_iteration=reg.best_iteration)
        sub_preds += reg.predict(X_test, num_iteration=reg.best_iteration) / folds.n_splits
        
        #displaying results
        print('Fold {} RMSE : {}'.format(n_fold + 1, rmse(valid_y, oof_preds[valid_idx])))
        
        #clearing memory
        del reg, valid_x, valid_y
        gc.collect()

    #creating submission file
    if file is not False:
        X_test.loc[:,'target'] = sub_preds
        X_test = X_test.reset_index()
        X_test[['card_id', 'target']].to_csv(file, index=False)
    
    return evals_results, sub_preds

In [8]:
#function returns mean of folds' RMSE
def rmse_avg(results):
    rmse_avg = 0
    for fold in results:
        rmse_avg += np.array(fold['test']['rmse']).min() / len(results)
    return rmse_avg

#### Base* Model:
<i>*with default parameters</i>

CV-average RMSE: 3.686 | Valid RMSE: 3.685

In [9]:
def base_model(return_preds=False):
    #open data
    with open(os.path.join(PATH_TO_DATA, 'train_df_csr.pkl'), 'rb') as pkl_f:
                train_df_csr = pickle.load(pkl_f)
    with open(os.path.join(PATH_TO_DATA, 'target_df.pkl'), 'rb') as pkl_f:
                target_df = pickle.load(pkl_f)

    #data train/valid split
    X_train, X_valid, y_train, y_valid = train_test_split(train_df_csr, target_df,
                                                          test_size=0.33, shuffle=True, random_state=17)

    #model training
    res, preds = kfold_lightgbm(X_train, y_train, X_valid, 3, file=False, params='default')
    
    #printing results
    print('-' * 70)
    print('RESULTS:')
    print('CV-average RMSE: {} | Valid RMSE: {}'.format(round(rmse_avg(res), 3),
                                                        round(rmse(preds, y_valid), 3)))
    print('-' * 70)
    
    #clearing memory
    del train_df_csr, target_df
    del X_train, X_valid, y_train, y_valid
    gc.collect()
    
    if return_preds == True:
        return preds
    
    del res, preds
    gc.collect()

In [10]:
%%time
base_model()

0it [00:00, ?it/s]

[10]	train's rmse: 3.63312	test's rmse: 3.65784
[20]	train's rmse: 3.51413	test's rmse: 3.64495
[30]	train's rmse: 3.43974	test's rmse: 3.63846
[40]	train's rmse: 3.38668	test's rmse: 3.63571
[50]	train's rmse: 3.34089	test's rmse: 3.6357
[60]	train's rmse: 3.30618	test's rmse: 3.6361
[70]	train's rmse: 3.27487	test's rmse: 3.63758
[80]	train's rmse: 3.248	test's rmse: 3.63711
[90]	train's rmse: 3.22046	test's rmse: 3.6387
[100]	train's rmse: 3.19763	test's rmse: 3.64084


1it [01:06, 66.89s/it]

Fold 1 RMSE : 3.6408424506480803
[10]	train's rmse: 3.57582	test's rmse: 3.80406
[20]	train's rmse: 3.46598	test's rmse: 3.76526
[30]	train's rmse: 3.39387	test's rmse: 3.75496
[40]	train's rmse: 3.34546	test's rmse: 3.74894
[50]	train's rmse: 3.30363	test's rmse: 3.74663
[60]	train's rmse: 3.26578	test's rmse: 3.74546
[70]	train's rmse: 3.23754	test's rmse: 3.74532
[80]	train's rmse: 3.20893	test's rmse: 3.7463
[90]	train's rmse: 3.17804	test's rmse: 3.74768
[100]	train's rmse: 3.15313	test's rmse: 3.75131


2it [02:20, 68.82s/it]

Fold 2 RMSE : 3.7513132952125106
[10]	train's rmse: 3.6076	test's rmse: 3.72201
[20]	train's rmse: 3.49869	test's rmse: 3.69437
[30]	train's rmse: 3.42425	test's rmse: 3.68487
[40]	train's rmse: 3.36768	test's rmse: 3.68006
[50]	train's rmse: 3.32849	test's rmse: 3.67825
[60]	train's rmse: 3.29475	test's rmse: 3.67877
[70]	train's rmse: 3.26227	test's rmse: 3.67937
[80]	train's rmse: 3.23434	test's rmse: 3.68083
[90]	train's rmse: 3.20514	test's rmse: 3.68231
[100]	train's rmse: 3.17717	test's rmse: 3.68326


3it [03:31, 69.51s/it]

Fold 3 RMSE : 3.683256602662884
----------------------------------------------------------------------
RESULTS:
CV-average RMSE: 3.686 | Valid RMSE: 3.685
----------------------------------------------------------------------





CPU times: user 13min 15s, sys: 3.15 s, total: 13min 18s
Wall time: 3min 39s


#### Base Model Improvements
After several experiments I stopped on 2 following parameters packs for LGB model (which predictions could be blended on final stage):
* model with deep trees;
* model with shallow trees.

Also I leave k-fold at 3 and learning_rate at 0.025 for now for learning speed increase. After further steps I plan, to increase k-fold up yo 10 and decrease learning_rate approzimately to 0.001.

##### Model With Deep Trees

CV-average RMSE: 3.67 | Valid RMSE: 3.681

In [11]:
def deep_trees(return_preds=False):
    #open data
    with open(os.path.join(PATH_TO_DATA, 'train_df_csr.pkl'), 'rb') as pkl_f:
                train_df_csr = pickle.load(pkl_f)
    with open(os.path.join(PATH_TO_DATA, 'target_df.pkl'), 'rb') as pkl_f:
                target_df = pickle.load(pkl_f)

    #data train/valid split
    X_train, X_valid, y_train, y_valid = train_test_split(train_df_csr, target_df,
                                                          test_size=0.33, shuffle=True, random_state=17)
    
    #parameters for lgb
    params ={'task': 'train',
             'objective': 'regression',
             'metric': 'rmse',

             'feature_fraction': 0.75, #1
             'subsample': 0.75, # 1

             'learning_rate': 0.025, # 0.1
             'max_depth': 12, # -1
             'num_leaves': 1000, # 31

             'reg_alpha': 0, # 0
             'reg_lambda': 0, # 0

             'min_split_gain': 0, #0
             'min_data_in_leaf': 100, #20
             'min_sum_hessian_in_leaf': 1e-3, # 1e-3

             'verbose': -1,

             'num_iterations':10000,
             'early_stopping_round':100}

    #model training
    res, preds = kfold_lightgbm(X_train, y_train, X_valid, 3, file=False, params=params)
    
    #printing results
    print('-' * 70)
    print('RESULTS:')
    print('CV-average RMSE: {} | Valid RMSE: {}'.format(round(rmse_avg(res), 3),
                                                        round(rmse(preds, y_valid), 3)))
    print('-' * 70)
    
    #clearing memory
    del train_df_csr, target_df
    del X_train, X_valid, y_train, y_valid
    gc.collect()
    
    if return_preds == True:
        return preds
    
    del res, preds
    gc.collect()

In [12]:
%%time
deep_trees()

0it [00:00, ?it/s]

Training until validation scores don't improve for 100 rounds.
[10]	train's rmse: 3.78077	test's rmse: 3.70739
[20]	train's rmse: 3.7013	test's rmse: 3.67138
[30]	train's rmse: 3.64093	test's rmse: 3.64907
[40]	train's rmse: 3.59205	test's rmse: 3.63619
[50]	train's rmse: 3.55116	test's rmse: 3.62852
[60]	train's rmse: 3.51662	test's rmse: 3.62436
[70]	train's rmse: 3.48697	test's rmse: 3.62178
[80]	train's rmse: 3.46357	test's rmse: 3.62027
[90]	train's rmse: 3.44329	test's rmse: 3.61938
[100]	train's rmse: 3.42653	test's rmse: 3.61849
[110]	train's rmse: 3.41128	test's rmse: 3.61858
[120]	train's rmse: 3.3971	test's rmse: 3.6183
[130]	train's rmse: 3.38664	test's rmse: 3.61833
[140]	train's rmse: 3.37499	test's rmse: 3.61879
[150]	train's rmse: 3.36542	test's rmse: 3.6185
[160]	train's rmse: 3.35541	test's rmse: 3.61857
[170]	train's rmse: 3.34575	test's rmse: 3.61878
[180]	train's rmse: 3.33961	test's rmse: 3.61906
[190]	train's rmse: 3.33306	test's rmse: 3.61943
[200]	train's rmse:

1it [05:01, 301.04s/it]

Training until validation scores don't improve for 100 rounds.
[10]	train's rmse: 3.70675	test's rmse: 3.87073
[20]	train's rmse: 3.6333	test's rmse: 3.82818
[30]	train's rmse: 3.57643	test's rmse: 3.79983
[40]	train's rmse: 3.52953	test's rmse: 3.78231
[50]	train's rmse: 3.49261	test's rmse: 3.76984
[60]	train's rmse: 3.45924	test's rmse: 3.76054
[70]	train's rmse: 3.43158	test's rmse: 3.75404
[80]	train's rmse: 3.40639	test's rmse: 3.74899
[90]	train's rmse: 3.38669	test's rmse: 3.74557
[100]	train's rmse: 3.36626	test's rmse: 3.74316
[110]	train's rmse: 3.35289	test's rmse: 3.74104
[120]	train's rmse: 3.33996	test's rmse: 3.73957
[130]	train's rmse: 3.32818	test's rmse: 3.73796
[140]	train's rmse: 3.31667	test's rmse: 3.73689
[150]	train's rmse: 3.30612	test's rmse: 3.73599
[160]	train's rmse: 3.29584	test's rmse: 3.73569
[170]	train's rmse: 3.28922	test's rmse: 3.73549
[180]	train's rmse: 3.28239	test's rmse: 3.7352
[190]	train's rmse: 3.27542	test's rmse: 3.73502
[200]	train's rms

2it [13:11, 357.98s/it]

Training until validation scores don't improve for 100 rounds.
[10]	train's rmse: 3.74919	test's rmse: 3.77633
[20]	train's rmse: 3.67371	test's rmse: 3.73776
[30]	train's rmse: 3.6155	test's rmse: 3.7132
[40]	train's rmse: 3.56652	test's rmse: 3.69603
[50]	train's rmse: 3.52901	test's rmse: 3.68507
[60]	train's rmse: 3.49812	test's rmse: 3.67775
[70]	train's rmse: 3.46691	test's rmse: 3.67213
[80]	train's rmse: 3.44211	test's rmse: 3.66871
[90]	train's rmse: 3.42274	test's rmse: 3.66689
[100]	train's rmse: 3.40582	test's rmse: 3.66479
[110]	train's rmse: 3.38984	test's rmse: 3.66355
[120]	train's rmse: 3.37868	test's rmse: 3.66239
[130]	train's rmse: 3.36627	test's rmse: 3.66137
[140]	train's rmse: 3.35705	test's rmse: 3.6602
[150]	train's rmse: 3.34836	test's rmse: 3.65971
[160]	train's rmse: 3.33929	test's rmse: 3.65919
[170]	train's rmse: 3.33244	test's rmse: 3.65904
[180]	train's rmse: 3.32479	test's rmse: 3.65882
[190]	train's rmse: 3.31714	test's rmse: 3.65893
[200]	train's rmse

3it [20:11, 376.33s/it]


----------------------------------------------------------------------
RESULTS:
CV-average RMSE: 3.67 | Valid RMSE: 3.681
----------------------------------------------------------------------
CPU times: user 1h 35s, sys: 45.2 s, total: 1h 1min 20s
Wall time: 20min 14s


##### Model With Shallow Trees

CV-average RMSE: 3.67 | Valid RMSE: 3.682

In [13]:
def shallow_trees(return_preds=False):
    #open data
    with open(os.path.join(PATH_TO_DATA, 'train_df_csr.pkl'), 'rb') as pkl_f:
                train_df_csr = pickle.load(pkl_f)
    with open(os.path.join(PATH_TO_DATA, 'target_df.pkl'), 'rb') as pkl_f:
                target_df = pickle.load(pkl_f)

    #data train/valid split
    X_train, X_valid, y_train, y_valid = train_test_split(train_df_csr, target_df,
                                                          test_size=0.33, shuffle=True, random_state=17)
    
    #parameters for lgb
    params ={'task': 'train',
             'objective': 'regression',
             'metric': 'rmse',

             'feature_fraction': 0.75, #1
             'subsample': 0.75, # 1

             'learning_rate': 0.025, # 0.1
             'max_depth': 7, # -1
             'num_leaves': 100,# 31

             'reg_alpha': 0, # 0
             'reg_lambda': 0, # 0

             'min_split_gain': 0, #0
             'min_data_in_leaf': 100, #20
             'min_sum_hessian_in_leaf': 1e-3, # 1e-3

             'verbose': -1,

             'num_iterations':10000,
             'early_stopping_round':100}

    #model training
    res, preds = kfold_lightgbm(X_train, y_train, X_valid, 3, file=False, params=params)
    
    #printing results
    print('-' * 70)
    print('RESULTS:')
    print('CV-average RMSE: {} | Valid RMSE: {}'.format(round(rmse_avg(res), 3),
                                                        round(rmse(preds, y_valid), 3)))
    print('-' * 70)
    
    #clearing memory
    del train_df_csr, target_df
    del X_train, X_valid, y_train, y_valid
    gc.collect()
    
    if return_preds == True:
        return preds
    
    del res, preds
    gc.collect()

In [14]:
%%time
shallow_trees()

0it [00:00, ?it/s]

Training until validation scores don't improve for 100 rounds.
[10]	train's rmse: 3.80168	test's rmse: 3.71059
[20]	train's rmse: 3.74044	test's rmse: 3.67591
[30]	train's rmse: 3.69535	test's rmse: 3.65449
[40]	train's rmse: 3.66314	test's rmse: 3.64112
[50]	train's rmse: 3.63651	test's rmse: 3.6322
[60]	train's rmse: 3.61591	test's rmse: 3.62714
[70]	train's rmse: 3.59751	test's rmse: 3.62362
[80]	train's rmse: 3.58375	test's rmse: 3.62186
[90]	train's rmse: 3.57116	test's rmse: 3.62082
[100]	train's rmse: 3.55953	test's rmse: 3.62004
[110]	train's rmse: 3.54946	test's rmse: 3.62
[120]	train's rmse: 3.54232	test's rmse: 3.61952
[130]	train's rmse: 3.53638	test's rmse: 3.61953
[140]	train's rmse: 3.53013	test's rmse: 3.61907
[150]	train's rmse: 3.52366	test's rmse: 3.61929
[160]	train's rmse: 3.51657	test's rmse: 3.61918
[170]	train's rmse: 3.51236	test's rmse: 3.6191
[180]	train's rmse: 3.50755	test's rmse: 3.61888
[190]	train's rmse: 3.50234	test's rmse: 3.61899
[200]	train's rmse: 

1it [03:38, 218.95s/it]

Fold 1 RMSE : 3.6184569440979284
Training until validation scores don't improve for 100 rounds.
[10]	train's rmse: 3.72763	test's rmse: 3.87315
[20]	train's rmse: 3.67339	test's rmse: 3.8319
[30]	train's rmse: 3.63239	test's rmse: 3.80402
[40]	train's rmse: 3.60175	test's rmse: 3.78604
[50]	train's rmse: 3.5773	test's rmse: 3.77279
[60]	train's rmse: 3.55894	test's rmse: 3.76318
[70]	train's rmse: 3.54153	test's rmse: 3.75564
[80]	train's rmse: 3.52382	test's rmse: 3.74932
[90]	train's rmse: 3.50963	test's rmse: 3.74385
[100]	train's rmse: 3.49702	test's rmse: 3.73981
[110]	train's rmse: 3.48941	test's rmse: 3.73828
[120]	train's rmse: 3.48094	test's rmse: 3.73582
[130]	train's rmse: 3.47526	test's rmse: 3.73439
[140]	train's rmse: 3.46995	test's rmse: 3.73313
[150]	train's rmse: 3.46537	test's rmse: 3.73234
[160]	train's rmse: 3.461	test's rmse: 3.73181
[170]	train's rmse: 3.45641	test's rmse: 3.7311
[180]	train's rmse: 3.45221	test's rmse: 3.73063
[190]	train's rmse: 3.4478	test's rm

2it [08:17, 236.87s/it]

Fold 2 RMSE : 3.728244417189211
Training until validation scores don't improve for 100 rounds.
[10]	train's rmse: 3.77016	test's rmse: 3.77991
[20]	train's rmse: 3.71309	test's rmse: 3.74304
[30]	train's rmse: 3.67187	test's rmse: 3.71837
[40]	train's rmse: 3.64056	test's rmse: 3.70291
[50]	train's rmse: 3.61362	test's rmse: 3.69241
[60]	train's rmse: 3.5929	test's rmse: 3.68526
[70]	train's rmse: 3.5744	test's rmse: 3.67896
[80]	train's rmse: 3.56015	test's rmse: 3.67516
[90]	train's rmse: 3.5474	test's rmse: 3.67242
[100]	train's rmse: 3.53463	test's rmse: 3.6694
[110]	train's rmse: 3.52571	test's rmse: 3.66824
[120]	train's rmse: 3.51846	test's rmse: 3.66625
[130]	train's rmse: 3.51307	test's rmse: 3.66558
[140]	train's rmse: 3.5067	test's rmse: 3.66467
[150]	train's rmse: 3.50071	test's rmse: 3.66387
[160]	train's rmse: 3.49499	test's rmse: 3.66321
[170]	train's rmse: 3.49038	test's rmse: 3.66288
[180]	train's rmse: 3.48503	test's rmse: 3.66259
[190]	train's rmse: 3.48221	test's rm

3it [12:04, 233.83s/it]

Fold 3 RMSE : 3.6618577612555807
----------------------------------------------------------------------
RESULTS:
CV-average RMSE: 3.67 | Valid RMSE: 3.682
----------------------------------------------------------------------
CPU times: user 45min 22s, sys: 11.6 s, total: 45min 33s
Wall time: 13min 12s





#### Additional Feature Engeneering
I think that furher parameters tuning could result in slight RMSE improvement, but I beleive, that additional feature engeneering could provide more feasible results.

##### Adding Transactions Num
<i>dump: updating existing train_df & test_df</i>

In [15]:
def trans_num_add():
    #train and test DataFrames names
    dfns = ['train_df', 'test_df']

    #transactions files' names
    tfns = ['ht_auth', 'ht_den', 'nt']

    for dfn in dfns:

        #open DataFrame
        with open(os.path.join(PATH_TO_DATA, dfn+'.pkl'), 'rb') as pkl_f:
            df = pickle.load(pkl_f)

        for tfn in tfns:

            #open transactions file
            with open(os.path.join(PATH_TO_DATA, tfn+'.pkl'), 'rb') as pkl_f:
                tf = pickle.load(pkl_f)

            #transactions count
            f_name = tfn+'_count'
            tf = tf.groupby(['card_id'])['card_id'].count().to_frame(name=f_name)

            #adding feature to DataFrame
            df = df.merge(tf, how='left', left_index=True, right_index=True)
            df[f_name] = df[f_name].fillna(0)

            #clearing memory
            del tf
            gc.collect()

        #saving DataFrame
        with open(os.path.join(PATH_TO_DATA, dfn+'.pkl'), 'wb') as pkl_f:
            pickle.dump(df, pkl_f)

        #clearing memory
        del df
        gc.collect()
        pass

In [16]:
%%time
trans_num_add()

CPU times: user 38.9 s, sys: 19.1 s, total: 58 s
Wall time: 2min 35s


##### Replacing Purchase_Amount with 'Raw'  Values
<i>dump: updating existing train_df & test_df</i>

In [17]:
def purchase_amount_raw_values():
    
    for tfn in ['hist_trans', 'new_trans']:
    #loading (raw) transaction files
        with open(os.path.join(PATH_TO_DATA, tfn+'_reduced_mem_usage.pkl'), 'rb') as pkl_f:
            trans_df = pickle.load(pkl_f)

        #restoring raw purchase amount
        #consider formula here: https://www.kaggle.com/raddar/target-true-meaning-revealed/
        trans_df['purchase_amount_raw'] = trans_df['purchase_amount'] / 0.00150265118 + 497.06

        #estimating purchase_amount per installment ratios
        trans_df['purchase_amount_raw_per_installments']=\
        trans_df['purchase_amount_raw'] / trans_df['installments'].replace([0, 999], np.nan)

        if tfn == 'hist_trans':
            #splitting hist_trans on authorized and denied transactions
            trans_df_auth = trans_df[trans_df['authorized_flag'] == 'Y']
            trans_df_den = trans_df[trans_df['authorized_flag'] == 'N']

            #clearing memory
            del trans_df
            gc.collect()
            pass

            trans_df_list = [trans_df_auth, trans_df_den]
            trans_names = ['ht_auth', 'ht_den']

        else:
            trans_df_list = [trans_df]
            trans_names = ['nt']

        #aggregation parameters
        aggs = {}
        for col in ['purchase_amount_raw', 'purchase_amount_raw_per_installments']:
            aggs[col] = ['sum', 'max', 'min', 'mean', 'var']
    
        
        for t_df, trans_name in zip(trans_df_list, trans_names):
            
            #aggregating
            df_agg = t_df.reset_index(drop=True).groupby('card_id').agg(aggs)
            
            #renaming columns
            df_agg.columns = [trans_name+'_'+i+'_'+j for i, j in zip(df_agg.columns.get_level_values(0),
                                                                     df_agg.columns.get_level_values(1))]
            
            for dfn in ['train_df', 'test_df']:
                
                #open train/test DataFrames
                with open(os.path.join(PATH_TO_DATA, dfn+'.pkl'), 'rb') as pkl_f:
                    df = pickle.load(pkl_f)
                
                #merging
                df = df.merge(df_agg, how='left', left_index=True, right_index=True)
                
                with open(os.path.join(PATH_TO_DATA, dfn+'.pkl'), 'wb') as pkl_f:
                    pickle.dump(df, pkl_f)
                
                #memory clearing
                del df
                gc.collect()


        #memory clearing
        del trans_df_list
        gc.collect()

In [18]:
%%time
purchase_amount_raw_values()

CPU times: user 1min, sys: 51.8 s, total: 1min 52s
Wall time: 6min 39s


##### Adding Ratios: (1) New Trans. Num. / Hist Trans. Num.; (2) New Trans. Purch. Amount Sum / Hist Trans. Purch. Amount Sum
<i>dump: updating existing train_df & test_df</i>

In [19]:
def trans_num_and_purch_am_ratios():
    #open train and test DataFrames
    with open(os.path.join(PATH_TO_DATA, 'train_df.pkl'), 'rb') as pkl_f:
        train_df = pickle.load(pkl_f)
    with open(os.path.join(PATH_TO_DATA, 'test_df.pkl'), 'rb') as pkl_f:
        test_df = pickle.load(pkl_f)

    for df in [train_df, test_df]:

        #counts
        df['nt_count_2_ht_auth_count'] = df['nt_count'] / df['ht_auth_count']
        df['ht_den_count_2_ht_auth_count'] = df['ht_den_count'] / df['ht_auth_count']

        #sums
        df['nt_purchase_amount_raw_sum_2_ht_auth_purchase_amount_raw_sum'] =\
        df['nt_purchase_amount_raw_sum'] / df['ht_auth_purchase_amount_raw_sum']

        df['ht_den_purchase_amount_raw_sum_2_ht_auth_purchase_amount_raw_sum'] =\
        df['ht_den_purchase_amount_raw_sum'] / df['ht_auth_purchase_amount_raw_sum']

    #dump updated train and test DataFrames
    with open(os.path.join(PATH_TO_DATA, 'train_df.pkl'), 'wb') as pkl_f:
        pickle.dump(train_df, pkl_f)
    with open(os.path.join(PATH_TO_DATA, 'test_df.pkl'), 'wb') as pkl_f:
        pickle.dump(test_df, pkl_f)
        
    #memory clearing
    del train_df, test_df
    gc.collect()

In [20]:
%%time
trans_num_and_purch_am_ratios()

CPU times: user 2.95 s, sys: 5.59 s, total: 8.54 s
Wall time: 37.4 s


##### Feature: Spending per Month Dynamics
Let's count spending of each card_id in each month and 'look' at it dynamics in history. For this I'll estimate \[M+1 purchase amount / M+0 purchase amount\] ratios, where M - month.

Also I'll try to compare dynamics in 2 ways:
* by months chronologically;
* by month lag (from reference month, which is suposingly a month of Elo's reccomendation algorithm activation).

I'd like to remind, that in new_merchant_transactions file are only purchases at merchants, which haven't met historical transactions. Hnce, to make data from historical transactions and new merchant transactions (more or less) consistent, I estimate average spending per merchant.

###### Preparation of Joint (hist authorized and new) Transactions File
<i>hant.pkl</i>
* Concatenating historical and new merchant transactions;
* Restoring 'raw' purchase amount;
* Creating purchase date in MM-YYYY format;
* Fullfilling mechant_id NaNs with concatenated merchant_category_id & subsector_id (cause several NaNs per 1 card_id will cause in inproper spending per merchant_id estimation).

In [21]:
def prepare_hant():
    #loading transactions files
    with open(os.path.join(PATH_TO_DATA, 'hist_trans_reduced_mem_usage.pkl'), 'rb') as pkl_f:
        hist_trans = pickle.load(pkl_f)
    with open(os.path.join(PATH_TO_DATA, 'new_trans_reduced_mem_usage.pkl'), 'rb') as pkl_f:
        new_trans = pickle.load(pkl_f)

    #restoring raw purchase amount
    hist_trans['purchase_amount_raw'] = hist_trans['purchase_amount'] / 0.00150265118 + 497.06
    new_trans['purchase_amount_raw'] = new_trans['purchase_amount'] / 0.00150265118 + 497.06

    #concatenating Historical_Authorized transactions with New_Transactions
    hant = hist_trans[hist_trans['authorized_flag'] == 'Y'].append(new_trans)

    #clearing memory
    del hist_trans, new_trans
    gc.collect()
    pass

    #purchase_date to_datetime
    hant['purchase_date'] = pd.to_datetime(hant['purchase_date'])

    #creating purchase_data in yyyymm format
    hant['purchase_date_month_year'] =\
    hant['purchase_date'].dt.year * 100 + hant['purchase_date'].dt.month

    #replacing NaNs in merchant_id column
    nan_idx = hant['merchant_id'].isna()
    hant.loc[nan_idx, 'merchant_id'] = 'unknown'
    for col in tqdm(['merchant_category_id', 'subsector_id']):
        hant.loc[nan_idx, 'merchant_id'] += '_'+hant.loc[nan_idx, col].astype(str)

    #saving concatenated transactions file
    with open(os.path.join(PATH_TO_DATA, 'hant.pkl'), 'wb') as pkl_f:
        pickle.dump(hant, pkl_f)

In [22]:
%%time
prepare_hant()

100%|██████████| 2/2 [00:06<00:00,  3.63s/it]


CPU times: user 1min 1s, sys: 29.6 s, total: 1min 31s
Wall time: 4min 11s


###### Dynamics by Month
<i>dump: new files - train_df_1 & test_df_1</i>

CV-average RMSE: 3.662 | Valid RMSE: 3.674

In [23]:
def dynamics_by_month():
    #loading joint transactions file
    with open(os.path.join(PATH_TO_DATA, 'hant.pkl'), 'rb') as pkl_f:
        hant = pickle.load(pkl_f)

    #estimating Spending per Month
    sm = hant.groupby(['card_id', 'purchase_date_month_year'])['purchase_amount_raw'].sum().unstack(-1)

    #estimating number of Unique Merchant_id per Month
    umm = hant.groupby(['card_id', 'purchase_date_month_year'])['merchant_id'].nunique().unstack(-1)

    #estimating Average Spending per Merchant_id per Month
    asmm = sm / umm

    #estimating M+1 / M+0 ratios

    #initializing asmm Ratios
    asmm_r = pd.DataFrame(index=asmm.index)

    for i in range(len(asmm.columns)-1):
        M_1 = asmm.columns[i+1]
        M_0 = asmm.columns[i]
        asmm_r['{}_/_{}'.format(M_1, M_0)] = asmm[M_1] / asmm[M_0]

    #initializing asmm Features
    asmm_f = pd.DataFrame(index=asmm.index)

    #generating features
    asmm_f['1M_ratios_mean'] = asmm_r.mean(axis=1)
    asmm_f['1M_ratios_min'] = asmm_r.min(axis=1)
    asmm_f['1M_ratios_max'] = asmm_r.max(axis=1)
    asmm_f['1M_ratios_sum'] = asmm_r.sum(axis=1)
    asmm_f['1M_ratios_count'] = asmm_r.count(axis=1)

    #adding features to train_df and test_df
    for dfn in ['train_df', 'test_df']:
        with open(os.path.join(PATH_TO_DATA, dfn+'.pkl'), 'rb') as pkl_f:
            df = pickle.load(pkl_f)
        df = df.merge(asmm_r, how='left', left_index=True, right_index=True)
        df = df.merge(asmm_f, how='left', left_index=True, right_index=True)
        with open(os.path.join(PATH_TO_DATA, dfn+'_1.pkl'), 'wb') as pkl_f:
            pickle.dump(df, pkl_f)

        #clearing memory
        del df
        gc.collect()
        
    #clearing_memory
    del hant, asmm_r, asmm_f
    gc.collect()

In [24]:
%%time
dynamics_by_month()

CPU times: user 1min 13s, sys: 13.8 s, total: 1min 27s
Wall time: 2min 20s


In [25]:
%%time
sparse_matrix_prep('train_df_1')
sparse_matrix_prep('test_df_1')

CPU times: user 1min 33s, sys: 1min, total: 2min 33s
Wall time: 10min 46s


In [26]:
def shallow_trees_with_dynamics_by_month(return_preds=False):
    #open data
    with open(os.path.join(PATH_TO_DATA, 'train_df_1_csr.pkl'), 'rb') as pkl_f:
                train_df_csr = pickle.load(pkl_f)
    with open(os.path.join(PATH_TO_DATA, 'target_df.pkl'), 'rb') as pkl_f:
                target_df = pickle.load(pkl_f)

    #data train/valid split
    X_train, X_valid, y_train, y_valid = train_test_split(train_df_csr, target_df,
                                                          test_size=0.33, shuffle=True, random_state=17)
    
    #parameters for lgb
    params ={'task': 'train',
             'objective': 'regression',
             'metric': 'rmse',

             'feature_fraction': 0.75, #1
             'subsample': 0.75, # 1

             'learning_rate': 0.025, # 0.1
             'max_depth': 7, # -1
             'num_leaves': 100,# 31

             'reg_alpha': 0, # 0
             'reg_lambda': 0, # 0

             'min_split_gain': 0, #0
             'min_data_in_leaf': 100, #20
             'min_sum_hessian_in_leaf': 1e-3, # 1e-3

             'verbose': -1,

             'num_iterations':10000,
             'early_stopping_round':100}

    #model training
    res, preds = kfold_lightgbm(X_train, y_train, X_valid, 3, file=False, params=params)
    
    #printing results
    print('-' * 70)
    print('RESULTS:')
    print('CV-average RMSE: {} | Valid RMSE: {}'.format(round(rmse_avg(res), 3),
                                                        round(rmse(preds, y_valid), 3)))
    print('-' * 70)
    
    #clearing memory
    del train_df_csr, target_df
    del X_train, X_valid, y_train, y_valid
    gc.collect()
    
    if return_preds == True:
        return preds
    
    del res, preds
    gc.collect()

In [27]:
%%time
preds_1 = shallow_trees_with_dynamics_by_month(return_preds=True)

0it [00:00, ?it/s]

Training until validation scores don't improve for 100 rounds.
[10]	train's rmse: 3.7967	test's rmse: 3.7074
[20]	train's rmse: 3.73338	test's rmse: 3.67097
[30]	train's rmse: 3.68638	test's rmse: 3.64905
[40]	train's rmse: 3.64984	test's rmse: 3.63438
[50]	train's rmse: 3.62176	test's rmse: 3.62538
[60]	train's rmse: 3.59745	test's rmse: 3.62056
[70]	train's rmse: 3.57848	test's rmse: 3.61682
[80]	train's rmse: 3.56177	test's rmse: 3.61453
[90]	train's rmse: 3.54827	test's rmse: 3.61345
[100]	train's rmse: 3.53621	test's rmse: 3.61218
[110]	train's rmse: 3.52576	test's rmse: 3.61122
[120]	train's rmse: 3.51764	test's rmse: 3.61039
[130]	train's rmse: 3.50854	test's rmse: 3.61015
[140]	train's rmse: 3.50192	test's rmse: 3.61002
[150]	train's rmse: 3.49693	test's rmse: 3.60999
[160]	train's rmse: 3.49125	test's rmse: 3.60998
[170]	train's rmse: 3.4861	test's rmse: 3.61012
[180]	train's rmse: 3.48143	test's rmse: 3.61032
[190]	train's rmse: 3.47544	test's rmse: 3.61069
[200]	train's rmse

1it [02:42, 162.61s/it]

Fold 1 RMSE : 3.609895823344217
Training until validation scores don't improve for 100 rounds.
[10]	train's rmse: 3.72302	test's rmse: 3.86985
[20]	train's rmse: 3.66611	test's rmse: 3.82837
[30]	train's rmse: 3.62135	test's rmse: 3.80012
[40]	train's rmse: 3.5863	test's rmse: 3.78148
[50]	train's rmse: 3.55896	test's rmse: 3.76848
[60]	train's rmse: 3.53644	test's rmse: 3.75841
[70]	train's rmse: 3.51683	test's rmse: 3.75073
[80]	train's rmse: 3.49906	test's rmse: 3.74504
[90]	train's rmse: 3.48435	test's rmse: 3.73981
[100]	train's rmse: 3.47127	test's rmse: 3.73689
[110]	train's rmse: 3.46259	test's rmse: 3.73422
[120]	train's rmse: 3.45564	test's rmse: 3.73261
[130]	train's rmse: 3.44911	test's rmse: 3.73114
[140]	train's rmse: 3.44401	test's rmse: 3.73013
[150]	train's rmse: 3.43801	test's rmse: 3.72913
[160]	train's rmse: 3.43247	test's rmse: 3.72847
[170]	train's rmse: 3.42777	test's rmse: 3.72793
[180]	train's rmse: 3.42267	test's rmse: 3.72767
[190]	train's rmse: 3.41877	test'

2it [06:52, 188.71s/it]

Fold 2 RMSE : 3.7254277208488173
Training until validation scores don't improve for 100 rounds.
[10]	train's rmse: 3.76693	test's rmse: 3.77789
[20]	train's rmse: 3.70431	test's rmse: 3.74032
[30]	train's rmse: 3.65941	test's rmse: 3.71508
[40]	train's rmse: 3.62222	test's rmse: 3.69731
[50]	train's rmse: 3.5937	test's rmse: 3.68418
[60]	train's rmse: 3.57068	test's rmse: 3.67517
[70]	train's rmse: 3.55197	test's rmse: 3.66887
[80]	train's rmse: 3.53581	test's rmse: 3.66515
[90]	train's rmse: 3.52205	test's rmse: 3.66169
[100]	train's rmse: 3.50655	test's rmse: 3.65841
[110]	train's rmse: 3.49625	test's rmse: 3.65634
[120]	train's rmse: 3.48928	test's rmse: 3.65512
[130]	train's rmse: 3.48339	test's rmse: 3.65403
[140]	train's rmse: 3.47776	test's rmse: 3.65303
[150]	train's rmse: 3.47102	test's rmse: 3.65217
[160]	train's rmse: 3.4654	test's rmse: 3.65151
[170]	train's rmse: 3.4608	test's rmse: 3.65124
[180]	train's rmse: 3.45597	test's rmse: 3.65086
[190]	train's rmse: 3.45239	test's

3it [10:42, 201.25s/it]

Fold 3 RMSE : 3.6499113168451673
----------------------------------------------------------------------
RESULTS:
CV-average RMSE: 3.662 | Valid RMSE: 3.674
----------------------------------------------------------------------
CPU times: user 38min 49s, sys: 1min 42s, total: 40min 31s
Wall time: 11min 18s





###### Dynamics by month_lag
dump train_df_2 & test_df_2

CV-average RMSE: 3.662 | Valid RMSE: 3.674

In [28]:
def dynamics_by_month_lag():
    #loading joint transactions file
    with open(os.path.join(PATH_TO_DATA, 'hant.pkl'), 'rb') as pkl_f:
        hant = pickle.load(pkl_f)

    #estimating Spending per Month
    sm = hant.groupby(['card_id', 'month_lag'])['purchase_amount_raw'].sum().unstack(-1)

    #estimating number of Unique Merchant_id per Month
    umm = hant.groupby(['card_id', 'month_lag'])['merchant_id'].nunique().unstack(-1)

    #estimating Average Spending per Merchant_id per Month
    asmm = sm / umm

    #estimating M+1 / M+0 ratios

    #initializing asmm Ratios
    asmm_r = pd.DataFrame(index=asmm.index)

    for i in range(len(asmm.columns)-1):
        M_1 = asmm.columns[i+1]
        M_0 = asmm.columns[i]
        asmm_r['{}_/_{}'.format(M_1, M_0)] = asmm[M_1] / asmm[M_0]

    #initializing asmm Features
    asmm_f = pd.DataFrame(index=asmm.index)

    #generating features
    asmm_f['1M_ratios_mean'] = asmm_r.mean(axis=1)
    asmm_f['1M_ratios_min'] = asmm_r.min(axis=1)
    asmm_f['1M_ratios_max'] = asmm_r.max(axis=1)
    asmm_f['1M_ratios_sum'] = asmm_r.sum(axis=1)
    asmm_f['1M_ratios_count'] = asmm_r.count(axis=1)

    #adding features to train_df and test_df
    for dfn in ['train_df', 'test_df']:
        with open(os.path.join(PATH_TO_DATA, dfn+'.pkl'), 'rb') as pkl_f:
            df = pickle.load(pkl_f)
        df = df.merge(asmm_r, how='left', left_index=True, right_index=True)
        df = df.merge(asmm_f, how='left', left_index=True, right_index=True)
        with open(os.path.join(PATH_TO_DATA, dfn+'_2.pkl'), 'wb') as pkl_f:
            pickle.dump(df, pkl_f)

        #clearing memory
        del df
        gc.collect()
        
    #clearing_memory
    del hant, asmm_r, asmm_f
    gc.collect()

In [29]:
%%time
dynamics_by_month_lag()

CPU times: user 1min 13s, sys: 14.7 s, total: 1min 28s
Wall time: 2min 34s


In [30]:
%%time
sparse_matrix_prep('train_df_2')
sparse_matrix_prep('test_df_2')

CPU times: user 1min 25s, sys: 57.9 s, total: 2min 23s
Wall time: 9min 39s


In [31]:
def shallow_trees_with_dynamics_by_month_lag(return_preds=False):
    #open data
    with open(os.path.join(PATH_TO_DATA, 'train_df_1_csr.pkl'), 'rb') as pkl_f:
                train_df_csr = pickle.load(pkl_f)
    with open(os.path.join(PATH_TO_DATA, 'target_df.pkl'), 'rb') as pkl_f:
                target_df = pickle.load(pkl_f)

    #data train/valid split
    X_train, X_valid, y_train, y_valid = train_test_split(train_df_csr, target_df,
                                                          test_size=0.33, shuffle=True, random_state=17)
    
    #parameters for lgb
    params ={'task': 'train',
             'objective': 'regression',
             'metric': 'rmse',

             'feature_fraction': 0.75, #1
             'subsample': 0.75, # 1

             'learning_rate': 0.025, # 0.1
             'max_depth': 7, # -1
             'num_leaves': 100,# 31

             'reg_alpha': 0, # 0
             'reg_lambda': 0, # 0

             'min_split_gain': 0, #0
             'min_data_in_leaf': 100, #20
             'min_sum_hessian_in_leaf': 1e-3, # 1e-3

             'verbose': -1,

             'num_iterations':10000,
             'early_stopping_round':100}

    #model training
    res, preds = kfold_lightgbm(X_train, y_train, X_valid, 3, file=False, params=params)
    
    #printing results
    print('-' * 70)
    print('RESULTS:')
    print('CV-average RMSE: {} | Valid RMSE: {}'.format(round(rmse_avg(res), 3),
                                                        round(rmse(preds, y_valid), 3)))
    print('-' * 70)
    
    #clearing memory
    del train_df_csr, target_df
    del X_train, X_valid, y_train, y_valid
    gc.collect()
    
    if return_preds == True:
        return preds
    
    del res, preds
    gc.collect()

In [32]:
%%time
preds_2 = shallow_trees_with_dynamics_by_month_lag(return_preds=True)

0it [00:00, ?it/s]

Training until validation scores don't improve for 100 rounds.
[10]	train's rmse: 3.7967	test's rmse: 3.7074
[20]	train's rmse: 3.73338	test's rmse: 3.67097
[30]	train's rmse: 3.68638	test's rmse: 3.64905
[40]	train's rmse: 3.64984	test's rmse: 3.63438
[50]	train's rmse: 3.62176	test's rmse: 3.62538
[60]	train's rmse: 3.59745	test's rmse: 3.62056
[70]	train's rmse: 3.57848	test's rmse: 3.61682
[80]	train's rmse: 3.56177	test's rmse: 3.61453
[90]	train's rmse: 3.54827	test's rmse: 3.61345
[100]	train's rmse: 3.53621	test's rmse: 3.61218
[110]	train's rmse: 3.52576	test's rmse: 3.61122
[120]	train's rmse: 3.51764	test's rmse: 3.61039
[130]	train's rmse: 3.50854	test's rmse: 3.61015
[140]	train's rmse: 3.50192	test's rmse: 3.61002
[150]	train's rmse: 3.49693	test's rmse: 3.60999
[160]	train's rmse: 3.49125	test's rmse: 3.60998
[170]	train's rmse: 3.4861	test's rmse: 3.61012
[180]	train's rmse: 3.48143	test's rmse: 3.61032
[190]	train's rmse: 3.47544	test's rmse: 3.61069
[200]	train's rmse

1it [02:48, 168.05s/it]

Fold 1 RMSE : 3.609895823344217
Training until validation scores don't improve for 100 rounds.
[10]	train's rmse: 3.72302	test's rmse: 3.86985
[20]	train's rmse: 3.66611	test's rmse: 3.82837
[30]	train's rmse: 3.62135	test's rmse: 3.80012
[40]	train's rmse: 3.5863	test's rmse: 3.78148
[50]	train's rmse: 3.55896	test's rmse: 3.76848
[60]	train's rmse: 3.53644	test's rmse: 3.75841
[70]	train's rmse: 3.51683	test's rmse: 3.75073
[80]	train's rmse: 3.49906	test's rmse: 3.74504
[90]	train's rmse: 3.48435	test's rmse: 3.73981
[100]	train's rmse: 3.47127	test's rmse: 3.73689
[110]	train's rmse: 3.46259	test's rmse: 3.73422
[120]	train's rmse: 3.45564	test's rmse: 3.73261
[130]	train's rmse: 3.44911	test's rmse: 3.73114
[140]	train's rmse: 3.44401	test's rmse: 3.73013
[150]	train's rmse: 3.43801	test's rmse: 3.72913
[160]	train's rmse: 3.43247	test's rmse: 3.72847
[170]	train's rmse: 3.42777	test's rmse: 3.72793
[180]	train's rmse: 3.42267	test's rmse: 3.72767
[190]	train's rmse: 3.41877	test'

2it [06:57, 192.52s/it]

Fold 2 RMSE : 3.7254277208488173
Training until validation scores don't improve for 100 rounds.
[10]	train's rmse: 3.76693	test's rmse: 3.77789
[20]	train's rmse: 3.70431	test's rmse: 3.74032
[30]	train's rmse: 3.65941	test's rmse: 3.71508
[40]	train's rmse: 3.62222	test's rmse: 3.69731
[50]	train's rmse: 3.5937	test's rmse: 3.68418
[60]	train's rmse: 3.57068	test's rmse: 3.67517
[70]	train's rmse: 3.55197	test's rmse: 3.66887
[80]	train's rmse: 3.53581	test's rmse: 3.66515
[90]	train's rmse: 3.52205	test's rmse: 3.66169
[100]	train's rmse: 3.50655	test's rmse: 3.65841
[110]	train's rmse: 3.49625	test's rmse: 3.65634
[120]	train's rmse: 3.48928	test's rmse: 3.65512
[130]	train's rmse: 3.48339	test's rmse: 3.65403
[140]	train's rmse: 3.47776	test's rmse: 3.65303
[150]	train's rmse: 3.47102	test's rmse: 3.65217
[160]	train's rmse: 3.4654	test's rmse: 3.65151
[170]	train's rmse: 3.4608	test's rmse: 3.65124
[180]	train's rmse: 3.45597	test's rmse: 3.65086
[190]	train's rmse: 3.45239	test's

3it [10:58, 206.97s/it]

Fold 3 RMSE : 3.6499113168451673
----------------------------------------------------------------------
RESULTS:
CV-average RMSE: 3.662 | Valid RMSE: 3.674
----------------------------------------------------------------------
CPU times: user 38min 57s, sys: 1min 55s, total: 40min 53s
Wall time: 11min 4s





#### 2 Models Blending

Models trained on both features sets(dynamics_by_month, dynamics_by_month_lag) provided similar results improvement, hence I try to blend its results.

In [33]:
with open(os.path.join(PATH_TO_DATA, 'target_df.pkl'), 'rb') as pkl_f:
    target_df = pickle.load(pkl_f)

_, y_valid = train_test_split(target_df, test_size=0.33, shuffle=True, random_state=17)
    
preds = preds_1 * 0.5 + preds_2 * 0.5

print('Valid RMSE: {}'.format(round(rmse(preds, y_valid), 3)))

Valid RMSE: 3.674


### Training Models on Full Datasets and Submission

In [34]:
def shallow_trees_with_dynamics_by_month_full(return_preds=False):
    #open data
    with open(os.path.join(PATH_TO_DATA, 'train_df_1_csr.pkl'), 'rb') as pkl_f:
                X_train = pickle.load(pkl_f)
    with open(os.path.join(PATH_TO_DATA, 'test_df_1_csr.pkl'), 'rb') as pkl_f:
                X_test = pickle.load(pkl_f)
    with open(os.path.join(PATH_TO_DATA, 'target_df.pkl'), 'rb') as pkl_f:
                y_train = pickle.load(pkl_f)

    #parameters for lgb
    params ={'task': 'train',
             'objective': 'regression',
             'metric': 'rmse',

             'feature_fraction': 0.75, #1
             'subsample': 0.75, # 1

             'learning_rate': 0.025, # 0.1
             'max_depth': 7, # -1
             'num_leaves': 100,# 31

             'reg_alpha': 0, # 0
             'reg_lambda': 0, # 0

             'min_split_gain': 0, #0
             'min_data_in_leaf': 100, #20
             'min_sum_hessian_in_leaf': 1e-3, # 1e-3

             'verbose': -1,

             'num_iterations':10000,
             'early_stopping_round':100}

    #model training
    res, preds = kfold_lightgbm(X_train, y_train, X_test, 3, file=False, params=params)
    
    #printing results
    print('-' * 70)
    print('RESULTS:')
    print('CV-average RMSE: {}'.format(round(rmse_avg(res), 3)))
    print('-' * 70)
    
    #clearing memory
    del X_train, X_test, y_train
    gc.collect()
    
    if return_preds == True:
        return preds
    
    del res, preds
    gc.collect()

In [35]:
%%time
preds_1 = shallow_trees_with_dynamics_by_month_full(return_preds=True)

0it [00:00, ?it/s]

Training until validation scores don't improve for 100 rounds.
[10]	train's rmse: 3.76625	test's rmse: 3.79299
[20]	train's rmse: 3.70715	test's rmse: 3.75386
[30]	train's rmse: 3.66257	test's rmse: 3.72711
[40]	train's rmse: 3.62874	test's rmse: 3.70982
[50]	train's rmse: 3.60199	test's rmse: 3.69938
[60]	train's rmse: 3.57922	test's rmse: 3.69115
[70]	train's rmse: 3.56123	test's rmse: 3.68635
[80]	train's rmse: 3.54455	test's rmse: 3.68337
[90]	train's rmse: 3.52965	test's rmse: 3.68035
[100]	train's rmse: 3.51683	test's rmse: 3.67839
[110]	train's rmse: 3.50586	test's rmse: 3.67656
[120]	train's rmse: 3.4969	test's rmse: 3.67524
[130]	train's rmse: 3.48929	test's rmse: 3.67407
[140]	train's rmse: 3.48318	test's rmse: 3.67337
[150]	train's rmse: 3.47925	test's rmse: 3.67267
[160]	train's rmse: 3.47609	test's rmse: 3.67218
[170]	train's rmse: 3.4728	test's rmse: 3.67177
[180]	train's rmse: 3.46994	test's rmse: 3.67162
[190]	train's rmse: 3.46699	test's rmse: 3.67144
[200]	train's rms

1it [05:58, 358.82s/it]

Fold 1 RMSE : 3.6708927292313662
Training until validation scores don't improve for 100 rounds.
[10]	train's rmse: 3.75726	test's rmse: 3.81156
[20]	train's rmse: 3.7	test's rmse: 3.77218
[30]	train's rmse: 3.65676	test's rmse: 3.74695
[40]	train's rmse: 3.62393	test's rmse: 3.73021
[50]	train's rmse: 3.59832	test's rmse: 3.71902
[60]	train's rmse: 3.57621	test's rmse: 3.70998
[70]	train's rmse: 3.55809	test's rmse: 3.70431
[80]	train's rmse: 3.54406	test's rmse: 3.70053
[90]	train's rmse: 3.53161	test's rmse: 3.69768
[100]	train's rmse: 3.5202	test's rmse: 3.69443
[110]	train's rmse: 3.51037	test's rmse: 3.69242
[120]	train's rmse: 3.50232	test's rmse: 3.69076
[130]	train's rmse: 3.4963	test's rmse: 3.68985
[140]	train's rmse: 3.49006	test's rmse: 3.68913
[150]	train's rmse: 3.48487	test's rmse: 3.68875
[160]	train's rmse: 3.48058	test's rmse: 3.68823
[170]	train's rmse: 3.47682	test's rmse: 3.68798
[180]	train's rmse: 3.47335	test's rmse: 3.68783
[190]	train's rmse: 3.46989	test's rm

2it [11:07, 343.91s/it]

Fold 2 RMSE : 3.687462014830268
Training until validation scores don't improve for 100 rounds.
[10]	train's rmse: 3.78498	test's rmse: 3.76116
[20]	train's rmse: 3.72843	test's rmse: 3.71982
[30]	train's rmse: 3.68461	test's rmse: 3.69104
[40]	train's rmse: 3.65066	test's rmse: 3.67243
[50]	train's rmse: 3.62491	test's rmse: 3.65964
[60]	train's rmse: 3.6039	test's rmse: 3.65103
[70]	train's rmse: 3.58561	test's rmse: 3.64435
[80]	train's rmse: 3.57114	test's rmse: 3.63982
[90]	train's rmse: 3.55783	test's rmse: 3.63653
[100]	train's rmse: 3.54722	test's rmse: 3.6341
[110]	train's rmse: 3.53792	test's rmse: 3.63187
[120]	train's rmse: 3.53177	test's rmse: 3.63035
[130]	train's rmse: 3.52599	test's rmse: 3.62903
[140]	train's rmse: 3.52091	test's rmse: 3.62805
[150]	train's rmse: 3.51593	test's rmse: 3.62708
[160]	train's rmse: 3.51102	test's rmse: 3.62661
[170]	train's rmse: 3.50831	test's rmse: 3.62636
[180]	train's rmse: 3.50527	test's rmse: 3.6263
[190]	train's rmse: 3.50118	test's 

3it [17:28, 355.00s/it]

Fold 3 RMSE : 3.6249218036682875
----------------------------------------------------------------------
RESULTS:
CV-average RMSE: 3.661
----------------------------------------------------------------------
CPU times: user 1h 3min 3s, sys: 2min 17s, total: 1h 5min 21s
Wall time: 17min 35s





In [36]:
def shallow_trees_with_dynamics_by_month_lag_full(return_preds=False):
    #open data
    with open(os.path.join(PATH_TO_DATA, 'train_df_2_csr.pkl'), 'rb') as pkl_f:
                X_train = pickle.load(pkl_f)
    with open(os.path.join(PATH_TO_DATA, 'test_df_2_csr.pkl'), 'rb') as pkl_f:
                X_test = pickle.load(pkl_f)
    with open(os.path.join(PATH_TO_DATA, 'target_df.pkl'), 'rb') as pkl_f:
                y_train = pickle.load(pkl_f)

    #parameters for lgb
    params ={'task': 'train',
             'objective': 'regression',
             'metric': 'rmse',

             'feature_fraction': 0.75, #1
             'subsample': 0.75, # 1

             'learning_rate': 0.025, # 0.1
             'max_depth': 7, # -1
             'num_leaves': 100,# 31

             'reg_alpha': 0, # 0
             'reg_lambda': 0, # 0

             'min_split_gain': 0, #0
             'min_data_in_leaf': 100, #20
             'min_sum_hessian_in_leaf': 1e-3, # 1e-3

             'verbose': -1,

             'num_iterations':10000,
             'early_stopping_round':100}

    #model training
    res, preds = kfold_lightgbm(X_train, y_train, X_test, 3, file=False, params=params)
    
    #printing results
    print('-' * 70)
    print('RESULTS:')
    print('CV-average RMSE: {}'.format(round(rmse_avg(res), 3)))
    print('-' * 70)
    
    #clearing memory
    del X_train, X_test, y_train
    gc.collect()
    
    if return_preds == True:
        return preds
    
    del res, preds
    gc.collect()

In [37]:
%%time
preds_2 = shallow_trees_with_dynamics_by_month_lag_full(return_preds=True)

0it [00:00, ?it/s]

Training until validation scores don't improve for 100 rounds.
[10]	train's rmse: 3.76609	test's rmse: 3.79301
[20]	train's rmse: 3.70621	test's rmse: 3.75358
[30]	train's rmse: 3.66153	test's rmse: 3.72668
[40]	train's rmse: 3.62769	test's rmse: 3.70882
[50]	train's rmse: 3.60109	test's rmse: 3.69722
[60]	train's rmse: 3.57824	test's rmse: 3.68933
[70]	train's rmse: 3.56032	test's rmse: 3.68419
[80]	train's rmse: 3.54384	test's rmse: 3.6808
[90]	train's rmse: 3.52936	test's rmse: 3.67741
[100]	train's rmse: 3.51589	test's rmse: 3.67628
[110]	train's rmse: 3.50521	test's rmse: 3.67464
[120]	train's rmse: 3.49649	test's rmse: 3.67315
[130]	train's rmse: 3.48829	test's rmse: 3.67214
[140]	train's rmse: 3.48298	test's rmse: 3.67123
[150]	train's rmse: 3.47904	test's rmse: 3.67056
[160]	train's rmse: 3.47534	test's rmse: 3.66993
[170]	train's rmse: 3.47186	test's rmse: 3.66956
[180]	train's rmse: 3.4686	test's rmse: 3.66925
[190]	train's rmse: 3.46555	test's rmse: 3.66901
[200]	train's rms

1it [07:05, 425.20s/it]

Fold 1 RMSE : 3.667999918916529
Training until validation scores don't improve for 100 rounds.
[10]	train's rmse: 3.75694	test's rmse: 3.81131
[20]	train's rmse: 3.69931	test's rmse: 3.77139
[30]	train's rmse: 3.65572	test's rmse: 3.7453
[40]	train's rmse: 3.62242	test's rmse: 3.72837
[50]	train's rmse: 3.59608	test's rmse: 3.71714
[60]	train's rmse: 3.57349	test's rmse: 3.70788
[70]	train's rmse: 3.55726	test's rmse: 3.70213
[80]	train's rmse: 3.5414	test's rmse: 3.69735
[90]	train's rmse: 3.52966	test's rmse: 3.69421
[100]	train's rmse: 3.5175	test's rmse: 3.6913
[110]	train's rmse: 3.50805	test's rmse: 3.68926
[120]	train's rmse: 3.50022	test's rmse: 3.68791
[130]	train's rmse: 3.49336	test's rmse: 3.68666
[140]	train's rmse: 3.48747	test's rmse: 3.68618
[150]	train's rmse: 3.48208	test's rmse: 3.68561
[160]	train's rmse: 3.47908	test's rmse: 3.68537
[170]	train's rmse: 3.47434	test's rmse: 3.68489
[180]	train's rmse: 3.47116	test's rmse: 3.68478
[190]	train's rmse: 3.46832	test's r

2it [13:43, 417.09s/it]

Fold 2 RMSE : 3.6836156372524846
Training until validation scores don't improve for 100 rounds.
[10]	train's rmse: 3.78488	test's rmse: 3.76077
[20]	train's rmse: 3.72845	test's rmse: 3.71901
[30]	train's rmse: 3.68496	test's rmse: 3.69069
[40]	train's rmse: 3.65111	test's rmse: 3.67166
[50]	train's rmse: 3.62498	test's rmse: 3.65891
[60]	train's rmse: 3.60332	test's rmse: 3.64973
[70]	train's rmse: 3.5843	test's rmse: 3.6433
[80]	train's rmse: 3.57031	test's rmse: 3.63845
[90]	train's rmse: 3.55711	test's rmse: 3.6344
[100]	train's rmse: 3.54606	test's rmse: 3.63158
[110]	train's rmse: 3.53799	test's rmse: 3.62944
[120]	train's rmse: 3.53052	test's rmse: 3.62785
[130]	train's rmse: 3.52449	test's rmse: 3.62636
[140]	train's rmse: 3.51972	test's rmse: 3.62554
[150]	train's rmse: 3.51415	test's rmse: 3.62484
[160]	train's rmse: 3.50913	test's rmse: 3.62425
[170]	train's rmse: 3.50517	test's rmse: 3.62372
[180]	train's rmse: 3.50124	test's rmse: 3.62338
[190]	train's rmse: 3.49796	test's

3it [19:07, 389.18s/it]

Fold 3 RMSE : 3.622573487255005
----------------------------------------------------------------------
RESULTS:
CV-average RMSE: 3.658
----------------------------------------------------------------------
CPU times: user 1h 8min 19s, sys: 2min 28s, total: 1h 10min 48s
Wall time: 19min 13s





In [38]:
preds = preds_1 * 0.5 + preds_2 * 0.5

In [39]:
submission = pd.read_csv(os.path.join(PATH_TO_DATA, 'sample_submission.csv'))
submission['target'] = preds
submission.to_csv(os.path.join(PATH_TO_DATA, 'submission.csv'), index=False)

## Conclusion
Please, consider ['Main Takeaways'](#main_takeaways) at the top of the page.