# Elo Merchant Category Recommendation - LynxKite feature boosting and LightGBM with StratifiedKFold
End date: _2019. february 19._<br/>

This tutorial notebook is part of a series for [Elo Mechant Category Recommendation](https://www.kaggle.com/c/elo-merchant-category-recommendation) contest organized by Elo, one of the largest payment brands in Brazil. It has built partnerships with merchants in order to offer promotions or discounts to cardholders. The objective of the competition is to identify and serve the most relevant opportunities to individuals, by uncovering signals in customer loyalty. LynxKite does not yet support some of the data preprocessing, thus they need to be done in Python. The input files are available from the [download](https://www.kaggle.com/c/elo-merchant-category-recommendation/data) section of the contest:

- **train.csv**,  **test.csv**: list of `card_ids` that can be used for training and testing
- **historical_transactions.csv**: contains up to 3 months' worth of transactions for every card at any of the provided `merchant_ids`
- **new_merchant_transactions.csv**: contains the transactions at new merchants (`merchant_ids` that this particular `card_id` 
has not yet visited) over a period of two months
- **merchants.csv**: contains aggregate information for each `merchant_id` represented in the data set

Inspired by [Feature engineering](https://github.com/zzsza/Play-Kaggle/blob/master/Elo-Merchant-Category-Recommendation/notebooks/03.Feature-Engineering.ipynb)

In [1]:
import gc
import random
import warnings
import datetime
import numpy as np
import pandas as pd
import seaborn as sns
import lightgbm as lgb
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import StratifiedKFold

warnings.filterwarnings("ignore")

random.seed(402)

In [2]:
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Starting memory usage: {:5.2f} MB'.format(start_mem))

    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min >= np.iinfo(np.int8).min and c_max <= np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min >= np.iinfo(np.int16).min and c_max <= np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min >= np.iinfo(np.int32).min and c_max <= np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min >= np.iinfo(np.int64).min and c_max <= np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min >= np.finfo(np.float16).min and c_max <= np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min >= np.finfo(np.float32).min and c_max <= np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Reduced memory usage: {:5.2f} MB ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

## Feature engineering
Inspired by [Feature Engineering](https://github.com/zzsza/Play-Kaggle/blob/master/Elo-Merchant-Category-Recommendation/notebooks/04.Feature-Engineering_2nd.ipynb)

### Train & test data

In [None]:
df_train = pd.read_csv("input/train.csv", parse_dates=["first_active_month"], index_col="card_id")
df_train = reduce_mem_usage(df_train)
print("{:,} observations and {} features in train set.".format(df_train.shape[0], df_train.shape[1]))

In [None]:
df_test = pd.read_csv("input/test.csv", parse_dates=["first_active_month"], index_col="card_id")
df_test = reduce_mem_usage(df_test)
print("{:,} observations and {} features in test set.".format(df_test.shape[0], df_test.shape[1]))

In [None]:
df_train['first_active_month'] = pd.to_datetime(df_train['first_active_month'])
df_train['elapsed_days'] = (datetime.date(2018, 2, 1) - df_train['first_active_month'].dt.date).dt.days

In [None]:
df_test['first_active_month'] = pd.to_datetime(df_test['first_active_month'])
df_test['elapsed_days'] = (datetime.date(2018, 2, 1) - df_test['first_active_month'].dt.date).dt.days

In [None]:
target = df_train['target']
del df_train['target']

In [None]:
df_train[:3]

In [None]:
df_test[:3]

In [None]:
target[:3]

### Transactions

In [None]:
%%time
df_hist_trans = pd.read_csv('input/historical_transactions.csv', index_col="card_id", parse_dates=['purchase_date'])
df_hist_trans = reduce_mem_usage(df_hist_trans)
print('Number of historical transactions: {:,}'.format(len(df_hist_trans)))

In [None]:
%%time
df_new_trans = pd.read_csv('input/new_merchant_transactions.csv', index_col="card_id", parse_dates=['purchase_date'])
df_new_trans = reduce_mem_usage(df_new_trans)
print('Number of new transactions: {:,}'.format(len(df_new_trans)))

In [None]:
def create_date_features(df, source_column, preposition):
    df[preposition + '_year'] = df[source_column].dt.year
    df[preposition + '_month'] = df[source_column].dt.month
    df[preposition + '_day'] = df[source_column].dt.day
    df[preposition + '_hour'] = df[source_column].dt.hour
    df[preposition + '_weekofyear'] = df[source_column].dt.weekofyear
    df[preposition + '_dayofweek'] = df[source_column].dt.dayofweek
    df[preposition + '_quarter'] = df[source_column].dt.quarter
    
    return df

In [None]:
df_hist_trans['authorized_flag'] = df_hist_trans['authorized_flag'].map({'Y': 1, 'N': 0})
df_hist_trans['category_1'] = df_hist_trans['category_1'].map({'N': 0, 'Y': 1})
df_hist_trans = pd.get_dummies(df_hist_trans, columns=['category_2', 'category_3'])

df_hist_trans['purchase_date'] = pd.to_datetime(df_hist_trans['purchase_date'])
df_hist_trans = create_date_features(df_hist_trans, 'purchase_date', 'purchase')

In [None]:
df_new_trans['authorized_flag'] = df_new_trans['authorized_flag'].map({'Y': 1, 'N': 0})
df_new_trans['category_1'] = df_new_trans['category_1'].map({'N': 0, 'Y': 1})
df_new_trans = pd.get_dummies(df_new_trans, columns=['category_2', 'category_3'])

df_new_trans['purchase_date'] = pd.to_datetime(df_new_trans['purchase_date'])
df_new_trans = create_date_features(df_new_trans, 'purchase_date', 'purchase')

In [None]:
def get_time_of_month(day_of_month):
    if day_of_month < 10:
        time_of_month = 0 # Beginning
    elif day_of_month >= 10 and day_of_month < 20:
        time_of_month = 1 # Middle
    else:
        time_of_month = 2 # End
    return time_of_month

In [None]:
df_hist_trans['purchase_part_of_month'] = df_hist_trans['purchase_day'].apply(lambda x: get_time_of_month(x))

In [None]:
df_new_trans['purchase_part_of_month'] = df_new_trans['purchase_day'].apply(lambda x: get_time_of_month(x))

In [None]:
df_hist_trans['month_diff'] = ((datetime.date(2018, 12, 1) - df_hist_trans['purchase_date'].dt.date).dt.days)//30
df_hist_trans['month_diff'] += df_hist_trans['month_lag']

In [None]:
df_new_trans['month_diff'] = ((datetime.date(2018, 12, 1) - df_new_trans['purchase_date'].dt.date).dt.days)//30
df_new_trans['month_diff'] += df_new_trans['month_lag']

In [None]:
df_hist_trans = reduce_mem_usage(df_hist_trans)

In [None]:
df_new_trans = reduce_mem_usage(df_new_trans)

In [None]:
df_hist_trans[:3]

In [None]:
df_new_trans[:3]

In [None]:
%%time
df_train = pd.read_csv('input/train.csv')
df_test = pd.read_csv('input/test.csv')

df_train = reduce_mem_usage(df_train)
df_test = reduce_mem_usage(df_test)

In [None]:
%%time
df_hist_trans = pd.read_csv('input/historical_transactions.csv')
df_new_merchant_trans = pd.read_csv('input/new_merchant_transactions.csv')

df_hist_trans = reduce_mem_usage(df_hist_trans)
df_new_merchant_trans = reduce_mem_usage(df_new_merchant_trans)

In [None]:
for df in [df_hist_trans, df_new_merchant_trans]:
    df['category_2'].fillna(1.0, inplace=True)
    df['category_3'].fillna('A', inplace=True)
    df['merchant_id'].fillna('M_ID_00a6ca8a8a', inplace=True)

What happens if we remove the missing data?

In [None]:
def get_new_columns(name,aggs):
    return [name + '_' + k + '_' + agg for k in aggs.keys() for agg in aggs[k]]

In [None]:
%%time
for df in [df_hist_trans, df_new_merchant_trans]:
    df['purchase_date'] = pd.to_datetime(df['purchase_date'])
    df['year'] = df['purchase_date'].dt.year
    df['weekofyear'] = df['purchase_date'].dt.weekofyear
    df['month'] = df['purchase_date'].dt.month
    df['dayofweek'] = df['purchase_date'].dt.dayofweek
    df['weekend'] = (df.purchase_date.dt.weekday >=5).astype(int)
    df['hour'] = df['purchase_date'].dt.hour
    df['authorized_flag'] = df['authorized_flag'].map({'Y':1, 'N':0})
    df['category_1'] = df['category_1'].map({'Y':1, 'N':0}) 
    #https://www.kaggle.com/c/elo-merchant-category-recommendation/discussion/73244
    df['month_diff'] = ((datetime.datetime.today() - df['purchase_date']).dt.days)//30
    df['month_diff'] += df['month_lag']

In [None]:
%%time
aggs = {}
for col in ['month','hour','weekofyear','dayofweek','year','subsector_id','merchant_id','merchant_category_id']:
    aggs[col] = ['nunique']

aggs['purchase_amount'] = ['sum','max','min','mean','var']
aggs['installments'] = ['sum','max','min','mean','var']
aggs['purchase_date'] = ['max','min']
aggs['month_lag'] = ['max','min','mean','var']
aggs['month_diff'] = ['mean']
aggs['authorized_flag'] = ['sum', 'mean']
aggs['weekend'] = ['sum', 'mean']
aggs['category_1'] = ['sum', 'mean']
aggs['card_id'] = ['size']

for col in ['category_2','category_3']:
    df_hist_trans[col+'_mean'] = df_hist_trans.groupby([col])['purchase_amount'].transform('mean')
    aggs[col+'_mean'] = ['mean']    

new_columns = get_new_columns('hist', aggs)
df_hist_trans_group = df_hist_trans.groupby('card_id').agg(aggs)
df_hist_trans_group.columns = new_columns
df_hist_trans_group.reset_index(drop=False,inplace=True)
df_hist_trans_group['hist_purchase_date_diff'] = (df_hist_trans_group['hist_purchase_date_max'] - df_hist_trans_group['hist_purchase_date_min']).dt.days
df_hist_trans_group['hist_purchase_date_average'] = df_hist_trans_group['hist_purchase_date_diff']/df_hist_trans_group['hist_card_id_size']
df_hist_trans_group['hist_purchase_date_uptonow'] = (datetime.datetime.today() - df_hist_trans_group['hist_purchase_date_max']).dt.days
df_train = df_train.merge(df_hist_trans_group,on='card_id', how='left')
df_test = df_test.merge(df_hist_trans_group,on='card_id', how='left')
del df_hist_trans_group;gc.collect()

Create a separate function, put params into an array

In [None]:
%%time
aggs = {}
for col in ['month','hour','weekofyear','dayofweek','year','subsector_id','merchant_id','merchant_category_id']:
    aggs[col] = ['nunique']
aggs['purchase_amount'] = ['sum','max','min','mean','var']
aggs['installments'] = ['sum','max','min','mean','var']
aggs['purchase_date'] = ['max','min']
aggs['month_lag'] = ['max','min','mean','var']
aggs['month_diff'] = ['mean']
aggs['weekend'] = ['sum', 'mean']
aggs['category_1'] = ['sum', 'mean']
aggs['card_id'] = ['size']

for col in ['category_2', 'category_3']:
    df_new_merchant_trans[col+'_mean'] = df_new_merchant_trans.groupby([col])['purchase_amount'].transform('mean')
    aggs[col+'_mean'] = ['mean']
    
new_columns = get_new_columns('new_hist',aggs)
df_hist_trans_group = df_new_merchant_trans.groupby('card_id').agg(aggs)
df_hist_trans_group.columns = new_columns
df_hist_trans_group.reset_index(drop=False,inplace=True)
df_hist_trans_group['new_hist_purchase_date_diff'] = (df_hist_trans_group['new_hist_purchase_date_max'] - df_hist_trans_group['new_hist_purchase_date_min']).dt.days
df_hist_trans_group['new_hist_purchase_date_average'] = df_hist_trans_group['new_hist_purchase_date_diff']/df_hist_trans_group['new_hist_card_id_size']
df_hist_trans_group['new_hist_purchase_date_uptonow'] = (datetime.datetime.today() - df_hist_trans_group['new_hist_purchase_date_max']).dt.days
df_train = df_train.merge(df_hist_trans_group,on='card_id',how='left')
df_test = df_test.merge(df_hist_trans_group,on='card_id',how='left')
del df_hist_trans_group;gc.collect()

In [None]:
del df_hist_trans
gc.collect()

del df_new_merchant_trans
gc.collect()

df_train[:5]

In [None]:
df_train['outliers'] = 0
df_train.loc[df_train['target'] < -30, 'outliers'] = 1
df_train['outliers'].value_counts()

In [None]:
%%time
for df in [df_train, df_test]:
    df['first_active_month'] = pd.to_datetime(df['first_active_month'])
    df['dayofweek'] = df['first_active_month'].dt.dayofweek
    df['weekofyear'] = df['first_active_month'].dt.weekofyear
    df['month'] = df['first_active_month'].dt.month
    df['elapsed_time'] = (datetime.datetime.today() - df['first_active_month']).dt.days
    df['hist_first_buy'] = (df['hist_purchase_date_min'] - df['first_active_month']).dt.days
    df['new_hist_first_buy'] = (df['new_hist_purchase_date_min'] - df['first_active_month']).dt.days
    for f in ['hist_purchase_date_max','hist_purchase_date_min','new_hist_purchase_date_max', 'new_hist_purchase_date_min']:
        df[f] = df[f].astype(np.int64) * 1e-9
    df['card_id_total'] = df['new_hist_card_id_size']+df['hist_card_id_size']
    df['purchase_amount_total'] = df['new_hist_purchase_amount_sum']+df['hist_purchase_amount_sum']

for f in ['feature_1', 'feature_2', 'feature_3']:
    order_label = df_train.groupby([f])['outliers'].mean()
    df_train[f] = df_train[f].map(order_label)
    df_test[f] = df_test[f].map(order_label)

In [None]:
df_train_columns = [c for c in df_train.columns if c not in ['card_id', 'first_active_month','target','outliers']]
target = df_train['target']
del df_train['target']

In [None]:
%%time
param = {
    'num_leaves': 31,
    'min_data_in_leaf': 30, 
    'objective':'regression',
    'max_depth': -1,
    'learning_rate': 0.01,
    "min_child_samples": 20,
    "boosting": "gbdt",
    "feature_fraction": 0.9,
    "bagging_freq": 1,
    "bagging_fraction": 0.9 ,
    "bagging_seed": 11,
    "metric": 'rmse',
    "lambda_l1": 0.1, 
    "verbosity": -1,
    "nthread": -1,
    "random_state": 402
}
folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=402)
oof = np.zeros(len(df_train))
predictions = np.zeros(len(df_test))
feature_importance_df = pd.DataFrame()

for fold_, (trn_idx, val_idx) in enumerate(folds.split(df_train, df_train['outliers'].values)):
    print("\nFold {}.".format(fold_))
    trn_data = lgb.Dataset(df_train.iloc[trn_idx][df_train_columns], label=target.iloc[trn_idx])#, categorical_feature=categorical_feats)
    val_data = lgb.Dataset(df_train.iloc[val_idx][df_train_columns], label=target.iloc[val_idx])#, categorical_feature=categorical_feats)

    num_round = 10000
    clf = lgb.train(param, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=100, early_stopping_rounds = 200)
    oof[val_idx] = clf.predict(df_train.iloc[val_idx][df_train_columns], num_iteration=clf.best_iteration)
    
    fold_importance_df = pd.DataFrame()
    fold_importance_df["Feature"] = df_train_columns
    fold_importance_df["importance"] = clf.feature_importance()
    fold_importance_df["fold"] = fold_ + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    
    predictions += clf.predict(df_test[df_train_columns], num_iteration=clf.best_iteration) / folds.n_splits

In [None]:
cv_score = np.sqrt(mean_squared_error(oof, target))
print("CV score: {:.6f}".format(cv_score))

In [None]:
cols = (feature_importance_df[["Feature", "importance"]]
        .groupby("Feature")
        .mean()
        .sort_values(by="importance", ascending=False)[:1000].index)

best_features = feature_importance_df.loc[feature_importance_df.Feature.isin(cols)]

plt.figure(figsize=(14,25))
sns.barplot(x="importance", y="Feature",
            data=best_features.sort_values(by="importance", ascending=False))
plt.title('LightGBM Features (avg over folds)')
plt.tight_layout()
plt.savefig('lgbm_importances.png')

In [None]:
sub_df = pd.DataFrame({
    "card_id": df_test["card_id"].values
})

sub_df["target"] = predictions
sub_df.to_csv("output/regression_{:.6f}.csv".format(cv_score), index=False)

#### Aggregation
##### Authorized flag

In [None]:
agg_func = {
    'authorized_flag': ['mean']
}

df_af_mean = df_hist_trans.groupby(['card_id']).agg(agg_func)
df_af_mean.columns = ['_'.join(col).strip() for col in df_af_mean.columns.values]

In [None]:
df_af_mean[:3]

In [None]:
df_auth_trans = df_hist_trans[df_hist_trans['authorized_flag'] == 1]
df_hist_trans = df_hist_trans[df_hist_trans['authorized_flag'] == 0]

In [None]:
print('There are {:,} authorized and {:,} denied records in the historical set.'.format(
    df_auth_trans.shape[0],
    df_hist_trans.shape[0]
))

In [None]:
df_hist_trans['purchase_month'] = df_hist_trans['purchase_date'].dt.month
df_auth_trans['purchase_month'] = df_auth_trans['purchase_date'].dt.month
df_new_trans['purchase_month'] = df_new_trans['purchase_date'].dt.month

##### Other fields

In [None]:
def aggregate_transactions(df):
    df.loc[:, 'purchase_date'] = pd.DatetimeIndex(df['purchase_date']).astype(np.int64) * 1e-9
    
    agg_func = {
        'category_1': ['sum', 'mean'],
        'category_2_1.0': ['mean'],
        'category_2_2.0': ['mean'],
        'category_2_3.0': ['mean'],
        'category_2_4.0': ['mean'],
        'category_2_5.0': ['mean'],
        'category_3_A': ['mean'],
        'category_3_B': ['mean'],
        'category_3_C': ['mean'],
        'city_id': ['nunique'],

        'installments': ['sum', 'mean', 'max', 'min', 'std'],
        
        'merchant_id': ['nunique'],
        'merchant_category_id': ['nunique'],
        'month_lag': ['mean', 'max', 'min', 'std'],
        'month_diff': ['mean'],
        
        'purchase_amount': ['sum', 'mean', 'max', 'min', 'std'],
        'purchase_date': [np.ptp, 'min', 'max'],
        'purchase_month': ['mean', 'max', 'min', 'std'],

        'state_id': ['nunique'],
        'subsector_id': ['nunique']
    }

    df_agg = df.groupby(['card_id']).agg(agg_func)
    df_agg.columns = ['_'.join(col).strip() for col in df_agg.columns.values]
    df_agg.reset_index(inplace=True)

    df_t = (df.groupby('card_id').size().reset_index(name='transactions_count'))

    df_agg = pd.merge(df_t, df_agg, on='card_id', how='left')

    return df_agg

In [None]:
%%time
df_hist_trans_agg = aggregate_transactions(df_hist_trans)
df_hist_trans_agg.columns = ['hist_' + c if c != 'card_id' else c for c in df_hist_trans_agg.columns]

In [None]:
df_hist_trans_agg[:3]

In [None]:
%%time
df_auth_trans_agg = aggregate_transactions(df_auth_trans)
df_auth_trans_agg.columns = ['auth_' + c if c != 'card_id' else c for c in df_auth_trans_agg.columns]

In [None]:
df_auth_trans_agg[:3]

In [None]:
%%time
df_new_trans_agg = aggregate_transactions(df_new_trans)
df_new_trans_agg.columns = ['new_' + c if c != 'card_id' else c for c in df_new_trans_agg.columns]

In [None]:
df_new_trans_agg[:3]

##### Monthly aggregation

In [None]:
def aggregate_per_month(df):
    grouped = df.groupby(['card_id', 'month_lag'])

    agg_func = {
        'purchase_amount': ['count', 'sum', 'mean', 'min', 'max', 'std'],
        'installments': ['count', 'sum', 'mean', 'min', 'max', 'std'],
    }

    df_agg = grouped.agg(agg_func)
    df_agg.columns = ['_'.join(col).strip() for col in df_agg.columns.values]
    df_agg.reset_index(inplace=True)

    df_out = df_agg.groupby('card_id').agg(['mean', 'std'])
    df_out.columns = ['_'.join(col).strip() for col in df_out.columns.values]
    df_out.reset_index(inplace=True)
    
    return df_out

In [None]:
df_auth_trans_monthly_agg = aggregate_per_month(df_auth_trans) 
df_auth_trans_monthly_agg[:5]

In [None]:
def successive_aggregates(df, field1, field2):
    t = df.groupby(['card_id', field1])[field2].mean()
    u = pd.DataFrame(t).reset_index().groupby('card_id')[field2].agg(['mean', 'min', 'max', 'std'])
    u.columns = [field1 + '_' + field2 + '_' + col for col in u.columns.values]
    u.reset_index(inplace=True)
    return u

In [None]:
df_extra = successive_aggregates(df_new_trans, 'category_1', 'purchase_amount')
df_extra = df_extra.merge(successive_aggregates(df_new_trans, 'installments', 'purchase_amount'), on='card_id', how='left')
df_extra = df_extra.merge(successive_aggregates(df_new_trans, 'city_id', 'purchase_amount'), on='card_id', how='left')
df_extra = df_extra.merge(successive_aggregates(df_new_trans, 'category_1', 'installments'), on='card_id', how='left')

#### Merging

In [None]:
df_train = pd.merge(df_train, df_hist_trans_agg, on='card_id', how='left')
df_test = pd.merge(df_test, df_hist_trans_agg, on='card_id', how='left')

In [None]:
df_train = pd.merge(df_train, df_auth_trans_agg, on='card_id', how='left')
df_test = pd.merge(df_test, df_auth_trans_agg, on='card_id', how='left')

In [None]:
df_train = pd.merge(df_train, df_new_trans_agg, on='card_id', how='left')
df_test = pd.merge(df_test, df_new_trans_agg, on='card_id', how='left')

In [None]:
df_train = pd.merge(df_train, df_auth_trans_monthly_agg, on='card_id', how='left')
df_test = pd.merge(df_test, df_auth_trans_monthly_agg, on='card_id', how='left')

In [None]:
df_train = pd.merge(df_train, df_auth_trans, on='card_id', how='left')
df_test = pd.merge(df_test, df_auth_trans, on='card_id', how='left')

In [None]:
df_train = pd.merge(df_train, df_extra, on='card_id', how='left')
df_test = pd.merge(df_test, df_extra, on='card_id', how='left')

## Training

In [None]:
features = [c for c in df_train.columns if c not in ['card_id', 'first_active_month']]
categorical_feats = ['feature_2', 'feature_3']

In [None]:
param = {
    "bagging_fraction": 0.7083,
    "bagging_freq": 1,
    "bagging_seed": 11,
    "boosting": "gbdt",
    "feature_fraction": 0.7522,
    "lambda_l1": 0.2634,
    'learning_rate': 0.005,
    'max_depth': 9,
    "metric": 'rmse',
    'min_data_in_leaf': 149, 
    'num_leaves': 111,
    'objective':'regression',
    "random_state": 133,
    "verbosity": -1
}

In [None]:
%%time
folds = KFold(n_splits=5, shuffle=True, random_state=15)
oof = np.zeros(len(df_train))
predictions = np.zeros(len(df_test))
start = time.time()
feature_importance_df = pd.DataFrame()

for fold_, (trn_idx, val_idx) in enumerate(folds.split(df_train.values, target.values)):
    print("fold n°{}".format(fold_))
    trn_data = lgb.Dataset(train.iloc[trn_idx][features],
                           label=target.iloc[trn_idx],
                           categorical_feature=categorical_feats
                          )
    val_data = lgb.Dataset(train.iloc[val_idx][features],
                           label=target.iloc[val_idx],
                           categorical_feature=categorical_feats
                          )

    num_round = 10000
    clf = lgb.train(param,
                    trn_data,
                    num_round,
                    valid_sets = [trn_data, val_data],
                    verbose_eval=100,
                    early_stopping_rounds = 200)
    
    oof[val_idx] = clf.predict(train.iloc[val_idx][features], num_iteration=clf.best_iteration)
    
    fold_importance_df = pd.DataFrame()
    fold_importance_df["feature"] = features
    fold_importance_df["importance"] = clf.feature_importance()
    fold_importance_df["fold"] = fold_ + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    
    predictions += clf.predict(test[features], num_iteration=clf.best_iteration) / folds.n_splits

[Just Train Data - LGB & XGB & CatBoost w/ Blending](https://www.kaggle.com/silverstone1903/just-train-data-lgb-xgb-catboost-w-blending/data)<br/>
[MultiModel + RIDGE + STACKING](https://www.kaggle.com/ashishpatel26/rmse-3-66-multimodel-ridge-stacking)