# Elo Merchant Category Recommendation - Outlier detection, ensembling
End date: _2019. february 19._<br/>

This tutorial notebook is part of a series for [Elo Mechant Category Recommendation](https://www.kaggle.com/c/elo-merchant-category-recommendation) contest organized by Elo, one of the largest payment brands in Brazil. It has built partnerships with merchants in order to offer promotions or discounts to cardholders. The objective of the competition is to identify and serve the most relevant opportunities to individuals, by uncovering signals in customer loyalty. LynxKite does not yet support some of the data preprocessing, thus they need to be done in Python. The input files are available from the [download](https://www.kaggle.com/c/elo-merchant-category-recommendation/data) section of the contest:

- **train.csv**,  **test.csv**: list of `card_ids` that can be used for training and testing
- **historical_transactions.csv**: contains up to 3 months' worth of transactions for every card at any of the provided `merchant_ids`
- **new_merchant_transactions.csv**: contains the transactions at new merchants (`merchant_ids` that this particular `card_id` 
has not yet visited) over a period of two months
- **merchants.csv**: contains aggregate information for each `merchant_id` represented in the data set

In [1]:
import gc
import time
import warnings
import datetime
import calendar
import statistics
import numpy as np
import pandas as pd
import seaborn as sns
import lightgbm as lgb
from datetime import date
import matplotlib.pyplot as plt
from lightgbm import LGBMClassifier
import matplotlib.gridspec as gridspec
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel

%matplotlib inline
warnings.simplefilter(action='ignore', category=FutureWarning)
gc.enable()

In [2]:
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Starting memory usage: {:5.2f} MB'.format(start_mem))

    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min >= np.iinfo(np.int8).min and c_max <= np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min >= np.iinfo(np.int16).min and c_max <= np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min >= np.iinfo(np.int32).min and c_max <= np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min >= np.iinfo(np.int64).min and c_max <= np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min >= np.finfo(np.float16).min and c_max <= np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min >= np.finfo(np.float32).min and c_max <= np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Reduced memory usage: {:5.2f} MB ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

## Input data preparation
### Transactions

In [3]:
df_new_trans = pd.read_csv("preprocessed/trans_merch_new_agg.csv")
df_new_trans = reduce_mem_usage(df_new_trans)

df_hist_trans = pd.read_csv("preprocessed/trans_merch_hist_agg.csv")
df_hist_trans = reduce_mem_usage(df_hist_trans)

Starting memory usage: 289.84 MB
Reduced memory usage: 70.52 MB (75.7% reduction)
Starting memory usage: 325.36 MB
Reduced memory usage: 96.86 MB (70.2% reduction)


In [4]:
df_hist_trans.drop(['Unnamed: 0'], inplace=True, axis=1)
df_new_trans.drop(['Unnamed: 0'], inplace=True, axis=1)

### Train and test data

In [5]:
df_train = pd.read_csv("preprocessed/train_parsed_outlier_marked.csv", index_col="card_id")
df_test = pd.read_csv("preprocessed/test_parsed.csv", index_col="card_id")

### LynxKite export

In [6]:
df_lk_train = pd.read_csv("LynxKite_export/LynxKite_outlier_viral_modeling_train.csv", index_col="card_id")
df_lk_test = pd.read_csv("LynxKite_export/LynxKite_outlier_viral_modeling_test.csv", index_col="card_id")

In [7]:
df_lk_train.drop(['new_id', 'outlier', 'target', 'type'], inplace=True, axis=1)
df_lk_test.drop(['new_id', 'outlier', 'target', 'type'], inplace=True, axis=1)

### Merging

In [8]:
df_train = pd.merge(df_train, df_lk_train, on='card_id', how='left')
df_test = pd.merge(df_test, df_lk_test, on='card_id', how='left')

In [9]:
df_train = pd.merge(df_train, df_hist_trans, on='card_id', how='left')
df_test = pd.merge(df_test, df_hist_trans, on='card_id', how='left')

df_train = pd.merge(df_train, df_new_trans, on='card_id', how='left')
df_test = pd.merge(df_test, df_new_trans, on='card_id', how='left')

In [10]:
df_train = reduce_mem_usage(df_train)
df_test = reduce_mem_usage(df_test)

Starting memory usage: 212.97 MB
Reduced memory usage: 127.67 MB (40.1% reduction)
Starting memory usage: 128.51 MB
Reduced memory usage: 81.70 MB (36.4% reduction)


## Outlier selection
https://www.kaggle.com/sz8416/6-ways-for-feature-selection

### Marking the outliers

In [11]:
df_train['outlier'] = np.where(df_train['target']<-30, 1, 0)

In [12]:
print('There are {:,} marked outliers in the training set.'.format(len(df_train[df_train['outlier'] == 1]['outlier'])))

There are 2,207 marked outliers in the training set.


In [13]:
columns2drop = ['card_id', 'target', 'first_active_month', 'viral_outlier_test', 'viral_outlier_train', 'viral_roles',
                'hist_avg_purchases_lag3_sum', 'hist_avg_purchases_lag3_mean', 'hist_avg_purchases_lag6_sum',
                'hist_avg_purchases_lag6_mean', 'hist_avg_purchases_lag12_sum', 'hist_avg_purchases_lag12_mean',
                'viral_outlier_spread_over_iterations', 'hist_transactions_count', 'hist_purchase_year_min',
                'new_transactions_count', 'new_authorized_flag_mean']

features_train = [c for c in df_train.columns if c not in columns2drop]

### Normalization

In [14]:
df_train_clean = df_train.dropna(how='any', axis=0, subset=features_train)[features_train]

In [15]:
def normalize(df):
    for c in df.columns:
        mean = statistics.mean(df[c])
        std = statistics.stdev(df[c])

        df.loc[:, c] = (df[c] - mean)/std
        print('{}: {:.4f} ({:.4f})'.format(c, mean, std))

    return df

In [16]:
y = df_train_clean['outlier']
print('There are {:,} records in the outlier set.'.format(len(y)))

df_train_clean.drop(['outlier'], axis=1, inplace=True)
X = normalize(df_train_clean)
print('There are {:,} features and {:,} items in the training set.'.format(X.shape[1], X.shape[0]))

There are 146,616 records in the outlier set.
feature_1: 3.0871 (1.1958)
feature_2: 1.7319 (0.7497)
feature_3: 0.5536 (0.4971)
elapsed_days: 361.0683 (285.5164)
year: 2016.5552 (0.7572)
month: 7.5153 (3.3234)
days_feature1: 1152.3708 (1082.4652)
days_feature2: 657.3871 (707.0489)
days_feature3: 226.1918 (318.1874)
number_of_transactions_x: 100.9815 (113.0187)
merch_seg_viral_outlier_average_after_iteration_1_most_common: 0.0028 (0.0324)
merch_seg_viral_outlier_average_after_iteration_2_most_common: 0.0018 (0.0244)
merch_seg_viral_outlier_average_after_iteration_3_most_common: 0.0018 (0.0244)
merch_seg_viral_outlier_average_after_iteration_4_most_common: 0.0018 (0.0244)
merch_seg_viral_outlier_average_after_iteration_5_most_common: 0.0018 (0.0244)
merch_seg_viral_outlier_standard_deviation_after_iteration_1_most_common: 0.0102 (0.0456)
merch_seg_viral_outlier_standard_deviation_after_iteration_2_most_common: 0.0075 (0.0353)
merch_seg_viral_outlier_standard_deviation_after_iteration_3_mo

new_category_2_merch_mean: 2.2377 (1.4758)
new_category_3_sum: 4.5775 (6.9506)
new_category_3_mean: 0.6156 (0.6155)
new_category_4_sum: 3.9364 (5.1460)
new_category_4_mean: 0.4770 (0.3884)
new_city_id_trans_nunique: 2.5839 (1.7275)
new_city_id_merch_nunique: 2.3113 (1.3520)
new_installments_sum: 5.3500 (8.7879)
new_installments_median: 0.5993 (0.7816)
new_installments_mean: 0.6973 (0.9062)
new_installments_max: 1.5637 (3.5770)
new_installments_min: 0.2152 (0.7373)
new_installments_std: 0.5075 (1.0764)
new_merchant_id_nunique: 7.8677 (6.7754)
new_merchant_category_id_trans_nunique: 6.2233 (4.1856)
new_merchant_category_id_merch_nunique: 6.1393 (4.1487)
new_merchant_group_id_nunique: 6.2793 (5.2163)
new_month_lag_max: 1.8792 (0.3258)
new_month_lag_min: 1.0982 (0.2976)
new_month_lag_mean: 1.4756 (0.2884)
new_month_lag_var: 0.2096 (0.1335)
new_most_recent_sales_range_sum: 15.9722 (13.9074)
new_most_recent_sales_range_mean: 2.0686 (0.6674)
new_most_recent_sales_range_max: 3.4746 (0.8204)
ne

In [17]:
X.shape, y.shape

((146616, 274), (146616,))

In [18]:
X.isnull().sum()

feature_1                                                                   0
feature_2                                                                   0
feature_3                                                                   0
elapsed_days                                                                0
year                                                                        0
month                                                                       0
days_feature1                                                               0
days_feature2                                                               0
days_feature3                                                               0
number_of_transactions_x                                                    0
merch_seg_viral_outlier_average_after_iteration_1_most_common               0
merch_seg_viral_outlier_average_after_iteration_2_most_common               0
merch_seg_viral_outlier_average_after_iteration_3_most_common   

### Feature selection

#### Pearson correlation

In [19]:
def cor_selector(X, y, limit=100):
    cor_list = []
    for i in X.columns.tolist():
        cor = np.corrcoef(X[i], y)[0, 1]
        cor_list.append(cor)

    cor_list = [0 if np.isnan(i) else i for i in cor_list]
    cor_feature = X.iloc[:,np.argsort(np.abs(cor_list))[-limit:]].columns.tolist()
    cor_support = [True if i in cor_feature else False for i in X.columns.tolist()]
    return cor_support, cor_feature, cor_list

In [20]:
cor_support, cor_feature, cor_value = cor_selector(X, y)
print(str(len(cor_feature)), 'selected features')

100 selected features


#### Chi-2

In [21]:
from sklearn.feature_selection import chi2
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectKBest

X_norm = MinMaxScaler().fit_transform(X)
chi_selector = SelectKBest(chi2, k=100)
chi_selector.fit(X_norm, y)

  return self.partial_fit(X, y)


SelectKBest(k=100, score_func=<function chi2 at 0x7fc17da049d8>)

In [22]:
chi_support = chi_selector.get_support()
chi_feature = X.loc[:,chi_support].columns.tolist()
print(str(len(chi_feature)), 'selected features')

100 selected features


#### RFE

In [23]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

rfe_selector = RFE(estimator=LogisticRegression(), n_features_to_select=5, step=10, verbose=5)
rfe_selector.fit(X_norm, y)

Fitting estimator with 274 features.
Fitting estimator with 264 features.
Fitting estimator with 254 features.
Fitting estimator with 244 features.
Fitting estimator with 234 features.
Fitting estimator with 224 features.
Fitting estimator with 214 features.
Fitting estimator with 204 features.
Fitting estimator with 194 features.
Fitting estimator with 184 features.
Fitting estimator with 174 features.
Fitting estimator with 164 features.
Fitting estimator with 154 features.
Fitting estimator with 144 features.
Fitting estimator with 134 features.
Fitting estimator with 124 features.
Fitting estimator with 114 features.
Fitting estimator with 104 features.
Fitting estimator with 94 features.
Fitting estimator with 84 features.
Fitting estimator with 74 features.
Fitting estimator with 64 features.
Fitting estimator with 54 features.
Fitting estimator with 44 features.
Fitting estimator with 34 features.
Fitting estimator with 24 features.
Fitting estimator with 14 features.


RFE(estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False),
  n_features_to_select=5, step=10, verbose=5)

In [24]:
rfe_support = rfe_selector.get_support()
rfe_feature = X.loc[:,rfe_support].columns.tolist()
print(str(len(rfe_feature)), 'selected features')

5 selected features


#### Embedded

In [25]:
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression

embeded_lr_selector = SelectFromModel(LogisticRegression(penalty="l1"), '1.25*median')
embeded_lr_selector.fit(X_norm, y)

SelectFromModel(estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l1', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False),
        max_features=None, norm_order=1, prefit=False,
        threshold='1.25*median')

In [26]:
embeded_lr_support = embeded_lr_selector.get_support()
embeded_lr_feature = X.loc[:,embeded_lr_support].columns.tolist()
print(str(len(embeded_lr_feature)), 'selected features')

274 selected features


#### Random Forest

In [27]:
embeded_rf_selector = SelectFromModel(RandomForestClassifier(n_estimators=5), threshold='1.25*median')
embeded_rf_selector.fit(X, y)

SelectFromModel(estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=5, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
        max_features=None, norm_order=1, prefit=False,
        threshold='1.25*median')

In [28]:
embeded_rf_support = embeded_rf_selector.get_support()
embeded_rf_feature = X.loc[:,embeded_rf_support].columns.tolist()
print(str(len(embeded_rf_feature)), 'selected features')

100 selected features


#### LightGBM

In [29]:
from sklearn.feature_selection import SelectFromModel
from lightgbm import LGBMClassifier

lgbc=LGBMClassifier(n_estimators=500, learning_rate=0.05, num_leaves=32, colsample_bytree=0.2,
            reg_alpha=3, reg_lambda=1, min_split_gain=0.01, min_child_weight=40)

embeded_lgb_selector = SelectFromModel(lgbc, threshold='1.25*median')
embeded_lgb_selector.fit(X, y)

SelectFromModel(estimator=LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=0.2,
        importance_type='split', learning_rate=0.05, max_depth=-1,
        min_child_samples=20, min_child_weight=40, min_split_gain=0.01,
        n_estimators=500, n_jobs=-1, num_leaves=32, objective=None,
        random_state=None, reg_alpha=3, reg_lambda=1, silent=True,
        subsample=1.0, subsample_for_bin=200000, subsample_freq=0),
        max_features=None, norm_order=1, prefit=False,
        threshold='1.25*median')

In [30]:
embeded_lgb_support = embeded_lgb_selector.get_support()
embeded_lgb_feature = X.loc[:,embeded_lgb_support].columns.tolist()
print(str(len(embeded_lgb_feature)), 'selected features')

109 selected features


In [31]:
pd.set_option('display.max_rows', None)

feature_selection_df = pd.DataFrame({
    'feature': X.columns.tolist(),
    'Pearson': cor_support,
    'Chi-2': chi_support,
    'RFE': rfe_support,
    'Logistics': embeded_lr_support,
    'Random Forest': embeded_rf_support,
    'LightGBM': embeded_lgb_support
})

feature_selection_df['total'] = np.sum(feature_selection_df, axis=1)

feature_selection_df = feature_selection_df.sort_values(['total', 'feature'] , ascending=False)
feature_selection_df.index = range(1, len(feature_selection_df)+1)
feature_selection_df[:50]

Unnamed: 0,feature,Pearson,Chi-2,RFE,Logistics,Random Forest,LightGBM,total
1,viral_outlier_after_iteration_5,True,True,True,True,True,True,6
2,viral_outlier_after_iteration_4,True,True,True,True,True,True,6
3,viral_outlier_after_iteration_3,True,True,False,True,True,True,5
4,viral_outlier_after_iteration_2,True,True,False,True,True,True,5
5,viral_outlier_after_iteration_1,True,True,False,True,True,True,5
6,new_purchase_weekofyear_nunique,True,True,True,True,False,True,5
7,new_purchase_weekofyear_median,True,True,False,True,True,True,5
8,new_purchase_weekofyear_mean,True,True,False,True,True,True,5
9,new_purchase_weekofyear_max,True,True,False,True,True,True,5
10,new_purchase_month_mean,True,True,False,True,True,True,5


In [86]:
len(feature_selection_df[feature_selection_df['total'] > 3]['feature'])

58

In [87]:
features = list(feature_selection_df[feature_selection_df['total'] > 3]['feature'])
features

['viral_outlier_after_iteration_5',
 'viral_outlier_after_iteration_4',
 'viral_outlier_after_iteration_3',
 'viral_outlier_after_iteration_2',
 'viral_outlier_after_iteration_1',
 'new_purchase_weekofyear_nunique',
 'new_purchase_weekofyear_median',
 'new_purchase_weekofyear_mean',
 'new_purchase_weekofyear_max',
 'new_purchase_month_mean',
 'new_purchase_month_max',
 'new_purchase_day_std',
 'merch_seg_viral_outlier_standard_deviation_after_iteration_5_most_common',
 'merch_seg_viral_outlier_standard_deviation_after_iteration_3_most_common',
 'merch_seg_viral_outlier_standard_deviation_after_iteration_2_most_common',
 'merch_seg_viral_outlier_average_after_iteration_5_most_common',
 'merch_seg_viral_outlier_average_after_iteration_4_most_common',
 'merch_seg_viral_outlier_average_after_iteration_3_most_common',
 'merch_seg_viral_outlier_average_after_iteration_1_most_common',
 'hist_purchase_year_std',
 'hist_purchase_weekofyear_std',
 'hist_purchase_month_std',
 'hist_category_1_tra

### Training

In [88]:
# all features
#features = [c for c in df_train.columns if c not in columns2drop]

features.append('outlier')

df_train_clean = df_train.dropna(how='any', axis=0, subset=features)[features]
y = df_train_clean.outlier

df_train_clean.drop(['outlier'], axis=1, inplace=True)
X = normalize(df_train_clean)

features_test = features
features_test.remove('outlier')
features_test.append('card_id')
df_test_clean = df_test.dropna(how='any', axis=0, subset=features_test)[features_test]

viral_outlier_after_iteration_5: 0.0065 (0.0789)
viral_outlier_after_iteration_4: 0.0065 (0.0789)
viral_outlier_after_iteration_3: 0.0065 (0.0789)
viral_outlier_after_iteration_2: 0.0065 (0.0789)
viral_outlier_after_iteration_1: 0.0065 (0.0789)
new_purchase_weekofyear_nunique: 3.8907 (1.9786)
new_purchase_weekofyear_median: 15.9449 (10.3316)
new_purchase_weekofyear_mean: 15.9085 (9.7555)
new_purchase_weekofyear_max: 18.9045 (10.7761)
new_purchase_month_mean: 4.0957 (2.2426)
new_purchase_month_max: 4.6073 (2.4666)
new_purchase_day_std: 7.2044 (3.5663)
merch_seg_viral_outlier_standard_deviation_after_iteration_5_most_common: 0.0083 (0.0364)
merch_seg_viral_outlier_standard_deviation_after_iteration_3_most_common: 0.0083 (0.0364)
merch_seg_viral_outlier_standard_deviation_after_iteration_2_most_common: 0.0083 (0.0363)
merch_seg_viral_outlier_average_after_iteration_5_most_common: 0.0019 (0.0243)
merch_seg_viral_outlier_average_after_iteration_4_most_common: 0.0019 (0.0243)
merch_seg_viral

In [89]:
X.shape, y.shape, df_test_clean.shape

((152461, 58), (152461,), (91633, 59))

In [90]:
features_test.remove('card_id')

#### Random Forest

In [91]:
clf = RandomForestClassifier(verbose=1)
clf.fit(X, y)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    3.3s finished


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=1,
            warm_start=False)

In [92]:
randomforest_outlier_pred = clf.predict(df_test_clean[features_test])

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    0.1s finished


In [93]:
randomforest_outlier_pred.sum()

1510

In [94]:
rf_outlier_card_ids = []
for i in range(len(randomforest_outlier_pred)):
    if randomforest_outlier_pred[i] == 1:
        print('{:,}. card_id: {}'.format(i, df_test_clean['card_id'].iloc[i]))
        rf_outlier_card_ids.append(df_test_clean['card_id'].iloc[i])

95. card_id: C_ID_ce489f86d1
157. card_id: C_ID_1d5864081b
223. card_id: C_ID_152f65c13d
227. card_id: C_ID_69bcef0917
229. card_id: C_ID_fe577884d4
304. card_id: C_ID_b3381af031
319. card_id: C_ID_0e6d55e198
358. card_id: C_ID_f3a6263458
383. card_id: C_ID_d2d79eae51
532. card_id: C_ID_3f12f3e7af
614. card_id: C_ID_327fe09457
640. card_id: C_ID_3ad7a0afb2
750. card_id: C_ID_f60d9fee4f
756. card_id: C_ID_b32a177495
833. card_id: C_ID_a16fb5d334
1,102. card_id: C_ID_328b87458e
1,121. card_id: C_ID_64cadb949b
1,125. card_id: C_ID_d9207b63e9
1,147. card_id: C_ID_be12bfd430
1,176. card_id: C_ID_be805a8bb4
1,200. card_id: C_ID_35dad8c5f3
1,218. card_id: C_ID_db86c750d3
1,222. card_id: C_ID_8189b80a88
1,223. card_id: C_ID_7ba3e6a065
1,257. card_id: C_ID_b49c76e624
1,274. card_id: C_ID_4c822955ab
1,350. card_id: C_ID_6bc7a8f802
1,408. card_id: C_ID_82b4f7781b
1,422. card_id: C_ID_270ae594f1
1,427. card_id: C_ID_866d5678d7
1,476. card_id: C_ID_bda15a62e5
1,501. card_id: C_ID_f35af12a23
1,522. 

69,355. card_id: C_ID_9b696f9294
69,372. card_id: C_ID_8a2567e4b2
69,404. card_id: C_ID_6ccdf07033
69,419. card_id: C_ID_213bd86a11
69,430. card_id: C_ID_7bc8c98949
69,596. card_id: C_ID_395faeb385
69,664. card_id: C_ID_a840c1d122
69,710. card_id: C_ID_e379b4ed94
69,733. card_id: C_ID_70417ba281
69,740. card_id: C_ID_a2eea4b5c0
69,790. card_id: C_ID_0e00eb9ad1
69,849. card_id: C_ID_08c825257c
69,851. card_id: C_ID_a690d80cc0
69,918. card_id: C_ID_3d32c5d0e7
69,967. card_id: C_ID_7723ee6343
70,093. card_id: C_ID_35a2b3d3ad
70,186. card_id: C_ID_17d26eab4b
70,206. card_id: C_ID_6e9a4a6008
70,257. card_id: C_ID_0b4d785df6
70,322. card_id: C_ID_70db8a6198
70,380. card_id: C_ID_e35f12cd3d
70,465. card_id: C_ID_04a0a32569
70,564. card_id: C_ID_cbb8b3343e
70,589. card_id: C_ID_13023e00bd
70,596. card_id: C_ID_c9294883ff
70,670. card_id: C_ID_f602e30ba1
70,733. card_id: C_ID_b43a254525
70,736. card_id: C_ID_fa0b9bdda2
70,763. card_id: C_ID_b349a757dd
70,794. card_id: C_ID_d8b698b1e2
70,813. ca

In [95]:
rf_outlier_card_ids

['C_ID_ce489f86d1',
 'C_ID_1d5864081b',
 'C_ID_152f65c13d',
 'C_ID_69bcef0917',
 'C_ID_fe577884d4',
 'C_ID_b3381af031',
 'C_ID_0e6d55e198',
 'C_ID_f3a6263458',
 'C_ID_d2d79eae51',
 'C_ID_3f12f3e7af',
 'C_ID_327fe09457',
 'C_ID_3ad7a0afb2',
 'C_ID_f60d9fee4f',
 'C_ID_b32a177495',
 'C_ID_a16fb5d334',
 'C_ID_328b87458e',
 'C_ID_64cadb949b',
 'C_ID_d9207b63e9',
 'C_ID_be12bfd430',
 'C_ID_be805a8bb4',
 'C_ID_35dad8c5f3',
 'C_ID_db86c750d3',
 'C_ID_8189b80a88',
 'C_ID_7ba3e6a065',
 'C_ID_b49c76e624',
 'C_ID_4c822955ab',
 'C_ID_6bc7a8f802',
 'C_ID_82b4f7781b',
 'C_ID_270ae594f1',
 'C_ID_866d5678d7',
 'C_ID_bda15a62e5',
 'C_ID_f35af12a23',
 'C_ID_ff19179428',
 'C_ID_86f01f19a2',
 'C_ID_25f1288e21',
 'C_ID_0134f3f984',
 'C_ID_649960b7d7',
 'C_ID_4ad42a7679',
 'C_ID_f3bfdde7b5',
 'C_ID_6117b5fcd6',
 'C_ID_dc871f00c4',
 'C_ID_4feb638ec9',
 'C_ID_8ef1d492db',
 'C_ID_bdb2eda16e',
 'C_ID_4a42fdcd4e',
 'C_ID_4feacff118',
 'C_ID_573c5399ae',
 'C_ID_6224f9495a',
 'C_ID_5503b28e52',
 'C_ID_b8c6631cd9',


#### Logistic regression

In [96]:
clf = LogisticRegression(verbose=1)
clf.fit(X, y)

[LibLinear]

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=1, warm_start=False)

In [97]:
logistic_regression_outlier_pred = clf.predict(df_test_clean[features_test])

In [98]:
logistic_regression_outlier_pred.sum()

71

In [99]:
lr_outlier_card_ids = []
for i in range(len(logistic_regression_outlier_pred)):
    if logistic_regression_outlier_pred[i] == 1:
        print('{:,}. card_id: {}'.format(i, df_test_clean['card_id'].iloc[i]))
        lr_outlier_card_ids.append(df_test_clean['card_id'].iloc[i])

2,685. card_id: C_ID_289b0daeb5
3,487. card_id: C_ID_c0b4e08398
3,618. card_id: C_ID_d62f4685ca
4,257. card_id: C_ID_a77600ea88
5,309. card_id: C_ID_49cc84c9de
7,825. card_id: C_ID_fd158a8b90
9,517. card_id: C_ID_f05440d321
9,662. card_id: C_ID_50061c8d78
10,019. card_id: C_ID_ea24c809b8
22,311. card_id: C_ID_038293286b
22,555. card_id: C_ID_64154f656b
22,857. card_id: C_ID_49e1ed0c10
26,598. card_id: C_ID_fff47dea33
26,981. card_id: C_ID_6d7769f9a9
27,471. card_id: C_ID_7897759c09
28,388. card_id: C_ID_97c7302133
28,701. card_id: C_ID_c5097fa8c9
29,309. card_id: C_ID_af98b072c8
30,132. card_id: C_ID_013c95d05b
31,179. card_id: C_ID_30ef6ad286
33,619. card_id: C_ID_77edf0ef0a
34,080. card_id: C_ID_5f77b4d185
35,406. card_id: C_ID_afc69a9c4b
36,010. card_id: C_ID_34aa1d9963
36,463. card_id: C_ID_8727ddbf6d
37,598. card_id: C_ID_95b828199e
39,793. card_id: C_ID_4e0c06e912
42,025. card_id: C_ID_1215b024d3
45,790. card_id: C_ID_a777236765
49,032. card_id: C_ID_238e35c5d7
49,412. card_id: C

#### AdaBoost

In [100]:
from sklearn.ensemble import AdaBoostClassifier

clf = AdaBoostClassifier()
clf.fit(X, y)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None)

In [101]:
adaboost_outlier_pred = clf.predict(df_test_clean[features_test])

In [102]:
adaboost_outlier_pred.sum()

45

In [103]:
adaboost_outlier_card_ids = []
for i in range(len(adaboost_outlier_pred)):
    if adaboost_outlier_pred[i] == 1:
        print('{:,}. card_id: {}'.format(i, df_test_clean['card_id'].iloc[i]))
        adaboost_outlier_card_ids.append(df_test_clean['card_id'].iloc[i])

901. card_id: C_ID_c25dc96879
3,464. card_id: C_ID_1a2a6cea39
4,789. card_id: C_ID_6237a7c2b7
5,309. card_id: C_ID_49cc84c9de
5,442. card_id: C_ID_bab0e3ed78
9,339. card_id: C_ID_912688a50a
10,675. card_id: C_ID_4347549f95
10,688. card_id: C_ID_3a3f7b4e0e
11,555. card_id: C_ID_162160c811
14,858. card_id: C_ID_1caad84982
15,682. card_id: C_ID_c2ca440d0c
16,370. card_id: C_ID_d247290e3d
19,462. card_id: C_ID_3f0de1cefb
27,513. card_id: C_ID_bb7a512b02
28,317. card_id: C_ID_774ca21a83
28,715. card_id: C_ID_b1061f8699
29,166. card_id: C_ID_8a78d84fac
30,403. card_id: C_ID_b62447ef8c
31,203. card_id: C_ID_5b703405d1
31,600. card_id: C_ID_aa5ddeb020
33,521. card_id: C_ID_9bae150f94
34,763. card_id: C_ID_e236b32ea0
36,281. card_id: C_ID_b619802595
38,381. card_id: C_ID_a82d1d6b9b
40,869. card_id: C_ID_f40731ceba
43,640. card_id: C_ID_3b397f83f7
47,674. card_id: C_ID_0fa8892abf
49,228. card_id: C_ID_a08f90b0ac
55,225. card_id: C_ID_a2f2be2611
57,240. card_id: C_ID_b470a6725b
59,038. card_id: C

### Ensembling

In [104]:
def intersection(lst1, lst2): 
    return list(set(lst1) & set(lst2))

In [105]:
print('Size of the output of the logistic regression: {}\nSize of the output of the random forest: {}\nSize of the output of the AdaBoost: {}'.format(len(lr_outlier_card_ids), len(rf_outlier_card_ids), len(adaboost_outlier_card_ids)))

Size of the output of the logistic regression: 71
Size of the output of the random forest: 1510
Size of the output of the AdaBoost: 45


In [106]:
intersection(lr_outlier_card_ids, rf_outlier_card_ids)

['C_ID_4acccd54ba', 'C_ID_95b828199e', 'C_ID_34aa1d9963', 'C_ID_c3ed09b27d']

In [107]:
intersection(lr_outlier_card_ids, adaboost_outlier_card_ids)

['C_ID_a2f2be2611', 'C_ID_49cc84c9de']

In [108]:
intersection(rf_outlier_card_ids, adaboost_outlier_card_ids)

['C_ID_06448cc327', 'C_ID_bab0e3ed78']

## Training
LightGBM training without outliers

In [118]:
df_train.drop(['outlier'], axis=1, inplace=True)

In [120]:
df_train.shape, df_test.shape

((201917, 291), (123623, 290))

[LightGBM parameters](https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst)<br/>
[Comparison between LGB boosting methods (goss, gbdt and dart)](https://www.kaggle.com/c/home-credit-default-risk/discussion/60921)

In [121]:
target = df_train['target']
df_train_clean = df_train.drop(['target'], axis=1)

In [123]:
df_train_clean.shape, target.shape, df_test.shape

((201917, 290), (201917,), (123623, 290))

In [None]:
param = {
    'num_leaves': 127,
    'min_data_in_leaf': 20, 
    'objective': 'regression_l2',
    'max_depth': -1,
    'learning_rate': 0.0025,
    "boosting": "dart",
    "feature_fraction": 0.9,
    "bagging_freq": 1,
    "bagging_fraction": 0.9,
    "bagging_seed": 11,
    "metric": 'rmse',
    "lambda_l1": 0.1,
    "verbosity": -1
}

folds = KFold(n_splits=5, shuffle=True, random_state=15)

oof = np.zeros(len(df_train_clean))
predictions = np.zeros(len(df_test))

feature_importance_df = pd.DataFrame()

for fold_, (trn_idx, val_idx) in enumerate(folds.split(df_train_clean.values, target.values)):
    print("Fold {}".format(fold_ + 1))
    trn_data = lgb.Dataset(df_train_clean.iloc[trn_idx][features], label=target.iloc[trn_idx])
    val_data = lgb.Dataset(df_train_clean.iloc[val_idx][features], label=target.iloc[val_idx])

    num_round = 10000
    clf = lgb.train(param, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=100, early_stopping_rounds=100)
    oof[val_idx] = clf.predict(df_train_clean.iloc[val_idx][features], num_iteration=clf.best_iteration)
    fold_importance_df = pd.DataFrame()
    fold_importance_df["feature"] = features
    fold_importance_df["importance"] = clf.feature_importance()
    fold_importance_df["fold"] = fold_ + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)

    predictions += clf.predict(df_test[features], num_iteration=clf.best_iteration) / folds.n_splits

-
Fold 1
Training until validation scores don't improve for 100 rounds.
[100]	training's rmse: 3.58264	valid_1's rmse: 3.62198
[200]	training's rmse: 3.44758	valid_1's rmse: 3.48605
[300]	training's rmse: 3.29571	valid_1's rmse: 3.33343
[400]	training's rmse: 3.15664	valid_1's rmse: 3.19443
[500]	training's rmse: 3.01756	valid_1's rmse: 3.05614
[600]	training's rmse: 2.96524	valid_1's rmse: 3.00494
[700]	training's rmse: 2.87007	valid_1's rmse: 2.91169
[800]	training's rmse: 2.81081	valid_1's rmse: 2.85417
[900]	training's rmse: 2.73013	valid_1's rmse: 2.77614
[1000]	training's rmse: 2.68037	valid_1's rmse: 2.72866
[1100]	training's rmse: 2.62318	valid_1's rmse: 2.67467
[1200]	training's rmse: 2.59543	valid_1's rmse: 2.64925
[1300]	training's rmse: 2.55534	valid_1's rmse: 2.61212
[1400]	training's rmse: 2.49861	valid_1's rmse: 2.55971
[1500]	training's rmse: 2.48178	valid_1's rmse: 2.54535
[1600]	training's rmse: 2.43989	valid_1's rmse: 2.50745
[1700]	training's rmse: 2.40111	valid_1's

In [None]:
cross_validation_lgb = np.sqrt(mean_squared_error(target, oof))
print('Cross-validation score: ' + str(cross_validation_lgb))

In [None]:
cols = (feature_importance_df[["feature", "importance"]]
        .groupby("feature")
        .mean()
        .sort_values(by="importance", ascending=False)[:1000].index)

best_features = feature_importance_df.loc[feature_importance_df['feature'].isin(cols)]

plt.figure(figsize=(14, 40))
sns.barplot(x="importance",
            y="feature",
            data=best_features.sort_values(by="importance", ascending=False))
plt.title('LightGBM Features (avg over folds)')
plt.tight_layout()
plt.savefig('lgbm_importances.png')

In [None]:
df_sub = pd.DataFrame({"card_id":df_test["card_id"].values})
df_sub["target"] = predictions
df_sub.to_csv("output/lgbm_{}.csv".format(cross_validation_lgb), index=False)

## Prediction
Load the model that was trained on the training set without outliers.

In [None]:
clf = lgb.Booster(model_file='models/lightgbm_all.txt')

In [None]:
drops = ['card_id', 'first_active_month', 'target', 'outlier']
use_cols = [c for c in df_train.columns if c not in drops]
features = list(df_train[use_cols].columns)

In [None]:
predictions = clf.predict(df_test[features])

In [None]:
df_sub = pd.DataFrame({
    "card_id": df_test["card_id"].values
})
df_sub["target"] = predictions

In [None]:
# Updating the intersection of Logistic regression & Random Forest
df_sub.loc[df_sub['card_id'] == 'C_ID_aae50409e7', 'target'] = -33.218750

In [None]:
for i in range(len(lr_outlier_card_ids)):
    print('The value of {} is {}'.format(lr_outlier_card_ids[i], df_sub.loc[df_sub['card_id'] == lr_outlier_card_ids[i], 'target'].values[0]))
    df_sub.loc[df_sub['card_id'] == lr_outlier_card_ids[i], 'target'] = -33.218750

In [None]:
len(df_sub[df_sub['target'] < -30])

In [None]:
df_sub.to_csv("output/lgbm_rf_and_adaboost_outliers.csv", index=False)

* Random Forest (LB score: 5.994)
* Logistic Regression (LB score: 5.990)
* Random Forest & AdaBoost (LB score: 5.986)