# Elo Merchant Category Recommendation - Outlier detection, ensembling
End date: _2019. february 19._<br/>

This tutorial notebook is part of a series for [Elo Mechant Category Recommendation](https://www.kaggle.com/c/elo-merchant-category-recommendation) contest organized by Elo, one of the largest payment brands in Brazil. It has built partnerships with merchants in order to offer promotions or discounts to cardholders. The objective of the competition is to identify and serve the most relevant opportunities to individuals, by uncovering signals in customer loyalty. LynxKite does not yet support some of the data preprocessing, thus they need to be done in Python. The input files are available from the [download](https://www.kaggle.com/c/elo-merchant-category-recommendation/data) section of the contest:

- **train.csv**,  **test.csv**: list of `card_ids` that can be used for training and testing
- **historical_transactions.csv**: contains up to 3 months' worth of transactions for every card at any of the provided `merchant_ids`
- **new_merchant_transactions.csv**: contains the transactions at new merchants (`merchant_ids` that this particular `card_id` 
has not yet visited) over a period of two months
- **merchants.csv**: contains aggregate information for each `merchant_id` represented in the data set

In [1]:
import gc
import time
import warnings
import datetime
import calendar
import statistics
import numpy as np
import pandas as pd
import seaborn as sns
import lightgbm as lgb
from datetime import date
import matplotlib.pyplot as plt
from lightgbm import LGBMClassifier
import matplotlib.gridspec as gridspec
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel

%matplotlib inline
warnings.simplefilter(action='ignore', category=FutureWarning)
gc.enable()

In [2]:
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Starting memory usage: {:5.2f} MB'.format(start_mem))

    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min >= np.iinfo(np.int8).min and c_max <= np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min >= np.iinfo(np.int16).min and c_max <= np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min >= np.iinfo(np.int32).min and c_max <= np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min >= np.iinfo(np.int64).min and c_max <= np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min >= np.finfo(np.float16).min and c_max <= np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min >= np.finfo(np.float32).min and c_max <= np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Reduced memory usage: {:5.2f} MB ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

## Input data preparation
### Transactions

In [3]:
df_new_trans = pd.read_csv("preprocessed/trans_merch_new_agg.csv")
df_new_trans = reduce_mem_usage(df_new_trans)

df_hist_trans = pd.read_csv("preprocessed/trans_merch_hist_agg.csv")
df_hist_trans = reduce_mem_usage(df_hist_trans)

Starting memory usage: 289.84 MB
Reduced memory usage: 70.52 MB (75.7% reduction)
Starting memory usage: 325.36 MB
Reduced memory usage: 96.86 MB (70.2% reduction)


In [4]:
df_hist_trans.drop(['Unnamed: 0'], inplace=True, axis=1)
df_new_trans.drop(['Unnamed: 0'], inplace=True, axis=1)

### Train and test data

In [5]:
df_train = pd.read_csv("preprocessed/train_parsed_outlier_marked.csv", index_col="card_id")
df_test = pd.read_csv("preprocessed/test_parsed.csv", index_col="card_id")

### LynxKite export

In [6]:
df_lk_train = pd.read_csv("LynxKite_export/LynxKite_outlier_viral_modeling_train.csv", index_col="card_id")
df_lk_test = pd.read_csv("LynxKite_export/LynxKite_outlier_viral_modeling_test.csv", index_col="card_id")

In [7]:
df_lk_train.drop(['new_id', 'outlier', 'target', 'type'], inplace=True, axis=1)
df_lk_test.drop(['new_id', 'outlier', 'target', 'type'], inplace=True, axis=1)

### Merging

In [8]:
df_train = pd.merge(df_train, df_lk_train, on='card_id', how='left')
df_test = pd.merge(df_test, df_lk_test, on='card_id', how='left')

In [9]:
df_train = pd.merge(df_train, df_hist_trans, on='card_id', how='left')
df_test = pd.merge(df_test, df_hist_trans, on='card_id', how='left')

df_train = pd.merge(df_train, df_new_trans, on='card_id', how='left')
df_test = pd.merge(df_test, df_new_trans, on='card_id', how='left')

In [10]:
df_train = reduce_mem_usage(df_train)
df_test = reduce_mem_usage(df_test)

Starting memory usage: 212.97 MB
Reduced memory usage: 127.67 MB (40.1% reduction)
Starting memory usage: 128.51 MB
Reduced memory usage: 81.70 MB (36.4% reduction)


## Outlier selection
https://www.kaggle.com/sz8416/6-ways-for-feature-selection

### Marking the outliers

In [25]:
df_train['outlier'] = np.where(df_train['target']<-30, 1, 0)

In [41]:
print('There are {:,} marked outliers in the training set.'.format(len(df_train[df_train['outlier'] == 1]['outlier'])))

There are 2,207 marked outliers in the training set.


In [42]:
df_target = df_train['outlier']

use_cols = [c for c in df_train_wo_outliers.columns if c not in ['card_id', 'first_active_month', 'target', 'outlier', 'viral_outlier_test', 'viral_roles']]
features = list(df_train_wo_outliers[use_cols].columns)

### Normalization

In [55]:
def normalize(df, used_cols):
    for c in df.columns:
        if c in used_cols:
            mean = statistics.mean(df[c])
            std = statistics.stdev(df[c])

            df[c] = (df[c] - mean)/std
            print('{}: {:.4f} ({:.4f})'.format(c, mean, std))

    return df[used_cols]

In [48]:
df_train_clean = df_train.dropna(how='any', axis=0, subset=features)

In [None]:
X = normalize(df_train_clean, features)
#X = df_train_clean[features]
y = df_train_clean['outlier']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


feature_1: 3.0869 (1.1955)
feature_2: 1.7331 (0.7495)
feature_3: 0.5537 (0.4971)
elapsed_days: 361.6757 (286.1876)
year: 2016.5540 (0.7591)
month: 7.5097 (3.3249)
days_feature1: 1154.0387 (1084.1325)
days_feature2: 659.3918 (709.8126)
days_feature3: 226.5280 (318.7878)
number_of_transactions_x: 101.1063 (113.3242)
merch_seg_viral_outlier_average_after_iteration_1_most_common: 0.0031 (0.0352)
merch_seg_viral_outlier_average_after_iteration_2_most_common: 0.0020 (0.0270)
merch_seg_viral_outlier_average_after_iteration_3_most_common: 0.0020 (0.0270)
merch_seg_viral_outlier_average_after_iteration_4_most_common: 0.0020 (0.0270)
merch_seg_viral_outlier_average_after_iteration_5_most_common: 0.0020 (0.0270)
merch_seg_viral_outlier_standard_deviation_after_iteration_1_most_common: 0.0101 (0.0470)
merch_seg_viral_outlier_standard_deviation_after_iteration_2_most_common: 0.0077 (0.0370)
merch_seg_viral_outlier_standard_deviation_after_iteration_3_most_common: 0.0077 (0.0371)
merch_seg_viral_out

new_avg_purchases_lag12_sum: 67.6129 (1005.4959)
new_avg_purchases_lag12_mean: 8.9693 (148.3317)
new_card_id_size: 7.9527 (6.8033)
new_category_1_trans_sum: 0.2218 (0.6011)
new_category_1_trans_mean: 0.0323 (0.0950)
new_category_1_merch_sum: 0.6690 (0.9932)
new_category_1_merch_mean: 0.0963 (0.1479)
new_category_2_trans_sum: 16.6452 (20.1448)
new_category_2_trans_mean: 2.1821 (1.3982)
new_category_2_merch_sum: 15.9578 (19.7995)
new_category_2_merch_mean: 2.2387 (1.4763)


In [None]:
X.shape, y.shape

### Feature selection

#### Pearson correlation

In [51]:
def cor_selector(X, y, limit=100):
    cor_list = []
    for i in X.columns.tolist():
        cor = np.corrcoef(X[i], y)[0, 1]
        cor_list.append(cor)

    cor_list = [0 if np.isnan(i) else i for i in cor_list]
    cor_feature = X.iloc[:,np.argsort(np.abs(cor_list))[-limit:]].columns.tolist()
    cor_support = [True if i in cor_feature else False for i in X.columns.tolist()]
    return cor_support, cor_feature, cor_list

In [52]:
cor_support, cor_feature, cor_value = cor_selector(X, y)
print(str(len(cor_feature)), 'selected features')

  c /= stddev[:, None]
  c /= stddev[None, :]
  X -= avg[:, None]


100 selected features


#### Chi-2

In [53]:
from sklearn.feature_selection import chi2
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectKBest

X_norm = MinMaxScaler().fit_transform(X)
chi_selector = SelectKBest(chi2, k=100)
chi_selector.fit(X_norm, y)

ValueError: Input contains infinity or a value too large for dtype('float64').

In [54]:
chi_support = chi_selector.get_support()
chi_feature = X.loc[:,chi_support].columns.tolist()
print(str(len(chi_feature)), 'selected features')

NameError: name 'chi_selector' is not defined

#### RFE

In [None]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

rfe_selector = RFE(estimator=LogisticRegression(), n_features_to_select=5, step=10, verbose=5)
rfe_selector.fit(X_norm, y)

In [None]:
rfe_support = rfe_selector.get_support()
rfe_feature = X.loc[:,rfe_support].columns.tolist()
print(str(len(rfe_feature)), 'selected features')

#### Embedded

In [None]:
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression

embeded_lr_selector = SelectFromModel(LogisticRegression(penalty="l1"), '1.25*median')
embeded_lr_selector.fit(X_norm, y)

In [None]:
embeded_lr_support = embeded_lr_selector.get_support()
embeded_lr_feature = X.loc[:,embeded_lr_support].columns.tolist()
print(str(len(embeded_lr_feature)), 'selected features')

#### Random Forest

In [None]:
embeded_rf_selector = SelectFromModel(RandomForestClassifier(n_estimators=5), threshold='1.25*median')
embeded_rf_selector.fit(X, y)

In [None]:
embeded_rf_support = embeded_rf_selector.get_support()
embeded_rf_feature = X.loc[:,embeded_rf_support].columns.tolist()
print(str(len(embeded_rf_feature)), 'selected features')

#### LightGBM

In [None]:
from sklearn.feature_selection import SelectFromModel
from lightgbm import LGBMClassifier

lgbc=LGBMClassifier(n_estimators=500, learning_rate=0.05, num_leaves=32, colsample_bytree=0.2,
            reg_alpha=3, reg_lambda=1, min_split_gain=0.01, min_child_weight=40)

embeded_lgb_selector = SelectFromModel(lgbc, threshold='1.25*median')
embeded_lgb_selector.fit(X, y)

In [None]:
embeded_lgb_support = embeded_lgb_selector.get_support()
embeded_lgb_feature = X.loc[:,embeded_lgb_support].columns.tolist()
print(str(len(embeded_lgb_feature)), 'selected features')

In [None]:
pd.set_option('display.max_rows', None)

feature_selection_df = pd.DataFrame({
    'feature': X.columns.tolist(),
    'Pearson': cor_support,
    'Chi-2': chi_support,
    'RFE': rfe_support,
    'Logistics': embeded_lr_support,
    'Random Forest': embeded_rf_support,
    'LightGBM': embeded_lgb_support
})

feature_selection_df['total'] = np.sum(feature_selection_df, axis=1)

feature_selection_df = feature_selection_df.sort_values(['total', 'feature'] , ascending=False)
feature_selection_df.index = range(1, len(feature_selection_df)+1)
feature_selection_df[:50]

In [None]:
len(feature_selection_df[feature_selection_df['total'] > 3]['feature'])

In [None]:
list(feature_selection_df[feature_selection_df['total'] > 3]['feature'])

### Training

In [None]:
# features used for training the outlier detector
features = ['new_purchase_weekofyear_max', 'hist_category_3_sum', 'new_purchase_weekofyear_std', 'new_purchase_weekofyear_mean', 'new_purchase_day_std', 'new_purchase_day_median', 'new_purchase_day_mean', 'new_purchase_day_max', 'new_number_of_transactions_median', 'new_number_of_transactions_mean', 'new_month_lag_mean', 'hist_purchase_weekofyear_std', 'hist_month_lag_min', 'hist_installments_sum', 'hist_category_1_trans_sum', 'hist_category_1_trans_mean', 'hist_category_1_merch_sum', 'hist_category_1_merch_mean', 'hist_authorized_flag_mean', 'new_purchase_year_min', 'new_purchase_weekofyear_min', 'new_purchase_weekofyear_median', 'new_purchase_part_of_month_mean', 'new_purchase_month_std', 'new_purchase_month_mean', 'new_purchase_hour_mean', 'new_purchase_dayofweek_std', 'new_purchase_day_min', 'new_purchase_amount_sum', 'new_most_recent_purchases_range_sum', 'new_category_2_merch_sum', 'new_category_1_trans_sum', 'new_category_1_trans_mean', 'month', 'hist_purchase_weekofyear_min', 'hist_purchase_quarter_std', 'hist_purchase_month_std', 'hist_purchase_month_max', 'hist_number_of_transactions_median', 'hist_category_4_mean', 'hist_category_3_mean']
X = df_train[features]
y = df_train.outlier

#### Random Forest

In [15]:
clf = RandomForestClassifier(verbose=1)
clf.fit(X, y)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:   11.4s finished


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=1,
            warm_start=False)

In [16]:
df_test_clean = df_test.dropna(axis=0, how='any')

In [17]:
randomforest_outlier_pred = clf.predict(df_test_clean[all_trainable_cols])

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    0.2s finished


In [18]:
randomforest_outlier_pred.sum()

17

In [19]:
rf_outlier_card_ids = []
for i in range(len(randomforest_outlier_pred)):
    if randomforest_outlier_pred[i] == 1:
        print('{:,}. card_id: {}'.format(i, df_test_clean['card_id'].iloc[i]))
        rf_outlier_card_ids.append(df_test_clean['card_id'].iloc[i])

1,136. card_id: C_ID_a8776b3483
4,729. card_id: C_ID_6237a7c2b7
5,160. card_id: C_ID_fc54efbb79
9,158. card_id: C_ID_912688a50a
12,343. card_id: C_ID_46eac83bde
13,481. card_id: C_ID_1076351773
14,872. card_id: C_ID_aae50409e7
23,526. card_id: C_ID_ac114ef831
32,867. card_id: C_ID_93c6231450
36,467. card_id: C_ID_84e2faf848
40,482. card_id: C_ID_e3c0c1325b
41,393. card_id: C_ID_9771b57a38
46,329. card_id: C_ID_0a21dd3613
63,803. card_id: C_ID_fdf63e6d6e
80,056. card_id: C_ID_aa32aa22e5
86,906. card_id: C_ID_cab853c010
87,384. card_id: C_ID_4cdcd7bbb1


In [20]:
rf_outlier_card_ids

['C_ID_a8776b3483',
 'C_ID_6237a7c2b7',
 'C_ID_fc54efbb79',
 'C_ID_912688a50a',
 'C_ID_46eac83bde',
 'C_ID_1076351773',
 'C_ID_aae50409e7',
 'C_ID_ac114ef831',
 'C_ID_93c6231450',
 'C_ID_84e2faf848',
 'C_ID_e3c0c1325b',
 'C_ID_9771b57a38',
 'C_ID_0a21dd3613',
 'C_ID_fdf63e6d6e',
 'C_ID_aa32aa22e5',
 'C_ID_cab853c010',
 'C_ID_4cdcd7bbb1']

#### Logistic regression

In [21]:
clf = LogisticRegression(verbose=1)
clf.fit(X, y)

[LibLinear]

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=1, warm_start=False)

In [22]:
logistic_regression_outlier_pred = clf.predict(df_test_clean[all_trainable_cols].dropna(axis=0, how='any'))

In [23]:
logistic_regression_outlier_pred.sum()

4

In [25]:
lr_outlier_card_ids = []
for i in range(len(logistic_regression_outlier_pred)):
    if logistic_regression_outlier_pred[i] == 1:
        print('{:,}. card_id: {}'.format(i, df_test_clean['card_id'].iloc[i]))
        lr_outlier_card_ids.append(df_test_clean['card_id'].iloc[i])

34,464. card_id: C_ID_1fa5e84ac3
47,151. card_id: C_ID_a5abf9fbc6
61,984. card_id: C_ID_2df75addf1
70,178. card_id: C_ID_2980922f6e


#### AdaBoost

In [41]:
from sklearn.ensemble import AdaBoostClassifier

clf = AdaBoostClassifier()
clf.fit(X, y)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=100, random_state=None)

In [42]:
adaboost_outlier_pred = clf.predict(df_test_clean[all_trainable_cols].dropna(axis=0, how='any'))

In [43]:
adaboost_outlier_pred.sum()

27

In [44]:
adaboost_outlier_card_ids = []
for i in range(len(adaboost_outlier_pred)):
    if adaboost_outlier_pred[i] == 1:
        print('{:,}. card_id: {}'.format(i, df_test_clean['card_id'].iloc[i]))
        adaboost_outlier_card_ids.append(df_test_clean['card_id'].iloc[i])

1,801. card_id: C_ID_489ac69e52
4,482. card_id: C_ID_4f5877bc71
4,974. card_id: C_ID_b7e0a64084
7,861. card_id: C_ID_4e8f2b5d1f
8,100. card_id: C_ID_25df4fa561
11,659. card_id: C_ID_9c760806b5
14,872. card_id: C_ID_aae50409e7
18,347. card_id: C_ID_159d993d5e
19,834. card_id: C_ID_e4a5ea82d4
28,012. card_id: C_ID_a8bc17bac9
30,817. card_id: C_ID_055de6e0ad
32,593. card_id: C_ID_bc1b6ba5c2
39,928. card_id: C_ID_761c27a0f2
40,296. card_id: C_ID_48f81f1eda
42,657. card_id: C_ID_e94aaeede0
44,918. card_id: C_ID_f2087398f1
53,081. card_id: C_ID_ed948f3ebc
53,321. card_id: C_ID_e895e3aaf4
60,393. card_id: C_ID_01103857f9
62,456. card_id: C_ID_ad43f734ab
67,919. card_id: C_ID_efbf650295
68,911. card_id: C_ID_8aa970734d
76,734. card_id: C_ID_4f47eac613
77,408. card_id: C_ID_ff19b99804
79,336. card_id: C_ID_bc6e652855
79,996. card_id: C_ID_d596794318
89,153. card_id: C_ID_5ddfd1c1c4


#### Gradient Boosting

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

clf = GradientBoostingRegressor({
    'n_estimators': 500,
    'max_depth': 4,
    'min_samples_split': 2,
    'learning_rate': 0.01,
    'loss': 'ls'
})

clf.fit(X, y)

In [None]:
gradient_boosting_outlier_pred = clf.predict(df_test_clean[all_trainable_cols].dropna(axis=0, how='any'))

### Ensembling

In [34]:
def intersection(lst1, lst2): 
    return list(set(lst1) & set(lst2))

In [48]:
print('Size of the output of the logistic regression: {}\nSize of the output of the random forest: {}\nSize of the output of the random forest: {}'.format(len(lr_outlier_card_ids), len(rf_outlier_card_ids), len(adaboost_outlier_card_ids)))

Size of the output of the logistic regression: 4
Size of the output of the random forest: 17
Size of the output of the random forest: 27


In [49]:
intersection(lr_outlier_card_ids, rf_outlier_card_ids)

[]

In [46]:
intersection(lr_outlier_card_ids, adaboost_outlier_card_ids)

[]

In [47]:
intersection(rf_outlier_card_ids, adaboost_outlier_card_ids)

['C_ID_aae50409e7']

## LightGBM

In [54]:
clf = lgb.Booster(model_file='models/lightgbm_all.txt')

In [55]:
drops = ['card_id', 'first_active_month', 'target', 'outlier']
use_cols = [c for c in df_train.columns if c not in drops]
features = list(df_train[use_cols].columns)

In [56]:
predictions = clf.predict(df_test[features])

In [57]:
df_sub = pd.DataFrame({
    "card_id": df_test["card_id"].values
})
df_sub["target"] = predictions

In [58]:
# Updating the intersection of RF & AdaBoost
df_sub.loc[df_sub['card_id'] == 'C_ID_aae50409e7', 'target'] = -33.218750

In [31]:
for i in range(len(lr_outlier_card_ids)):
    print('The value of {} is {}'.format(lr_outlier_card_ids[i], df_sub.loc[df_sub['card_id'] == lr_outlier_card_ids[i], 'target'].values[0]))
    df_sub.loc[df_sub['card_id'] == lr_outlier_card_ids[i], 'target'] = -33.218750

The value of C_ID_1fa5e84ac3 is -5.515564316338457
The value of C_ID_a5abf9fbc6 is -3.9438269947352578
The value of C_ID_2df75addf1 is -5.361454488823543
The value of C_ID_2980922f6e is -6.559434848065991


In [59]:
len(df_sub[df_sub['target'] < -30])

1

In [60]:
df_sub.to_csv("output/lgbm_rf_and_adaboost_outliers.csv", index=False)

* Random Forest (LB score: 5.994)
* Logistic Regression (LB score: 5.990)
* Random Forest & AdaBoost (LB score: 5.986)