# Elo Merchant Category Recommendation - Autoencoder
End date: _2019. february 19._<br/>

This tutorial notebook is the second part of a seriers for [Elo Mechant Category Recommendation](https://www.kaggle.com/c/elo-merchant-category-recommendation) contest organized by Elo, one of the largest payment brands in Brazil. It has built partnerships with merchants in order to offer promotions or discounts to cardholders. The objective of the competition is to identify and serve the most relevant opportunities to individuals, by uncovering signals in customer loyalty. The input files are available from the [download](https://www.kaggle.com/c/elo-merchant-category-recommendation/data) section of the contest:

- **train.csv**,  **test.csv**: list of `card_ids` that can be used for training and testing
- **historical_transactions.csv**: contains up to 3 months' worth of transactions for every card at any of the provided `merchant_ids`
- **new_merchant_transactions.csv**: contains the transactions at new merchants (`merchant_ids` that this particular `card_id` 
has not yet visited) over a period of two months
- **merchants.csv**: contains aggregate information for each `merchant_id` represented in the data set

In [16]:
import gc
import math
import time
import statistics
import numpy as np
import pandas as pd
import lightgbm as lgb
import matplotlib.pyplot as plt
from sklearn.metrics import log_loss
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import StratifiedKFold, KFold

%matplotlib inline

In [2]:
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Starting memory usage: {:5.2f} MB'.format(start_mem))

    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min >= np.iinfo(np.int8).min and c_max <= np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min >= np.iinfo(np.int16).min and c_max <= np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min >= np.iinfo(np.int32).min and c_max <= np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min >= np.iinfo(np.int64).min and c_max <= np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min >= np.finfo(np.float16).min and c_max <= np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min >= np.finfo(np.float32).min and c_max <= np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Reduced memory usage: {:5.2f} MB ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

## Loading the input data

In [None]:
df_merch = pd.read_csv("input/merchants.csv")
print("{:,} records and {} features in merchant set.".format(df_merch.shape[0], df_merch.shape[1]))

In [None]:
df_merch['category_1'] = df_merch['category_1'].map({'N': 0, 'Y': 1})
df_merch['category_2'] = pd.to_numeric(df_merch['category_2'])
df_merch['category_4'] = df_merch['category_4'].map({'N': 0, 'Y': 1})
df_merch['most_recent_sales_range'] = df_merch['most_recent_sales_range'].map({'E': 0, 'D': 1, 'C': 2, 'B': 3, 'A': 4})
df_merch['most_recent_purchases_range'] = df_merch['most_recent_purchases_range'].map({'E': 0, 'D': 1, 'C': 2, 'B': 3, 'A': 4})

In [None]:
dropping = ['city_id', 'state_id']
for var in dropping:
    df_merch = df_merch.drop(var, axis=1)

In [None]:
df_merch = reduce_mem_usage(df_merch)

In [None]:
df_new_trans = pd.read_csv("input/trans_merch_new_agg.csv", index_col=0)
df_new_trans = reduce_mem_usage(df_new_trans)

df_hist_trans = pd.read_csv("input/trans_merch_hist_agg.csv", index_col=0)
df_hist_trans = reduce_mem_usage(df_hist_trans)

In [None]:
df_train = pd.read_csv("input/train.csv")
df_train = reduce_mem_usage(df_train)

df_test = pd.read_csv("input/test.csv")
df_test = reduce_mem_usage(df_test)

print("{:,} records and {} features in train set.".format(df_train.shape[0], df_train.shape[1]))
print("{:,} records and {} features in test set.".format(df_test.shape[0], df_test.shape[1]))

In [None]:
df_train = pd.merge(df_train, df_hist_trans, on='card_id',how='left')
df_test = pd.merge(df_test, df_hist_trans, on='card_id',how='left')

df_train = pd.merge(df_train, df_new_trans, on='card_id', how='left')
df_test = pd.merge(df_test, df_new_trans, on='card_id', how='left')

In [None]:
del df_hist_trans
del df_new_trans
gc.collect()

In [None]:
df_train[:3]

In [None]:
df_test[:3]

In [None]:
print('According to Rice\'s rule, the number of bins is {:.0f} (for the whole set)'.format(math.sqrt(df_train.shape[0])*2))

In [None]:
plt.style.use("seaborn")
plt.figure(figsize=(15, 6))
plt.hist(df_train['target'].values, bins=899)
plt.title('Histogram target counts')
plt.xlabel('Count')
plt.xticks(rotation=60)
plt.ylabel('Target')
plt.show()

As you can see, there is a peak around -33 and also the number of 0's are extremely high. It might be a good idea to handle them later.

### Marking the outliers

In [None]:
df_train['outlier'] = np.where(df_train['target']<-30, 1, 0)

In [None]:
print('There are {:,} marked outliers in the training set.'.format(len(df_train[df_train['outlier'] == 1]['outlier'])))

In [None]:
count_classes = pd.value_counts(df_train['outlier'], sort = True)
count_classes.plot(kind = 'bar', rot=0)
plt.title("Transaction class distribution")
plt.xticks(range(2), ["Normal", "Outlier"])
plt.ylabel("Frequency")

### Filtering

In [None]:
cols = ['first_active_month', 'hist_merchant_id_mode', 'new_merchant_id_mode', 'hist_avg_purchases_lag3_sum', 'hist_avg_purchases_lag3_mean', 'hist_avg_purchases_lag6_sum', 'hist_avg_purchases_lag6_mean', 'hist_avg_purchases_lag12_sum', 'hist_avg_purchases_lag12_mean']

df_test.drop(columns=cols, axis=1, inplace=True)
cols.append('card_id')
df_train.drop(columns=cols, axis=1, inplace=True)

In [None]:
print('Number of records in the training set: {:,}, in the test set: {:,}'.format(len(df_train), len(df_test)))

In [None]:
df_train.dropna(how='any', axis=0, inplace=True)

In [None]:
print('Number of records in the training set: {:,}, in the test set: {:,}'.format(len(df_train), len(df_test)))

### Normalization

In [None]:
for f in df_train.columns:
    if f != 'outlier':
        mean = statistics.mean(df_train[f])
        std = statistics.stdev(df_train[f])
    
        df_train[f] = (df_train[f] - mean)/std
        print('{}: {:.4f} ({:.4f})'.format(f, mean, std))

In [None]:
df_train[:5]

In [None]:
df_train.to_csv('input/train_preprocessed.csv')
df_test.to_csv('input/test_preprocessed.csv')

### Loading normalized input data

In [22]:
%%time
df_train = pd.read_csv("input/train_preprocessed.csv")
df_test = pd.read_csv("input/test_preprocessed.csv")

CPU times: user 12.9 s, sys: 1.34 s, total: 14.3 s
Wall time: 14.3 s


## Outlier detection
[Combining your model with a model without outlier](https://www.kaggle.com/waitingli/combining-your-model-with-a-model-without-outlier)
### Filtering out the outliers

In [4]:
df_train = df_train[df_train['outlier'] == 0]
target = df_train['target']
del df_train['target']
features = [c for c in df_train.columns if c not in ['card_id', 'first_active_month','outlier']]
categorical_feats = [c for c in features if 'feature_' in c]

### Training with LightGBM on training set without outliers

In [5]:
param = {
    'objective': 'regression',
    'num_leaves': 31,
    'min_data_in_leaf': 25,
    'max_depth': 7,
    'learning_rate': 0.01,
    'lambda_l1': 0.13,
    'boosting': 'gbdt',
    'feature_fraction': 0.85,
    'bagging_freq':8,
    'bagging_fraction': 0.9,
    'metric': 'rmse',
    'verbosity': -1,
    'random_state': 2333
}

In [6]:
%%time
folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=2333)
oof = np.zeros(len(df_train))
predictions = np.zeros(len(df_test))
feature_importance_df = pd.DataFrame()

for fold_, (trn_idx, val_idx) in enumerate(folds.split(df_train, df_train['outlier'].values)):
    print("Fold {}.".format(fold_+1))
    trn_data = lgb.Dataset(df_train.iloc[trn_idx][features], label=target.iloc[trn_idx])
    val_data = lgb.Dataset(df_train.iloc[val_idx][features], label=target.iloc[val_idx])

    num_round = 10000
    clf = lgb.train(param, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=100, early_stopping_rounds=200)
    oof[val_idx] = clf.predict(df_train.iloc[val_idx][features], num_iteration=clf.best_iteration)
    
    fold_importance_df = pd.DataFrame()
    fold_importance_df["feature"] = features
    fold_importance_df["importance"] = clf.feature_importance()
    fold_importance_df["fold"] = fold_ + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    
    predictions += clf.predict(df_test[features], num_iteration=clf.best_iteration) / folds.n_splits

Fold 1.
Training until validation scores don't improve for 200 rounds.
[100]	training's rmse: 0.439947	valid_1's rmse: 0.441321
[200]	training's rmse: 0.432385	valid_1's rmse: 0.435234
[300]	training's rmse: 0.427882	valid_1's rmse: 0.432426
[400]	training's rmse: 0.424587	valid_1's rmse: 0.430874
[500]	training's rmse: 0.421963	valid_1's rmse: 0.429922
[600]	training's rmse: 0.419742	valid_1's rmse: 0.429409
[700]	training's rmse: 0.417719	valid_1's rmse: 0.429075
[800]	training's rmse: 0.41592	valid_1's rmse: 0.428848
[900]	training's rmse: 0.414225	valid_1's rmse: 0.42871
[1000]	training's rmse: 0.412562	valid_1's rmse: 0.428595
[1100]	training's rmse: 0.411094	valid_1's rmse: 0.428542
[1200]	training's rmse: 0.409636	valid_1's rmse: 0.428484
[1300]	training's rmse: 0.408251	valid_1's rmse: 0.428467
[1400]	training's rmse: 0.407019	valid_1's rmse: 0.428449
[1500]	training's rmse: 0.405691	valid_1's rmse: 0.428428
[1600]	training's rmse: 0.404366	valid_1's rmse: 0.428403
[1700]	train

In [9]:
print("CV score: {:<8.5f}".format(mean_squared_error(oof, target)**0.5))

CV score: 0.42831 


In [64]:
model_without_outliers = pd.DataFrame({"card_id": df_test["card_id"].values})
model_without_outliers.set_index("card_id", inplace=True)
model_without_outliers["target"] = predictions

In [66]:
model_without_outliers.to_csv('output/lightgbm_wo_outliers_normalized.csv')

### Training Model For Outliers Classification

In [12]:
target = df_train['outlier']
del df_train['outlier']
del df_train['target']

In [13]:
features = [c for c in df_train.columns if c not in ['card_id', 'first_active_month']]
categorical_feats = [c for c in features if 'feature_' in c]

In [14]:
param = {
    'num_leaves': 31,
    'min_data_in_leaf': 30, 
    'objective':'binary',
    'max_depth': 6,
    'learning_rate': 0.01,
    'boosting': 'rf',
    'feature_fraction': 0.9,
    'bagging_freq': 1,
    'bagging_fraction': 0.9 ,
    'bagging_seed': 11,
    'metric': 'binary_logloss',
    'lambda_l1': 0.1,
    'verbosity': -1,
    'random_state': 2333
}

In [17]:
%%time
folds = KFold(n_splits=5, shuffle=True, random_state=15)
oof = np.zeros(len(df_train))
predictions = np.zeros(len(df_test))
feature_importance_df = pd.DataFrame()

start = time.time()

for fold_, (trn_idx, val_idx) in enumerate(folds.split(df_train.values, target.values)):
    print("Fold {}.".format(fold_+1))
    trn_data = lgb.Dataset(df_train.iloc[trn_idx][features], label=target.iloc[trn_idx], categorical_feature=categorical_feats)
    val_data = lgb.Dataset(df_train.iloc[val_idx][features], label=target.iloc[val_idx], categorical_feature=categorical_feats)

    num_round = 10000
    clf = lgb.train(param, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=100, early_stopping_rounds = 200)
    oof[val_idx] = clf.predict(df_train.iloc[val_idx][features], num_iteration=clf.best_iteration)

    fold_importance_df = pd.DataFrame()
    fold_importance_df["feature"] = features
    fold_importance_df["importance"] = clf.feature_importance()
    fold_importance_df["fold"] = fold_ + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    
    predictions += clf.predict(df_test[features], num_iteration=clf.best_iteration) / folds.n_splits

Fold 1.




Training until validation scores don't improve for 200 rounds.
[100]	training's binary_logloss: 0.0318888	valid_1's binary_logloss: 0.0361328
[200]	training's binary_logloss: 0.0318484	valid_1's binary_logloss: 0.0360422
[300]	training's binary_logloss: 0.0318666	valid_1's binary_logloss: 0.0360633
Early stopping, best iteration is:
[136]	training's binary_logloss: 0.0318359	valid_1's binary_logloss: 0.0360665
Fold 2.
Training until validation scores don't improve for 200 rounds.
[100]	training's binary_logloss: 0.0328958	valid_1's binary_logloss: 0.0322714
[200]	training's binary_logloss: 0.0328354	valid_1's binary_logloss: 0.0322384
[300]	training's binary_logloss: 0.0328441	valid_1's binary_logloss: 0.0322514
Early stopping, best iteration is:
[173]	training's binary_logloss: 0.0328316	valid_1's binary_logloss: 0.0322262
Fold 3.
Training until validation scores don't improve for 200 rounds.
[100]	training's binary_logloss: 0.0325969	valid_1's binary_logloss: 0.0346276
[200]	training

In [18]:
print("CV score: {:<8.5f}".format(log_loss(target, oof)))

CV score: 0.03469 


In [19]:
df_outlier_prob = pd.DataFrame({"card_id": df_test["card_id"].values})
df_outlier_prob["target"] = predictions
df_outlier_prob.head()

Unnamed: 0,card_id,target
0,C_ID_0ab67a22ab,0.002081
1,C_ID_130fd0cbdd,0.002417
2,C_ID_b709037bc5,0.002081
3,C_ID_d27d835a9f,0.001607
4,C_ID_2b5e3df5c2,0.001583


In [20]:
df_outlier_prob.to_csv('output/outlier_test_normalized.csv')

### Combining solutions

In [51]:
op = len(df_train[df_train['outlier']==1])/len(df_train)
nr = int(op*len(df_test))
print('The percentage of outliers in the training set is {:.4f}% ({:,} records)'.format(op, len(df_train[df_train['outlier']==1])))
print('If the test set has the same ration of outliers as training set, then {:.4f}% of the test set is {:,} records.'.format(op, nr))

The percentage of outliers in the training set is 0.0076% (1,125 records)
If the test set has the same ration of outliers as training set, then 0.0076% of the test set is 941 records.


In [52]:
df_outlier_prob.sort_values(by='target', axis=0, ascending=False, inplace=True)

In [53]:
df_outlier_prob['target'].max(), df_outlier_prob['target'].min()

(0.034976826592427804, 0.0013762192493320057)

In [54]:
df_outlier_prob[:5]

Unnamed: 0,card_id,target
3709,C_ID_fe89c1890a,0.034977
91167,C_ID_7a451830de,0.034548
85220,C_ID_6cca036355,0.030155
50372,C_ID_65e5c44c3e,0.030047
102165,C_ID_f4225643ec,0.029902


In [82]:
for i in df_outlier_prob['card_id'][:nr]:
    model_without_outliers.loc[i]['target'] = -33.218750

In [85]:
len(model_without_outliers[model_without_outliers['target'] == -33.218750])

941

In [86]:
model_without_outliers.to_csv('output/lgbm_0.42831_normalized_added_outliers.csv')

In [87]:
model_without_outliers.shape

(123623, 1)