# Elo Merchant Category Recommendation
In this tutorial you can solve the [Elo Mechant Category Recommendation](https://www.kaggle.com/c/elo-merchant-category-recommendation) contest with the help of LynxKite. Unfortunately LynxKite does not yet support some of the data preprocessing, thus it needs to be done in Python.

First download the input files from [here](https://www.kaggle.com/c/elo-merchant-category-recommendation/data), unzip them and copy the extracted files to the `input` folder. These files are

- **train.csv**,  **test.csv**: list of `card_ids` that can be used for training and prediction
- **historical_transactions.csv**: contains up to 3 months' worth of transactions for every card at any of the provided `merchant_ids`
- **new_merchant_transactions.csv**: contains the transactions at new merchants (`merchant_ids` that this particular `card_id` 
has not yet visited) over a period of two months
- **merchants.csv**: contains aggregate information for each `merchant_id` represented in the data set

### Preprocessing the data
First we need to import several libraries, then load the test and train data.

In [1]:
import gc # garbage collector
import warnings
import datetime
import numpy as np
import pandas as pd

warnings.filterwarnings("ignore")

In [2]:
df_train = pd.read_csv("input/train.csv", parse_dates=["first_active_month"])
df_test = pd.read_csv("input/test.csv", parse_dates=["first_active_month"])
print("{:,} observations and {} features in train set.".format(df_train.shape[0], df_train.shape[1]))
print("{:,} observations and {} features in test set.".format(df_test.shape[0], df_test.shape[1]))

201,917 observations and 6 features in train set.
123,623 observations and 5 features in test set.


In [3]:
df_train[:3]

Unnamed: 0,first_active_month,card_id,feature_1,feature_2,feature_3,target
0,2017-06-01,C_ID_92a2005557,5,2,1,-0.820283
1,2017-01-01,C_ID_3d0044924f,4,1,0,0.392913
2,2016-08-01,C_ID_d639edf6cd,2,2,0,0.688056


In [4]:
df_test[:3]

Unnamed: 0,first_active_month,card_id,feature_1,feature_2,feature_3
0,2017-04-01,C_ID_0ab67a22ab,3,3,1
1,2017-01-01,C_ID_130fd0cbdd,2,3,0
2,2017-08-01,C_ID_b709037bc5,5,1,1


As you can see, the test set does not have the target variable.
Then check the value set of the `feature_1`, `feature_2` and `feature_3` features

In [5]:
print('Feature_1: ' + str(df_train['feature_1'].min()) + '-' + str(df_train['feature_1'].max()))
print('Feature_2: ' + str(df_train['feature_2'].min()) + '-' + str(df_train['feature_2'].max()))
print('Feature_3: ' + str(df_train['feature_3'].min()) + '-' + str(df_train['feature_3'].max()))

Feature_1: 1-5
Feature_2: 1-3
Feature_3: 0-1


The `feature_1` and `feature_2` needs to be converted to **one hot vector** (More info on [one-hot vectors](https://en.wikipedia.org/wiki/One-hot])) the `feature_3` has 2 output values, so it is already a one-hot vector.

In [6]:
df_train = pd.get_dummies(df_train, columns=['feature_1', 'feature_2'])
df_test = pd.get_dummies(df_test, columns=['feature_1', 'feature_2'])
df_train[:3]

Unnamed: 0,first_active_month,card_id,feature_3,target,feature_1_1,feature_1_2,feature_1_3,feature_1_4,feature_1_5,feature_2_1,feature_2_2,feature_2_3
0,2017-06-01,C_ID_92a2005557,1,-0.820283,0,0,0,0,1,0,1,0
1,2017-01-01,C_ID_3d0044924f,0,0.392913,0,0,0,1,0,1,0,0
2,2016-08-01,C_ID_d639edf6cd,0,0.688056,0,1,0,0,0,0,1,0


Extract the year and the month from the `first_active_month` attribute.

In [7]:
df_train["year"] = df_train["first_active_month"].dt.year
df_test["year"] = df_test["first_active_month"].dt.year

df_train["month"] = df_train["first_active_month"].dt.month
df_test["month"] = df_test["first_active_month"].dt.month

The `first_active_month` attribute contains the month, when the customer used the bank card for the first time. It might be a good idea to convert it to the number of days until the last day of the sample. Let's check the last day of both the train and test set.

In [8]:
test_first = df_test['first_active_month'].min()
test_last = df_test['first_active_month'].max()
train_first = df_train['first_active_month'].min()
train_last = df_train['first_active_month'].max()

print("The first_active_month attribute in the test set is ranging between {0}-{1:02}-{2:02} and {3}-{4:02}-{5:02}.".format(test_first.year, test_first.month, test_first.day, test_last.year, test_last.month, test_last.day))
print("The first_active_month attribute in the train set is ranging between {0}-{1:02}-{2:02} and {3}-{4:02}-{5:02}.".format(train_first.year, train_first.month, train_first.day, train_last.year, train_last.month, train_last.day))

The first_active_month attribute in the test set is ranging between 2011-11-01 and 2018-01-01.
The first_active_month attribute in the train set is ranging between 2011-11-01 and 2018-02-01.


The last month in the training and test set is _2018-02-01_, so we will calculate the day difference to this date.

In [9]:
df_train['elapsed_days'] = (datetime.date(2018, 2, 1) - df_train['first_active_month'].dt.date).dt.days
df_test['elapsed_days'] = (datetime.date(2018, 2, 1) - df_test['first_active_month'].dt.date).dt.days

In [10]:
df_train[:3]

Unnamed: 0,first_active_month,card_id,feature_3,target,feature_1_1,feature_1_2,feature_1_3,feature_1_4,feature_1_5,feature_2_1,feature_2_2,feature_2_3,year,month,elapsed_days
0,2017-06-01,C_ID_92a2005557,1,-0.820283,0,0,0,0,1,0,1,0,2017,6,245
1,2017-01-01,C_ID_3d0044924f,0,0.392913,0,0,0,1,0,1,0,0,2017,1,396
2,2016-08-01,C_ID_d639edf6cd,0,0.688056,0,1,0,0,0,0,1,0,2016,8,549


As a next step, we will join both the _historical_ and the _new merchant transactions_ to the train and test data. First we'll load them into memory.

In [11]:
df_hist_trans = pd.read_csv("input/historical_transactions.csv")
df_new_trans = pd.read_csv("input/new_merchant_transactions.csv")

Since the `historical_transactions.csv` file is 2.65 GB on disk, it requires loads of RAM to store it as dataframes in memory. The following function reduces the memory usage of dataframes by storing the numeric values in their most appropriate format in memory.

In [12]:
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Starting memory usage: {:5.2f} MB'.format(start_mem))

    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Reduced memory usage: {:5.2f} MB ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

In [13]:
df_hist_trans = reduce_mem_usage(df_hist_trans)
df_new_trans = reduce_mem_usage(df_new_trans)

Starting memory usage: 3109.54 MB
Reduced memory usage: 1749.11 MB (43.7% reduction)
Starting memory usage: 209.67 MB
Reduced memory usage: 114.20 MB (45.5% reduction)


In [14]:
df_hist_trans[:3]

Unnamed: 0,authorized_flag,card_id,city_id,category_1,installments,category_3,merchant_category_id,merchant_id,month_lag,purchase_amount,purchase_date,category_2,state_id,subsector_id
0,Y,C_ID_4e6213e9bc,88,N,0,A,80,M_ID_e020e9b302,-8,-0.703331,2017-06-25 15:33:07,1.0,16,37
1,Y,C_ID_4e6213e9bc,88,N,0,A,367,M_ID_86ec983688,-7,-0.733128,2017-07-15 12:10:45,1.0,16,16
2,Y,C_ID_4e6213e9bc,88,N,0,A,80,M_ID_979ed661fc,-6,-0.720386,2017-08-09 22:04:29,1.0,16,37


In [15]:
df_new_trans[:3]

Unnamed: 0,authorized_flag,card_id,city_id,category_1,installments,category_3,merchant_category_id,merchant_id,month_lag,purchase_amount,purchase_date,category_2,state_id,subsector_id
0,Y,C_ID_415bb3a509,107,N,1,B,307,M_ID_b0c793002c,1,-0.557617,2018-03-11 14:57:36,1.0,9,19
1,Y,C_ID_415bb3a509,140,N,1,B,307,M_ID_88920c89e8,1,-0.569336,2018-03-19 18:53:37,1.0,9,19
2,Y,C_ID_415bb3a509,330,N,1,B,507,M_ID_ad5237ef6b,2,-0.55127,2018-04-26 14:08:44,1.0,9,14


`category_1` and `authorized_flag` is stored as string, so we need to map them to integers, then convert `category_2` and `category_3` to one-hot vectors.

In [16]:
df_hist_trans['authorized_flag'] = df_hist_trans['authorized_flag'].map({'Y': 1, 'N': 0})
df_hist_trans['category_1'] = df_hist_trans['category_1'].map({'Y': 1, 'N': 0})

df_new_trans['authorized_flag'] = df_new_trans['authorized_flag'].map({'Y': 1, 'N': 0})
df_new_trans['category_1'] = df_new_trans['category_1'].map({'Y': 1, 'N': 0})

df_hist_trans = pd.get_dummies(df_hist_trans, columns=['category_2', 'category_3'])
df_new_trans = pd.get_dummies(df_new_trans, columns=['category_2', 'category_3'])

Then the historical and new transactions can be joined with training and test set.

In [17]:
df_train = pd.merge(df_train, df_new_trans, on='card_id', how='left')
df_test = pd.merge(df_test, df_new_trans, on='card_id', how='left')
del df_new_trans
gc.collect()

49

In [18]:
df_train = pd.merge(df_train, df_hist_trans, on='card_id', how='left')
df_test = pd.merge(df_test, df_hist_trans, on='card_id', how='left')
del df_hist_trans
gc.collect()

MemoryError: 

In [19]:
df_train[:3]

Unnamed: 0,first_active_month,card_id,feature_3,target,feature_1_1,feature_1_2,feature_1_3,feature_1_4,feature_1_5,feature_2_1,...,state_id,subsector_id,category_2_1.0,category_2_2.0,category_2_3.0,category_2_4.0,category_2_5.0,category_3_A,category_3_B,category_3_C
0,2017-06-01,C_ID_92a2005557,1,-0.820283,0,0,0,0,1,0,...,9.0,37.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,2017-06-01,C_ID_92a2005557,1,-0.820283,0,0,0,0,1,0,...,9.0,37.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,2017-06-01,C_ID_92a2005557,1,-0.820283,0,0,0,0,1,0,...,9.0,37.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [20]:
df_test[:3]

Unnamed: 0,first_active_month,card_id,feature_3,feature_1_1,feature_1_2,feature_1_3,feature_1_4,feature_1_5,feature_2_1,feature_2_2,...,state_id,subsector_id,category_2_1.0,category_2_2.0,category_2_3.0,category_2_4.0,category_2_5.0,category_3_A,category_3_B,category_3_C
0,2017-04-01,C_ID_0ab67a22ab,1,0,0,1,0,0,0,0,...,12.0,21.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,2017-04-01,C_ID_0ab67a22ab,1,0,0,1,0,0,0,0,...,12.0,19.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,2017-04-01,C_ID_0ab67a22ab,1,0,0,1,0,0,0,0,...,12.0,37.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


LynxKite does not support dot in the names of columns, so they need to be renamed before saving.

In [21]:
df_train.rename(index=str, columns={"category_2_1.0": "category_2_1_0", "category_2_2.0": "category_2_2_0", "category_2_3.0": "category_2_3_0", "category_2_4.0": "category_2_4_0", "category_2_5.0": "category_2_5_0"}, inplace=True)
df_test.rename(index=str, columns={"category_2_1.0": "category_2_1_0", "category_2_2.0": "category_2_2_0", "category_2_3.0": "category_2_3_0", "category_2_4.0": "category_2_4_0", "category_2_5.0": "category_2_5_0"}, inplace=True)

In [None]:
list(df_train.columns.values), list(df_test.columns.values) 

The preprocessed training and test sets can now be saved to disk as CSV.

In [22]:
df_train.to_csv('output/train_preprocessed.csv')
df_test.to_csv('output/test_preprocessed.csv')