# Elo Merchant Category Recommendation
In this tutorial you can solve the [Elo Mechant Category Recommendation](https://www.kaggle.com/c/elo-merchant-category-recommendation) contest with the help of LynxKite. Unfortunately LynxKite does not yet support some of the data preprocessing, thus it needs to be done in Python.

First download the input files from [here](https://www.kaggle.com/c/elo-merchant-category-recommendation/data), unzip them and copy the extracted files to the `input` folder. These files are

- **train.csv**,  **test.csv**: list of `card_ids` that can be used for training and prediction
- **historical_transactions.csv**: contains up to 3 months' worth of transactions for every card at any of the provided `merchant_ids`
- **new_merchant_transactions.csv**: contains the transactions at new merchants (`merchant_ids` that this particular `card_id` 
has not yet visited) over a period of two months
- **merchants.csv**: contains aggregate information for each `merchant_id` represented in the data set

### Descriptives

In [1]:
import gc # garbage collector
import warnings
import datetime
import numpy as np
import pandas as pd

warnings.filterwarnings("ignore")

In [2]:
df_train = pd.read_csv("input/train.csv", parse_dates=["first_active_month"])
df_test = pd.read_csv("input/test.csv", parse_dates=["first_active_month"])
print("{:,} observations and {} features in train set.".format(df_train.shape[0], df_train.shape[1]))
print("{:,} observations and {} features in test set.".format(df_test.shape[0], df_test.shape[1]))

201,917 observations and 6 features in train set.
123,623 observations and 5 features in test set.


In [3]:
df_train[:3]

Unnamed: 0,first_active_month,card_id,feature_1,feature_2,feature_3,target
0,2017-06-01,C_ID_92a2005557,5,2,1,-0.820283
1,2017-01-01,C_ID_3d0044924f,4,1,0,0.392913
2,2016-08-01,C_ID_d639edf6cd,2,2,0,0.688056


In [4]:
df_test[:3]

Unnamed: 0,first_active_month,card_id,feature_1,feature_2,feature_3
0,2017-04-01,C_ID_0ab67a22ab,3,3,1
1,2017-01-01,C_ID_130fd0cbdd,2,3,0
2,2017-08-01,C_ID_b709037bc5,5,1,1


As you can see, the test set does not have the target variable.
Then check the value set of the `feature_1`, `feature_2` and `feature_3` features

In [5]:
print('Feature_1: ' + str(df_train['feature_1'].min()) + '-' + str(df_train['feature_1'].max()))
print('Feature_2: ' + str(df_train['feature_2'].min()) + '-' + str(df_train['feature_2'].max()))
print('Feature_3: ' + str(df_train['feature_3'].min()) + '-' + str(df_train['feature_3'].max()))

Feature_1: 1-5
Feature_2: 1-3
Feature_3: 0-1


As a next step, we will join both the _historical_ and the _new merchant transactions_ to the train and test data. First we'll load them into memory.

In [6]:
df_hist_trans = pd.read_csv("input/historical_transactions.csv")
df_new_trans = pd.read_csv("input/new_merchant_transactions.csv")

FileNotFoundError: File b'input/historical_transactions.csv' does not exist

Since the `historical_transactions.csv` file is 2.65 GB on disk, it requires loads of RAM to store it as dataframes in memory. The following function reduces the memory usage of dataframes by storing the numeric values in their most appropriate format in memory.

In [None]:
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Starting memory usage: {:5.2f} MB'.format(start_mem))

    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Reduced memory usage: {:5.2f} MB ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

In [None]:
df_hist_trans = reduce_mem_usage(df_hist_trans)
df_new_trans = reduce_mem_usage(df_new_trans)

In [None]:
df_h = df_hist_trans.groupby("card_id").size().reset_index().rename({0:'transactions'},axis=1)
df_n = df_new_trans.groupby("card_id").size().reset_index().rename({0:'transactions'},axis=1)

In [None]:
print("Historic Transactions ---->   Average transactions per card : {:.0f}, Maximum transactions : {}.\nNew Transactions ---->   Average transactions per card : {:.0f}, Maximum transactions : {}. ".format(df_h['transactions'].mean(),df_h['transactions'].max(),df_n['transactions'].mean(),df_n['transactions'].max()))