# Elo Merchant Category Recommendation
Ridge regression prediction with LynxKite of the [Elo Mechant Category Recommendation](https://www.kaggle.com/c/elo-merchant-category-recommendation) contest. The contents of the shared files can be read here:

- **train.csv**,  **test.csv**: list of `card_ids` that can be used for training and prediction
- **historical_transactions.csv**: contains up to 3 months' worth of transactions for every card at any of the provided `merchant_ids`
- **new_merchant_transactions.csv**: contains the transactions at new merchants (`merchant_ids` that this particular `card_id` 
has not yet visited) over a period of two months
- **merchants.csv**: contains aggregate information for each `merchant_id` represented in the data set

Unfortunately LynxKite does not support some of the data preprocessing, thus it needs to be done in Python.

### Preprocessing the data
First we need to import several libraries

In [1]:
import os
import gc
import warnings
import datetime
import numpy as np
import pandas as pd

warnings.filterwarnings("ignore")

In [3]:
df_train = pd.read_csv("input/train.csv", parse_dates=["first_active_month"])
df_test = pd.read_csv("input/test.csv", parse_dates=["first_active_month"])
print("{:,} observations and {} features in train set.".format(df_train.shape[0], df_train.shape[1]))
print("{:,} observations and {} features in test set.".format(df_test.shape[0], df_test.shape[1]))

201,917 observations and 6 features in train set.
123,623 observations and 5 features in test set.


In [4]:
df_train[:3]

Unnamed: 0,first_active_month,card_id,feature_1,feature_2,feature_3,target
0,2017-06-01,C_ID_92a2005557,5,2,1,-0.820283
1,2017-01-01,C_ID_3d0044924f,4,1,0,0.392913
2,2016-08-01,C_ID_d639edf6cd,2,2,0,0.688056


Checking the value set of the `feature_1`, `feature_2` and `feature_3` features

In [5]:
print('Feature_1: ' + str(df_train['feature_1'].min()) + '-' + str(df_train['feature_1'].max()))
print('Feature_2: ' + str(df_train['feature_2'].min()) + '-' + str(df_train['feature_2'].max()))
print('Feature_3: ' + str(df_train['feature_3'].min()) + '-' + str(df_train['feature_3'].max()))

Feature_1: 1-5
Feature_2: 1-3
Feature_3: 0-1


The `feature_1` and `feature_2` needs to be converted to **one hot vector** (More info on [one-hot vectors](https://en.wikipedia.org/wiki/One-hot])