# Analyzing Training Data

_Note! If you want to commit any changes to this document, please strip all output (Cell > Current Outputs > Clear, or set up [nbstripout](https://github.com/kynan/nbstripout) as a git filter) from this notebook before doing so. Thanks!_


## Import Libraries

Next we import the Python libraries we'll need. If any of these are missing for you, you can install them with e.g. `pip3 install pandas` on the command line.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from fastai import *
from fastai.tabular import *
from fastai.metrics import *
from feature_engineering import *

## Load Data

Load the data into Pandas data frames and look at their structure.

First thing we'll do with the training data is split it into a train and validation set. (The given test set is what we'll later make our predictions on and upload, but only after we are fully satisfied with our model.)

In [None]:
hist_trans_df = pd.read_csv('data/unzipped/historical_transactions.csv',
                            parse_dates=['purchase_date'])
merchants_df = pd.read_csv('data/unzipped/merchants.csv',
                           index_col='merchant_id')
merch_trans_df = pd.read_csv('data/unzipped/new_merchant_transactions.csv',
                             parse_dates=['purchase_date'])
train_and_validation_df = pd.read_csv('data/unzipped/train.csv',
                                      index_col='card_id',
                                      parse_dates=['first_active_month'])
test_df = pd.read_csv('data/unzipped/test.csv',
                      index_col='card_id',
                      parse_dates=['first_active_month'])

In [None]:
for v in ['feature_1', 'feature_2', 'feature_3']:
    train_and_validation_df[v] = train_and_validation_df[v].astype('category').cat.as_ordered()

In [None]:
for v in ['authorized_flag', 'category_1', 'category_2', 'category_3', 'merchant_id', 'merchant_category_id',
          'subsector_id', 'city_id', 'state_id']:
    hist_trans_df[v] = hist_trans_df[v].astype('category').cat.as_ordered()
    merch_trans_df[v] = merch_trans_df[v].astype('category').cat.as_ordered()

## Create Features

Next we want to combine and shape all of our raw data to create useful features in the train (and validation and test) data set.

Fastai has a useful helper function called `add_datepart()`, which takes a date field and turns it into a bunch of useful columns, such as "day of week", "is month end", etc.

In [None]:
add_datepart(hist_trans_df, 'purchase_date')
add_datepart(merch_trans_df, 'purchase_date')

### Aggregate Transaction Data

Next we'll use the functions defined in `feature_engineering.py` to aggregate the historical transactions for each card into single values for that card, for instance the mean of all purchase amounts, &c.

_Note: these functions can take quite a long time to complete._

In [None]:
hist_trans_df.columns

In [None]:
aggregators = {
    'purchase_amount': ['sum', 'mean', 'min', 'max', 'std'],
    'installments': ['sum', 'mean', 'min', 'max', 'std'],
    'month_lag': ['mean', 'min', 'max'],
    'merchant_id': ['nunique'],
    'merchant_category_id': ['nunique'],
    'state_id': ['nunique'],
    'city_id': ['nunique'],
    'subsector_id': ['nunique'],
}

In [None]:
add_aggregated_numerical_fields(train_and_validation_df, hist_trans_df, aggregators=aggregators)

For the categorical fields, we can't aggregate by taking the mean or sum values, so let's count the occurences of each possible categorical value instead. _(Iow, for a category that can be either YES or NO, we count the number of YESes and the number of NOs and use those values.)_

In [None]:
add_aggregated_categorical_fields(train_and_validation_df,
                                  hist_trans_df,
                                  column_names=['authorized_flag', 'category_1', 'category_2', 'category_3'])

In [None]:
# category_2 and category_3 contain nan values, so let's skip those for now.
add_top_categories(train_and_validation_df,
                   hist_trans_df,
                   column_names=['authorized_flag', 'category_1', 'subsector_id', 'city_id', 'state_id'])

In [None]:
train_and_validation_df.head()

## Split Into Train and Validation Sets

Split our data into a train test (80%) and a validation set (20%).

In [None]:
from sklearn.model_selection import train_test_split
train_df, validate_df = train_test_split(train_and_validation_df, test_size=0.2, random_state=238923)

In [None]:
train_df.shape

In [None]:
validate_df.shape

In [None]:
train_df.head()

## Explore Data

### Correlations

In [None]:
train_df.corr().target.sort_values(ascending=False)

In [None]:
sns.heatmap(train_df.corr(), vmin=-1, vmax=1, cmap='PiYG', xticklabels=True, yticklabels=True)

### Distributions

In [None]:
sns.distplot(train_df.target)

In [None]:
sns.countplot(x='feature_1', palette='Set2', data=train_df)

In [None]:
sns.countplot(x='feature_2', palette='Set2', data=train_df)

In [None]:
sns.countplot(x='feature_3', palette='Set2', data=train_df)

In [None]:
sns.countplot(x='authorized_flag_top', palette='Set2', data=train_df)

In [None]:
sns.countplot(x='category_1_top', palette='Set2', data=train_df)

In [None]:
sns.countplot(x='subsector_id_top', palette='Set2', data=train_df)

In [None]:
sns.countplot(x='city_id_top', palette='Set2', data=train_df)

In [None]:
sns.countplot(x='state_id_top', palette='Set2', data=train_df)

In [None]:
sns.distplot(train_df.purchase_amount_mean)

In [None]:
sns.distplot(train_df.installments_mean)

In [None]:
sns.distplot(train_df.month_lag_mean)