## Compute Requirements
Make sure you use an instance with at least 32G of memory and 100G of storage.

To run evaluations we used `ml.r5.12xlarge` instance with 48 CPUs and 384G memory.
A smaller instance can be used to run the same evaluations, for example, `ml.m5.4xlarge` with 16 CPUs and 64G memory.


# Install dependencies

In [1]:
%pip install -qU -r requirements.txt


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m22.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


# Download and unzip Kaggle dataset
We use [IEEE-CIS Fraud Detection](https://www.kaggle.com/c/ieee-fraud-detection/data) dataset.

In [5]:
!kaggle competitions download -c ieee-fraud-detection -p ./data/ieee-fraud-detection/

Downloading ieee-fraud-detection.zip to ./data/ieee-fraud-detection
100%|████████████████████████████████████████▉| 118M/118M [00:00<00:00, 155MB/s]
100%|█████████████████████████████████████████| 118M/118M [00:00<00:00, 137MB/s]


In [6]:
!unzip ./data/ieee-fraud-detection/ieee-fraud-detection.zip -d ./data/ieee-fraud-detection/

Archive:  ./data/ieee-fraud-detection/ieee-fraud-detection.zip
  inflating: ./data/ieee-fraud-detection/sample_submission.csv  
  inflating: ./data/ieee-fraud-detection/test_identity.csv  
  inflating: ./data/ieee-fraud-detection/test_transaction.csv  
  inflating: ./data/ieee-fraud-detection/train_identity.csv  
  inflating: ./data/ieee-fraud-detection/train_transaction.csv  


# Create training and test splits
Fraud labels are only available for competition's training data. We sort transactions by timestamp (TransactionDT) column, and use first 80% of the competition's training data to train our models, and retain the last 20% of transactions for testing. We join transaction and identity tables into a single dataframe using TransactionID column. Note that not all of the transactions have identity information, so we are left with a total of 144,233 transactions. And, 115,386 transactions will be used to training, and 28,847 transactions will be used for testing.

In [8]:
import numpy as np
import pandas as pd

In [9]:
df_identity = pd.read_csv('./data/ieee-fraud-detection/train_identity.csv')

In [10]:
df_transaction = pd.read_csv('./data/ieee-fraud-detection/train_transaction.csv')

In [11]:
df=pd.merge(df_identity, df_transaction, on='TransactionID', how='inner')

In [12]:
df.sort_values(by='TransactionDT', ascending=True, inplace=True)

In [13]:
n_total = len(df)
n_train = int(n_total*0.8)
n_test  = n_total - n_train

In [14]:
print(f"Total transactions: {n_total}, training transactions: {n_train}, testing transaction: {n_test}")

Total transactions: 144233, training transactions: 115386, testing transaction: 28847


In [15]:
df_train = df.head(n_train)
df_test  = df.tail(n_test)

In [16]:
df_train.to_parquet("./data/train.parquet", index=False)
df_test.to_parquet("./data/test.parquet", index=False)

In [88]:
### sample a few transactions to be used in figure
df= df_train[['TransactionID', 'isFraud', 'TransactionDT', 'ProductCD', 'P_emaildomain', 'TransactionAmt', 'DeviceType']]

pd.concat([df.query('isFraud == 0').sample(5), df.query('isFraud == 1').sample(3)]).dropna().sort_values('TransactionDT').style.format(precision=2).hide(axis="index")

TransactionID,isFraud,TransactionDT,ProductCD,P_emaildomain,TransactionAmt,DeviceType
3051657,0,1458799,H,msn.com,25.0,mobile
3057708,1,1581324,C,yahoo.com,90.48,mobile
3090795,0,2067466,H,gmail.com,50.0,desktop
3094044,0,2132259,R,gmail.com,100.0,desktop
3185488,1,4486165,H,gmail.com,150.0,desktop
3253881,0,6464196,R,gmail.com,100.0,desktop
3288988,1,7459023,C,hotmail.com,20.84,mobile
3292756,0,7593027,R,gmail.com,75.0,mobile
