<a href="https://colab.research.google.com/github/ekaratnida/Applied-machine-learning/blob/master/sna/fraud/final/10_data_loader_for_students.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Preparation

In this notebook, we will re-construct the dataset.

Graph Neural Networks work by learning representation for nodes or edges of a graph that are well suited for some downstream task. We can model the fraud detection problem as a node classification task, and the goal of the graph neural network would be to learn how to use information from the topology of the sub-graph for each transaction node to transform the node's features to a representation space where the node can be easily classified as fraud or not.

Specifically, we will be using a relational graph convolutional neural network model (R-GCN) on a heterogeneous graph since we have nodes and edges of different types.

## Set up Colab environment

In [134]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [135]:
import os
cur_path = "/content/drive/MyDrive/graph-fraud-detection/"
os.chdir(cur_path)
!pwd

/content/drive/MyDrive/graph-fraud-detection


## Data Overview

Import the numpy and pandas modules.

In [136]:
import numpy as np
import pandas as pd

## Data Description

### Transaction Table
- TransactionDT: timedelta from a given reference datetime (not an actual timestamp)
- TransactionAMT: transaction payment amount in USD
- ProductCD: product code, the product for each transaction
- card1 - card6: payment card information, such as card type, card category, issue bank, country, etc.
- addr: address
- dist: distance
- P_ and (R__) emaildomain: purchaser and recipient email domain
- C1-C14: counting, such as how many addresses are found to be associated with the payment card, etc. The actual meaning is masked.
- D1-D15: timedelta, such as days between previous transaction, etc.
- M1-M9: match, such as names on card and address, etc.
- Vxxx: Vesta engineered rich features, including ranking, counting, and other entity relations.

Categorical Features:
- ProductCD
- card1 - card6
- addr1, addr2
- P_emaildomain
- R_emaildomain
- M1 - M9

## Identity Table
Variables in this table are identity information – network connection information (IP, ISP, Proxy, etc) and digital signature (UA/browser/os/version, etc) associated with transactions.
They're collected by Vesta’s fraud protection system and digital security partners.
(The field names are masked and pairwise dictionary will not be provided for privacy protection and contract agreement)

Categorical Features:
- DeviceType
- DeviceInfo
- id_12 - id_38

In [137]:
transaction_df = pd.read_csv('./ieee-data/train_transaction_final.csv')
identity_df = pd.read_csv('./ieee-data/train_identity.csv')

We provide a general processing framework to convert a relational table to heterogeneous graph edgelists based on the column types of the relational table. Some of the data transformation and feature engineering techniques include:

- Performing numerical encoding for categorical variables and logarithmic transformation for transaction amount
- Constructing graph edgelists between transactions and other entities for the various relation types

The inputs to the data preprocessing script are passed in as python command line arguments. All the columns in the relational table are classifed into one of 3 types for the purposes of data transformation:

- Identity columns --id-cols: columns that contain identity information related to a user or transaction for example IP address, Phone Number, device identifiers etc. These column types become node types in the heterogeneous graph, and the entries in these columns become the nodes. The column names for these column types need to passed in to the script.

- Categorical columns --cat-cols: columns that correspond to categorical features for a user's age group or whether a provided address matches with an address on file. The entries in these columns undergo numerical feature transformation and are used as node attributes in the heterogeneous graph. The columns names for these column types also needs to be passed in to the script

- Numerical columns: columns that correspond to numerical features like how many times a user has tried a transaction and so on. The entries here are also used as node attributes in the heterogeneous graph. The script assumes that all columns in the tables that are not identity columns or categorical columns are numerical columns

In [138]:
id_cols = ['card1','card2','card3','card4','card5','card6','ProductCD','addr1','addr2','P_emaildomain','R_emaildomain']
cat_cols = ['M1','M2','M3','M4','M5','M6','M7','M8','M9']
train_data_ratio = 0.8

Based on the train/test ratio we assigned before, extact the IDs of test data.

In [139]:
n_train = int(transaction_df.shape[0]*train_data_ratio)
test_ids = transaction_df.TransactionID.values[n_train:]

In [140]:
get_fraud_frac = lambda series: 100 * sum(series)/len(series)
print("Percent fraud for train transactions: {}".format(get_fraud_frac(transaction_df.isFraud[:n_train])))
print("Percent fraud for test transactions: {}".format(get_fraud_frac(transaction_df.isFraud[n_train:])))
print("Percent fraud for all transactions: {}".format(get_fraud_frac(transaction_df.isFraud)))

Percent fraud for train transactions: 3.4798806172342993
Percent fraud for test transactions: 3.7592075184150366
Percent fraud for all transactions: 3.535746943475463


 Save test IDs into the `test.csv` file

In [141]:
with open('data/test.csv', 'w') as f:
    f.writelines(map(lambda x: str(x) + "\n", test_ids))

Based on the standard we talked about before, define non-feature-columns and feature-columns for creating graph.

In [142]:
non_feature_cols = ['isFraud', 'TransactionDT'] + id_cols
print(non_feature_cols)

['isFraud', 'TransactionDT', 'card1', 'card2', 'card3', 'card4', 'card5', 'card6', 'ProductCD', 'addr1', 'addr2', 'P_emaildomain', 'R_emaildomain']


In [143]:
feature_cols = [col for col in transaction_df.columns if col not in non_feature_cols]
print(feature_cols)

['TransactionID', 'TransactionAmt', 'dist1', 'dist2', 'C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'C7', 'C8', 'C9', 'C10', 'C11', 'C12', 'C13', 'C14', 'D1', 'D2', 'D3', 'D4', 'D5', 'D6', 'D7', 'D8', 'D9', 'D10', 'D11', 'D12', 'D13', 'D14', 'D15', 'M1', 'M2', 'M3', 'M4', 'M5', 'M6', 'M7', 'M8', 'M9', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'V29', 'V30', 'V31', 'V32', 'V33', 'V34', 'V35', 'V36', 'V37', 'V38', 'V39', 'V40', 'V41', 'V42', 'V43', 'V44', 'V45', 'V46', 'V47', 'V48', 'V49', 'V50', 'V51', 'V52', 'V53', 'V54', 'V55', 'V56', 'V57', 'V58', 'V59', 'V60', 'V61', 'V62', 'V63', 'V64', 'V65', 'V66', 'V67', 'V68', 'V69', 'V70', 'V71', 'V72', 'V73', 'V74', 'V75', 'V76', 'V77', 'V78', 'V79', 'V80', 'V81', 'V82', 'V83', 'V84', 'V85', 'V86', 'V87', 'V88', 'V89', 'V90', 'V91', 'V92', 'V93', 'V94', 'V95', 'V96', 'V97', 'V98', 'V99', 'V100', 'V101', 'V102',

Transfer categorical features to be dummy variables and scale the `TransactionAmt` feature by log10.

In [144]:
features = pd.get_dummies(transaction_df[feature_cols], columns=cat_cols, dtype=float).fillna(0)
features['TransactionAmt'] = features['TransactionAmt'].apply(np.log10)

In [145]:
print(list(features.columns))

['TransactionID', 'TransactionAmt', 'dist1', 'dist2', 'C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'C7', 'C8', 'C9', 'C10', 'C11', 'C12', 'C13', 'C14', 'D1', 'D2', 'D3', 'D4', 'D5', 'D6', 'D7', 'D8', 'D9', 'D10', 'D11', 'D12', 'D13', 'D14', 'D15', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'V29', 'V30', 'V31', 'V32', 'V33', 'V34', 'V35', 'V36', 'V37', 'V38', 'V39', 'V40', 'V41', 'V42', 'V43', 'V44', 'V45', 'V46', 'V47', 'V48', 'V49', 'V50', 'V51', 'V52', 'V53', 'V54', 'V55', 'V56', 'V57', 'V58', 'V59', 'V60', 'V61', 'V62', 'V63', 'V64', 'V65', 'V66', 'V67', 'V68', 'V69', 'V70', 'V71', 'V72', 'V73', 'V74', 'V75', 'V76', 'V77', 'V78', 'V79', 'V80', 'V81', 'V82', 'V83', 'V84', 'V85', 'V86', 'V87', 'V88', 'V89', 'V90', 'V91', 'V92', 'V93', 'V94', 'V95', 'V96', 'V97', 'V98', 'V99', 'V100', 'V101', 'V102', 'V103', 'V104', 'V105', 'V106', 'V107', 'V108', 'V109

Save the features into `features.csv` for future training.

p.s. We don't need the index column and header.

In [146]:
features.to_csv('data/features.csv', index=False, header=False)

Save the IDs and label into the `tags.csv`.

In [147]:
transaction_df[['TransactionID', 'isFraud']].to_csv('data/tags.csv', index=False)

Select the columns that define the edges.

In [148]:
edge_types = id_cols + list(identity_df.columns)
print(edge_types)

['card1', 'card2', 'card3', 'card4', 'card5', 'card6', 'ProductCD', 'addr1', 'addr2', 'P_emaildomain', 'R_emaildomain', 'TransactionID', 'id_01', 'id_02', 'id_03', 'id_04', 'id_05', 'id_06', 'id_07', 'id_08', 'id_09', 'id_10', 'id_11', 'id_12', 'id_13', 'id_14', 'id_15', 'id_16', 'id_17', 'id_18', 'id_19', 'id_20', 'id_21', 'id_22', 'id_23', 'id_24', 'id_25', 'id_26', 'id_27', 'id_28', 'id_29', 'id_30', 'id_31', 'id_32', 'id_33', 'id_34', 'id_35', 'id_36', 'id_37', 'id_38', 'DeviceType', 'DeviceInfo']


In [149]:
all_id_cols = ['TransactionID'] + id_cols
full_identity_df = transaction_df[all_id_cols].merge(identity_df, on='TransactionID', how='left')
full_identity_df.head(5)

Unnamed: 0,TransactionID,card1,card2,card3,card4,card5,card6,ProductCD,addr1,addr2,...,id_31,id_32,id_33,id_34,id_35,id_36,id_37,id_38,DeviceType,DeviceInfo
0,3054316,7664,490.0,150.0,visa,226.0,debit,R,264.0,87.0,...,chrome 63.0,24.0,2048x1152,match_status:2,T,F,T,F,desktop,Windows
1,3169840,14426,111.0,150.0,mastercard,224.0,debit,W,272.0,87.0,...,,,,,,,,,,
2,3440616,12544,321.0,150.0,visa,226.0,debit,W,184.0,87.0,...,,,,,,,,,,
3,3468992,8695,170.0,150.0,visa,226.0,credit,W,184.0,87.0,...,,,,,,,,,,
4,3369076,3277,111.0,150.0,visa,226.0,debit,W,231.0,87.0,...,,,,,,,,,,


In [150]:
full_identity_df.shape

(59054, 52)

For each identity feature, save the data into the corresponding `relation_{FEATURE NAME}_edgelist.csv`. Each csv file represents one kind of edge.

In [151]:
edges = {}
for etype in edge_types:
    edgelist = full_identity_df[['TransactionID', etype]].dropna()
    edgelist.to_csv('data/relation_{}_edgelist.csv'.format(etype), index=False, header=True)
    edges[etype] = edgelist

#print(edges)

Let's re-check the edges we defined.

In [152]:

import glob

file_list = glob.glob('./data/*edgelist.csv')

edges = ",".join(map(lambda x: x.split("/")[-1], [file for file in file_list if "relation" in file]))

edges_full = ''
for etype in edge_types:
    edges_full += ',data/relation_{}_edgelist.csv'.format(etype)



In [153]:
edges

'relation_card1_edgelist.csv,relation_card2_edgelist.csv,relation_card3_edgelist.csv,relation_card4_edgelist.csv,relation_card5_edgelist.csv,relation_card6_edgelist.csv,relation_ProductCD_edgelist.csv,relation_addr1_edgelist.csv,relation_addr2_edgelist.csv,relation_P_emaildomain_edgelist.csv,relation_R_emaildomain_edgelist.csv,relation_TransactionID_edgelist.csv,relation_id_01_edgelist.csv,relation_id_02_edgelist.csv,relation_id_03_edgelist.csv,relation_id_04_edgelist.csv,relation_id_05_edgelist.csv,relation_id_06_edgelist.csv,relation_id_07_edgelist.csv,relation_id_08_edgelist.csv,relation_id_09_edgelist.csv,relation_id_10_edgelist.csv,relation_id_11_edgelist.csv,relation_id_12_edgelist.csv,relation_id_13_edgelist.csv,relation_id_14_edgelist.csv,relation_id_15_edgelist.csv,relation_id_16_edgelist.csv,relation_id_17_edgelist.csv,relation_id_18_edgelist.csv,relation_id_19_edgelist.csv,relation_id_20_edgelist.csv,relation_id_21_edgelist.csv,relation_id_22_edgelist.csv,relation_id_23_edge