# Data Preprocessing and Feature Engineering

Upload raw data to S3
The dataset we use is the IEEE-CIS Fraud Detection dataset which is a typical example of financial transactions dataset that many companies have. The dataset consists of two tables:

Transactions: Records transactions and metadata about transactions between two users. Examples of columns include the product code for the transaction and features on the card used for the transaction.
Identity: Contains information about the identity users performing transactions. Examples of columns here include the device type and device ids used.
We will go over the specific data schema in subsequent cells but now let's move the raw data to a convenient location in the S3 bucket for this proejct, where it will be picked up by the preprocessing job and training job.

If you would like to use your own dataset for this demonstration. Replace the raw_data_location with the s3 path or local path of your dataset, and modify the data preprocessing step as needed.

### Prerequisites

- AWS account
- install python 3.6+, boto3, sagemaker, pandas
- configure credential of aws cli with s3, sagemaker permissions

In [None]:
import json
import os
import boto3
import sagemaker
import tempfile

In [None]:
raw_data_location = 's3://aws-gcr-solutions-assets/open-dataset/ieee-fraud-detection/'

session_prefix = 'realtime-fraud-detection-on-dgl'

dest_dir = tempfile.mkdtemp()

transaction_source = f'{raw_data_location}train_transaction.csv'
transaction_dest = f'{dest_dir}/transaction.csv'

!aws s3 cp $transaction_source $transaction_dest

identity_source = f'{raw_data_location}train_identity.csv'
identity_dest = f'{dest_dir}/identity.csv'

!aws s3 cp $identity_source $identity_dest

In [None]:
output_dir = tempfile.mkdtemp()

! python ./data-preprocessing/graph_data_preprocessor.py --data-dir $dest_dir --output-dir $output_dir --id-cols 'card1,card2,card3,card4,card5,card6,ProductCD,addr1,addr2,P_emaildomain,R_emaildomain' '--cat-cols' 'M1,M2,M3,M4,M5,M6,M7,M8,M9'

In [None]:
region = boto3.session.Session().region_name
account_id = boto3.client('sts').get_caller_identity().get('Account')
default_bucket = sagemaker.session.Session(boto3.session.Session()).default_bucket()

processed_data = f's3://{default_bucket}/{session_prefix}/processed-data'

! aws s3 sync $output_dir $processed_data

In [None]:
%store processed_data
%store default_bucket