# Mobile money transactions Fraud Detection
**Acknowledgements: This dataset is from Kaggle.
For details, see https://www.kaggle.com/ntnu-testimon/paysim1/home**

- [About this dataset](#About-this-dataset)
- [Before you start](#Before-you-start)
- [Setup](#Setup)
- [Data Processing](#Data-Processing)
  * [Load](#Load-data)
  * [Explore](#Explore-the-dataset)
  * [Feature Engineering](#Feature-Engineering)
  * [Prepare dataset for SageMaker XGBoost](#Prepare-dataset-for-SageMaker-XGBoost)
- [Training the XGBoost model](#Training-the-XGBoost-model)
- [Predict Using Batch transform](#Predict-Using-Batch-transform)
- [Predict using API inference endpoint](#Predict-using-API-inference-endpoint)



  


## About this dataset


There is a lack of public available datasets on financial services and specially in the emerging mobile money transactions domain. Financial datasets are important to many researchers and in particular to us performing research in the domain of fraud detection. Part of the problem is the intrinsically private nature of financial transactions, that leads to no publicly available datasets.

We present a synthetic dataset generated using the simulator called PaySim as an approach to such a problem. PaySim uses aggregated data from the private dataset to generate a synthetic dataset that resembles the normal operation of transactions and injects malicious behaviour to later evaluate the performance of fraud detection methods.
Content

PaySim simulates mobile money transactions based on a sample of real transactions extracted from one month of financial logs from a mobile money service implemented in an African country. The original logs were provided by a multinational company, who is the provider of the mobile financial service which is currently running in more than 14 countries all around the world.

This synthetic dataset is scaled down 1/4 of the original dataset and it is created just for Kaggle.

### Headers

This is a sample of 1 row with headers explanation:

1,PAYMENT,1060.31,C429214117,1089.0,28.69,M1591654462,0.0,0.0,0,0

step - maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).

type - CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.

amount - amount of the transaction in local currency.

nameOrig - customer who started the transaction

oldbalanceOrg - initial balance before the transaction

newbalanceOrig - new balance after the transaction

nameDest - customer who is the recipient of the transaction

oldbalanceDest - initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants).

newbalanceDest - new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants).

isFraud - This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system.

isFlaggedFraud - The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction.
Past Research

There are 5 similar files that contain the run of 5 different scenarios. These files are better explained at my PhD thesis chapter 7 (PhD Thesis Available here http://urn.kb.se/resolve?urn=urn:nbn:se:bth-12932).

We ran PaySim several times using random seeds for 744 steps, representing each hour of one month of real time, which matches the original logs. Each run took around 45 minutes on an i7 intel processor with 16GB of RAM. The final result of a run contains approximately 24 million of financial records divided into the 5 types of categories: CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.

### Acknowledgements

This work is part of the research project ”Scalable resource-efficient systems for big data analytics” funded by the Knowledge Foundation (grant: 20140032) in Sweden.

Please refer to this dataset using the following citations:

PaySim first paper of the simulator:

E. A. Lopez-Rojas , A. Elmir, and S. Axelsson. "PaySim: A financial mobile money simulator for fraud detection". In: The 28th European Modeling and Simulation Symposium-EMSS, Larnaca, Cyprus. 2016



## Before you start
1. Manually Download the dataset from Kaggle https://www.kaggle.com/ntnu-testimon/paysim1/downloads/PS_20174392719_1491204439457_log.csv/2 and upload to s3

## Setup

In [None]:
### Download data from S3

In [None]:
#Enter the location of the input dataset uploaded from kaggle
## Say. s3://mybucket/PaySimFraudDetection>/PS_20174392719_1491204439457_log.csv.zip
s3_source_data="s3://<s3 path>"

##Results
bucket="<enter your bucket to hold results>"
prefix="DemoPaySimFraudDetection"
bucket_prefix="{}/{}".format(bucket,prefix)
tmpdir="./tmpDemoPaySimFraudDetection"



In [None]:
## This is to ensure that the right libraries  are installed...
!pip install -r requirements.txt

In [None]:
!mkdir -p $tmpdir

In [None]:
import os
local_input_zip=os.path.join(tmpdir, "paysim.zip")

!aws s3 cp $s3_source_data $local_input_zip

In [None]:
!unzip -o $local_input_zip -d $tmpdir

## Data Processing

### Load data
Load the csv file into pandas

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Load the data into pandas

In [None]:
data = pd.read_csv(os.path.join(tmpdir,'PS_20174392719_1491204439457_log.csv'))

### Explore the dataset 

In [None]:
data.head(n=10)

In [None]:
data.describe()

In [None]:
data.hist (bins=50, figsize=(15,15), color = 'green')
plt.show()

Distribution of transactions wrt to source account

In [None]:
data['nameOrig'].value_counts().hist (bins=500, figsize=(15,5), color = 'red')
plt.show()

In [None]:
data['nameOrig'].value_counts().describe()

Distribution of transactions wrt to dest account

In [None]:
data['nameDest'].value_counts().hist (bins=500, figsize=(15,5), color = 'green')
plt.show()

In [None]:
data.sort_values(by=['step', 'nameOrig']).head(n=20)

Explore the class distribution

In [None]:
data.isFraud.value_counts().plot.pie(autopct='%.2f',figsize=(5, 5), colors=["green","cyan"], explode=[0,.1])
plt.title('Class Distribution')
plt.tight_layout()

#### Highly imbalanced dataset
Since this is a highly imbalanced dataset, use AUCPR instead of AUC under ROC as the eval metric

View Correlation heatmap

In [None]:
fig, ax = plt.subplots( 1,2, figsize=(15,5))

ax[0].set_title("Fraudent Records correlation")
sns.heatmap(data.query('isFraud == 1').drop(['isFraud', 'isFlaggedFraud'],1).corr(),  cmap="OrRd", ax=ax[0])

ax[1].set_title("Non-fraudent Records correlation")
sns.heatmap(data.query('isFraud == 0').drop(['isFraud', 'isFlaggedFraud'],1).corr(),  cmap="OrRd", ax=ax[1])
plt.show()

#### Source Amount and destination difference in balance dont match
When the record has the isFlaggedFraud = 1, this means that the transaction was detected and stopped from being processed, that is the reason why it didn't affect the account destination/origin (previous value). 

**Note:** there is not record of balance from clients that start with M (Merchants).



### Feature Engineering

#### Drop Correlated features

In [None]:
data_clean = data
data_clean = data_clean.drop(["newbalanceOrig", "newbalanceDest", "isFlaggedFraud" ],1)

**Note** If you dont remove the newbalanceDest you will get better results > 90% AUCPR. But I would think that is not entirely fair because in the dataset the source amount transfered doesnt add up to the destination increase in balance if the transfer is stopped. This happens whena the banks modelling system which detects a potential fraud transaction

#### Add new features

In [None]:
data_clean["isMerchantTransOrig"] = data_clean["nameOrig"].str.startswith('M').astype(int) 
data_clean["isMerchantTransDest"] = data_clean["nameDest"].str.startswith('M').astype(int) 

data_clean["isMerchantTrans"] = data_clean["isMerchantTransOrig"] |  data_clean["isMerchantTransDest"]

In [None]:
fig, ax = plt.subplots( 1,2, figsize=(15,5))


ax[0].set_title("Fraudent Records correlation after clean")
sns.heatmap(data_clean.query('isFraud == 1').drop(['isFraud'],1).corr(),  cmap="OrRd", ax=ax[0])

ax[1].set_title("Non-fraudent Records correlation after clean")
sns.heatmap(data_clean.query('isFraud == 0').drop(['isFraud'],1).corr(),  cmap="OrRd", ax=ax[1])
plt.show()

### Prepare dataset for SageMaker XGBoost

#### Column order - Labels in first column
Recorder columns such that the label is the first column. This is because of the format expected by XGBoost SageMaker implementation, for more details see https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html

In [None]:
cols = data_clean.drop(["isFraud"],1).columns.tolist()
cols.insert(0, "isFraud")
data_clean = data_clean[cols]
data_clean.head(n=5)

#### Unimportant non numerical column drop
XGBoost only works with numerical values, drop non-numerical columns source/des accounts 

In [None]:
data_clean = data_clean.drop([ 'nameOrig', 'nameDest'],1)

#### Onehot encode categorical columns 
XG boost only works with numerical values, so translate categorical columns into one-hot vector

In [None]:
data_clean =  pd.get_dummies(data_clean,prefix=['transaction_type'],  columns=['type']) 
data_clean.head(n=5)

#### Train test set split
Split the dataset into train, test and validation set

In [None]:
from sklearn.model_selection import train_test_split, learning_curve

train_val, test = train_test_split(data_clean, test_size = 0.2, random_state = 777)
train, validation = train_test_split(train_val, test_size = 0.2, random_state = 777)

In [None]:
fig, ax = plt.subplots( 1,3, figsize=(15,5))

train.isFraud.value_counts().plot.pie(autopct='%.2f', ax = ax[0], colors=["green","cyan"], explode=[0,.1])
ax[0].set_title('Train fraud distribution ({} records)'.format(train.shape[0]))

test.isFraud.value_counts().plot.pie(autopct='%.2f', ax = ax[1], colors=["green","cyan"], explode=[0,.1])
ax[1].set_title('Test fraud distribution ({} records)'.format(test.shape[0]))

validation.isFraud.value_counts().plot.pie(autopct='%.2f', ax = ax[2], colors=["green","cyan"], explode=[0,.1])
ax[2].set_title('Validation fraud distribution ({} records)'.format(validation.shape[0]))

plt.show() 

In [None]:
import os

trainfile=os.path.join(tmpdir, "train_paysim.csv")
testfile=os.path.join(tmpdir,"test_paysim.csv")
validationfile=os.path.join(tmpdir,"validation_paysim.csv")

Write the records to file

In [None]:
train.to_csv(path_or_buf=trainfile, sep=',', na_rep='', header=False, index=False,  mode='w', encoding='UTF-8', quotechar='"', line_terminator='\n', decimal='.')
test.to_csv(path_or_buf=testfile, sep=',', na_rep='', header=False, index=False,  mode='w', encoding='UTF-8', quotechar='"', line_terminator='\n', decimal='.')
validation.to_csv(path_or_buf=validationfile, sep=',', na_rep='', header=False, index=False,  mode='w', encoding='UTF-8', quotechar='"', line_terminator='\n', decimal='.')

In [None]:
!head $trainfile

Copy the data to s3 into train and test channels

In [None]:
s3train="s3://{}/train/{}".format(bucket_prefix, "train.txt")
s3validation="s3://{}/validation/{}".format(bucket_prefix, "validation.txt")

print(trainfile)
!aws s3 cp $trainfile $s3train
!aws s3 cp $validationfile $s3validation



## Training the XGBoost model

After setting training parameters, we kick off training, and poll for status until training is completed, which in this example, takes between 5 and 6 minutes.


In [None]:
import boto3
from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(boto3.Session().region_name, 'xgboost')
from sagemaker import get_execution_role
role = get_execution_role()


In [None]:
%%time
import boto3
from time import gmtime, strftime

job_name = 'Fraud-xgboost-classification-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print("Training job", job_name)

#Ensure that the training and validation data folders generated above are reflected in the "InputDataConfig" parameter below.

create_training_params = \
{
    "AlgorithmSpecification": {
        "TrainingImage": container,
        "TrainingInputMode": "File"
    },
    "RoleArn": role,
    "OutputDataConfig": {
        "S3OutputPath": "s3://{}/single-xgboost".format(bucket_prefix)
    },
    "ResourceConfig": {
        "InstanceCount": 1,
        "InstanceType": "ml.m4.4xlarge",
        "VolumeSizeInGB": 5
    },
    "TrainingJobName": job_name,
    "HyperParameters": {
        "max_depth":"5",
        "eta":"0.2",
        "gamma":"4",
        "min_child_weight":"6",
        "subsample":"0.7",
        "silent":"0",
        "objective":"reg:logistic",
        "num_round":"50",
        "eval_metric":"auc"
    },
    "StoppingCondition": {
        "MaxRuntimeInSeconds": 3600
    },
    "InputDataConfig": [
        {
            "ChannelName": "train",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": s3train,
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "csv",
            "CompressionType": "None"
        },
        {
            "ChannelName": "validation",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": s3validation,
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "csv",
            "CompressionType": "None"
        }
    ]
}


client = boto3.client('sagemaker')
client.create_training_job(**create_training_params)

import time

status = client.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
print(status)
while status !='Completed' and status!='Failed':
    time.sleep(60)
    status = client.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
    print(status)

### Create Model

In [None]:
%%time
import boto3
from time import gmtime, strftime

model_name=job_name + '-model'
print(model_name)

info = client.describe_training_job(TrainingJobName=job_name)
model_data = info['ModelArtifacts']['S3ModelArtifacts']
print(model_data)

primary_container = {
    'Image': container,
    'ModelDataUrl': model_data
}

create_model_response = client.create_model(
    ModelName = model_name,
    ExecutionRoleArn = role,
    PrimaryContainer = primary_container)

print(create_model_response['ModelArn'])

## Predict Using Batch transform

This is a good use of batch transform where you simply evalute the model  before deloying it as an API

In [None]:
import os

batchfileinput=os.path.join(tmpdir, "batchvalidation.csv")
batchfileresults=os.path.join(tmpdir, "batchvalidation_results.csv")

In [None]:
%%time
import json
from itertools import islice
import math
import struct

file_name = testfile 
with open(file_name, 'r') as f:
    lines = f.readlines()
    
input_records = [",".join(l.strip().split(",")[1:]) for l in lines]
labels = [int(l.split(",")[0]) for l in lines]


with open(batchfileinput , "w") as f:
    f.writelines(["{}\n".format(item) for item in input_records])
                                          

In [None]:
%%time
import boto3
import sagemaker
import json

fmttime= strftime("%Y-%m-%d-%H-%M-%S", gmtime())
input_key_file="batchvalidation.csv"
input_batch_key="{}/batchTransform/{}_input/{}".format(prefix, fmttime, input_key_file)
input_location = 's3://{}/{}'.format(bucket, input_batch_key)
output_batch_key = "{}/batchTransform/{}_output".format(prefix,fmttime)
output_location = 's3://{}/{}'.format(bucket, output_batch_key)


s3_client = boto3.client('s3')
s3_client.upload_file(batchfileinput, bucket, input_batch_key)

# Initialize the transformer object
transformer =sagemaker.transformer.Transformer(
    base_transform_job_name='Batch-Transform',
    model_name=model_name,
    instance_count=1,
    instance_type='ml.c4.xlarge',
    output_path=output_location
    )
# To start a transform job:
transformer.transform(input_location, content_type='text/csv', split_type='Line')
# Then wait until transform job is completed
transformer.wait()

# To fetch validation result 
outputkey ='{}/{}.out'.format(output_batch_key, input_key_file)
print(outputkey)
s3_client.download_file(bucket, outputkey, batchfileresults)
with open(batchfileresults) as f:
    results = f.readlines()   
    predicted = [float(r) for r in results]
print("Sample transform result: {}".format(results[0]))

### Measurement using - AUCPR
Because the postive samples are underrepresented, measures such as AUC under ROC or accuracy inflate the numbers. So, use the AUC under the Precision Recall curve instead as it doesnt take into account True Negatives

In [None]:
import sklearn
micro_score = sklearn.metrics.average_precision_score(labels, predicted, average='micro',  sample_weight=None)
print("AUC under precision recall curve is {}".format(micro_score))

## Predict using API inference endpoint
Now you are ready to deploy your model as an API..

#### Deploy Endpoint

In [None]:
from time import gmtime, strftime

endpoint_config_name = 'DEMO-XGBoostEndpointConfig-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print(endpoint_config_name)
create_endpoint_config_response = client.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
    ProductionVariants=[{
        'InstanceType':'ml.m4.xlarge',
        'InitialVariantWeight':1,
        'InitialInstanceCount':1,
        'ModelName':model_name,
        'VariantName':'AllTraffic'}])

print("Endpoint Config Arn: " + create_endpoint_config_response['EndpointConfigArn'])

In [None]:
%%time
import time

endpoint_name = 'DEMO-XGBoostEndpoint-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print(endpoint_name)
create_endpoint_response = client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name)
print(create_endpoint_response['EndpointArn'])

resp = client.describe_endpoint(EndpointName=endpoint_name)
status = resp['EndpointStatus']
print("Status: " + status)

while status=='Creating':
    time.sleep(60)
    resp = client.describe_endpoint(EndpointName=endpoint_name)
    status = resp['EndpointStatus']
    print("Status: " + status)

print("Arn: " + resp['EndpointArn'])
print("Status: " + status)

#### Invoke your api to run inference

In [None]:
runtime_client = boto3.client('runtime.sagemaker')

In [None]:
%%time
import json
from itertools import islice
import math
import struct

file_name = testfile 
with open(file_name, 'r') as f:
    lines = f.readlines()
    
input_records = [",".join(l.strip().split(",")[1:]) for l in lines]
labels = [int(l.split(",")[0]) for l in lines]



In [None]:
def chunks(l, n):
    """Yield successive n-sized chunks from l."""
    for i in range(0, len(l), n):
        yield l[i:i + n]

In [None]:
%%time
predicted = []
for record_chunks in chunks(input_records, 10000):
    formatted = "\n".join(record_chunks)
    response = runtime_client.invoke_endpoint(EndpointName=endpoint_name, 
                                   ContentType='text/csv', 
                                   Body=formatted.encode('utf-8'))
    result = response['Body'].read()
    result = result.decode("utf-8")
    predicted.extend([float(r) for r in result.split(',')])
    
    print("Predicted {} out of {} so far ..".format(len(predicted), len(input_records)))


In [None]:
import sklearn
    
macro_score = sklearn.metrics.average_precision_score(labels, predicted, average='macro',  sample_weight=None)

print("The AUC under precision recall curve is {}".format(micro_score, macro_score))

In [None]:
confidence_threshold=.5

In [None]:
confusion_matrix = sklearn.metrics.confusion_matrix(labels, pd.DataFrame(predicted) > confidence_threshold, labels=[1,0], sample_weight=None)
confusion_matrix

In [None]:
import seaborn as sn
df_cm = pd.DataFrame(
        confusion_matrix, index=["Fraud", "Non-Fraud"], columns=["Fraud", "Non-Fraud"], 
)
sn.set(font_scale=1.4)#for label size
sn.heatmap(df_cm, annot=True,annot_kws={"size": 16},fmt="d", cmap="tab10" )
plt.show()

In [None]:
from sklearn.metrics import precision_recall_fscore_support
precision_recall_fscore_support(labels, pd.DataFrame(predicted) > confidence_threshold, average=None)

#### Delete endpoint as this is just a demo..

In [None]:
client.delete_endpoint(EndpointName=endpoint_name)

### Clean up local tmp directory

In [None]:
!rm -rf $tmpdir

### Next Steps
This XGBoost model doesnt take into account the time series ( the step sequence). So inorder to improve the model, we will look at using Time Series Classification Techniques. 

**Coming soon.....**