# EvalML Fraud Detection Demo:
This demo showcases using EvalMl to optimize models using a custom objective to predict realized business value. The goal of the model would be to take in credit card transaction data and decide whether the transaction is fraudulent. 

Data: https://www.kaggle.com/c/ieee-fraud-detection/

In [1]:
import os

import evalml
import featuretools as ft
import numpy as np
import pandas as pd

In [2]:
train_identity = pd.read_csv('https://featuretools-static.s3.amazonaws.com/evalml/IEEE-CIS+Fraud+Detection/train_identity.csv')
train_transaction = pd.read_csv('https://featuretools-static.s3.amazonaws.com/evalml/IEEE-CIS+Fraud+Detection/train_transaction.csv')

In [3]:
display(train_identity.head())
display(train_transaction.head())

Unnamed: 0,TransactionID,id_01,id_02,id_03,id_04,id_05,id_06,id_07,id_08,id_09,...,id_31,id_32,id_33,id_34,id_35,id_36,id_37,id_38,DeviceType,DeviceInfo
0,2987004,0.0,70787.0,,,,,,,,...,samsung browser 6.2,32.0,2220x1080,match_status:2,T,F,T,T,mobile,SAMSUNG SM-G892A Build/NRD90M
1,2987008,-5.0,98945.0,,,0.0,-5.0,,,,...,mobile safari 11.0,32.0,1334x750,match_status:1,T,F,F,T,mobile,iOS Device
2,2987010,-5.0,191631.0,0.0,0.0,0.0,0.0,,,0.0,...,chrome 62.0,,,,F,F,T,T,desktop,Windows
3,2987011,-5.0,221832.0,,,0.0,-6.0,,,,...,chrome 62.0,,,,F,F,T,T,desktop,
4,2987016,0.0,7460.0,0.0,0.0,1.0,0.0,,,0.0,...,chrome 62.0,24.0,1280x800,match_status:2,T,F,T,T,desktop,MacOS


Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,...,V330,V331,V332,V333,V334,V335,V336,V337,V338,V339
0,2987000,0,86400,68.5,W,13926,,150.0,discover,142.0,...,,,,,,,,,,
1,2987001,0,86401,29.0,W,2755,404.0,150.0,mastercard,102.0,...,,,,,,,,,,
2,2987002,0,86469,59.0,W,4663,490.0,150.0,visa,166.0,...,,,,,,,,,,
3,2987003,0,86499,50.0,W,18132,567.0,150.0,mastercard,117.0,...,,,,,,,,,,
4,2987004,0,86506,50.0,H,4497,514.0,150.0,mastercard,102.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Merge dataframes:

Since the data given is a one-to-one relationship between the identity and transaction data, we can merge the two dataframes on the `TransactionID` column.

In [4]:
train_df = train_transaction.merge(train_identity)

train_sample = train_df.sample(frac=0.1, random_state=1)
X_train = train_sample.drop('isFraud', axis=1)
y_train = train_sample['isFraud']

## Encode Categorical Variables:
As categorical variables are not compatible with some machine learning models, here we encode them into numerical variables by creating dummy variables.

In [5]:
cat_cols = X_train.select_dtypes(include=['object']).columns

In [6]:
# encode categorical features
X_train = pd.get_dummies(X_train, columns=cat_cols)

In [12]:
X_train, X_holdout, y_train, y_holdout = split_data(X_train, y_train, test_size=.8, random_state=0)

NameError: name 'split_data' is not defined

## Model Training With AUC
Here we utilize a traditional classification objective function to automatically learn the best model. Further down, 

In [8]:
clf = evalml.AutoClassifier(objective="AUC",
                            max_pipelines=5)

### After fitting our models, we can display the rankings of all the models and also score the holdout data with the best model

In [9]:
# fit using autoclassfier
clf.fit(X_train, y_train)

[1m*****************************[0m
[1m* Beginning pipeline search *[0m
[1m*****************************[0m

Optimizing for AUC. Greater score is better.

Searching up to 5 pipelines. No time limit is set. Set one using max_time parameter.

Possible model types: random_forest, linear_model, xgboost

Testing XGBoost w/ imputation: 100%|██████████| 5/5 [29:48<00:00, 357.64s/it]                     

✔ Optimization finished


In [10]:
clf.rankings

Unnamed: 0,id,pipeline_name,score,high_variance_cv,parameters
0,4,XGBoostPipeline,0.784822,False,"{'eta': 0.38438170729269994, 'min_child_weight..."
1,1,XGBoostPipeline,0.75878,False,"{'eta': 0.5928446182250184, 'min_child_weight'..."
2,3,LogisticRegressionPipeline,0.743911,False,"{'penalty': 'l2', 'C': 6.239401330891865, 'imp..."
3,0,LogisticRegressionPipeline,0.742609,False,"{'penalty': 'l2', 'C': 8.444214828324364, 'imp..."
4,2,RFClassificationPipeline,0.731483,False,"{'n_estimators': 569, 'max_depth': 22, 'impute..."


In [11]:
pipeline = clf.best_pipeline
print("Model Score: {}".format(pipeline.score(X_holdout, y_holdout)))

NameError: name 'X_holdout' is not defined

## Custom Objective:

Here we utilize a custom objective function built within EvalML for fraud detection. Using it we can define how the model will train to provide the most realized business value. We define below that `50%` of our customers will retry a declined transaction, we earn `2%` of each transaction and we will not be able to colelct `75%` of all fraudulent transactions. Thus, the model chosen will best fit our business needs.

In [None]:
fraud_objective = evalml.objectives.FraudDetection(
    retry_percentage=.5,
    interchange_fee=.02,
    fraud_payout_percentage=.75,
    amount_col='TransactionAmt'  # column in data that contains the amount of the transaction
)

clf_fraud = evalml.AutoClassifier(objective=fraud_objective,
                            max_pipelines=5)

In [None]:
%%time
# fit using autoclassfier
clf_fraud.fit(X_train, y_train)

### Again we can rank our models and see the performance on our holdout sets. However, this time we will see the predicted amount of dollars lost due to fraudulent transactions!

In [None]:
clf_fraud.rankings

In [None]:
pipeline = clf_fraud.best_pipeline
print("Best Model Dollars Lost: {}".format(pipeline.score(X_holdout, y_holdout)))

### In comparison, the model that optimized for AUC performed TODO: HOW IT PERFORMED

In [None]:
pipeline = best_pipeline
print("AUC Model Dollars Lost: {}".format(pipeline.score(X_holdout, y_holdout, other_objectives=fraud_objective)[1]))