# IEEE-CIS Fraud Detection
Can you detect fraud from customer transactions?

Imagine standing at the check-out counter at the grocery store with a long line behind you and the cashier not-so-quietly announces that your card has been declined. In this moment, you probably aren’t thinking about the data science that determined your fate.

Embarrassed, and certain you have the funds to cover everything needed for an epic nacho party for 50 of your closest friends, you try your card again. Same result. As you step aside and allow the cashier to tend to the next customer, you receive a text message from your bank. “Press 1 if you really tried to spend $500 on cheddar cheese.”

While perhaps cumbersome (and often embarrassing) in the moment, this fraud prevention system is actually saving consumers millions of dollars per year. Researchers from the [IEEE Computational Intelligence Society](https://cis.ieee.org/) (IEEE-CIS) want to improve this figure, while also improving the customer experience. With higher accuracy fraud detection, you can get on with your chips without the hassle.

IEEE-CIS works across a variety of AI and machine learning areas, including deep neural networks, fuzzy systems, evolutionary computation, and swarm intelligence. Today they’re partnering with the world’s leading payment service company, [Vesta Corporation](https://trustvesta.com/), seeking the best solutions for fraud prevention industry, and now you are invited to join the challenge.

In this competition, you’ll benchmark machine learning models on a challenging large-scale dataset. The data comes from Vesta's real-world e-commerce transactions and contains a wide range of features from device type to product features. You also have the opportunity to create new features to improve your results.

If successful, you’ll improve the efficacy of fraudulent transaction alerts for millions of people around the world, helping hundreds of thousands of businesses reduce their fraud loss and increase their revenue. And of course, you will save party people just like you the hassle of false positives.

_Acknowledgements_:

![](https://storage.googleapis.com/kaggle-media/competitions/IEEE/Vesta-logo_200x.png)

Vesta Corporation provided the dataset for this competition. Vesta Corporation is the forerunner in guaranteed e-commerce payment solutions. Founded in 1995, Vesta pioneered the process of fully guaranteed card-not-present (CNP) payment transactions for the telecommunications industry. Since then, Vesta has firmly expanded data science and machine learning capabilities across the globe and solidified its position as the leader in guaranteed ecommerce payments. Today, Vesta guarantees more than $18B in transactions annually.

Header Photo by Tim Evans on Unsplash

Dataset Description
-------------------

In this competition you are predicting the probability that an online transaction is fraudulent, as denoted by the binary target `isFraud`.

The data is broken into two files `identity` and `transaction`, which are joined by `TransactionID`. Not all transactions have corresponding identity information.

### Categorical Features - Transaction

*   `ProductCD`
*   `card1` - `card6`
*   `addr1`, `addr2`
*   `P_emaildomain`
*   `R_emaildomain`
*   `M1` - `M9`

### Categorical Features - Identity

*   `DeviceType`
*   `DeviceInfo`
*   `id_12` - `id_38`

The `TransactionDT` feature is a timedelta from a given reference datetime (not an actual timestamp).

You can read more about the data from [this post by the competition host](https://www.kaggle.com/c/ieee-fraud-detection/discussion/101203).

Files
-----

*   **train\_{transaction, identity}.csv** - the training set
*   **test\_{transaction, identity}.csv** - the test set (you must predict the `isFraud` value for these observations)
*   **sample\_submission.csv** - a sample submission file in the correct format

Link: https://www.kaggle.com/competitions/ieee-fraud-detection

In [1]:
import numpy as np
import pandas as pd
from catboost import (
    CatBoostClassifier,
    EFeaturesSelectionAlgorithm,
    EShapCalcType,
    Pool,
    sum_models,
    to_classifier,
)
from sklearn.model_selection import StratifiedKFold, train_test_split

In [2]:
%load_ext nb_black

<IPython.core.display.Javascript object>

In [3]:
!ls ../../data/ieee-fraud-detection

ieee_identity.orc      submission.csv	     train_identity.csv
ieee_transaction.orc   test_identity.csv     train_transaction.csv
sample_submission.csv  test_transaction.csv


<IPython.core.display.Javascript object>

In [4]:
sample_submission_df = pd.read_csv(
    "../../data/ieee-fraud-detection/sample_submission.csv"
)
sample_submission_df

Unnamed: 0,TransactionID,isFraud
0,3663549,0.5
1,3663550,0.5
2,3663551,0.5
3,3663552,0.5
4,3663553,0.5
...,...,...
506686,4170235,0.5
506687,4170236,0.5
506688,4170237,0.5
506689,4170238,0.5


<IPython.core.display.Javascript object>

In [5]:
identity_df = pd.read_orc(
    "../../data/ieee-fraud-detection/ieee_identity.orc"
).set_index("TransactionID")
identity_df

Unnamed: 0_level_0,id-01,id-02,id-03,id-04,id-05,id-06,id-07,id-08,id-09,id-10,...,DeviceInfo,id-27,id-34,id-37,id-29,id-23,id-15,id-12,id-35,isTest
TransactionID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2987004,0.744040,-0.625225,-0.059848,0.064284,-0.269624,0.403955,-0.142214,0.157637,-0.060508,0.079481,...,1565,2,3,1,1,3,1,1,1,0
2987008,0.397763,-0.461389,-0.059848,0.064284,-0.269624,0.088768,-0.142214,0.157637,-0.060508,0.079481,...,2693,2,2,0,1,3,1,1,1,0
2987010,0.397763,0.077901,-0.059848,0.064284,-0.269624,0.403955,-0.142214,0.157637,-0.060508,0.079481,...,2526,2,4,1,0,3,0,1,0,0
2987011,0.397763,0.253624,-0.059848,0.064284,-0.269624,0.025730,-0.142214,0.157637,-0.060508,0.079481,...,2799,2,4,1,1,3,1,1,0,0
2987016,0.744040,-0.993690,-0.059848,0.064284,-0.071304,0.403955,-0.142214,0.157637,-0.060508,0.079481,...,1170,2,3,1,0,3,0,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4170230,-0.641068,1.717158,-0.059848,0.064284,-0.269624,0.403955,-0.142214,0.157637,-0.060508,0.079481,...,2165,2,4,1,1,3,1,1,0,1
4170233,0.397763,1.813465,-0.059848,0.064284,-1.062900,-1.613244,-0.142214,0.157637,-0.060508,0.079481,...,2106,2,4,1,0,3,0,1,0,1
4170234,0.397763,-0.396595,-0.059848,0.064284,4.093396,-1.550206,-0.142214,0.157637,-0.060508,0.079481,...,2693,2,3,0,1,3,1,1,1,1
4170236,-2.372452,0.514710,-0.059848,0.064284,-0.864581,-0.226420,-0.142214,0.157637,-0.060508,0.079481,...,141,2,4,1,1,3,1,1,0,1


<IPython.core.display.Javascript object>

In [6]:
transaction_df = pd.read_orc(
    "../../data/ieee-fraud-detection/ieee_transaction.orc"
).set_index("TransactionID")
transaction_df

Unnamed: 0_level_0,TransactionDT,TransactionAmt,card1,card2,card3,card5,addr1,addr2,dist1,dist2,...,M4,ProductCD,R_emaildomain,card6,card4,M3,M7,M5,isFraud,isTest
TransactionID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2987000,-1.508623,-0.274058,0.817417,-2.186194,-0.176293,-1.260537,0.435966,0.375378,-0.104394,-0.103594,...,2,4,60,1,1,1,2,0,0.0,0
2987001,-1.508623,-0.437119,-1.465279,0.285881,-0.176293,-2.159569,0.510346,0.375378,-0.187625,-0.103594,...,0,4,60,1,2,2,2,1,0.0,0
2987002,-1.508616,-0.313275,-1.075396,0.812114,-0.176293,-0.721117,0.547536,0.375378,1.069600,-0.103594,...,0,4,60,2,3,1,0,0,0.0,0
2987003,-1.508614,-0.350428,1.676877,1.283277,-0.176293,-1.822432,1.633481,0.375378,-0.187625,-0.103594,...,0,4,60,2,2,2,2,1,0.0,0
2987004,-1.508613,-0.350428,-1.109316,0.958970,-0.176293,-2.159569,1.216954,0.375378,-0.187625,-0.103594,...,3,1,60,1,2,2,2,2,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4170235,1.646657,-0.165987,0.798209,0.108430,2.094471,0.582480,0.205389,-0.579101,-0.187625,-0.103594,...,2,0,16,2,2,2,2,2,,1
4170236,1.646658,-0.506583,-1.383747,0.310357,2.094471,0.582480,-1.906998,-2.700166,-0.187625,0.924042,...,2,0,19,2,2,2,2,2,,1
4170237,1.646662,-0.354556,1.376291,0.812114,-0.176293,0.627431,0.525222,0.375378,-0.187625,-0.103594,...,0,4,60,2,3,1,0,0,,1
4170238,1.646663,0.277047,1.368117,0.971208,-0.176293,0.582480,-0.590475,0.375378,-0.187625,-0.103594,...,0,4,60,2,2,1,0,0,,1


<IPython.core.display.Javascript object>

# Prepare

In [7]:
X_test = (
    transaction_df[transaction_df["isTest"] == 1]
    .drop(["isFraud", "isTest"], axis=1)
    .copy()
)
X_test

Unnamed: 0_level_0,TransactionDT,TransactionAmt,card1,card2,card3,card5,addr1,addr2,dist1,dist2,...,M1,M9,M4,ProductCD,R_emaildomain,card6,card4,M3,M7,M5
TransactionID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3663549,0.184852,-0.424941,0.098749,-1.506985,-0.176293,0.627431,-0.642541,0.375378,-0.183245,-0.103594,...,1,1,3,4,60,2,3,0,1,2
3663550,0.184856,-0.354556,-1.155293,-1.506985,-0.176293,0.627431,0.316958,0.375378,-0.170103,-0.103594,...,1,2,0,4,60,2,3,0,2,2
3663551,0.184860,0.149075,-1.113608,1.326110,-0.176293,0.627431,1.603729,0.375378,11.355196,-0.103594,...,1,0,0,4,60,2,3,0,0,0
3663552,0.184860,0.619475,0.217267,0.016645,-0.176293,-0.721117,-0.382212,0.375378,-0.113155,-0.103594,...,1,2,3,4,60,2,3,1,2,2
3663553,0.184861,-0.276328,1.653582,0.579592,-0.176293,-1.822432,0.056629,0.375378,-0.161342,-0.103594,...,1,1,3,4,60,2,2,1,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4170235,1.646657,-0.165987,0.798209,0.108430,2.094471,0.582480,0.205389,-0.579101,-0.187625,-0.103594,...,2,2,2,0,16,2,2,2,2,2
4170236,1.646658,-0.506583,-1.383747,0.310357,2.094471,0.582480,-1.906998,-2.700166,-0.187625,0.924042,...,2,2,2,0,19,2,2,2,2,2
4170237,1.646662,-0.354556,1.376291,0.812114,-0.176293,0.627431,0.525222,0.375378,-0.187625,-0.103594,...,1,1,0,4,60,2,3,1,0,0
4170238,1.646663,0.277047,1.368117,0.971208,-0.176293,0.582480,-0.590475,0.375378,-0.187625,-0.103594,...,1,0,0,4,60,2,2,1,0,0


<IPython.core.display.Javascript object>

In [8]:
X_train = (
    transaction_df[transaction_df["isTest"] == 0]
    .drop(["isFraud", "isTest"], axis=1)
    .copy()
)
X_train

Unnamed: 0_level_0,TransactionDT,TransactionAmt,card1,card2,card3,card5,addr1,addr2,dist1,dist2,...,M1,M9,M4,ProductCD,R_emaildomain,card6,card4,M3,M7,M5
TransactionID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2987000,-1.508623,-0.274058,0.817417,-2.186194,-0.176293,-1.260537,0.435966,0.375378,-0.104394,-0.103594,...,1,2,2,4,60,1,1,1,2,0
2987001,-1.508623,-0.437119,-1.465279,0.285881,-0.176293,-2.159569,0.510346,0.375378,-0.187625,-0.103594,...,2,2,0,4,60,1,2,2,2,1
2987002,-1.508616,-0.313275,-1.075396,0.812114,-0.176293,-0.721117,0.547536,0.375378,1.069600,-0.103594,...,1,0,0,4,60,2,3,1,0,0
2987003,-1.508614,-0.350428,1.676877,1.283277,-0.176293,-1.822432,1.633481,0.375378,-0.187625,-0.103594,...,2,2,0,4,60,2,2,2,2,1
2987004,-1.508613,-0.350428,-1.109316,0.958970,-0.176293,-2.159569,1.216954,0.375378,-0.187625,-0.103594,...,2,2,3,1,60,1,2,2,2,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3577535,-0.054806,-0.354556,-0.689804,-2.186194,-0.176293,0.627431,0.116133,0.375378,0.022642,-0.103594,...,1,1,0,4,60,2,3,1,0,1
3577536,-0.054806,-0.393773,0.105901,-0.809420,-0.176293,0.582480,-0.389650,0.375378,-0.187625,-0.103594,...,1,0,0,4,60,2,2,0,0,0
3577537,-0.054803,-0.429069,0.431417,1.454609,-0.176293,0.582480,-0.188824,0.375378,-0.187625,-0.103594,...,1,2,3,4,60,2,2,0,2,2
3577538,-0.054803,-0.073844,-0.429064,0.757043,-0.176293,0.582480,0.971501,0.375378,-0.174484,-0.103594,...,1,2,0,4,60,2,2,1,2,0


<IPython.core.display.Javascript object>

In [9]:
y_train = (
    transaction_df[transaction_df.index.isin(X_train.index)][["isFraud"]]
    .astype(int)
    .copy()
)
y_train

Unnamed: 0_level_0,isFraud
TransactionID,Unnamed: 1_level_1
2987000,0
2987001,0
2987002,0
2987003,0
2987004,0
...,...
3577535,0
3577536,0
3577537,0
3577538,0


<IPython.core.display.Javascript object>

In [10]:
y_train.value_counts(normalize=True)

isFraud
0          0.96501
1          0.03499
dtype: float64

<IPython.core.display.Javascript object>

In [11]:
X_train, X_true, y_train, y_true = train_test_split(
    X_train, y_train, test_size=0.1, random_state=42
)
X_train.shape, X_true.shape, y_train.shape, y_true.shape

((531486, 392), (59054, 392), (531486, 1), (59054, 1))

<IPython.core.display.Javascript object>

# Train

## Hyperparameter tuning

In [12]:
model = CatBoostClassifier(logging_level="Silent")

# https://docs.aws.amazon.com/sagemaker/latest/dg/catboost-tuning.html
tuned_params = {
    "learning_rate": [
        0.001,
        0.002,
        0.003,
        0.004,
        0.005,
        0.006,
        0.007,
        0.008,
        0.009,
        0.01,
    ],
    "depth": [4, 5, 6, 7, 8, 9, 10],
    "l2_leaf_reg": [2, 3, 4, 5, 6, 7, 8, 9, 10],
    "random_strength": [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0],
    "iterations": [500, 600, 700, 800, 900, 1000],
}

grid_search_result = model.randomized_search(
    tuned_params, Pool(X_train, y_train), verbose=False, plot=True
)

  self._init_pool(data, label, cat_features, text_features, embedding_features, embedding_features_data, pairs, weight,


MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

<IPython.core.display.Javascript object>

In [13]:
best_model_params = grid_search_result["params"]
best_model_params

{'depth': 10,
 'l2_leaf_reg': 8,
 'iterations': 1000,
 'random_strength': 3.0,
 'learning_rate': 0.008}

<IPython.core.display.Javascript object>

# Loop

In [14]:
skf = StratifiedKFold(n_splits=5)

<IPython.core.display.Javascript object>

In [15]:
ensemble = []

for i, (train_index, val_index) in enumerate(skf.split(X_train, y_train)):
    X_sub_train, X_sub_val = X_train.iloc[train_index], X_train.iloc[val_index]
    y_sub_train, y_sub_val = y_train.iloc[train_index], y_train.iloc[val_index]

    model = CatBoostClassifier(**best_model_params)

    model.fit(
        Pool(X_sub_train, y_sub_train),
        eval_set=Pool(X_sub_val, y_sub_val),
        verbose=False,
    )

    ensemble.append(model)
    print(model.best_score_)

  self._init_pool(data, label, cat_features, text_features, embedding_features, embedding_features_data, pairs, weight,


{'learn': {'Logloss': 0.08558366067496066}, 'validation': {'Logloss': 0.08665814240520439}}
{'learn': {'Logloss': 0.08535027651338876}, 'validation': {'Logloss': 0.08777876040572438}}
{'learn': {'Logloss': 0.08502091119570183}, 'validation': {'Logloss': 0.08765053181936175}}
{'learn': {'Logloss': 0.0851887496855096}, 'validation': {'Logloss': 0.08761280049084375}}
{'learn': {'Logloss': 0.0847095864431656}, 'validation': {'Logloss': 0.08934629581286058}}


<IPython.core.display.Javascript object>

In [16]:
models_avrg = to_classifier(
    sum_models(ensemble, weights=[1.0 / len(ensemble)] * len(ensemble))
)
models_avrg

<catboost.core.CatBoostClassifier at 0x7f95fa59b850>

<IPython.core.display.Javascript object>

In [17]:
pd.DataFrame(
    {
        "Column": X_train.columns,
        "Score": models_avrg.get_feature_importance(),
    }
).sort_values(by="Score", ascending=False)

Unnamed: 0,Column,Score
23,C14,4.545559
387,card6,4.395237
22,C13,4.345609
10,C1,3.560260
384,M4,3.061948
...,...,...
65,V27,0.000281
343,V305,0.000061
145,V107,0.000060
127,V89,0.000005


<IPython.core.display.Javascript object>

# Validate

In [18]:
y_preds_1 = models_avrg.predict(X_true)
y_preds_1

  self._init_pool(data, label, cat_features, text_features, embedding_features, embedding_features_data, pairs, weight,


array([0, 0, 0, ..., 0, 0, 0])

<IPython.core.display.Javascript object>

In [19]:
(y_true["isFraud"] == y_preds_1).sum() / len(y_true)

0.9756324719748027

<IPython.core.display.Javascript object>

# Submission

In [20]:
y_preds_avrg = models_avrg.predict_proba(X_test)[:, 1]
y_preds_avrg

array([0.00655312, 0.01041727, 0.01702622, ..., 0.01152193, 0.01419534,
       0.02982722])

<IPython.core.display.Javascript object>

In [21]:
submission = pd.DataFrame(
    {"TransactionID": X_test.index, "isFraud": y_preds_avrg}
).set_index("TransactionID")
submission

Unnamed: 0_level_0,isFraud
TransactionID,Unnamed: 1_level_1
3663549,0.006553
3663550,0.010417
3663551,0.017026
3663552,0.006850
3663553,0.010826
...,...
4170235,0.018325
4170236,0.022508
4170237,0.011522
4170238,0.014195


<IPython.core.display.Javascript object>

In [22]:
submission.to_csv("../../data/ieee-fraud-detection/submission.csv")

<IPython.core.display.Javascript object>