# [Paris Saclay Center for Data Science](http://www.datascience-paris-saclay.fr)


## Fraud detection


## Introduction


### Requirements

* numpy>=1.10.0  
* matplotlib>=1.5.0 
* pandas>=0.19.0  
* scikit-learn>=0.19   

In [12]:
%matplotlib inline
import os
import glob
import numpy as np
from scipy import io
import matplotlib.pyplot as plt
import pandas as pd
pd.options.display.max_columns = 999

## Exploratory data analysis

### Loading the data



In [33]:
train_filename = 'original_data/train.csv'

In [5]:
data = pd.read_csv(train_filename)

In [6]:
data.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


In [7]:
data.describe()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
count,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0
mean,243.3972,179861.9,833883.1,855113.7,1100702.0,1224996.0,0.00129082,2.514687e-06
std,142.332,603858.2,2888243.0,2924049.0,3399180.0,3674129.0,0.0359048,0.001585775
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,156.0,13389.57,0.0,0.0,0.0,0.0,0.0,0.0
50%,239.0,74871.94,14208.0,0.0,132705.7,214661.4,0.0,0.0
75%,335.0,208721.5,107315.2,144258.4,943036.7,1111909.0,0.0,0.0
max,743.0,92445520.0,59585040.0,49585040.0,356015900.0,356179300.0,1.0,1.0


In [29]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5090096 entries, 0 to 5090095
Data columns (total 13 columns):
Unnamed: 0        int64
step              int64
type              object
amount            float64
nameOrig          object
oldbalanceOrg     float64
newbalanceOrig    float64
nameDest          object
oldbalanceDest    float64
newbalanceDest    float64
isFraud           int64
isFlaggedFraud    int64
id                int64
dtypes: float64(5), int64(5), object(3)
memory usage: 504.8+ MB


In [30]:
data.count()

Unnamed: 0        5090096
step              5090096
type              5090096
amount            5090096
nameOrig          5090096
oldbalanceOrg     5090096
newbalanceOrig    5090096
nameDest          5090096
oldbalanceDest    5090096
newbalanceDest    5090096
isFraud           5090096
isFlaggedFraud    5090096
id                5090096
dtype: int64

In [31]:
np.unique(data['isFraud'])

array([0, 1])

In [32]:
data.groupby('isFraud').count()[['id']]

Unnamed: 0_level_0,id
isFraud,Unnamed: 1_level_1
0,5083514
1,6582


## The pipeline

For submitting at the [RAMP site](http://ramp.studio), you will have to write two classes, saved in two different files,
* the class `FeatureExtractor`, which will be used to extract features for classification from the dataset and produce a numpy array of size (number of samples $\times$ number of features), and  
* the class `Classifier` to predict the target.

### Feature extractor

The feature extractor implements a `transform` member function. It is saved in the file [`submissions/starting_kit/feature_extractor.py`](/edit/submissions/starting_kit/feature_extractor.py). It receives the pandas dataframe `X_df` defined at the beginning of the notebook. It should produce a numpy array representing the extracted features, which will then be used for the classification.  

Note that the following code cells are *not* executed in the notebook. The notebook saves their contents in the file specified in the first line of the cell, so you can edit your submission before running the local test below and submitting it at the RAMP site.

In [34]:
%%file submissions/starting_kit/feature_extractor.py
class FeatureExtractor():
    def __init__(self):
        pass

    def fit(self, X_df, y):
        pass

    def transform(self, X_df):
        X_df = X_df.drop(['nameDest','nameOrig','type'],axis = 1)
        return X_df.values



Overwriting submissions/starting_kit/feature_extractor.py


### Classifier

The classifier follows a classical scikit-learn classifier template. It should be saved in the file [`submissions/starting_kit/classifier.py`](/edit/submissions/starting_kit/classifier.py). In its simplest form it takes a scikit-learn pipeline, assigns it to `self.clf` in `__init__`, then calls its `fit` and `predict_proba` functions in the corresponding member funtions.

In [35]:
%%file submissions/starting_kit/classifier.py
from sklearn.base import BaseEstimator
from sklearn.ensemble import RandomForestClassifier


class Classifier(BaseEstimator):
    def __init__(self):
        pass

    def fit(self, X, y):
        self.clf = RandomForestClassifier(
            n_estimators=2, max_leaf_nodes=2, random_state=61)
        self.clf.fit(X, y)

    def predict(self, X):
        return self.clf.predict(X)

    def predict_proba(self, X):
        return self.clf.predict_proba(X)



Overwriting submissions/starting_kit/classifier.py


## Local testing (before submission)

It is <b><span style="color:red">important that you test your submission files before submitting them</span></b>. For this we provide a unit test. Note that the test runs on your files in [`submissions/starting_kit`](/tree/submissions/starting_kit), not on the classes defined in the cells of this notebook.

First `pip install ramp-workflow` or install it from the [github repo](https://github.com/paris-saclay-cds/ramp-workflow). Make sure that the python files `classifier.py` and `feature_extractor.py` are in the  [`submissions/starting_kit`](/tree/submissions/starting_kit) folder, and the data `train.csv` and `test.csv` are in [`data`](/tree/data). Then run

```ramp_test_submission```

If it runs and print training and test errors on each fold, then you can submit the code.


In [55]:
!ramp_test_submission

[38;5;178m[1mTesting Fraud detection[0m
[38;5;178m[1mReading train and test files from ./data ...[0m
[38;5;178m[1mReading cv ...[0m
[38;5;178m[1mTraining ./submissions/starting_kit ...[0m
[38;5;178m[1mCV fold 0[0m
	[38;5;178m[1mscore    auc    acc    nll[0m
	[38;5;10m[1mtrain[0m  [38;5;10m[1m0.510[0m  [38;5;150m0.999[0m  [38;5;150m0.010[0m
	[38;5;12m[1mvalid[0m  [38;5;12m[1m0.496[0m  [38;5;105m0.999[0m  [38;5;105m0.010[0m
	[38;5;1m[1mtest[0m   [38;5;1m[1m0.497[0m  [38;5;218m1.000[0m  [38;5;218m0.001[0m
[38;5;178m[1mCV fold 1[0m
	[38;5;178m[1mscore    auc    acc    nll[0m
	[38;5;10m[1mtrain[0m  [38;5;10m[1m0.504[0m  [38;5;150m0.999[0m  [38;5;150m0.010[0m
	[38;5;12m[1mvalid[0m  [38;5;12m[1m0.500[0m  [38;5;105m0.999[0m  [38;5;105m0.010[0m
	[38;5;1m[1mtest[0m   [38;5;1m[1m0.499[0m  [38;5;218m1.000[0m  [38;5;218m0.001[0m
[38;5;178m[1mCV fold 2[0m
	[38;5;178m[1mscore    auc    acc    nll[0m
	[38;5;10m

You can use the `--quick-test` switch to test the notebook on the mock data sets in `data/`. Since the data is random, the scores will not be meaningful, but it can be useful to run this first on your submissions to make sure they run without errors.

In [56]:
!ramp_test_submission --quick-test

[38;5;178m[1mTesting Fraud detection[0m
[38;5;178m[1mReading train and test files from ./data ...[0m
[38;5;178m[1mReading cv ...[0m
[38;5;178m[1mTraining ./submissions/starting_kit ...[0m
[38;5;178m[1mCV fold 0[0m
	[38;5;178m[1mscore  Kappa    auc    acc[0m
	[38;5;10m[1mtrain[0m    [38;5;10m[1m0.0[0m  [38;5;150m0.517[0m  [38;5;150m0.999[0m
	[38;5;12m[1mvalid[0m    [38;5;12m[1m0.0[0m  [38;5;105m0.498[0m  [38;5;105m0.999[0m
	[38;5;1m[1mtest[0m     [38;5;1m[1m0.0[0m  [38;5;218m0.498[0m  [38;5;218m1.000[0m
[38;5;178m[1mCV fold 1[0m
	[38;5;178m[1mscore  Kappa    auc    acc[0m
	[38;5;10m[1mtrain[0m    [38;5;10m[1m0.0[0m  [38;5;150m0.598[0m  [38;5;150m0.999[0m
	[38;5;12m[1mvalid[0m    [38;5;12m[1m0.0[0m  [38;5;105m0.497[0m  [38;5;105m0.999[0m
	[38;5;1m[1mtest[0m     [38;5;1m[1m0.0[0m  [38;5;218m0.476[0m  [38;5;218m1.000[0m
[38;5;178m[1mCV fold 2[0m
	[38;5;178m[1mscore  Kappa    auc    acc[0m
	[38;5;10m

## Other models in the starting kit

You can also keep several other submissions in your work directory [`submissions`](/tree/submissions) and test them using
```
ramp_test_submission --submission <submission_name>
```
where `<submission_name>` is the name of the folder in `submissions/`.

## Submitting to [ramp.studio](http://ramp.studio)

If you are eligible, you can join the team at [ramp.studio](http://www.ramp.studio). If it is your first time using RAMP, [sign up](http://www.ramp.studio/sign_up), otherwise [log in](http://www.ramp.studio/login). 

Once your signup request is accepted, you can go to your [sandbox](http://www.ramp.studio/events/kaggle_seguro/sandbox) and copy-paste (or upload) [`feature_extractor.py`](/edit/submissions/starting_kit/feature_extractor.py) and [`classifier.py`](/edit/submissions/starting_kit/classifier.py) from `submissions/starting_kit`. Save it, rename it, then submit it. The submission is trained and tested on our backend in the same way as `ramp_test_submission` does it locally. While your submission is waiting in the queue and being trained, you can find it in the "New submissions (pending training)" table in [my submissions](http://www.ramp.studio/events/kaggle_seguro/my_submissions). Once it is trained, you get a mail, and your submission shows up on the [public leaderboard](http://www.ramp.studio/events/kaggle_seguro/leaderboard). 
If there is an error (despite having tested your submission locally with `ramp_test_submission`), it will show up in the "Failed submissions" table in [my submissions](http://www.ramp.studio/events/kaggle_seguro/my_submissions). You can click on the error to see part of the trace.

After submission, do not forget to give credits to the previous submissions you reused or integrated into your submission.

The data set we use at the backend is usually different from what you find in the starting kit, so the score may be different.

The usual way to work with RAMP is to explore solutions, add feature transformations, select models, perhaps do some AutoML/hyperopt, etc., _locally_, and checking them with `ramp_test_submission`. The script prints mean cross-validation and test scores 
```
----------------------------
train ngini = 0.119 ± 0.007
train auc = 0.559 ± 0.003
train acc = 0.964 ± 0.0
train nll = 0.156 ± 0.0
valid ngini = 0.114 ± 0.005
valid auc = 0.558 ± 0.002
valid acc = 0.964 ± 0.0
valid nll = 0.156 ± 0.0
test ngini = 0.229 ± 0.256
test auc = 0.307 ± 0.064
test acc = 1.0 ± 0.0
test nll = 0.037 ± 0.0
```
and bagged cross-validation and test scores
```
valid ngini = 0.167
test ngini = -0.324
```
This latter combines the cross-validation models pointwise on the validation and test sets, and usually leads to a better score than the mean CV score. The RAMP [leaderboard](http://www.ramp.studio/events/kaggle_seguro/leaderboard) displays this score.

The official score in this RAMP (the first score column after "historical contributivity" on the [leaderboard](http://www.ramp.studio/events/kaggle_seguro/leaderboard)) is normalized Gini ("ngini"), so the line that is relevant in the output of `ramp_test_submission` is `valid ngini = 0.167`. When the score is good enough, you can submit it at the RAMP.

## More information

You can find more information in the [README](https://github.com/paris-saclay-cds/ramp-workflow/blob/master/README.md) of the [ramp-workflow library](https://github.com/paris-saclay-cds/ramp-workflow).

## Contact

Don't hesitate to [contact us](mailto:admin@ramp.studio?subject=kaggle seguro notebook).