In [1]:
%load_ext autoreload
%autoreload 2

## Goals
* The prediction performances and computation times of various **unsupervised learning anomaly detection** algorithms such as **Isolation Forest**, **COPOD**, and **Random Cut Forest**, are compared.
* (Optional) `Altair` is applied for the purpose of drawing interactive plots during EDA.

## Requirement
* The dataset can be downloaded from [this Kaggle competition](https://www.kaggle.com/c/ieee-fraud-detection/).
* `Altair`, `vega-datasets`, `PyOD` and `scikit-learn` version 0.24 libraries are required. 
* For some experiments, `Amazon SageMaker` and related SDKs are needed. This installation can be skipped.

In [2]:
import gc
import multiprocessing
import os
import pickle
import numpy as np
import pandas as pd
import altair as alt
from scipy.interpolate import interp1d
from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.manifold import TSNE
from sklearn.metrics import (average_precision_score, precision_recall_curve, roc_auc_score, roc_curve)
from sklearn.model_selection import train_test_split
from pyod.models.copod import COPOD
from pyod.models.iforest import IForest

In [3]:
def dump_pickle(file_path, obj):
    with open(file_path, 'wb') as f:
        pickle.dump(obj, f)
        
        
def load_pickle(file_path):
    with open(file_path, 'rb') as f:
        obj = pickle.load(f)
    return obj


def str_to_int(x):
    return x if pd.isnull(x) else str(int(x))

## Data Loading
The Kaggle dataset has been saved in the local directory `~/Data/ieee-fraud-detection` in advance.

In [4]:
DATA_DIR = '../../Data/ieee-fraud-detection'
RANDOM_STATE = 42

In [5]:
train_identity = pd.read_csv(os.path.join(DATA_DIR, 'train_identity.csv'))
train_transaction = pd.read_csv(os.path.join(DATA_DIR, 'train_transaction.csv')) 
test_identity = pd.read_csv(os.path.join(DATA_DIR, 'test_identity.csv')) 
test_transaction = pd.read_csv(os.path.join(DATA_DIR, 'test_transaction.csv'))

df_train = pd.merge(train_transaction, train_identity, on='TransactionID', how='left')
df_test = pd.merge(test_transaction, test_identity, on='TransactionID', how='left')
df_test = df_test.rename(columns={'id-{:02d}'.format(i): 'id_{:02d}'.format(i) for i in range(1, 39)})

In [6]:
del train_identity, train_transaction, test_identity, test_transaction
_ = gc.collect()

In [7]:
print(f'Train dataset has {df_train.shape[0]} rows and {df_train.shape[1]} columns.')
print(f'Test dataset has {df_test.shape[0]} rows and {df_test.shape[1]} columns.')

Train dataset has 590540 rows and 434 columns.
Test dataset has 506691 rows and 433 columns.


In [8]:
print('The fraud rate is {:.2%}.'.format(df_train['isFraud'].mean()))

The fraud rate is 3.50%.


## EDA
To preprocess data for modeling, I quickly explored proportions of missing values, cardinalities of categorical features, distributions of numerical features, and a correlation coefficient matrix. For the efficiency of the calculation, I selected 100 features by random sampling and looked at the proportions of their missing values. I have found that most of the features have a fairly high percentage of missing values.

In [9]:
prop_of_missing_values = (df_train[df_train.columns.difference(['TransactionID', 'isFraud'])].isnull().sum() / df_train.shape[0]).reset_index()
prop_of_missing_values.columns = ['feature', 'prop_of_missing_values']
source = prop_of_missing_values.sample(100, random_state=RANDOM_STATE)

highlight = alt.selection(type='single', on='mouseover', fields=['feature'], nearest=True)
bar = alt.Chart(source).mark_bar().encode(
    x=alt.X('feature:N', axis=alt.Axis(title='Feature'), sort='-y'),
    y=alt.Y('prop_of_missing_values:Q', axis=alt.Axis(title='Percentage', format='.0%')),
    color=alt.Color('prop_of_missing_values:Q', legend=None),
    opacity=alt.condition(~highlight, alt.value(1.0), alt.value(0.5)),
    tooltip=['feature:N', alt.Tooltip('prop_of_missing_values:Q', format='.2%')]
).add_selection(highlight)
bar.properties(title='Proportions of Missing Values', width=1200, height=200).configure_axisX(labelAngle=-45, labelFontSize=8)

A list and description of categorical features can be found on [this Kaggle page](https://www.kaggle.com/c/ieee-fraud-detection/data).

In [10]:
cat_features = pd.Index(['ProductCD', 'addr1', 'addr2', 'P_emaildomain', 'R_emaildomain', 'DeviceType', 'DeviceInfo'] + \
[f'card{i}' for i in range(1, 7)] + [f'M{i}' for i in range(1, 10)] + [f'id_{i}' for i in range(12, 39)])
num_features = df_train.columns.difference(pd.Index(['TransactionID', 'TransactionDT', 'isFraud']) | cat_features)
all_features = cat_features | num_features

In [11]:
print(f'There are {len(cat_features)} categorical features and {len(num_features)} numeric features.')

There are 49 categorical features and 382 numeric features.


In [12]:
int_cat_features =  df_train[cat_features].select_dtypes('number').columns
df_train[int_cat_features] = df_train[int_cat_features].applymap(str_to_int)

int_cat_features =  df_test[cat_features].select_dtypes('number').columns
df_test[int_cat_features] = df_test[int_cat_features].applymap(str_to_int)

Some categorical features have more than a few hundred categories, or even more than 10,000.

In [13]:
source = df_train[cat_features].nunique().reset_index()
source.columns = ['feature', 'cardinality']

bar = alt.Chart(source).mark_bar().encode(
    x=alt.X('feature:N', axis=alt.Axis(title='Feature'), sort='-y'),
    y=alt.Y('cardinality:Q', axis=alt.Axis(title='Count')),
    tooltip=['feature:N', 'cardinality:Q']
)
bar.properties(title='Cardinalities of Categorical Features', width=1000, height=200).configure_axisX(labelAngle=-45)

In [14]:
def plot_histogram(values, index, bins=20, bar_size=15, height=200, n_digits=0):
    hist, bin_edges = np.histogram(values, bins=bins)
    source = pd.DataFrame(hist, index=bin_edges[:-1], columns=['count']).reset_index().rename(columns={index: 'count', 'index': index})
    stats = pd.DataFrame({'mean': [round(values.mean(), 2)], 'median': [round(np.quantile(values, 0.5), 2)]})
    eps = 0.02
    
    bar = alt.Chart(source).transform_joinaggregate(
        total_count='sum(count)'
    ).transform_calculate(
        pecent_of_total='datum.count / datum.total_count'
    ).mark_bar(size=bar_size).encode(
        x=alt.X(f'{index}:Q', axis=alt.Axis(title=None, format=f'.{n_digits}f'), scale=alt.Scale(domain=[bin_edges[0] - eps, bin_edges[-1] + eps], nice=False)),
        y=alt.Y('pecent_of_total:Q', axis=alt.Axis(title=None, format='.0%')),
        tooltip=[index, alt.Tooltip('count:Q', format='.0f')]
    )  
    rule1 = alt.Chart(stats).mark_rule(color='#ff7f0e', size=1.5, strokeDash=[3, 2]).encode(
        x='mean:Q', 
        tooltip=['mean', 'mean:Q']
    )
    rule2 = alt.Chart(stats).mark_rule(color='#2ca02c', size=1.5, strokeDash=[3, 2]).encode(
        x='median:Q',
        tooltip=['median', 'median:Q']
    )
    return (bar + rule1 + rule2).properties(title=f"Histogram of '{index}'", height=height)

In order to examine the distribution of numerical features, some of the features with few missing values and adequately distributed were randomly selected. From the histograms, it can be seen that most of the features have a long tail.

In [15]:
np.random.seed(RANDOM_STATE)
missing_value_ratios = df_train[num_features].isnull().sum() / df_train.shape[0]
value_counts = df_train[num_features].nunique()
selected_features = num_features[(missing_value_ratios < 0.5) & (value_counts > 99)]
selected_features = np.random.permutation(selected_features)[:20]

In [16]:
charts = []
for feature in selected_features:
    charts.append(plot_histogram(df_train[feature].dropna(), feature, bins=20, bar_size=12.5, height=100, n_digits=0))
    
rows = []
for i, chart in enumerate(charts):
    if (i % 3 == 2) or (i == len(charts) - 1):
        rows.append(alt.HConcatChart(hconcat=charts[i - (i % 3):i + 1]))
alt.VConcatChart(vconcat=rows[:3]).configure_axisY(labelAlign='left', labelLimit=30, labelPadding=30).configure_axisX(labelAngle=-45)

Finally I calculated the correlation coefficient matrix. While most of the features are not correlated, some have very high positive correlations.

In [17]:
corr_matrix = df_train[selected_features].corr()
source = corr_matrix.stack().reset_index()
source.columns = ['feature_x', 'feature_y', 'correlation']

base = alt.Chart(source).encode(
    x=alt.X('feature_x:N', axis=alt.Axis(ticks=False, title='Feature')),
    y=alt.Y('feature_y:N', axis=alt.Axis(ticks=False, title='Feature'))
)
text = base.mark_text(size=10).encode(
    text=alt.Text('correlation', format='.0%'),
    color=alt.condition(alt.datum.correlation > 0.5, alt.value('white'), alt.value('black'))
)
heatmap = base.mark_rect().encode(color=alt.Color('correlation:Q', legend=alt.Legend(title='Correlation', titleFontSize=9)))
(heatmap + text).properties(title='Correlation Matrix', width=500, height=500).configure_axisX(labelAngle=-45)

## Data Splitting & Preprocessing
In the general case of unsupervised learning, it is not possible to evaluate the predictive performance, but since there are labels in this example, 20% of the total was splitted into the validation dataset. Ordinal Encoding and imputation were applied to categorical features, and imputation was applied after normalization to numeric features.

In [18]:
df_train[cat_features] = df_train[cat_features].astype('str')
df_test[cat_features] = df_test[cat_features].astype('str')

df_X_train, df_X_valid, df_y_train, df_y_valid = train_test_split(
    df_train[all_features], df_train['isFraud'], test_size=0.2, random_state=RANDOM_STATE, stratify=df_train['isFraud'])

In [19]:
n_jobs = int(0.75 * multiprocessing.cpu_count())
cat_pipeline = make_pipeline(OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=np.nan), SimpleImputer(strategy='constant', fill_value=-1))
num_pipeline = make_pipeline(StandardScaler(), SimpleImputer(strategy='median'))
transformer = make_column_transformer((cat_pipeline, cat_features), (num_pipeline, num_features))

X_train = transformer.fit_transform(df_X_train)
X_valid = transformer.transform(df_X_valid)

To view the transformed validation dataset, the dimensions of the dataset was reduced using **t-SNE**. The manifold looks like a twisted band, and the fraudulent labels appear to exist outside the clusters. Therefore, it seems that pretty good accuracy can be achieved even with unsupervised learning.

In [20]:
tsne = TSNE(n_components=2, perplexity=50.0, early_exaggeration=12.0, learning_rate=200.0, random_state=RANDOM_STATE, n_jobs=n_jobs)
manifold = tsne.fit_transform(X_valid[:5000])

In [21]:
source = pd.DataFrame(np.c_[manifold, df_y_valid.iloc[:5000].values], columns=['feature_x', 'feature_y', 'is_fraud'])
source['is_fraud'] = source['is_fraud'].map(lambda x: str(int(x)))

brush = alt.selection(type='interval')
base = alt.Chart(source).add_selection(brush)
points = base.mark_point(size=5).encode(
    x=alt.X('feature_x', title=None),
    y=alt.Y('feature_y', title=None),
    color=alt.condition(brush, alt.Color('is_fraud:N', legend=alt.Legend(title='Fraudulent', titleFontSize=9)), alt.value('grey'))
)
tick_axis = alt.Axis(labels=False, domain=False, ticks=False)
x_ticks = base.mark_tick().encode(
    alt.X('feature_x', title='Feature', axis=tick_axis),
    alt.Y('is_fraud', title=None, axis=tick_axis),
    color=alt.condition(brush, 'is_fraud', alt.value('lightgrey'))
)
y_ticks = base.mark_tick().encode(
    alt.X('is_fraud', title=None, axis=tick_axis),
    alt.Y('feature_y', title='Feature', axis=tick_axis),
    color=alt.condition(brush, 'is_fraud', alt.value('lightgrey'))
)
(y_ticks | (points & x_ticks)).properties(title='Scatter Plot of Manifold with t-SNE').configure_title(anchor='middle')

## Moding with Isolation Forest, Random Cut Forest and COPOD
I used popular tree ensemble models, namely **Isolation Forest** and **Random Cut Forest**, and the latest algorithm **Copula-based Outlier Detection**(COPOD). For a detailed description of the algorithms, refer to the following links.
* [Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. “Isolation forest.” Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on.](https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf?q=isolation-forest)
* [Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. “Isolation-based anomaly detection.” ACM Transactions on Knowledge Discovery from Data (TKDD) 6.1 (2012): 3.](https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/tkdd11.pdf)
* [Sudipto Guha, Nina Mishra, Gourav Roy, and Okke Schrijvers. "Robust random cut forest based anomaly detection on streams." In International Conference on Machine Learning, pp. 2712-2721. 2016. ](http://proceedings.mlr.press/v48/guha16.pdf)
* [Li, Z., Zhao, Y., Botta, N., Ionescu, C. and Hu, X. COPOD: Copula-Based Outlier Detection. IEEE International Conference on Data Mining (ICDM), 2020.](https://arxiv.org/pdf/2009.09463.pdf)

### PyOD Isolation Forest and COPOD
Isolation Forest fitting used 12 cores as multi-threading, but COPOD was fitted with a single thread.

In [22]:
%%time
if_detector = IForest(n_estimators=100, behaviour='new', n_jobs=n_jobs, random_state=RANDOM_STATE) 
_ = if_detector.fit(X_train)
if_scores = if_detector.predict_proba(X_valid)

CPU times: user 13min 29s, sys: 7min 39s, total: 21min 9s
Wall time: 5min 23s


In [23]:
%%time
cop_detector = COPOD() 
_ = cop_detector.fit(X_train)
cop_scores = cop_detector.predict_proba(X_valid)

CPU times: user 4min 29s, sys: 50.4 s, total: 5min 19s
Wall time: 4min 48s


### Amazon SageMaker Random Cut Forest
Random Cut Forest was trained using AWS EC2 `ml.m4.xlarge` instance. If you are not going to use Amazon Sagemaker, please skip this section.

In [24]:
import boto3
import botocore
import sagemaker
from sagemaker import RandomCutForest

In [25]:
def check_bucket_permission(bucket):
    permission = False
    try:
        boto3.Session().client('s3').head_bucket(Bucket=bucket)
    except botocore.exceptions.ParamValidationError as e:
        print(
            'Hey! You either forgot to specify your S3 bucket'
            ' or you gave your bucket an invalid name!'
        )
    except botocore.exceptions.ClientError as e:
        if e.response['Error']['Code'] == '403':
            print(f"Hey! You don't have permission to access the bucket, {bucket}.")
        elif e.response['Error']['Code'] == '404':
            print(f"Hey! Your bucket, {bucket}, doesn't exist!")
        else:
            raise
    else:
        permission = True
    return permission

In [26]:
sess = sagemaker.Session()
bucket = sess.default_bucket()
prefix = 'ieee-fraud-detection'
execution_role = sagemaker.get_execution_role()
region = boto3.Session().region_name

if check_bucket_permission(bucket):
    print('Input/output will be stored in: s3://{}_.../{}'.format('_'.join(bucket.split('-')[:-1]), prefix))

Input/output will be stored in: s3://sagemaker_us_east_1_.../ieee-fraud-detection


In [27]:
%%time
rcf_detector = RandomCutForest(
    role=execution_role,
    instance_count=1,
    instance_type='ml.m4.xlarge',
    data_location=f's3://{bucket}/{prefix}/data/train/',
    output_path=f's3://{bucket}/{prefix}/model',
    num_samples_per_tree=512,
    num_trees=100
)
_ = rcf_detector.fit(rcf_detector.record_set(X_train))

Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: 1.
Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: 1.


2021-04-26 04:22:02 Starting - Starting the training job...
2021-04-26 04:22:26 Starting - Launching requested ML instancesProfilerReport-1619410920: InProgress
......
2021-04-26 04:23:35 Starting - Preparing the instances for training.........
2021-04-26 04:25:21 Downloading - Downloading input data......
2021-04-26 04:26:27 Training - Downloading the training image..[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[04/26/2021 04:26:43 INFO 139724981012288] Reading default configuration from /opt/amazon/lib/python3.7/site-packages/algorithm/resources/default-conf.json: {'num_samples_per_tree': 256, 'num_trees': 100, 'force_dense': 'true', 'eval_metrics': ['accuracy', 'precision_recall_fscore'], 'epochs': 1, 'mini_batch_size': 1000, '_log_level': 'info', '_kvstore': 'dist_async', '_num_kv_servers': 'auto', '_num_gpus': 'auto', '_tuning_objective_metric': '', '_ftp_port': 8999}[0m
[34m[04/26/2021 04:26:43 INFO 139

In [28]:
np.savetxt(os.path.join(DATA_DIR, 'X_valid.csv'), X_valid, delimiter=',', fmt='%i')
s3_uri = sess.upload_data(os.path.join(DATA_DIR, 'X_valid.csv'), bucket=bucket, key_prefix=f'{prefix}/data/valid')

In [29]:
%%time
rcf_transformer = rcf_detector.transformer(
    instance_count=1, 
    instance_type='ml.m4.xlarge', 
    output_path=f's3://{bucket}/{prefix}/prediction'
)
_ = rcf_transformer.transform(
    data=s3_uri, 
    content_type='text/csv', 
    split_type='Line'
)
rcf_transformer.wait()

Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: 1.


....................................[34mDocker entrypoint called with argument(s): serve[0m
[34mRunning default environment configuration script[0m
[34m[04/26/2021 04:36:35 INFO 140210463557440] loaded entry point class algorithm.serve.server_config:config_api[0m
[34m[04/26/2021 04:36:35 INFO 140210463557440] loading entry points[0m
[34m[04/26/2021 04:36:35 INFO 140210463557440] Loaded iterator creator application/x-recordio-protobuf for content type ('application/x-recordio-protobuf', '1.0')[0m
[34m[04/26/2021 04:36:35 INFO 140210463557440] loaded request iterator application/json[0m
[34m[04/26/2021 04:36:35 INFO 140210463557440] loaded request iterator application/jsonlines[0m
[34m[04/26/2021 04:36:35 INFO 140210463557440] loaded request iterator application/x-recordio-protobuf[0m
[34m[04/26/2021 04:36:35 INFO 140210463557440] loaded request iterator text/csv[0m
[34m[04/26/2021 04:36:35 INFO 140210463557440] loaded response encoder application/json[0m
[34m[04/26/

In [30]:
boto3.resource('s3').meta.client.download_file(bucket, f'{prefix}/prediction/X_valid.csv.out', os.path.join(DATA_DIR, 'X_valid.csv.out'))
rcf_scores = pd.read_csv(os.path.join(DATA_DIR, 'X_valid.csv.out'), header=None)[0].map(lambda x: eval(x)['score']).values

In [43]:
dump_pickle('rcf_scores.pkl', rcf_scores)
dump_pickle('if_scores.pkl', if_scores)
dump_pickle('cop_scores.pkl', cop_scores)

## Model Evaluation
Anomaly scores output by the models have log-normal distributions with long tails as expected. 

In [31]:
charts = []
for model, score in zip(['Isolation Forest', 'COPOD', 'Random Cut Forest'], [if_scores[:, 1], cop_scores[:, 1], rcf_scores]):
    charts.append(plot_histogram(score, model, bins=50, bar_size=15, height=150, n_digits=2))
    
alt.VConcatChart(vconcat=charts).configure_axisX(labelAngle=-45)

COPOD is the highest for both AUROC and AUPRC, followed by Isolation Forest, followed by Random Cut Forest.

In [32]:
roc_curves = [x for scores in [if_scores[:, 1], cop_scores[:, 1], rcf_scores] for x in roc_curve(df_y_valid, scores)]
aurocs = [roc_auc_score(df_y_valid, scores) for scores in [if_scores[:, 1], cop_scores[:, 1], rcf_scores]]

pr_curves = [x for scores in [if_scores[:, 1], cop_scores[:, 1], rcf_scores] for x in precision_recall_curve(df_y_valid, scores)]
auprcs = [average_precision_score(df_y_valid, scores) for scores in [if_scores[:, 1], cop_scores[:, 1], rcf_scores]]

In [33]:
x = np.linspace(0, 1, int(5000 / 3))
source = np.c_[x, interp1d(roc_curves[0], roc_curves[1])(x), interp1d(roc_curves[3], roc_curves[4])(x), interp1d(roc_curves[6], roc_curves[7])(x)]
columns = ['IF (AUROC:{:0.2%})'.format(aurocs[0]), 'COPOD (AUROC:{:0.2%})'.format(aurocs[1]), 'RCF (AUROC:{:0.2%})'.format(aurocs[2])]
source = pd.DataFrame(source, columns=['x'] + columns)
source = pd.melt(source, id_vars=['x'], value_vars=columns)

highlight = alt.selection(type='single', on='mouseover', fields=['variable'], nearest=True)
base = alt.Chart(source).encode(
    x=alt.X('x:Q', title='False Positive Rate'),
    y=alt.Y('value:Q', title='True Positive Rate'),
    color=alt.Color('variable:N', legend=alt.Legend(title='Detector', orient='bottom-right'))
)
points = base.mark_circle().encode(
    opacity=alt.value(0)).add_selection(
    highlight
).properties()
line = base.mark_line().encode(
    size=alt.condition(~highlight, alt.value(1.5), alt.value(3))
)
(points + line).properties(title='Receiver Operating Characteristic Curves')

In [34]:
source = np.c_[x, interp1d(pr_curves[1], pr_curves[0])(x), interp1d(pr_curves[4], pr_curves[3])(x), interp1d(pr_curves[7], pr_curves[6])(x)]
columns = ['IF (AUPRC:{:0.2%})'.format(auprcs[0]), 'COPOD (AUPRC:{:0.2%})'.format(auprcs[1]), 'RCF (AUPRC:{:0.2%})'.format(auprcs[2])]
source = pd.DataFrame(source, columns=['x'] + columns)
source = pd.melt(source, id_vars=['x'], value_vars=columns)

highlight = alt.selection(type='single', on='mouseover', fields=['variable'], nearest=True)
base = alt.Chart(source).encode(
    x=alt.X('x:Q', title='Recall'),
    y=alt.Y('value:Q', title='Precision', scale=alt.Scale(domain=[0.0, 1.0])),
    color=alt.Color('variable:N', legend=alt.Legend(title='Detector', orient='top-right'))
)
points = base.mark_circle().encode(
    opacity=alt.value(0)).add_selection(
    highlight
).properties()
lines = base.mark_line().encode(
    size=alt.condition(~highlight, alt.value(1.5), alt.value(3))
)
(points + lines).properties(title='Precision - Recall Curves')

The results are summarized in a table as follows.

|Detector|AUROC|AUPRC|Training Time|Spec.|
|:------:|:---:|:---:|:---:|:---:|
|COPOD|77.55%|15.33%|4min 48s|MacBook Pro (16 CPUs, mem 32GB)|
|Isolation Forest|76.33%|13.74%|5min 23s|MacBook Pro (16 CPUs, mem 32GB)|
|Random Cut Forest|68.84%|7.96%|3min 10s|EC2 ml.m4.xlarge (2 CPUs, mem 16GB)|

## Submission
COPOD performance was the best in AUROC, the competition criterion, so I fitted the entire dataset with it and submitted the predictions. The final result is AUROC 82.03%, which is quite far from 94.59%, which is No. 1 in the private leaderboard, but the possibility as unsupervised learning could be found.

In [35]:
X_train = transformer.fit_transform(df_train[all_features])
X_test = transformer.transform(df_test[all_features])

In [36]:
%%time
_ = cop_detector.fit(X_train)
cop_scores = cop_detector.predict_proba(X_test)

CPU times: user 7min 7s, sys: 2min 20s, total: 9min 27s
Wall time: 8min 15s


In [37]:
submission = pd.DataFrame({'TransactionID': df_test['TransactionID'].values, 'isFraud': cop_scores[:, 1]})
submission.to_csv('./submission.csv', index=False)