<a href="https://colab.research.google.com/github/akshatamadavi/data_mining/blob/main/autogluon/ieee_fraud_detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##IEEE-CIS Fraud Detection
Can you detect fraud from customer transactions?

Description
Imagine standing at the check-out counter at the grocery store with a long line behind you and the cashier not-so-quietly announces that your card has been declined. In this moment, you probably aren’t thinking about the data science that determined your fate.

Embarrassed, and certain you have the funds to cover everything needed for an epic nacho party for 50 of your closest friends, you try your card again. Same result. As you step aside and allow the cashier to tend to the next customer, you receive a text message from your bank. “Press 1 if you really tried to spend $500 on cheddar cheese.”

While perhaps cumbersome (and often embarrassing) in the moment, this fraud prevention system is actually saving consumers millions of dollars per year. Researchers from the IEEE Computational Intelligence Society (IEEE-CIS) want to improve this figure, while also improving the customer experience. With higher accuracy fraud detection, you can get on with your chips without the hassle.

IEEE-CIS works across a variety of AI and machine learning areas, including deep neural networks, fuzzy systems, evolutionary computation, and swarm intelligence. Today they’re partnering with the world’s leading payment service company, Vesta Corporation, seeking the best solutions for fraud prevention industry, and now you are invited to join the challenge.

In this competition, you’ll benchmark machine learning models on a challenging large-scale dataset. The data comes from Vesta's real-world e-commerce transactions and contains a wide range of features from device type to product features. You also have the opportunity to create new features to improve your results.

If successful, you’ll improve the efficacy of fraudulent transaction alerts for millions of people around the world, helping hundreds of thousands of businesses reduce their fraud loss and increase their revenue. And of course, you will save party people just like you the hassle of false positives.

Acknowledgements:



Vesta Corporation provided the dataset for this competition. Vesta Corporation is the forerunner in guaranteed e-commerce payment solutions. Founded in 1995, Vesta pioneered the process of fully guaranteed card-not-present (CNP) payment transactions for the telecommunications industry. Since then, Vesta has firmly expanded data science and machine learning capabilities across the globe and solidified its position as the leader in guaranteed ecommerce payments. Today, Vesta guarantees more than $18B in transactions annually.

Header Photo by Tim Evans on Unsplash

##This tutorial will teach you how to use AutoGluon to become a serious Kaggle competitor without writing lots of code. We first outline the general steps to use AutoGluon in Kaggle contests. Here, we assume the competition involves tabular data which are stored in one (or more) CSV files.

##1. Run Bash command: pip install kaggle!

In [None]:
!pip -q install -U kaggle

##2. Navigate to: https://www.kaggle.com/account and create an account (if necessary). Then , click on “Create New API Token” and move downloaded file to this location on your machine: ~/.kaggle/kaggle.json. For troubleshooting, see Kaggle API instructions.

In [None]:
from google.colab import files
files.upload()  # choose kaggle.json you downloaded from Kaggle > Account > Create New API Token

!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/kaggle.json
!chmod 600 ~/.kaggle/kaggle.json


Saving kaggle.json to kaggle.json


##3. To download data programmatically: Execute this Bash command in your terminal:

kaggle competitions download -c [COMPETITION]

Here, [COMPETITION] should be replaced by the name of the competition you wish to enter. Alternatively, you can download data manually: Just navigate to website of the Kaggle competition you wish to enter, click “Download All”, and accept the competition’s terms.

In [None]:
import json, os
p = os.path.expanduser('~/.kaggle/kaggle.json')
creds = json.load(open(p))
print("Using Kaggle account:", creds["username"])


Using Kaggle account: akshatamadavi


In [None]:
!kaggle competitions files -c ieee-fraud-detection


name                         size  creationDate                
---------------------  ----------  --------------------------  
sample_submission.csv     6080314  2019-07-15 00:19:01.536000  
test_identity.csv        25797161  2019-07-15 00:19:01.536000  
test_transaction.csv    613194934  2019-07-15 00:19:01.536000  
train_identity.csv       26529680  2019-07-15 00:19:01.536000  
train_transaction.csv   683351067  2019-07-15 00:19:01.536000  


In [None]:
!kaggle competitions download -c ieee-fraud-detection -p /content -w
# (The competition slug is all lowercase: ieee-fraud-detection)


Downloading ieee-fraud-detection.zip to .
  0% 0.00/118M [00:00<?, ?B/s]
100% 118M/118M [00:00<00:00, 3.23GB/s]


##4. If the competition’s training data is comprised of multiple CSV files, use pandas to properly merge/join them into a single data table where rows = training examples, columns = features.

In [None]:
import zipfile
import os

zip_file_path = '/content/ieee-fraud-detection.zip'
destination_path = '/content/'

if os.path.exists(zip_file_path):
    with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
        zip_ref.extractall(destination_path)
    print(f"Extracted {zip_file_path} to {destination_path}")
else:
    print(f"Error: {zip_file_path} not found. Please ensure the competition data is downloaded.")

Extracted /content/ieee-fraud-detection.zip to /content/


4(a): we first load the competition’s training data into Python:

In [None]:
!pip -q install autogluon

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.0/44.0 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m259.5/259.5 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m225.1/225.1 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.2/64.2 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m454.9/454.9 kB[0m [31m21.0 MB/s[0m eta

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path
from autogluon.tabular import TabularPredictor

directory = Path("/content")  # directory where you have downloaded the data CSV files from the competition
label = 'isFraud'  # name of target variable to predict in this competition
eval_metric = 'roc_auc'  # Optional: specify that competition evaluation metric is AUC
save_path = directory/'AutoGluonModels/'  # where to store trained models

train_identity = pd.read_csv(directory/'train_identity.csv')
train_transaction = pd.read_csv(directory/'train_transaction.csv')

4(b):Since the training data for this competition is comprised of multiple CSV files, we just first join them into a single large table (with rows = examples, columns = features) before applying AutoGluon:

In [None]:
train_data = pd.merge(train_transaction, train_identity, on='TransactionID', how='left')

4(c):we specify the presets argument to maximize AutoGluon’s predictive accuracy which usually requires that you run fit()

In [None]:
predictor = TabularPredictor(label=label, eval_metric=eval_metric, path=save_path, verbosity=3).fit(
    train_data, presets='best_quality', time_limit=3600
)

results = predictor.fit_summary()

Verbosity: 3 (Detailed Logging)
AutoGluon Version:  1.4.0
Python Version:     3.12.12
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Thu Oct  2 10:42:05 UTC 2025
CPU Count:          8
GPU Count:          0
Memory Avail:       43.83 GB / 50.99 GB (86.0%)
Disk Space Avail:   178.86 GB / 225.83 GB (79.2%)
Presets specified: ['best_quality']
User Specified kwargs:
{'auto_stack': True, 'num_bag_sets': 1}
Full kwargs:
{'_experimental_dynamic_hyperparameters': False,
 '_feature_generator_kwargs': None,
 '_save_bag_folds': None,
 'ag_args': None,
 'ag_args_ensemble': None,
 'ag_args_fit': None,
 'auto_stack': True,
 'calibrate': 'auto',
 'delay_bag_sets': False,
 'ds_args': {'clean_up_fits': True,
             'detection_time_frac': 0.25,
             'enable_callbacks': False,
             'enable_ray_logging': True,
             'holdout_data': None,
             'holdout_frac': 0.1111111111111111,
             'memory_safe_fits': True,
             'n_folds'

[36m(_ray_fit pid=4676)[0m [50]	valid_set's binary_logloss: 0.0947332
[36m(_ray_fit pid=4676)[0m [100]	valid_set's binary_logloss: 0.0856834[32m [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)[0m
[36m(_ray_fit pid=4676)[0m [150]	valid_set's binary_logloss: 0.0811864[32m [repeated 2x across cluster][0m
[36m(_ray_fit pid=4676)[0m [200]	valid_set's binary_logloss: 0.0779008[32m [repeated 2x across cluster][0m
[36m(_ray_fit pid=4676)[0m [250]	valid_set's binary_logloss: 0.075393[32m [repeated 2x across cluster][0m
[36m(_ray_fit pid=4676)[0m [300]	valid_set's binary_logloss: 0.0733887[32m [repeated 2x across cluster][0m
[36m(_ray_fit pid=4676)[0m [350]	valid_set's binary_logloss: 0.0715076[32m [repeated 2x across cluster][0m
[36m(_ray_fit pid=4677)[0m [450]	valid_set's bin

[36m(_ray_fit pid=4677)[0m 	Ran out of time, early stopping on iteration 995. Best iteration is:
[36m(_ray_fit pid=4677)[0m 	[995]	valid_set's binary_logloss: 0.0629932
[36m(_ray_fit pid=4676)[0m 	Fitting 10000 rounds... Hyperparameters: {'learning_rate': 0.05, 'extra_trees': True}
[36m(_ray_fit pid=4677)[0m Saving /content/AutoGluonModels/ds_sub_fit/sub_fit_ho/models/LightGBMXT_BAG_L1/S1F2/model.pkl
[36m(_ray_fit pid=5299)[0m 	Fitting 10000 rounds... Hyperparameters: {'learning_rate': 0.05, 'extra_trees': True}
[36m(_ray_fit pid=4676)[0m 	Ran out of time, early stopping on iteration 979. Best iteration is:
[36m(_ray_fit pid=4676)[0m 	[979]	valid_set's binary_logloss: 0.0603987
[36m(_ray_fit pid=4676)[0m Saving /content/AutoGluonModels/ds_sub_fit/sub_fit_ho/models/LightGBMXT_BAG_L1/S1F1/model.pkl


[36m(_ray_fit pid=5299)[0m [50]	valid_set's binary_logloss: 0.0972117[32m [repeated 3x across cluster][0m
[36m(_ray_fit pid=5299)[0m [100]	valid_set's binary_logloss: 0.088068[32m [repeated 2x across cluster][0m
[36m(_ray_fit pid=5299)[0m [150]	valid_set's binary_logloss: 0.0833867[32m [repeated 2x across cluster][0m
[36m(_ray_fit pid=5299)[0m [200]	valid_set's binary_logloss: 0.0802051[32m [repeated 2x across cluster][0m
[36m(_ray_fit pid=5299)[0m [250]	valid_set's binary_logloss: 0.0777755[32m [repeated 2x across cluster][0m
[36m(_ray_fit pid=5299)[0m [300]	valid_set's binary_logloss: 0.0757618[32m [repeated 2x across cluster][0m
[36m(_ray_fit pid=5300)[0m [350]	valid_set's binary_logloss: 0.0703856[32m [repeated 3x across cluster][0m
[36m(_ray_fit pid=5299)[0m [450]	valid_set's binary_logloss: 0.0715822[32m [repeated 3x across cluster][0m
[36m(_ray_fit pid=5300)[0m [500]	valid_set's binary_logloss: 0.0667224[32m [repeated 3x across cluster][0m
[3

[36m(_ray_fit pid=5299)[0m 	Ran out of time, early stopping on iteration 977. Best iteration is:
[36m(_ray_fit pid=5299)[0m 	[977]	valid_set's binary_logloss: 0.0633687
[36m(_ray_fit pid=5300)[0m 	Fitting 10000 rounds... Hyperparameters: {'learning_rate': 0.05, 'extra_trees': True}
[36m(_ray_fit pid=5299)[0m Saving /content/AutoGluonModels/ds_sub_fit/sub_fit_ho/models/LightGBMXT_BAG_L1/S1F3/model.pkl
[36m(_ray_fit pid=5919)[0m 	Fitting 10000 rounds... Hyperparameters: {'learning_rate': 0.05, 'extra_trees': True}
[36m(_ray_fit pid=5300)[0m 	Ran out of time, early stopping on iteration 977. Best iteration is:
[36m(_ray_fit pid=5300)[0m 	[977]	valid_set's binary_logloss: 0.0596786
[36m(_ray_fit pid=5300)[0m Saving /content/AutoGluonModels/ds_sub_fit/sub_fit_ho/models/LightGBMXT_BAG_L1/S1F4/model.pkl


[36m(_ray_fit pid=5919)[0m [50]	valid_set's binary_logloss: 0.0947801[32m [repeated 2x across cluster][0m
[36m(_ray_fit pid=5919)[0m [100]	valid_set's binary_logloss: 0.085936[32m [repeated 2x across cluster][0m
[36m(_ray_fit pid=5919)[0m [150]	valid_set's binary_logloss: 0.0804302[32m [repeated 2x across cluster][0m
[36m(_ray_fit pid=5914)[0m [200]	valid_set's binary_logloss: 0.0790555[32m [repeated 2x across cluster][0m
[36m(_ray_fit pid=5919)[0m [250]	valid_set's binary_logloss: 0.0746041[32m [repeated 2x across cluster][0m
[36m(_ray_fit pid=5919)[0m [300]	valid_set's binary_logloss: 0.0725095[32m [repeated 2x across cluster][0m
[36m(_ray_fit pid=5914)[0m [350]	valid_set's binary_logloss: 0.072986[32m [repeated 3x across cluster][0m
[36m(_ray_fit pid=5914)[0m [400]	valid_set's binary_logloss: 0.071683[32m [repeated 2x across cluster][0m
[36m(_ray_fit pid=5919)[0m [500]	valid_set's binary_logloss: 0.0670411[32m [repeated 3x across cluster][0m
[36m

[36m(_ray_fit pid=5919)[0m 	Ran out of time, early stopping on iteration 1010. Best iteration is:
[36m(_ray_fit pid=5919)[0m 	[1010]	valid_set's binary_logloss: 0.0594578
[36m(_ray_fit pid=5914)[0m 	Fitting 10000 rounds... Hyperparameters: {'learning_rate': 0.05, 'extra_trees': True}
[36m(_ray_fit pid=5914)[0m Saving /content/AutoGluonModels/ds_sub_fit/sub_fit_ho/models/LightGBMXT_BAG_L1/S1F5/model.pkl
[36m(_ray_fit pid=6529)[0m 	Fitting 10000 rounds... Hyperparameters: {'learning_rate': 0.05, 'extra_trees': True}
[36m(_ray_fit pid=5914)[0m 	Ran out of time, early stopping on iteration 991. Best iteration is:
[36m(_ray_fit pid=5914)[0m 	[990]	valid_set's binary_logloss: 0.0622165
[36m(_ray_fit pid=5919)[0m Saving /content/AutoGluonModels/ds_sub_fit/sub_fit_ho/models/LightGBMXT_BAG_L1/S1F6/model.pkl


[36m(_ray_fit pid=6529)[0m [50]	valid_set's binary_logloss: 0.0953804
[36m(_ray_fit pid=6528)[0m [50]	valid_set's binary_logloss: 0.0948859
[36m(_ray_fit pid=6529)[0m [100]	valid_set's binary_logloss: 0.0866564
[36m(_ray_fit pid=6528)[0m [100]	valid_set's binary_logloss: 0.0859983
[36m(_ray_fit pid=6529)[0m [150]	valid_set's binary_logloss: 0.0819336
[36m(_ray_fit pid=6528)[0m [150]	valid_set's binary_logloss: 0.0813892
[36m(_ray_fit pid=6528)[0m [200]	valid_set's binary_logloss: 0.0780416
[36m(_ray_fit pid=6529)[0m [200]	valid_set's binary_logloss: 0.0786712
[36m(_ray_fit pid=6528)[0m [250]	valid_set's binary_logloss: 0.0754078
[36m(_ray_fit pid=6529)[0m [250]	valid_set's binary_logloss: 0.0763081
[36m(_ray_fit pid=6528)[0m [300]	valid_set's binary_logloss: 0.0731009
[36m(_ray_fit pid=6529)[0m [300]	valid_set's binary_logloss: 0.0743573
[36m(_ray_fit pid=6529)[0m [350]	valid_set's binary_logloss: 0.0727327
[36m(_ray_fit pid=6528)[0m [400]	valid_set's binary

[36m(_ray_fit pid=6529)[0m 	Ran out of time, early stopping on iteration 799. Best iteration is:
[36m(_ray_fit pid=6529)[0m 	[799]	valid_set's binary_logloss: 0.0640687
[36m(_ray_fit pid=6528)[0m 	Fitting 10000 rounds... Hyperparameters: {'learning_rate': 0.05, 'extra_trees': True}
[36m(_ray_fit pid=6528)[0m Saving /content/AutoGluonModels/ds_sub_fit/sub_fit_ho/models/LightGBMXT_BAG_L1/S1F8/model.pkl
[36m(_dystack pid=3891)[0m 	0.9514	 = Validation score   (roc_auc)
[36m(_dystack pid=3891)[0m 	483.16s	 = Training   runtime
[36m(_dystack pid=3891)[0m 	23.87s	 = Validation runtime
[36m(_dystack pid=3891)[0m 	2749.2	 = Inference  throughput (rows/s | 65616 batch size)
[36m(_dystack pid=3891)[0m Saving /content/AutoGluonModels/ds_sub_fit/sub_fit_ho/models/trainer.pkl
[36m(_dystack pid=3891)[0m Fitting model: LightGBM_BAG_L1 ... Training model for up to 80.90s of the 368.84s of remaining time.
[36m(_dystack pid=3891)[0m 	Fitting LightGBM_BAG_L1 with 'num_gpus': 0, 'num

[36m(_ray_fit pid=8160)[0m [50]	valid_set's binary_logloss: 0.066489
[36m(_ray_fit pid=8161)[0m [50]	valid_set's binary_logloss: 0.0691349
[36m(_ray_fit pid=8160)[0m [100]	valid_set's binary_logloss: 0.0619952
[36m(_ray_fit pid=8161)[0m [100]	valid_set's binary_logloss: 0.0641302
[36m(_ray_fit pid=8160)[0m [150]	valid_set's binary_logloss: 0.0604586
[36m(_ray_fit pid=8161)[0m [150]	valid_set's binary_logloss: 0.0622381
[36m(_ray_fit pid=8160)[0m [200]	valid_set's binary_logloss: 0.0592156
[36m(_ray_fit pid=8161)[0m [200]	valid_set's binary_logloss: 0.0603236
[36m(_ray_fit pid=8161)[0m [250]	valid_set's binary_logloss: 0.0589189[32m [repeated 2x across cluster][0m
[36m(_ray_fit pid=8161)[0m [300]	valid_set's binary_logloss: 0.058122[32m [repeated 2x across cluster][0m


[36m(_ray_fit pid=8160)[0m 	Ran out of time, early stopping on iteration 335. Best iteration is:
[36m(_ray_fit pid=8160)[0m 	[335]	valid_set's binary_logloss: 0.0570324
[36m(_ray_fit pid=8160)[0m 	Fitting 10000 rounds... Hyperparameters: {'learning_rate': 0.05, 'extra_trees': True}
[36m(_ray_fit pid=8161)[0m Saving /content/AutoGluonModels/ds_sub_fit/sub_fit_ho/models/LightGBMXT_BAG_L2/S1F1/model.pkl
[36m(_ray_fit pid=8518)[0m 	Fitting 10000 rounds... Hyperparameters: {'learning_rate': 0.05, 'extra_trees': True}
[36m(_ray_fit pid=8161)[0m 	Ran out of time, early stopping on iteration 315. Best iteration is:
[36m(_ray_fit pid=8161)[0m 	[315]	valid_set's binary_logloss: 0.0579649
[36m(_ray_fit pid=8160)[0m Saving /content/AutoGluonModels/ds_sub_fit/sub_fit_ho/models/LightGBMXT_BAG_L2/S1F2/model.pkl


[36m(_ray_fit pid=8518)[0m [50]	valid_set's binary_logloss: 0.0665779
[36m(_ray_fit pid=8519)[0m [50]	valid_set's binary_logloss: 0.0729042
[36m(_ray_fit pid=8519)[0m [100]	valid_set's binary_logloss: 0.0666059
[36m(_ray_fit pid=8518)[0m [100]	valid_set's binary_logloss: 0.0613788
[36m(_ray_fit pid=8519)[0m [150]	valid_set's binary_logloss: 0.0633225
[36m(_ray_fit pid=8518)[0m [150]	valid_set's binary_logloss: 0.0590339
[36m(_ray_fit pid=8518)[0m [200]	valid_set's binary_logloss: 0.0578096
[36m(_ray_fit pid=8519)[0m [200]	valid_set's binary_logloss: 0.0606734
[36m(_ray_fit pid=8518)[0m [250]	valid_set's binary_logloss: 0.0569658
[36m(_ray_fit pid=8519)[0m [250]	valid_set's binary_logloss: 0.0593816
[36m(_ray_fit pid=8519)[0m [300]	valid_set's binary_logloss: 0.0582874[32m [repeated 2x across cluster][0m


[36m(_ray_fit pid=8518)[0m 	Ran out of time, early stopping on iteration 324. Best iteration is:
[36m(_ray_fit pid=8518)[0m 	[324]	valid_set's binary_logloss: 0.056099
[36m(_ray_fit pid=8519)[0m 	Fitting 10000 rounds... Hyperparameters: {'learning_rate': 0.05, 'extra_trees': True}
[36m(_ray_fit pid=8518)[0m Saving /content/AutoGluonModels/ds_sub_fit/sub_fit_ho/models/LightGBMXT_BAG_L2/S1F3/model.pkl
[36m(_ray_fit pid=8876)[0m 	Fitting 10000 rounds... Hyperparameters: {'learning_rate': 0.05, 'extra_trees': True}
[36m(_ray_fit pid=8519)[0m 	Ran out of time, early stopping on iteration 311. Best iteration is:
[36m(_ray_fit pid=8519)[0m 	[311]	valid_set's binary_logloss: 0.058138
[36m(_ray_fit pid=8519)[0m Saving /content/AutoGluonModels/ds_sub_fit/sub_fit_ho/models/LightGBMXT_BAG_L2/S1F4/model.pkl


[36m(_ray_fit pid=8876)[0m [50]	valid_set's binary_logloss: 0.0717959
[36m(_ray_fit pid=8877)[0m [50]	valid_set's binary_logloss: 0.0743618
[36m(_ray_fit pid=8876)[0m [100]	valid_set's binary_logloss: 0.0656099
[36m(_ray_fit pid=8877)[0m [100]	valid_set's binary_logloss: 0.0669291
[36m(_ray_fit pid=8876)[0m [150]	valid_set's binary_logloss: 0.0631139
[36m(_ray_fit pid=8877)[0m [150]	valid_set's binary_logloss: 0.0636332
[36m(_ray_fit pid=8877)[0m [200]	valid_set's binary_logloss: 0.0615359
[36m(_ray_fit pid=8876)[0m [200]	valid_set's binary_logloss: 0.0610246
[36m(_ray_fit pid=8877)[0m [250]	valid_set's binary_logloss: 0.0601346
[36m(_ray_fit pid=8876)[0m [250]	valid_set's binary_logloss: 0.0595749


[36m(_ray_fit pid=8876)[0m 	Ran out of time, early stopping on iteration 320. Best iteration is:
[36m(_ray_fit pid=8876)[0m 	[320]	valid_set's binary_logloss: 0.0585341
[36m(_ray_fit pid=8877)[0m 	Fitting 10000 rounds... Hyperparameters: {'learning_rate': 0.05, 'extra_trees': True}
[36m(_ray_fit pid=8876)[0m Saving /content/AutoGluonModels/ds_sub_fit/sub_fit_ho/models/LightGBMXT_BAG_L2/S1F5/model.pkl
[36m(_ray_fit pid=9235)[0m 	Fitting 10000 rounds... Hyperparameters: {'learning_rate': 0.05, 'extra_trees': True}
[36m(_ray_fit pid=8877)[0m 	Ran out of time, early stopping on iteration 327. Best iteration is:
[36m(_ray_fit pid=8877)[0m 	[327]	valid_set's binary_logloss: 0.058179
[36m(_ray_fit pid=8877)[0m Saving /content/AutoGluonModels/ds_sub_fit/sub_fit_ho/models/LightGBMXT_BAG_L2/S1F6/model.pkl


[36m(_ray_fit pid=9236)[0m [50]	valid_set's binary_logloss: 0.0674543[32m [repeated 3x across cluster][0m
[36m(_ray_fit pid=9236)[0m [100]	valid_set's binary_logloss: 0.0618775[32m [repeated 2x across cluster][0m
[36m(_ray_fit pid=9235)[0m [150]	valid_set's binary_logloss: 0.0601212[32m [repeated 2x across cluster][0m
[36m(_ray_fit pid=9236)[0m [200]	valid_set's binary_logloss: 0.0575264[32m [repeated 2x across cluster][0m
[36m(_ray_fit pid=9236)[0m [250]	valid_set's binary_logloss: 0.0564409[32m [repeated 2x across cluster][0m
[36m(_ray_fit pid=9235)[0m [300]	valid_set's binary_logloss: 0.0565652[32m [repeated 3x across cluster][0m


[36m(_ray_fit pid=9235)[0m 	Ran out of time, early stopping on iteration 336. Best iteration is:
[36m(_ray_fit pid=9235)[0m 	[336]	valid_set's binary_logloss: 0.0560847
[36m(_ray_fit pid=9236)[0m 	Fitting 10000 rounds... Hyperparameters: {'learning_rate': 0.05, 'extra_trees': True}
[36m(_ray_fit pid=9235)[0m Saving /content/AutoGluonModels/ds_sub_fit/sub_fit_ho/models/LightGBMXT_BAG_L2/S1F7/model.pkl
[36m(_dystack pid=3891)[0m 	0.9566	 = Validation score   (roc_auc)
[36m(_dystack pid=3891)[0m 	237.09s	 = Training   runtime
[36m(_dystack pid=3891)[0m 	9.96s	 = Validation runtime
[36m(_dystack pid=3891)[0m 	1876.8	 = Inference  throughput (rows/s | 65616 batch size)
[36m(_dystack pid=3891)[0m Saving /content/AutoGluonModels/ds_sub_fit/sub_fit_ho/models/trainer.pkl
[36m(_dystack pid=3891)[0m Fitting model: LightGBM_BAG_L2 ... Training model for up to 30.13s of the 29.53s of remaining time.
[36m(_dystack pid=3891)[0m 	Fitting LightGBM_BAG_L2 with 'num_gpus': 0, 'num_c

*** Summary of fit() ***
Estimated performance of each model:
                 model  score_val eval_metric  pred_time_val     fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0  WeightedEnsemble_L3   0.970703     roc_auc     129.732271  2493.838254                0.098900          10.238527            3       True          6
1    LightGBMXT_BAG_L2   0.970024     roc_auc     127.582366  2364.513823               32.236678         707.670918            2       True          4
2      LightGBM_BAG_L2   0.969456     roc_auc      97.396693  1775.928809                2.051005         119.085905            2       True          5
3    LightGBMXT_BAG_L1   0.966753     roc_auc      84.326269  1418.641238               84.326269        1418.641238            1       True          1
4  WeightedEnsemble_L2   0.966753     roc_auc      84.429683  1424.069423                0.103414           5.428186            2       True          3
5      LightGBM_BAG_L1   0

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path

directory = Path("/content")
ID_COL = "TransactionID"
LABEL  = "isFraud"  # your training label

# --- load & merge test ---
test_identity    = pd.read_csv(directory/"test_identity.csv")
test_transaction = pd.read_csv(directory/"test_transaction.csv")
test_data = test_transaction.merge(test_identity, on=ID_COL, how="left")

# drop accidental index columns
test_data = test_data.loc[:, ~test_data.columns.astype(str).str.startswith("Unnamed")]

# don't pass target
if LABEL in test_data.columns:
    test_data = test_data.drop(columns=[LABEL])

# --- align to exactly the features used in training ---
feat = predictor.features()            # the columns AutoGluon expects
test_data = test_data.reindex(columns=feat)   # no fill_value; new cols will be NaN

# --- make dtypes NumPy-friendly; replace pd.NA with np.nan ---
test_data = test_data.replace({pd.NA: np.nan})

for col in test_data.columns:
    dt = test_data[col].dtype
    # pandas nullable integers (Int64/Int32/Int16) -> float (allows np.nan)
    if str(dt) in {"Int64", "Int32", "Int16", "UInt64", "UInt32", "UInt16"}:
        test_data[col] = test_data[col].astype("float32")
    # pandas nullable boolean -> float (0.0/1.0 + nan)
    elif str(dt) == "boolean":
        test_data[col] = test_data[col].astype("float32")
    # pandas string dtype -> plain object (np.nan-compatible)
    elif pd.api.types.is_string_dtype(dt):
        test_data[col] = test_data[col].astype("object")

# --- predict ---
y_predproba = predictor.predict_proba(test_data)

Loading: /content/AutoGluonModels/models/LightGBMXT_BAG_L1/model.pkl
Loading: /content/AutoGluonModels/models/LightGBM_BAG_L1/model.pkl
Loading: /content/AutoGluonModels/models/LightGBMXT_BAG_L2/model.pkl
Loading: /content/AutoGluonModels/models/LightGBM_BAG_L2/model.pkl
Loading: /content/AutoGluonModels/models/WeightedEnsemble_L3/model.pkl


In [None]:
y_predproba.head(5)

Unnamed: 0,0,1
0,0.998561,0.001439
1,0.998815,0.001185
2,0.997428,0.002572
3,0.998495,0.001505
4,0.998787,0.001213


For binary classification tasks, you can see which class AutoGluon’s predicted probabilities correspond to via:

In [None]:
predictor.positive_class

1

For multiclass classification tasks, you can see which classes AutoGluon’s predicted probabilities correspond to via:

In [None]:
predictor.class_labels  # classes in this list correspond to columns of predict_proba() output

[0, 1]

Now, let’s get prediction probabilities for the entire test data, while only getting the positive class predictions by specifying:

In [None]:
y_predproba = predictor.predict_proba(test_data, as_multiclass=False)

Loading: /content/AutoGluonModels/models/LightGBMXT_BAG_L1/model.pkl
Loading: /content/AutoGluonModels/models/LightGBM_BAG_L1/model.pkl
Loading: /content/AutoGluonModels/models/LightGBMXT_BAG_L2/model.pkl
Loading: /content/AutoGluonModels/models/LightGBM_BAG_L2/model.pkl
Loading: /content/AutoGluonModels/models/WeightedEnsemble_L3/model.pkl


Now that we have made a prediction for each row in the test dataset, we can submit these predictions to Kaggle. Most Kaggle competitions provide a sample submission file, in which you can simply overwrite the sample predictions with your own as we do below:

In [None]:
submission = pd.read_csv(directory/'sample_submission.csv')
submission['isFraud'] = y_predproba
submission.head()
submission.to_csv(directory/'archie_submission.csv', index=False)

##To submit your predictions to Kaggle, you can run the following command in your terminal

In [None]:
!kaggle competitions submit -c ieee-fraud-detection -f archie_submission.csv -m "my first submission"

401 Client Error: Unauthorized for url: https://www.kaggle.com/api/v1/competitions/submission-url
