##Steps to use AutoGluon to become a serious Kaggle competitor without writing lots of code. This example uses IEEE fraud detection data

*Run Bash command:*

In [1]:
!pip install kaggle




*Creating a kaggle folder*

In [2]:
!mkdir -p ~/.kaggle


*After creating a new API from kaggle and downloading the kaggle.json file, move downloaded file to this location on machine*

In [3]:
!cp kaggle.json ~/.kaggle/


*Granting read and write permissions to the file's owner and for maintaining the security and privacy of the credentials when using Kaggle’s API*

In [4]:
!chmod 600 ~/.kaggle/kaggle.json


*Downloads the IEEE fraud detection data*

In [5]:
!kaggle competitions download -c ieee-fraud-detection

Downloading ieee-fraud-detection.zip to /content
 92% 109M/118M [00:00<00:00, 107MB/s] 
100% 118M/118M [00:00<00:00, 126MB/s]


*Unzips the downloaded IEEE fraud detection folder*



In [6]:
!unzip ieee-fraud-detection.zip


Archive:  ieee-fraud-detection.zip
  inflating: sample_submission.csv   
  inflating: test_identity.csv       
  inflating: test_transaction.csv    
  inflating: train_identity.csv      
  inflating: train_transaction.csv   


*Install the autogluon.tabular module, which is specifically designed for tabular data, including tasks like classification, regression, and time-series forecasting. It also handles data preprocessing, model training, and hyperparameter tuning automatically.*

In [7]:
!pip install autogluon.tabular

Collecting autogluon.tabular
  Downloading autogluon.tabular-1.1.1-py3-none-any.whl.metadata (13 kB)
Collecting scipy<1.13,>=1.5.4 (from autogluon.tabular)
  Downloading scipy-1.12.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.4/60.4 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
Collecting autogluon.core==1.1.1 (from autogluon.tabular)
  Downloading autogluon.core-1.1.1-py3-none-any.whl.metadata (11 kB)
Collecting autogluon.features==1.1.1 (from autogluon.tabular)
  Downloading autogluon.features-1.1.1-py3-none-any.whl.metadata (11 kB)
Collecting boto3<2,>=1.10 (from autogluon.core==1.1.1->autogluon.tabular)
  Downloading boto3-1.35.19-py3-none-any.whl.metadata (6.6 kB)
Collecting autogluon.common==1.1.1 (from autogluon.core==1.1.1->autogluon.tabular)
  Downloading autogluon.common-1.1.1-py3-none-any.whl.metadata (11 kB)
Collecting botocore<1.36.0,>=1.35.19 (from boto3<2,>=1.10->autogluon

In [8]:
import pandas as pd
import numpy as np
from autogluon.tabular import TabularPredictor
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA

# Loading data
directory = '/content/'
label = 'isFraud'
eval_metric = 'roc_auc'
save_path = directory + 'AutoGluonModels/'

# Load data in chunks to reduce memory usage
chunk_size = 100000
train_identity = pd.read_csv(directory + 'train_identity.csv', chunksize=chunk_size)
train_transaction = pd.read_csv(directory + 'train_transaction.csv', chunksize=chunk_size)

# Process data in chunks
def process_chunk(identity_chunk, transaction_chunk):
    # Merge chunks
    merged_chunk = pd.merge(transaction_chunk, identity_chunk, on='TransactionID', how='left')

    # Impute missing values
    imputer = SimpleImputer(strategy='mean')
    imputed_chunk = imputer.fit_transform(merged_chunk.select_dtypes(include=[np.number]))

    # Apply PCA
    pca = PCA(n_components=0.95)
    pca_chunk = pca.fit_transform(imputed_chunk)

    # Convert back to DataFrame
    processed_chunk = pd.DataFrame(pca_chunk)
    processed_chunk['isFraud'] = merged_chunk['isFraud'].values

    return processed_chunk

# Process chunks and concatenate results
processed_chunks = []
for identity_chunk, transaction_chunk in zip(train_identity, train_transaction):
    processed_chunk = process_chunk(identity_chunk, transaction_chunk)
    processed_chunks.append(processed_chunk)

train_data = pd.concat(processed_chunks, ignore_index=True)

# Data sampling (10% of the data for experimentation)
train_data = train_data.sample(frac=0.1, random_state=42)

# AutoGluon settings for less resource-intensive training
predictor = TabularPredictor(label='isFraud', eval_metric=eval_metric, path=save_path, verbosity=3)
predictor.fit(
    train_data,
    presets='medium_quality',
    time_limit=1800,  # Reduced time limit
    ag_args_fit={'num_bag_folds': 2, 'num_stack_levels': 0},
    num_bag_sets=1,
    keep_only_best=True,
    refit_full=False,
    set_best_to_refit_full=False
)

# Print summary of fit results
results = predictor.fit_summary()
print(results)

 'id_10' 'id_11' 'id_13' 'id_14' 'id_17' 'id_18' 'id_19' 'id_20' 'id_21'
 'id_22' 'id_24' 'id_25' 'id_26' 'id_32']. At least one non-missing value is needed for imputation with strategy='mean'.
Verbosity: 3 (Detailed Logging)
AutoGluon Version:  1.1.1
Python Version:     3.10.12
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP PREEMPT_DYNAMIC Thu Jun 27 21:05:47 UTC 2024
CPU Count:          2
GPU Count:          0
Memory Avail:       9.81 GB / 12.67 GB (77.4%)
Disk Space Avail:   73.28 GB / 107.72 GB (68.0%)
Presets specified: ['medium_quality']
User Specified kwargs:
{'ag_args_fit': {'num_bag_folds': 2, 'num_stack_levels': 0},
 'auto_stack': False,
 'keep_only_best': True,
 'num_bag_sets': 1,
 'refit_full': False,
 'set_best_to_refit_full': False}
Full kwargs:
{'_feature_generator_kwargs': None,
 '_save_bag_folds': None,
 'ag_args': None,
 'ag_args_ensemble': None,
 'ag_args_fit': {'num_bag_folds': 2, 'num_stack_levels': 0},
 'auto_stack': False,
 'calib

[50]	valid_set's binary_logloss: 0.130373
[100]	valid_set's binary_logloss: 0.129903
[150]	valid_set's binary_logloss: 0.129056
[200]	valid_set's binary_logloss: 0.1288
[250]	valid_set's binary_logloss: 0.128521
[300]	valid_set's binary_logloss: 0.128334
[350]	valid_set's binary_logloss: 0.128343
[400]	valid_set's binary_logloss: 0.128134
[450]	valid_set's binary_logloss: 0.127927
[500]	valid_set's binary_logloss: 0.127828
[550]	valid_set's binary_logloss: 0.127748
[600]	valid_set's binary_logloss: 0.127802
[650]	valid_set's binary_logloss: 0.127747
[700]	valid_set's binary_logloss: 0.12776
[750]	valid_set's binary_logloss: 0.12801
[800]	valid_set's binary_logloss: 0.128024
[850]	valid_set's binary_logloss: 0.12814


Saving /content/AutoGluonModels/models/LightGBMXT/model.pkl
Saving /content/AutoGluonModels/utils/attr/LightGBMXT/y_pred_proba_val.pkl
	0.6391	 = Validation score   (roc_auc)
	3.18s	 = Training   runtime
	0.11s	 = Validation runtime
	18705.3	 = Inference  throughput (rows/s | 2000 batch size)
Saving /content/AutoGluonModels/models/trainer.pkl
Fitting model: LightGBM ... Training model for up to 1796.18s of the 1796.14s of remaining time.
	Fitting LightGBM with 'num_gpus': 0, 'num_cpus': 1
	Fitting 10000 rounds... Hyperparameters: {'learning_rate': 0.05}


[50]	valid_set's binary_logloss: 0.128812
[100]	valid_set's binary_logloss: 0.131344
[150]	valid_set's binary_logloss: 0.134381


Saving /content/AutoGluonModels/models/LightGBM/model.pkl
Saving /content/AutoGluonModels/utils/attr/LightGBM/y_pred_proba_val.pkl
	0.6172	 = Validation score   (roc_auc)
	0.55s	 = Training   runtime
	0.0s	 = Validation runtime
	1258228.3	 = Inference  throughput (rows/s | 2000 batch size)
Saving /content/AutoGluonModels/models/trainer.pkl
Fitting model: RandomForestGini ... Training model for up to 1795.61s of the 1795.58s of remaining time.
	Fitting RandomForestGini with 'num_gpus': 0, 'num_cpus': 2
Saving /content/AutoGluonModels/models/RandomForestGini/model.pkl
Saving /content/AutoGluonModels/utils/attr/RandomForestGini/y_pred_proba_val.pkl
	0.5645	 = Validation score   (roc_auc)
	14.0s	 = Training   runtime
	0.16s	 = Validation runtime
	12739.3	 = Inference  throughput (rows/s | 2000 batch size)
Saving /content/AutoGluonModels/models/trainer.pkl
Fitting model: RandomForestEntr ... Training model for up to 1781.39s of the 1781.36s of remaining time.
	Fitting RandomForestEntr with 

[0]	validation_0-logloss:0.18719
[50]	validation_0-logloss:0.12895
[100]	validation_0-logloss:0.12999
[150]	validation_0-logloss:0.13101
[200]	validation_0-logloss:0.13241
[224]	validation_0-logloss:0.13269


Saving /content/AutoGluonModels/models/XGBoost/model.pkl
Saving /content/AutoGluonModels/utils/attr/XGBoost/y_pred_proba_val.pkl
	0.6088	 = Validation score   (roc_auc)
	1.29s	 = Training   runtime
	0.02s	 = Validation runtime
	127857.6	 = Inference  throughput (rows/s | 2000 batch size)
Saving /content/AutoGluonModels/models/trainer.pkl
Fitting model: NeuralNetTorch ... Training model for up to 1729.54s of the 1729.51s of remaining time.
	Fitting NeuralNetTorch with 'num_gpus': 0, 'num_cpus': 1
Tabular Neural Network treats features as the following types:
{
    "continuous": [
        "0"
    ],
    "skewed": [
        "1"
    ],
    "onehot": [],
    "embed": [],
    "language": [],
    "bool": []
}


Training data for TabularNeuralNetTorchModel has: 18000 examples, 2 features (2 vector, 0 embedding)
Training on CPU
Neural network architecture:
EmbedNet(
  (main_block): Sequential(
    (0): Linear(in_features=2, out_features=128, bias=True)
    (1): ReLU()
    (2): Dropout(p=0.1, in

[50]	valid_set's binary_logloss: 0.134198
[100]	valid_set's binary_logloss: 0.140116
[150]	valid_set's binary_logloss: 0.146112


Saving /content/AutoGluonModels/models/LightGBMLarge/model.pkl
Saving /content/AutoGluonModels/utils/attr/LightGBMLarge/y_pred_proba_val.pkl
	0.6036	 = Validation score   (roc_auc)
	1.45s	 = Training   runtime
	0.01s	 = Validation runtime
	362719.2	 = Inference  throughput (rows/s | 2000 batch size)
Saving /content/AutoGluonModels/models/trainer.pkl
Loading: /content/AutoGluonModels/utils/attr/NeuralNetFastAI/y_pred_proba_val.pkl
Loading: /content/AutoGluonModels/utils/attr/LightGBMLarge/y_pred_proba_val.pkl
Loading: /content/AutoGluonModels/utils/attr/RandomForestGini/y_pred_proba_val.pkl
Loading: /content/AutoGluonModels/utils/attr/NeuralNetTorch/y_pred_proba_val.pkl
Loading: /content/AutoGluonModels/utils/attr/LightGBMXT/y_pred_proba_val.pkl
Loading: /content/AutoGluonModels/utils/attr/ExtraTreesEntr/y_pred_proba_val.pkl
Loading: /content/AutoGluonModels/utils/attr/RandomForestEntr/y_pred_proba_val.pkl
Loading: /content/AutoGluonModels/utils/attr/ExtraTreesGini/y_pred_proba_val.pkl


*** Summary of fit() ***
Estimated performance of each model:
                 model  score_val eval_metric  pred_time_val   fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0  WeightedEnsemble_L2   0.691671     roc_auc       0.362314  33.405865                0.001741           0.272109            2       True          5
1           LightGBMXT   0.639147     roc_auc       0.106921   3.183570                0.106921           3.183570            1       True          2
2      NeuralNetFastAI   0.601886     roc_auc       0.081158  27.722270                0.081158          27.722270            1       True          4
3       ExtraTreesGini   0.565143     roc_auc       0.156813   2.178463                0.156813           2.178463            1       True          3
4       KNeighborsDist   0.548430     roc_auc       0.015682   0.049454                0.015682           0.049454            1       True          1
Number of models trained: 5
Types of m

*We ask AutoGluon for predicted class probabilities*

In [9]:
import pandas as pd
from autogluon.tabular import TabularPredictor
import gc

# Load data in chunks to reduce memory usage
chunk_size = 100000
directory = '/content/'

# Function to process chunks
def process_chunks(identity_chunks, transaction_chunks):
    processed_data = []
    for identity_chunk, transaction_chunk in zip(identity_chunks, transaction_chunks):
        # Merge the chunks
        merged_chunk = pd.merge(transaction_chunk, identity_chunk, on='TransactionID', how='left')
        processed_data.append(merged_chunk)

        # Free up memory
        del identity_chunk, transaction_chunk
        gc.collect()

    return pd.concat(processed_data, ignore_index=True)

# Load and process data in chunks
test_identity = pd.read_csv(directory + 'test_identity.csv', chunksize=chunk_size)
test_transaction = pd.read_csv(directory + 'test_transaction.csv', chunksize=chunk_size)
test_data = process_chunks(test_identity, test_transaction)

# Optional: Sample the data if it's too large
test_data_sample = test_data.sample(frac=0.3, random_state=42)

# Downcast columns to save memory
test_data_sample['card1'] = test_data_sample['card1'].astype('int32')
test_data_sample['TransactionAmt'] = test_data_sample['TransactionAmt'].astype('float32')

# Free up memory
del test_data
gc.collect()

# Load the trained model
predictor = TabularPredictor.load('/content/AutoGluonModels/')

# Check the required columns in the model
required_columns = predictor.feature_metadata.get_features()
print("Required columns:", required_columns)

# Check if the test data contains all the required columns
missing_columns = [col for col in required_columns if col not in test_data_sample.columns]
print("Missing columns:", missing_columns)

# Ensure test data matches the model's required columns
for col in missing_columns:
    test_data_sample[col] = 0  # Fill missing columns with a default value (e.g., 0)

# Predict in chunks to avoid memory overload
chunk_size = 10000
y_predproba = []

for i in range(0, len(test_data_sample), chunk_size):
    chunk = test_data_sample.iloc[i:i + chunk_size]
    y_pred_chunk = predictor.predict_proba(chunk)
    y_predproba.append(y_pred_chunk)

# Concatenate predictions
y_predproba = pd.concat(y_predproba)

# Display some predictions
print(y_predproba.head())



Loading: /content/AutoGluonModels/predictor.pkl
Loading: /content/AutoGluonModels/learner.pkl
Loading: /content/AutoGluonModels/models/trainer.pkl
Loading: /content/AutoGluonModels/models/ExtraTreesGini/model.pkl


Required columns: ['0', '1']
Missing columns: ['0', '1']


Loading: /content/AutoGluonModels/models/KNeighborsDist/model.pkl
Loading: /content/AutoGluonModels/models/LightGBMXT/model.pkl
Loading: /content/AutoGluonModels/models/NeuralNetFastAI/model.pkl
Loading: /content/AutoGluonModels/models/NeuralNetFastAI/model-internals.pkl
Loading: /content/AutoGluonModels/models/WeightedEnsemble_L2/model.pkl
Loading: /content/AutoGluonModels/models/ExtraTreesGini/model.pkl
Loading: /content/AutoGluonModels/models/KNeighborsDist/model.pkl
Loading: /content/AutoGluonModels/models/LightGBMXT/model.pkl
Loading: /content/AutoGluonModels/models/NeuralNetFastAI/model.pkl
Loading: /content/AutoGluonModels/models/NeuralNetFastAI/model-internals.pkl
Loading: /content/AutoGluonModels/models/WeightedEnsemble_L2/model.pkl
Loading: /content/AutoGluonModels/models/ExtraTreesGini/model.pkl
Loading: /content/AutoGluonModels/models/KNeighborsDist/model.pkl
Loading: /content/AutoGluonModels/models/LightGBMXT/model.pkl
Loading: /content/AutoGluonModels/models/NeuralNetFast

               0         1
119737  0.977992  0.022008
72272   0.977992  0.022008
158154  0.977992  0.022008
65426   0.977992  0.022008
30074   0.977992  0.022008


In [10]:
predictor.positive_class


1

In [11]:
predictor.class_labels

[0, 1]

*Prediction probabilities for the entire test data*


In [12]:
y_predproba = predictor.predict_proba(test_data_sample, as_multiclass=False)


Loading: /content/AutoGluonModels/models/ExtraTreesGini/model.pkl
Loading: /content/AutoGluonModels/models/KNeighborsDist/model.pkl
Loading: /content/AutoGluonModels/models/LightGBMXT/model.pkl
Loading: /content/AutoGluonModels/models/NeuralNetFastAI/model.pkl
Loading: /content/AutoGluonModels/models/NeuralNetFastAI/model-internals.pkl
Loading: /content/AutoGluonModels/models/WeightedEnsemble_L2/model.pkl


In [13]:
submission = pd.read_csv(directory+'sample_submission.csv')
submission['isFraud'] = y_predproba
submission.head()
submission.to_csv(directory+'my_submission.csv', index=False)

In [14]:
!kaggle competitions submit -c ieee-fraud-detection -f sample_submission.csv -m "my first submission"



100% 5.80M/5.80M [00:00<00:00, 14.3MB/s]
Successfully submitted to IEEE-CIS Fraud Detection