# Colab-ready: AutoGluon + IEEE-CIS Fraud Detection

This notebook combines the steps from the project's `tabular-kaggle.ipynb` and the existing Colab helper into a single, runnable Colab notebook. It will:
- install packages,
- show two ways to provide your `kaggle.json` (upload or Drive),
- download the dataset via the Kaggle API,
- merge CSVs, reduce memory, and optionally sample for quick runs,
- train AutoGluon,
- save predictions and optionally submit to Kaggle via the CLI.

Run cells in order. New cells don't need IDs but existing ones do — this file is freshly created so cells have no `metadata.id`.

## 1 — Install dependencies
Run this cell first in Colab. It installs `kaggle` and `autogluon.tabular` plus common data packages. If any install fails, try a different AutoGluon version that matches Colab's Python runtime.

In [1]:
# Install required packages (run in Colab)
!pip -q install --upgrade pip
!pip -q install kaggle autogluon.tabular pandas numpy scikit-learn
import sys, pandas as pd, numpy as np
print('Python:', sys.version)

Python: 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]


## 2 — Provide your Kaggle API key (`kaggle.json`)

Two recommended options:
A) Upload interactively (one-off): use the upload cell below.
B) Mount Google Drive (recommended for repeat runs): put `kaggle.json` in `MyDrive/.kaggle/kaggle.json` then run the Drive cell to copy it to `~/.kaggle/kaggle.json`.

Important: never commit `kaggle.json` to source control. It contains your API key.

In [3]:
# Option A: Upload kaggle.json (interactive)
from google.colab import files
import os
uploaded = files.upload()
if 'kaggle.json' in uploaded:
    os.makedirs(os.path.expanduser('~/.kaggle'), exist_ok=True)
    open(os.path.expanduser('~/.kaggle/kaggle.json'), 'wb').write(uploaded['kaggle.json'])
    os.chmod(os.path.expanduser('~/.kaggle/kaggle.json'), 0o600)
    print('Saved to ~/.kaggle/kaggle.json')
else:
    print('No kaggle.json uploaded. Use Drive option if preferred.')

Saving kaggle.json to kaggle.json
Saved to ~/.kaggle/kaggle.json


In [None]:
# Option B: Mount Google Drive and copy kaggle.json from MyDrive/.kaggle/
from google.colab import drive
import shutil, os
drive.mount('/content/drive')
drive_path = '/content/drive/MyDrive/.kaggle/kaggle.json'
if os.path.exists(drive_path):
    os.makedirs(os.path.expanduser('~/.kaggle'), exist_ok=True)
    shutil.copy(drive_path, os.path.expanduser('~/.kaggle/kaggle.json'))
    os.chmod(os.path.expanduser('~/.kaggle/kaggle.json'), 0o600)
    print('Copied kaggle.json from Drive to ~/.kaggle/kaggle.json')
else:
    print('No kaggle.json found in Drive at', drive_path)

## 3 — Download dataset via Kaggle API
This will download and unzip the competition data into `/content/data`. Ensure you have accepted the competition rules on the Kaggle website first.

In [5]:
# Download and unzip
competition = 'california-house-prices'
import os
os.makedirs('/content/data', exist_ok=True)
!kaggle competitions download -c $competition -p /content/data --quiet
!unzip -oq /content/data/{competition}.zip -d /content/data || true
print('Done. Files:')
!ls -lh /content/data | sed -n '1,120p'

Done. Files:
total 1.5G
-rw-r--r-- 1 root root  30M Mar 19  2021 california-house-prices.zip
-rw-r--r-- 1 root root 119M Dec 11  2019 ieee-fraud-detection.zip
-rw-r--r-- 1 root root 248K Mar 19  2021 sample_submission.csv
-rw-r--r-- 1 root root  35M Mar 19  2021 test.csv
-rw-r--r-- 1 root root  25M Dec 11  2019 test_identity.csv
-rw-r--r-- 1 root root 585M Dec 11  2019 test_transaction.csv
-rw-r--r-- 1 root root  51M Mar 19  2021 train.csv
-rw-r--r-- 1 root root  26M Dec 11  2019 train_identity.csv
-rw-r--r-- 1 root root 652M Dec 11  2019 train_transaction.csv


## 4 — Load, merge, and reduce memory
We left-join transaction with identity on `TransactionID`. We also include a utility to downcast numeric columns to reduce memory usage.

In [6]:
import pandas as pd, numpy as np, os
DATA_DIR = '/content/data'
def reduce_mem_usage(df, verbose=True):
    start_mem = df.memory_usage(deep=True).sum() / 1024**2
    for col in df.columns:
        col_type = df[col].dtype
        if pd.api.types.is_numeric_dtype(col_type):
            c_min = df[col].min()
            c_max = df[col].max()
            if pd.api.types.is_integer_dtype(col_type):
                if c_min >= 0:
                    if c_max < 255:
                        df[col] = df[col].astype(np.uint8)
                    elif c_max < 65535:
                        df[col] = df[col].astype(np.uint16)
                    else:
                        df[col] = df[col].astype(np.uint32)
                else:
                    df[col] = pd.to_numeric(df[col], downcast='integer')
            else:
                df[col] = pd.to_numeric(df[col], downcast='float')
    end_mem = df.memory_usage(deep=True).sum() / 1024**2
    if verbose:
        print(f'Memory: {start_mem:,.2f} MB -> {end_mem:,.2f} MB')
    return df
# Read files
paths = {
    'train_transaction': os.path.join(DATA_DIR, 'train_transaction.csv'),
    'train_identity': os.path.join(DATA_DIR, 'train_identity.csv'),
    'test_transaction': os.path.join(DATA_DIR, 'test_transaction.csv'),
    'test_identity': os.path.join(DATA_DIR, 'test_identity.csv'),
    'sample_submission': os.path.join(DATA_DIR, 'sample_submission.csv'),
}
for k,p in paths.items():
    print(k, '->', os.path.exists(p))
train_tr = pd.read_csv(paths['train_transaction'], low_memory=False)
train_id = pd.read_csv(paths['train_identity'], low_memory=False)
test_tr  = pd.read_csv(paths['test_transaction'], low_memory=False)
test_id  = pd.read_csv(paths['test_identity'], low_memory=False)
train = train_tr.merge(train_id, on='TransactionID', how='left')
test  = test_tr.merge(test_id, on='TransactionID', how='left')
train = reduce_mem_usage(train)
test = reduce_mem_usage(test)
print('train shape', train.shape, 'test shape', test.shape)

train_transaction -> True
train_identity -> True
test_transaction -> True
test_identity -> True
sample_submission -> True
Memory: 2,513.97 MB -> 1,603.31 MB
Memory: 2,164.10 MB -> 1,386.12 MB
train shape (590540, 434) test shape (506691, 433)


## 5 — Optional: sample for quicker experiments
Set `USE_SAMPLE=True` to run a quick baseline. This helps when Colab runs out of RAM.

In [None]:
USE_SAMPLE = False
SAMPLE_FRAC = 0.1
if USE_SAMPLE:
    train = train.sample(frac=SAMPLE_FRAC, random_state=42)
    print('Sampled train shape:', train.shape)

## 6 — Train AutoGluon TabularPredictor
We use `isFraud` as the label. Adjust `time_limit` and `presets` as needed. For large jobs, save the model directory to Drive after training.

In [5]:
from autogluon.tabular import TabularPredictor
label = 'isFraud'
if label not in train.columns:
    raise SystemExit(f'Label {label} not in train')
cols_to_drop = ['TransactionID']
train_features = train.drop(columns=[c for c in cols_to_drop if c in train.columns])
test_features = test.drop(columns=[c for c in cols_to_drop if c in test.columns])
save_path = '/content/ag_models'
predictor = TabularPredictor(label=label, path=save_path, eval_metric='roc_auc').fit(train_features, presets='medium_quality', time_limit=1800)
print(predictor.fit_summary())

Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.4.0
Python Version:     3.12.12
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Thu Oct  2 10:42:05 UTC 2025
CPU Count:          8
Memory Avail:       42.20 GB / 50.99 GB (82.8%)
Disk Space Avail:   184.93 GB / 225.83 GB (81.9%)
Presets specified: ['medium_quality']
Using hyperparameters preset: hyperparameters='default'
Beginning AutoGluon training ... Time limit = 1800s
AutoGluon will save models to "/content/ag_models"
Train Data Rows:    590540
Train Data Columns: 432
Label Column:       isFraud
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [np.uint8(0), np.uint8(1)]
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during Predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression', 'quantile'])
Problem Type:       binary
Preprocessing d

[1000]	valid_set's binary_logloss: 0.070524
[2000]	valid_set's binary_logloss: 0.0621083
[3000]	valid_set's binary_logloss: 0.0567775
[4000]	valid_set's binary_logloss: 0.0541112
[5000]	valid_set's binary_logloss: 0.0516139
[6000]	valid_set's binary_logloss: 0.0503407
[7000]	valid_set's binary_logloss: 0.0496697
[8000]	valid_set's binary_logloss: 0.0496936
[9000]	valid_set's binary_logloss: 0.0495574
[10000]	valid_set's binary_logloss: 0.0498254


	0.9691	 = Validation score   (roc_auc)
	477.4s	 = Training   runtime
	1.83s	 = Validation runtime
Fitting model: LightGBM ... Training model for up to 1305.21s of the 1305.21s of remaining time.
	Fitting with cpus=4, gpus=0, mem=5.0/39.0 GB


[1000]	valid_set's binary_logloss: 0.0636254
[2000]	valid_set's binary_logloss: 0.0554408
[3000]	valid_set's binary_logloss: 0.0506679
[4000]	valid_set's binary_logloss: 0.048336
[5000]	valid_set's binary_logloss: 0.0471347
[6000]	valid_set's binary_logloss: 0.0468823
[7000]	valid_set's binary_logloss: 0.0473612


	0.9707	 = Validation score   (roc_auc)
	361.41s	 = Training   runtime
	0.76s	 = Validation runtime
Fitting model: RandomForestGini ... Training model for up to 942.81s of the 942.81s of remaining time.
	Fitting with cpus=8, gpus=0, mem=0.4/38.9 GB
	0.9344	 = Validation score   (roc_auc)
	237.0s	 = Training   runtime
	0.15s	 = Validation runtime
Fitting model: RandomForestEntr ... Training model for up to 705.17s of the 705.17s of remaining time.
	Fitting with cpus=8, gpus=0, mem=0.4/38.8 GB
	0.9365	 = Validation score   (roc_auc)
	185.79s	 = Training   runtime
	0.17s	 = Validation runtime
Fitting model: CatBoost ... Training model for up to 518.76s of the 518.75s of remaining time.
	Fitting with cpus=4, gpus=0, mem=5.6/38.7 GB
		`import catboost` failed. A quick tip is to install via `pip install autogluon.tabular[catboost]==1.4.0`.
Fitting model: ExtraTreesGini ... Training model for up to 516.77s of the 516.77s of remaining time.
	Fitting with cpus=8, gpus=0, mem=0.4/38.7 GB
	0.9041

*** Summary of fit() ***
Estimated performance of each model:
                 model  score_val eval_metric  pred_time_val     fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0  WeightedEnsemble_L2   0.971892     roc_auc       2.807516  1005.767153                0.001033           0.160867            2       True          9
1             LightGBM   0.970669     roc_auc       0.764521   361.411730                0.764521         361.411730            1       True          2
2           LightGBMXT   0.969097     roc_auc       1.827361   477.403704                1.827361         477.403704            1       True          1
3     RandomForestEntr   0.936494     roc_auc       0.172647   185.789573                0.172647         185.789573            1       True          4
4     RandomForestGini   0.934371     roc_auc       0.151000   236.999153                0.151000         236.999153            1       True          3
5       ExtraTreesEntr   0

## 7 — Predict & prepare submission
We predict class probabilities and write the `submission.csv` expected by the competition.

In [7]:
proba = predictor.predict_proba(test_features)
import pandas as pd
if isinstance(proba, pd.DataFrame):
    # pick positive class probability
    if 1 in proba.columns:
        preds = proba[1]
    else:
        preds = proba.iloc[:, -1]
else:
    preds = proba
submission = pd.read_csv(paths['sample_submission'])
submission['isFraud'] = preds
submission.to_csv('/content/submission.csv', index=False)
print('Saved /content/submission.csv')

KeyError: "38 required columns are missing from the provided dataset to transform using AutoMLPipelineFeatureGenerator. 38 missing columns: ['id_01', 'id_02', 'id_03', 'id_04', 'id_05', 'id_06', 'id_07', 'id_08', 'id_09', 'id_10', 'id_11', 'id_12', 'id_13', 'id_14', 'id_15', 'id_16', 'id_17', 'id_18', 'id_19', 'id_20', 'id_21', 'id_22', 'id_23', 'id_24', 'id_25', 'id_26', 'id_27', 'id_28', 'id_29', 'id_30', 'id_31', 'id_32', 'id_33', 'id_34', 'id_35', 'id_36', 'id_37', 'id_38'] | 432 available columns: ['TransactionDT', 'TransactionAmt', 'ProductCD', 'card1', 'card2', 'card3', 'card4', 'card5', 'card6', 'addr1', 'addr2', 'dist1', 'dist2', 'P_emaildomain', 'R_emaildomain', 'C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'C7', 'C8', 'C9', 'C10', 'C11', 'C12', 'C13', 'C14', 'D1', 'D2', 'D3', 'D4', 'D5', 'D6', 'D7', 'D8', 'D9', 'D10', 'D11', 'D12', 'D13', 'D14', 'D15', 'M1', 'M2', 'M3', 'M4', 'M5', 'M6', 'M7', 'M8', 'M9', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'V29', 'V30', 'V31', 'V32', 'V33', 'V34', 'V35', 'V36', 'V37', 'V38', 'V39', 'V40', 'V41', 'V42', 'V43', 'V44', 'V45', 'V46', 'V47', 'V48', 'V49', 'V50', 'V51', 'V52', 'V53', 'V54', 'V55', 'V56', 'V57', 'V58', 'V59', 'V60', 'V61', 'V62', 'V63', 'V64', 'V65', 'V66', 'V67', 'V68', 'V69', 'V70', 'V71', 'V72', 'V73', 'V74', 'V75', 'V76', 'V77', 'V78', 'V79', 'V80', 'V81', 'V82', 'V83', 'V84', 'V85', 'V86', 'V87', 'V88', 'V89', 'V90', 'V91', 'V92', 'V93', 'V94', 'V95', 'V96', 'V97', 'V98', 'V99', 'V100', 'V101', 'V102', 'V103', 'V104', 'V105', 'V106', 'V107', 'V108', 'V109', 'V110', 'V111', 'V112', 'V113', 'V114', 'V115', 'V116', 'V117', 'V118', 'V119', 'V120', 'V121', 'V122', 'V123', 'V124', 'V125', 'V126', 'V127', 'V128', 'V129', 'V130', 'V131', 'V132', 'V133', 'V134', 'V135', 'V136', 'V137', 'V138', 'V139', 'V140', 'V141', 'V142', 'V143', 'V144', 'V145', 'V146', 'V147', 'V148', 'V149', 'V150', 'V151', 'V152', 'V153', 'V154', 'V155', 'V156', 'V157', 'V158', 'V159', 'V160', 'V161', 'V162', 'V163', 'V164', 'V165', 'V166', 'V167', 'V168', 'V169', 'V170', 'V171', 'V172', 'V173', 'V174', 'V175', 'V176', 'V177', 'V178', 'V179', 'V180', 'V181', 'V182', 'V183', 'V184', 'V185', 'V186', 'V187', 'V188', 'V189', 'V190', 'V191', 'V192', 'V193', 'V194', 'V195', 'V196', 'V197', 'V198', 'V199', 'V200', 'V201', 'V202', 'V203', 'V204', 'V205', 'V206', 'V207', 'V208', 'V209', 'V210', 'V211', 'V212', 'V213', 'V214', 'V215', 'V216', 'V217', 'V218', 'V219', 'V220', 'V221', 'V222', 'V223', 'V224', 'V225', 'V226', 'V227', 'V228', 'V229', 'V230', 'V231', 'V232', 'V233', 'V234', 'V235', 'V236', 'V237', 'V238', 'V239', 'V240', 'V241', 'V242', 'V243', 'V244', 'V245', 'V246', 'V247', 'V248', 'V249', 'V250', 'V251', 'V252', 'V253', 'V254', 'V255', 'V256', 'V257', 'V258', 'V259', 'V260', 'V261', 'V262', 'V263', 'V264', 'V265', 'V266', 'V267', 'V268', 'V269', 'V270', 'V271', 'V272', 'V273', 'V274', 'V275', 'V276', 'V277', 'V278', 'V279', 'V280', 'V281', 'V282', 'V283', 'V284', 'V285', 'V286', 'V287', 'V288', 'V289', 'V290', 'V291', 'V292', 'V293', 'V294', 'V295', 'V296', 'V297', 'V298', 'V299', 'V300', 'V301', 'V302', 'V303', 'V304', 'V305', 'V306', 'V307', 'V308', 'V309', 'V310', 'V311', 'V312', 'V313', 'V314', 'V315', 'V316', 'V317', 'V318', 'V319', 'V320', 'V321', 'V322', 'V323', 'V324', 'V325', 'V326', 'V327', 'V328', 'V329', 'V330', 'V331', 'V332', 'V333', 'V334', 'V335', 'V336', 'V337', 'V338', 'V339', 'id-01', 'id-02', 'id-03', 'id-04', 'id-05', 'id-06', 'id-07', 'id-08', 'id-09', 'id-10', 'id-11', 'id-12', 'id-13', 'id-14', 'id-15', 'id-16', 'id-17', 'id-18', 'id-19', 'id-20', 'id-21', 'id-22', 'id-23', 'id-24', 'id-25', 'id-26', 'id-27', 'id-28', 'id-29', 'id-30', 'id-31', 'id-32', 'id-33', 'id-34', 'id-35', 'id-36', 'id-37', 'id-38', 'DeviceType', 'DeviceInfo']"

## 8 — Submit to Kaggle (optional)
You can submit via the Kaggle CLI. If you prefer, download `/content/submission.csv` and upload manually on the competition page.

In [None]:
# Kaggle submit (uncomment to run)
# competition = 'ieee-fraud-detection'
!kaggle competitions submit -c $competition -f /content/submission.csv -m 'AutoGluon colab submission'
# print('To submit: run kaggle competitions submit -c ieee-fraud-detection -f /content/submission.csv -m ')

---
### Notes:
- Keep `kaggle.json` private.
- For reproducibility, save `/content/ag_models` to Drive after training.
- If Colab runs out of RAM, try a smaller sample or use a larger VM.