# Colab-ready: AutoGluon + IEEE-CIS Fraud Detection

This notebook combines the steps from the project's `tabular-kaggle.ipynb` and the existing Colab helper into a single, runnable Colab notebook. It will:
- install packages,
- show two ways to provide your `kaggle.json` (upload or Drive),
- download the dataset via the Kaggle API,
- merge CSVs, reduce memory, and optionally sample for quick runs,
- train AutoGluon,
- save predictions and optionally submit to Kaggle via the CLI.

Run cells in order. New cells don't need IDs but existing ones do — this file is freshly created so cells have no `metadata.id`.

## 1 — Install dependencies
Run this cell first in Colab. It installs `kaggle` and `autogluon.tabular` plus common data packages. If any install fails, try a different AutoGluon version that matches Colab's Python runtime.

In [None]:
# Install required packages (run in Colab)
!pip -q install --upgrade pip
!pip -q install kaggle autogluon.tabular pandas numpy scikit-learn
import sys, pandas as pd, numpy as np
print('Python:', sys.version)

## 2 — Provide your Kaggle API key (`kaggle.json`)

Two recommended options:
A) Upload interactively (one-off): use the upload cell below.
B) Mount Google Drive (recommended for repeat runs): put `kaggle.json` in `MyDrive/.kaggle/kaggle.json` then run the Drive cell to copy it to `~/.kaggle/kaggle.json`.

Important: never commit `kaggle.json` to source control. It contains your API key.

In [None]:
# Option A: Upload kaggle.json (interactive)
from google.colab import files
import os
uploaded = files.upload()
if 'kaggle.json' in uploaded:
    os.makedirs(os.path.expanduser('~/.kaggle'), exist_ok=True)
    open(os.path.expanduser('~/.kaggle/kaggle.json'), 'wb').write(uploaded['kaggle.json'])
    os.chmod(os.path.expanduser('~/.kaggle/kaggle.json'), 0o600)
    print('Saved to ~/.kaggle/kaggle.json')
else:
    print('No kaggle.json uploaded. Use Drive option if preferred.')

In [None]:
# Option B: Mount Google Drive and copy kaggle.json from MyDrive/.kaggle/
from google.colab import drive
import shutil, os
drive.mount('/content/drive')
drive_path = '/content/drive/MyDrive/.kaggle/kaggle.json'
if os.path.exists(drive_path):
    os.makedirs(os.path.expanduser('~/.kaggle'), exist_ok=True)
    shutil.copy(drive_path, os.path.expanduser('~/.kaggle/kaggle.json'))
    os.chmod(os.path.expanduser('~/.kaggle/kaggle.json'), 0o600)
    print('Copied kaggle.json from Drive to ~/.kaggle/kaggle.json')
else:
    print('No kaggle.json found in Drive at', drive_path)

## 3 — Download dataset via Kaggle API
This will download and unzip the competition data into `/content/data`. Ensure you have accepted the competition rules on the Kaggle website first.

In [None]:
# Download and unzip
competition = 'ieee-fraud-detection'
import os
os.makedirs('/content/data', exist_ok=True)
!kaggle competitions download -c $competition -p /content/data --quiet
!unzip -oq /content/data/{competition}.zip -d /content/data || true
print('Done. Files:')
!ls -lh /content/data | sed -n '1,120p'

## 4 — Load, merge, and reduce memory
We left-join transaction with identity on `TransactionID`. We also include a utility to downcast numeric columns to reduce memory usage.

In [None]:
import pandas as pd, numpy as np, os
DATA_DIR = '/content/data'
def reduce_mem_usage(df, verbose=True):
    start_mem = df.memory_usage(deep=True).sum() / 1024**2
    for col in df.columns:
        col_type = df[col].dtype
        if pd.api.types.is_numeric_dtype(col_type):
            c_min = df[col].min()
            c_max = df[col].max()
            if pd.api.types.is_integer_dtype(col_type):
                if c_min >= 0:
                    if c_max < 255:
                        df[col] = df[col].astype(np.uint8)
                    elif c_max < 65535:
                        df[col] = df[col].astype(np.uint16)
                    else:
                        df[col] = df[col].astype(np.uint32)
                else:
                    df[col] = pd.to_numeric(df[col], downcast='integer')
            else:
                df[col] = pd.to_numeric(df[col], downcast='float')
    end_mem = df.memory_usage(deep=True).sum() / 1024**2
    if verbose:
        print(f'Memory: {start_mem:,.2f} MB -> {end_mem:,.2f} MB')
    return df
# Read files
paths = {
    'train_transaction': os.path.join(DATA_DIR, 'train_transaction.csv'),
    'train_identity': os.path.join(DATA_DIR, 'train_identity.csv'),
    'test_transaction': os.path.join(DATA_DIR, 'test_transaction.csv'),
    'test_identity': os.path.join(DATA_DIR, 'test_identity.csv'),
    'sample_submission': os.path.join(DATA_DIR, 'sample_submission.csv'),
}
for k,p in paths.items():
    print(k, '->', os.path.exists(p))
train_tr = pd.read_csv(paths['train_transaction'], low_memory=False)
train_id = pd.read_csv(paths['train_identity'], low_memory=False)
test_tr  = pd.read_csv(paths['test_transaction'], low_memory=False)
test_id  = pd.read_csv(paths['test_identity'], low_memory=False)
train = train_tr.merge(train_id, on='TransactionID', how='left')
test  = test_tr.merge(test_id, on='TransactionID', how='left')
train = reduce_mem_usage(train)
test = reduce_mem_usage(test)
print('train shape', train.shape, 'test shape', test.shape)

## 5 — Optional: sample for quicker experiments
Set `USE_SAMPLE=True` to run a quick baseline. This helps when Colab runs out of RAM.

In [None]:
USE_SAMPLE = False
SAMPLE_FRAC = 0.1
if USE_SAMPLE:
    train = train.sample(frac=SAMPLE_FRAC, random_state=42)
    print('Sampled train shape:', train.shape)

## 6 — Train AutoGluon TabularPredictor
We use `isFraud` as the label. Adjust `time_limit` and `presets` as needed. For large jobs, save the model directory to Drive after training.

In [None]:
from autogluon.tabular import TabularPredictor
label = 'isFraud'
if label not in train.columns:
    raise SystemExit(f'Label {label} not in train')
cols_to_drop = ['TransactionID']
train_features = train.drop(columns=[c for c in cols_to_drop if c in train.columns])
test_features = test.drop(columns=[c for c in cols_to_drop if c in test.columns])
save_path = '/content/ag_models'
predictor = TabularPredictor(label=label, path=save_path, eval_metric='roc_auc').fit(train_features, presets='medium_quality', time_limit=1800)
print(predictor.fit_summary())

## 7 — Predict & prepare submission
We predict class probabilities and write the `submission.csv` expected by the competition.

In [None]:
proba = predictor.predict_proba(test_features)
import pandas as pd
if isinstance(proba, pd.DataFrame):
    # pick positive class probability
    if 1 in proba.columns:
        preds = proba[1]
    else:
        preds = proba.iloc[:, -1]
else:
    preds = proba
submission = pd.read_csv(paths['sample_submission'])
submission['isFraud'] = preds
submission.to_csv('/content/submission.csv', index=False)
print('Saved /content/submission.csv')

## 8 — Submit to Kaggle (optional)
You can submit via the Kaggle CLI. If you prefer, download `/content/submission.csv` and upload manually on the competition page.

In [None]:
# Kaggle submit (uncomment to run)
# competition = 'ieee-fraud-detection'
# !kaggle competitions submit -c $competition -f /content/submission.csv -m 'AutoGluon colab submission'
print('To submit: run kaggle competitions submit -c ieee-fraud-detection -f /content/submission.csv -m
')

---
### Notes:
- Keep `kaggle.json` private.
- For reproducibility, save `/content/ag_models` to Drive after training.
- If Colab runs out of RAM, try a smaller sample or use a larger VM.