
# Kaggle Cloud Training Workflow

This notebook orchestrates remote training on Kaggle's GPU or TPU runtimes. It
assumes that the repository contents (including the `src/` modules) are available
inside the Kaggle workspace – either by uploading the project as a dataset or by
linking a GitHub repository through Kaggle Notebooks.



## 1. Environment setup

Run the cell below to install project dependencies and register the `src/`
modules on the Python path. When executing inside Kaggle, this will install the
requirements into the session sandbox.


In [None]:
import sys
from pathlib import Path
import subprocess

PROJECT_ROOT = Path.cwd()
if not (PROJECT_ROOT / 'src').exists():
    PROJECT_ROOT = PROJECT_ROOT.parent

REQUIREMENTS = PROJECT_ROOT / 'requirements.txt'
if REQUIREMENTS.exists():
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', '-r', str(REQUIREMENTS)])

if str(PROJECT_ROOT / 'src') not in sys.path:
    sys.path.append(str(PROJECT_ROOT / 'src'))

print(f"Project root: {PROJECT_ROOT}")


## 2. Configure data and output directories

Update the variables in the next cell to point at your dataset (typically
mounted beneath `/kaggle/input`) and to define where training artifacts should
be written (commonly `/kaggle/working/results`).


In [None]:

from pathlib import Path

# Path to the feature matrix. Replace the placeholder with your dataset name.
# Example: Path('/kaggle/input/vpcf-dataset/x.npy')
data_path = Path('/kaggle/input/YOUR_DATASET_NAME/x.npy')

# Optional: provide the path to ground-truth labels if available.
# Example: Path('/kaggle/input/vpcf-dataset/y.npy')
labels_path = None  # set to Path('...') when labels are present.

# Where to store model checkpoints and logs.
results_root = Path('/kaggle/working/results')
results_root.mkdir(parents=True, exist_ok=True)

print('Feature matrix:', data_path)
print('Labels path   :', labels_path)
print('Results folder:', results_root)



## 3. Load the dataset

The cell below loads the feature matrix and (optionally) label array. Adjust the
`astype` target if you require a different dtype.


In [None]:

import numpy as np

if not data_path.exists():
    raise FileNotFoundError(f'Update data_path to match your Kaggle dataset. {data_path} not found.')

x = np.load(data_path).astype('float32')
print('Loaded features with shape:', x.shape)

if labels_path is not None:
    if not labels_path.exists():
        raise FileNotFoundError(f'Labels path {labels_path} does not exist.')
    y = np.load(labels_path)
    print('Loaded labels with shape:', y.shape)
else:
    y = None



## 4. Train DEC

This block performs autoencoder pretraining followed by DEC clustering. Adjust
the hyperparameters (e.g., `dims`, `n_clusters`, `maxiter`) to suit your data.
Results and logs are saved beneath the configured `results_root`.


In [None]:

from DEC import DEC

dims = [x.shape[1], 500, 500, 2000, 10]
dec_save_dir = results_root / 'dec'
dec = DEC(dims=dims, n_clusters=10, save_dir=str(dec_save_dir))

dec.pretrain(x, epochs=50, batch_size=256)
dec.compile(optimizer='sgd')
dec_labels = dec.fit(x, y=y, maxiter=8000, update_interval=200, batch_size=256)



## 5. Train IDEC

The IDEC workflow reuses the autoencoder weights (if available) and optimizes an
augmented objective. Reduce the epochs or `maxiter` values if you are exploring
interactively.


In [None]:

from IDEC import IDEC

dims = [x.shape[1], 500, 500, 2000, 10]
idec_save_dir = results_root / 'idec'
idec = IDEC(dims=dims, n_clusters=10, save_dir=str(idec_save_dir))

if not idec.pretrained:
    idec.pretrain(x, epochs=50, batch_size=256)

idec.compile(optimizer='sgd')
idec_labels = idec.fit(x, y=y, maxiter=8000, update_interval=200, batch_size=256)



## 6. Run DBSCAN baseline

DBSCAN offers a non-parametric baseline that can highlight structure without
requiring neural network training. Tune `eps` and `min_samples` for your data.


In [None]:

from DBSCANModel import DBSCANClustering

# Hyperparameters are sensitive to feature scaling; adjust as necessary.
dbscan = DBSCANClustering(eps=0.5, min_samples=5, scale=True)

dbscan_labels = dbscan.fit(x)
print('Unique cluster labels (including noise = -1):', sorted(set(dbscan_labels.tolist())))

if y is not None:
    metrics = dbscan.evaluate(y)
    print('DBSCAN metrics (noise excluded):', metrics)



## 7. Inspect generated artifacts

Use the helper below to list the files that were produced during training. The
paths correspond to Kaggle's working directory, so you can add selected outputs
(e.g., final weights or CSV logs) to the notebook results section for download.


In [None]:
for folder in ['dec', 'idec']:
    target = results_root / folder
    if not target.exists():
        continue
    print(f'
Artifacts for {folder.upper()}:')
    for path in sorted(target.glob('**/*')):
        if path.is_file():
            size_kb = path.stat().st_size / 1024
            print(f'  {path.relative_to(results_root)} ({size_kb:.1f} KB)')