Code for Kaggle Jigsaw challenge
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
scripts
templates
.gitignore
.helmignore
Chart.yaml
README.md
infrastructure.tf
model.ipynb
model_local.py
model_param.yaml
requirements.txt
values.yaml

README.md

I modified this Helm chart which installs Dask on a Kubernetes cluster, adding support for nodeSelectors. This allows you to put Jupyter and the Dask scheduler in a separate node pool from the Dask workers; you can scale Dask workers up / down without inadvertently killing your Jupyter instance.

Use kubectl cp to copy files to/from the Jupyter instance.

I specifically used Dask to parallelize hyperparameter tuning. The dask-searchcv package provides implementations of sklearn’s GridSearchCV and RandomizedSearchCV classes.

Initialize cluster, scale up/down, and destroy:

time source scripts/initialize.sh
time source scripts/scale.sh <num_nodes> <num_workers>
time source scripts/destroy.sh

Copy models/params to/from Jupyter:

export JUPYTER_POD=$(kubectl get pods --selector=component=jupyter -o jsonpath='{.items[0].metadata.name}')
kubectl cp model.ipynb $JUPYTER_POD:model.ipynb
kubectl cp $JUPYTER_POD:model.ipynb model.ipynb
kubectl cp $JUPYTER_POD:model_param.yaml model_param.yaml