The quickstart assumes user have access to Kubeflow Pipelines deployment. Pipelines can be dedployed on any Kubernetes cluster, including local cluster.
It is a good practice to start by creating a new virtualenv before installing new packages. Therefore, use virtalenv
command to create new env and activate it:
$ virtualenv venv-demo
created virtual environment CPython3.8.5.final.0-64 in 145ms
creator CPython3Posix(dest=/home/mario/kedro/venv-demo, clear=False, no_vcs_ignore=False, global=False)
seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/home/mario/.local/share/virtualenv)
added seed packages: pip==20.3.1, setuptools==51.0.0, wheel==0.36.2
activators BashActivator,CShellActivator,FishActivator,PowerShellActivator,PythonActivator,XonshActivator
$ source venv-demo/bin/activate
Then, kedro
must be present to enable cloning the starter project, along with the latest version of kedro-kubeflow
plugina and kedro-docker (required to build docker images with the Kedro pipeline nodes):
$ pip install 'kedro<0.17' kedro-kubeflow kedro-docker
With the dependencies in place, let's create a new project:
$ kedro new --starter=git+https://github.com/getindata/kedro-starter-spaceflights.git --checkout allow_nodes_with_commas
Project Name:
=============
Please enter a human readable name for your new project.
Spaces and punctuation are allowed.
[New Kedro Project]: Kubeflow Plugin Demo
Repository Name:
================
Please enter a directory name for your new project repository.
Alphanumeric characters, hyphens and underscores are allowed.
Lowercase is recommended.
[kubeflow-plugin-demo]:
Python Package Name:
====================
Please enter a valid Python package name for your project package.
Alphanumeric characters and underscores are allowed.
Lowercase is recommended. Package name must start with a letter or underscore.
[kubeflow_plugin_demo]:
Change directory to the project generated in /home/mario/kedro/kubeflow-plugin-demo
A best-practice setup includes initialising git and creating a virtual environment before running `kedro install` to install project-specific dependencies. Refer to the Kedro documentation: https://kedro.readthedocs.io/
TODO: switch to the official
spaceflights
starter after https://github.com/quantumblacklabs/kedro-starter-spaceflights/pull/10 is merged
Finally, go the demo project directory and ensure that kedro-kubeflow plugin is activated:
$ cd kubeflow-plugin-demo/
$ kedro install
(...)
Requirements installed!
$ kedro kubeflow --help
Usage: kedro kubeflow [OPTIONS] COMMAND [ARGS]...
Interact with Kubeflow Pipelines
Options:
-e, --env TEXT Environment to use.
-h, --help Show this message and exit.
Commands:
compile Translates Kedro pipeline into YAML file with Kubeflow...
init Initializes configuration for the plugin
list-pipelines List deployed pipeline definitions
mlflow-start
run-once Deploy pipeline as a single run within given experiment.
schedule Schedules recurring execution of latest version of the...
ui Open Kubeflow Pipelines UI in new browser tab
upload-pipeline Uploads pipeline to Kubeflow server
First, initialize the project with kedro-docker
configuration by running:
$ kedro docker init
This command creates a several files, including .dockerignore
. This file ensures that transient files are not included in the docker image and it requires small adjustment. Open it in your favourite text editor and extend the section # except the following
by adding there:
!data/01_raw
This change enforces raw data existence in the image. Also, one of the limitations of running the Kedro pipeline on Kubeflow (and not on local environemt) is inability to use MemoryDataSets, as the pipeline nodes do not share memory, so every artifact should be stored as file. The spaceflights
demo configures four datasets as in-memory, so let's change the behaviour by adding these lines to conf/base/catalog.yml
:
X_train:
type: pickle.PickleDataSet
filepath: data/05_model_input/X_train.pickle
layer: model_input
y_train:
type: pickle.PickleDataSet
filepath: data/05_model_input/y_train.pickle
layer: model_input
X_test:
type: pickle.PickleDataSet
filepath: data/05_model_input/X_test.pickle
layer: model_input
y_test:
type: pickle.PickleDataSet
filepath: data/05_model_input/y_test.pickle
layer: model_input
Finally, build the image:
kedro docker build
When execution finishes, your docker image is ready. If you don't use local cluster, you should push the image to the remote repository:
docker tag kubeflow_plugin_demo:latest remote.repo.url.com/kubeflow_plugin_demo:latest
docker push remote.repo.url.com/kubeflow_plugin_demo:latest
First, run init
script to create the sample configuration. A parameter value should reflect the kubeflow base path as seen from the system (so no internal Kubernetes IP unless you run the local cluster):
kedro kubeflow init https://kubeflow.cluster.com
(...)
Configuration generated in /home/mario/kedro/kubeflow-plugin-demo/conf/base/kubeflow.yaml
Then, if needed, adjust the conf/base/kubeflow.yaml
. For example, the image:
key should point to the full image name (like remote.repo.url.com/kubeflow_plugin_demo:latest
if you pushed the image at this name). Depending on the storage classes availability in Kubernetes cluster, you may want to modify volume.storageclass
and volume.access_modes
(please consult with Kubernetes admin what values should be there).
Finally, everything is set to run the pipeline on Kubeflow. Run upload-pipeline
:
$ kedro kubeflow upload-pipeline
2021-01-12 09:47:35,132 - kedro_kubeflow.kfpclient - INFO - No IAP_CLIENT_ID provided, skipping custom IAP authentication
2021-01-12 09:47:35,209 - kedro_kubeflow.kfpclient - INFO - Pipeline created
2021-01-12 09:47:35,209 - kedro_kubeflow.kfpclient - INFO - Pipeline link: https://kubeflow.cluster.com/#/pipelines/details/9a3e4e16-1897-48b5-9752-d350b1d1faac/version/9a3e4e16-1897-48b5-9752-d350b1d1faac
As you can see, the pipeline was compiled and uploaded into Kubeflow. Let's visit the link:
The Kubeflow pipeline reflects the Kedro pipeline with two extra steps:
data-volume-create
- creates an empty volume in Kubernetes cluster as a persistence layer for inter-steps data accessdata-volume-init
- initialized the volume with01_raw
data when the pipeline starts
By using Create run
button you can start a run of the pipeline on the cluster. A run behaves like kedro run
command, but the steps are executed on the remote cluster. The outputs are stored on the persistent volume, and passed as the inputs accordingly to how Kedro nodes need them.
From the UI you can access the logs of the execution. If everything seems fine, use `schedule to create a recurring run:
$ kedro kubeflow schedule --cron-expression '0 0 4 * * *'
(...)
2021-01-12 12:37:23,086 - kedro_kubeflow.kfpclient - INFO - No IAP_CLIENT_ID provided, skipping custom IAP authentication
2021-01-12 12:37:23,096 - root - INFO - Creating experiment Kubeflow Plugin Demo.
2021-01-12 12:37:23,103 - kedro_kubeflow.kfpclient - INFO - New experiment created: 2123c082-b336-4093-bf3f-ce73f68b66b4
2021-01-12 12:37:23,147 - kedro_kubeflow.kfpclient - INFO - Pipeline scheduled to 0 0 4 * * *
You can see that the new experiment was created (that will group the runs) and the pipeline was scheduled. Please note, that Kubeflow uses 6-places cron expression (as opposite to Linux's cron with 5-places), where first place is the second indicator.