# Intro to MLOps using ZenML

## 🌍 Overview

This repository is a minimalistic MLOps project intended as a starting point to learn how to put ML workflows in production. It features: 

- A very simple training pipeline that loads the a dataset and trains a model.

Within this notebook we will show you how simple it is to switch where your code runs and where your data is stored. You will also learn how all the metadata of your run is stored and accessible through ZenML.

Follow along this notebook to understand how you can use ZenML to productionalize your ML workflows!

<img src=".assets/pipeline_overview.png" width="50%" alt="Pipelines Overview">

## Run on Colab

You can use Google Colab to see ZenML in action, no installation
required!

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](
https://colab.research.google.com/github/zenml-io/zenml/blob/main/examples/quickstart/quickstart.ipynb)

# 👶 Step 0. Install Requirements

Let's install ZenML to get started.

In [1]:
!pip install "zenml[server]" pyarrow
!zenml integration install sklearn -y


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1.2[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
[2K[32m⠴[0m Installing integrations...
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1.2[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
[2K[32m⠦[0m Installing integrations...
[1A[2K

In [1]:
from zenml.environment import Environment

# In case we are in a google colab, clone all additional relevant files
if Environment.in_google_colab():
    # Pull required modules from this example
    !git clone -b main https://github.com/zenml-io/zenml
    !cp -r zenml/examples/quickstart/* .
    !rm -rf zenml

In [7]:
# Restart Kernel to ensure all libraries are properly loaded
import IPython
IPython.Application.instance().kernel.do_shutdown(restart=True)

{'status': 'ok', 'restart': True}


Please wait for the installation to complete before running subsequent cells. At
the end of the installation, the notebook kernel will automatically restart.

## ☁️ Step 1: Connect to your ZenML Server
To run this quickstart you need to connect to a ZenML Server. You can deploy it [yourself](https://docs.zenml.io/getting-started/deploying-zenml) or try it out for free, no credit-card required in our [ZenML Pro managed service](https://zenml.io/pro).

In [1]:
zenml_server_url = "https://1cf18d95-zenml.cloudinfra.zenml.io"  # in the form "https://URL_TO_SERVER"

!zenml connect --url $zenml_server_url

Error: [31m[1mYou're trying to connect to a remote ZenML server but already have a local server running. This can lead to unexpected behavior. Please shut down the local server by running `zenml down` before connecting to a remote server.[0m


In [1]:
# Initialize ZenML and set the default stack
!zenml init

!zenml stack set default

[?25l[2;36mFound existing ZenML repository at path [0m
[2;32m'/home/alexej/PycharmProjects/zenml/examples/quickstart'[0m[2;36m.[0m
[2;32m⠋[0m[2;36m Initializing ZenML repository at [0m
[2;36m/home/alexej/PycharmProjects/zenml/examples/quickstart.[0m
[2K[1A[2K[1A[2K[32m⠋[0m Initializing ZenML repository at 
/home/alexej/PycharmProjects/zenml/examples/quickstart.

[1A[2K[1A[2K[1A[2K[?25l[2;36mActive repository stack set to: [0m[2;32m'default'[0m
[2K[32m⠋[0m Setting the repository active stack to 'default'...t'...[0m
[1A[2K

Default stack in this case means the code will run on the machine that is running this notebook and all output data will be stored there as well.

In [1]:
# Do the imports at the top
from typing_extensions import Annotated
from sklearn.datasets import load_breast_cancer

import random
import pandas as pd
from zenml import step, pipeline, Model, get_step_context
from zenml.client import Client
from zenml.logger import get_logger
from uuid import UUID

from typing import Optional, List

from zenml import pipeline

from steps import (
    data_loader,
    data_preprocessor,
    data_splitter,
    model_evaluator,
    inference_preprocessor,
    model_trainer,
    model_evaluator
)

from zenml.logger import get_logger

logger = get_logger(__name__)

# Initialize the ZenML client to fetch objects from the ZenML Server
client = Client()

## 🥇 Step 2: Run your first pipeline

We'll start off by importing our data and training a simple ml model. In this quickstart we'll be working with
[the Breast Cancer](https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic) dataset
which is publicly available on the UCI Machine Learning Repository. The task is a classification
problem, to predict whether a patient is diagnosed with breast cancer or not.

When you're getting started with a machine learning problem you'll want to break down your code into distinct functions that load your data, bring it into the correct shape and finally produce a model. H#ere is our first function.

In [2]:
@step
def data_loader_simplified(
    random_state: int, is_inference: bool = False, target: str = "target"
) -> Annotated[pd.DataFrame, "dataset"]:  # We name the dataset 
    """Dataset reader step."""
    dataset = load_breast_cancer(as_frame=True)
    inference_size = int(len(dataset.target) * 0.05)
    dataset: pd.DataFrame = dataset.frame
    inference_subset = dataset.sample(inference_size, random_state=random_state)
    if is_inference:
        dataset = inference_subset
        dataset.drop(columns=target, inplace=True)
    else:
        dataset.drop(inference_subset.index, inplace=True)
    dataset.reset_index(drop=True, inplace=True)
    logger.info(f"Dataset with {len(dataset)} records loaded!")
    return dataset


The whole function is decorated with the ZenML-`@step` decorator. Once this step is added to a pipeline, ZenML will automatically version, track, and cache the data that is produced by this function as an `artifact`. This enables you to 
reproduce your data at any point in the future, even if the original data source
changes or disappears. 

Note the typing of the function outputs. These are not only good practice, but also
help ZenML store and load your data appropriately. ABy using `Annotated` type hint in the output of the
step, we are also naming our outputs. This will make
it possible to access it by name later on.

You'll also notice that we have included type hints for the outputs
to the function. 

ZenML is built in a way that allows you to experiment with your data and build
your pipelines one step at a time.  If you want to call this function to see how it
works, you can just call it directly. Here we take a look at the first few rows
of your training dataset.

In [3]:
df = data_loader_simplified(random_state=42)
df.head()

[1;35mDataset with 541 records loaded![0m


Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


Everything looks as we'd expect and the values are all in the right format 🥳.

We're now at the point where can bring this step (and some others) together into a single
pipeline. To do this simply plug multiple steps together through their inputs and outputs.
Then just add the `@pipeline` decorator to the function that connects the steps.

In [4]:
@pipeline(model=Model(name="demo", description="Show case Model Control Plane."), enable_cache=False)
def training_pipeline(
    test_size: float = 0.3,
    drop_na: Optional[bool] = None,
    normalize: Optional[bool] = None,
    drop_columns: Optional[List[str]] = None,
    target: Optional[str] = "target",
    random_state: int = 17,
    model_type: Optional[str] = "sgd"
):
    """Feature engineering pipeline."""
    # Link all the steps together by calling them and passing the output
    # of one step as the input of the next step.
    raw_data = data_loader(random_state=random_state, target=target)
    dataset_trn, dataset_tst = data_splitter(
        dataset=raw_data,
        test_size=test_size,
    )
    dataset_trn, dataset_tst, _ = data_preprocessor(
        dataset_trn=dataset_trn,
        dataset_tst=dataset_tst,
        drop_na=drop_na,
        normalize=normalize,
        drop_columns=drop_columns,
        target=target,
        random_state=random_state,
    )
    trained_model = model_trainer(
        dataset_trn=dataset_trn,
        model_type=model_type,
    )

    acc = model_evaluator(
        model=trained_model,
        dataset_trn=dataset_trn,
        dataset_tst=dataset_tst,
        target=target,
    )

    return acc

We're ready to run the pipeline now, which we can do just as with the step - by calling the
pipeline function itself:

In [5]:
pipeline_obj = training_pipeline(model_type="sgd")

[1;35mInitiating a new run for the pipeline: [0m[1;36mtraining_pipeline[1;35m.[0m
[1;35mNew model version [0m[1;36m4[1;35m was created.[0m
[1;35mModels can be viewed in the dashboard using ZenML Pro. Sign up for a free trial at [0m[34mhttps://www.zenml.io/pro/[1;35m[0m
[1;35mExecuting a new run.[0m
[1;35mCaching is disabled by default for [0m[1;36mtraining_pipeline[1;35m.[0m
[1;35mUsing user: [0m[1;36mdefault[1;35m[0m
[1;35mUsing stack: [0m[1;36mdefault[1;35m[0m
[1;35m  orchestrator: [0m[1;36mdefault[1;35m[0m
[1;35m  artifact_store: [0m[1;36mdefault[1;35m[0m
[1;35mDashboard URL: [0m[34mhttp://127.0.0.1:8237/runs/85e35890-16f8-4f96-ad88-b257d88b9ce9[1;35m[0m
[1;35mStep [0m[1;36mdata_loader[1;35m has started.[0m
[1;35mDataset with 541 records loaded![0m
[1;35mStep [0m[1;36mdata_loader[1;35m has finished in [0m[1;36m0.520s[1;35m.[0m
[1;35mStep [0m[1;36mdata_loader[1;35m completed successfully.[0m
[1;35mStep [0m[1;36mdat

As you can see the pipeline has run succesfully. Lets check this out by following the Dashboard URL that you can find in the logs above. 

We can also fetch the pipeline from the server and view the results directly in the notebook:

In [7]:
client = Client()
run = client.get_pipeline("training_pipeline").last_run
print(run.name)

training_pipeline-2024_07_30-12_38_42_355398


We can also see the data artifacts that were produced by the last step of the pipeline:

In [None]:
run.steps["data_preprocessor"].outputs

In [None]:
# Read one of the training datasets
run.steps["data_preprocessor"].outputs["dataset_trn"].load()

# ⌚ Step 3: Run the same pipeline on your cloud

## Congratulations!

You're a legit MLOps engineer now! You have created a training pipeline and you
have deployed it into a production-ready environment with the compute of your 
choice. You also have gotten a hang of the ZenML Dashboard.

## Further exploration

This was just the tip of the iceberg of what ZenML can do; check out the [**docs**](https://docs.zenml.io/) to learn more
about the capabilities of ZenML. For example, you might want to:

- [Deploy ZenML](https://docs.zenml.io/user-guide/production-guide/connect-deployed-zenml) to collaborate with your colleagues.
- Run the same pipeline on a [cloud MLOps stack in production](https://docs.zenml.io/user-guide/production-guide/cloud-stack).
- Track your metrics in an experiment tracker like [MLflow](https://docs.zenml.io/stacks-and-components/component-guide/experiment-trackers/mlflow).

## What next?

* If you have questions or feedback... join our [**Slack Community**](https://zenml.io/slack) and become part of the ZenML family!
* If you want to quickly get started with ZenML, check out [ZenML Pro](https://zenml.io/pro).