# Lesson 1.2: Artifact Lineage

[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/zenml-io/zenbytes/blob/main/1-2_Artifact_Lineage.ipynb)

***Key Concepts:*** *Artifacts, Artifact Stores, Metadata, Versioning, Caching*

In this lesson we will learn about one of the coolest features of ML pipelines: automated artifact versioning and tracking. This will give us tremendous insights into how exactly each of our models was created. Furthermore, it enables artifact caching, allowing us to switch out parts of our ML pipelines without having to rerun any previous steps.

This notebook requires you to have the ZenML [Dash](https://dash.plotly.com/introduction) integration installed. Install it with the following command if you have not done so before, which will also restart the kernel of your notebook.

In [None]:
%pip install zenml
!zenml integration install sklearn dash -y
%pip install pyparsing==2.4.2  # required for Colab

import IPython

# automatically restart kernel
IPython.Application.instance().kernel.do_shutdown(restart=True)

**Colab Note:** On Colab, you need an [ngrok account](https://dashboard.ngrok.com/signup) to view some of the visualizations later. Please set up an account, then set your user token below:

In [None]:
NGROK_TOKEN = ""  # TODO: set your ngrok token if you are working on Colab

In [None]:
# COLAB ONLY setup
try:
    import google.colab

    IN_COLAB = True

    # clone zenbytes repo to get source code of previous lessons
    !git clone https://github.com/zenml-io/zenbytes.git  # noqa
    !mv zenbytes/steps .
    !mv zenbytes/pipelines .

    # install ngrok and expose port 8080
    !pip install pyngrok
    !ngrok authtoken {NGROK_TOKEN}

except ModuleNotFoundError as err:
    IN_COLAB = False

Before we dive into any versioning and caching, let's clarify what exactly **[Artifacts](https://docs.zenml.io/mlops-stacks/artifact-stores)** are. To illustrate, let us first rebuild our digits pipeline from the previous chapter:

In [None]:
from zenml.pipelines import pipeline

from steps.evaluator import evaluator
from steps.importer import importer
from steps.sklearn_trainer import svc_trainer


@pipeline
def digits_pipeline(importer, trainer, evaluator):
    """Links all the steps together in a pipeline"""
    X_train, X_test, y_train, y_test = importer()
    model = trainer(X_train=X_train, y_train=y_train)
    evaluator(X_test=X_test, y_test=y_test, model=model)

The artifacts of this pipeline are simply the local variables we defined: `X_train`, `X_test`, `y_train`, `y_test`, and `model`. These make up the data that flows in and out of our steps. Artifacts are at the core of our pipelines, and the pipeline definition above just defines which artifact is the input or output of what step.

## Pipeline Visualization with Dash

To visualize how the steps connect the different artifacts, we can view our pipeline with ZenML's [Dash](https://dash.plotly.com/introduction) integration. 

Run the following code, then open http://127.0.0.1:8050 in your browser.

In [None]:
digits_svc_pipeline = digits_pipeline(
    importer=importer(), trainer=svc_trainer(), evaluator=evaluator()
)
digits_svc_pipeline.run(unlisted=True)

In [None]:
def start_pipeline_visualizer():
    if IN_COLAB:
        from pyngrok import ngrok

        public_url = ngrok.connect(8050)
        print(f"\x1b[31mIn Colab, use this URL instead: {public_url}!\x1b[0m")

    from zenml.integrations.dash.visualizers.pipeline_run_lineage_visualizer import (
        PipelineRunLineageVisualizer,
    )
    from zenml.post_execution import get_unlisted_runs

    latest_run = get_unlisted_runs()[-1]
    PipelineRunLineageVisualizer().visualize(latest_run)


start_pipeline_visualizer()

**Note:** If you're running on Colab, you will not be able to access the regular dash link. Instead, use the `ngrok.io` link printed above!

You should now see an interactive visualization in your browser, as shown below. The squares represent your artifacts and the circles your pipeline steps. Also, note that the different nodes are color-coded, so if your pipeline ever fails or runs for too long, you can find the responsible step at a glance!

![Dash Visualization](_assets/1-2/dash_initial.png)

## Artifact Caching
As mentioned in the beginning, tracking which exact artifact went into what steps allows us to cache and reuse artifacts. Let's see this in action: First, stop the execution of the last notebook cell if it is still running. Then, execute the next cell to rerun our pipeline and visualize it with dash again.

In [None]:
digits_svc_pipeline.run(unlisted=True)

In [None]:
start_pipeline_visualizer()

You should now see a visualization as shown below. Note that the color of all nodes in the graph has changed to green now. This means they were still cached from our previous run.

![Dash Visualization Cached](_assets/1-2/dash_cached.png)

Let's now replace the SVC model in our ML pipeline with a decision tree and see what happens.

In [None]:
import numpy as np
from sklearn.base import ClassifierMixin
from sklearn.tree import DecisionTreeClassifier
from zenml.steps import step


@step()
def tree_trainer(
    X_train: np.ndarray,
    y_train: np.ndarray,
) -> ClassifierMixin:
    """Train a sklearn decision tree classifier."""
    model = DecisionTreeClassifier()
    model.fit(X_train, y_train)
    return model


# redefine and rerun our pipeline, this time with tree_trainer()
digits_tree_pipeline = digits_pipeline(
    importer=importer(), trainer=tree_trainer(), evaluator=evaluator()
)
digits_tree_pipeline.run(unlisted=True)

In [None]:
start_pipeline_visualizer()

The visualization should now look as shown below. Since we changed the trainer, the corresponding node and all subsequent nodes are now blue again, meaning they were rerun and the artifacts were freshly created. However, note how the input data artifacts are still green. They did not have to be recreated. In an actual production setting, this might save us a tremendous amount of time and resources as those data artifacts might have resulted from some complex, expensive preprocessing job.

![Dash Visualization Partly Cached](_assets/1-2/dash_partly_cached.png)


## Artifact Storage

You might now wonder how our ML pipelines can keep track of which artifacts changed and which did not. This requires several additional MLOps components that you would typically have to set up and configure yourself. Luckily, ZenML automatically set this up for us.

Under the hood, all the artifacts in our ML pipeline are automatically stored in an [Artifact Store](https://docs.zenml.io/mlops-stacks/artifact-stores). By default, this is simply a place in your local file system, but we could also configure ZenML to store this data in a cloud bucket like [Amazon S3](https://aws.amazon.com/s3/) or any other place instead. We will see this in more detail when we migrate our MLOps stack to the cloud in a later chapter.

You can run the following command to find out where exactly your artifacts are currently stored:

In [None]:
!zenml artifact-store describe

### ZenML MLOps Stacks

Artifact stores, together with orchestrators, are the backbone of any ZenML MLOps stack. You can see a list of all components in your current MLOps stack using the following command:

In [None]:
!zenml stack describe

## Orchestrators

The [Orchestrator](https://docs.zenml.io/mlops-stacks/orchestrators) is the component that defines how and where each pipeline step is executed when calling `pipeline.run()`. This component is not of much interest to us right now, but we will learn more about it in later chapters, e.g., to run our pipelines on a [Kubernetes](https://kubernetes.io/) clusters with a [Kubeflow](https://www.kubeflow.org/) orchestrator.

![Local MLOps Stack](_assets/1-2/localstack.png)

We will add several more components to our MLOps stack throughout the subsequent chapters, including model deployment tools, experiment trackers, data and model monitoring tools, and more. Let's start with experiment tracking in the [next lesson](2-1_Experiment_Tracking.ipynb).