# Scipy 2023: Production-grade Machine Learning with Flyte

This workshop will focus on five facets of production-grade data science:

- ⛰️ Scalability
- ✅ Data Quality
- 🔄 Reproducibility
- 🔂 Recoverability
- 🔎 Auditability

### Learning Objectives

- Learn the basics constructs of Flyte: tasks, workflows, and launchplans
- Understand how Flyte orchestrates execution graphs, data, and compute infrastructure
- Work with the building blocks for productionizing data science workloads
- Learn how to test Flyte code, use CI/CD, and extend Flyte

## Introduction to Flyte

### Environment Setup

Follow the instructions in the setup instructions of
the [README](./README.md).

### Flyte Basics

`flytekit` is the Python SDK for Flyte. It's the way data scientists, ML engineers, data engineers, and data analysts write code that will eventually run on a Flyte cluster.

Let's take a look at the [workflows/example_00_intro.py](./workflows/example_00_intro.py) script.

In it, you'll see a simple pipeline that uses the penguins dataset to train a
penguin species classifier. This script introduces three core concepts in Flyte:

- `tasks`: the basic unit of compute in Flyte.
- `workflows`: an execution graph of tasks.
- `launchplans`: a mechanism for executing and reusing workflows.

You can run this workflow locally with:

```
python workflows/example_00_intro.py
```

### `pyflyte run`

Run this locally on your terminal with pyflyte run:

```bash
pyflyte run \
    workflows/example_00_intro.py training_workflow \
    --hyperparameters '{"C": 0.01}'
```

This is great for the local debugging experience, but what if we want to run this
workflow on an actual Flyte cluster?

`pyflyte run` also supports this use case through the `--remote` flag.

```bash
pyflyte --config ~/.flyte/config-sandbox.yaml \
    run --remote \
    --image ghcr.io/flyteorg/flyte-conference-talks:scipy-2023-latest \
    workflows/example_00_intro.py training_workflow \
    --hyperparameters '{"C": 0.01}'
```

Notice how we're providing two extra flags:
- `--config`: this is the path to the Flyte config file, which points `pyflyte run`
  to the Flyte cluster endpoint.
- `--remote`: this flag tells `pyflyte run` that we want to run the workflow on
  a flyte cluster.

Once you execute this command, you should see a message that looks like this:

```
Go to http://localhost:30080/console/projects/flytesnacks/domains/development/executions/ff733ed1039b64067a89 to see execution in the console.
```

Where `ff733ed1039b64067a89`, in this case, is the execution id of the workflow
execution.

### Flyte Console


The `flyteconsole` is the UI component of the Flyte stack. It provides a way to visualize workflows, launch them from the browser, and obtain useful metadata about Flyte entities and their corresponding executions.

![Flyte Console](https://raw.githubusercontent.com/flyteorg/static-resources/main/flytesnacks/getting_started/getting_started_console.gif)

Go to the link provided by the `pyflyte run` command to see the execution in the in the console.

### `FlyteRemote`

In [1]:
from workflows import example_00_intro
from workflows.utils import get_remote

remote = get_remote()
execution = remote.execute_local_workflow(
    example_00_intro.training_workflow,
    inputs={
        "hyperparameters": example_00_intro.Hyperparameters(C=0.1, max_iter=5000),
        "test_size": 0.2,
        "random_state": 11,
    }
)
remote.generate_console_url(execution)

'http://localhost:30080/console/projects/flytesnacks/domains/development/executions/fda494308661e40fba53'

In [2]:
execution = remote.wait(execution)

In [3]:
from sklearn.linear_model import LogisticRegression

clf = execution.outputs.get("o0", LogisticRegression)
clf

### Scheduling Launchplans

You can activate schedules from the CLI or in a Python runtime:

Using the `flytectl` CLI:

```bash
flytectl update launchplan \
    -p flytesnacks -d development \
    scheduled_training_workflow --version '<version>' --activate
```

Make sure it's activated:

```bash
flytectl get launchplan \
    -p flytesnacks -d development \
    scheduled_training_workflow --output yaml --latest \
    | grep ' state:'
```

Expected output:
```
state: ACTIVE
```

Using: `FlyteRemote`

In [None]:
from workflows.utils import get_remote

remote = get_remote()
lp_id = remote.fetch_launch_plan(name="scheduled_training_workflow").id
remote.client.update_launch_plan(lp_id, "ACTIVE")
print("activated scheduled_training_workflow")

Get the execution for the most recent scheduled run:

In [None]:
recent_executions = [
    execution
    for execution in remote.recent_executions()
    if execution.spec.launch_plan.name == "scheduled_training_workflow"
]

scheduled_execution = None
model = None
if recent_executions:
    scheduled_execution = recent_executions[0]
    
print(scheduled_execution)

Deactivate the schedule with the `flytectl` CLI:

```bash
flytectl update launchplan \
    -p flytesnacks -d development \
    scheduled_training_workflow --version '<version>' --archive
```

Or with `FlyteRemote`:

In [None]:
remote.client.update_launch_plan(lp_id, "INACTIVE")
print("deactivated scheduled_training_workflow")

## Flyte Programming Model

### Tasks as Containerized Functions

#### Reproducibility

Next, we'll learn about multiple levels of reproducibility:

- **Environment-level reproducibility**: As you can see in the
  [Dockerfile](./Dockerfile), we're containerizing our Flyte application to
  capture a snapshot of all the dependencies that your tasks and workflows rely on.
- **Code-level reproducibility**: In [example_06_reproducibility.py](./workflows/example_06_reproducibility.py)
  we take care of setting a random seed for our model. This is a common practice 
  but an important one to remember!
- **Resource-level reproducibility**: Finally, as you've seen previously we can
  declare the compute and memory requirements of our pipeline at the task-level.

Combined with built-in versioning for all tasks, workflows, launchplans, and
executions, Flyte gives you the ability to roll back/forward to previous versions
of any of these entities. Flyte tasks/workflows are sort of like hermetically-sealed
containers that are guaranteed to produce the same output (error or not) given
the same input.

In [None]:
!AWS_ACCESS_KEY_ID=minio AWS_SECRET_ACCESS_KEY=miniostorage aws --endpoint-url http://localhost:30002 s3 ls s3://my-s3-bucket

In [None]:
!docker run --network="host" -it ghcr.io/flyteorg/flyte-conference-talks:scipy-2023-latest /bin/bash

!FLYTE_SDK_LOGGING_LEVEL=20 AWS_ACCESS_KEY_ID=minio AWS_SECRET_ACCESS_KEY=miniostorage FLYTE_AWS_ENDPOINT=http://localhost:30002 \
pyflyte-fast-execute \
--additional-distribution s3://my-s3-bucket/flytesnacks/development/NS35CUJVLK7GDKD5YPU4VQMV44======/fast156af83e41196d797f4f45c29ca4e221.tar.gz \
--dest-dir /root \
-- \
pyflyte-execute \
--inputs s3://my-s3-bucket/metadata/propeller/flytesnacks-development-f3c370addb6ae3b57000/n0/data/inputs.pb \
--output-prefix s3://my-s3-bucket/metadata/propeller/flytesnacks-development-f3c370addb6ae3b57000/n0/data/0 \
--raw-output-data-prefix s3://my-s3-bucket/data/yb/f3c370addb6ae3b57000-n0-0 \
--checkpoint-path s3://my-s3-bucket/data/yb/f3c370addb6ae3b57000-n0-0/_flytecheckpoints \
--prev-checkpoint "" \
--resolver flytekit.core.python_auto_container.default_task_resolver \
-- \
task-module \
workflows.example_00_intro \
task-name \
get_data

### Workflows and Promises

#### Exercise: Understanding Workflows

Workflows are basically a domain-specific language (DSL) that builds an
execution graph that uses tasks as the building blocks for more complex pipelines.

Insert a breakpoint on line 80 of the `example_00_intro.py` script and rerun 
it. Take a look at all the variables in the `training_workflow` like `data` 
and `model`. What data type are they?

### Type System

The Flyte type system is responsible for a lot of Flyte's magic: Flyte uses
the regular Python type hints to automatically serialize outputs of tasks
and deserialize inputs of tasks from Flyte's native serialization format,
including handling the off-loading of tabular data like `pandas.DataFrame`
objects.

A nice consequence of this is that Flyte can also analyze the execution graph
that's built at compile-time and raise errors.

Take a look at [example_04_type_system.py](./workflows/example_04_type_system.py).
Try changing the output signature of `get_data` from `pd.DataFrame` to `dict`
and to fast register it:

```
pyflyte register --project flytesnacks --domain development --image $IMAGE workflows
```

What error do you see?

#### Data Quality: DataFrame Types

Pandera is a data validation tool for dataframe-like objects. In
[example_05_pandera_types.py](./workflows/example_05_pandera_types.py), we define
a pandera schema that validates the output of `get_data` as well as the DataFrame
input of `split_data` at runtime.

#### Exercise

- Uncomment line 49 in the `example_05_pandera_types.py`
- Fast register your workflows then run the cell below. What error do you see?
- Bonus: comment the offending line and fast register the workflows again.
  Re-run the cell again... what do you see?

In [11]:
from workflows import example_05_pandera_types
from workflows.utils import get_remote

remote = get_remote()
execution = remote.execute_local_workflow(
    example_05_pandera_types.get_splits,
    inputs={"test_size": 0.2}
)
print(remote.generate_console_url(execution))
execution = remote.wait(execution)

http://localhost:30080/console/projects/flytesnacks/domains/development/executions/fef227229689147d9b5e


### How Data Flows in Flyte

If tasks run in their own containers inside the Flyte cluster, how is data passed between them?

### Lifecycle of a Workflow

When you run a workflow locally, flytekit just runs the tasks in a Python runtime. However, when you run the workflow on a Flyte cluster, a lot of things are happening under the hood.

`flytepropeller` is the core engine in the Flyte stack that orchestrates:
- the compute infrastructure needed to run a task.
- the execution of tasks in a particular sequence.
- the management of data dependencies between tasks.

### Development Lifecycle Overview

#### `pyflyte register`

Flyte support rapid iteration during development via "fast registration" via
`pyflyte register`. This zips up all of the source code of your Flyte 
application and bypasses the need to re-build a docker image.

```
pyflyte register --project flytesnacks --domain development --image $IMAGE workflows
```

Now go back the Flyte console and take a look at one of the workflows. You'll
see our fast-registered version under the **Recent Workflow Versions** panel.

## Productionizing Data Science Workloads

### Parallelism

#### Example 1: Dynamic Workflows

Dynamic workflows allow you to create execution graphs on the fly. This allows
you to specify for loops over inputs to implement a grid search model tuning
workflow.

In [9]:
from workflows import example_01_dynamic
from workflows.utils import get_remote

remote = get_remote()
execution = remote.execute_local_workflow(
    example_01_dynamic.tuning_workflow,
    inputs={
        "hyperparam_grid": [
            example_00_intro.Hyperparameters(C=0.1, max_iter=5000),
            example_00_intro.Hyperparameters(C=0.01, max_iter=5000),
            example_00_intro.Hyperparameters(C=0.001, max_iter=5000),
        ],
    }
)
remote.generate_console_url(execution)

'http://localhost:30080/console/projects/flytesnacks/domains/development/executions/f7483f3766a8044cc86f'

#### Example 2: Map Tasks

Map tasks enable larger fan-outs of embarrassingly parallel computations compared
to dynamic workflows.

In [7]:
from workflows import example_02_map_task
from workflows.utils import get_remote

remote = get_remote()
execution = remote.execute_local_workflow(
    example_02_map_task.tuning_workflow,
    inputs={
        "hyperparam_grid": [
            example_00_intro.Hyperparameters(C=0.1, max_iter=5000),
            example_00_intro.Hyperparameters(C=0.01, max_iter=5000),
            example_00_intro.Hyperparameters(C=0.001, max_iter=5000),
        ],
    }
)
print(remote.generate_console_url(execution))
execution = remote.wait(execution)

http://localhost:30080/console/projects/flytesnacks/domains/development/executions/fca97404075754c2fa4c


### Horizontal Scaling

#### Example 3: Plugins

Flyte has a plugin system that lets you integrate with a wide variety of
data and machine learning tools that help you to scale, like BigQuery,
Pyspark, and Ray.

In [10]:
from workflows import example_03_plugins
from workflows.utils import get_remote

remote = get_remote()
execution = remote.execute_local_workflow(
    example_03_plugins.training_workflow,
    inputs={
        "n_epochs": 50,
        "hyperparameters": example_03_plugins.Hyperparameters(
            in_dim=4, hidden_dim=100, out_dim=3, learning_rate=0.03
        ),
    }
)
print(remote.generate_console_url(execution))
execution = remote.wait(execution)

http://localhost:30080/console/projects/flytesnacks/domains/development/executions/fa4de104a7afd4d1d857


### Production Notebooks

### Container Tasks

### ImageSpec

### Recovering from Failure

#### Caching

In [example_07_caching.py](./workflows/example_07_caching.py), we revisit the model-tuning use case using `@dynamic` workflows,
showing how caching can help reduce wasted compute.

In [12]:
from workflows import example_07_caching
from workflows.example_06_reproducibility import Hyperparameters
from workflows.utils import get_remote

remote = get_remote()
execution = remote.execute_local_workflow(
    example_07_caching.tuning_workflow,
    inputs={
        "hyperparam_grid": [
            Hyperparameters(alpha=alpha)
            for alpha in [10.0, 1.0, 0.1, 0.01, 0.001, 0.0001]
        ],
    }
)
print(remote.generate_console_url(execution))
execution = remote.wait(execution)

http://localhost:30080/console/projects/flytesnacks/domains/development/executions/fca9432093c79475888c


#### Recovering Failed Executions

In [example_08_recover_executions.py](./workflows/example_08_recover_executions.py), we see how Flyte
provides a mechanism by which you can automatically recover from unexpected failures.

In [13]:
from workflows import example_08_recover_executions
from workflows.utils import get_remote

remote = get_remote()
execution = remote.execute_local_workflow(
    example_08_recover_executions.tuning_workflow,
    inputs={"alpha_grid": [100.0, 10.0, 1.0, 0.1, 0.01, 0.001, 0.0001, 0.00001]}
)
print(remote.generate_console_url(execution))
execution = remote.wait(execution)

http://localhost:30080/console/projects/flytesnacks/domains/development/executions/f30fc95e6cfa84af19bb


#### Checkpointing

In [example_09_checkpointing.py](./workflows/example_09_checkpointing.py), we
learn about how you can do intra-task checkpoints natively in Flyte to pick
up from where you left off in, e.g., a model training task.

In [14]:
from workflows import example_09_checkpointing
from workflows.example_06_reproducibility import Hyperparameters
from workflows.utils import get_remote

remote = get_remote()
execution = remote.execute_local_workflow(
    example_09_checkpointing.training_workflow,
    inputs={
        "n_epochs": 30,
        "hyperparameters": Hyperparameters(penalty="l1", random_state=42),
    }
)
print(remote.generate_console_url(execution))
execution = remote.wait(execution)

http://localhost:30080/console/projects/flytesnacks/domains/development/executions/f0f29542fe51740e9af4


### Auditing Workflows

#### Visualization with Flyte Decks

In [example_10_flyte_decks.py](./workflows/example_10_flyte_decks.py) we
create tasks that produce static html reports that help you understand the
inputs/outputs of your tasks.

In [1]:
from workflows import example_10_flyte_decks
from workflows.utils import download_deck, get_remote

remote = get_remote()
execution = remote.execute_local_workflow(
    example_10_flyte_decks.penguins_data_workflow,
    inputs={},
)
print(remote.generate_console_url(execution))
execution = remote.wait(execution)

http://localhost:30080/console/projects/flytesnacks/domains/development/executions/f03a65384d3814414836


In [9]:
download_deck(remote, execution, "n0", "decks/example_10_decks.html")

Flyte decks for execution f9b47ecb975ad4829a95 downloaded to decks/example_10_decks.html


## Testing, CI/CD, Extending Flyte

### Writing Unit Tests

### Writing Integration Tests

### Using Github Actions

### Extending Flyte

#### Decorators

#### Extending Flyte Decks

Flyte decks can be easily extended to support any arbitrary visualization, as
we can see in [example_11_extend_flyte_decks.py](./workflows/example_11_extend_flyte_decks.py)

**Exercise**

Come up with a visualization for one of inputs or outputs of any of the tasks
in `example_11_extend_flyte_decks.py`, and create a custom Flyte deck for it.

In [8]:
from workflows import example_11_extend_flyte_decks
from workflows.example_06_reproducibility import Hyperparameters
from workflows.utils import download_deck, get_remote
from IPython.display import HTML, display

remote = get_remote()
execution = remote.execute_local_workflow(
    example_11_extend_flyte_decks.training_workflow,
    inputs={
        "hyperparameters": Hyperparameters(
            penalty="l1", alpha=0.03, random_state=12345
        )
    },
)
print(remote.generate_console_url(execution))
execution = remote.wait(execution)

http://localhost:30080/console/projects/flytesnacks/domains/development/executions/f9b47ecb975ad4829a95


In [10]:
download_deck(remote, execution, "n2", "decks/example_11_decks_n2.html")

Flyte decks for execution f9b47ecb975ad4829a95 downloaded to decks/example_11_decks_n2.html


In [11]:
download_deck(remote, execution, "n3", "decks/example_11_decks_n3.html")

Flyte decks for execution f9b47ecb975ad4829a95 downloaded to decks/example_11_decks_n3.html


#### Type Plugins

#### Task Plugins