# Pydata Global 2022: Production-grade Machine Learning with Flyte

In this tutorial, you're going to learn about some of the key challenges to building and deploying reliable machine learning systems. At a high level, these challenges are the following:

- Scalability
- Data Quality
- Reproducibility
- Recoverability
- Auditability

## Introduction

### Environment Setup

Follow the instructions in the [setup instructions](./README.md#setup) of
the README.

### Example 0: Flyte Basics

Let's take a look at the [first example](./workflows/example_00_intro.py).

In it, you'll see a simple pipeline that uses the penguins dataset to train a
penguin species classifier. You can run this workflow locally with:

```
python workflows/example_00_intro.py
```

#### Exercise: Understanding Workflows

Workflows are basically a domain-specific language (DSL) that builds an
execution graph that uses tasks as the building blocks for more complex pipelines.

Insert a debugging breakpoint `import pdb; pdb.set_trace()` on line 80 of the
`example_00_intro.py` script and rerun it. Take a look at all the variables
in the `training_workflow` like `data` and `model`. What data type are they?

#### Registering Your Workflow

Once you're happy with the state of your tasks and workflows, you can register
them by first packaging them up into a portable flyte archive:

```
export IMAGE='ghcr.io/flyteorg/flyte-conference-talks:pydata-global-2022-latest'
pyflyte --pkgs workflows package --image $IMAGE -f
```

This will create a `flyte-package.tgz` archive file that contains the serialized
tasks and workflows in this project. Then, you can register it with:

```
flytectl register files --project flytesnacks --domain development --archive flyte-package.tgz --version v0
```

Now we can go over to https://sandbox.union.ai/console
(or http://localhost:30080/console if you're using a local Flyte cluster) to
check out the tasks and workflows we just registered.

In [32]:
from workflows import example_00_intro
from workflows.utils import get_remote

remote = get_remote()
execution = remote.execute_local_workflow(
    example_00_intro.training_workflow,
    inputs={
        "hyperparameters": {"C": 0.1, "max_iter": 5000},
        "test_size": 0.2,
        "random_state": 11,
    }
)
remote.generate_console_url(execution)

'http://localhost:30080/console/projects/flytesnacks/domains/development/executions/fdfdd3a6aa1bd4202825'

In [33]:
execution = remote.wait(execution)

In [12]:
from sklearn.linear_model import LogisticRegression

clf = execution.outputs.get("o0", LogisticRegression)
clf

#### Scheduling Launchplans

Activate the schedule:

In [13]:
from workflows.utils import get_remote

remote = get_remote()
lp_id = remote.fetch_launch_plan(name="scheduled_training_workflow").id
remote.client.update_launch_plan(lp_id, "ACTIVE")
print("activated scheduled_training_workflow")

activated scheduled_training_workflow


Get the execution for the most recent schedule run.

In [15]:
recent_executions = [
    execution
    for execution in remote.recent_executions()
    if execution.spec.launch_plan.name == "scheduled_training_workflow"
]

scheduled_execution = None
model = None
if recent_executions:
    scheduled_execution = recent_executions[0]
    scheduled_execution = remote.wait(scheduled_execution)
    model = scheduled_execution.outputs.get("o0", LogisticRegression)
    model

print(model)

LogisticRegression(C=0.1, max_iter=1000)


Now deactivate the schedule

In [16]:
remote.client.update_launch_plan(lp_id, "INACTIVE")
print("deactivated scheduled_training_workflow")

deactivated scheduled_training_workflow


#### `pyflyte register`

Flyte support rapid iteration during development via "fast registration" via
`pyflyte register`. This zips up all of the source code of your Flyte 
application and bypasses the need to re-build a docker image.

```
pyflyte register --project flytesnacks --domain development --image $IMAGE workflows
```

Now go back the Flyte console and take a look at one of the workflows. You'll
see our fast-registered version under the **Recent Workflow Versions** panel.

## Scalability

### Example 1: Dynamic Workflows

Dynamic workflows allow you to create execution graphs on the fly. This allows
you to specify for loops over inputs to implement a grid search model tuning
workflow.

In [17]:
from workflows import example_01_dynamic
from workflows.utils import get_remote

remote = get_remote()
execution = remote.execute_local_workflow(
    example_01_dynamic.tuning_workflow,
    inputs={
        "hyperparam_grid": [
            {"C": 0.1, "max_iter": 5000},
            {"C": 0.01, "max_iter": 5000},
            {"C": 0.001, "max_iter": 5000},
        ],
    }
)
remote.generate_console_url(execution)

'http://localhost:30080/console/projects/flytesnacks/domains/development/executions/f6d26f4dbcef4410a9c6'

### Example 2: Map Tasks

Map tasks enable larger fan-outs of embarrassingly parallel computations compared
to dynamic workflows.

In [34]:
from workflows import example_02_map_task
from workflows.utils import get_remote

remote = get_remote()
execution = remote.execute_local_workflow(
    example_02_map_task.tuning_workflow,
    inputs={
        "hyperparam_grid": [
            {"C": 0.1},
            {"C": 0.01},
            {"C": 0.001},
        ],
    }
)
print(remote.generate_console_url(execution))
execution = remote.wait(execution)

http://localhost:30080/console/projects/flytesnacks/domains/development/executions/fafbb6dc220664e5b93e


### Example 3: Plugins

Flyte has a plugin system that lets you integrate with a wide variety of
data and machine learning tools that help you to scale, like BigQuery,
Pyspark, and Ray.

In [20]:
from workflows import example_03_plugins
from workflows.utils import get_remote

remote = get_remote()
execution = remote.execute_local_workflow(
    example_03_plugins.training_workflow,
    inputs={
        "n_epochs": 50,
        "hyperparameters": example_03_plugins.Hyperparameters(
            in_dim=4, hidden_dim=100, out_dim=3, learning_rate=0.03
        ),
    }
)
print(remote.generate_console_url(execution))
execution = remote.wait(execution)

http://localhost:30080/console/projects/flytesnacks/domains/development/executions/ffad2720c9634405487d


## Data Quality

### Example 4: Type System

The Flyte type system is responsible for a lot of Flyte's magic: Flyte uses
the regular Python type hints to automatically serialize outputs of tasks
and deserialize inputs of tasks from Flyte's native serialization format,
including handling the off-loading of tabular data like `pandas.DataFrame`
objects.

A nice consequence of this is that Flyte can also analyze the execution graph
that's built at compile-time and raise errors.

Take a look at [example_04_type_system.py](./workflows/example_04_type_system.py).
Try changing the output signature of `get_data` from `pd.DataFrame` to `dict`
and to fast register it:

```
pyflyte register --project flytesnacks --domain development --image $IMAGE workflows
```

What error do you see?

### Example 5: DataFrame Types

Pandera is a data validation tool for dataframe-like objects. In
[example_05_pandera_types.py](./workflows/example_05_pandera_types.py), we define
a pandera schema that validates the output of `get_data` as well as the DataFrame
input of `split_data` at runtime.

#### Exercise

- Uncomment line 49 in the `example_05_pandera_types.py`
- Fast register your workflows then run the cell below. What error do you see?
- Bonus: comment the offending line and fast register the workflows again.
  Re-run the cell again... what do you see?

In [23]:
from workflows import example_05_pandera_types
from workflows.utils import get_remote

remote = get_remote()
execution = remote.execute_local_workflow(
    example_05_pandera_types.get_splits,
    inputs={"test_size": 0.2}
)
print(remote.generate_console_url(execution))
execution = remote.wait(execution)

http://localhost:30080/console/projects/flytesnacks/domains/development/executions/fde41b392cafc40fba5d


## Reproducibility

### Example 6: Reproducibility

Next, we'll learn about multiple levels of reproducibility:

- **Environment-level reproducibility**: As you can see in the
  [Dockerfile](./Dockerfile), we're containerizing our Flyte application to
  capture a snapshot of all the dependencies that your tasks and workflows rely on.
- **Code-level reproducibility**: In [example_06_reproducibility.py](./workflows/example_06_reproducibility.py)
  we take care of setting a random seed for our model. This is a common practice 
  but an important one to remember!
- **Resource-level reproducibility**: Finally, as you've seen previously we can
  declare the compute and memory requirements of our pipeline at the task-level.

Combined with built-in versioning for all tasks, workflows, launchplans, and
executions, Flyte gives you the ability to roll back/forward to previous versions
of any of these entities. Flyte tasks/workflows are sort of like hermetically-sealed
containers that are guaranteed to produce the same output (error or not) given
the same input.

## Recoverability

### Example 7: Caching

In [example_07_caching.py](./workflows/example_07_caching.py), we revisit the model-tuning use case using `@dynamic` workflows,
showing how caching can help reduce wasted compute.

In [27]:
from workflows import example_07_caching
from workflows.example_06_reproducibility import Hyperparameters
from workflows.utils import get_remote

remote = get_remote()
execution = remote.execute_local_workflow(
    example_07_caching.tuning_workflow,
    inputs={
        "hyperparam_grid": [
            Hyperparameters(alpha=alpha)
            for alpha in [10.0, 1.0, 0.1, 0.01, 0.001, 0.0001]
        ],
    }
)
print(remote.generate_console_url(execution))
execution = remote.wait(execution)

http://localhost:30080/console/projects/flytesnacks/domains/development/executions/f0009904de4a84079ae1


### Example 8: Recovering Failed Executions

In [example_08_recover_executions.py](./workflows/example_08_recover_executions.py), we see how Flyte
provides a mechanism by which you can automatically recover from unexpected failures.

In [26]:
from workflows import example_08_recover_executions
from workflows.utils import get_remote

remote = get_remote()
execution = remote.execute_local_workflow(
    example_08_recover_executions.tuning_workflow,
    inputs={"alpha_grid": [100.0, 10.0, 1.0, 0.1, 0.01, 0.001, 0.0001, 0.00001]}
)
print(remote.generate_console_url(execution))
execution = remote.wait(execution)

http://localhost:30080/console/projects/flytesnacks/domains/development/executions/f8b22ae867b2c4fc58ba


### Example 9: Checkpointing

In [example_09_checkpointing.py](./workflows/example_09_checkpointing.py), we
learn about how you can do intra-task checkpoints natively in Flyte to pick
up from where you left off in, e.g., a model training task.

In [28]:
from workflows import example_09_checkpointing
from workflows.example_06_reproducibility import Hyperparameters
from workflows.utils import get_remote

remote = get_remote()
execution = remote.execute_local_workflow(
    example_09_checkpointing.training_workflow,
    inputs={
        "n_epochs": 30,
        "hyperparameters": Hyperparameters(penalty="l1", random_state=42),
    }
)
print(remote.generate_console_url(execution))
execution = remote.wait(execution)

http://localhost:30080/console/projects/flytesnacks/domains/development/executions/fcd7819a261a6461298b


## Auditability

### Example 10: Visiualization with Flyte Decks

In [example_10_flyte_decks.py](./workflows/example_10_flyte_decks.py) we
create tasks that produce static html reports that help you understand the
inputs/outputs of your tasks.

In [29]:
from workflows import example_10_flyte_decks
from workflows.utils import download_deck, get_remote

remote = get_remote()
execution = remote.execute_local_workflow(
    example_10_flyte_decks.penguins_data_workflow,
    inputs={},
)
print(remote.generate_console_url(execution))
execution = remote.wait(execution)

http://localhost:30080/console/projects/flytesnacks/domains/development/executions/f8c019b475b7c4cbea24


In [31]:
download_deck(remote, execution, "n0", "decks/example_10_decks.html")

from IPython.display import HTML
HTML(filename="decks/example_10_decks.html")

Flyte decks for execution f8c019b475b7c4cbea24 downloaded to decks/example_10_decks.html


0,1
Number of variables,5
Number of observations,344
Missing cells,8
Missing cells (%),0.5%
Duplicate rows,0
Duplicate rows (%),0.0%
Total size in memory,13.6 KiB
Average record size in memory,40.4 B

0,1
Categorical,1
Numeric,4

0,1
bill_length_mm is highly overall correlated with flipper_length_mm and 2 other fields,High correlation
bill_depth_mm is highly overall correlated with flipper_length_mm and 1 other fields,High correlation
flipper_length_mm is highly overall correlated with bill_length_mm and 3 other fields,High correlation
body_mass_g is highly overall correlated with bill_length_mm and 2 other fields,High correlation
species is highly overall correlated with bill_length_mm and 3 other fields,High correlation

0,1
Analysis started,2023-04-25 06:09:49.278184
Analysis finished,2023-04-25 06:09:51.580273
Duration,2.3 seconds
Software version,ydata-profiling vv4.1.2
Download configuration,config.json

0,1
Distinct,3
Distinct (%),0.9%
Missing,0
Missing (%),0.0%
Memory size,2.8 KiB

0,1
Adelie,152
Gentoo,124
Chinstrap,68

0,1
Max length,9.0
Median length,6.0
Mean length,6.5930233
Min length,6.0

0,1
Total characters,2268
Distinct characters,15
Distinct categories,2 ?
Distinct scripts,1 ?
Distinct blocks,1 ?

0,1
Unique,0 ?
Unique (%),0.0%

0,1
1st row,Adelie
2nd row,Adelie
3rd row,Adelie
4th row,Adelie
5th row,Adelie

Value,Count,Frequency (%)
Adelie,152,44.2%
Gentoo,124,36.0%
Chinstrap,68,19.8%

Value,Count,Frequency (%)
adelie,152,44.2%
gentoo,124,36.0%
chinstrap,68,19.8%

Value,Count,Frequency (%)
e,428,18.9%
o,248,10.9%
i,220,9.7%
n,192,8.5%
t,192,8.5%
A,152,6.7%
d,152,6.7%
l,152,6.7%
G,124,5.5%
C,68,3.0%

Value,Count,Frequency (%)
Lowercase Letter,1924,84.8%
Uppercase Letter,344,15.2%

Value,Count,Frequency (%)
e,428,22.2%
o,248,12.9%
i,220,11.4%
n,192,10.0%
t,192,10.0%
d,152,7.9%
l,152,7.9%
h,68,3.5%
s,68,3.5%
r,68,3.5%

Value,Count,Frequency (%)
A,152,44.2%
G,124,36.0%
C,68,19.8%

Value,Count,Frequency (%)
Latin,2268,100.0%

Value,Count,Frequency (%)
e,428,18.9%
o,248,10.9%
i,220,9.7%
n,192,8.5%
t,192,8.5%
A,152,6.7%
d,152,6.7%
l,152,6.7%
G,124,5.5%
C,68,3.0%

Value,Count,Frequency (%)
ASCII,2268,100.0%

Value,Count,Frequency (%)
e,428,18.9%
o,248,10.9%
i,220,9.7%
n,192,8.5%
t,192,8.5%
A,152,6.7%
d,152,6.7%
l,152,6.7%
G,124,5.5%
C,68,3.0%

0,1
Distinct,164
Distinct (%),48.0%
Missing,2
Missing (%),0.6%
Infinite,0
Infinite (%),0.0%
Mean,43.92193

0,1
Minimum,32.1
Maximum,59.6
Zeros,0
Zeros (%),0.0%
Negative,0
Negative (%),0.0%
Memory size,2.8 KiB

0,1
Minimum,32.1
5-th percentile,35.7
Q1,39.225
median,44.45
Q3,48.5
95-th percentile,51.995
Maximum,59.6
Range,27.5
Interquartile range (IQR),9.275

0,1
Standard deviation,5.4595837
Coefficient of variation (CV),0.124302
Kurtosis,-0.87602697
Mean,43.92193
Median Absolute Deviation (MAD),4.75
Skewness,0.053118067
Sum,15021.3
Variance,29.807054
Monotonicity,Not monotonic

Value,Count,Frequency (%)
41.1,7,2.0%
45.2,6,1.7%
46.5,5,1.5%
46.2,5,1.5%
50.5,5,1.5%
39.6,5,1.5%
50,5,1.5%
45.5,5,1.5%
37.8,5,1.5%
47.5,4,1.2%

Value,Count,Frequency (%)
32.1,1,0.3%
33.1,1,0.3%
33.5,1,0.3%
34.0,1,0.3%
34.1,1,0.3%
34.4,1,0.3%
34.5,1,0.3%
34.6,2,0.6%
35.0,2,0.6%
35.1,1,0.3%

Value,Count,Frequency (%)
59.6,1,0.3%
58.0,1,0.3%
55.9,1,0.3%
55.8,1,0.3%
55.1,1,0.3%
54.3,1,0.3%
54.2,1,0.3%
53.5,1,0.3%
53.4,1,0.3%
52.8,1,0.3%

0,1
Distinct,80
Distinct (%),23.4%
Missing,2
Missing (%),0.6%
Infinite,0
Infinite (%),0.0%
Mean,17.15117

0,1
Minimum,13.1
Maximum,21.5
Zeros,0
Zeros (%),0.0%
Negative,0
Negative (%),0.0%
Memory size,2.8 KiB

0,1
Minimum,13.1
5-th percentile,13.9
Q1,15.6
median,17.3
Q3,18.7
95-th percentile,20.0
Maximum,21.5
Range,8.4
Interquartile range (IQR),3.1

0,1
Standard deviation,1.9747932
Coefficient of variation (CV),0.11514044
Kurtosis,-0.90686609
Mean,17.15117
Median Absolute Deviation (MAD),1.5
Skewness,-0.14346463
Sum,5865.7
Variance,3.899808
Monotonicity,Not monotonic

Value,Count,Frequency (%)
17,12,3.5%
15,10,2.9%
18.6,10,2.9%
17.9,10,2.9%
18.5,10,2.9%
17.3,9,2.6%
18.9,9,2.6%
19,9,2.6%
17.8,9,2.6%
18.1,9,2.6%

Value,Count,Frequency (%)
13.1,1,0.3%
13.2,1,0.3%
13.3,1,0.3%
13.4,1,0.3%
13.5,2,0.6%
13.6,1,0.3%
13.7,6,1.7%
13.8,4,1.2%
13.9,4,1.2%
14.0,2,0.6%

Value,Count,Frequency (%)
21.5,1,0.3%
21.2,2,0.6%
21.1,3,0.9%
20.8,1,0.3%
20.7,3,0.9%
20.6,1,0.3%
20.5,1,0.3%
20.3,3,0.9%
20.2,1,0.3%
20.1,1,0.3%

0,1
Distinct,55
Distinct (%),16.1%
Missing,2
Missing (%),0.6%
Infinite,0
Infinite (%),0.0%
Mean,200.9152

0,1
Minimum,172
Maximum,231
Zeros,0
Zeros (%),0.0%
Negative,0
Negative (%),0.0%
Memory size,2.8 KiB

0,1
Minimum,172
5-th percentile,181
Q1,190
median,197
Q3,213
95-th percentile,225
Maximum,231
Range,59
Interquartile range (IQR),23

0,1
Standard deviation,14.061714
Coefficient of variation (CV),0.0699883
Kurtosis,-0.98427289
Mean,200.9152
Median Absolute Deviation (MAD),11
Skewness,0.34568183
Sum,68713
Variance,197.73179
Monotonicity,Not monotonic

Value,Count,Frequency (%)
190,22,6.4%
195,17,4.9%
187,16,4.7%
193,15,4.4%
210,14,4.1%
191,13,3.8%
215,12,3.5%
197,10,2.9%
196,10,2.9%
185,9,2.6%

Value,Count,Frequency (%)
172,1,0.3%
174,1,0.3%
176,1,0.3%
178,4,1.2%
179,1,0.3%
180,5,1.5%
181,7,2.0%
182,3,0.9%
183,2,0.6%
184,7,2.0%

Value,Count,Frequency (%)
231,1,0.3%
230,7,2.0%
229,2,0.6%
228,4,1.2%
226,1,0.3%
225,4,1.2%
224,3,0.9%
223,2,0.6%
222,6,1.7%
221,5,1.5%

0,1
Distinct,94
Distinct (%),27.5%
Missing,2
Missing (%),0.6%
Infinite,0
Infinite (%),0.0%
Mean,4201.7544

0,1
Minimum,2700
Maximum,6300
Zeros,0
Zeros (%),0.0%
Negative,0
Negative (%),0.0%
Memory size,2.8 KiB

0,1
Minimum,2700
5-th percentile,3150
Q1,3550
median,4050
Q3,4750
95-th percentile,5650
Maximum,6300
Range,3600
Interquartile range (IQR),1200

0,1
Standard deviation,801.95454
Coefficient of variation (CV),0.19086183
Kurtosis,-0.71922187
Mean,4201.7544
Median Absolute Deviation (MAD),600
Skewness,0.47032933
Sum,1437000
Variance,643131.08
Monotonicity,Not monotonic

Value,Count,Frequency (%)
3800,12,3.5%
3700,11,3.2%
3900,10,2.9%
3950,10,2.9%
3550,9,2.6%
4300,8,2.3%
3400,8,2.3%
4400,8,2.3%
3450,8,2.3%
3500,7,2.0%

Value,Count,Frequency (%)
2700,1,0.3%
2850,2,0.6%
2900,4,1.2%
2925,1,0.3%
2975,1,0.3%
3000,2,0.6%
3050,4,1.2%
3075,1,0.3%
3100,1,0.3%
3150,4,1.2%

Value,Count,Frequency (%)
6300,1,0.3%
6050,1,0.3%
6000,2,0.6%
5950,2,0.6%
5850,3,0.9%
5800,2,0.6%
5750,1,0.3%
5700,5,1.5%
5650,3,0.9%
5600,2,0.6%

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,species
bill_length_mm,1.0,-0.222,0.673,0.584,0.65
bill_depth_mm,-0.222,1.0,-0.523,-0.432,0.635
flipper_length_mm,0.673,-0.523,1.0,0.84,0.701
body_mass_g,0.584,-0.432,0.84,1.0,0.605
species,0.65,0.635,0.701,0.605,1.0

Unnamed: 0,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
0,Adelie,39.1,18.7,181.0,3750.0
1,Adelie,39.5,17.4,186.0,3800.0
2,Adelie,40.3,18.0,195.0,3250.0
3,Adelie,,,,
4,Adelie,36.7,19.3,193.0,3450.0
5,Adelie,39.3,20.6,190.0,3650.0
6,Adelie,38.9,17.8,181.0,3625.0
7,Adelie,39.2,19.6,195.0,4675.0
8,Adelie,34.1,18.1,193.0,3475.0
9,Adelie,42.0,20.2,190.0,4250.0

Unnamed: 0,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
334,Chinstrap,50.2,18.8,202.0,3800.0
335,Chinstrap,45.6,19.4,194.0,3525.0
336,Chinstrap,51.9,19.5,206.0,3950.0
337,Chinstrap,46.8,16.5,189.0,3650.0
338,Chinstrap,45.7,17.0,195.0,3650.0
339,Chinstrap,55.8,19.8,207.0,4000.0
340,Chinstrap,43.5,18.1,202.0,3400.0
341,Chinstrap,49.6,18.2,193.0,3775.0
342,Chinstrap,50.8,19.0,210.0,4100.0
343,Chinstrap,50.2,18.7,198.0,3775.0

Unnamed: 0,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
0,Adelie,39.1,18.7,181.0,3750.0
1,Adelie,39.5,17.4,186.0,3800.0
2,Adelie,40.3,18.0,195.0,3250.0
3,Adelie,,,,
4,Adelie,36.7,19.3,193.0,3450.0
...,...,...,...,...,...
339,Chinstrap,55.8,19.8,207.0,4000.0
340,Chinstrap,43.5,18.1,202.0,3400.0
341,Chinstrap,49.6,18.2,193.0,3775.0
342,Chinstrap,50.8,19.0,210.0,4100.0


### Example 11: Extending Flyte Decks

Flyte decks can be easily extended to support any arbitrary visualization, as
we can see in [example_11_extend_flyte_decks.py](./workflows/example_11_extend_flyte_decks.py)

#### Exercise

Come up with a visualization for one of inputs or outputs of any of the tasks
in `example_11_extend_flyte_decks.py`, and create a custom Flyte deck for it.

In [None]:
from workflows import example_11_extend_flyte_decks
from workflows.example_06_reproducibility import Hyperparameters
from workflows.utils import download_deck, get_remote

remote = get_remote()
execution = remote.execute_local_workflow(
    example_11_extend_flyte_decks.training_workflow,
    inputs={
        "hyperparameters": Hyperparameters(
            penalty="l1", alpha=0.03, random_state=12345
        )
    },
)
print(remote.generate_console_url(execution))
execution = remote.wait(execution)

In [None]:
download_deck(remote, execution, "n2", "decks/example_11_decks_n2.html")
download_deck(remote, execution, "n2", "decks/example_11_decks_n3.html")