# Building graphs with MetaFlow

MetaFlow is a great framework designed for managing **data related workflows**. It can perform multiprocessed tasks threrefore bypassing the Python's GIL restrictions by leveraging the subprocess (separate Python interpreter) in a still very Pythonic dev angle. It tends to be compute intensive on the CPU, but in a host-dedicated environment, it's a handy tool.

In [1]:
!cat build_graphs.sh

#!/usr/bin/bash 

export MAX_WORKERS=$(python -c "import psutil; print(psutil.cpu_count(logical=False))")

USERNAME='mluser' python flows.py \
    run \
        --max-num-splits 6000 \
        --max-workers ${MAX_WORKERS}

In [2]:
!bash build_graphs.sh

[35m[1mMetaflow 2.5.0[0m[35m[22m executing [0m[31m[1mBuildGraphsFlow[0m[35m[22m[0m[35m[22m for [0m[31m[1muser:mluser[0m[35m[22m[K[0m[35m[22m[0m
[35m[22mValidating your flow...[K[0m[35m[22m[0m
[32m[1m    The graph looks good![K[0m[32m[1m[0m
[35m[22mRunning pylint...[K[0m[35m[22m[0m
[32m[1m    Pylint is happy![K[0m[32m[1m[0m
[35m2022-02-17 14:24:42.839 [0m[1mWorkflow starting (run-id 1645107882835214):[0m
[35m2022-02-17 14:24:42.847 [0m[32m[1645107882835214/start/1 (pid 13880)] [0m[1mTask is starting.[0m
[35m2022-02-17 14:24:43.606 [0m[32m[1645107882835214/start/1 (pid 13880)] [0m[1mForeach yields 848 child steps.[0m
[35m2022-02-17 14:24:43.606 [0m[32m[1645107882835214/start/1 (pid 13880)] [0m[1mTask finished successfully.[0m
[35m2022-02-17 14:24:43.615 [0m[32m[1645107882835214/build_graphs/2 (pid 13916)] [0m[1mTask is starting.[0m
[35m2022-02-17 14:24:43.621 [0m[32m[1645107882835214/build_graphs/3 (pid 1

The neat thing with MetaFlow is that it registers everything in a namespace, and centralizes the logs and artifacts produced for each run. This data is then viewable with the commands below. Everything is Python-scriptable, which is a huge advantage.

We launched the run with the `USERNAME` set at `'mluser'` so everything is stored under that namespace.

In [13]:
from metaflow import Flow, namespace
from pprint import pprint

namespace('user:mluser')
flow = Flow('BuildGraphsFlow')
runs = list(flow)
run0 = runs[0]
run0.data.name

pprint(runs)

[Run('BuildGraphsFlow/1644939228924458'),
 Run('BuildGraphsFlow/1644939170651991'),
 Run('BuildGraphsFlow/1644939034351635'),
 Run('BuildGraphsFlow/1644939000054534'),
 Run('BuildGraphsFlow/1644938768013885'),
 Run('BuildGraphsFlow/1644938649267334')]


In [25]:
# Isolated last Run
run = Flow('BuildGraphsFlow').latest_run

# Get Steps from that Run
steps = list(run.steps())
pprint(steps)

# Isolate Tasks from the Start Step
start_tasks = list(steps[-1].tasks())

[Step('BuildGraphsFlow/1644939228924458/end'),
 Step('BuildGraphsFlow/1644939228924458/join'),
 Step('BuildGraphsFlow/1644939228924458/build_graphs'),
 Step('BuildGraphsFlow/1644939228924458/start')]


In [32]:
# Restrieve the list of artifacts registered at the Start Step
start_artifacts = tasks[0].artifacts
list(start_artifacts)

[DataArtifact('BuildGraphsFlow/1644939228924458/start/1/out_dir'),
 DataArtifact('BuildGraphsFlow/1644939228924458/start/1/timestep'),
 DataArtifact('BuildGraphsFlow/1644939228924458/start/1/params'),
 DataArtifact('BuildGraphsFlow/1644939228924458/start/1/shard'),
 DataArtifact('BuildGraphsFlow/1644939228924458/start/1/name')]

In [35]:
start_artifacts.params.data

{'dtype': 'float32',
 'timestep': 1000,
 'dataset_len': 1085440,
 'num_shards': 848,
 'x_shape': [1280, 136, 17],
 'y_shape': [1280, 138, 1],
 'edge_shape': [1280, 137, 27]}