# Building graphs with MetaFlow

MetaFlow is a great framework designed for managing **data related workflows**. 

It can perform multiprocessed tasks threrefore bypassing the Python's GIL restrictions by leveraging the subprocess (separate Python interpreter) in a still very Pythonic dev angle. It tends to be compute intensive on the CPU, but in a host-dedicated environment, it's a handy tool.

We will demonstrate its ease-of-use on a simple example: building the graphs from the previously processed featured data. 

In [1]:
!cat build_graphs.sh

#!/usr/bin/bash 

export MAX_WORKERS=$(python -c "import psutil; print(psutil.cpu_count(logical=False))")

USERNAME='mluser' python flows.py \
    run \
        --max-num-splits 7000 \
        --max-workers ${MAX_WORKERS} >> ${HOME}/.kosmoss/logs/build_graphs.stdout

Open the `flows.py` file and debug it.

In [2]:
!bash build_graphs.sh

[35m[1mMetaflow 2.5.0[0m[35m[22m executing [0m[31m[1mBuildGraphsFlow[0m[35m[22m[0m[35m[22m for [0m[31m[1muser:mluser[0m[35m[22m[K[0m[35m[22m[0m
[35m[22mValidating your flow...[K[0m[35m[22m[0m
[32m[1m    The graph looks good![K[0m[32m[1m[0m
[35m[22mRunning pylint...[K[0m[35m[22m[0m
[32m[1m    Pylint is happy![K[0m[32m[1m[0m


The neat thing with MetaFlow is that it registers everything in a namespace, and centralizes the logs and artifacts produced for each run. 

This data is then viewable with the commands below. Everything is Python-scriptable, which is a huge advantage.

We launched the run with the `USERNAME` set at `'mluser'` so everything is stored under that namespace.

In [2]:
from kosmoss import CONFIG, PROCESSED_DATA_PATH
import os.path as osp
import shutil

_, used, _ = shutil.disk_usage(osp.join(PROCESSED_DATA_PATH, f"graphs-{CONFIG['timestep']}"))
used // 2 ** 30

187

In [3]:
from metaflow import Flow, namespace
from pprint import pprint

namespace('user:mluser')
flow = Flow('BuildGraphsFlow')
runs = list(flow)
run0 = runs[0]
run0.data.name

pprint(runs)

[Run('BuildGraphsFlow/1645199395087434'),
 Run('BuildGraphsFlow/1645198826240123'),
 Run('BuildGraphsFlow/1645196218706754'),
 Run('BuildGraphsFlow/1645196181584531'),
 Run('BuildGraphsFlow/1645120758149454'),
 Run('BuildGraphsFlow/1645119463179640'),
 Run('BuildGraphsFlow/1645118944827984'),
 Run('BuildGraphsFlow/1645113974841657'),
 Run('BuildGraphsFlow/1645113549978964'),
 Run('BuildGraphsFlow/1645113475192636'),
 Run('BuildGraphsFlow/1645113414895923'),
 Run('BuildGraphsFlow/1645113354496294'),
 Run('BuildGraphsFlow/1645113295883944'),
 Run('BuildGraphsFlow/1645112614290339'),
 Run('BuildGraphsFlow/1645112032400082'),
 Run('BuildGraphsFlow/1645111655483619'),
 Run('BuildGraphsFlow/1645107882835214'),
 Run('BuildGraphsFlow/1645107534723044'),
 Run('BuildGraphsFlow/1645106947880002'),
 Run('BuildGraphsFlow/1645106880212705'),
 Run('BuildGraphsFlow/1645106069216544'),
 Run('BuildGraphsFlow/1645105603685647'),
 Run('BuildGraphsFlow/1645104292879028'),
 Run('BuildGraphsFlow/164510417555

In [4]:
# Isolated last Run
run = Flow('BuildGraphsFlow').latest_run

# Get Steps from that Run
steps = list(run.steps())
pprint(steps)

# Isolate Tasks from the Start Step
start_tasks = list(steps[-1].tasks())

[Step('BuildGraphsFlow/1645199395087434/end'),
 Step('BuildGraphsFlow/1645199395087434/join'),
 Step('BuildGraphsFlow/1645199395087434/build_graphs'),
 Step('BuildGraphsFlow/1645199395087434/start')]


In [6]:
# Restrieve the list of artifacts registered at the Start Step
start_artifacts = start_tasks[0].artifacts
list(start_artifacts)

[DataArtifact('BuildGraphsFlow/1645199395087434/start/1/PROCESSED_DATA_PATH'),
 DataArtifact('BuildGraphsFlow/1645199395087434/start/1/num_shards'),
 DataArtifact('BuildGraphsFlow/1645199395087434/start/1/y_shape'),
 DataArtifact('BuildGraphsFlow/1645199395087434/start/1/edge_shape'),
 DataArtifact('BuildGraphsFlow/1645199395087434/start/1/out_dir'),
 DataArtifact('BuildGraphsFlow/1645199395087434/start/1/timestep'),
 DataArtifact('BuildGraphsFlow/1645199395087434/start/1/dtype'),
 DataArtifact('BuildGraphsFlow/1645199395087434/start/1/x_shape'),
 DataArtifact('BuildGraphsFlow/1645199395087434/start/1/shard'),
 DataArtifact('BuildGraphsFlow/1645199395087434/start/1/name')]

In [8]:
start_artifacts.num_shards.data

6784