# Building graphs with Metaflow

[Metaflow](https://docs.metaflow.org/metaflow/basics) is a great framework designed by Netflix for managing **data related workflows**. 

It can perform multiprocessed tasks threrefore bypassing the [Python's GIL restrictions](https://docs.python.org/3/glossary.html#term-global-interpreter-lock) by leveraging the subprocess (separate Python interpreter) in a still very Pythonic dev angle. Unlike with MPI-based programs, processes can also share data through superclass attributes. 

It tends to be compute intensive on the CPU, but in a host-dedicated environment, it's still a handy tool.

We will demonstrate its ease-of-use on a simple example: building the graphs from the previously processed featured data. 

In [1]:
!cat build_graphs.sh

#!/usr/bin/bash 

export MAX_WORKERS=$(python -c "import psutil; print(psutil.cpu_count(logical=False))")

# Usually, you should enable pylint, really
# But because PyTorch generates errors on its own, we'll simplify by just disabling it
# Our code is clean though ;)
USERNAME='mluser' python flows.py --no-pylint \
    run \
        --max-num-splits 7000 \
        --max-workers ${MAX_WORKERS} >> ${HOME}/.kosmoss/logs/build_graphs.stdout

Open the `flows.py` file and debug it.

In [2]:
!cat flows.py

from metaflow import FlowSpec, Parameter, step
import os
import os.path as osp
import shutil

class BuildGraphsFlow(FlowSpec):
    
    # In addition to the standard class properties...
    PROCESSED_DATA_PATH = osp.join(os.environ['HOME'], ".kosmoss", "data", "processed")

    # ...you can just add parameters to be read from the command line
    timestep = Parameter('timestep', help='Temporal sampling step', default=1000)
    num_shards = Parameter('num_shards', help='Number of shards', default=3392)
    dtype = Parameter('dtype', help="NumPy's dtype", default='float32')
    x_shape = Parameter('x_shape', help='Shape for x', default=(160, 136, 20))
    y_shape = Parameter('y_shape', help='Shape for y', default=(160, 138, 4))
    edge_shape = Parameter('edge_shape', help='Shape for edge', default=(160, 137, 27))
        
    @step
    def start(self):
        """
        Create the constants for the rest of the Flow.
        """
        
        import numpy as np
        
        # Ea

In [3]:
!bash build_graphs.sh

[35m[1mMetaflow 2.5.3[0m[35m[22m executing [0m[31m[1mBuildGraphsFlow[0m[35m[22m[0m[35m[22m for [0m[31m[1muser:mluser[0m[35m[22m[K[0m[35m[22m[0m
[35m[22mValidating your flow...[K[0m[35m[22m[0m
[32m[1m    The graph looks good![K[0m[32m[1m[0m


The neat thing with Metaflow is that it registers everything in a namespace, and centralizes the logs and artifacts produced for each run. 

This data is then viewable with the commands below. Everything is Python-scriptable, which is a huge advantage.

We launched the run with the `USERNAME` set at `'mluser'` so everything is stored under that namespace.

In [4]:
from kosmoss import CONFIG, PROCESSED_DATA_PATH
import os.path as osp
import shutil

_, used, _ = shutil.disk_usage(osp.join(PROCESSED_DATA_PATH, f"graphs-{CONFIG['timestep']}"))
used // 2 ** 30

594

In [5]:
from metaflow import Flow, namespace
from pprint import pprint

namespace('user:mluser')
flow = Flow('BuildGraphsFlow')
runs = list(flow)
run0 = runs[0]
run0.data.name

pprint(runs)

[Run('BuildGraphsFlow/1646758431702086'),
 Run('BuildGraphsFlow/1646420527721542'),
 Run('BuildGraphsFlow/1646420491932514'),
 Run('BuildGraphsFlow/1646420303736854')]


In [6]:
# Isolated last Run
run = Flow('BuildGraphsFlow').latest_run

# Get Steps from that Run
steps = list(run.steps())
pprint(steps)

# Isolate Tasks from the Start Step
start_tasks = list(steps[-1].tasks())

[Step('BuildGraphsFlow/1646758431702086/end'),
 Step('BuildGraphsFlow/1646758431702086/join'),
 Step('BuildGraphsFlow/1646758431702086/build_graphs'),
 Step('BuildGraphsFlow/1646758431702086/start')]


In [7]:
# Restrieve the list of artifacts registered at the Start Step
start_artifacts = start_tasks[0].artifacts
list(start_artifacts)

[DataArtifact('BuildGraphsFlow/1646758431702086/start/1/y_shape'),
 DataArtifact('BuildGraphsFlow/1646758431702086/start/1/x_shape'),
 DataArtifact('BuildGraphsFlow/1646758431702086/start/1/timestep'),
 DataArtifact('BuildGraphsFlow/1646758431702086/start/1/num_shards'),
 DataArtifact('BuildGraphsFlow/1646758431702086/start/1/name'),
 DataArtifact('BuildGraphsFlow/1646758431702086/start/1/edge_shape'),
 DataArtifact('BuildGraphsFlow/1646758431702086/start/1/dtype'),
 DataArtifact('BuildGraphsFlow/1646758431702086/start/1/PROCESSED_DATA_PATH'),
 DataArtifact('BuildGraphsFlow/1646758431702086/start/1/shard'),
 DataArtifact('BuildGraphsFlow/1646758431702086/start/1/out_dir')]

In [8]:
start_artifacts.num_shards.data

3392