Skip to content

visualize flow execution

Chris Lu edited this page Jan 2, 2016 · 5 revisions

Generate flow execution DAG

A flow usually consists of many tasks. Glow optimizes the task execution by grouping the tasks, avoiding disk IO as much as possible.

To see the execution graph, run this command, assuming the driver program is called "word_count":

 // generate a x.dot file, which describes a graph
 $ ./word_count --glow --glow.flow.plot > x.dot
 // generate a x.svg file
 $ dot -Tsvg -ox.svg x.dot

You can also generate the graph into png format, etc.

Understand the graph

Take this graph as an exemple:

Glow Hello World Execution Plan

The Source0, Filter1, Map2, Map3, LocalReduce4, MergeReduce5 are actual computation steps. If a step is split into multiple shards, each task is named as [stepName][stepNumber][shardId][totalShards], for example: LocalReduce4_1_3 means the step is called LocalReduce4, shardId is 1, and total number of shards is 3.

Each dataset shard is named as [datasetName][datasetNumber]_[shardId], for example: d4_1 means the dataset is output of step 4(LocalReduce4), shardId is 1.

The shown datasets are actually written to disks. The datasets not written to disks are skipped.

How does Glow optimize flow execution DAG?

Glow only writes dataset to disk in 2 cases:

  1. One task produces output for multiple tasks.
  2. One task consumes outputs from multiple tasks.

This would avoid any unnecessary delay or deadlock caused by waiting. For example, if one task produces outputs for task A and B. If task A runs fast, but task B runs slow, task A should just proceed. Task B can read the dataset later. If the dataset is kept in memory, at some point task A would need to slow down and wait for B, which is not ideal at all. In special cases, task B may be waiting for resources taken by task A. If task A waits for B, deadlock would happen.

Clone this wiki locally