# Workflows

Have questions? Chat with us on Github or Slack:

[![Homepage](https://img.shields.io/badge/fugue-source--code-red?logo=github)](https://github.com/fugue-project/fugue)
[![Slack Status](https://img.shields.io/badge/slack-join_chat-white.svg?logo=slack&style=social)](http://slack.fugue.ai)

So far we've used Fugue's `transform()` function to port Pandas code to Spark without any rewrites. The `transform()` function is very flexible so it can handle functions with varying input and output types.

Decoupling logic and execution is one of the primary motivations of Fugue. This is meant to solve the following problems:

1. Users have to learn an entirely new framework to work with distributed computing problems
2. Logic written for a *small data* project is not reusable for a *big data* project
3. Testing becomes a heavyweight process for distributed computing, especially Spark
4. Along with number 3, iterations for distributed computing problems become slower and more expensive

Fugue's core principle is to minimize code dependency on frameworks as much as possible, which provides flexibility and portability. **By decoupling logic and execution, we can focus on our logic in a scale-agnostic way and then choose which execution engine to use when the time arises.** In this section, we look at how to move from the `transform()` function to end-to-end workflows with `FugueWorkflow()`.

## `transform` versus `FugueWorkflow`

While the `transform()` function is good for running a single function across multiple execution engines, Fugue also has `FugueWorkflow`, which can be used to make engine-agnostic end-to-end workflows. `FugueWorkflow()` constructs a directed-acyclic graph (DAG) where the inputs and outputs are DataFrames. The code block below will run on the Pandas-based engine.

In [None]:
from fugue import FugueWorkflow

with FugueWorkflow() as dag:
    df = dag.df(data.copy())
    df = df.transform(map_phone_to_location, schema="*, location:str")
    df.show()

PandasDataFrame
phone:str                                                      |location:str             
---------------------------------------------------------------+-------------------------
(217)-123-4567                                                 |Champaign, IL            
(217)-234-5678                                                 |Champaign, IL            
(407)-123-4567                                                 |Orlando, FL              
(407)-234-5678                                                 |Orlando, FL              
(510)-123-4567                                                 |Fremont, CA              
Total count: 5



To bring it to Spark, all we need to do is pass the engine into `FugueWorkflow`, similar to how we used the `transform()` function in the last sections. All the code underneath the `with` statement will run on Spark. We did not make any modifications to `map_phone_to_location` to bring it to Spark. The `df.transform()` call below converts it to a Fugue `Transformer` during runtime by using the type-annotations and schema provided. We can use the same function in Spark, Dask or Ray without making modifications. Any function compaible with the `transform()` function will work in the `FugueWorkflow` transform call.

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

In [None]:
with FugueWorkflow(spark) as dag:
    df = dag.df(data.copy())  # Still the original Pandas DataFrame
    df = df.transform(map_phone_to_location, schema="*, location:str")
    df.show()

SparkDataFrame
phone:str                                                      |location:str             
---------------------------------------------------------------+-------------------------
(217)-123-4567                                                 |Champaign, IL            
(217)-234-5678                                                 |Champaign, IL            
(407)-123-4567                                                 |Orlando, FL              
(407)-234-5678                                                 |Orlando, FL              
(510)-123-4567                                                 |Fremont, CA              
Total count: 5



If we had five different functions that we call `transform()` on to bring to Spark, we would need to specify the Spark engine five times. The `FugueWorkflow` allows us to make the entire computation run on either Pandas, Spark, Dask, or Ray. Both are similar in principle in that they leave the original functions decoupled from the execution environment.

## The Directed Acyclic Graph (DAG)

The `FugueWorkflow` is responsible for constructing a Directed Acyclic Graph, also called a DAG. A lot of people associate the DAG concept with workflow orchestration tools like Airflow, Prefect, or Dagster. While these tools also use DAGs, they use it in a different way than the distributed computing frameworks (Spark and Dask). For orchestration frameworks, the DAG is used to manage dependencies of scheduled tasks. For computing frameworks, the DAG represents a computation graph that is built, validated, and then executed. DAGs are used because distributed computing operations are very expensive and have a lot of room to be optimized. Also, mistakes in a distributed setting are very expensive.

Fugue follows these distributed computing frameworks in using the DAG for validation before execution. DAGs can catch errors significantly earlier, in a way similar to compiling the computing job. For Fugue specifically, the built DAG validates schema, as well as provides the basis for further optimizations. For example, Fugue can detect which DataFrames are re-used in the computation graph and then persist them automatically to avoid recomputation. The DAG is a graph where the nodes are DataFrames connected by Fugue extensions. We already introduced the most common extension, which is the `transformer`. Schema is tracked throughout the DAG. More extensions will be introduced later.

## Loading and Saving Data

Load and save operations are done inside the `FugueWorkflow` and use the appropriate saver/loader for the file extension (.csv, .json, .parquet, .avro) and ExecutionEngine (Pandas, Spark, or Dask). For distributed computing, parquet and avro tend to be the most used due to compression. 

In [6]:
with FugueWorkflow() as dag:
    df = dag.df(data2)
    df.save('/tmp/data.parquet', mode="overwrite", single=True)
    df.save("/tmp/data.csv", mode="overwrite", header=True)
    df2 = dag.load('/tmp/data.parquet')
    df3 = dag.load("/tmp/data.csv", header=True, columns="col1:int,col2:int")
    df3.show()

PandasDataFrame
col1:int|col2:int
--------+--------
1       |2       
2       |3       
3       |4       
Total count: 3



## Summary

In this section we covered the DAG concept, which can be use to define full end-to-end framework-agnostic workloads. We also covered how to define schema and pass in parameters. Combined with loading and saving of files, users can already start using Fugue for working with data.