# Introduction

Fugue is an abstraction layer for distributed computing frameworks such as Spark or Dask. The purpose of the abstraction layer is that users should not need to worry about the tool they are using when working with data. Often we'll find that logic written to solve problems is highly tied to a specific compute framework (Pandas, Spark or Dask). There are many problems that come out of this:

1. Users have to learn an entirely new framework to work with distributed compute problems
2. Logic written for a *small data* project does not become reusable for a *big data* project
3. Testing becomes a heavyweight process for distrubted compute, especially Spark
4. Along with number 3, iterations for distributed compute problems become slower and more expensive

## Decoupling Logic and Execution

To illustrate the first two pain points above, we'll use a simple code example on a DataFrame with 2 string columns (*col1* and *col2*). This following operation will not make sense, but will show how Pandas and Spark syntactically differ.  We're will create a new column (*col3*). This third column with be a concatenation of the first column with the first 3 letters second column. This whole value will then be reversed. We will execute this logic on Pandas and Spark.


In [1]:
import pandas as pd

data = pd.DataFrame({"col1": ["hello", "apple", "yellow"], "col2": ["world", "banana", "orange"]})
data.head()

Unnamed: 0,col1,col2
0,hello,world
1,apple,banana
2,yellow,orange


First we'll perform the operation in Pandas

In [2]:
def concat_and_reverse(df: pd.DataFrame) -> pd.DataFrame:
    df['col3'] = df['col1'] + '_' + df['col2'].str.slice(0,3)  # concat
    df['col3'] = df['col3'].apply(lambda x: x[::-1])           # reverse
    return df

p_data = data.copy()
pandas_data = concat_and_reverse(p_data)
pandas_data.head()

Unnamed: 0,col1,col2,col3
0,hello,world,row_olleh
1,apple,banana,nab_elppa
2,yellow,orange,aro_wolley


Next we'll perform the same operation in Spark

In [3]:
# Setting up Spark session
from pyspark.sql import SparkSession, DataFrame

spark = SparkSession.builder.getOrCreate()

In [4]:
from pyspark.sql.functions import concat_ws, col, reverse, substring

s_data = spark.createDataFrame(data)  # this is the previous Pandas DataFrame

def concat_and_reverse_spark(df: DataFrame) -> DataFrame:
    df = df.withColumn('col3', concat_ws('_', col('col1'), substring(col('col2'), 0, 3)))\
           .withColumn('col3', reverse(col('col3')))                         
    return df

spark_data = concat_and_reverse_spark(s_data)
spark_data.show()

+------+------+----------+
|  col1|  col2|      col3|
+------+------+----------+
| hello| world| row_olleh|
| apple|banana| nab_elppa|
|yellow|orange|aro_wolley|
+------+------+----------+



Looking at the two code examples, we had to reimplement the exact same functionality with completely different syntax. This isn't a cherry-picked example. Data practitioners will often have to write two implementations of the same logic, one for each framework, especially as the logic gets more specific. This is where Fugue comes in. Users can use the abstraction layer to only write one implementation of the function. This can then be applied to Pandas, Spark, and Dask. All we need to do is apply a `transformer` decorator to the Pandas implementation of the function. The decorator takes in a string that specifies the output schema.

In [5]:
from fugue import transformer, FugueWorkflow
from fugue_spark import SparkExecutionEngine

@transformer("*, col3:str")
def concat_and_reverse(df: pd.DataFrame) -> pd.DataFrame:
    df['col3'] = df['col1'] + '_' + df['col2'].str.slice(0,3)  # concat
    df['col3'] = df['col3'].apply(lambda x: x[::-1])           # reverse
    return df

By wrapping the function with the decorator, we can then use it inside a `FugueWorkflow`. The `FugueWorkflow` constructs a directed-acyclic grap (DAG) where the inputs and outputs are DataFrames. More details will follow but the important thing for now is to show how it's used. The code block below is still running in Pandas.

In [6]:
f_data = data.copy()
with FugueWorkflow() as dag:
    df = dag.df(f_data)  # Still the original Pandas DataFrame
    df = df.transform(concat_and_reverse)
    df.show()

PandasDataFrame
col1:str|col2:str|col3:str                                                                          
--------+--------+----------------------------------------------------------------------------------
hello   |world   |row_olleh                                                                         
apple   |banana  |nab_elppa                                                                         
yellow  |orange  |aro_wolley                                                                        
Total count: 3



In order to bring it to Spark, all we need to do is pass the SparkExecutionEngine into FugueWorkflow. Now all the code underneath the `with` statement will run on Spark. We did not make any modifications to `concat_and_reverse` in order to bring it to Spark. By wrapping the function with a `transformer`, it became agnostic to the ExecutionEngine it was operating on.

In [7]:
f_data = data.copy()

with FugueWorkflow(SparkExecutionEngine) as dag:
    df = dag.df(f_data)  # Still the original Pandas DataFrame
    df = df.transform(concat_and_reverse)
    df.show()

SparkDataFrame
col1:str|col2:str|col3:str                                                                          
--------+--------+----------------------------------------------------------------------------------
hello   |world   |row_olleh                                                                         
apple   |banana  |nab_elppa                                                                         
yellow  |orange  |aro_wolley                                                                        
Total count: 3



## Independency from Frameworks

We earlier said that the abstraction layer Fugue provides makes code independent of any frameworks. Some users may have realized that `concat_and_reverse` was still written in Pandas, and they would be right. However, we can actually rewrite this function in base Python and apply it on Pandas and Spark. Below is the implementation in native Python. Similar to earlier, we are running this on Spark by passing in the `SparkExecutionEngine`

In [8]:
from typing import List, Dict, Any

f_data = data.copy()

# schema: *, col3:str
def concat_and_reverse(df: List[Dict[str,Any]]) -> List[Dict[str,Any]]:
    for row in df:
        row['col3'] = row['col1'] + '_' + row['col2'][0:3]     # concat
        row['col3'] = ''.join(reversed(row['col3']))           # reverse
    return df

with FugueWorkflow(SparkExecutionEngine) as dag:
    df = dag.df(f_data)  # Still the original Pandas DataFrame
    df = df.transform(concat_and_reverse)
    df.show()

SparkDataFrame
col1:str|col2:str|col3:str                                                                          
--------+--------+----------------------------------------------------------------------------------
hello   |world   |row_olleh                                                                         
apple   |banana  |nab_elppa                                                                         
yellow  |orange  |aro_wolley                                                                        
Total count: 3



Notice the `transformer` decorator was removed from `concat_and_reverse`. Instead, it was replaced with a comment that specified the schema. Fugue reads in this schema. Now, this fucntion is truly independent of any framework and written in native Python. **It is even independent from Fugue itself.** Fugue only appears when we reach the execution part of the code. Logic, however, is not coupled with any framework. Keen users may also notice that the type annotations in the `concat_and_reverse` caused the DataFrame to be converted before it was used by the function. If users want to offboard from Fugue, they can use their function with Pandas `apply()` or Spark user-defined functions (UDFs).

Is the native Python implementation or Pandas implementation of `concat_and_reverse` better? Is the native Spark implementation better? The main concern of Fugue is clear readable code. **Users can write code in whatever expresses their logic the best**. There will be cases where the Pandas implementation is faster the native Python, especially because of vectorized operations and usage of C. However, the philosophy of Fugue is clean and reusable logic that is portable across frameworks. We demonstrated that Fugue can be used on Pandas and Spark. To use it on Dask, we simply use the `DaskExecutionEngine`. The compute efficiency lost by using Fugue is unlikely to be significant, especially in comparison to the developer efficiency gained through more rapid iterations and easier maintenace. In fact, Fugue is designed in a way that often sees speed ups compared to inexperienced users working with native Spark code. Fugue handles a lot of the tricks neccesary to use Spark effectively.

Fugue also future-proofs the code. If one day Spark and Dask are replaced by a more efficient framework, a new ExecutionEngine can be added to Fugue to support that new framework.

## Testability and Maintainability

Fugue code becomes easily testable because the function contain logic that is portable across all Pandas, Spark, and Dask. All we have to do is run some values through the defined function. We can test code without the need to spin up compute resource (such as Spark or Dask clusters). This hardware often takes time to spin up just for a simple test. Now, we can test quickly with native Python or Pandas, and then execute on Spark when needed. Developers that use Fugue benefit from more rapid iterations in their data projects.

In [9]:
# Remember the input was List[Dict[str,Any]]
concat_and_reverse([{'col1': 'hello', 'col2': 'world'}, {'col1': 'apple', 'col2': 'banana'}])

[{'col1': 'hello', 'col2': 'world', 'col3': 'row_olleh'},
 {'col1': 'apple', 'col2': 'banana', 'col3': 'nab_elppa'}]

## Fugue as a Mindset

Fugue is a framework, but more importantly, it is a mindset. 

1. Fugue believes that the framework should adapt to the user, not the other way around
2. Fugue lets users code express logic in a scale-agnostic way, with the tools they prefer
3. Fugue values readability and maintability of code over deep framework-specific optimizations

Distributed computing is currently harder than it needs to be. However, these systems often follow similar patterns, which have been abstracted to create a framework that lets users focus on defining their logic. We cover these concepts in the following tutorials. If you're new to distributed computing, Fugue is the perfect place to get started.

## Comparison to Modin and Koalas

Fugue gets compared a lot of Modin and Koalas. Modin is a Pandas interface for execution on Dask, and Koalas is a Pandas interface for execution on Spark. Fugue, Modin, and Koalas have very similar goals in making an easier distributed computing experience. The main difference is that Modin and Koalas use Pandas as the grammar for distributed compute. Fugue, on the other hand, uses native Python and SQL as the grammar for distributed compute. 

The clearest example of Pandas not being compatible with Spark is the acceptance of mixed-typed columns. A single column can have numeric and string values. Spark on the other hand, is strongly typed. More than that, Pandas is strongly reliant on the index for operations. As users transition to Spark, the index mindset does not hold as well. Order is not always guaranteed in a distributed system, and there is an overhead to maintain a global index, even when it is not necessary.