# Fugue in 10 minutes

All questions are welcome in the Slack channel.

[![Slack Status](https://img.shields.io/badge/slack-join_chat-white.svg?logo=slack&style=social)](http://slack.fugue.ai)

This is a short introduction to [Fugue](https://github.com/fugue-project/fugue) geared towards new users. The Fugue project aims to make big data effortless by accelerating iteration speed and providing a simpler interface for users to utilize distributed computing engines.

This tutorial covers the Python interface only. For SQL, check the FugueSQL in 10 minutes section.

Fugue is meant for:
1. Data scientists who need to bring business logic written in Python or Pandas to bigger datasets
2. Data practitioners looking to parallelize existing code with distributed computing
3. Data teams that want to reduce the maintenance and testing of boilerplate Spark code

## Setup

For this tutorial, we firstly need to run through some quick setup steps in order to instantiate a Spark session

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

## Bringing a function to Spark or Dask

The simplest way to scale pandas based code to Spark or Dask is with the `transform()` function. With the addition of this modest wrapper, we can bring existing Pandas and Python code to distributed execution with minimal refactoring. The `transform()` function also provides quality of life enhancements that can eliminate boilerplate code for users.



Let's quickly demonstrate how this concept can be applied to a typical user story. In the following code snippets illustrated below we will train a model using scikit-learn and pandas. Then we will perform predictions using this model whilst achieving parallelization by supplying a Spark execution engine as an argument in Fugue's `transform()` function.

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

X = pd.DataFrame({"x_1": [1, 1, 2, 2], "x_2":[1, 2, 2, 3]})
y = np.dot(X, np.array([1, 2])) + 3
reg = LinearRegression().fit(X, y)

After training our model, we then wrap it in a `predict()` function. Bear in mind that this function is still written in pandas making it easy to test on the `input_df` that we create. Wrapping our model in `predict()` will allow us to bridge execution to Spark or Dask.

In [None]:
# define our predict function
def predict(df: pd.DataFrame, model: LinearRegression) -> pd.DataFrame:
    """
    Function to predict results using a pre-built model
    """
    return df.assign(predicted=model.predict(df))

# create test data
input_df = pd.DataFrame({"x_1": [3, 4, 6, 6], "x_2":[3, 3, 6, 6]})

# test the predict function
predict(input_df.copy(), reg)

Now this is where it starts to get interesting, let's bring the same code defined above to Spark using Fugue `transform()`. We take our dataframe and apply the `predict()` function to it using either one of the Pandas, Spark, or Dask engines. The `transform()` parameters will be explained in detail later on, but for now, notice how we made no modifications to the `predict()` function in order to switch the execution from Pandas to Spark. All we have to do is pass in the `SparkSession` as the engine.

In [None]:
# import Fugue
from fugue import transform

# create a spark dataframe
sdf = spark.createDataFrame(input_df)

# use Fugue transform to switch exection to spark
result = transform(
    df=sdf,
    using=predict,
    schema="*,predicted:double",
    params=dict(model=reg),
    engine=spark
)

# display results
print(type(result))
result.show()

The `transform()` function provides much more flexibility for users than what we have just described above. This is just a simple use case designed to give you a flavour of what Fugue has to offer. For this example, the `transform()` function took in the following arguments:

* df     - input DataFrame (can be a pandas, Spark, or Dask DataFrame)
* using  - a Python function with valid input and output types
* schema - output schema of the operation
* params - a dictionary of parameters to pass in the function
* engine - the execution engine to run the operation on pandas, Spark, or Dask 

We will delve into these variables in more detail later on, this will include an explanation of the roles that `type annotations` and `schema` play. For now, the most important thing to discuss is the engine.

## Execution Engines

Because we supplied the `spark` variable as the engine, the `predict()` function will be applied on the Spark DataFrame `sdf` using Spark. Similarly, passing a Dask Client as the engine will run the operation on Dask. If users just want to use Spark and Dask default configurations (normally local versions of these libraries), they can also pass a string.

```python
transform(df, fn, ..., engine="spark")
transform(df, fn, ..., engine="dask")
```

If a pandas DataFrame is supplied as the input, it will be converted to that engine's brand of DataFrame before applying the operation. Below we see an example of Pandas DataFrame input and Dask DataFrame output. This is the same operation as illustrated above, note how we used the `"dask"` string to spin up a Dask client.

In [None]:
# using trnasform to bring predict to dask execution
result = transform(
    df=input_df.copy(),
    using=predict,
    schema="*,predicted:double",
    params=dict(model=reg),
    engine="dask"
)

# display results
print(type(result))
result.compute().head()

If no `engine` is supplied to transform, the function will be executed on pandas.

While Fugue converts pandas DataFrames to Spark or Dask DataFrames, it will not perform the conversion the other way around. It normally makes more sense to save the output data as a parquet file after transformations. Additionally users can use Spark's `toPandas()` or Dask's `compute()` if they want to bring the data to Pandas. Fugue prefers the user explicitly does this.

## Type Hint Conversion

In the earlier section, we mentioned that Fugue can bring a function with valid input and output types to Spark or Dask. The previous `predict()` function had `pd.DataFrame` in and `pd.DataFrame` out. These type annotations are read in by Fugue to understand how to convert the data before the function is applied. For example, a Spark DataFrame will be converted to multiple pandas DataFrames.

Fugue can also take in other DataFrame-like input and output types. Take the following function that sums up a whole row and returns one column. But if the value in the summed column in greater than 10, we drop the row.

In [None]:
df = pd.DataFrame({"a": [1,2,3,4], "b": [1,2,3,4], "c": [1,2,3,4]})

def add_row(df: pd.DataFrame) -> pd.DataFrame:
    df = df.assign(total=df.sum(axis=1))
    df = df.loc[df["total"] < 10]
    return df

This can be ran using `transform()` without passing an `engine`.

In [None]:
transform(df, add_row, schema="*,total:int")

This same logic can be represented in multiple ways

In [None]:
from typing import List, Iterable, Any, Dict

def add_row2(df: List[Dict[str,Any]]) -> List[Dict[str,Any]]:
    result = []
    for row in df:
        row["total"] = row["a"] + row["b"] + row["c"]
        if row["total"] < 10:
            result.append(row)
    return result

def add_row3(df: List[List[Any]]) -> Iterable[List[Any]]:
    for row in df:
        row.append(sum(row))
        if row[-1] < 10:
            yield row

The input type annotation tells Fugue what to convert the input data to before the function is ran. The output type annotation informs Fugue how to convert it back to a Pandas, Spark, or Dask DataFrame. Notice that these functions are not even dependent on pandas and can be tested easilty. For example:

In [None]:
print(add_row2([{"a": 1, "b": 2, "c": 3}]))
print(list(add_row3([[1,2,3]])))

This is one of the core offerings of Fugue. **Testing code that uses Spark or Dask is hard because of the dependency on the hardware.** Even if running the tests locally, iteration speed is significantly slower than using pandas. This setup allows developers to unit test Python or pandas code, and bring it to Spark or Dask when ready.

These definitions are compatible with `transform()` across all execution engines. For example, we can use `add_row2` with the Spark engine.

In [None]:
transform(df, add_row2, schema="*,total:int", engine=spark).show()

This is a list of acceptable input and output types. They can be mixed and matched. There is no need to memorize them, in fact, a lot of users tend to just use a few of them. The point is that users can write their functions in the grammar of their choice.

**Acceptable input DataFrame types**

* `LocalDataFrame`, `pd.DataFrame`, `List[List[Any]]`, `Iterable[List[Any]]`, `EmptyAwareIterable[List[Any]]`, `List[Dict[str, Any]]`, `Iterable[Dict[str, Any]]`, `EmptyAwareIterable[Dict[str, Any]]`

**Acceptable output DataFrame types**

* `LocalDataFrame`, `pd.DataFrame`, `List[List[Any]]`, `Iterable[List[Any]]`, `EmptyAwareIterable[List[Any]]`, `List[Dict[str, Any]]`, `Iterable[Dict[str, Any]]`, `EmptyAwareIterable[Dict[str, Any]]`

## Partitioning

The input type conversation is not on a partition level, not on the entire DataFrame. If no partitions are supplied, the default engine partitions are used. To get a better clue of partitions, look at the following function that gets the size of the DataFrame.

In [None]:
df = pd.DataFrame({"a": [1,2,3,4], "b": [1,2,3,4], "c": [1,2,3,4]})

def size(df: pd.DataFrame) -> Iterable[Dict[str,Any]]:
    yield {"size":df.shape[0]}

Then we run it on Dask using the `transform()` function.

In [None]:
transform(df, size, schema="size:int", engine="dask").compute().head()

There are 4 rows of data, each with a value of 1. This tells us the `size()` function ran on 4 partitions of data, each with one row. The type hint conversion happens on each partition. The concept of partitioning is important for distributed computing. When dealing with big data, it's more effective to find logically independent groups of data that can serve as the partitioning strategy. Take the following DataFrame:

In [None]:
df = pd.DataFrame({"col1": ["a","a","a","b","b","b"], 
                   "col2": [1,2,3,4,5,6]})

Suppose we want to get the min and max values of `col2` for each group in `col1`. First, we define the logic for one group of data. Again, we take advantage of the type hint conversions.

In [None]:
def min_max(df:pd.DataFrame) -> List[Dict[str,Any]]:
    return [{"group": df.iloc[0]["col1"], 
             "max": df['col2'].max(), 
             "min": df['col2'].min()}]

We can specify the partitioning strategy on the `transform()` function by doing:

In [None]:
transform(df, 
          min_max, 
          schema="group:str, max:int, min:int",
          partition={"by": "col1"})

On pandas, the `partition-transform` semantic is close to a `groupby-apply`. The difference is that the `partition-transform` paradigm also extends to distributed computing where we control the movement of the physical location of the data. Again, the expression above will also work on Spark and Dask by supplying the engine.

In [None]:
transform(df, 
          min_max, 
          schema="group:str,max:int,min:int",
          partition={"by": "col1"},
          engine="dask").compute()

### Presort

During the partition operation, we can specify a `presort` so that the data comes in sorted before the function is applied. For example, we can get the top 2 rows of each group using

In [None]:
def top_two(df:List[Dict[str,Any]]) -> Iterable[Dict[str,Any]]:
    n = 0
    while n < 2:
        yield df[n]
        n = n + 1

transform(df, top_two, schema="*", partition={"by":"col1", "presort": "col2 desc"})
        

### Partition Validation

This means that there are functions that require the DataFrame to be partitioned and sorted to get the correct results. We can define a partition validation check to make sure that the DataFrame is partitioned correctly before the function is applied. The comments above the function will be read and applied. Even if Fugue is not used, they serve as helpful comments.

In [None]:
# partitionby_has: col1
# presort_is: col2 desc
def top_two(df:List[Dict[str,Any]]) -> Iterable[Dict[str,Any]]:
    n = 0
    while n < 2:
        yield df[n]
        n = n + 1

transform(df, top_two, schema="*", partition={"by":"col1", "presort": "col2 desc"})

And to see the error that is raised if we don't partition correctly

In [None]:
try:
    transform(df, top_two, schema="*", partition={"by":"col1"})
except Exception as e:
    print(e)

### Partitioning Strategies (Advanced)

There are other partitioning strategies that can be used aside from passing in columns to divide the data by. For more information, see [partitioning](advanced/partition.ipynb)

**Algo**
* even - enforces an even number of items per partition
* rand - randomly shuffles data
* hash - uses hashing to partition the data (similar to Spark default)

**Num**
* A number of partitions can be supplied as the partitioning strategy. This does not work for the pandas-based engine because the number doesn't make sense.

These strategies can also be used with `presort`.

In [None]:
def no_op(df:List[Dict[str,Any]]) -> Iterable[Dict[str,Any]]:
    yield df[0]

# by number
transform(df, no_op, schema="*", partition={"num":4}, engine="dask").compute()

# by algorithm
transform(df, no_op, schema="*", partition={"algo":"even"}, engine="dask").compute()

Remember that `transform()` runs on each partition of data. When using the pandas-based engine, the whole DataFrame is treated as one partition. For example, if you are normalizing a column of data, it will be applied on a per partition basis.

If you need to perform an operation that requires global max/mean/min, then Fugue has a more advanced interface for that called `FugueWorkflow`, or you can use the native Spark/Dask operations after a `transform()` call.

## Schema

We have seen a couple of `transform()` calls by now and each of them has had the `schema` passed in. The schema is a requirement for Spark, and heavily recommended for Dask. When data lives across multiple machines, schema inference can be very expensive, and take long. Also, operations can be silently inconsistent. 

Fugue enforces best practices so that code can run effectively at scale. Here we see how to use Fugue's representation, which is minimal compared to Spark's. 

The Fugue project has a utility called `triad`, which contains the `Schema` class. In practice, you will just need to interact with the string representation.

In [None]:
from triad.collections.schema import Schema

s = Schema("a:int, b:str")
s == "a:int,b:str"

For this section we create a DataFrame to use throughout the examples:

In [None]:
df = pd.DataFrame({"a": [1,2,3], "b": [1,2,3], "c": [1,2,3]})

**Adding a column**

When using the `transform()`, the `*` in a schema expression means all existing columns. From there we can add new columns by adding `",col:type"`.

In [None]:
def add_col(df: pd.DataFrame) -> pd.DataFrame:
    return df.assign(new_col=df["a"] + 1)

transform(df, add_col, schema="*,new_col:int")

**Entirely new schema**

There is no need to use the `*` operation. We can just specify all columns.

In [None]:
def new_df(df: pd.DataFrame) -> pd.DataFrame:
    return pd.DataFrame({"x": [1,2,3]})

transform(df, new_df, schema="x:int")

**Dropping Columns**

To drop a column, use `-col` without `","`.

In [None]:
def drop_col(df: pd.DataFrame) -> pd.DataFrame:
    return df.drop("b", axis=1)

transform(df, drop_col, schema="*-b")

**Altering Types**

If a column is remaining but the type is being altered, use `+col:type`.

In [None]:
def alter_col(df: pd.DataFrame) -> pd.DataFrame:
    return df.assign(a=df['a'].astype("str")+"a")

transform(df, alter_col, schema="*+a:str")

**Drop if Present**

Use `~` to drop a column from the result if it is present.

In [None]:
def no_op(df: pd.DataFrame) -> pd.DataFrame:
    return df

transform(df, no_op, schema="*~b")

**Schema Result Mismatch**

If the `transform()` output has columns not in the defined schema, they will not be returned.

If the `transform()` output has an inconsistent type with the defined schema, it will be coerced.

In [None]:
def no_op(df: pd.DataFrame) -> pd.DataFrame:
    return df

transform(df, no_op, schema="a:float")

**Defining Schema**

We only showed one way to define the `schema` inside the `transform()` call. It can also be applied on the function as a comment similar to the partition validation. If done this way, it should not be supplied to the `transform()` function anymore. 

Fugue has no preference. The advantage of having schema is a string is that it can programmatically be manipulated while the comment can't.

In [None]:
# schema: a:int
def no_op(df: pd.DataFrame) -> pd.DataFrame:
    return df

transform(df, no_op)

## File IO with `transform()`

**Input file**

The `transform()` function also supports loading parquet files from source because the function is applied. With this, users don't need to worry about passing a Pandas or Spark DataFrame to `transform()`. It will be loaded by the `engine`.

**Output file**

Users can also write out the result of the `transform()` operation as a parquet file. This can be done by providing a `save_path`.

Only parquet files are supported because CSVs often have additional configuration that make it harder to deal with. Parquet files are also a best practice for big data because they are partitioned and hold schema information.

Absolute paths are also a best practice for big data.

In [None]:
df = pd.DataFrame({"a": [1,2,3], "b": [1,2,3], "c": [1,2,3]})
df.to_parquet("/tmp/df.parquet")

In [None]:
def drop_col(df: pd.DataFrame) -> pd.DataFrame:
    return df.drop("b", axis=1)

transform("/tmp/df.parquet",
          drop_col,
          schema="*-b",
          engine=spark,
          save_path="/tmp/processed.parquet")

pd.read_parquet("/tmp/processed.parquet/").head()

This expression makes it easy for users to toggle between running pandas with sampled data, and Spark or Dask on the full dataset.

## Conclusion

The Fugue `transform()` function is the simplest interface for Fugue. It handles the execution of one function across Pandas, Spark, and Dask. Most users can easily adopt this minimal wrapper to parallelize existing code that they have already written. For end-to-end workflows, see [FugueWorkflow](/beginner/decoupling_logic_and_execution.html#transform-versus-fugueworkflow)

For any questions, free free to reach out on [Slack](http://slack.fugue.ai)