# Motivation for Fugue

Before we talk about Fugue, we need to first go through the scenarios that make data practitioners use distributed computed frameworks such as Spark and Dask. This will help us understand the "why" behind Fugue. Skip to the "Fugue Transform" section if you are already familiar with Spark or Dask.

When data is small enough to fit on a laptop, it's very easy for data practitioners to iterate on data projects. Most commonly, data practitioners use pandas and numpy for their data analysis and feature engineering needs. These tools go well with scikit-learn to provide a stack capable of handling the end-to-end machine learning pipeline. This works great, until data becomes too big to fit on a single machine. At this point, there are a couple of options to scale the solution.

1. Sampling
2. Increasing Hardware Vertically
3. Utilizing Distributed Compute Frameworks (Spark and Dask)

## Sampling

Sampling is the most common form of dealing with bigger data. Practitioners can sample their datasets and iterate on a data project with smaller data. Assuming that the distribution of the sample is representative of the whole dataset, sampling allows users to use the exact same codebase without modifications. This method stops working when the solution has to be applied to a bigger dataset. For example, we can train a machine learning model on a subset of data, but we may still need to generate predictions for all the data coming in, and that dataset may be too big.

At this point, we would need to write extra code to handle splitting files into batches so that they fit in memory. After that, we could run the model to get predictions one batch at a time. Ultimately, this solution does not scale well and requires a lot of maintenance, which leads us to the next solution.

## Increasing Hardware

Pandas is great for small datasets, but unfortunately does not scale well large datasets. The primary reason is that Pandas is single core, and does not take advantage of the available compute resources all of the time. A lot of operations also generate [intermediate copies](https://pandas.pydata.org/pandas-docs/stable/user_guide/scale.html#scaling-to-large-datasets) of data, causing more memory than necessary. 

If the dataset becomes too big for Pandas to handle, some practitioners work around this by vertically increasing the hardware. This can be in the form of spinning up a more powerful virtual machine on a cloud provider like AWS, and then running the Pandas code there. Ultimately, this but there are limits to vertical scaling. For one, it is more expensive to scale vertically, than horizontally. 

Vertical scaling involved having a powerful machine consistently on, leaving to underutilization during the less intensive operations. Horizontal scaling, on the other hand, involves spinning up multiple machines and distributing the compute job across them. This is what is called distributed computing.

## Utilizing Distributed Compute

This leads us to frameworks such as Spark and Dask. These frameworks allow us to split compute jobs across multiple machines. Dask is the easier transition from Pandas because it is built on top of the Pandas DataFrames. 

## Fugue Transform

The question then becomes what is the easiest way to take advantage of Spark and Dask. Fugue is an abstraction layer designed to provide a seamless transition between local compute to distributed compute. Fugue allows users to take advantage of the Spark and Dask computation engines, while writing Python, Pandas, and SQL code. This allows users to focus on the problems they are trying to solve, rather than learning a new framework for the job.

The most basic way Fugue can be used to scale Pandas based code to Spark is the `transform` function. We'll copy the code from the [Sklearn Linear Regression example](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html). 

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

X = pd.DataFrame({"x_1": [1, 1, 2, 2], "x_2":[1, 2, 2, 3]})
y = np.dot(X, np.array([1, 2])) + 3
reg = LinearRegression().fit(X, y)

After training our model, we then wrap it in a function to be used for production. This function is still written in Pandas. We can easily test it on the `input_df` that we create.

In [2]:
def predict(df: pd.DataFrame, model: LinearRegression) -> pd.DataFrame:
    return df.assign(predicted=model.predict(df))

input_df = pd.DataFrame({"x_1": [3, 4, 6, 6], "x_2":[3, 3, 6, 6]})

# test the function
predict(input_df.copy(), reg)

Unnamed: 0,x1,x2,predicted
0,3,3,12.0
1,4,3,13.0
2,6,6,21.0
3,6,6,21.0


And now we bring it to Spark using Fugue. Fugue has a function called `transform` that takes in a DataFrame and applies a function to it. We'll explain the inputs that go into this function in a bit (though you probably already have a good clue). The important thing to notice is that we did not make modifications to the Pandas based function in order to use it Spark. This function can now scale to big datasets through the Spark engine.

Even if there is no cluster available, the SparkExecutionEngine will start a local Spark instance and parallelize the jobs with all cores of the machine.

In [3]:
# create Spark session for next cells
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.getOrCreate()

In [4]:
from fugue import transform
from fugue_spark import SparkExecutionEngine

result = transform(
    input_df,
    predict,
    schema="*,predicted:double",
    params=dict(model=reg),
    engine=SparkExecutionEngine()
)
result.show()

----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 53388)
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/socketserver.py", line 316, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/usr/local/lib/python3.7/socketserver.py", line 347, in process_request
    self.finish_request(request, client_address)
  File "/usr/local/lib/python3.7/socketserver.py", line 360, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/local/lib/python3.7/socketserver.py", line 720, in __init__
    self.handle()
  File "/usr/local/lib/python3.7/site-packages/pyspark/accumulators.py", line 266, in handle
    poll(authenticate_and_accum_updates)
  File "/usr/local/lib/python3.7/site-packages/pyspark/accumulators.py", line 241, in poll
    if func():
  File "/usr/local/lib/python3.7/site-packages/pyspark/accumulators.py", line 256, in authenticate_and_accum_upda

+---+---+---------+
| x1| x2|predicted|
+---+---+---------+
|  3|  3|     12.0|
|  4|  3|     13.0|
|  6|  6|     21.0|
|  6|  6|     21.0|
+---+---+---------+



The first two arguments of the function are the DataFrame to operate on and the function to use. The `input_df` can either be a Pandas DataFrame or a Spark DataFrame. The engine then dictates what engine to use for the computation. Because we supplied a Pandas DataFrame with the SparkExecutionEngine, that DataFrame was converted to be used in Spark. The output of this function is a Spark DataFrame because the `engine` used was the `SparkExecutionEngine`. Supplying no engine uses the pandas-based `NativeExecutionEngine`. Fugue also has a `DaskExecutionEngine` available. 

The other two arguments are the schema and parameter. The schema is a hard requirement in distributed computing frameworks, so we need to supply the output schema of the operation. `params` is a dictionary that contains other inputs into the function. In this case, we passed in the regression model to be used.

## Conclusion

With that, we have shown the use-case of Fugue in scaling Python and Pandas-written code to Spark. It can be done in very few lines of code, without comprimising the existing code base. In the next sections, we'll see other features Fugue has to offer, and the other ways it simplifies using distributed compute.

## Spark Equivalent of Transform

If you are wondering how `transform` compares to implementing the same logic in Spark, below is an example of how the Pandas function would be implemented in Spark if you did it yourself. This implementation uses the Spark's `mapInPandas` method. Note how the schema has to be handled inside `run_predict`. This is the schema requirement we mentioned earlier.

In [5]:
from typing import Iterator, Any, Union
from pyspark.sql.types import StructType, StructField, DoubleType
from pyspark.sql import DataFrame, SparkSession

def predict_wrapper(dfs: Iterator[pd.DataFrame], model):
    for df in dfs:
        yield predict(df, model)

def run_predict(sdf, model):
    schema = StructType(list(sdf.schema.fields))
    schema.add(StructField("predicted", DoubleType()))
    return sdf.mapInPandas(lambda dfs: predict_wrapper(dfs, model), 
                           schema=schema)

# conversion
if isinstance(input_df, pd.DataFrame):
    sdf = spark_session.createDataFrame(input_df.copy())
else:
    sdf = input_df.copy()

result = run_predict(sdf, reg)
result.show()

+---+---+---------+
| x1| x2|predicted|
+---+---+---------+
|  3|  3|     12.0|
|  4|  3|     13.0|
|  6|  6|     21.0|
|  6|  6|     21.0|
+---+---+---------+



It's very easy to see why it becomes very difficult to bring a Pandas codebase to Spark with this approach. We had to define two additional functions in the `predict_wrapper` and the `run_predict` to bring it to Spark. If this had to be done for tens of functions, it can easily fill the codebase with boilerplate code, rather than the logic.