# Why Fugue Does NOT Want To Be Another Pandas-Like Framework

Fugue fully utilizes Pandas for computing tasks, but **Fugue is NOT a Pandas-like computing framework, and it never wants to be.** In this article we are going to explain the reason for this critical design decision.

## A Simple Example

This is an example modified from a real piece of Pandas user code. Assume we have Pandas dataframes generated by this code:

```python
def gen(n):
    np.random.seed(0)
    return pd.DataFrame(dict(
        a=np.random.choice(["aa","abcd","xyzzzz","tttfs"],n),
        b=np.random.randint(0,100,n),
        c=np.random.choice(["aa","abcd","xyzzzz","tttfs"],n),
        d=np.random.randint(0,10000,n),
    ))
```

The output has four columns with string and integer types. Here is the user's code:

```python
df.sort_values(["a", "b", "c", "d"]).drop_duplicates(subset=["a", "b"], keep="last")
```

Based on the code, the user want to firstly partition the dataframe by `a` and `b`,
and in each group, the user wants to sort by `c` and `d` and then to get the last record
of each group.


### Configuration and Datasets

* **Databricks runtime version:** 10.1 (Scala 2.12 Spark 3.2..0)
* **Cluster:** 1 i3.xlarge driver instance and 8 i3.xlarge worker instances

And we will use 4 different datasets: 1 million, 10 million, 20 million, and 30 million

```python
g1 = gen(1 * 1000 * 1000)
df1 = spark.createDataFrame(g1).cache()
df1.count()
pdf1 = df1.to_pandas_on_spark()

g10 = gen(10 * 1000 * 1000)
df10 = spark.createDataFrame(g10).cache()
df10.count()
pdf10 = df10.to_pandas_on_spark()

g20 = gen(20 * 1000 * 1000)
df20 = spark.createDataFrame(g20).cache()
df20.count()
pdf20 = df20.to_pandas_on_spark()

g30 = gen(30 * 1000 * 1000)
df30 = spark.createDataFrame(g30).cache()
df30.count()
pdf30 = df30.to_pandas_on_spark()
```

### Comparison 1

Let's firstly follow the user's original logic, and we will discuss the alternative solution later.

In this [Databrick's article](https://databricks.com/blog/2021/10/04/pandas-api-on-upcoming-apache-spark-3-2.html)
the author claimed that Pandas users will be able to scale their workloads with one simple line change in the Spark 3.2 release.
So we will first convert the Pandas dataframe to the Spark Pandas dataframe (and without any other change) to verify the result.

On the other hand, in traditional Spark, a [window function solution](https://stackoverflow.com/a/33878701) is typical.
So we will also add the window function solution to the comparison.

To force the full execution of the statement and also to verify result consistency, at the end of each execution
we will compute the sum of column `d` and print.

Based on the output, the 3 solutions all have consistent result, meaning they have no correctness issue, now let's
take a look at their speed:

![Sort Dedup vs Window](../../images/pandas_like_1.png)

* With a 32 core Spark cluster, both Spark solutions are significantly faster than
  the single core Pandas solution
* The window function solution is 30% to 50% faster than the Spark Pandas solution

On a local machine, a global sort is a very popular technique that is often seen in Pandas code. And in certain
scenarios it outperforms other methods. However
the global sort operation in distributed computing is difficult and expensive. The performance depends on each
specific computing framework's implementation. Spark Pandas has done an amazing job, but even so,
it is still significantly slower than a window function.

Rethinking about the problem we want to solve, a global sort on the entire dataset is not necessary.
If convenience is the only thing important, then switching the Pandas backend to Spark Pandas may make sense.
However the whole point of moving to Spark is to be more scalable and performant. Moving to window function that
will sort inside each partition isn't overly complicated, but the performance advantage is significant.

### Comparison 2

In the second comparison, we simplify the original problem to not consider column `c`. We only need to remove
`c` in `sort_values` to accommodate the change

```python
df.sort_values(["a", "b", "d"]).drop_duplicates(subset=["a", "b"], keep="last")
```

Again, it's intuitive and convenient and Spark Pandas can inherit this change too. However, this new problem
actually means we want to group by `a` and `b` and get the max value of `d`. It can be a simple aggregation
in big data. So in this comparison, we add the simple Spark aggregation approach.

![Sort Dedup vs Window vs Aggregation](../../images/pandas_like_2.png)

* The previous performance pattern stays the same
* Spark aggregation takes ~1 sec regardless of data size

So now, do you want to just remove column `c` for simplicity or do you want to rewrite the logic for performance?

### Comparison 3

Let's go back to the original logic where we still have 4 columns. By understanding the intention, we can have an alternative Pandas solution:

```python
df.groupby(["a", "b"]).apply(lambda df:df.sort_values(["c", "d"], ascending=[False, False]).head(1))
```

When testing on the 1 million dataset, the original logic takes `1.43 sec` while this new logic takes `2.2 sec`. This is probably
one of the reasons the user chose the sort & dedup approach. On a small local dataset, a global sort seems to be faster.

In this section, we are going to compare groupby-apply with sort-dedup on all datasets. In addition, this fits nicely
with [Pandas UDF](https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html) scenarios, so we will also
compare with the Pandas UDF approach.

To avoid duplication, we extract the lambda function:

```python
def largest(df:pd.DataFrame) -> pd.DataFrame:
    return df.sort_values(["c","d"], ascending=[False, False]).head(1)
```

Unfortunately, the first issue we encounter is that Spark Pandas can't take this function

```python
g1.groupby(["a", "b"]).apply(largest)  # g1 is a pandas dataframe, it works
df1.groupby(["a", "b"]).applyInPandas(largest, schema="a string,b long,c string,d long")  # Pandas UDF works
pdf1.groupby(["a", "b"]).apply(largest)  # pdf1 is a spark pandas dataframe, it doesn't work
```

So for Spark Pandas we will need to use:

```python
pdf1.groupby(["a", "b"]).apply(lambda df: largest(df))
```

This breaks the claim that with an import change everything works out of the box.

Now let's see the performance chart:

![Sort Dedup vs Group Apply vs Pandas UDF](../../images/pandas_like_3.png)

* For Pandas, when data size increases, groupby-apply has more performance advantage over sort-dedup
* For Spark Pandas, groupby-apply is even slower than Pandas
* Pandas UDF is the fastest Spark solution for this problem

### Summary of Comparisons

With the 3 comparisons we find out:

* The convenience is at the cost of performance
* Simply switching backend doesn't always work (not 100% consistent)
* Simply switching backend can cause unexpected performance issues
* Big data problems require different ways of thinking, users must learn and change their mindset






