# Partitioning

In the last section, we showed how the `transform()` function could be applied on functions with inputs such as `List[List[Any]]` or `List[Dict[str,Any]]` besides `pd.DataFrame`. In this section, we'll show a feature of `transform()` we have not touched on yet, the `partition` argument.

The `partition` argument allows us to control the partitoning scheme before the `transform()` operation is applied.

## Simple Partitioning Example

In the DataFrame below, we want to take the difference of the value per day. Because there are three different ids, we want to make sure that we don't get the difference across ids.

In [8]:
import pandas as pd 

data = pd.DataFrame({"date":["2021-01-01", "2021-01-02", "2021-01-03"] * 3,
                   "id": (["A"]*3 + ["B"]*3 + ["C"]*3),
                   "value": [3, 4, 2, 1, 2, 5, 3, 2, 3]})
data.head()

Unnamed: 0,date,id,val
0,2021-01-01,A,3
1,2021-01-02,A,4
2,2021-01-03,A,2
3,2021-01-01,B,1
4,2021-01-02,B,2


Now we create a function that takes in a `pd.DataFrame` and outputs a `pd.DataFrame`. This will allow us to bring the logic to Spark and Dask as we've seen before.

In [9]:
def diff(df: pd.DataFrame) -> pd.DataFrame:
    df['diff'] = df['val'].diff()
    return df

But if we use this directly, we will notice that there is a row where the between B and A was calculated, which is invalid. This is because we did not supply a partition.

In [13]:
from fugue import transform
transform(data.copy(), 
          diff, 
          schema="*, diff:int").head()

Unnamed: 0,date,id,val,diff
0,2021-01-01,A,3,
1,2021-01-02,A,4,1.0
2,2021-01-03,A,2,-2.0
3,2021-01-01,B,1,-1.0
4,2021-01-02,B,2,1.0


This can be solved by passing the partitions to Fugue's `transform()`. Now we see that this will correctly give NaN for the first value for B.

In [14]:
transform(data.copy(), 
          diff, 
          schema="*, diff:int",
          partition={"by": "id"}).head()

Unnamed: 0,date,id,val,diff
0,2021-01-01,A,3,
1,2021-01-02,A,4,1.0
2,2021-01-03,A,2,-2.0
3,2021-01-01,B,1,
4,2021-01-02,B,2,1.0


## Default Partitions

So what happens if we don't supply partitions when we call `transform()`? To find out, we 

In [23]:
from typing import List, Dict, Any
import fugue_spark

def count(df: pd.DataFrame) -> List[Dict[str,Any]]:
    return [{"count": df.shape[0]}]

transform(data.copy(),
          count,
          schema="count:int",
          engine="spark").show()

+-----+
|count|
+-----+
|    1|
|    1|
|    1|
|    1|
|    1|
|    1|
|    1|
|    2|
+-----+

