# Case Study 2: Filter Pushdown

> `Filter pushdown` improves performance by reducing the amount of data shuffled during any dataframes transformations.

### Library Imports

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

Create a `SparkSession`. No need to create `SparkContext` as you automatically get it as part of the `SparkSession`.

In [2]:
spark = SparkSession.builder \
    .master("local") \
    .appName("Exploring Joins") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

sc = spark.sparkContext

### Initial Datasets

In [9]:
df_1 = spark.createDataFrame(
    [
        (1, 1, 'a'), 
        (2, 1, 'b'), 
        (2, 2, 'c'), 
    ], ['shop_id', 'data_id', 'val_1']
)

df_1.toPandas()

Unnamed: 0,shop_id,data_id,val_1
0,1,1,a
1,2,1,b
2,2,2,c


In [10]:
df_2 = spark.createDataFrame(
    [
        (1, 1, 10), 
        (2, 2, 20), 
    ], ['shop_id', 'data_id', 'val_2']
)

df_2.toPandas()

Unnamed: 0,shop_id,data_id,val_2
0,1,1,10
1,2,2,20


## Option #1: Join the data, then perform Filter

In [11]:
df = df_1 \
    .join(df_2.drop('shop_id'), 'data_id') \
    .filter(F.col('shop_id') == 1)

df.toPandas()

Unnamed: 0,data_id,shop_id,val_1,val_2
0,1,1,a,10


In [12]:
df.explain()

== Physical Plan ==
*(5) Project [data_id#45L, shop_id#44L, val_1#46, val_2#52L]
+- *(5) SortMergeJoin [data_id#45L], [data_id#51L], Inner
   :- *(2) Sort [data_id#45L ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(data_id#45L, 200)
   :     +- *(1) Filter ((isnotnull(shop_id#44L) && (shop_id#44L = 1)) && isnotnull(data_id#45L))
   :        +- Scan ExistingRDD[shop_id#44L,data_id#45L,val_1#46]
   +- *(4) Sort [data_id#51L ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(data_id#51L, 200)
         +- *(3) Project [data_id#51L, val_2#52L]
            +- *(3) Filter isnotnull(data_id#51L)
               +- Scan ExistingRDD[shop_id#50L,data_id#51L,val_2#52L]


**What Happened:**

* We can see that the filter is after the join and not pushed down. 
* This means all of the data is brough to the join.
* Then the filter is done.

**Results:**

We bring more data to the join and shuffle, **this is bad**.

## Option #2: Join on Filter Key, then Filter

In [13]:
df = df_1 \
    .join(df_2, ['shop_id', 'data_id']) \
    .filter(F.col('shop_id') == 1)

df.toPandas()

Unnamed: 0,shop_id,data_id,val_1,val_2
0,1,1,a,10


In [14]:
df.explain()

== Physical Plan ==
*(5) Project [shop_id#44L, data_id#45L, val_1#46, val_2#52L]
+- *(5) SortMergeJoin [shop_id#44L, data_id#45L], [shop_id#50L, data_id#51L], Inner
   :- *(2) Sort [shop_id#44L ASC NULLS FIRST, data_id#45L ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(shop_id#44L, data_id#45L, 200)
   :     +- *(1) Filter ((isnotnull(shop_id#44L) && (shop_id#44L = 1)) && isnotnull(data_id#45L))
   :        +- Scan ExistingRDD[shop_id#44L,data_id#45L,val_1#46]
   +- *(4) Sort [shop_id#50L ASC NULLS FIRST, data_id#51L ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(shop_id#50L, data_id#51L, 200)
         +- *(3) Filter (((shop_id#50L = 1) && isnotnull(data_id#51L)) && isnotnull(shop_id#50L))
            +- Scan ExistingRDD[shop_id#50L,data_id#51L,val_2#52L]


**What Happened:**
* The filter got pushed down.
* Less data is brought to the join and shuffle.

**Results:**

We bring less data to the join and shuffle, **this is good**.

## Option #3: Filter Left, then Join

In [15]:
df = df_1 \
    .filter(F.col('shop_id') == 1) \
    .join(df_2.drop('shop_id'), 'data_id')

df.toPandas()

Unnamed: 0,data_id,shop_id,val_1,val_2
0,1,1,a,10


In [16]:
df.explain()

== Physical Plan ==
*(5) Project [data_id#45L, shop_id#44L, val_1#46, val_2#52L]
+- *(5) SortMergeJoin [data_id#45L], [data_id#51L], Inner
   :- *(2) Sort [data_id#45L ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(data_id#45L, 200)
   :     +- *(1) Filter ((isnotnull(shop_id#44L) && (shop_id#44L = 1)) && isnotnull(data_id#45L))
   :        +- Scan ExistingRDD[shop_id#44L,data_id#45L,val_1#46]
   +- *(4) Sort [data_id#51L ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(data_id#51L, 200)
         +- *(3) Project [data_id#51L, val_2#52L]
            +- *(3) Filter isnotnull(data_id#51L)
               +- Scan ExistingRDD[shop_id#50L,data_id#51L,val_2#52L]


**What Happened:**
* This is exactly the same as case 1.

**Results:**

We bring less data to the join and shuffle, **this is bad**.

## Option #4: Filter Both, then Join

In [17]:
df = df_2 \
    .filter(F.col('shop_id') == 1) \
    .drop('shop_id')

df = df_1 \
    .filter(F.col('shop_id') == 1) \
    .join(df_3, 'data_id')

df.toPandas()

Unnamed: 0,data_id,shop_id,val_1,shop_id.1,val_1.1
0,1,1,a,1,a
1,1,1,a,2,b


In [18]:
df.explain()

== Physical Plan ==
*(5) Project [data_id#45L, shop_id#44L, val_1#46, shop_id#12L, val_1#2]
+- *(5) SortMergeJoin [data_id#45L], [data_id#1L], Inner
   :- *(2) Sort [data_id#45L ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(data_id#45L, 200)
   :     +- *(1) Filter ((isnotnull(shop_id#44L) && (shop_id#44L = 1)) && isnotnull(data_id#45L))
   :        +- Scan ExistingRDD[shop_id#44L,data_id#45L,val_1#46]
   +- *(4) Sort [data_id#1L ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(data_id#1L, 200)
         +- *(3) Project [id#0L AS shop_id#12L, data_id#1L, val_1#2]
            +- *(3) Filter isnotnull(data_id#1L)
               +- Scan ExistingRDD[id#0L,data_id#1L,val_1#2]


## TL;DR

* We should always try to push the filter down as much as possible. 
* This means that there will be less data being shuffled and joined during the join. 
* This can be achieved with join in case #2 or #4.

**Option #2** (Good)
* When we `join`ed on `filter`ed on the key `shop_id` this caused a `filter-pushdown` which is good.
* But this made us `sort` on 2 keys.

**Option #4** (Better)
* When we pre `filter` the `join`ing datasets, this caused a `filter-pushdown` which is good.
* We only `join` on one key as well, which is good as we only sort on 1 key.