### Library Imports

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

Create a `SparkSession`. No need to create `SparkContext` as you automatically get it as part of the `SparkSession`.

In [2]:
spark = SparkSession.builder \
    .master("local") \
    .appName("Exploring Joins") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

sc = spark.sparkContext

## Use Case

When working with data, sometimes you want to filter out rows based on a boolean column. You thought it would be a simple `filter/where` clause right? 

Turns out it's not, when there's also null values in the column, things get pretty weird. They follow the proper SQL behaviors which might not be familiar to some.

In [3]:
pets_df = spark.createDataFrame(
    [
        (1, 'cat', 'charlie'),
        (2, 'cat', 'fluffy'),
        (3, 'dog', 'bean'),
        (4, 'dog', None),
    ], ['id','animal', 'name']
)

pets_df.toPandas()

Unnamed: 0,id,animal,name
0,1,cat,charlie
1,2,cat,fluffy
2,3,dog,bean
3,4,dog,


## Option 1: Simple Filter Claus

In [4]:
pets_df.withColumn(
    'condition_result', 
    F.col('name') != 'charlie'
).toPandas()

Unnamed: 0,id,animal,name,condition_result
0,1,cat,charlie,False
1,2,cat,fluffy,True
2,3,dog,bean,True
3,4,dog,,


**What Happened:**

Notice how the column with the `None` values give a `None` result after the comparision? 

**This is the default behaviour for SQL when it tries to do a comparision with `Null/None` values.**

## Option 2: Lets See with an `isin` Claus

In [5]:
pets_df.withColumn(
    'condition_result', 
    F.col('name').isin('charlie')
).toPandas()

Unnamed: 0,id,animal,name,condition_result
0,1,cat,charlie,True
1,2,cat,fluffy,False
2,3,dog,bean,False
3,4,dog,,


In [6]:
pets_df.withColumn(
    'condition_result', 
    F.col('name').isin('charlie', None)
).toPandas()

Unnamed: 0,id,animal,name,condition_result
0,1,cat,charlie,True
1,2,cat,fluffy,
2,3,dog,bean,
3,4,dog,,


**What Happened:**

When we added the `None` value, the same behavior occured. As we are doing a comparision to `None` again.

## Correct Way of Comparing to Null

In [7]:
default_value = 'deafult name'

pets_df.withColumn(
    'condition_result', 
    F.coalesce(F.col('name'), F.lit(default_value)).isin('charlie', default_value)
).toPandas()

Unnamed: 0,id,animal,name,condition_result
0,1,cat,charlie,True
1,2,cat,fluffy,False
2,3,dog,bean,False
3,4,dog,,True


In [8]:
pets_df.withColumn(
    'condition_result', 
    (F.col('name').isNull() | (F.col('name') == 'charlie'))
).toPandas()

Unnamed: 0,id,animal,name,condition_result
0,1,cat,charlie,True
1,2,cat,fluffy,False
2,3,dog,bean,False
3,4,dog,,True


**What Happened:**

With these 2 methods we can compare with `Null` values properly?

The first method, does it first by filling in all the `Null` values with a default value and checking if that default value exists.

The second method, does it be checking if the column is `Null` or equal to `charlie`.

## TL;DR

The behavior when doing comparisions to `Null` in SQL is to return a `Null` result. This behavior is then followed in Spark as well.

If we want to do comparisions on `Null`s we should either 1) fill them with a default value and check for that value as well or 2) check for `Null` explicitly.