
# Common Null related problems

1. Null in join column

Removing null records from the join optimizes the join operation since it reduces shuffle size.

2. null value data skew

Spark takes both tables in a join and breaks them into join key partitions.
This makes null key values form a partition of their own. Too many null records lead to a very big parition.
One or more very big partitions and remaining partitions being very small is called data skew.

Due to these skewed partitions your application may hand or even fail with an Out of memory error.

3. null in partition column

can lead to skew.


4. null in spark functions
5. null in UDFs

In [0]:
data_list1 = [
    (100, "Prashant", "Software"),
    (101, "David", None),
    (102, "Sushant", None),
    (103, "Abdul", "Account"),
    (104, "Shruti", "Software")
]

data_list2 = [(501, "Software"), [502, "Account"]]


employee_df = spark.createDataFrame(data_list1).toDF("id", "name", "department")
department_df = spark.createDataFrame(data_list2).toDF("id", "department")

In [0]:
employee_df.join(department_df, on="department").show()

+----------+---+--------+---+
|department| id|    name| id|
+----------+---+--------+---+
|   Account|103|   Abdul|502|
|  Software|100|Prashant|501|
|  Software|104|  Shruti|501|
+----------+---+--------+---+



In [0]:
person_list = [(100, "Prashant", 30),
               (101, "David", None),
               (102, "Sushant", None),
               (103, "Abdul", 45),
               (104, "Shruti", 28)]

person_df = spark.createDataFrame(person_list).toDF("id", "name", "age")

In [0]:
person_df.select("*").where("age is null").show()

+---+-------+----+
| id|   name| age|
+---+-------+----+
|101|  David|null|
|102|Sushant|null|
+---+-------+----+



In [0]:
person_df.selectExpr("avg(age)").show()

+------------------+
|          avg(age)|
+------------------+
|34.333333333333336|
+------------------+



In [0]:
person_df.filter("age is not null").selectExpr("avg(age)").show()

+------------------+
|          avg(age)|
+------------------+
|34.333333333333336|
+------------------+



In [0]:
person_df