# How Spark handle null and nan

In [2]:
import sys
sys.path.append("/usr/local/spark/python")
sys.path.append("/usr/local/spark/python/lib/py4j-0.10.7-src.zip")

from pyspark.sql import SparkSession
from pyspark.conf import SparkConf

spark = SparkSession.builder \
        .master('local[4]') \
        .appName('spark_null_nan') \
        .enableHiveSupport() \
        .getOrCreate()

In [6]:
# the parquet data is generated using the spark_create_dateframe.ipynb
df = spark.read.parquet("sample_parquet")
df.show()
df.printSchema()

+----------+------+---+-----------+------------+---------+-----------+-----+------------+-------------------+
|      Name|    id|Age|entry_score|update_score|     Food|    Balance|  VIP|sign_up_date|     last_check_out|
+----------+------+---+-----------+------------+---------+-----------+-----+------------+-------------------+
|Han Meimei|342887| 33|   443.9234|        null|Ice Cream|  111246.87| true|  2010-12-10|2018-09-30 10:34:16|
|    Li Lei|278584| 35|        NaN|    400.2312|Chocolate|       null| true|  2005-06-23|2018-12-23 22:10:24|
|   Niu Ren|588269| 28|       null|   995.36255|     null|65897412.57|false|  2006-01-01|2019-01-04 12:56:45|
|  Jay Chou|785445| 45|        NaN|        null|    Donut|       null| true|  2001-05-05|2017-08-04 06:33:43|
+----------+------+---+-----------+------------+---------+-----------+-----+------------+-------------------+

root
 |-- Name: string (nullable = true)
 |-- id: integer (nullable = true)
 |-- Age: short (nullable = true)
 |-- entr

## Column-wise Mathamtical Operations with null and NaN

**General rule**: if a row has `null` for any column, the result column on that row is `null`;
                  if a row has no `null` but has `NaN`, the result is `NaN`.

**Example:**

In [8]:
df = df.withColumn("total_score", df.entry_score + df.update_score) \
       .withColumn("score_ratio", df.entry_score / df.Balance) \
       .select("Name", "entry_score", "update_score", "Balance", "total_score", "score_ratio")
df.show()

+----------+-----------+------------+-----------+-----------+--------------------+
|      Name|entry_score|update_score|    Balance|total_score|         score_ratio|
+----------+-----------+------------+-----------+-----------+--------------------+
|Han Meimei|   443.9234|        null|  111246.87|       null|0.003990434974744964|
|    Li Lei|        NaN|    400.2312|       null|        NaN|                null|
|   Niu Ren|       null|   995.36255|65897412.57|       null|                null|
|  Jay Chou|        NaN|        null|       null|       null|                null|
+----------+-----------+------------+-----------+-----------+--------------------+

