# Missing Data

### Start a simple Spark Session

In [1]:
import org.apache.spark.sql.SparkSession

import org.apache.spark.sql.SparkSession


In [2]:
val spark = SparkSession.builder().getOrCreate()

spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@b8e53f6


### Grab small dataset with some missing data

In [4]:
val df = spark.read.options(Map(("header","true"),("inferSchema","true"))).csv("ContainsNull.csv")

df: org.apache.spark.sql.DataFrame = [Id: string, Name: string ... 1 more field]


### Show schema

In [5]:
df.printSchema

root
 |-- Id: string (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sales: double (nullable = true)



### Notice the missing values!

In [6]:
df.show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John| null|
|emp2| null| null|
|emp3| null|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



**We basically have 3 options with Null values**

- Just keep them, maybe only let a certain percentage through
- Drop them
- Fill them in with some other value

### Dropping values

#### Drop any rows with any amount of na values

In [7]:
df.na.drop().show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp4|Cindy|456.0|
+----+-----+-----+



#### Drop any rows that have less than a minimum Number of NON-null values ( < Int)

In [8]:
df.na.drop(2).show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John| null|
|emp3| null|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



## **Interesting behavior!**
<br>**What happens when using double/int versus strings**</br>

### Fill in the Na values with Int

In [9]:
df.na.fill(100).show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John|100.0|
|emp2| null|100.0|
|emp3| null|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



### Fill in String will only go to all string columns

In [10]:
df.na.fill("Emp Name Missing").show()

+----+----------------+-----+
|  Id|            Name|Sales|
+----+----------------+-----+
|emp1|            John| null|
|emp2|Emp Name Missing| null|
|emp3|Emp Name Missing|345.0|
|emp4|           Cindy|456.0|
+----+----------------+-----+



### Be more specific, pass an array of string column names

In [11]:
df.na.fill("Specific",Array("Name")).show()

+----+--------+-----+
|  Id|    Name|Sales|
+----+--------+-----+
|emp1|    John| null|
|emp2|Specific| null|
|emp3|Specific|345.0|
|emp4|   Cindy|456.0|
+----+--------+-----+



### Fill in Sales with average sales.

#### Get the average sales using describe()

In [14]:
df.describe().show()

+-------+----+-----+-----------------+
|summary|  Id| Name|            Sales|
+-------+----+-----+-----------------+
|  count|   4|    2|                2|
|   mean|null| null|            400.5|
| stddev|null| null|78.48885271170677|
|    min|emp1|Cindy|            345.0|
|    max|emp4| John|            456.0|
+-------+----+-----+-----------------+



#### Now fill in with the values

In [15]:
df.na.fill(400.5).show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John|400.5|
|emp2| null|400.5|
|emp3| null|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



Or

### Closing Spark Session

In [23]:
spark.stop()

## Thank You!