### Handling Missing Values in a Dataset

You can use the .na functions for missing data. The drop command has the following parameters:

df.na.drop(how='any', thresh=None, subset=None)

* param how: 'any' or 'all'.

    If 'any', drop a row if it contains any nulls.
    If 'all', drop a row only if all its values are null.

* param thresh: int, default None

    If specified, drop rows that have less than `thresh` non-null values.
    This overwrites the `how` parameter.

* param subset: 
    optional list of column names to consider.

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName("missingdata").getOrCreate()

24/04/28 16:58:04 WARN Utils: Your hostname, myspark resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface enp0s3)
24/04/28 16:58:04 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/04/28 16:58:06 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
df = spark.read.csv("/home/vboxuser/jupyter_notebooks/Python-and-Spark-for-Big-Data-master/Spark_DataFrames/ContainsNull.csv",
                     header = True,
                     inferSchema = True)

                                                                                

In [4]:
df.show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John| NULL|
|emp2| NULL| NULL|
|emp3| NULL|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



In [6]:
df.na.drop().show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp4|Cindy|456.0|
+----+-----+-----+



Thresh attribute specifies the number of not null columns that should be populated in a row.
If there are less than thresh number of not null columns then those rows are dropped.

In [7]:
df.na.drop(thresh=2).show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John| NULL|
|emp3| NULL|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



In [12]:
# thresh = 3 means at least 3 columns should be populated for a row in the data frame.

df.na.drop(thresh=3).show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp4|Cindy|456.0|
+----+-----+-----+



In [14]:
# how = 'any' means drop a row if any of the column values is null

df.na.drop(how = 'any').show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp4|Cindy|456.0|
+----+-----+-----+



In [5]:
# how = 'all' means drop a row if all it's columns are null

df.na.drop(how = 'all').show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John| NULL|
|emp2| NULL| NULL|
|emp3| NULL|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



In [6]:
df.na.drop(subset = "Sales").show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp3| NULL|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



In [7]:
df.na.drop(subset = ["Sales","Name"]).show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp4|Cindy|456.0|
+----+-----+-----+



In [8]:
df.na.fill("No Name").show()

+----+-------+-----+
|  Id|   Name|Sales|
+----+-------+-----+
|emp1|   John| NULL|
|emp2|No Name| NULL|
|emp3|No Name|345.0|
|emp4|  Cindy|456.0|
+----+-------+-----+



In [9]:
df.na.fill("No Name","Name").show()

+----+-------+-----+
|  Id|   Name|Sales|
+----+-------+-----+
|emp1|   John| NULL|
|emp2|No Name| NULL|
|emp3|No Name|345.0|
|emp4|  Cindy|456.0|
+----+-------+-----+



In [11]:
df.na.fill(0,"Sales").show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John|  0.0|
|emp2| NULL|  0.0|
|emp3| NULL|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



In [28]:
from pyspark.sql.functions import mean

In [29]:
meanval = df.select(mean(df["Sales"])).collect()

In [30]:
mean = meanval[0][0]

In [31]:
print(mean)

400.5


In [32]:
df.na.fill(mean,"Sales").show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John|400.5|
|emp2| NULL|400.5|
|emp3| NULL|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+

