# Spark DataFrame basics IV - Missing data

<p>Obs.: After download the databricks notebook to .ipynb we have problems in the output format but if you run this notebook in a databricks cluster you'll have a output in a table format.</p>

<p>E.g.:</p>
<p>The following output:</p>
<p>+----+-------+ age| name| +----+-------+ null|Michael| 30| Andy| 19| Justin| +----+-------+</p>
<p>actually is:</p>
<pre>+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+  </pre>

### Create session and load data

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName('miss').getOrCreate()

In [3]:
df = spark.read.csv('/FileStore/tables/ContainsNull.csv', header=True, inferSchema=True)

In [4]:
df.show()

### Drop missing data of a subset of the dataset

In [5]:
df.na.drop(subset=['Sales']).show()

### Drop missing data based in a threshold

In [6]:
df.na.drop(thresh=2).show()

In [7]:
df.printSchema()

### Fill missing data

In [8]:
df.na.fill('No name', subset=['Name']).show()

In [9]:
from pyspark.sql.functions import mean
mean_value = df.select(mean(df['Sales'])).collect()
mean_value

In [10]:
mean_value[0][0]

In [11]:
df.na.fill(mean_value[0][0], ['Sales']).show()