**_pySpark Basics: Missing Data_**

_by Jeff Levy (jlevy@urban.org)_

_Last Updated: 21 June 2016, Spark v1.6.1_

_Abstract: In this guide we'll look at how to handle null and missing values in pySpark_

***

We'll begin by verifying the Spark Context and loading the SQL Context necessary to work with a dataframe:

In [1]:
try:
    sc
except NameError:
    raise Exception('Spark context not created.')

In [2]:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

We'll load some real data from CSV to work with.  It helps to know in advance how the dataset handles missing values - are they an empty string, or something else?  Most CSVs will use empty strings, but we can't compute anything on a column that is mixed strings and numbers.  The `null` object in pySpark is what we want, and we can tell it when we import the data to replace the value our data uses to denote missing data with it.

In [3]:
df = sqlContext.read.load('s3://ui-hfpc/Performance_2015Q1.txt',
                          format='com.databricks.spark.csv',
                          header='false',
                          inferSchema='true',
                          delimiter='|',
                          nullValue=''
                          )

Note that on the `nullValue=''` line, the empty string can be replaced by whatever your dataset uses.

First let's see how many rows the entire dataframe has:

In [12]:
df.count()

3526154

To explore missing data in pySpark, we need to make sure we're looking in a numerical column - the system does not insert `null` into a column that has a string datatype.  The general point of `null` is so the system knows to skip those rows when doing calculations down a column.  

For example, the mean of the series [3, 4, 2, null, 5] is: 

14 / 4 = 3.5 

not: 

14 / 5 = 2.8

In [4]:
df.dtypes

[('C0', 'bigint'),
 ('C1', 'string'),
 ('C2', 'string'),
 ('C3', 'double'),
 ('C4', 'double'),
 ('C5', 'int'),
 ('C6', 'int'),
 ('C7', 'int'),
 ('C8', 'string'),
 ('C9', 'int'),
 ('C10', 'string'),
 ('C11', 'string'),
 ('C12', 'int'),
 ('C13', 'string'),
 ('C14', 'string'),
 ('C15', 'string'),
 ('C16', 'string'),
 ('C17', 'string'),
 ('C18', 'string'),
 ('C19', 'string'),
 ('C20', 'string'),
 ('C21', 'string'),
 ('C22', 'string'),
 ('C23', 'string'),
 ('C24', 'string'),
 ('C25', 'string'),
 ('C26', 'int'),
 ('C27', 'string')]

For our practice purposes it doesn't matter what this data is.  

In [10]:
df.where(df['C12'].isNull()).count()

3510294

When we look at our earlier command, `df.count()`, we can see that column `C12` is mostly null values - there are 15,860 values in here, out of 3,526,154 rows.  

When exploring a dataset we might want to check all our numeric rows for null values.  However, the `isNull()` method can only be called on a column, not an entire dataframe, so I'll write a convenient Python function to do this for us:

In [15]:
null_counts = []       #make an empty list to hold our results
for col in df.dtypes:  #iterate through the column data types we saw above
    cname = col[0]
    ctype = col[1]
    if ctype != 'string': #calling isNull() on string columns will just return 0, so we skip them for efficiency
        nulls = df.where(df[cname].isNull()).count()
        result = tuple([cname, nulls])
        null_counts.append(result)

In [16]:
null_counts

[('C0', 0),
 ('C3', 0),
 ('C4', 1945752),
 ('C5', 0),
 ('C6', 0),
 ('C7', 1),
 ('C9', 0),
 ('C12', 3510294),
 ('C26', 3526153)]

A quick note about Python programming in general, for those who may be new(er) to the language: one of the core precepts of Python is that code should be as easy to read as possible.  For the purpose of clairty I spread that last code block out vertically far more than was strictly necessary.  This bit of code would do the exact same thing:

In [18]:
"""
null_counts = []
for col in df.dtypes:
    if col[1] != 'string':
        null_counts.append(tuple([col[0], df.where(df[col[0]].isNull()).count()])))
""";

But despite accomplishing the same thing in 4 lines instead of 8, it violates the rules of Python style by looking like an unreadable jumble.  Much more on this can be found in the official Python PEP8 style guide, located at:

https://www.python.org/dev/peps/pep-0008/