# Handling Missing Data in PySpark HW Solutions

In this HW assignment you will be strengthening your skill sets dealing with missing data.
 
**Review:** you have 2 basic options for filling in missing data (you will personally have to make the decision for what is the right approach:

1. Drop the missing data points (including the entire row)
2. Fill them in with some other value.

Let's practice some examples of each of these methods!


#### But first!

Start your Spark session

In [1]:
import pandas as pd
import numpy as np
import datetime as dt

import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *#avg, count, expr
from pyspark.sql.types import *

In [2]:
# initialize
sc = pyspark.SparkContext()
spark = SparkSession(sc)
spark.sparkContext.appName = 'handlingMissing'
# show the number of cores
print('%d cores'%spark._jsc.sc().getExecutorMemoryStatus().keySet().size())
spark

1 cores


## Read in the dataset for this Notebook

Weather.csv attached to this lecture. 

In [13]:
fil = '../../data/Weather.csv'
schem = StructType([StructField('pickup_datetime', TimestampType()), StructField('tempm', FloatType()),
                    StructField('tempi', FloatType()), StructField('dewptm', FloatType()),
                    StructField('dewpti', FloatType()), StructField('hum', FloatType()),
                    StructField('wspdm', StringType()), StructField('wspdi', FloatType()),
                    StructField('wgustm', FloatType()), StructField('wgusti', FloatType()),
                    StructField('wdird', FloatType()), StructField('wdire', FloatType()),
                    StructField('vism', FloatType()), StructField('visi', FloatType()),
                    StructField('pressurem', FloatType()), StructField('pressurei', FloatType()),
                    StructField('windchillm', FloatType()), StructField('windchilli', FloatType()),
                    StructField('heatindexm', FloatType()), StructField('heatindexi', FloatType()),
                    StructField('precipm', FloatType()), StructField('precipi', FloatType()),
                    StructField('conds', StringType()), StructField('icon', StringType()),
                    StructField('fog', IntegerType()), StructField('rain', IntegerType()),
                    StructField('snow', IntegerType()), StructField('hail', IntegerType()),
                    StructField('thunder', IntegerType()), StructField('tornado', IntegerType())])
weath = spark.read.format('csv').options(header='True', timestampFormat='YYYY-mm-DD HH:MM:SS').schema(schem).load(fil)

In [14]:
print('%d rows'%weath.count())
display(weath.limit(50).toPandas().head())

10481 rows


Unnamed: 0,pickup_datetime,tempm,tempi,dewptm,dewpti,hum,wspdm,wspdi,wgustm,wgusti,...,precipm,precipi,conds,icon,fog,rain,snow,hail,thunder,tornado
0,2015-12-31 00:15:00,7.8,46.0,6.1,43.0,89.0,7.4,4.6,,,...,0.5,0.02,Light Rain,rain,0,1,0,0,0,0
1,2015-12-31 00:42:00,7.8,46.0,6.1,43.0,89.0,7.4,4.6,,,...,0.8,0.03,Overcast,cloudy,0,0,0,0,0,0
2,2015-12-31 00:51:00,7.8,46.0,6.1,43.0,89.0,5.6,3.5,,,...,0.8,0.03,Overcast,cloudy,0,0,0,0,0,0
3,2015-12-31 01:51:00,7.2,45.0,5.6,42.099998,90.0,7.4,4.6,,,...,0.3,0.01,Overcast,cloudy,0,0,0,0,0,0
4,2015-12-31 02:51:00,7.2,45.0,5.6,42.099998,90.0,0.0,0.0,,,...,,,Overcast,cloudy,0,0,0,0,0,0


## About this dataset

**New York City Taxi Trip - Hourly Weather Data**

Here is some detailed weather data for the New York City Taxi Trips.

**Source:** https://www.kaggle.com/meinertsen/new-york-city-taxi-trip-hourly-weather-data

### Print a view of the first several lines of the dataframe to see what our data looks like

In [None]:
# done

### Print the schema 

So that we can see if we need to make any corrections to the data types.

In [None]:
# not needed

## 1. How much missing data are we working with?

Get a count and percentage of each variable in the dataset to answer this question.

In [15]:
# count nulls per column
cnt = weath.count()
nullCounts = {colm:weath.select(colm).where(col(colm).isNull()).count() for colm in weath.columns}
nullCounts = {colm:(ncnt, ncnt/cnt) for (colm, ncnt) in nullCounts.items()}
print(nullCounts)

{'pickup_datetime': (0, 0.0), 'tempm': (5, 0.00047705371624844956), 'tempi': (5, 0.00047705371624844956), 'dewptm': (5, 0.00047705371624844956), 'dewpti': (5, 0.00047705371624844956), 'hum': (5, 0.00047705371624844956), 'wspdm': (737, 0.07031771777502147), 'wspdi': (737, 0.07031771777502147), 'wgustm': (8605, 0.8210094456635817), 'wgusti': (8605, 0.8210094456635817), 'wdird': (0, 0.0), 'wdire': (10481, 1.0), 'vism': (245, 0.02337563209617403), 'visi': (245, 0.02337563209617403), 'pressurem': (239, 0.02280316763667589), 'pressurei': (239, 0.02280316763667589), 'windchillm': (7775, 0.7418185287663391), 'windchilli': (7775, 0.7418185287663391), 'heatindexm': (9644, 0.9201412079000095), 'heatindexi': (9644, 0.9201412079000095), 'precipm': (8775, 0.837229272016029), 'precipi': (8775, 0.837229272016029), 'conds': (0, 0.0), 'icon': (0, 0.0), 'fog': (0, 0.0), 'rain': (0, 0.0), 'snow': (0, 0.0), 'hail': (0, 0.0), 'thunder': (0, 0.0), 'tornado': (0, 0.0)}


In [16]:
# pretty print
nullCountsDF = pd.DataFrame(nullCounts).T.reset_index(drop=False).sort_values(1, ascending=False)
nullCountsDF.columns = ['Column', 'Freq.', 'Rel. Freq.']
nullCountsDF = nullCountsDF.merge(pd.DataFrame([[colm.name, colm.dataType] for colm in weath.schema], columns=['Column', 'Type']),
                                how='inner', on=['Column'])
display(nullCountsDF)

Unnamed: 0,Column,Freq.,Rel. Freq.,Type
0,wdire,10481.0,1.0,FloatType
1,heatindexi,9644.0,0.920141,FloatType
2,heatindexm,9644.0,0.920141,FloatType
3,precipi,8775.0,0.837229,FloatType
4,precipm,8775.0,0.837229,FloatType
5,wgustm,8605.0,0.821009,FloatType
6,wgusti,8605.0,0.821009,FloatType
7,windchilli,7775.0,0.741819,FloatType
8,windchillm,7775.0,0.741819,FloatType
9,wspdm,737.0,0.070318,StringType


## 2. How many rows contain at least one null value?

We want to know, if we use the df.na option, how many rows will we loose. 

In [19]:
dcnt = weath.dropna().count()
print('%d rows reduced to %d (%0.2f%% lost)'%(cnt, dcnt, 100*(cnt - dcnt)/cnt))

10481 rows reduced to 0 (100.00% lost)


## 3. Drop the missing data

Drop any row that contains missing data across the whole dataset

In [20]:
dcnt = weath.dropna(how='all').count()
print('%d rows reduced to %d (%0.2f%% lost)'%(cnt, dcnt, 100*(cnt - dcnt)/cnt))

10481 rows reduced to 10481 (0.00% lost)


## 4. Drop with a threshold

Count how many rows would be dropped if we only dropped rows that had a least 12 NON-Null values

In [21]:
dcnt = weath.dropna(thresh=12).count()
print('%d rows reduced to %d (%0.2f%% lost)'%(cnt, dcnt, 100*(cnt - dcnt)/cnt))

10481 rows reduced to 10476 (0.05% lost)


## 5. Drop rows according to specific column value

Now count how many rows would be dropped if you only drop rows whose values in the tempm column are null/NaN

In [22]:
dcnt = weath.dropna(subset='tempm').count()
print('%d rows reduced to %d (%0.2f%% lost)'%(cnt, dcnt, 100*(cnt - dcnt)/cnt))

10481 rows reduced to 10476 (0.05% lost)


## 6. Drop rows that are null accross all columns

Count how many rows would be dropped if you only dropped rows where ALL the values are null

In [None]:
# don't understand, question 3 already asked this...

## 7. Fill in all the string columns missing values with the word "N/A"

Make sure you don't edit the df dataframe itself. Create a copy of the df then edit that one.

In [24]:
dropped = weath.fillna('N/A', subset = [colm.name for colm in weath.schema if colm.dataType is StringType()])
dropped.show()

+-------------------+-----+-----+------+------+----+-----+-----+------+------+-----+-----+----+----+---------+---------+----------+----------+----------+----------+-------+-------+----------------+------------+---+----+----+----+-------+-------+
|    pickup_datetime|tempm|tempi|dewptm|dewpti| hum|wspdm|wspdi|wgustm|wgusti|wdird|wdire|vism|visi|pressurem|pressurei|windchillm|windchilli|heatindexm|heatindexi|precipm|precipi|           conds|        icon|fog|rain|snow|hail|thunder|tornado|
+-------------------+-----+-----+------+------+----+-----+-----+------+------+-----+-----+----+----+---------+---------+----------+----------+----------+----------+-------+-------+----------------+------------+---+----+----+----+-------+-------+
|2015-12-31 00:15:00|  7.8| 46.0|   6.1|  43.0|89.0|  7.4|  4.6|  null|  null| 40.0| null| 4.0| 2.5|   1018.2|    30.07|       6.6|      43.9|      null|      null|    0.5|   0.02|      Light Rain|        rain|  0|   1|   0|   0|      0|      0|
|2015-12-31 00:4

## 8. Fill in NaN values with averages for the tempm and tempi columns

*Note: you will first need to compute the averages for each column and then fill in with the corresponding value.*

In [27]:
# fill nulls with the non-null mean
noNull = weath.fillna(weath.agg(*(avg(colm).alias(colm) for colm in ['tempm', 'tempi'])).first().asDict())
noNull.select('tempm', 'tempi').show()

+-----+-----+
|tempm|tempi|
+-----+-----+
|  7.8| 46.0|
|  7.8| 46.0|
|  7.8| 46.0|
|  7.2| 45.0|
|  7.2| 45.0|
|  6.7| 44.1|
|  7.2| 45.0|
|  7.2| 45.0|
|  7.2| 45.0|
|  7.2| 45.0|
|  7.2| 45.0|
|  7.8| 46.0|
|  7.8| 46.0|
|  7.8| 46.0|
|  8.3| 46.9|
|  8.3| 46.9|
|  8.3| 46.9|
|  8.3| 46.9|
|  8.3| 46.9|
|  8.3| 46.9|
+-----+-----+
only showing top 20 rows



### That's it! Great Job!

In [28]:
sc.stop()