# Handling Missing Data in PySpark HW Solutions

In this HW assignment you will be strengthening your skill sets dealing with missing data.
 
**Review:** you have 2 basic options for filling in missing data (you will personally have to make the decision for what is the right approach:

1. Drop the missing data points (including the entire row)
2. Fill them in with some other value.

Let's practice some examples of each of these methods!


#### But first!

Start your Spark session

In [22]:

import pyspark # only run after findspark.init()
from pyspark.sql import SparkSession
# May take awhile locally
spark = SparkSession.builder.appName("nulls").getOrCreate()

cores = spark._jsc.sc().getExecutorMemoryStatus().keySet().size()
print("You are working with", cores, "core(s)")
spark

You are working with 1 core(s)


## Read in the dataset for this Notebook

Weather.csv attached to this lecture. 

In [23]:
import pandas as pd
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
# Take a look at the first few lines
weather = spark.read.csv('Datasets/weather.csv',inferSchema=True,header=True)


## About this dataset

**New York City Taxi Trip - Hourly Weather Data**

Here is some detailed weather data for the New York City Taxi Trips.

**Source:** https://www.kaggle.com/meinertsen/new-york-city-taxi-trip-hourly-weather-data

### Print a view of the first several lines of the dataframe to see what our data looks like

In [24]:
weather.limit(4).toPandas()

  series = series.astype(t, copy=False)


Unnamed: 0,pickup_datetime,tempm,tempi,dewptm,dewpti,hum,wspdm,wspdi,wgustm,wgusti,wdird,wdire,vism,visi,pressurem,pressurei,windchillm,windchilli,heatindexm,heatindexi,precipm,precipi,conds,icon,fog,rain,snow,hail,thunder,tornado
0,2015-12-31 00:15:00,7.8,46.0,6.1,43.0,89.0,7.4,4.6,,,40,NE,4.0,2.5,1018.2,30.07,6.6,43.9,,,0.5,0.02,Light Rain,rain,0,1,0,0,0,0
1,2015-12-31 00:42:00,7.8,46.0,6.1,43.0,89.0,7.4,4.6,,,0,Variable,6.4,4.0,1017.8,30.06,6.6,43.9,,,0.8,0.03,Overcast,cloudy,0,0,0,0,0,0
2,2015-12-31 00:51:00,7.8,46.0,6.1,43.0,89.0,5.6,3.5,,,20,NNE,8.0,5.0,1017.0,30.04,7.1,44.8,,,0.8,0.03,Overcast,cloudy,0,0,0,0,0,0
3,2015-12-31 01:51:00,7.2,45.0,5.6,42.1,90.0,7.4,4.6,,,0,Variable,12.9,8.0,1016.5,30.02,5.9,42.6,,,0.3,0.01,Overcast,cloudy,0,0,0,0,0,0


### Print the schema 

So that we can see if we need to make any corrections to the data types.

In [25]:
weather.printSchema()

root
 |-- pickup_datetime: timestamp (nullable = true)
 |-- tempm: double (nullable = true)
 |-- tempi: double (nullable = true)
 |-- dewptm: double (nullable = true)
 |-- dewpti: double (nullable = true)
 |-- hum: double (nullable = true)
 |-- wspdm: double (nullable = true)
 |-- wspdi: double (nullable = true)
 |-- wgustm: double (nullable = true)
 |-- wgusti: double (nullable = true)
 |-- wdird: integer (nullable = true)
 |-- wdire: string (nullable = true)
 |-- vism: double (nullable = true)
 |-- visi: double (nullable = true)
 |-- pressurem: double (nullable = true)
 |-- pressurei: double (nullable = true)
 |-- windchillm: double (nullable = true)
 |-- windchilli: double (nullable = true)
 |-- heatindexm: double (nullable = true)
 |-- heatindexi: double (nullable = true)
 |-- precipm: double (nullable = true)
 |-- precipi: double (nullable = true)
 |-- conds: string (nullable = true)
 |-- icon: string (nullable = true)
 |-- fog: integer (nullable = true)
 |-- rain: integer (nullab

## 1. How much missing data are we working with?

Get a count and percentage of each variable in the dataset to answer this question.

In [5]:
from pyspark.sql.functions import *

def null_value_calc(df):
    null_columns_counts = []
    numRows = df.count()
    for k in df.columns:
        nullRows = df.where(col(k).isNull()).count()
        if(nullRows > 0):
            temp = k,nullRows,(nullRows/numRows)*100
            null_columns_counts.append(temp)
    return(null_columns_counts)

null_columns_calc_list = null_value_calc(weather)
spark.createDataFrame(null_columns_calc_list, ['Column_Name', 'Null_Values_Count','Null_Value_Percent']).show()


+-----------+-----------------+-------------------+
|Column_Name|Null_Values_Count| Null_Value_Percent|
+-----------+-----------------+-------------------+
|      tempm|                5|0.04770537162484496|
|      tempi|                5|0.04770537162484496|
|     dewptm|                5|0.04770537162484496|
|     dewpti|                5|0.04770537162484496|
|        hum|                5|0.04770537162484496|
|      wspdm|              737|  7.031771777502146|
|      wspdi|              737|  7.031771777502146|
|     wgustm|             8605|  82.10094456635817|
|     wgusti|             8605|  82.10094456635817|
|       vism|              245| 2.3375632096174033|
|       visi|              245| 2.3375632096174033|
|  pressurem|              239| 2.2803167636675887|
|  pressurei|              239| 2.2803167636675887|
| windchillm|             7775|  74.18185287663391|
| windchilli|             7775|  74.18185287663391|
| heatindexm|             9644|  92.01412079000096|
| heatindexi

## 2. How many rows contain at least one null value?

We want to know, if we use the df.na option, how many rows will we loose. 

In [7]:
og_len = weather.count()
drop_len = weather.na.drop().count()
print("Total Rows Dropped:",og_len-drop_len)

Total Rows Dropped: 10481


## 3. Drop the missing data

Drop any row that contains missing data across the whole dataset

In [10]:
df_drop_all = weather.na.drop(how='all')
df_drop_all.toPandas()

  series = series.astype(t, copy=False)


Unnamed: 0,pickup_datetime,tempm,tempi,dewptm,dewpti,hum,wspdm,wspdi,wgustm,wgusti,...,precipm,precipi,conds,icon,fog,rain,snow,hail,thunder,tornado
0,2015-12-31 00:15:00,7.8,46.0,6.1,43.0,89.0,7.4,4.6,,,...,0.5,0.02,Light Rain,rain,0,1,0,0,0,0
1,2015-12-31 00:42:00,7.8,46.0,6.1,43.0,89.0,7.4,4.6,,,...,0.8,0.03,Overcast,cloudy,0,0,0,0,0,0
2,2015-12-31 00:51:00,7.8,46.0,6.1,43.0,89.0,5.6,3.5,,,...,0.8,0.03,Overcast,cloudy,0,0,0,0,0,0
3,2015-12-31 01:51:00,7.2,45.0,5.6,42.1,90.0,7.4,4.6,,,...,0.3,0.01,Overcast,cloudy,0,0,0,0,0,0
4,2015-12-31 02:51:00,7.2,45.0,5.6,42.1,90.0,0.0,0.0,,,...,,,Overcast,cloudy,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10476,2016-12-31 19:51:00,6.1,43.0,-4.4,24.1,47.0,7.4,4.6,,,...,,,Overcast,cloudy,0,0,0,0,0,0
10477,2016-12-31 20:51:00,6.1,43.0,-4.4,24.1,47.0,13.0,8.1,38.9,24.2,...,,,Overcast,cloudy,0,0,0,0,0,0
10478,2016-12-31 21:51:00,6.1,43.0,-5.0,23.0,45.0,9.3,5.8,29.6,18.4,...,,,Overcast,cloudy,0,0,0,0,0,0
10479,2016-12-31 22:51:00,6.7,44.1,-5.0,23.0,43.0,14.8,9.2,,,...,,,Overcast,cloudy,0,0,0,0,0,0


## 4. Drop with a threshold

Count how many rows would be dropped if we only dropped rows that had a least 12 NON-Null values

In [11]:
og_len = weather.count()
drop_len = weather.na.drop(thresh=12).count()
print("Total Rows Dropped:",og_len-drop_len)

Total Rows Dropped: 5


## 5. Drop rows according to specific column value

Now count how many rows would be dropped if you only drop rows whose values in the tempm column are null/NaN

In [15]:
og_len = weather.count()
drop_len = weather.na.drop(subset=["tempm"]).count()
print("Total Rows Dropped:",og_len-drop_len)

Total Rows Dropped: 5


## 6. Drop rows that are null accross all columns

Count how many rows would be dropped if you only dropped rows where ALL the values are null

In [17]:
og_len = weather.count()
drop_len = weather.na.drop(how='all').count()
print("Total Rows Dropped:",og_len-drop_len)

Total Rows Dropped: 0


## 7. Fill in all the string columns missing values with the word "N/A"

Make sure you don't edit the df dataframe itself. Create a copy of the df then edit that one.

In [59]:
from pyspark.sql.functions import regexp_replace
# filled=weather.filter(weather.conds.isNull()).na.fill('N/A')
filled=weather.withColumn('conds', regexp_replace('conds', 'Unknown', 'N/A'))\
                          .withColumn('icon', regexp_replace('icon', 'unknown', 'N/A'))
filled.filter(filled.tempm.isNull()).limit(5).toPandas()


  series = series.astype(t, copy=False)


Unnamed: 0,pickup_datetime,tempm,tempi,dewptm,dewpti,hum,wspdm,wspdi,wgustm,wgusti,wdird,wdire,vism,visi,pressurem,pressurei,windchillm,windchilli,heatindexm,heatindexi,precipm,precipi,conds,icon,fog,rain,snow,hail,thunder,tornado
0,2016-04-04 09:00:00,,,,,,,,,,0,North,,,,,,,,,,,,,0,0,0,0,0,0
1,2016-04-19 07:00:00,,,,,,,,,,0,North,,,,,,,,,,,,,0,0,0,0,0,0
2,2016-05-04 06:00:00,,,,,,,,,,0,North,,,,,,,,,,,,,0,0,0,0,0,0
3,2016-07-11 18:00:00,,,,,,,,,,0,North,,,,,,,,,,,,,0,0,0,0,0,0
4,2016-11-04 07:00:00,,,,,,,,,,0,North,,,,,,,,,,,,,0,0,0,0,0,0


## 8. Fill in NaN values with averages for the tempm and tempi columns

*Note: you will first need to compute the averages for each column and then fill in with the corresponding value.*

In [39]:
df=weather.filter(weather.tempm.isNull())
df.limit(4).toPandas()

  series = series.astype(t, copy=False)


Unnamed: 0,pickup_datetime,tempm,tempi,dewptm,dewpti,hum,wspdm,wspdi,wgustm,wgusti,wdird,wdire,vism,visi,pressurem,pressurei,windchillm,windchilli,heatindexm,heatindexi,precipm,precipi,conds,icon,fog,rain,snow,hail,thunder,tornado
0,2016-04-04 09:00:00,,,,,,,,,,0,North,,,,,,,,,,,Unknown,unknown,0,0,0,0,0,0
1,2016-04-19 07:00:00,,,,,,,,,,0,North,,,,,,,,,,,Unknown,unknown,0,0,0,0,0,0
2,2016-05-04 06:00:00,,,,,,,,,,0,North,,,,,,,,,,,Unknown,unknown,0,0,0,0,0,0
3,2016-07-11 18:00:00,,,,,,,,,,0,North,,,,,,,,,,,Unknown,unknown,0,0,0,0,0,0


In [57]:
def fill_with_mean(df, include=set()): 
    stats = df.agg(*(avg(c).alias(c) for c in df.columns if c in include))
    return df.na.fill(stats.first().asDict())

filled_values = fill_with_mean(weather, ["tempm","tempi"])
filled.limit(4).toPandas()

  series = series.astype(t, copy=False)


Unnamed: 0,pickup_datetime,tempm,tempi,dewptm,dewpti,hum,wspdm,wspdi,wgustm,wgusti,wdird,wdire,vism,visi,pressurem,pressurei,windchillm,windchilli,heatindexm,heatindexi,precipm,precipi,conds,icon,fog,rain,snow,hail,thunder,tornado
0,2015-12-31 00:15:00,7.8,46.0,6.1,43.0,89.0,7.4,4.6,,,40,NE,4.0,2.5,1018.2,30.07,6.6,43.9,,,0.5,0.02,Light Rain,rain,0,1,0,0,0,0
1,2015-12-31 00:42:00,7.8,46.0,6.1,43.0,89.0,7.4,4.6,,,0,Variable,6.4,4.0,1017.8,30.06,6.6,43.9,,,0.8,0.03,Overcast,cloudy,0,0,0,0,0,0
2,2015-12-31 00:51:00,7.8,46.0,6.1,43.0,89.0,5.6,3.5,,,20,NNE,8.0,5.0,1017.0,30.04,7.1,44.8,,,0.8,0.03,Overcast,cloudy,0,0,0,0,0,0
3,2015-12-31 01:51:00,7.2,45.0,5.6,42.1,90.0,7.4,4.6,,,0,Variable,12.9,8.0,1016.5,30.02,5.9,42.6,,,0.3,0.01,Overcast,cloudy,0,0,0,0,0,0


### That's it! Great Job!