# Handling Missing Data in PySpark HW Solutions

In this HW assignment you will be strengthening your skill sets dealing with missing data.
 
**Review:** you have 2 basic options for filling in missing data (you will personally have to make the decision for what is the right approach:

1. Drop the missing data points (including the entire row)
2. Fill them in with some other value.

Let's practice some examples of each of these methods!


#### But first!

Start your Spark session

In [1]:
from pyspark.sql import SparkSession

In [None]:
spark = SparkSession.builder.appName('HandlingMissingData').getOrCreate()

In [3]:
spark

## Read in the dataset for this Notebook

Weather.csv attached to this lecture. 

In [4]:
df = spark.read.csv('./data/Weather.csv', inferSchema=True, header=True)

## About this dataset

**New York City Taxi Trip - Hourly Weather Data**

Here is some detailed weather data for the New York City Taxi Trips.

**Source:** https://www.kaggle.com/meinertsen/new-york-city-taxi-trip-hourly-weather-data

### Print a view of the first several lines of the dataframe to see what our data looks like

In [6]:
df.limit(5).toPandas()

Unnamed: 0,pickup_datetime,tempm,tempi,dewptm,dewpti,hum,wspdm,wspdi,wgustm,wgusti,...,precipm,precipi,conds,icon,fog,rain,snow,hail,thunder,tornado
0,2015-12-31 00:15:00,7.8,46.0,6.1,43.0,89.0,7.4,4.6,,,...,0.5,0.02,Light Rain,rain,0,1,0,0,0,0
1,2015-12-31 00:42:00,7.8,46.0,6.1,43.0,89.0,7.4,4.6,,,...,0.8,0.03,Overcast,cloudy,0,0,0,0,0,0
2,2015-12-31 00:51:00,7.8,46.0,6.1,43.0,89.0,5.6,3.5,,,...,0.8,0.03,Overcast,cloudy,0,0,0,0,0,0
3,2015-12-31 01:51:00,7.2,45.0,5.6,42.1,90.0,7.4,4.6,,,...,0.3,0.01,Overcast,cloudy,0,0,0,0,0,0
4,2015-12-31 02:51:00,7.2,45.0,5.6,42.1,90.0,0.0,0.0,,,...,,,Overcast,cloudy,0,0,0,0,0,0


### Print the schema 

So that we can see if we need to make any corrections to the data types.

In [7]:
df.printSchema()

root
 |-- pickup_datetime: string (nullable = true)
 |-- tempm: double (nullable = true)
 |-- tempi: double (nullable = true)
 |-- dewptm: double (nullable = true)
 |-- dewpti: double (nullable = true)
 |-- hum: double (nullable = true)
 |-- wspdm: double (nullable = true)
 |-- wspdi: double (nullable = true)
 |-- wgustm: double (nullable = true)
 |-- wgusti: double (nullable = true)
 |-- wdird: integer (nullable = true)
 |-- wdire: string (nullable = true)
 |-- vism: double (nullable = true)
 |-- visi: double (nullable = true)
 |-- pressurem: double (nullable = true)
 |-- pressurei: double (nullable = true)
 |-- windchillm: double (nullable = true)
 |-- windchilli: double (nullable = true)
 |-- heatindexm: double (nullable = true)
 |-- heatindexi: double (nullable = true)
 |-- precipm: double (nullable = true)
 |-- precipi: double (nullable = true)
 |-- conds: string (nullable = true)
 |-- icon: string (nullable = true)
 |-- fog: integer (nullable = true)
 |-- rain: integer (nullable 

## 1. How much missing data are we working with?

Get a count and percentage of each variable in the dataset to answer this question.

In [10]:
from pyspark.sql.functions import *

def null_value_calc(df):
    null_columns_counts = []
    numRows = df.count()
    for k in df.columns:
        nullRows = df.where(col(k).isNull()).count()
        if nullRows > 0:
            temp = (k, nullRows, nullRows * 100/numRows)
            null_columns_counts.append(temp)
    return null_columns_counts

In [11]:
df_nulls = spark.createDataFrame(null_value_calc(df), ['column', 'row_null_count', 'null_count_percentage'])
df_nulls.show()

+----------+--------------+---------------------+
|    column|row_null_count|null_count_percentage|
+----------+--------------+---------------------+
|     tempm|             5|  0.04770537162484496|
|     tempi|             5|  0.04770537162484496|
|    dewptm|             5|  0.04770537162484496|
|    dewpti|             5|  0.04770537162484496|
|       hum|             5|  0.04770537162484496|
|     wspdm|           737|    7.031771777502147|
|     wspdi|           737|    7.031771777502147|
|    wgustm|          8605|    82.10094456635817|
|    wgusti|          8605|    82.10094456635817|
|      vism|           245|    2.337563209617403|
|      visi|           245|    2.337563209617403|
| pressurem|           239|    2.280316763667589|
| pressurei|           239|    2.280316763667589|
|windchillm|          7775|    74.18185287663391|
|windchilli|          7775|    74.18185287663391|
|heatindexm|          9644|    92.01412079000096|
|heatindexi|          9644|    92.01412079000096|


## 2. How many rows contain at least one null value?

We want to know, if we use the df.na option, how many rows will we loose. 

In [12]:
og_len = df.count()
drop_len = df.na.drop().count()

print(f'Total rows dropped {og_len - drop_len}')
print(f'Percent of rows dropped {(og_len-drop_len)/og_len}')

Total rows dropped 10481
Percent of rows dropped 1.0


## 3. Drop the missing data

Drop any row that contains missing data across the whole dataset

In [13]:
df_dropped = df.na.drop()
df_dropped.limit(5).toPandas()

Unnamed: 0,pickup_datetime,tempm,tempi,dewptm,dewpti,hum,wspdm,wspdi,wgustm,wgusti,...,precipm,precipi,conds,icon,fog,rain,snow,hail,thunder,tornado


## 4. Drop with a threshold

Count how many rows would be dropped if we only dropped rows that had a least 12 NON-Null values

In [14]:
og_len = df.count()
drop_len = df.na.drop(thresh=12).count()

print(f'Total rows dropped {og_len - drop_len}')
print(f'Percent of rows dropped {(og_len-drop_len)/og_len}')

Total rows dropped 5
Percent of rows dropped 0.00047705371624844956


## 5. Drop rows according to specific column value

Now count how many rows would be dropped if you only drop rows whose values in the tempm column are null/NaN

In [15]:
og_len = df.count()
drop_len = df.na.drop(subset=['tempm']).count()

print(f'Total rows dropped {og_len - drop_len}')
print(f'Percent of rows dropped {(og_len-drop_len)/og_len}')

Total rows dropped 5
Percent of rows dropped 0.00047705371624844956


## 6. Drop rows that are null accross all columns

Count how many rows would be dropped if you only dropped rows where ALL the values are null

In [16]:
og_len = df.count()
drop_len = df.na.drop(how='all').count()

print(f'Total rows dropped {og_len - drop_len}')
print(f'Percent of rows dropped {(og_len-drop_len)/og_len}')

Total rows dropped 0
Percent of rows dropped 0.0


## 7. Fill in all the string columns missing values with the word "N/A"

Make sure you don't edit the df dataframe itself. Create a copy of the df then edit that one.

In [28]:
df_filled = df.na.fill('N/A')
df_filled.limit(10).toPandas()

Unnamed: 0,pickup_datetime,tempm,tempi,dewptm,dewpti,hum,wspdm,wspdi,wgustm,wgusti,...,precipm,precipi,conds,icon,fog,rain,snow,hail,thunder,tornado
0,2015-12-31 00:15:00,7.8,46.0,6.1,43.0,89.0,7.4,4.6,,,...,0.5,0.02,Light Rain,rain,0,1,0,0,0,0
1,2015-12-31 00:42:00,7.8,46.0,6.1,43.0,89.0,7.4,4.6,,,...,0.8,0.03,Overcast,cloudy,0,0,0,0,0,0
2,2015-12-31 00:51:00,7.8,46.0,6.1,43.0,89.0,5.6,3.5,,,...,0.8,0.03,Overcast,cloudy,0,0,0,0,0,0
3,2015-12-31 01:51:00,7.2,45.0,5.6,42.1,90.0,7.4,4.6,,,...,0.3,0.01,Overcast,cloudy,0,0,0,0,0,0
4,2015-12-31 02:51:00,7.2,45.0,5.6,42.1,90.0,0.0,0.0,,,...,,,Overcast,cloudy,0,0,0,0,0,0
5,2015-12-31 03:28:00,6.7,44.1,5.0,41.0,89.0,7.4,4.6,,,...,,,Overcast,cloudy,0,0,0,0,0,0
6,2015-12-31 03:40:00,7.2,45.0,5.0,41.0,86.0,0.0,0.0,,,...,,,Overcast,cloudy,0,0,0,0,0,0
7,2015-12-31 03:51:00,7.2,45.0,5.0,41.0,86.0,7.4,4.6,,,...,,,Overcast,cloudy,0,0,0,0,0,0
8,2015-12-31 04:22:00,7.2,45.0,5.0,41.0,86.0,5.6,3.5,,,...,,,Overcast,cloudy,0,0,0,0,0,0
9,2015-12-31 04:51:00,7.2,45.0,5.6,42.1,90.0,5.6,3.5,,,...,,,Overcast,cloudy,0,0,0,0,0,0


## 8. Fill in NaN values with averages for the tempm and tempi columns

*Note: you will first need to compute the averages for each column and then fill in with the corresponding value.*

In [30]:
def fill_with_mean(df, include=set()):
    stats = df.agg(*(avg(c).alias(c) for c in df.columns if c in include))
    return df.na.fill(stats.first().asDict())

In [32]:
updated_df = fill_with_mean(df, ['tempm', 'tempi'])
updated_df.limit(5).toPandas()

Unnamed: 0,pickup_datetime,tempm,tempi,dewptm,dewpti,hum,wspdm,wspdi,wgustm,wgusti,...,precipm,precipi,conds,icon,fog,rain,snow,hail,thunder,tornado
0,2015-12-31 00:15:00,7.8,46.0,6.1,43.0,89.0,7.4,4.6,,,...,0.5,0.02,Light Rain,rain,0,1,0,0,0,0
1,2015-12-31 00:42:00,7.8,46.0,6.1,43.0,89.0,7.4,4.6,,,...,0.8,0.03,Overcast,cloudy,0,0,0,0,0,0
2,2015-12-31 00:51:00,7.8,46.0,6.1,43.0,89.0,5.6,3.5,,,...,0.8,0.03,Overcast,cloudy,0,0,0,0,0,0
3,2015-12-31 01:51:00,7.2,45.0,5.6,42.1,90.0,7.4,4.6,,,...,0.3,0.01,Overcast,cloudy,0,0,0,0,0,0
4,2015-12-31 02:51:00,7.2,45.0,5.6,42.1,90.0,0.0,0.0,,,...,,,Overcast,cloudy,0,0,0,0,0,0


### That's it! Great Job!