Goal:

Analyse missing values, what caused them, how many of them are present in data set, how to solve the issue.

In [1]:
from os.path import join, basename, splitext
from glob import glob
from dask import dataframe as dd
from matplotlib import rcParams
import pandas as pd
import dask
from collections import Counter
import pickle


from deep_aqi import ROOT


pd.set_option('max_columns', 50)
pd.set_option('max_rows', 25)

In [2]:
PROCESSED_DATA = join(ROOT, 'data', 'processed')
INTERIM_DATA = join(ROOT, 'data', 'interim')
RAW_DATA = join(ROOT, 'data', 'raw')

In [3]:
weather_path = join(INTERIM_DATA, 'combined-WEATHER.parquet')
weather = dd.read_parquet(weather_path)

In [4]:
weather.isnull().sum().compute()

SiteCode                            0
LocalDate                           0
Wind Direction - Resultant      14326
Wind Speed - Resultant          65437
Outdoor Temperature                 0
Barometric pressure                 0
Dew Point                     4233851
Relative Humidity                 299
dtype: int64

In [5]:
len(weather)

4969212

Dew Point - f(p, T)

Safe to drop; check if data not present from start, or lost in processing.

In [9]:
rhdp2010_path = join(RAW_DATA, 'hourly_RH_DP_2010.csv')
rhdp2010 = dd.read_csv(rhdp2010_path, 
                       dtype={'MDL': 'float64',
                              'Sample Measurement': 'float64',
                              'Qualifier': 'object'},
                       assume_missing=True)

In [10]:
len(rhdp2010)

3357945

In [17]:
rhdp2010['Parameter Name'].value_counts().compute()

Relative Humidity     3011619
Dew Point              346326
Name: Parameter Name, dtype: int64

Almost no values of Dew Point, drop it.

After Dew Point drop:

In [18]:
weather_path = join(INTERIM_DATA, 'combined-WEATHER.parquet')
weather = dd.read_parquet(weather_path)

In [19]:
weather.isnull().sum().compute()

SiteCode                          0
LocalDate                         0
Wind Direction - Resultant    14326
Wind Speed - Resultant        65437
Outdoor Temperature               0
Barometric pressure               0
Relative Humidity               299
dtype: int64

In [31]:
missing_sites = weather.loc[weather['Wind Speed - Resultant'].isnull(), 'SiteCode'].unique().compute()
weather.loc[weather['Wind Speed - Resultant'].isnull(), 'SiteCode'].nunique().compute()

32

In [34]:
weather.loc[weather['Wind Speed - Resultant'].isnull(), 'SiteCode'].value_counts().compute()

Idaho_Benewah_11.0                                35023
Wyoming_Sweetwater_200.0                          11427
Massachusetts_Suffolk_42.0                         8514
Wyoming_Fremont_99.0                               7587
District Of Columbia_District of Columbia_43.0     1417
Texas_Harris_1035.0                                 718
New Hampshire_Rockingham_18.0                       284
Wisconsin_Dodge_1.0                                 119
Oregon_Klamath_4.0                                   93
Kentucky_Jefferson_43.0                              76
Maryland_Dorchester_4.0                              41
New York_Monroe_1007.0                               36
                                                  ...  
Pennsylvania_Allegheny_8.0                            3
Maryland_Washington_9.0                               3
Maryland_Cecil_3.0                                    3
Oregon_Union_119.0                                    3
California_Madera_2010.0                        

In [25]:
weather.loc[:, 'SiteCode'].nunique().compute()

100

1/3 of sites has NaN in Wind Speed field

In [26]:
weather.loc[weather['Wind Speed - Resultant'].isnull(), 'LocalDate'].dt.year.nunique().compute()

8

NaNs present across all years

In [27]:
weather.loc[weather['Wind Speed - Resultant'].isnull(), 'LocalDate'].dt.year.value_counts().compute()

2010    25105
2011    20699
2013     9005
2012     8764
2017     1537
2016      157
2014      100
2015       70
Name: LocalDate, dtype: int64

mostly in 2010 & 2011; let's see if it's present in raw file, or missing after processing

In [28]:
wind2010_path = join(RAW_DATA, 'hourly_WIND_2010.csv')
wind2010 = dd.read_csv(wind2010_path)

In [49]:
cond = (wind2010['State Name'] == 'Idaho') & (wind2010['County Name'] == 'Benewah') & (wind2010['Site Num'] == 11)
wind2010.loc[cond, 'Parameter Name'].value_counts().compute()

Wind Direction - Resultant    8760
Name: Parameter Name, dtype: int64

There was no readings of Wind Speed from this year from this station;

Drop missing values, without attempt to estimate.

In [50]:
weather.isnull().sum().compute()

SiteCode                          0
LocalDate                         0
Wind Direction - Resultant    14326
Wind Speed - Resultant        65437
Outdoor Temperature               0
Barometric pressure               0
Relative Humidity               299
dtype: int64

In [51]:
len(weather.dropna()) / len(weather)

0.9838883911573908

Drop will result in lose of 1.7% of data.