### Outlier Analysis

Please work in your section and suggest:
- Why the data is missing?
- Is there a pattern for missing data (by ZIP, day, year, etc)?
- Is it important?
- What are the next steps?

### Wendy

- Pressure_potential_vorticity_surface has 385709 (99.1%) missing values
- Soil_temperature_depth_below_surface_layer has 5544 (1.4%) missing values
- Temperature_altitude_above_msl has 4321 (1.1%) missing values
- Temperature_potential_vorticity_surface has 385709 (99.1%) missing values
- Vertical_Speed_Shear_potential_vorticity_surface has 385709 (99.1%) missing values
- Visibility_surface has 69098 (17.7%) missing values
- Volumetric_Soil_Moisture_Content_depth_below_surface_layer has 5544 (1.4%) missing values
- Categorical_Freezing_Rain_surface has 316016 (81.2%) missing values
- Categorical_Rain_surface has 316016 (81.2%) missing values
- Cloud_mixing_ratio_hybrid has 316016 (81.2%) missing values
- Cloud_mixing_ratio_hybrid is highly skewed (γ1 = 27.69373963)
- Cloud_mixing_ratio_hybrid has 72199 (18.5%) zeros
- Rain_mixing_ratio_hybrid has 316016 (81.2%) missing values
- Rain_mixing_ratio_hybrid has 70260 (18.0%) zeros
- Rain_mixing_ratio_isobaric has 316016 (81.2%) missing values
- Total_cloud_cover_isobaric has 316016 (81.2%) missing values

## Ngan
- Geopotential_height_potential_vorticity_surface has 385709 (99.1%) missing values
- Haines_Index_surface has 6726 (1.7%) missing values
- u_component_of_wind_altitude_above_msl has 4321 (1.1%) missing values
- u_component_of_wind_potential_vorticity_surface has 385709 (99.1%) missing values
- v_component_of_wind_altitude_above_msl has 4321 (1.1%) missing values
- v_component_of_wind_potential_vorticity_surface has 385709 (99.1%) missing values
- Categorical_Ice_Pellets_surface has 316016 (81.2%) missing values
- Ice_water_mixing_ratio_hybrid has 316016 (81.2%) missing values
- Ice_water_mixing_ratio_hybrid is highly skewed (γ1 = 21.89943017)
- Ice_water_mixing_ratio_hybrid has 72756 (18.7%) zeros
- Ice_water_mixing_ratio_isobaric has 316016 (81.2%) missing values
- Precipitation_rate_surface has 316016 (81.2%) missing values
- Precipitation_rate_surface has 61001 (15.7%) zeros
- Vertical_velocity_geometric_isobaric has 316016 (81.2%) missing values
- Ice_growth_rate_altitude_above_msl has 369711 (95.0%) missing values
- Ice_growth_rate_altitude_above_msl has 19572 (5.0%) zeros!

We will drop variables with more than 80% of missing data. There are 9 of such including:
- Geopotential_height_potential_vorticity_surface has 385709 (99.1%) missing values
- u_component_of_wind_potential_vorticity_surface has 385709 (99.1%) missing values
- v_component_of_wind_potential_vorticity_surface has 385709 (99.1%) missing values
- Categorical_Ice_Pellets_surface has 316016 (81.2%) missing values
- Ice_water_mixing_ratio_hybrid has 316016 (81.2%) missing values
- Ice_water_mixing_ratio_isobaric has 316016 (81.2%) missing values
- Precipitation_rate_surface has 316016 (81.2%) missing values
- Vertical_velocity_geometric_isobaric has 316016 (81.2%) missing values
- Ice_growth_rate_altitude_above_msl has 369711 (95.0%) missing values

In [2]:
#import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [27]:
# Import the merged data:
data = pd.read_hdf('data.h5')

In [4]:
data.head()

Unnamed: 0,date_key,zip5,impact_score,grid_lat,grid_lon,Date,Time,ForecastRange,x,y,...,Ice_water_mixing_ratio_hybrid,Ice_water_mixing_ratio_isobaric,Precipitation_rate_surface,Rain_mixing_ratio_hybrid,Rain_mixing_ratio_isobaric,Snow_mixing_ratio_hybrid,Snow_mixing_ratio_isobaric,Total_cloud_cover_isobaric,Vertical_velocity_geometric_isobaric,Ice_growth_rate_altitude_above_msl
0,2017-01-01,2722,20.268081,41.5,-71.0,2017-01-01,0,0,578,97,...,,,,,,,,,,
211,2017-01-01,66219,19.260728,39.0,-95.0,2017-01-01,18,0,530,102,...,,,,,,,,,,
210,2017-01-01,66219,19.260728,39.0,-95.0,2017-01-01,12,0,530,102,...,,,,,,,,,,
209,2017-01-01,66219,19.260728,39.0,-95.0,2017-01-01,6,0,530,102,...,,,,,,,,,,
208,2017-01-01,66219,19.260728,39.0,-95.0,2017-01-01,0,0,530,102,...,,,,,,,,,,


In [10]:
# Drop 9 variables with more than 80% missing data
highMissingList = ['Geopotential_height_potential_vorticity_surface',\
                   'u_component_of_wind_potential_vorticity_surface',\
                   'v_component_of_wind_potential_vorticity_surface',\
                   'Categorical_Ice_Pellets_surface',\
                   'Ice_water_mixing_ratio_hybrid',\
                   'Ice_water_mixing_ratio_isobaric',\
                   'Precipitation_rate_surface',\
                   'Vertical_velocity_geometric_isobaric',\
                   'Ice_growth_rate_altitude_above_msl']
print('Data size before dropping high-data-misisng variables:', np.shape(data))
data = data.drop(highMissingList, axis = 1)
print('Data size AFTER dropping high-data-misisng variables:', np.shape(data))

Data size before dropping high-data-misisng variables: (389331, 123)
Data size AFTER dropping high-data-misisng variables: (389331, 114)


In [31]:
# Get list of interested features (those that have missing values)
# nanList = [line.rstrip('\n') for line in open('nan_features_Ngan.txt')]
nanList = ['Haines_Index_surface',\
            'u_component_of_wind_altitude_above_msl',\
            'v_component_of_wind_altitude_above_msl']

# Append primary keys to the nanList
nanList = nanList + ['date_key','Time','zip5','impact_score']

In [32]:
nanData = data[nanList]
nanData.head()

Unnamed: 0,Haines_Index_surface,u_component_of_wind_altitude_above_msl,v_component_of_wind_altitude_above_msl,date_key,Time,zip5,impact_score
0,4.0,25.209999,14.56,2017-01-01,0,2722,20.268081
211,3.0,-0.2,6.64,2017-01-01,18,66219,19.260728
210,4.0,4.66,0.66,2017-01-01,12,66219,19.260728
209,3.0,4.94,-1.48,2017-01-01,6,66219,19.260728
208,3.0,6.31,-7.36,2017-01-01,0,66219,19.260728


In [33]:
nanData.to_csv('nanData.csv')

In [35]:
missingPerZip = (1-nanData.set_index("zip5").notnull().groupby(level=0).mean()).reset_index()

In [37]:
missingPerZip.head(20)

Unnamed: 0,zip5,Haines_Index_surface,u_component_of_wind_altitude_above_msl,v_component_of_wind_altitude_above_msl,date_key,Time,impact_score
0,2722,0.187225,0.0,0.0,0.0,0.0,0.337653
1,3063,0.0,0.0,0.0,0.0,0.0,0.286818
2,7008,1.0,0.0,0.0,0.0,0.0,0.337653
3,8085,0.0,0.0,0.0,0.0,0.0,0.363568
4,8512,0.0,0.0,0.0,0.0,0.0,0.363568
5,8518,0.0,0.0,0.0,0.0,0.0,0.337653
6,8691,0.0,0.0,0.0,0.0,0.0,0.337653
7,8817,0.0,0.0,0.0,0.0,0.0,0.363568
8,17013,0.0,0.0,0.0,0.0,0.0,0.337653
9,17015,0.0,0.0,0.0,0.0,0.0,0.337653


In [45]:
missingPerZip.sort_values(['Haines_Index_surface'], ascending = [0])

Unnamed: 0,zip5,Haines_Index_surface,u_component_of_wind_altitude_above_msl,v_component_of_wind_altitude_above_msl,date_key,Time,impact_score
2,7008,1.000000,0.0,0.0,0.0,0.0,0.337653
0,2722,0.187225,0.0,0.0,0.0,0.0,0.337653
90,98421,0.184851,0.0,0.0,0.0,0.0,0.337040
87,98032,0.184679,0.0,0.0,0.0,0.0,0.337653
3,8085,0.000000,0.0,0.0,0.0,0.0,0.363568
...,...,...,...,...,...,...,...
30,33182,0.000000,0.0,0.0,0.0,0.0,0.358286
29,32221,0.000000,0.0,0.0,0.0,0.0,0.358565
28,32218,0.000000,0.0,0.0,0.0,0.0,0.358565
27,30517,0.000000,0.0,0.0,0.0,0.0,0.337040


## Chitwan
- impact_score has 132409 (34.0%) missing values
- Land_sea_coverage_nearest_neighbor_land1sea0_surface has 122793 (31.5%) missing values
- Snow_depth_surface has 5544 (1.4%) missing values
- Snow_depth_surface has 356611 (91.6%) zeros
- Water_equivalent_of_accumulated_snow_depth_surface has 356781 (91.6%) zeros
- Categorical_Snow_surface has 316016 (81.2%) missing values
- Composite_reflectivity_entire_atmosphere has 316016 (81.2%) missing values
- Graupel_snow_pellets_hybrid has 316016 (81.2%) missing values
- Graupel_snow_pellets_hybrid is highly skewed (γ1 = 73.26876449)
- Graupel_snow_pellets_hybrid has 73043 (18.8%) zeros
- Graupel_snow_pellets_isobaric has 316016 (81.2%) missing values
- Snow_mixing_ratio_hybrid has 316016 (81.2%) missing values
- Snow_mixing_ratio_hybrid is highly skewed (γ1 = 31.41715949)
- Snow_mixing_ratio_hybrid has 72934 (18.7%) zeros
- Snow_mixing_ratio_isobaric has 316016 (81.2%) missing values!