# CSO QaQc: Zero Depth Value Exploration

We know that `Zero` depth values are worth keeping and valuable pieces of data. However there seems to be some inconsistencies when it comes to the filtering of these values, we will explore this in this Notebook

In [1]:
import requests
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import cartopy.io.img_tiles as cimgt

# Import necessary packages, may need more or less as I go.

In [None]:
CSO_gdf = gpd.read_file('CSOgeodata.geojson')
CSO_gdf['timestamp'] = pd.to_datetime(CSO_gdf.timestamp)

In [None]:
CSO_gdf['flags'] = False
CSO_gdf

In [None]:
ZERO = 0
CSO_gdf.loc[CSO_gdf['depth'] <= ZERO, 'flags'] = True
CSO_noZeros = (CSO_gdf.loc[CSO_gdf['flags'] == False])
CSO_zeros = CSO_gdf.loc[CSO_gdf['flags'] == True]
CSO_zeros

In [None]:
CSO_zeros[['depth','source']].groupby(['source']).agg(['count'])

### SnowPilot Zeros

* These have been discussed to be mostly unintentional due to some problem when compiling the data from SnowPilot as it has many more fields to fill in when compared to thing such as MountainHub. We need to reach out and figure out the procedure/see if these are intentional.

### MountainHub Zeros

* These are valuable pieces of data however there is only 24 pieces of data, which seems rather low when considering the amount of `Zeros` that people have said they have submitted.

In [None]:
CSO_zeros.loc[CSO_zeros['source'] == "MountainHub"]

### Interesting note

Dave had mentioned recording `Zero` values from his office on Mountainhub but none of them are present?

In [None]:
CSO_DAVE = CSO_gdf.loc[CSO_gdf['author'] == "David Hill"]
CSO_DAVE

In [None]:
CSO_DAVE_zeros = CSO_DAVE.loc[CSO_DAVE['flags'] == True]
CSO_DAVE_zeros

We can see that theres still no `Zero` depth values for Dave even though he had said that he had recorded them. This will take some asking around as I could be wrong in this assumption.

### Histogram for Elevation of CSO dataset

In [None]:
histogram_gdf = CSO_gdf['elevation'].hist(bins = 25)

### Histogram for Elevation of CSO zero values

In [None]:
histogram_flag = CSO_zeros['elevation'].hist(bins = 25)

### Histogram for CSO data depth values

In [None]:
histogram_depth = CSO_gdf['depth'].hist(bins = 25)

Looking at the Depth values on the histogram we can see that most depth observations are under 200 cm. With this in mind we can possible remove bad depth observations by flagging values that are unreasonably low.

## Flagging values for depth observations that are unreasonably low.

This is a somewhat arbitrary test and may need some fleshing out in terms of numbers.

In [None]:
LOW = 3
CSO_gdf_LOW = gpd.read_file('CSOgeodata.geojson')
CSO_gdf_LOW.loc[CSO_gdf_LOW['depth'] <= LOW, 'flags'] = True
CSO_gdf_LOW.loc[CSO_gdf_LOW['depth'] == ZERO, 'flags'] = False
CSO_LOW = CSO_gdf_LOW.loc[CSO_gdf_LOW['flags'] == True]
CSO_LOW

In [None]:
CSO_LOW[['depth','source']].groupby(['source']).agg(['count'])

In [None]:
CSO_LOW.loc[CSO_LOW['source'] == "MountainHub"]

I think using an unreasonably low check seems to be a bit unecessary as it seems that many of our own team members are recording low depth values, so the authenticity of low values seems fairly strong.

## How Impactful Are Zero Values?

Here is where I'll explore how much impact `Zero` depth values have on things such as discriptive statistics.

In [None]:
# Here is the mean depth value of the CSO data.
CSO_mean = CSO_gdf['depth'].mean()
CSO_mean

In [None]:
# Here is the mean now with zero values excluded from the data.
CSO_meanNoZero = CSO_noZeros['depth'].mean()
CSO_meanNoZero

Interestingly, they don't actually have that much of an effect on the data set when looked as a WHOLE. This makes sense as they make up a small amount of the data.

## Domain Specific Zero Depth Exploration

Now we will be looking at a region where `Zero` depth values actually have a noticeable impact on the data. This region will be California as defined by `CSO_CA`.

In [None]:
CSO_CA = gpd.read_file('CSO_CA.geojson')
CSO_gdf['timestamp'] = pd.to_datetime(CSO_gdf.timestamp)

In [None]:
CSO_CA['flags'] = False
CSO_CA

In [None]:
CSO_CA.loc[CSO_CA['depth'] <= ZERO, 'flags'] = True
CSO_noCA = (CSO_CA.loc[CSO_CA['flags'] == False])
CSO_CAzeros = CSO_CA.loc[CSO_CA['flags'] == True]
CSO_CAzeros

In [None]:
CSO_CAzeros[['depth','source']].groupby(['source']).agg(['count'])

In [None]:
CSO_meanCA = CSO_CA['depth'].mean()
CSO_meanCA

In [None]:
CSO_noZeroMean = CSO_noCA['depth'].mean()
CSO_noZeroMean

In [None]:
histogram_CA = CSO_CA['elevation'].hist(bins = 50)
histogram_CA

In [None]:
histogram_CAnoZeros = CSO_noCA['elevation'].hist(bins = 50)
histogram_CAnoZeros

The zeros are actually very impactful in this region as the amount of `Zero` depth values is relatively higher than most regions. 

From all of this exploration I think it highlights an important aspect that I have not mentioned. The effect that `Zero` depth values have is very reliant on the amount of data in a given region. If we look at California as defined by `CSO_CA` we can see that Zeros are actually very impactful as that region has a rather large amount of these values when compared to most. These also all seem a bit unintentional as they come from SnowPilot. Also another interesting fact about those SnowPilot `Zero` depth values is that they ALL come from 2017. So I wonder if there was some sort of miscommunication or a specific part of 2017 that lead to this possible error.