# Individual glacier data inspection

This notebook will walk through steps to read in and organize velocity data and clip it to the extent of a single glacier. The tools we will use include **xarray**, **rioxarray**, **geopandas**, and **flox**. 

To clip its_live data to the extent of a single glacier we will use a vector dataset of glacier outlines, the [Randolph Glacier Inventory](https://nsidc.org/data/nsidc-0770). These aren't cloud-hosted currently so you will need to download the data to your local machine. 

*Learning goals*

- subset large raster to spatial area of interest
- exploring with **dask** and **xarray**
- dataset inspection using
    - xarray label and index-based selections
    - grouped computations and reductions
    - visualization

First, lets install the python libraries that we'll need for this notebook:

In [None]:
import geopandas as gpd
import os
import numpy as np
import xarray as xr
import rioxarray as rxr
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
from shapely.geometry import Polygon
from shapely.geometry import Point
import cartopy.crs as ccrs
from cartopy.mpl.gridliner import LONGITUDE_FORMATTER, LATITUDE_FORMATTER
import cartopy
import cartopy.feature as cfeature
import json
import urllib.request
import pandas as pd
import flox
%config InlineBackend.figure_format='retina'


## Reading in ITS_LIVE data

We will use some of the functions we defined in the data access notebook in this notebook and others within this tutorial. They will all be within the `itslivetools` package.

In [None]:
import itslivetools

First, let's read in the catalog again:  

In [None]:
with urllib.request.urlopen('https://its-live-data.s3.amazonaws.com/datacubes/catalog_v02.json') as url_catalog:
    itslive_catalog = json.loads(url_catalog.read().decode())
itslive_catalog.keys()

The `read_in_s3()` function will read in a xarray dataset from a url to a zarr datacube when we're ready:

I started with `chunk_size='auto'` which will choose chunk sizes that match the underlying data structure (this is generally ideal). More about choosing good chunk sizes [here](https://blog.dask.org/2021/11/02/choosing-dask-chunk-sizes). If you want to use a different chunk size, specify it when you call the `read_in_s3()` function.

In [None]:
url = itslivetools.find_granule_by_point(itslive_catalog, [84.56, 28.54])
url

In [None]:
dc = itslivetools.read_in_s3(url[0])
dc

We are reading this in as a dask array. Let's take a look at the chunk sizes:

```{note} 
chunksizes shows the largest chunk size. chunks shows the sizes of all chunks along all dims, better if you have irregular chunks
```

In [None]:
dc.chunksizes

In [None]:
dc.chunks

```{note} 
Setting the dask chunksize to `auto` at the `xr.open_dataset()` step will use chunk sizes that most closely resemble the structure of the underlying data. To avoid imposing a chunk size that isn't a good fit for the data, avoid re-chunking until we have selected a subset of our area of interest from the larger dataset 
```

Check CRS of xr object: 

In [None]:
dc.mapping

Let's take a look at the time dimension (`mid_date` here). To start with we'll just print the first 10 values:

In [None]:
for element in range(10):
    
    print(dc.mid_date[element].data)

Weird, it doesn't look like the time dimension is in chronological order, let's fix that: 

In [None]:
dc_timesorted = dc.sortby(dc['mid_date'])
dc_timesorted

When we read in the zarr datacube as a `xr.Dataset` we set the chunk sizes to `auto`. When we try to sort along the `mid_date` dimension this seems to become a problem and we get the warning above. 

At first it makes sense to follow the instructions in the warning message to avoid creating large chunks, but this creates some issues. If you want, turn the cell below to `code` and run it, you can see that this re-chunks the time dimension which isn't something that we want 

import dask
with dask.config.set(**{'array.slicing.split_large_chunks': True}):
    dc_timesorted_false = dc.sortby(dc['mid_date'])
    dc_timesorted_false

In [None]:
dc_timesorted

In [None]:
for element in range(10):
    
    print(dc_timesorted.mid_date[element].data)

## Read in vector data 

We are going to read in RGI region **15 (SouthAsiaEast)**. RGI data is downloaded in lat/lon coordinates. We will project it to match the CRS of the ITS_LIVE dataset and then select an individual glacier to begin our analysis.

In [None]:
se_asia = gpd.read_file('https://github.com/scottyhq/rgi/raw/main/15_rgi60_SouthAsiaEast.gpkg')
se_asia.head(3)

In [None]:
#project rgi data to match itslive
#we know the epsg from looking at the 'spatial epsg' attr of the mapping var of the dc object
se_asia_prj = se_asia.to_crs('EPSG:32645') 
se_asia_prj.head(3)

## Crop RGI to ITS_LIVE extent
- is there a way to call `get_bbox_single()` without the plot output?

In [None]:
#first, get vector bbox of itslive

bbox_dc = itslivetools.get_bbox_single(dc)
bbox_dc['geometry']

In [None]:
#project from latlon to local utm 
bbox_dc = bbox_dc.to_crs('EPSG:32645')
bbox_dc

In [None]:
#subset rgi to bounds 
se_asia_subset = gpd.clip(se_asia_prj, bbox_dc)
se_asia_subset
se_asia_subset.explore()

In [None]:
sample_glacier_vec = se_asia_subset.loc[se_asia_subset['RGIId'] == 'RGI60-15.04714']
sample_glacier_vec

### Clip ITS_LIVE dataset to individual glacier extent

First, we need to use rio.write_crs() to assign a CRS to the itslive object. If we don't do that first the `rio.clip()` command will produce an error
*Note*: it looks like you can only run write_crs() once, because it switches mapping from being a `data_var` to a `coord` so if you run it again it will produce a key error looking for a var that doesnt' exist

In [None]:
dc_timesorted = dc_timesorted.rio.write_crs(f"epsg:{dc_timesorted.mapping.attrs['spatial_epsg']}", inplace=True)

In [None]:
%%time

sample_glacier_raster = dc_timesorted.rio.clip(sample_glacier_vec.geometry, sample_glacier_vec.crs)

Take a look at the clipped object:

In [None]:
sample_glacier_raster

Let's take a look at the clipped raster alongside the vector outline. To start with and for the sake of easy visualizing we will take the mean of the magnitude of velocity variable along the `mid_date` dimension:

In [None]:
fig, ax = plt.subplots(figsize = (15,9))
sample_glacier_vec.plot(ax=ax, facecolor='none', edgecolor='red');
sample_glacier_raster.v.mean(dim=['mid_date']).plot(ax=ax);



Now let's take a look at the x and y components of velocity, again averaging over time:

In [None]:
fig, axs = plt.subplots(ncols =2, figsize=(17,7))

sample_glacier_raster.vx.mean(dim='mid_date').plot(ax=axs[0]);
sample_glacier_raster.vy.mean(dim='mid_date').plot(ax=axs[1]);


In [None]:
sample_glacier_raster.v_error.mean(dim=['mid_date']).plot();

## Exploring ITS_LIVE data

ITS_LIVE data cubes come with many (53!) variables that carry information about the estimated surface velocities and the satellite images that were used to generate the surface velocity estimates. We won't examine all of this information here but let's look at a litte bit.

To start with, let's look at the satellite imagery used to generate the velocity data.

We see that we have two `data_vars` that indicate which sensor that each image in the image pair at a certain time step comes from. We will "load" these values in to memory since we will use them later.

In [None]:
sample_glacier_raster.satellite_img1.load()

In [None]:
sample_glacier_raster.satellite_img2.load()

The `satellite_img1` and `satellite_img2` variables are 1-dimensional numpy arrays corresponding to the length of the `mid_date` dimension of the data cube. You can see that each element of the array is a string corresponding to a different satellite:
    `1A` = Sentinel 1A, `1B` = Sentinel 1B, `2A` = Sentinel 2A
    `2B` = Sentinel 2B, `8.` = Landsat8 and `9.` = Landsat9
    
Let's re-arrange these string arrays into a format that is easier to work with.

First, we'll make a set of all the different string values in the satellite image variables:

## Examining velocity data from each satellite in `ITS_LIVE` dataset

What if we only wanted to look at the velocity estimates from landat8?

In [None]:
l8_data = sample_glacier_raster.where(sample_glacier_raster['satellite_img1'] == '8.', drop=True)
l8_data

`dataset.where()` at first seems appropriate to use for kind of operation but there's actually an easier way. Because we are selecting along a single dimension (`mid_date`), we can use xarray's `.sel()` method instead. This is more efficient and integrates with `dask` arrays more smoothly.

In [None]:
l8_condition = sample_glacier_raster.satellite_img1 == '8.'
l8_subset = sample_glacier_raster.sel(mid_date=l8_condition)
l8_subset

We can see that we are looking at roughly a third of the original time steps. Let's take a look at the average speeds of the Landsat8-derived velocities:

In [None]:
l8_subset.v.mean(dim='mid_date').plot();

What about Landsat9?

In [None]:
l9_condition = sample_glacier_raster.satellite_img1 == '9.'

l9_subset = sample_glacier_raster.sel(mid_date=l9_condition)
l9_subset

Only 45 time steps have data from Landsat9, this makes sense because Landsat9 was just launched recently

In [None]:
l9_subset.v.mean(dim='mid_date').plot();

Let's look at Sentinel 1 data. Note here we are selecting for 2 values instead of 1 using [DataArray.isin](https://xarray.pydata.org/en/stable/generated/xarray.DataArray.isin.html)

In [None]:
s1_condition = sample_glacier_raster.satellite_img1.isin(['1A','1B'])
s1_subset = sample_glacier_raster.sel(mid_date = s1_condition)
s1_subset

In [None]:
s1_subset.v.mean(dim='mid_date').plot();

In [None]:
s2_condition = sample_glacier_raster.satellite_img1.isin(['2A','2B'])
s2_subset = sample_glacier_raster.sel(mid_date=s2_condition)
s2_subset

In [None]:
s2_subset.v.mean(dim='mid_date').plot();

ITS_LIVE is exciting because it combines velocity data from a number of satellites into one accessible and efficient dataset. From this brief look, you can see snapshot overviews of the different data within the dataset and begin to think about processing steps you might take to work with the data further.

## Checking coverage along a dimension
It would be nice to be able to scan/visualize and observe coverage of a variable along a dimension

First need to make a mask that will tell us all the possible 'valid' pixels. ie pixels over ice v. rock.

In [None]:
valid_pixels = sample_glacier_raster.v.count(dim=['x','y'])
valid_pix_max = sample_glacier_raster.v.notnull().any('mid_date').sum(['x','y'])

sample_glacier_raster['cov'] = valid_pixels/valid_pix_max



#### Looking into `mid_dates` dimension...

- how many duplicate time steps are there? 

My trouble shooting steps... keeping in for now but can delete when ready

In [None]:
#how many time steps are duplicates?, there are 16872 unique vals in mid_dates
np.unique(sample_glacier_raster['mid_date'].data).shape

Start by grouping over `mid_date`. Would expect 16,872 (# unique time steps) with mostly groups of 1, groups of more than one on duplicate coords

In [None]:
test_gb = sample_glacier_raster.groupby(sample_glacier_raster.mid_date)
type(test_gb.groups)

`test_gb.groups` is a [dict](https://xarray.pydata.org/en/stable/generated/xarray.core.groupby.DatasetGroupBy.groups.html), so let's explore that object. the keys correspond to `mid_date` coords, so the values should be the entries at that coordinate. Want to find dict entries with more than one value...

In [None]:

val_ls = [len(val) for val in test_gb.groups.values()] #this is hopefully a list of the number of vals in each dict key

val_df = pd.DataFrame({'num_vals': val_ls})
val_df.head(3)

Subset the values dataframe to only keep rows with more than one value per key

In [None]:
val_df_sub = val_df.loc[val_df['num_vals'] > 1]
val_df_sub.head(3)

Interesting, df is 2602 rows (# time steps with more than one entry). How many have more than 2? 

In [None]:
val_df_sub.plot.hist()

Let's look at one of the time steps with multiple entries. Used a for loop fn on the first year of the dataset for this (below)

In [None]:
duplicate_middate = sample_glacier_raster.sel(mid_date = '2013-09-30T04:56:01.528083968').compute()
duplicate_middate



In [None]:
print('mid date: ', duplicate_middate.mid_date.mid_date.data)
print('image 1 date for entries 1 and 2: ', duplicate_middate.acquisition_date_img1.data)
print('image 2 date for entries 1 and 2: ', duplicate_middate.acquisition_date_img2.data)

time_diff1 = duplicate_middate.acquisition_date_img1.data[0] - duplicate_middate.acquisition_date_img2.data[0]

diff_days1 = time_diff1.astype('timedelta64[D]')
print(diff_days1/np.timedelta64(1,'D'), ' days')

time_diff2 = duplicate_middate.acquisition_date_img1.data[1] - duplicate_middate.acquisition_date_img2.data[1]
diff_days2 = time_diff2.astype('timedelta64[D]')
print(diff_days2/np.timedelta64(1,'D'), ' days')

In this case, looks like both entries are Landsat8 velocity data, one generated from image pairs 16 days apart and one from image pair 48 days apart

In [None]:
def find_dim_duplicates(input_xr):

    for element in range(len(input_xr.mid_date)):
        if (element+1) <= 284:
            if input_xr.isel(mid_date=element).mid_date.data == input_xr.isel(mid_date=(element+1)).mid_date.data:
                print(input_xr.isel(mid_date=element).mid_date.data)
            else:
                pass
        else:
            pass

In [None]:
find_dim_duplicates(sample_glacier_raster.sel(mid_date = slice('2013-01-01','2014-01-01')))

## Exploring data coverage over time series

Let's take a look at the data coverage over this glacier across the time series

In [None]:
fig, ax = plt.subplots(figsize=(30,3))
sample_glacier_raster.cov.plot(ax=ax, linestyle='None',marker = 'x')

But what if we wanted to explore the relative coverage of the different sensors that make up the its_live dataset as a whole?
We can use `groupby` to group the data based on a single condition such as `satellite_img1` or `mid_date`.

In [None]:
sample_glacier_raster.cov.groupby(sample_glacier_raster.satellite_img1)

In [None]:
sample_glacier_raster.groupby('mid_date')

However, if we want to examine the coverage of data from different sensor groups over time, we would essentially want to `groupby` two groups. To do this, we use [flox](https://flox.readthedocs.io/en/latest/)

In [None]:
import flox.xarray

This is the `xr.DataArray` on which we will perform the grouping operation using `flox`

In [None]:
sample_glacier_raster.cov

Using `flox`, we will define a coverage object that takes as inputs the data we want to reduce, the groups we want to use to group the data and the reduction we want to perform. 

In [None]:
coverage = flox.xarray.xarray_reduce(
    # array to reduce
    sample_glacier_raster.cov,
    # Grouping by two variables
    sample_glacier_raster.satellite_img1.compute(),
    sample_glacier_raster.mid_date,
    # reduction to apply in each group
    func="mean",
    # for when no elements exist in a group
    fill_value=0,
)

Now we can visualize the coverage over time for each sensor in the its_live dataset. Cool!

In [None]:
plt.pcolormesh(coverage.mid_date, coverage.satellite_img1, coverage, cmap='viridis')
plt.colorbar()

This notebook displayed basic data inspection steps that you can take when working with a new dataset. The following notebooks will demonstrate further processing, analytical and visualization steps you can take. 