# Individual glacier analysis 1

This notebook will walk you through steps to read in and organize velocity data and clip it to the extent of a single glacier. The tools we will use include **xarray**, **rioxarray** and **geopandas**. 

To clip its_live data to the extent of a single glacier we will use a vector dataset of glacier outlines, the [Randolph Glacier Inventory](https://nsidc.org/data/nsidc-0770). These aren't cloud-hosted currently so you will need to download the data to your local machine. 

**Learning goals**
come back and finish these, feel like this notebook has alot, is pretty disorganized.. </br>
using xarray to read zarr data from s3 bucket
- **`rio.clip()`** to clip raster by vector
- viewing CRS, reprojecting and writing CRS data for various objects
- dataset.where()
- dataset.sel() using multiple conditions
- groupby



First, lets install the python libraries that were listed on the [Software](software.ipynb) page:

In [None]:
import geopandas as gpd
import os
import numpy as np
import xarray as xr
import rioxarray as rxr
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
from shapely.geometry import Polygon
from shapely.geometry import Point
import cartopy.crs as ccrs
from cartopy.mpl.gridliner import LONGITUDE_FORMATTER, LATITUDE_FORMATTER
import cartopy
import cartopy.feature as cfeature
import json
import urllib.request
from skimage.morphology import skeletonize
import pandas as pd
import seaborn as sns 
from matplotlib import pyplot as plt
%config InlineBackend.figure_format='retina'


## Reading in ITS_LIVE data

We will use some of the functions we defiend in the data access notebook to read in data here. First, let's read in the catalog again:  

In [None]:
#import itslivetools

In [None]:
with urllib.request.urlopen('https://its-live-data.s3.amazonaws.com/datacubes/catalog_v02.json') as url_catalog:
    itslive_catalog = json.loads(url_catalog.read().decode())
itslive_catalog.keys()

Take a look at a single catalog entry:

Use the function below to find the url that corresponds to the zarr datacube for a specific point:

In [None]:
def find_granule_by_point(input_dict, input_point): #[lon,lat]
    '''Takes an inputu dictionary (a geojson catalog) and a point to represent AOI.
    this returns a list of the s3 urls corresponding to zarr datacubes whose footprint covers the AOI'''
    #print([input_points][0])
    
    target_granule_urls = []
    #Point(coord[0], coord[1])
    #print(input_point[0])
    #print(input_point[1])
    point_geom = Point(input_point[0], input_point[1])
    #print(point_geom)
    point_gdf = gpd.GeoDataFrame(crs='epsg:4326', geometry = [point_geom])
    for granule in range(len(input_dict['features'])):
        
        #print('tick')
        bbox_ls = input_dict['features'][granule]['geometry']['coordinates'][0]
        bbox_geom = Polygon(bbox_ls)
        bbox_gdf = gpd.GeoDataFrame(index=[0], crs='epsg:4326', geometry = [bbox_geom])
        
        #if poly_gdf.contains(points1_ls[poly]).all() == True:

        if bbox_gdf.contains(point_gdf).all() == True:
            #print('yes')
            target_granule_urls.append(input_dict['features'][granule]['properties']['zarr_url'])
        else:
            pass
            #print('no')
    return target_granule_urls

This function will read in a xarray dataset from a url to a zarr datacube when we're ready:

I started with `chunk_size='auto'` but ran into issues. more about choosing good chunk sizes [here](https://blog.dask.org/2021/11/02/choosing-dask-chunk-sizes). 

In [None]:
def read_in_s3(http_url):
    s3_url = http_url.replace('http','s3')
    s3_url = s3_url.replace('.s3.amazonaws.com','')

    datacube = xr.open_dataset(s3_url, engine = 'zarr',
                                storage_options={'anon':True},
                                chunks = 'auto')

    return datacube

In [None]:
def get_bbox_single(input_xr):
    
    '''Takes input xr object (from itslive data cube), plots a quick map of the footprint. 
    currently only working for granules in crs epsg 32645'''

    xmin = input_xr.coords['x'].data.min()
    xmax = input_xr.coords['x'].data.max()

    ymin = input_xr.coords['y'].data.min()
    ymax = input_xr.coords['y'].data.max()

    pts_ls = [(xmin, ymin), (xmax, ymin),(xmax, ymax), (xmin, ymax), (xmin, ymin)]

    #print(input_xr.mapping.spatial_epsg)
    #print(f"epsg:{input_xr.mapping.spatial_epsg}")
    crs = f"epsg:{input_xr.mapping.spatial_epsg}"
    #crs = {'init':f'epsg:{input_xr.mapping.spatial_epsg}'}
    #crs = 'epsg:32645'
    #print(crs)

    polygon_geom = Polygon(pts_ls)
    polygon = gpd.GeoDataFrame(index=[0], crs=crs, geometry=[polygon_geom]) 
    #polygon = polygon.to_crs('epsg:4326')

    bounds = polygon.total_bounds

    return polygon

In [None]:
url = find_granule_by_point(itslive_catalog, [84.56, 28.54])
url

In [None]:
dc = read_in_s3(url[0])
dc

We are reading this in as a dask array. Let's take a look at the chunk sizes:

**NOTE**: chunksizes shows the largest chunk size. chunks shows the sizes of all chunks along all dims, better if you have irregular chunks

In [None]:
dc.chunksizes

In [None]:
dc.chunks

I think it could be useful to talk about dask chunk sizes here? Especially since I run into a warning a few steps down. Need to look into rechunking more to better undrestand first -- fix warning below but still not sure about specifying chunk sizes

Check CRS of xr object: 

In [None]:
dc.mapping

Let's take a look at the time dimension (`mid_date` here). To start with we'll just print the first 10 values:

In [None]:
for element in range(10):
    
    print(dc.mid_date[element].data)

Weird, it doesn't look like the time dimension is in chronological order, let's fix that: 

In [None]:
dc_timesorted = dc.sortby(dc['mid_date'])
dc_timesorted

When we read in the zarr datacube as a `xr.Dataset` we set the chunk sizes to `auto`. When we try to sort along the `mid_date` dimension this seems to become a problem and we get the warning above. 

Let's follow the instructions in the warning message to avoid creating large chunks: 

In [None]:
import dask
with dask.config.set(**{'array.slicing.split_large_chunks': True}):
    dc_timesorted = dc.sortby(dc['mid_date'])
    dc_timesorted

In [None]:
dc_timesorted

In [None]:
for element in range(10):
    
    print(dc_timesorted.mid_date[element].data)

## Read in vector data 

We are going to read in RGI region **15 (SouthAsiaEast)**. RGI data is downloaded in lat/lon coordinates. We will project it to match the CRS of the ITS_LIVE dataset and then select an individual glacier to begin our analysis.

In [None]:
se_asia = gpd.read_file('/Users/emarshall/Desktop/siparcs/data/nsidc0770_15.rgi60.SouthAsiaEast/15_rgi60_SouthAsiaEast.shp')
se_asia.head(3)

In [None]:
#project rgi data to match itslive
se_asia_prj = se_asia.to_crs('EPSG:32645') #we know the epsg from looking at the 'spatial epsg' attr of the mapping var of the dc object
se_asia_prj.head(3)

## Crop RGI to ITS_LIVE extent
- use get_bbox_single() from access nb but no plotting (above)

In [None]:
#first, get vector bbox of itslive

bbox_dc = get_bbox_single(dc)
bbox_dc['geometry']

#subset rgi to bounds 
se_asia_subset = gpd.clip(se_asia_prj, bbox_dc)
se_asia_subset
se_asia_subset.explore()

In [None]:
sample_glacier_vec = se_asia_subset.loc[se_asia_subset['RGIId'] == 'RGI60-15.04714']
sample_glacier_vec

### Clip ITS_LIVE dataset to individual glacier extent

First, we need to use rio.write_crs() to assign a CRS to the itslive object. If we don't do that first the `rio.clip()` command will produce an error
*Note*: it looks like you can only run write_crs() once, because it switches mapping from being a `data_var` to a `coord` so if you run it again it will produce a key error looking for a var that doesnt' exist

In [None]:
dc_timesorted = dc_timesorted.rio.write_crs(f"epsg:{dc_timesorted.mapping.attrs['spatial_epsg']}", inplace=True)

In [None]:
%%time

sample_glacier_raster = dc_timesorted.rio.clip(sample_glacier_vec.geometry, sample_glacier_vec.crs)

In [None]:
sample_glacier_raster

Let's take a look at the clipped raster alongside the vector outline. To start with and for the sake of easy visualizing we will take the mean of the magnitude of velocity variable along the `mid_date` dimension:

In [None]:
fig, ax = plt.subplots(figsize = (15,9))
sample_glacier_vec.plot(ax=ax, facecolor='none', edgecolor='red');
sample_glacier_raster.v.mean(dim=['mid_date']).plot(ax=ax);



Now let's take a look at the x and y components of velocity, again averaging over time:

In [None]:
fig, axs = plt.subplots(ncols =2, figsize=(17,7))

sample_glacier_raster.vx.mean(dim='mid_date').plot(ax=axs[0]);
sample_glacier_raster.vy.mean(dim='mid_date').plot(ax=axs[1]);


In [None]:
sample_glacier_raster.v_error.mean(dim=['mid_date']).plot();

## Exploring ITS_LIVE data

ITS_LIVE data cubes come with many (53!) variables that carry information about the estimated surface velocities and the satellite images that were used to generate the surface velocity estimates. We won't examine all of this information here but let's look at a litte bit.

To start with, let's look at the satellite imagery used to generate the velocity data.

We see that we have two `data_vars` that indicate which sensor that each image in the image pair at a certain time step comes from:

In [None]:
sample_glacier_raster.satellite_img1.data.compute()

In [None]:
sample_glacier_raster.satellite_img2

The `satellite_img1` and `satellite_img2` variables are 1-dimensional numpy arrays corresponding to the length of the `mid_date` dimension of the data cube. You can see that each element of the array is a string corresponding to a different satellite:
    `1A` = Sentinel 1A, `1B` = Sentinel 1B, `2A` = Sentinel 2A
    `2B` = Sentinel 2B, `8.` = Landsat8 and `9.` = Landsat9
    
Let's re-arrange these string arrays into a format that is easier to work with.

First, we'll make a set of all the different string values in the satellite image variables:

In [None]:
sat_ls1 = list(set(sample_glacier_raster.satellite_img1.compute().data)) #these should be the same, and img1 img2 should only 
sat_ls2 = set(sample_glacier_raster.satellite_img2.compute().data) #differ if its someting like 2a, 2b or 1a, 1b so i think shouldn't 
                                                                    #have to worry about the 2 vars ?

Next, we'll assign a value to each element in the set:

In [None]:
mapping = {}

for x in range(len(sat_ls1)):
    mapping[sat_ls1[x]] = x
print('mapping: ', mapping)
print('')

We'll then convert each element of the satellite image variable arrays to a binary array that gives us the integer associated with each sensor:

In [None]:
#convert each satellite_img1 value to binary array indicated int associated with sensor
one_hot_encode = []
for c in sample_glacier_raster.satellite_img1.compute().data:
    arr = list(np.zeros(len(sat_ls1), dtype=int))
    arr[mapping[c]]= 1
    one_hot_encode.append(arr)

Back out the sensor integer from the binary array:

In [None]:
sensor_ints = [int(one_hot_encode[x].index(1)) for x in range(len(one_hot_encode))]


Then make a **pandas dataframe** with each mid_date of the data cube and the sensor integer

In [None]:
dates_ls = list(sample_glacier_raster.mid_date.data)
#make dataframe of sensor ints and associated img date
sat_df = pd.DataFrame({'mid_date1':dates_ls, 'sensor': sensor_ints})
sat_df['mid_date'] = sat_df['mid_date1'].dt.date
sat_df = sat_df.drop('mid_date1', axis=1)
sat_df = sat_df.sort_values(by='mid_date')
sat_df = sat_df.set_index('mid_date')

As a first step, let's visualize the time series of different sensors as a heat map

In [None]:
#make heatmap
pal = sns.color_palette('Paired',6)

fig, ax = plt.subplots(figsize=(20,4))
sns.heatmap(sat_df.T, cmap=pal, ax=ax);

We can wrap those steps into a function to use them more easily:

In [None]:
#help from: https://www.educative.io/answers/one-hot-encoding-in-python
#help from: https://datascienceparichay.com/article/remove-time-from-date-pandas/

def get_satellite_as_int(input_da):
    
    ''' Function that takes a dast xr.DataArray that represents what sensor velocity data from a specific date was collected from.
    returns an xr.DataArray of the sensor coded as an integer key as well as a pandas df with mid_date as index, sensor integer as 
    a column. **still need to figure out how to carry the mapping of what sensor str corresponds to what sensor integer through**'''
    
    #make list of satellite strs
    sat_ls = list(set(input_da.compute().data))
    #sat_ls = list(set(input_da.data))
    #map strs to ints
    mapping = {}
    for x in range(len(sat_ls)):
        mapping[sat_ls[x]] = x
    print('mapping: ', mapping)
    #convert each satellite_img1 value to binary array indicated int associated with sensor
    one_hot_encode = []
    for c in input_da.compute().data:
        arr = list(np.zeros(len(sat_ls), dtype=int))
        arr[mapping[c]]= 1
        one_hot_encode.append(arr)
    sensor_ints = [one_hot_encode[x].index(1) for x in range(len(one_hot_encode))]

    dates_ls = list(input_da.mid_date.data)
    #make dataframe of sensor ints and associated img date
    sat_df = pd.DataFrame({'mid_date1':dates_ls, 'sensor': sensor_ints})
    #sat_df['mid_date'] = sat_df['mid_date1'].dt.date
    #sat_df = sat_df.drop('mid_date1', axis=1)
    sat_df = sat_df.sort_values(by='mid_date1')
    sat_df = sat_df.set_index('mid_date1')
    
    sat_xr = sat_df['sensor'].to_xarray()
    
    return sat_xr, sat_df


In [None]:
a = get_satellite_as_int(sample_glacier_raster.satellite_img1.compute())[0]

And if we want we can add this new `xarray.DataArray` back as a `data_var` in the original `xarray.Dataset`:

In [None]:
sample_glacier_raster['satellite_img_int'] = ('mid_date', a.data)
sample_glacier_raster

## Examining velocity data from each satellite in `ITS_LIVE` dataset

What if we only wanted to look at the velocity estimates from landat8?

In [None]:
l8_data = sample_glacier_raster.where(sample_glacier_raster['satellite_img_int'] == 5., drop=True)
l8_data

`dataset.where()` at first seems appropriate to use for kind of operation but there's actually an easier way. Because we are selecting along a single dimension (`mid_date`), we can use xarray's `.sel()` method instead. This is more efficient and integrates with `dask` arrays more smoothly.

In [None]:
l8_condition = sample_glacier_raster.satellite_img_int.isin(5.)
l8_subset = sample_glacier_raster.sel(mid_date=l8_condition)
l8_subset

We can see that we are looking at roughly a third of the original time steps. Let's take a look at the average speeds of the Landsat8-derived velocities:

In [None]:
l8_subset.v.mean(dim='mid_date').plot();

What about Landsat9?

In [None]:
l9_condition = sample_glacier_raster.satellite_img_int.isin(1.)

l9_subset = sample_glacier_raster.sel(mid_date=l9_condition)
l9_subset

Only 45 time steps have data from Landsat9, this makes sense because Landsat9 was just launched recently

In [None]:
l9_subset.v.mean(dim='mid_date').plot();

Let's look at Sentinel 1 data. Note here we are selecting for 2 values instead of 1: 

In [None]:
s1_condition = sample_glacier_raster.satellite_img_int.isin([0,2])
s1_subset = sample_glacier_raster.sel(mid_date = s1_condition)
s1_subset

In [None]:
s1_subset.v.mean(dim='mid_date').plot();

In [None]:
s2_condition = sample_glacier_raster.satellite_img_int.isin([3,4])
s2_subset = sample_glacier_raster.sel(mid_date=s2_condition)
s2_subset

In [None]:
s2_subset.v.mean(dim='mid_date').plot();

In [None]:
#attempt at ufunc, didn't work

#def xr_satellite_coding(a):
#    return xr.apply_ufunc(get_satellite_as_int, a)
#sample_glacier_raster
#test_ds = xr_satellite_coding(sample_glacier_raster.satellite_img1)

In [None]:
pal = sns.color_palette('Paired',6)
pal

### Seasonal mean velocities with groupby

In [None]:
#first define the function we'll apply to each group
def middate_mean(a):
    return a.mean(dim='mid_date')


In [None]:
seasons_gb = sample_glacier_raster.groupby(sample_glacier_raster.mid_date.dt.season).map(middate_mean)
#add attrs to gb object
seasons_gb.attrs = sample_glacier_raster.attrs #why didn't that work?
seasons_gb

In [None]:
fg = seasons_gb.v.plot(
    col='season',
    vmax = 150
);