# Accessing ITS_LIVE data hosted by AWS
- this is a notebook from Scott H

## Which AWS data center hosts the data?

**AWS us-east-1**

check this at: 

In [None]:
!nslookup https://its-live-data.s3.amazonaws.com

## Checking if data available over certain regions

In [None]:
import geopandas as gpd

In [None]:
gf_all = gpd.read_file('https://its-live-data.s3.amazonaws.com/datacubes/catalog_v02.json')


In [None]:
gf_all.head(4)

In [None]:
type(gf_all)

In [None]:
gf_all.explore()

In [None]:


# geopandas crop to himat
xmin = 70
xmax = 110
ymin = 25
ymax = 50
gf = gf_all.cx[xmin:xmax, ymin:ymax]
len(gf)

In [None]:
gf.explore()

In [None]:
url = gf.iloc[0].zarr_url
url

In [None]:
s3obj = url.replace('.s3.amazonaws.com', '').replace('http://', 's3://')
s3obj

## Do all of these s3 zarr objects exist? How much space do they take up? 

**having trouble downloading aws cli** sent email to help desk about getting permissions

In [None]:
!aws --no-sign-request s3 ls {s3obj}/

In [None]:
import s3fs
fs  = s3fs.S3FileSystem(anon=True)

In [None]:
#make the s3 url a col in the gdf
gf['zarr_s3'] = gf.zarr_url.str.replace('.s3.amazonaws.com','').str.replace('http://','s3://')


In [None]:
#also save the index as a column (tile id)
gf['tile_id'] = gf.index

In [None]:
gf.iloc[0]

In [None]:
# 'datacube_exist' is a col in gf
# unique() returns unique values of series object
gf.datacube_exist.unique()

`lexists` is a method to check whetehr a given path exists or not. supply: `os.path.lexists(path)` that you can apply to the fs object (type = s3fs.core.S3FileSystem)

In [None]:

gf['path_exists'] = gf.zarr_s3.apply(lambda x: fs.lexists(x))

In [None]:
gf.path_exists.unique()


Use `groupby` to check which of the supplied zarr urls for hma exist:

In [None]:
gf.groupby('path_exists')['zarr_url'].count()

### Which s3 paths don't exist?

In [None]:
gf.head(2)


In [None]:
gf['path_exists'] = gf.path_exists.astype(str)
gf.explore(column = 'path_exists',
                    cmap = 'Set1',
                    #tiles = 'https://glaceirflow.nyc3.digitaloceanspaces.com/webmaps/vel_map/{z}/{x}/{y}.png',
                    tiles = 'OpenStreetMap',
                    attr = 'ITS_LIVE Velocity Mosaic')

In [None]:
import xarray as xr

In [None]:
#Select Zarr s3 url by index
# 225 (75.61, 35.96)
# 226 (75.53, 36.17)

TILEID = 226
LON, LAT = (75.53, 36.17)

s3obj = gf.loc[TILEID].zarr_s3
s3obj

In [None]:
%%time

ds = xr.open_dataset(s3obj, 
                    storage_options = {'anon':True},
                    chunks = 'auto',
                    engine = 'zarr')

In [None]:
ds.vx

In [None]:
ds.vx.encoding

In [None]:
import hvplot.xarray

In [None]:
import pyproj
UTM = pyproj.CRS.from_epsg(ds.attrs['projection'])
LONLAT = pyproj.CRS.from_epsg(4326)
proj = pyproj.Transformer.from_crs(LONLAT, UTM, always_xy=True)

utmX,utmY = proj.transform(LON, LAT)


In [None]:
%%time 

# Timeseries plots should be fast, but where to select? lots of nans...

ds.v.sel(x=utmX, y=utmY, method='nearest').hvplot.scatter()