# Simple Cloud storage operations with gcsfs

So somebody (it was me wasn't it?) told you to `put your data on a LEAP cloud bucket` but it is not entirely clear how to do that. 

Lets explore how we would go about it and what the benefits are.

Cloud Storage works quite differently from a traditional filesystem (e.g. on an HPC or the harddrive on your laptop). 
> Cloud object storage is essentially a key/value storage system. They keys are strings, and the values are bytes of data. Data is read and written using HTTP calls.

[2i2c docs](https://docs.2i2c.org/user/topics/data/cloud/#cloud-object-storage)

This means that we need to tweak the way we read and write data accordingly. But as you will see the changes are fairly small when working with xarray datasets.

## Creating a small test dataset 

In [1]:
import xarray as xr
import numpy as np

# lets make a test dataset
ds = xr.DataArray(
    np.random.rand(3,4,5),
    dims=['x', 'y', 'z'],
    attrs={'this':'is our first dataarray'}
).to_dataset(name='data')
ds

## Save a netcdf to your user directory (Only for small test files!)

So we can naively start to save our data as we would e.g. on our laptop

In [2]:
ds.to_netcdf('first.nc')

# we can reload the file with

ds_reloaded = xr.open_dataset('first.nc')
ds_reloaded

This works fine for small test datasets like the one above, but has several downsides

❌ Nobody but you can read this file

❌ The User Directory can not be used for large files!

## Now let's move the file to a cloud bucket
Ok so the next best thing is to create a small file locally and then put it into a bucket. 

We can use the gcsfs library to get some 'filesystem-like' convienence on top of our cloud object store.

In [3]:
import gcsfs
fs = gcsfs.GCSFileSystem()
fs.ls('gs://leap-scratch') # methods are similar to UNIX shell commands

['leap-scratch/data-library']

In [4]:
# 🚨 Always work in a subfolder with your username, to avoid messing with other folks data
import os
user_path = f"gs://leap-scratch/{os.environ['JUPYTERHUB_USER']}/annual_meeting_demo"

In [5]:
# we can put our written netcdf on the cloud bucket with the .put method
cloud_path = user_path+'/netcdf_upload/first.nc'
fs.put('first.nc', cloud_path)
fs.ls(user_path+'/netcdf_upload')

['leap-scratch/jbusecke/annual_meeting_demo/netcdf_upload/first.nc']

In [6]:
# we can now load the netcdf file from the cloud bucket using xarray
with fs.open(cloud_path) as f:
    ds_reloaded_cloud = xr.open_dataset(f)
ds_reloaded_cloud

So big deal, what is the advantage of this? 

You can now share the following snipped with everyone in LEAP and they can access the file! Lets try it with my version.

```python
import xarray as xr
import gcsfs 
fs = gcsfs.GCSFileSystem()
cloud_path = 'gs://leap-scratch/jbusecke/annual_meeting_demo/netcdf_upload/first.nc'
with fs.open(cloud_path) as f:
    ds_julius = xr.open_dataset(f)
ds_julius
```
👆 copy this into a new cell and execute it!

----

Ok so we just showed that with this simple change we can keep working as before, but we also gained the ability to easily share data with other LEAP members!

However this approach is still not optimal. Instead whenever you have array data as an xarray Dataset we strongly recommend to use [zarr](https://zarr.dev) which is optimized for cloud object storage and enables you to write directly to cloud storage in a streaming fashion (eliminating the need for intermediate copies of your files) and under the right conditions can enable much better performance for distributed data analytics in the cloud. 

## Writing to zarr instead of netcdf

You will see that when you use zarr
- The code becomes even cleaner
- And there might be big performance gains over netcdf files when you work with large datasets. 

So lets redo the whole thing with zarr!

In [7]:
cloud_path_zarr = user_path+'/zarr_write/first.zarr'
ds.attrs['zarr']='FTW' # lets give this dataset a unique attribute 
ds.to_zarr(cloud_path_zarr) # if you want to overwrite an existing store add `mode='w'`

<xarray.backends.zarr.ZarrStore at 0x7f716e48dbd0>

Thats it! 

By giving a url which starts with `gs://...` xarray automatically invokes `gcsfs` under the hood!

You can now give your collaborator an even more concise snippet:
```python
import xarray as xr
path = 'gs://leap-scratch/jbusecke/annual_meeting_demo/zarr_write/first.zarr'
ds_julius_zarr = xr.open_dataset(path, engine='zarr')
ds_julius_zarr
```

Feel free to try this out again! Check the datasets attributes.

## Take Home Points
✅ Moving data to the cloud buckets enables you to share data with other LEAP members super easily.

✅ Whenever you are able to load data into an xarray dataset, try to use `.to_zarr()` to store a cloud optimized zarr store into a cloud bucket!

⚠️ All data on the buckets are visible to all members, but do not just use data from other users without contacting them. 

⚠️ The LEAP cloud buckets are **not meant as long-term archival storage**. Always have a backup copy of valuable data on another resource!