# Data storage and locality <img align="right" src="../resources/csiro_easi_logo.png">

This notebook introduces users to working with data in EASI and in the cloud.

Understanding the location of data assets is important for efficiency in your code and reducing cloud-computing costs. In this notebook you will explore the different data read/write options available in EASI. 

By the end of this module, you will be able to: 

- Identify the different available storage options in EASI and recognize
    - When to use one or more storage options
    - The different factors associated with these storage options (cost, security, latency etc) 
- Upload data for use in JupyterLab and with Dask workers
- Access resources to help build skills and knowledge to further your learning around storage. 

### Summary of data storage types

| Type | User r/w | Dask workers r/w | Collaborators r/w |
|--|--|--|--|
| [Home directory](#Home-directory) | Yes | No | No |
| [User scratch](#User-scratch) | Yes | Yes | No |
| [Project data](#Project-data) | Yes | Yes | Yes |
| [Data cube](#Data-cube) | Read | Read | Read |
| [Cloud data](#Cloud-data) | Read | Read | Read |

## Home directory

Your EASI home directory is available in JupyterLab. It appears in JupyterLab as the left panel and via the terminal.

TODO: Add pic

**What is the Home Directory?**
- The Home Directory uses the Amazon Web Service (AWS) Elastic File System (EFS) storage architecture and similar to a normal network file system.

**What are the advantages of the Home Directory?**
- Automatically backed-up nightly and retained for 90 days.
- Contact EASI Admins to restore a file (but you should not rely on us)

**When would I use the Home Directory to store data?**
- Small files, source code, notebooks
- Files format that cannot work from other storage options (require normal file system – though if these are large would recommend finding an alternative)

**What are the limitations of the Home Directory?**
- This is the most expensive form of storage available – other choices better for large data.
- Your home directory is not visible from dask workers (s3 storage option is)

**How do I upload and use data?**

TODO: link to a video of this and manual in standard jupyter labs documentation 

For small files it is simplest to drag-and-drop your files into the JupyterLab file interface directly from your desktop. This will upload the file via your browser into your home directory. 

For larger files it's best to use the AWS CLI running locally along with your EASI credentials and upload directly to the **User scratch** storage.

**Exercise / Simulation?**: Upload a file into JupyterLab 

## User scratch

EASI has a "User scratch" S3 storage bucket that users can write to.
- You can use the AWS CLI and library functions to write larger files to this bucket.
- Objects in the bucket have a 30 day (since creation lifecycle rule), so it’s appropriate for saving intermediate results and distributed data to your dask workers

**When would I use the UserScratch?**
- When you workflow requires temporary storage location and is large
- For use with Dask workers

TODO: Migrate notebook to easi-notebooks, https://dev.azure.com/csiro-easi/easi-hub-public/_git/hub-notebooks?path=/EASI%20Training/A4%20-%20User%20Scratch%20Usage.ipynb&_a=preview 

**What are the advantages of the UserScratch?**
- Lots of storage and scale
- Works with Dask workers
- Has a managed lifetime in case you forget

**What are the limitations of the UserScratch?**

It’s Object Storage, which is a bit different to POSIX so it requries a different method to access and manage the data. Most tooling these days will support object storage directly but older tools that haven’t been updated may not.

**Exercise / Simulation?:** Save xarray masked result (from previous sim) to user scratch 

## Project data

**What is Project Data?**
- Similar to the “user scratch bucket” projects can also make additional storage resources available for their users in EASI. That is the project bucket can be shared across all project members and have different levels of membership (read/write, read-only)
- There is an EASI account hosted project bucket, you can request a project share by contacting your EASI Administrators
- And EASI can also Cross-account authorization to project-managed resources – which is a much better option as the project then remains in full control of their resources
- Contact EASI for more information

**When would I use project data?**
- You need to share project data between multiple people
- You want to manage your own large scale data in your own project account

**What are the advantages?**
- Can be shared to others
- Custom lifecycle rules
- Can be completely managed by the project team in their own account.

**What are the limitations?**
- Only selected users will be able to read or read/write to a the project bucket.
- The EASI Admins can add/update user's access to project buckets.

**How do I upload and use data?
- Similar to User scratch but change the bucket name.

**Exercise / Simulation?:** None

## Data cube

**What is DataCube?**
- These are indexed in the datacube database and available from the datacube API
- EASI managed data (what we do)
- https://explorer.csiro.easi-eo.solutions/ (what we have)

**When would I use the DataCube**
- If your using data and it’s available as EASI managed data then it’s much easier to use via the ODC API and you don’t have to have your own copy!

**What are the advantages of the DataCube?**
- Managed for you
- Shared
- Easier access via Open Data Cube library
- You can build and manage your own (contact easi-help@csiro.au for admin assistance)

**What are the limitations of the DataCube?**
- May not have your data – but you can build your own
- ODC API may not suit your overall workflow – but there are ways to get the underlying list of files so you can still exploit some aspects or access it directly

**How do I upload and use data?**
- Data preparation and indexing to the datacube is a managed by the EASI admins
- We intend to make this easier for users to contribute their own products

Using datacube data is
- datacube API
- explorer and ows web services

TODO: Screenshots

TODO: Migrate https://dev.azure.com/csiro-easi/easi-hub-public/_git/hub-notebooks?path=/EASI%20Training/03%20-%20Loading%20Data.ipynb&_a=preview

**Exercise / Simulation?:** 
- upload a shapefile (copy from some resource or choose your own)
- load EO datacube into an xarray
- mask with an uploaded shapefile

## Cloud-computing data

**What is Cloud Data and what cloud data services / technologies does EASI support?**
- Many data providers are making their data available directly in the Cloud.
- The amount and variety of data and providers is vast and we can’t cover them all here but will highlight that they can in most cases be accessed directly from EASI.

**When would I use cloud-based data assets?**
- If a provider has data you want and its just as efficient to access it directly as it is to download and store (why bother curating it yourself!)

**What are the advantages of cloud-based data assets?**
- You don’t pay for the long term storage
- Maintained by the provider
- There’s a lot of it

**What are the limitations of cloud-based data assets?**
- Cost and efficiency mostly – data access isn’t free, there are network traffic costs, the costs of accessing the API, etc, even if the Provider makes the data available for “free”, these other costs still occur. Additionally, accessing data over a large network can slow calculation speed and repeated use in calculations will result in more network costs (possibly more than the cost of downloading the data and temporarily storing it). All these need to be managed.

**How do I upload, access and use cloud-based data?**
- It varies depending on the data source and we can’t cover them all. Here’s a few examples though of common approaches. If you get stuck ask the EASI community channel for a pointer in the right direction.

Egress and costs for cross-region/cloud transfers
- To be aware of. EASI will help find ways to make this efficient for you

Comments on Authorisation for other data sources and python libraries
- Often need to authorise to the source bucket for read access
- AWS (“requester pays” or “request-payer”)
- GEE (some token thing)
- PC (sign)

Underlying tools need access to the necessary credentials and directives
- GDAL and rasterio

ODC configure_s3_access
- from datacube.utils.aws import configure_s3_access
- configure_s3_access(aws_unsigned=False, requester_pays=True, client=client)

Environment variables
- TBA

Caching-proxy
- TBA

**Exercise / Simulation?:**
- Try odc-stac

## Summary

Knowledge check:

What are the factors associated with choosing storage types in EASI?

Scenario 1:

Scenario 2: