# Data storage and locality <img align="right" src="../resources/csiro_easi_logo.png">

We have the potential to use and generate huge volumes of data, in different formats, for different purposes and across multiple projects and collaborations.  

How we store this data can have significant impacts on: 

- Our compute efficiency and science productivity 
- Our ability to connect with other researchers 
- Our budget 
- Our cybersecurity and 
- Our commitment to Research Data Management policies, procedures, and practices 

So, how do we choose the right storage solution?  

This module will help you explore the different storage options on EASI, their use-cases, their limitations and how you can get started using them. By the end of this module, you will be able to: 

- Identify the different available storage options in EASI and recognize: 
    - When to use one or more storage options  
    - The different factors associated with these storage options (cost, security, latency etc) 
- Upload data to use with your Jupyter Notebook and where to store data outputs for download or re-use by yourself or others. 
- Access resources to help build skills and knowledge to further your learning around data storage options 

## How do we normally store data? 

If you have been using a computer, chances are that you are familiar with storing data on your home drive, or on an external hard drive, or maybe even on the cloud. This is typically referred to as **File Storage**.

With file storage, all data is stored together (usually in one file) and the file extension type helps determine the applications that can read or open the file and access the data (e.g., .jpg, .docx, or .txt).

File storage systems also make it easier for users to find and manage files using a hierarchical structure that organizes files into folders and subfolders. To access the file, select or enter the path of the file, including subdirectories and file name. Most users manage their file storage using a simple file system such as a file manager.

There are other alternatives to the everyday local file storage, like network file storage (connect to the network server first then access by path/file) and *Object Storage*, which is most relevant to cloud-computing.

**Object storage** is like a giant virtual bucket where you can store your belongings in labelled boxes. Each box has a unique ID and can hold different types of objects. You can easily add or remove boxes as needed, and the bucket can expand to accommodate more and more boxes. Plus, you can access any box from anywhere in the world, if you have the ID and access permission. Object storage is ideal for storing vast amounts of unstructured data, such as videos, images, and documents, and is commonly used in cloud storage services. You can have many Buckets where you can store vast amounts of your data and they can all have different access permissions and life cycles. 

EASI supports many types of data storage, local and network file storage and Object Storage, and each method has certain advantages, limitations and requirements. EASI also manages data for all users in the Open Data Cube, and there are third-party datasets accessible directly in the Cloud. 

Some additional terms that may be helpful when talking about Data Storage:

**Latency**

Data latency is the time it takes for data packets to be stored or retrieved.  

**Data Retrieval**

The process of identifying and extracting data from a database or storage system, based on a query provided by the user or application. It enables the fetching of data from storage in order to display it on a monitor and/or use within an application.  

**Cache** 

A high-speed memory or storage device that helps reduce the time required to read and write data to a slower device, such as a hard drive or a remote server (or bucket).

## Different Storage Types in EASI 

- [Type 1: Home Directory](#Home-Directory)
- [Type 2: DataCube](#DataCube)
- [Type 3: User Scratch](#User-Scratch)
- [Type 4: Project Based Data](#Project-Based-Data)
- [Type 5: Cloud Data](#Cloud-Data)

### Home Directory

**What is the Home Directory?**

The Home Directory uses the Amazon Web Service (AWS) Elastic File System (EFS) storage architecture. This is similar to a normal network file system in that it has folders, sub folders and files and it very easy and familiar to navigate and use.

**When would I use the Home Directory to store data?**

The Home Directory is best used for small files including source code and notebooks and smaller sets of data outputs.

You can also use the Home Directory to store files that cannot readily work from other storage options (i.e., files or programs that require a normal file system). Again, if these are large, then you may wish to find an alternative and your EASI Administrator will be happy to advise.

**What are the advantages of the Home Directory?**

The data housed in your Home Directory is persistent through log-in/log-out cycles (you will see your home directory in the state as you left it). The home directory is also automatically backed-up nightly and retained for 90 days.  

The EASI Admin team can restore a file. However this is not always guaranteed. If you are looking for more stable and long-term storage, there are better solutions out there! 

**What are the limitations of the Home Directory?**

This is the most expensive form of storage available on EASI. It is ok to store a Gigabyte or so of data, code and outputs. For larger datasets and outputs there are other EASI storage types available. 

Note that your home directory is not visible to *Dask* workers, as they run in a different part of EASI. We will learn about the Python *Dask* library later in the module.

**How do I upload and use data?**

There’s a video of this and information in the [Jupyter labs documentation](https://jupyterlab.readthedocs.io/en/stable/user/files.html).

- For small files, it is simplest to drag-and-drop your files into the JupyterLab file interface directly from your desktop. This will upload the file via your browser into your home directory.
- For larger files, it's best to use the AWS CLI running locally along with your EASI credentials and upload directly to your [EASI S3 User Scratch space](#User-Scratch) space.

Example usage and EASI documentation: https://docs.csiro.easi-eo.solutions/user-guide/users-guide/07-uploading-data/ 

**Tasks**

Jump into the training notebooks to explore some of the below activities you can complete. 

- [ ] Familiarise yourself with the JupyterLabs documentation.
- [ ] Upload some of your own data or file to JupyterLab home directory.

### DataCube

**What is DataCube?**

The [Open Data Cube (ODC)](https://github.com/opendatacube) is an Open-Source Geospatial Data Management and Analysis Software project that helps you harness the power of Satellite data. At its core, the ODC is a set of Python libraries and PostgreSQL database that helps you work with geospatial raster data.

The broad goal of the ODC is to make it easier to access and use large data holdings, without requiring data to be stored in a specific way or in a specific place. What this means is that you can point it at your data repository and index the data where it sits, abstracting the complexity of managing large, distributed data holdings.

**EASI managed data**

EASI manages an amount of data in its datacube databases. These can be viewed in each EASI's Explorer website and with the *Datacube()* API.

- This is the CSIRO Explorer site: https://explorer.csiro.easi-eo.solutions/
- The EASI documentation [lists the URLs for other EASIs](https://docs.csiro.easi-eo.solutions/user-guide/developers/easi-platform-overview/#easi-deployments-services-and-support).

There are many commonalities in the data preparation pipelines into a datacube databases across the ODC community. We do this so that its easier to move your work between ODC systems (including third-parties). The EASI team are working to make our data processing pipelines transparent and open to contributions.

**When would I use the DataCube**

If you’re using data that’s available as EASI managed data, then it’s much easier to use via the ODC API and you don’t have to have your own copy! See the [easi-notebooks](https://github.com/csiro-easi/easi-notebooks) for getting started tutorials.

**What are the advantages of the DataCube?**

DataCube is managed for you! So, you don’t need to worry about updates, maintenance, storage costs etc. Many EO and large public data collections are available, or becoming available, via the cloud. EASI and ODC connects directly to the current best collections endpoints and API services, and utilises cloud efficiencies and capabilities where possible.

Many of the data collections and processing routines are open source, and are contributed to by the ODC, research and domain-coordination communities. EASI leverages and contributes to these open resources to enure our users stay up to date.


---
For development work, Easier access via Open Data Cube library

You can build and manage your own library and upload your own data into a datacube! Contact easi-help@csiro.au for admin assistance.

What are the limitations of the DataCube?

It may not have the specific dataset you are looking for – but you can build your own and share it!

ODC API may not suit your overall workflow – but there are ways to get the underlying list of files so you can still exploit some aspects or access the storage directly

How do I upload and use data?

Getting start with using data from, and uploading to, the Open Data Cube is straightforward. You can explore the training notebooks in the git repository.

https://dev.azure.com/csiro-easi/easi-hub-public/_git/hub-notebooks?path=/EASI%20Training/03%20-%20Loading%20Data.ipynb&_a=preview 

https://github.com/csiro-easi/easi-notebooks (.. TBA)

Tasks

Jump into the training notebooks to explore some of the below activities you can complete.

upload a shapefile (copy from some resource or choose your own)

load EO datacube into an xarray

mask with an uploaded shapefile

### User Scratch

### Project Based Data

### Cloud Data

### Scenarios and summary of data storage types

| Type | User r/w | Dask workers r/w | Collaborators r/w |
|--|--|--|--|
| [Home directory](#Home-directory) | Yes | No | No |
| [User scratch](#User-scratch) | Yes | Yes | No |
| [Project data](#Project-data) | Yes | Yes | Yes |
| [Data cube](#Data-cube) | Read | Read | Read |
| [Cloud data](#Cloud-data) | Read | Read | Read |

Scenario 1
- I want to upload some small data files and use these with datacube data
- > home directory

Scanario 2
- I want to upload a boot-load of data from my organisation to use in easi
- > get EASI aws creds (from EASI JupyterLab)
- > upload to user-scratch or project buckets

Scenario 3
- I am generating a boot-load of output files and I'd like to store them somewhere. These may be coming from my notebook and or dask workers
- Are they temporary or to be shared with collaborators?
- > temporary > user scratch
- > shared > project bucket

Scenario 4
- I want to add my own product to the datacube
- > very good! talk to the easi admins .. we're making this easier.

Scenario 5
- I want to access other data from the cloud
- > consider region transfer costs
- > consider cloud authentication
- > consider caching-proxy
- > try odc-stac

## Home directory

Your EASI home directory is available in JupyterLab. It appears in JupyterLab as the left panel and via the terminal.

POSIX

TODO: Add pic

**What is the Home Directory?**
- The Home Directory uses the Amazon Web Service (AWS) Elastic File System (EFS) storage architecture and similar to a normal network file system.

**What are the advantages of the Home Directory?**
- Automatically backed-up nightly and retained for 90 days.
- Contact EASI Admins to restore a file (but you should not rely on us)

**When would I use the Home Directory to store data?**
- Small files, source code, notebooks
- Files format that cannot work from other storage options (require normal file system – though if these are large would recommend finding an alternative)

**What are the limitations of the Home Directory?**
- This is the most expensive form of storage available – other choices better for large data.
- Your home directory is not visible from dask workers (s3 storage option is)

**How do I upload and use data?**

TODO: link to a video of this and manual in standard jupyter labs documentation 

For small files it is simplest to drag-and-drop your files into the JupyterLab file interface directly from your desktop. This will upload the file via your browser into your home directory. 

For larger files it's best to use the AWS CLI running locally along with your EASI credentials and upload directly to the **User scratch** storage.

**Exercise / Simulation?**: Upload a file into JupyterLab 

## User scratch

*Object store*
- each file is a generic *object* with a *key*. A *key* is a string. There is no directory structure, however we humans often use '/' in keys to create a view 

EASI has a "User scratch" S3 storage bucket that users can write to.
- You can use the AWS CLI and library functions to write larger files to this bucket.
- Objects in the bucket have a 30 day (since creation lifecycle rule), so it’s appropriate for saving intermediate results and distributed data to your dask workers

**When would I use the UserScratch?**
- When you workflow requires temporary storage location and is large
- For use with Dask workers

TODO: Migrate notebook to easi-notebooks, https://dev.azure.com/csiro-easi/easi-hub-public/_git/hub-notebooks?path=/EASI%20Training/A4%20-%20User%20Scratch%20Usage.ipynb&_a=preview 

**What are the advantages of the UserScratch?**
- Lots of storage and scale
- Works with Dask workers
- Has a managed lifetime in case you forget

**What are the limitations of the UserScratch?**

It’s Object Storage, which is a bit different to POSIX so it requries a different method to access and manage the data. Most tooling these days will support object storage directly but older tools that haven’t been updated may not.

**Exercise / Simulation?:** Specifci task demo (e,g Save xarray masked result (from previous sim) to user scratch)

## Project data

Similar to user scratch

**What is Project Data?**
- Similar to the “user scratch bucket” projects can also make additional storage resources available for their users in EASI. That is the project bucket can be shared across all project members and have different levels of membership (read/write, read-only)
- There is an EASI account hosted project bucket, you can request a project share by contacting your EASI Administrators
- And EASI can also Cross-account authorization to project-managed resources – which is a much better option as the project then remains in full control of their resources
- Contact EASI for more information

**When would I use project data?**
- You need to share project data between multiple people
- You want to manage your own large scale data in your own project account

**What are the advantages?**
- Can be shared to others
- Custom lifecycle rules
- Can be completely managed by the project team in their own account.

**What are the limitations?**
- Only selected users will be able to read or read/write to a the project bucket.
- The EASI Admins can add/update user's access to project buckets.

**How do I upload and use data?
- Similar to User scratch but change the bucket name.

**Exercise / Simulation?:** None

## Data cube

Info only. Examples are everywhere else

**What is DataCube?**
- These are indexed in the datacube database and available from the datacube API
- EASI managed data (what we do)
- https://explorer.csiro.easi-eo.solutions/ (what we have)

**When would I use the DataCube**
- If your using data and it’s available as EASI managed data then it’s much easier to use via the ODC API and you don’t have to have your own copy!

**What are the advantages of the DataCube?**
- Managed for you
- Shared
- Easier access via Open Data Cube library
- You can build and manage your own (contact easi-help@csiro.au for admin assistance)

**What are the limitations of the DataCube?**
- May not have your data – but you can build your own
- ODC API may not suit your overall workflow – but there are ways to get the underlying list of files so you can still exploit some aspects or access it directly

**How do I upload and use data?**
- Data preparation and indexing to the datacube is a managed by the EASI admins
- We intend to make this easier for users to contribute their own products

Using datacube data is
- datacube API
- explorer and ows web services

TODO: Screenshots

TODO: Migrate https://dev.azure.com/csiro-easi/easi-hub-public/_git/hub-notebooks?path=/EASI%20Training/03%20-%20Loading%20Data.ipynb&_a=preview

**Exercise / Simulation?:** 
- upload a shapefile (copy from some resource or choose your own)
- load EO datacube into an xarray
- mask with an uploaded shapefile

## Cloud-computing data

Links to other resources. Intro to odc-stac.

**What is Cloud Data and what cloud data services / technologies does EASI support?**
- Many data providers are making their data available directly in the Cloud.
- The amount and variety of data and providers is vast and we can’t cover them all here but will highlight that they can in most cases be accessed directly from EASI.

**When would I use cloud-based data assets?**
- If a provider has data you want and its just as efficient to access it directly as it is to download and store (why bother curating it yourself!)

**What are the advantages of cloud-based data assets?**
- You don’t pay for the long term storage
- Maintained by the provider
- There’s a lot of it

**What are the limitations of cloud-based data assets?**
- Cost and efficiency mostly – data access isn’t free, there are network traffic costs, the costs of accessing the API, etc, even if the Provider makes the data available for “free”, these other costs still occur. Additionally, accessing data over a large network can slow calculation speed and repeated use in calculations will result in more network costs (possibly more than the cost of downloading the data and temporarily storing it). All these need to be managed.

**How do I upload, access and use cloud-based data?**
- It varies depending on the data source and we can’t cover them all. Here’s a few examples though of common approaches. If you get stuck ask the EASI community channel for a pointer in the right direction.

Egress and costs for cross-region/cloud transfers
- To be aware of. EASI will help find ways to make this efficient for you

Comments on Authorisation for other data sources and python libraries
- Often need to authorise to the source bucket for read access
- AWS (“requester pays” or “request-payer”)
- GEE (some token thing)
- PC (sign)

Underlying tools need access to the necessary credentials and directives
- GDAL and rasterio

ODC configure_s3_access
- from datacube.utils.aws import configure_s3_access
- configure_s3_access(aws_unsigned=False, requester_pays=True, client=client)

Environment variables
- TBA

Caching-proxy
- TBA

**Exercise / Simulation?:**
- Try odc-stac

## Summary

Knowledge check:

What are the factors associated with choosing storage types in EASI?

Scenario 1:

Scenario 2: