# Data storage and locality <img align="right" src="../resources/csiro_easi_logo.png">

We have the potential to use and generate huge volumes of data, in different formats, for different purposes and across multiple projects and collaborations.

How we store this data can have significant impacts on:

- Our compute efficiency and science productivity
- Our ability to connect with other researchers
- Our budget
- Our cybersecurity and
- Our commitment to Research Data Management policies, procedures, and practices

So, how do we choose the right storage solution?

This module will help you explore the different storage options on EASI, their use-cases, their limitations and how you can get started using them. By the end of this module, you will be able to:

- Identify the different available storage options in EASI and recognize:
    - When to use one or more storage options
    - Considerations associated with these storage options (cost, security, latency etc)
- Upload data to use with your Jupyter Notebook and where to store data outputs for download or re-use by yourself or others.
- Access resources to help build skills and knowledge to further your learning around data storage options

**Contents**:
- [How do we normally store data?](#How-do-we-normally-store-data?)
- [Different storage types in EASI](#Different-storage-types-in-EASI)
- [Decision making matrix](#Decision-making-matrix)
- [Scenarios](#Scenarios)

## How do we normally store data? 

If you have been using a computer, chances are that you are familiar with storing data on your home drive, or on an external hard drive, or maybe even on the cloud. This is typically referred to as **File Storage**.

With file storage, all data is stored together (usually in one file) and the file extension type helps determine the applications that can read or open the file and access the data (e.g., .jpg, .docx, or .txt).

File storage systems also make it easier for users to find and manage files using a hierarchical structure that organizes files into folders and subfolders. To access the file, select or enter the path of the file, including subdirectories and file name. Most users manage their file storage using a simple file system such as a file manager.

There are other alternatives to the everyday local file storage, like network file storage (connect to the network server first then access by path/file) and *Object Storage*, which is most relevant to cloud-computing.

**Object storage** is like a giant virtual bucket where you can store your belongings in labelled boxes. Each box has a unique ID and can hold different types of objects. You can easily add or remove boxes as needed, and the bucket can expand to accommodate more and more boxes. Plus, you can access any box from anywhere in the world, if you have the ID and access permission. Object storage is ideal for storing vast amounts of unstructured data, such as videos, images, and documents, and is commonly used in cloud storage services. You can have many Buckets where you can store vast amounts of your data and they can all have different access permissions and life cycles. 

EASI supports many types of data storage, local and network file storage and Object Storage, and each method has certain advantages, limitations and requirements. EASI also manages data for all users in the Open Data Cube, and there are third-party datasets accessible directly in the Cloud. 

Some additional terms that may be helpful when talking about Data Storage:

**Latency**

Data latency is the time it takes for data packets to be stored or retrieved.  

**Data Retrieval**

The process of identifying and extracting data from a database or storage system, based on a query provided by the user or application. It enables the fetching of data from storage in order to display it on a monitor and/or use within an application.  

**Cache** 

A high-speed memory or storage device that helps reduce the time required to read and write data to a slower device, such as a hard drive or a remote server (or bucket).

## Different storage types in EASI

- [Type 1: Home Directory](#Home-Directory)
- [Type 2: DataCube](#DataCube)
- [Type 3: User Scratch](#User-Scratch)
- [Type 4: Project Data](#Project-Data)
- [Type 5: Cloud Data](#Cloud-Data)

### Home Directory

**What is the Home Directory?**

The Home Directory uses the Amazon Web Service (AWS) Elastic File System (EFS) storage architecture. This is similar to a normal network file system in that it has folders, sub folders and files and it is easy and familiar to navigate and use.

**When would I use the Home Directory to store data?**

The Home Directory is best used for small files including source code and notebooks and smaller sets of data outputs.

You can also use the Home Directory to store files that cannot readily work from other storage options (i.e., files or programs that require a normal file system). Again, if these are large, then you may wish to find an alternative and your EASI Administrator will be happy to advise.

**What are the advantages of the Home Directory?**

The data housed in your Home Directory is persistent through log-in/log-out cycles (your home directory will be as you left it). The home directory is also automatically backed-up nightly and retained for 90 days.  

The EASI Admin team can restore a file. However this is not always guaranteed. If you are looking for more stable and long-term storage, there are better solutions out there! 

**What are the limitations of the Home Directory?**

This is the most expensive form of storage available on EASI. It is ok to store a Gigabyte or so of data, code and outputs. For larger datasets and outputs there are other EASI storage types available. 

Note that your home directory is not visible to *Dask* workers, as they run in a different part of EASI. We will learn about the Python *Dask* library later in the module.

**How do I upload and use data?**

There’s a video of this and information in the [Jupyter labs documentation](https://jupyterlab.readthedocs.io/en/stable/user/files.html).

- For small files, it is simplest to drag-and-drop your files into the JupyterLab file interface directly from your desktop. This will upload the file via your browser into your home directory.
- For larger files, it's best to use the AWS CLI running locally along with your EASI credentials and upload directly to your [EASI S3 User Scratch space](#User-Scratch) space.

Example usage and EASI documentation: https://docs.csiro.easi-eo.solutions/user-guide/users-guide/07-uploading-data/ 

**Tasks**

Open an EASI JupyterLab session and a separate browser tab to: 

- [ ] Familiarise yourself with the JupyterLabs documentation.
- [ ] Upload some of your own data or file to your JupyterLab home directory.

### DataCube

**What is DataCube?**

The [Open Data Cube (ODC)](https://github.com/opendatacube) is an Open-Source Geospatial Data Management and Analysis Software project that helps you harness the power of Satellite data. At its core, the ODC is a set of Python libraries and PostgreSQL database that helps you work with geospatial raster data.

The broad goal of the ODC is to make it easier to access and use large data holdings, without requiring data to be stored in a specific way or in a specific place. What this means is that you can point it at your data repository and index the data where it sits, abstracting the complexity of managing large, distributed data holdings.

**EASI managed data**

EASI manages an amount of data in its datacube databases. These can be viewed in each EASI's Explorer website and with the *Datacube()* API.

- This is the CSIRO Explorer site: https://explorer.csiro.easi-eo.solutions/
- The EASI documentation [lists the URLs for other EASIs](https://docs.csiro.easi-eo.solutions/user-guide/developers/easi-platform-overview/#easi-deployments-services-and-support).

There are many commonalities in the data processing workflows (into a datacube databases) across the ODC community. We do this so that its easier to move your work between ODC systems (including third-parties). The EASI team are working to make our data processing workflows transparent and open to contributions.

**When would I use the DataCube**

If you’re using data that’s available as EASI managed data, then it’s much easier to use via the ODC API and you don’t have to have your own copy! See the [easi-notebooks](https://github.com/csiro-easi/easi-notebooks) for getting started tutorials.

**What are the advantages of the DataCube?**

DataCube is managed for you! So, you don’t need to worry about updates, maintenance, storage costs etc. Many EO and large public data collections are available, or becoming available, via the cloud. EASI and ODC connects directly to the current best collection endpoints and API services, and utilises cloud efficiencies and capabilities where possible.

Many of the data collections and processing routines are open source, and are contributed to by the ODC, research and domain-coordination communities. EASI leverages and contributes to these open resources to enure our users stay up to date. The EASI team welcome suggestions and contributions to the development of current or new data processing workflows.

**What are the limitations of the DataCube?**

It may not have the specific dataset you are looking for. However datasets are always being added and updated, and you can build your own and share it!

The Datacube API may not suit your overall workflow.  Just remember that the datacube shows “one way to do it”. Talk to the EASI team about your workflow and we can advise what others options are available or could be combined. For example, you can access the list of files (rather than the data) that the datacube API would return for your query, and you can likely access the same sources that EASI’s workflows do.

**How do I upload and use data?**

Getting start with using data from, and uploading to, the Open Data Cube is straightforward. You can explore the tutorial notebooks in the git repository: https://github.com/csiro-easi/easi-notebooks.

**Tasks**

Open an EASI JupyterLab session to:

- [ ] Upload a shapefile (copy from some resource or choose your own)
- [ ] Load EO datacube into an xarray
- [ ] Mask with an uploaded shapefile

### User Scratch

**What is User Scratch?**

EASI has a "scratch" bucket available for all EASI users to write to. Scratch storage serves as a temporary location for large data sets that are more appropriate to save and access from AWS S3 storage. These could be intermediate datasets in your workflow or interim results for testing and exploration.

**When would I use the User Scratch?**

The User Scratch bucket is helpful to save files between processing runs or share files between your projects. You can use Scratch as large, efficient temporary storage for your workflows and projects, e.g. for saving intermediate results from your workflows.

Scratch can be accessed by *Dask* workers for passing data and saving snapshots of results, which is an advantage over Home Directory storage. Don’t know what *Dask* is? Don’t worry, we’ll cover Dask in detail.

**What are the advantages of the User Scratch?**

It’s a dedicated S3 Bucket in EASI that provides:

- Lots of storage and scale (efficient)
- Works with Dask workers (fast)
- Has a managed lifetime, in case you forget it will (eventually) clean up after you
- You can use the AWS CLI and library functions to read and write files to this bucket

**What are the limitations of the User Scratch?**

User Scratch is *Object Storage*, which is a bit different to file storage, so it requires a different method to access and manage the data. Many tools support efficient read/write to cloud object storage directly but some older tools that you use may not. If so, use the AWS CLI to make a temporary copy of the files to your JupyterLab home directory or to dask workers.

Secondly, objects in the user scratch bucket have a 30-day (since date of creation) lifecycle rule. It is not intended for long-term data storage. Similarly, it is also not a practical solution if you need to share files and resources between projects, particularly beyond 30 days. If sharing data or a different lifecycle is a project requirement, then consider provisioning a "Project" bucket tailored to your needs.

**How do I upload and use data?**

User guide / additional information:

- https://docs.csiro.easi-eo.solutions/user-guide/users-guide/07-uploading-data/
- https://github.com/csiro-easi/easi-notebooks

**Tasks**

- [ ] Save your xarray masked result (from previous step) to User Scratch

### Project Data

**What is Project Data?**

Like the “User Scratch bucket”, which all EASI users can access, each project can also make additional storage resources available for their users in EASI. This is known as a **Project** bucket, and it can be shared across all project members and have different levels of membership (read/write, read-only).

Create and manage your own AWS account, and then create a bucket that can be shared with EASI. The EASI team will coordinate with you to enable cross-account authorisation between EASI and your shared project bucket. You retain full control of your AWS resources. Contact EASI for more information.

**When would I use Project Data?**

Project Data is best used when you need to share data and results with collaborators in a project, or when you want to manage your own large data in your own project account with your own lifecycle rules.

**What are the advantages?**

From EASI, a Project bucket is only accessible by users who have been nominated as a member of the *project*. This allows sharing with colleagues who are part of the same project.

The project's AWS account administrator can create custom lifecycle rules and can enable other services that might be appropriate for the project.

**What are the limitations?**

Only accessible to people that have access to this storage, i.e. colleagues as part of the same project, which is the intent.

**How do I upload and use data?**

The methods for uploading or reading data from a project bucket are the same as for the User Scratch bucket. Just change the bucket name!

**User guide / additional information:**

In CSIRO, contact the Cloud Platforms Team to set up your project AWS account. Then contact the EASI team to enable cross-account authorisation and to nominate users and their read/write permissions.

**Tasks**

- [ ] If you have a project bucket then use the same approach as for User Scratch to upload your xarray result to the project bucket.
- [ ] If not, then the only task is to remember that this option exists!

### Cloud Data

**What is cloud-available data and what cloud data services / technologies does EASI support?

The “cloud” in this context refers to the main global cloud data companies – Amazon Web Services, Google Earth Engine and Microsoft Planetary Computer. However, the ideas mentioned here are also reasonably applicable to many internet-available datasets, although the specific way to access these may differ.

Many Earth observation and other public data collection providers are making their data available directly in the cloud. In many cases the cloud copy is their ‘authoritative’ public copy of the data. The same data collection may also be available in more than one cloud system.

The amount and variety of data and providers is vast, and we can’t cover them all here but will highlight that they can in most cases be accessed directly from EASI.

**When would I use cloud-based data assets?**

When a provider has data you want and its just as efficient to access it directly as it is to download and store (why bother curating it yourself!)

**What are the advantages of cloud-based data assets?**

- You don’t pay for the long-term storage
- Maintained by the data provider
- There’s a lot of it

Many of the EASI-managed datacube products are indexed directly or indirectly from the cloud.

**What are the limitations of cloud-based data assets?**

Cost and efficiency mostly. While these cloud data collections are publicly available they are not necessarily free to access. The cloud companies typically charge network data transfer and related costs associated for moving data within, into and out of their systems.

Additionally, accessing data over the internet can slow workflow speed and repeated use in workflows will result in more network costs (possibly more than the cost of downloading the data and temporarily storing it). All these need to be managed.

For small, infrequent access (like one off grabs of data) you can safely just use them. For large data use or frequent re-use then if you are unsure contact your EASI Admin and we’ll help you navigate. It’s not difficult, just awkward at the start to understand what to watch for.

**How do I access and use cloud-based data?**

This varies depending on the data source and we can’t cover them all. However, there are common considerations highlighted below to help you think about how to use cloud-based data. Each cloud data company provides a list or summary of their available public data collections. It is important to be able to locate these and to identify and interpret some of the key points.

- [Registry of Open Data on AWS](https://registry.opendata.aws/)
- [Earth Engine Data Catalog](https://developers.google.com/earth-engine/datasets)
- [Planetary Computer Data Catalog](https://planetarycomputer.microsoft.com/catalog)

The OpenDataCube provides a tool that can read from the cloud data collections and return an xarray object as if the data had been read from a datacube database. Examples are provided here: https://odc-stac.readthedocs.io/en/latest/examples.html

**Tasks**

- [ ] Let's look at a data catalog example from AWS: https://registry.opendata.aws/sentinel-2-l2a-cogs/

<img align="center" src="../resources/aws-opendata-markup.png">

## Decision making matrix

This table summarises the key attributes of the different storage types.

<table>
<tr align="center">
  <th>&nbsp;
  <th colspan=2>User access rights
  <th colspan=2>Dask workers access rights
  <th colspan=2>Collaborator access rights
  <th colspan=2>Suitable file size
  <th colspan=2>Network costs
  <th colspan=2>Storage costs
  <th colspan=2>Shelf life
<tr align="center">
  <td>&nbsp;
  <td><em>Read<td><em>Write
  <td><em>Read<td><em>Write
  <td><em>Read<td><em>Write
  <td><em>Large<td><em>Small
  <td><em>High<td><em>Low
  <td><em>High<td><em>Low
  <td><em>Short term<td><em>Long term
<tr align="center">
  <td><a href="#Home-directory">Home directory</a><td>Yes<td>Yes<td>No<td>No<td>No<td>No<td>&nbsp;<td>X<td>&nbsp;<td>X<td>X<td>&nbsp;<td>X<td>X
<tr align="center">
  <td><a href="#User-scratch">User scratch</a><td>Yes<td>Yes<td>Yes<td>Yes<td>No<td>No<td>X<td>X<td>&nbsp;<td>X<td>&nbsp;<td>X<td>X<td>&nbsp;
<tr align="center">
  <td><a href="#Project-data">Project data</a><td>Yes<td>Yes<td>Yes<td>Yes<td>Yes<td>Yes<td>X<td>X<td>&nbsp;<td>X<td>&nbsp;<td>X<td>X<td>X
<tr align="center">
  <td><a href="#Data-cube">Data cube</a><td>Yes<td>No<td>Yes<td>No<td>Yes<td>No<td>X<td>X<td>&nbsp;<td>X<td>&nbsp;<td>X<td>&nbsp;<td>X
<tr align="center">
  <td><a href="#Cloud-data">Cloud data</a><td>Yes<td>No<td>Yes<td>No<td>Yes<td>No<td>X<td>X<td>X<td>X<td>&nbsp;<td>X<td>X<td>X
</table>

Note that the specifics for any cloud data sources can vary depending on the provider, where the data is stored, where it is being accessed and more. For an in-depth conversation, reach out to the EASI team.

<!--
Markup version - could not find a way to do colspan
| | User access rights | | Dask workers access rights | | Collaborator access rights | | Suitable file size | | Network costs | | Storage cost | | Shelf life | |
|--|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
| | *Read* | *Write* | *Read* | *Write* | *Read* | *Write* | *Large* | *Small* | *High* | *Low* | *High* | *Low* | *Short term* | *Long term* |
| [User scratch](#User-scratch) | Yes | Yes | No | No | No | No | | X | | X | X | | X | X |
-->

## Scenarios

### Scenario 1 - Home directory and ODC

Michael has some small data files that he wants to upload and use in conjunction with datacube data. Where is the best place for him to upload this data?

*Answer*: Home directory

The Home Directory is great for smaller data files that you need for your work. It is easy to upload small files from your desktop with drag-and-drop in JupyterLab. There are many python notebook examples (to borrow or adapt from) that query the datacube API (which returns an xarray object) then combine with other data (such as shapefiles, polygons or non-datacube data) with standard python tools.

For larger files and datasets you may consider uploading to User Scratch or a Project Data bucket.

### Scenario 2 - User scratch and Project data buckets

Jenny is generating a set of intermediate or final data files from her workflow. The total size is in the Gigabytes. Where should Jenny write the data files to?

*Answer*: User scratch or Project data (if applicable)

*Extended response*:

S3 bucket storage is preferred for data sets larger than a few GBs. Workflows (notebooks and dask workers) can read and write to S3 buckets if permitted. All EASI users have access to the User Scratch bucket. Some users may also have access to a Project data bucket, in which case the data can also be used by project colleagues.

S3 bucket storage is preferred for larger datasets because it is cheaper, efficient and programmable from notebooks and dask workers. Your home directory is more expensive and not accessible from dask workers, so is less suitable for workflows generating or using large datasets.

### Scenario 3: Cloud or external online data

Catherine would like to access a dataset that is available in the cloud or from an external online service. She checks Explorer and notes that the dataset is not available in the datacube database. What choices does Catherine have?

*Answer*: Consider the details for accessing these Cloud data in EASI and seek advice if required.

*Extended response*:

Typical external data access considerations include:

- Whether there is a programmable request interface, e.g. STAC API, an OGC web service, or a custom API. For STAC APIs, you can use the odc-stac tool to query and read the data into an xarray object. Similarly, there are common python tools for reading from ODC web services. A custom API will likely need specific code to search, download and prepare the data for use.
- Where the data are located. Cloud data services (including EASI’s host) often charge for each data transaction into and out of their data centres, which add up. Moving data across the internet can be slow as well. Ask the EASI team for advice.
- Is this a candidate dataset for including in the datacube database? If yes, then a great start is to document how this dataset can be used by you and others, the dataset source and metadata, and some example code for accessing and using the dataset.