# Storage Spaces
There are different types of storage spaces on national clusters:
* Personal (`/home`)
* Temporary local (`$SLURM_TMPDIR`)
* Temporary network (`/scratch`)
* Shared project (`/project`)
* *Nearline* - for long term storage (`/nearline`)

Your data have the following attributes:
* **Size**: small, large, very large files
* **Count**: few files or a lot of files
* **Transferable**: grouped and/or compressed data
* **Life cycle**: in a job, between jobs,
  from one project to the next, to be archived
* **Access levels**: confidential, shared, published

The goal of this chapter is to overview the basics of active data
management on the available storage spaces on compute clusters.

## Different Types of Storage
The following table shows which types of storage space are accessible
from **login nodes** and from **compute nodes** (CPU or GPU):

| Available Space | Login Nodes | Compute Nodes |
|-----------------|:-----------:|:-------------:|
|         `/home` |     Yes     |      Yes      |
| `$SLURM_TMPDIR` |    **No**   |      Yes      |
|      `/scratch` |     Yes     |      Yes      |
|      `/project` |     Yes     |      Yes      |
|     `/nearline` |     Yes     |     **No**    |

About the different [types of storage](https://docs.alliancecan.ca/wiki/Storage_and_file_management#Storage_types).

### `$HOME` - Your Personal Space
```Bash
cd  # $HOME by default
ls -a
ls -la
ls -ld $HOME
```

* **Entry point** by default when you connect to a compute cluster
* [Relatively small quota limit](https://docs.alliancecan.ca/wiki/Storage_and_file_management#Filesystem_quotas_and_policies), but accepts a relatively large number of files
  * Ideal for [installing software in your home directory](https://docs.alliancecan.ca/wiki/Installing_software_in_your_home_directory)

### `$SLURM_TMPDIR` - Temporary Local Space
```Bash
ls -ld $SLURM_TMPDIR
salloc  # From login1

ls -ld $SLURM_TMPDIR
df -h $SLURM_TMPDIR
exit
```

* [Very fast local storage](https://docs.alliancecan.ca/wiki/Using_node-local_storage),
  but limited in size and to the duration of the compute job
  * **Low latency** compared to *Lustre* (the network filesystem)
  * Great bandwidth, even for small files
  * **Data deleted at the end** of the compute job
  * If multiple nodes are reserved for a single parallel job,
    **each node has its own directory** `$SLURM_TMPDIR`
* Use cases:
  * **Importing** multiple **small files** which
    will be used repeatedly during the calculation
  * **Saving** files which will be **constantly modified** -
    these files will have to be copied to `/project`
    or `/scratch` before the end of the job

### `$SCRATCH` - Temporary Network Space
```Bash
df -h /scratch
ls -ld $SCRATCH
```

* Network storage space of
  [great capacity](https://docs.alliancecan.ca/wiki/Storage_and_file_management#Filesystem_quotas_and_policies)
  for **temporary data**
  * **Not** backed up
  * [Monthly purge](https://docs.alliancecan.ca/wiki/Scratch_purging_policy)
    for the data older than 60 days
* Variable performance according to the use by all users
* Use cases:
  * Using a dataset **for only a few days**
  * Storing **results temporarily** if made of hundreds of files
  * Storing **intermediate** results which
    would be **too big** for `/project`

### `/project` - Shared Project Space
```Bash
ls -ld /project
ls -ld /project/def-sponsor00
ls -l /project/def-sponsor00
```

* Network storage space of
  [small or large capacity](https://docs.alliancecan.ca/wiki/Storage_and_file_management#Filesystem_quotas_and_policies)
  for **project data**
  * A default project space per research group (except on Niagara)
    * The quota can be increased by a few TB **on demand by email**
    * For a much larger project space, the PI needs to submit a
      [RAC request](https://alliancecan.ca/en/services/advanced-research-computing/accessing-resources/resource-allocation-competition)
  * Backed up **daily**
    * The quota on the number of files is limited (500k by default)
* Project data:
  * Potentially **shared** -
    [configuration of ACLs](https://docs.alliancecan.ca/wiki/Sharing_data)
  * Last as long as the project lasts
  * Typically more important than temporary data
* Use cases:
  * Storing datasets that are **reused over
    multiple months or shared by many group members**
  * Storing **final results** which would be too expensive to recreate

### `/nearline` - Long Term Storage
Storage interface on disk:
* We can see the files with the `ls` command
* The oldest data in
  [`/nearline` is most likely moved to tape](https://docs.alliancecan.ca/wiki/Using_nearline_storage)
  * There are command lines to
    [check the status of your files](https://docs.alliancecan.ca/wiki/Using_nearline_storage#Transferring_data_from_Nearline)
    in `/nearline`

To considerate:
* The migration of the data to tape reduces the space used on
  disk, which saves some money when buying the storage system
* Each read operation of a migrated file to tape will create
  **a blocking request** which causes a response time of a few
  minutes to hours (when the tape system is overloaded of requests)
  * That is why it is necessary to save
    a **small number of large files**
  * **To avoid**: copying numerous small files on
    Nearline before grouping them in archive files

Use cases:
  * Grouping files from `/project` or `/scratch`
  * Storing important data **that will not be used for months**

## Storage Management
### Life Cycle of Active Data
As time passes, the data tend to accumulate. It eventually becomes
necessary to monitor the used space, as well as the number of files.
```
du -s ~
find . | wc -l

df -h /project
df -hi /project
```

* **[The `diskusage_report` command](https://docs.alliancecan.ca/wiki/Storage_and_file_management#Overview)**
  generates a short report of the space used and the number
  of files inside each storage space you have access to
* Every day, a per-user storage usage report is created in `/project`:
  * On Béluga: in `/project/.stats/<allocation-name>`
  * On Cedar and Narval: in `/project/.stats/<allocation-name>.json`
  * **[The `diskusage_explorer` command](https://docs.alliancecan.ca/wiki/Diskusage_Explorer)**
    shows a storage space usage summary and allows to
    navigate to sub-directories for further analysis
  * Detailed information is available on demand for Graham and Niagara

Having a good active data management plan makes it easier
to delete or archive specific files in the long term.

#### Example of the Life Cycle of Active Data
![Pipeline 1](images/data-flow-1.svg)

Description of each step:
* A dataset is downloaded in `/scratch`
  * To be used between a few days and a few weeks
  * No need to backup the data (it is easily recoverable)
* Submission of multiple compute jobs
  * One job per file in `data` in the `/scratch` partition
* The job script is located in the `/home` partition
  * It uses variables `$FIC` and `$SLURM_TMPDIR` to copy
    the data file to process locally on the compute node
  * Move the session to the local directory
  * Configure a Python environment
  * Execute the Python code saved in `/home`, provide the name of
    the file to process and redirect the output to a local file
  * Copy the results file to a directory in `/scratch`
* Post-process - process all results files and
  keep only the necessary information in `/project`

#### **Exercise** - Running a Small Pipeline
* Check the code of these scripts:
  * `scripts/blastn-pipeline.sh`
  * `scripts/blastn-postprocess.sh`
* Run the pipeline with the following command:

```Bash
bash scripts/blastn-pipeline.sh
```

* Monitor both jobs with `squeue -u $USER`
* Find created files in:
  * `$SCRATCH/data` and
  * `$SCRATCH/data/res_prll`
* Run the post-processing with the following command:

```Bash
bash scripts/blastn-postprocess.sh
```
* Find the TSV file in the project space

### In Case of Inaccessible Data
* A professor can request the deletion of the inaccessible data
* To get access to the data, we need the consent of the user
  that has blocked the access (most of the time not on purpose)
  * In case of no response, the research group's institution
    policy can allow or not the access to the data
  
In all cases, it is better to plan the data management
in advance, even when importing data on compute clusters.

## Key Points
* The *Lustre* file system is optimized for large files (+10 MB)
  * Avoid saving too many files and directories in a directory
    (maximum 1000 items)
* For data transfers and the use of *Nearline*, it is
  better to group data in archive files (like Zip, DAR, etc.)
* In the project space, the group must
  plan who should access to what and when
* To optimize jobs, use the `$SLURM_TMPDIR`
* The `diskusage_report` command returns an overview of used space
* For critical data and codes:
  * have a copy elsewhere, and
  * use a version control system