Skip to content

Commit

Permalink
Update data-curation.rst [skip ci]
Browse files Browse the repository at this point in the history
  • Loading branch information
danieljprice committed Jan 31, 2024
1 parent fb2df88 commit 2e7f7b8
Showing 1 changed file with 32 additions and 6 deletions.
38 changes: 32 additions & 6 deletions docs/data-curation.rst
Original file line number Diff line number Diff line change
@@ -1,22 +1,48 @@
Long-term archiving of your phantom calculations
Publishing the data from your phantom calculations
==================================================================
One of the biggest headaches we have with shared supercomputer projects
is that inevitably somebody fills whatever disk quota was allocated,
and the project halts for everyone. A true tragedy of the commons. To solve this, shift your data somewhere more permanent.
Recommended best practice for open science is that parameter files, initial conditions
and snapshots from calculations with phantom should be made publicly available on publication.

FAIR Principles
----------------
According to the `FAIR principles for scientific data management <https://ardc.edu.au/resource/fair-data/>`__, your data should be:

- Findable, e.g. with links to and from the paper publishing the simulations
- Accessible, available for free in a publicly accessible repository
- Interoperable, data is labelled and able to be reused or converted
- Reusable, include enough information to be able to reproduce your simulations

Data curation
-------------
For calculations with phantom that have been published in a paper,
best practice is to upload the **entire calculation including .in and
ideal practice is to upload the **entire calculation including .in and
.setup files, .ev files and all dump files in a public repository**.

See for example a dataset from Mentiplay et al. (2020) using figshare: `<https://doi.org/10.6084/m9.figshare.11595369.v1>`_

Or this example from Wurster, Bate & Price (2018) in the University of Exeter repository: `<https://doi.org/10.24378/exe.607>`_

However, size limitations may restrict preservation of all data, in which case we recommend saving:

- .in files
- .setup files
- .ev files
- dump files used to create figures in your paper, with a link to splash or sarracen in the metadata for how to read/convert these files
- dump files containing initial conditions, if these are non-trivial
- metadata including link to your publication or arXiv preprint, link to the phantom code, code version information and labelling of data corresponding to simulations listed in your paper

Zenodo community
----------------
To facilitate better data sharing between phantom users, we have set up a Zenodo community:

https://zenodo.org/communities/phantom

Please join this community and let's learn from each other to create best-practice data curation.
Zenodo currently has a 50Gb limit on data size, which is sufficient for the recommended list of files to save above.

Archiving your data to Google Drive using rclone
------------------------------------------------
You can use rclone to copy data from a remote cluster or supercomputing facility to Google Drive. For universities with institutional subscriptions, this provides almost unlimited storage.
You can use rclone to copy data from a remote cluster or supercomputing facility to Google Drive. This is not recommended as a long term storage solution but can facilitate short-term data sharing between users.

Set this up by logging into your supercomputer and typing::

Expand Down

0 comments on commit 2e7f7b8

Please sign in to comment.