Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document private/datalad-annex/QC/huge data use case #535

Open
mih opened this issue Nov 7, 2023 · 1 comment
Open

Document private/datalad-annex/QC/huge data use case #535

mih opened this issue Nov 7, 2023 · 1 comment

Comments

@mih
Copy link
Member

mih commented Nov 7, 2023

The script below is an executable example of a workflow that aims to address a use case for huge microscopy data processing. Each file is >4GB and the joint datasets track about 2PB (yes, not TB). The workflow aims to minimize the invasiveness of datalad use, while maximizing the utility for using datalad as an option. The resulting deposit is minimized re required inodes.

The script also demos the interventions necessary after a storage reorganization, and change of data access method.

#
# This script demos the use of datalad-next features to
#
# - dataladify data in a temporary workspace
#  (that leaves no trace in the final dataset)
# - run a mock QC analysis with provenance capture
# - deposit all data in a "domain-native" organization with rsync
# - annotate data availability via access URLs
# - deposit datalad dataset as a strict add-on (no RIA store, just two more files)
# - clean up the workspace with availability verification for all components
#   (data files and repositories)
#
# At the end of the script, additional command are shown that:
#
# - demo datalad-based dataset access (in addition to the plain directory tree
#   that rsync made)
# - demo a simulation of a storage reorganization that changes both acces protocol and
#   path
#
# The script is extensively annotated.

set -eu

# create a tmp workspace for this script and enter it
mkdir -p demo_micro_datalad
cd demo_micro_datalad

# this segment only cleans the workspace, such that this script can run
# multiple times
rm -rf sshtarget newstore
for cleands in tmp_rawdata tmp_qc; do
	[ -d "$cleands" ] && datalad drop -r --what all --reckless kill -d "$cleands"
done


#
# RAW data ingestion
#
# We create a dataset with two files that match some real-world dataset.
# annex.private is set to hide this dataset location from any future dataset
# users -- this step uses a temporary workspace, with a throwaway clone

# TODO should likely run a dataset procedure to apply a uniform dataset setup
# for section datasets
datalad -c annex.private=true create tmp_rawdata
echo "b31 se4318 sl01" > tmp_rawdata/B31_4318_LE01_Slice01.tif
echo "b31 se4318 sl02" > tmp_rawdata/B31_4318_LE01_Slice02.tif
# we leave out the annex.private here to enable a datalad retrieval for the QC
# processing that is happening in here. This is just a shortcut. We could also
# move the URL registration up here. However, I am leaving it like this,
# because it better demos what information is critically needed.
datalad -C tmp_rawdata save -m "data from scanner"


#
# QC run
#
# YODA-style setup. Data linked as a subdataset (from TMP location for now)
# Mock QC command, prov-captured.

# TODO should likely run a dataset procedure to apply a uniform dataset setup
# for qc datasets
datalad -c annex.private=true create tmp_qc
datalad -C tmp_qc -c annex.private=true clone -d . ../tmp_rawdata inputs/se4318
# run the "QC" workflow
datalad -C tmp_qc -c annex.private=true run -i inputs/se4318/B31_4318_LE01_Slice01.tif -o qc.txt 'echo "QC on {inputs}" > {outputs}'
# remove inputs (while running sanity checks), QC should not modify the inputs
datalad -C tmp_qc -c annex.private=true drop -r --what all inputs/

#
# Data(set) deposition
#

# non-datalad deposition, we use different locations for raw data and qc results
TARGETRSYNCHOST=
TARGETPATH=$(pwd)/sshtarget
RAWPATH=B31/4318/zscans
QCPATH=QC/B31_4318
RSYNC_OPTS='--recursive --mkpath --copy-links --exclude .datalad --exclude .git*'
# rsync for the payload files
rsync ${RSYNC_OPTS} tmp_rawdata/ ${TARGETRSYNCHOST}${TARGETPATH}/${RAWPATH}/
rsync ${RSYNC_OPTS} --exclude inputs tmp_qc/ ${TARGETRSYNCHOST}${TARGETPATH}/${QCPATH}/

#
# Annotate file availability
#
TARGETPROTO='file:'
TARGETHOST=
TARGETURL="${TARGETPROTO}//${TARGETHOST}${TARGETPATH}"
TARGETURLRAW="${TARGETURL}/${RAWPATH}"
TARGETURLQC="${TARGETURL}/${QCPATH}"

# for the raw data dataset

# enable UNCURL special remote, utility is explained below
git -C tmp_rawdata annex initremote BIGSTORE \
	type=external externaltype=uncurl encryption=none autoenable=true

# register a URL for any annex key for any file in the dataset.
git -C tmp_rawdata annex find . \
	--format="\${key} ${TARGETURLRAW}/\${file}\n" \
	| sort | uniq | git -C tmp_rawdata annex registerurl --batch

# announce the keys to be available at the remote
# this is needed, because we copied the data without datalad, and we
# did not redownload it again for git-annex to verify
git -C tmp_rawdata annex find . \
	--format="\${key} $(git -C tmp_rawdata annex info BIGSTORE | grep uuid: | cut -d: -f2) 1\n" \
	| sort | uniq | git -C tmp_rawdata annex setpresentkey --batch

# now the exact same again for the QC result dataset, only URL is different
git -C tmp_qc annex initremote BIGSTORE \
	type=external externaltype=uncurl encryption=none autoenable=true
git -C tmp_qc annex find . \
	--format="\${key} ${TARGETURLQC}/\${file}\n" \
	| sort | uniq | git -C tmp_qc annex registerurl --batch
git -C tmp_qc annex find . \
	--format="\${key} $(git -C tmp_qc annex info BIGSTORE | grep uuid: | cut -d: -f2) 1\n" \
	| sort | uniq | git -C tmp_qc annex setpresentkey --batch

#
# Deposit datalad datasets
#
# the following uses the datalad-annex git remote helper to deposit
# a repository at the target location (made of two files)
DATALADANNEX_PREFIX="datalad-annex::${TARGETURL}"
UNCURL_OPTS='type=external&externaltype=uncurl&encryption=none&url={noquery}/{{annex_key}}'

DLANNEXRAWURL="${DATALADANNEX_PREFIX}/${RAWPATH}-datalad"
DLANNEXQCURL="${DATALADANNEX_PREFIX}/${QCPATH}-datalad"

# the deposition is a standard git-remote-add + git-push
#
# BTW there is no need to push to the actual target. We could also create the
# deposit locally, and include it in the rsync transfer above
git -C tmp_rawdata remote add TARGET "${DLANNEXRAWURL}?${UNCURL_OPTS}"
# we push ALL, because we know that we need all and we will wipe this clone
# out in a few moments
git -C tmp_rawdata push TARGET --all

# now the same for the QC result dataset, with one addition
#
# we need to record the raw data subdataset location with its
# persistent location. This uses the same datalad-annex:: URL is before.
#
# we only have a single subdataset, so not need for filtering
datalad -C tmp_qc subdatasets --set-property url "${DLANNEXRAWURL}?${UNCURL_OPTS}"
git -C tmp_qc remote add TARGET "${DLANNEXQCURL}/?${UNCURL_OPTS}"
git -C tmp_qc push TARGET --all

# this shows the final deposit at the target
tree -a sshtarget

#
# Clean desk. Remove all local dataset clones. The deposit is now the single
# source.
#
for cleands in tmp_rawdata tmp_qc; do
	datalad drop -r --what all -d "$cleands"
done

#
# Test retrieve
#
echo "Run the following to verify (remote) availability of all components"
echo "============="
echo datalad clone "'${DLANNEXQCURL}?${UNCURL_OPTS}'" b31_4318_qc
echo datalad -C b31_4318_qc get --recursive .
echo ""

# simulate that the store has move and is now accessible via SSH
echo "Run the following to simulate a move of the storage and a protocol change"
echo "============="
echo mv $(pwd)/sshtarget $(pwd)/newstore
NEWURL="ssh://localhost/$(pwd)/newstore"

# for cloning we need to give the new URL
echo datalad clone "'datalad-annex::${NEWURL}/${QCPATH}-datalad?${UNCURL_OPTS}'" b31_4318_qc_fromnew

# for retrieval of files, we can recycle the (now outdated) recorded URL,
# segment it via a match expression, and reassmeble a new URL via a template
# declaration that uses the segmentation results.
#
# this configuration need not be supplied ad-hoc like this. It can also be
# system-wide (BIGSTORE should be a unique-enough label), or can be
# committed to the (updated) dataset.
echo datalad -C b31_4318_qc_fromnew -c "'remote.BIGSTORE.uncurl-match=(file:///.*/sshtarget/)(?P<path>.*)$'" -c "'remote.BIGSTORE.uncurl-url=${NEWURL}/{path}'" get .
@mih
Copy link
Member Author

mih commented Nov 7, 2023

This is what it looks, when the above script runs...

❯ bash micro_datalad.sh
create(ok): /tmp/demo_micro_datalad/tmp_rawdata (dataset)
add(ok): B31_4318_LE01_Slice01.tif (file)                                                                                                                                                            
add(ok): B31_4318_LE01_Slice02.tif (file)                                                                                                                                                            
save(ok): . (dataset)                                                                                                                                                                                
action summary:                                                                                                                                                                                      
  add (ok: 2)
  save (ok: 1)
create(ok): /tmp/demo_micro_datalad/tmp_qc (dataset)
install(ok): inputs/se4318 (dataset)                                                                                                                                                                 
add(ok): inputs/se4318 (dataset)                                                                                                                                                                     
add(ok): .gitmodules (file)                                                                                                                                                                          
save(ok): . (dataset)                                                                                                                                                                                
add(ok): .gitmodules (file)                                                                                                                                                                          
save(ok): . (dataset)                                                                                                                                                                                
action summary:                                                                                                                                                                                      
  add (ok: 3)
  install (ok: 1)
  save (ok: 2)
[INFO   ] Making sure inputs are available (this may take some time) 
get(ok): inputs/se4318/B31_4318_LE01_Slice01.tif (file) [from origin...]                                                                                                                             
[INFO   ] == Command start (output follows) =====                                                                                                                                                    
[INFO   ] == Command exit (modification check follows) ===== 
run(ok): /tmp/demo_micro_datalad/tmp_qc (dataset) [echo "QC on inputs/se4318/B31_4318_LE01_...]
add(ok): qc.txt (file)                                                                                                                                                                               
save(ok): . (dataset)                                                                                                                                                                                
drop(ok): inputs/se4318 (key)                                                                                                                                                                        
uninstall(ok): inputs/se4318 (dataset)
action summary:
  drop (notneeded: 1, ok: 1)
  uninstall (ok: 1)
initremote BIGSTORE ok
(recording state in git...)
registerurl file:///tmp/demo_micro_datalad/sshtarget/B31/4318/zscans/B31_4318_LE01_Slice01.tif ok
registerurl file:///tmp/demo_micro_datalad/sshtarget/B31/4318/zscans/B31_4318_LE01_Slice02.tif ok
(recording state in git...)
setpresentkey MD5E-s16--2efe667f3d1210039c94854ac65f2667.tif ok
setpresentkey MD5E-s16--cec75369ec85271b432135cb9aa1f0e5.tif ok
initremote BIGSTORE ok
(recording state in git...)
registerurl file:///tmp/demo_micro_datalad/sshtarget/QC/B31_4318/qc.txt ok
(recording state in git...)
setpresentkey MD5E-s46--4f79a9f83fb82bb478959721f2f3cb90.txt ok
Enumerating objects: 32, done.
Counting objects: 100% (32/32), done.
Delta compression using up to 8 threads
Compressing objects: 100% (26/26), done.
Writing objects: 100% (32/32), 2.65 KiB | 2.65 MiB/s, done.
Total 32 (delta 6), reused 0 (delta 0), pack-reused 0
To datalad-annex::file:///tmp/demo_micro_datalad/sshtarget/B31/4318/zscans-datalad?type=external&externaltype=uncurl&encryption=none&url={noquery}/{{annex_key}}
 * [new branch]      git-annex -> git-annex
 * [new branch]      main -> main
add(ok): .gitmodules (file)                                                                                                                                                                          
save(ok): . (dataset)                                                                                                                                                                                
subdataset(ok): inputs/se4318 (dataset)                                                                                                                                                              
Enumerating objects: 32, done.
Counting objects: 100% (32/32), done.
Delta compression using up to 8 threads
Compressing objects: 100% (27/27), done.
Writing objects: 100% (32/32), 3.10 KiB | 3.10 MiB/s, done.
Total 32 (delta 7), reused 0 (delta 0), pack-reused 0
To datalad-annex::file:///tmp/demo_micro_datalad/sshtarget/QC/B31_4318-datalad/?type=external&externaltype=uncurl&encryption=none&url={noquery}/{{annex_key}}
 * [new branch]      git-annex -> git-annex
 * [new branch]      main -> main


sshtarget
├── B31
│   └── 4318
│       ├── zscans
│       │   ├── B31_4318_LE01_Slice01.tif
│       │   └── B31_4318_LE01_Slice02.tif
│       └── zscans-datalad
│           ├── XDLRA--refs
│           └── XDLRA--repo-export
└── QC
    ├── B31_4318
    │   └── qc.txt
    └── B31_4318-datalad
        ├── XDLRA--refs
        └── XDLRA--repo-export

8 directories, 7 files



drop(ok): . (key)
drop(ok): . (key)
uninstall(ok): . (dataset)
action summary:
  drop (ok: 2)
  uninstall (ok: 1)
drop(ok): . (key)
uninstall(ok): . (dataset)
action summary:
  drop (ok: 1)
  uninstall (ok: 1)

Run the following to verify (remote) availability of all components
=============
datalad clone 'datalad-annex::file:///tmp/demo_micro_datalad/sshtarget/QC/B31_4318-datalad?type=external&externaltype=uncurl&encryption=none&url={noquery}/{{annex_key}}' b31_4318_qc
datalad -C b31_4318_qc get --recursive .

Run the following to simulate a move of the storage and a protocol change
=============
mv /tmp/demo_micro_datalad/sshtarget /tmp/demo_micro_datalad/newstore
datalad clone 'datalad-annex::ssh://localhost//tmp/demo_micro_datalad/newstore/QC/B31_4318-datalad?type=external&externaltype=uncurl&encryption=none&url={noquery}/{{annex_key}}' b31_4318_qc_fromnew
datalad -C b31_4318_qc_fromnew -c 'remote.BIGSTORE.uncurl-match=(file:///.*/sshtarget/)(?P<path>.*)$' -c 'remote.BIGSTORE.uncurl-url=ssh://localhost//tmp/demo_micro_datalad/newstore/{path}' get .


bash micro_datalad.sh  5.55s user 1.46s system 62% cpu 11.236 total

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant