zarr backup might need "optimization" #363

yarikoptic · 2023-11-01T16:42:42Z

I found 4 days old process still running for non 000108 dandiset. The process tree

dandi      27781 50.7  1.5 3886708 1025188 ?     Rl   Oct27 3546:00                 python -m tools.backups2datalad -l WARNING --backup-root /mnt/backup/dandi --config tools/backups2datalad.cfg.yaml update-from-backup --workers 5 -e 000108$
dandi      90653  0.0  0.0  10820  2868 ?        S    Oct27   0:00                   git -c receive.autogc=0 -c gc.auto=0 annex examinekey --batch --migrate-to-backend=MD5E
dandi      90655  0.0  0.0 1074053100 11264 ?    Sl   Oct27   4:49                     /home/dandi/miniconda3/envs/dandisets-2/bin/git-annex examinekey --batch --migrate-to-backend=MD5E
dandi      91009  0.0  0.0  10820  2856 ?        S    Oct27   0:00                   git -c receive.autogc=0 -c gc.auto=0 annex whereis --batch-keys --json --json-error-messages
dandi      91011  6.1  0.8 1074060012 545916 ?   Sl   Oct27 426:27                     /home/dandi/miniconda3/envs/dandisets-2/bin/git-annex whereis --batch-keys --json --json-error-messages
dandi      91021  0.0  0.0  14788  5308 ?        S    Oct27   4:55                       git --git-dir=.git --work-tree=. --literal-pathspecs cat-file --batch
dandi      91432  0.0  0.0  10820  3008 ?        S    Oct27   0:00                   git -c receive.autogc=0 -c gc.auto=0 annex fromkey --force --batch --json --json-error-messages
dandi      91434  0.1  0.0 1074053256 33024 ?    Sl   Oct27  13:16                     /home/dandi/miniconda3/envs/dandisets-2/bin/git-annex fromkey --force --batch --json --json-error-messages
dandi      91443  0.0  0.0  14748  2236 ?        S    Oct27   0:00                       git --git-dir=.git --work-tree=. --literal-pathspecs cat-file --batch
dandi      91499  0.0  0.0  11736  3136 ?        S    Oct27   2:54                       git --git-dir=.git --work-tree=. --literal-pathspecs hash-object -w --stdin-paths --no-filters
dandi      91566  0.0  0.0  10820  2944 ?        S    Oct27   0:00                   git -c receive.autogc=0 -c gc.auto=0 annex registerurl -c annex.alwayscompact=false --batch --json --json-error-messages
dandi      91569  2.1  0.1 1074126968 71816 ?    Sl   Oct27 149:30                     /home/dandi/miniconda3/envs/dandisets-2/bin/git-annex registerurl -c annex.alwayscompact=false --batch --json --json-error-messages
dandi      91598  0.1  0.0  14852  4868 ?        S    Oct27   9:32                       git --git-dir=.git --work-tree=. --literal-pathspecs -c annex.alwayscompact=false cat-file --batch
dandi      91599  0.0  0.0   6952  2472 ?        S    Oct27   3:13                       /bin/bash /usr/bin/git-annex-remote-rclone
dandi      27782  0.0  0.0   6384  2084 ?        S    Oct27   0:00                 grep -v nothing to save, working tree clean

and looking at that zarr

dandi@drogon:/mnt/backup/dandi/dandisets/000108$ ls -l /proc/91011/cwd
lrwxrwxrwx 1 dandi dandi 0 Nov  1 12:32 /proc/91011/cwd -> /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61
dandi@drogon:/mnt/backup/dandi/dandisets/000108$ ls /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61
0  1  2  3  4
dandi@drogon:/mnt/backup/dandi/dandisets/000108$ find /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/* | nl | tail
494148  /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/1/26/35/29
494149  /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/1/26/35/3
494150  /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/1/26/35/30
494151  /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/1/26/35/31
494152  /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/2
494153  /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/2/.zarray
494154  /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/3
494155  /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/3/.zarray
494156  /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/4
494157  /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/4/.zarray

so it is a "hefty" zarr -- half a million files. I wonder if we could make that process anyhow faster. there was some splitindex etc.

FWIW -- above count is with folders. Without folders:

dandi@drogon:/mnt/backup/dandi/dandisets/000108$ find /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/* \! -type d | nl | tail -n 1
487185  /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/4/.zarray

and that particular zarr is almost done so I will keep it going for now

❯ curl --silent -X 'GET' 'https://api.dandiarchive.org/api/zarr/5c37c233-222f-4e60-96e7-a7536e08ef61/' -H 'accept: application/json' | jq .file_count
526320

edits:

it seems to have a lot of not packed git objects. I am getting stats using ncdu ATM... since we are running with receive.autogc=0 and gc.auto=0 -- should we trigger it "manually" but wouldn't it then interfere with running batched processes? we might need to stop and redo. Might be worth simulating that all with some dedicated script to time it all up. Also might be worth moving all the dandizarrs to some faster / dedicated medium (SSDs?)
it is that top level python process ( 27781) which is relatively CPU busy -- 60-100% CPU, looking at what it is doing might be relevant. Doing some py-spy top sampling gives the top of

Total Samples 3277
GIL: 1.00%, Active: 17.00%, Threads: 15

  %Own   %Total  OwnTime  TotalTime  Function (filename)                                                                                                                                                                                                           
 11.00%  11.00%   27.13s    27.13s   _worker (concurrent/futures/thread.py)
  5.00%   5.00%   15.57s    15.57s   _do_waitpid (asyncio/unix_events.py)
  0.00%   1.00%   0.080s     1.03s   _run_once (asyncio/base_events.py)
  0.00%   0.00%   0.070s    0.090s   _execute_child (subprocess.py)
  0.00%   0.00%   0.070s    0.070s   _add_callback (asyncio/base_events.py)
  0.00%   1.00%   0.050s    0.880s   _run (asyncio/events.py)
  0.00%   0.00%   0.040s    0.040s   register (selectors.py)
  0.00%   0.00%   0.030s    0.030s   raw_decode (json/decoder.py)

so is it just jumping between different async items or really doing some useful work???

edit: some stats from ncdu. A LOT of files during the backup, then just few

at some point there were over 900,000 files in .git/annex/journal !

--- /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/.git/annex ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                /..
    3.5 GiB [##########] 904.4k /journal                                                                                                                                                                                                                           
    2.0 MiB [          ]         index
    1.5 MiB [          ]      1 /keysdb
   12.0 KiB [          ]      3 /fsck
    4.0 KiB [          ]         index.lck

and separate objects (no packing performed) for each tiny file

--- /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/.git ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    3.5 GiB [##########] 904.4k /annex
.   1.7 GiB [####      ] 451.7k /objects

which then all get handled eventually and .git/objects packed too:

--- /mnt/backup/dandi/dandizarrs/5c37c233-222f-4e60-96e7-a7536e08ef61/.git ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  254.8 MiB [##########]     16 /objects                                                                                                                                                                                                                           
  241.3 MiB [######### ]     18 /annex
   38.9 MiB [#         ]         index

The text was updated successfully, but these errors were encountered:

satra · 2023-11-01T17:11:59Z

which asset is this? i want to check that the shape/compression characteristics did not change in the process and this is indeed a hefty zarr (i.e. could be one of the 4mm slices).

also i'm going to start rolling out not storing rawest data but stitched data.

yarikoptic · 2023-11-01T17:21:55Z

❯ curl --silent -X 'GET' 'https://api.dandiarchive.org/api/zarr/5c37c233-222f-4e60-96e7-a7536e08ef61/' -H 'accept: application/json' | jq .
{
  "name": "sub-I58/ses-Hip-CT/micr/sub-I58_sample-01_chunk-01_hipCT.ome.zarr",
  "dandiset": "000026",
  "zarr_id": "5c37c233-222f-4e60-96e7-a7536e08ef61",
  "status": "Complete",
  "checksum": "4cb549b2e2346bb1a30f493b50fb6a2e-526320--1023396474554",
  "file_count": 526320,
  "size": 1023396474554
}

satra · 2023-11-01T17:28:26Z

this is dandiset 26, not 108. it's probably the TB one. it's an entire hemisphere and more at 15um resolution.

satra · 2023-11-01T17:30:15Z

i didn't read "non" 000108 dandiset - i thought it was in 108. but this one is beautiful. yael posted the neuroglancer rendering in the bids spec addition of HiPCT.

yarikoptic · 2023-11-02T21:13:47Z

yael posted the neuroglancer rendering in the bids spec addition of HiPCT.

is there a link?

satra · 2023-11-03T00:12:52Z

bids-standard/bids-specification#1646

yarikoptic · 2024-02-16T23:37:04Z

Let's consider migrated to

Optimize creation/update of zarrs #1: do all removals at once backups2datalad#31

yarikoptic added the performance label Nov 1, 2023

yarikoptic mentioned this issue Nov 2, 2023

make updates more efficient regardless either dandiset with zarrs or not #364

Open

yarikoptic mentioned this issue Feb 16, 2024

Optimize creation/update of zarrs #1: do all removals at once dandi/backups2datalad#31

Closed

yarikoptic closed this as completed Feb 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zarr backup might need "optimization" #363

zarr backup might need "optimization" #363

yarikoptic commented Nov 1, 2023 •

edited

Loading

satra commented Nov 1, 2023

yarikoptic commented Nov 1, 2023

satra commented Nov 1, 2023

satra commented Nov 1, 2023

yarikoptic commented Nov 2, 2023

satra commented Nov 3, 2023

yarikoptic commented Feb 16, 2024

zarr backup might need "optimization" #363

zarr backup might need "optimization" #363

Comments

yarikoptic commented Nov 1, 2023 • edited Loading

satra commented Nov 1, 2023

yarikoptic commented Nov 1, 2023

satra commented Nov 1, 2023

satra commented Nov 1, 2023

yarikoptic commented Nov 2, 2023

satra commented Nov 3, 2023

yarikoptic commented Feb 16, 2024

yarikoptic commented Nov 1, 2023 •

edited

Loading