Changes to get Azure Blob Support work with large GenomicsDB datasets #107

nalinigans · 2021-08-23T21:17:16Z

The significant codec changes are to register names specifically for the supported algorithms. This allows fragment reads/writes to output the name of the codec in case of compression errors. Also, changed all tabs to spaces in the codec code for readability for consistency.

For Azure Blob Support to handle large GenomicsDB datasets:

Implemented write-once semantics to bring it in line with what we do for s3 and gcs. Basically, we only write once in close_file. Sync paths are no-ops in this scenario.
Added file size checking after commit only in debug mode. Basically, we stash away the file sizes while writing and then confirm the file sizes from cloud storage after committing the files.
Removed using directory markers in AzureBlob for maintaining folders/directories. AzureBlob::create_dir() is now a no-op as directories don't need to exist to create files, etc. As a result, we don't perform any is_dir() checks in all create_file(), write_to_file(), etc.
Moved all common cloud functionality from individual implementations into StorageCloudFS in storage_fs.h.

…ic loading of codecs and allow zstd to use compress/decompress with context

codecov-commenter · 2021-08-23T21:17:37Z

Codecov Report

Merging #107 (c576cbb) into develop (228be6a) will decrease coverage by 0.03%.
The diff coverage is 71.55%.

@@             Coverage Diff             @@
##           develop     #107      +/-   ##
===========================================
- Coverage    62.65%   62.61%   -0.04%     
===========================================
  Files           59       60       +1     
  Lines        17697    17703       +6     
===========================================
- Hits         11088    11085       -3     
- Misses        6609     6618       +9

Impacted Files	Coverage Δ
core/include/codec/codec_rle.h	`0.00% <0.00%> (ø)`
core/include/storage_manager/storage_gcs.h	`40.00% <ø> (ø)`
core/include/storage_manager/storage_s3.h	`70.00% <ø> (ø)`
core/src/fragment/read_state.cc	`69.63% <0.00%> (ø)`
core/src/storage_manager/storage_fs.cc	`82.14% <ø> (+5.47%)`	⬆️
core/src/storage_manager/storage_gcs.cc	`69.84% <ø> (+0.22%)`	⬆️
core/src/storage_manager/storage_s3.cc	`80.71% <ø> (+0.18%)`	⬆️
core/src/fragment/write_state.cc	`67.44% <40.00%> (-0.06%)`	⬇️
core/include/storage_manager/storage_azure_blob.h	`72.13% <62.50%> (-3.43%)`	⬇️
core/src/storage_manager/storage_azure_blob.cc	`74.61% <67.92%> (-2.64%)`	⬇️
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 228be6a...c576cbb. Read the comment docs.

… par with s3/gcs

…ently than az blob storage on empty folders

mlathara · 2021-08-26T04:31:16Z

These changes look good to me, but I wonder if we should give the azure benchmarks a whirl with these changes. Not that I expect any perf changes necessarily, but given that we didn't get a smoking gun for the issues on there that might serve as a good stress test to make sure the refactoring doesn't uncover any issues with the azure client. What do you think @nalinigans?

nalinigans · 2021-08-26T16:31:55Z

These changes look good to me, but I wonder if we should give the azure benchmarks a whirl with these changes. Not that I expect any perf changes necessarily, but given that we didn't get a smoking gun for the issues on there that might serve as a good stress test to make sure the refactoring doesn't uncover any issues with the azure client. What do you think @nalinigans?

Agreed. Have pulled in these changes into GenomicsDB - check this branch https://github.com/GenomicsDB/GenomicsDB/tree/ng_debug_azure_0822. @aoblebea and I had chatted yesterday about testing these changes. I am testing importing the tcga dataset and the workspace from Azure Blob from my laptop from home to use the faster network meanwhile.

aoblebea · 2021-09-03T20:14:41Z

These changes look good to me, but I wonder if we should give the azure benchmarks a whirl with these changes. Not that I expect any perf changes necessarily, but given that we didn't get a smoking gun for the issues on there that might serve as a good stress test to make sure the refactoring doesn't uncover any issues with the azure client. What do you think @nalinigans?

The GenomicsDB branch (ng_debug_azure_0822) which uses this pull request ran fine on Azure (except for the no compression case).

mlathara

looks good

nalinigans added 4 commits August 2, 2021 15:51

Use std::once_flag/call_once instead of maintaining mutexes for dynam…

40fb8b8

…ic loading of codecs and allow zstd to use compress/decompress with context

Debug azure failures with importing large GenomicsDB datasets

05378e8

Merge with develop

b870671

Register names for all codecs and cleanup debug logging for azure blobs

56af96b

nalinigans requested review from aoblebea and mlathara August 23, 2021 21:47

nalinigans added 5 commits August 24, 2021 21:50

Refactor out cloud common functionality to get azure support to be on…

176bbda

… par with s3/gcs

Include strin.h for memset

961f646

Fix failures on azurite, azurite seems to be behaving a little differ…

4ff8891

…ently than az blob storage on empty folders

Fix failures on azurite, azurite seems to be behaving a little differ…

90346e1

…ently than az blob storage on empty folders

Add to azure blob storage unit tests

c576cbb

mlathara approved these changes Sep 3, 2021

View reviewed changes

nalinigans mentioned this pull request Sep 3, 2021

No compression scenarios with TileDB_IO_READ with tiledb context do not work #108

Open

nalinigans merged commit da33b67 into develop Sep 3, 2021

nalinigans deleted the ng_debug_azure_build_0822 branch September 3, 2021 23:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes to get Azure Blob Support work with large GenomicsDB datasets #107

Changes to get Azure Blob Support work with large GenomicsDB datasets #107

nalinigans commented Aug 23, 2021 •

edited

codecov-commenter commented Aug 23, 2021 •

edited

mlathara commented Aug 26, 2021

nalinigans commented Aug 26, 2021

aoblebea commented Sep 3, 2021

mlathara left a comment

Changes to get Azure Blob Support work with large GenomicsDB datasets #107

Changes to get Azure Blob Support work with large GenomicsDB datasets #107

Conversation

nalinigans commented Aug 23, 2021 • edited

codecov-commenter commented Aug 23, 2021 • edited

Codecov Report

mlathara commented Aug 26, 2021

nalinigans commented Aug 26, 2021

aoblebea commented Sep 3, 2021

mlathara left a comment

Choose a reason for hiding this comment

nalinigans commented Aug 23, 2021 •

edited

codecov-commenter commented Aug 23, 2021 •

edited