Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes to get Azure Blob Support work with large GenomicsDB datasets #107

Merged
merged 9 commits into from
Sep 3, 2021

Conversation

nalinigans
Copy link
Collaborator

@nalinigans nalinigans commented Aug 23, 2021

The significant codec changes are to register names specifically for the supported algorithms. This allows fragment reads/writes to output the name of the codec in case of compression errors. Also, changed all tabs to spaces in the codec code for readability for consistency.

For Azure Blob Support to handle large GenomicsDB datasets:

  • Implemented write-once semantics to bring it in line with what we do for s3 and gcs. Basically, we only write once in close_file. Sync paths are no-ops in this scenario.
  • Added file size checking after commit only in debug mode. Basically, we stash away the file sizes while writing and then confirm the file sizes from cloud storage after committing the files.
  • Removed using directory markers in AzureBlob for maintaining folders/directories. AzureBlob::create_dir() is now a no-op as directories don't need to exist to create files, etc. As a result, we don't perform any is_dir() checks in all create_file(), write_to_file(), etc.
  • Moved all common cloud functionality from individual implementations into StorageCloudFS in storage_fs.h.

@codecov-commenter
Copy link

codecov-commenter commented Aug 23, 2021

Codecov Report

Merging #107 (c576cbb) into develop (228be6a) will decrease coverage by 0.03%.
The diff coverage is 71.55%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #107      +/-   ##
===========================================
- Coverage    62.65%   62.61%   -0.04%     
===========================================
  Files           59       60       +1     
  Lines        17697    17703       +6     
===========================================
- Hits         11088    11085       -3     
- Misses        6609     6618       +9     
Impacted Files Coverage Δ
core/include/codec/codec_rle.h 0.00% <0.00%> (ø)
core/include/storage_manager/storage_gcs.h 40.00% <ø> (ø)
core/include/storage_manager/storage_s3.h 70.00% <ø> (ø)
core/src/fragment/read_state.cc 69.63% <0.00%> (ø)
core/src/storage_manager/storage_fs.cc 82.14% <ø> (+5.47%) ⬆️
core/src/storage_manager/storage_gcs.cc 69.84% <ø> (+0.22%) ⬆️
core/src/storage_manager/storage_s3.cc 80.71% <ø> (+0.18%) ⬆️
core/src/fragment/write_state.cc 67.44% <40.00%> (-0.06%) ⬇️
core/include/storage_manager/storage_azure_blob.h 72.13% <62.50%> (-3.43%) ⬇️
core/src/storage_manager/storage_azure_blob.cc 74.61% <67.92%> (-2.64%) ⬇️
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 228be6a...c576cbb. Read the comment docs.

@mlathara
Copy link
Contributor

These changes look good to me, but I wonder if we should give the azure benchmarks a whirl with these changes. Not that I expect any perf changes necessarily, but given that we didn't get a smoking gun for the issues on there that might serve as a good stress test to make sure the refactoring doesn't uncover any issues with the azure client. What do you think @nalinigans?

@nalinigans
Copy link
Collaborator Author

These changes look good to me, but I wonder if we should give the azure benchmarks a whirl with these changes. Not that I expect any perf changes necessarily, but given that we didn't get a smoking gun for the issues on there that might serve as a good stress test to make sure the refactoring doesn't uncover any issues with the azure client. What do you think @nalinigans?

Agreed. Have pulled in these changes into GenomicsDB - check this branch https://github.com/GenomicsDB/GenomicsDB/tree/ng_debug_azure_0822. @aoblebea and I had chatted yesterday about testing these changes. I am testing importing the tcga dataset and the workspace from Azure Blob from my laptop from home to use the faster network meanwhile.

@aoblebea
Copy link

aoblebea commented Sep 3, 2021

These changes look good to me, but I wonder if we should give the azure benchmarks a whirl with these changes. Not that I expect any perf changes necessarily, but given that we didn't get a smoking gun for the issues on there that might serve as a good stress test to make sure the refactoring doesn't uncover any issues with the azure client. What do you think @nalinigans?

The GenomicsDB branch (ng_debug_azure_0822) which uses this pull request ran fine on Azure (except for the no compression case).

Copy link
Contributor

@mlathara mlathara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants