Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(1/7) RFC 79: Implement incremental upload of tab delimited data. #45

Merged
merged 27 commits into from
Jun 19, 2024

Conversation

forus
Copy link
Contributor

@forus forus commented Jun 4, 2024

This loader covers the following data types:

  • EXPRESSION
  • CNA_DISCRETE
  • CNA_CONTINUOUS
  • CNA_LOG2
  • METHYLATION
  • PROTEIN
  • GENERIC_ASSAY_CONTINUOUS (sample level)
  • GENERIC_ASSAY_BINARY (sample level)
  • GENERIC_ASSAY_CATEGORICAL (sample level)

Note: We do not support incremental upload of GSVA data although tab-delimited uploader covers this type with non-interactive upload. The incremental upload for GSVA data will throw an error.
We decided to exclude GSVA as there is unclarity from our part on whether GSVA scores of existing samples need to be recalculated when a new sample is added to the data.

This PR makes the tab. delimited data loader run in a single transaction to escape corruption of tsv rows (e.g. sample has been added to the header, but not to the info rows).

The first commit (b2c1c21) contains the code improvements. It could be handy to review it first.

=======
Please review #44 next.

Please note, to avoid the costly process of rebasing all pull requests (PRs), all feedback will be incorporated into the final PR.

forus added 27 commits May 15, 2024 11:27
- Calculate number of lines in the file in the loader
- Remove unused imports and fields
- Reuse constructors
- Reuse common parsing logic in tab delimiter importer
- Show full stacktrace which helps in dinding where tests errored out
It's dangerous as we would further mess up the data in the row
Do it for all data types, not only MAF
- Split big tab. delim test to multiple tests based on data type.

- Use ImportProfileData instead of ImportTabDelimData for testing.
  - We cover more logic with such tests.
  - This is more stable interface. ImportTabDelimData can be refactored.
args variable was not declared
Copy link
Member

@pieterlukasse pieterlukasse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @forus , amazing work 👍
I do have some questions on sections that I would like to double-check with you, and some naming improvements. Please take a look.

@forus forus requested a review from pieterlukasse June 11, 2024 19:14
Copy link
Collaborator

@haynescd haynescd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@forus forus merged commit 52714d6 into rfc79 Jun 19, 2024
5 of 6 checks passed
@forus forus deleted the inc-tab-delimited-uploader branch June 19, 2024 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants