-
Notifications
You must be signed in to change notification settings - Fork 0
Handle very large group-based metadata annotations in ingest (SCP-3761) #221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
eweitz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good stuff! I note a trivial blocker regarding maintainability, and a few nits.
ingest/cell_metadata.py
Outdated
| group = True if annot_type == "group" else False | ||
| # should not store annotations with >200 unique values for viz | ||
| # annot_header is the column of data, which includes name and type | ||
| # large is any annotation with more than 200 + 2 unique values | ||
| large = True if len(list(self.file[annot_header].unique())) > 202 else False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The True if... else False construct is redundant -- the result of the inner expression is itself a Boolean, and should simply be assigned as such.
Also, unpacking this to have a unique_values variable would help readability farther down.
| group = True if annot_type == "group" else False | |
| # should not store annotations with >200 unique values for viz | |
| # annot_header is the column of data, which includes name and type | |
| # large is any annotation with more than 200 + 2 unique values | |
| large = True if len(list(self.file[annot_header].unique())) > 202 else False | |
| group = annot_type == "group" | |
| # should not store annotations with >200 unique values for viz | |
| # annot_header is the column of data, which includes name and type | |
| # large is any annotation with more than 200 + 2 unique values | |
| unique_values = list(self.file[annot_header].unique()) | |
| large = len(unique_values) > 202 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the readability improvement with unique_values, very helpful.
The True if... else False is necessary for the if large and group conditional further down. Without else False, the variable may be unassigned and the conditional fails to evaluate successfully. Happy to discuss if I've mis-interpreted the comment.
ingest/cell_metadata.py
Outdated
| "values": list(self.file[annot_header].unique()) | ||
| if annot_type == "group" | ||
| else [], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reuse variables from above for readability, and to avoid the dreaded multi-line ternary.
| "values": list(self.file[annot_header].unique()) | |
| if annot_type == "group" | |
| else [], | |
| "values": unique_values if group else [], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adjusting here to fit Jon's original conception - ingest all metadata, just don't store values for large group.
Co-authored-by: Eric Weitz <eweitz@broadinstitute.org>
Co-authored-by: Eric Weitz <eweitz@broadinstitute.org>
Codecov Report
@@ Coverage Diff @@
## development #221 +/- ##
===============================================
+ Coverage 71.57% 71.64% +0.06%
===============================================
Files 26 26
Lines 3092 3103 +11
===============================================
+ Hits 2213 2223 +10
- Misses 879 880 +1
Continue to review full report at Codecov.
|
bistline
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good - we can merge in this release as a part of SCP-3889
Metadata files may have group annotations with many unique values - they may have worth for analysis (or it's just easier for the study owner if they do not need to be removed from the analysis file). but >200 values has not proven to be useful for visualization, therefore, the values do not need to be stored in Mongo. This PR avoids loading to Mongo any group annotations in the metadata file that have >200 unique values.
To test, set your local instance "Ingest Pipeline Docker Image" configuration to use the following image:
gcr.io/broad-singlecellportal-staging/scp-ingest-jlc_skip_lg_grp_metadata:a72f76b
and upload the metadata file found at:
https://github.com/broadinstitute/scp-ingest-pipeline/blob/jlc_skip_lg_grp_metadata/tests/data/annotation/metadata/convention/lg_grp_metadata_to_skip.txt
In the Study details page the metadata 'barcodekey' is listed and has no values.
In the ingest email, barcodekey is listed as a entry but has no associated values in the listing:
barcodekey: groupThis addresses SCP-3761