Skip to content

Conversation

@jlchang
Copy link
Contributor

@jlchang jlchang commented Nov 10, 2021

Metadata files may have group annotations with many unique values - they may have worth for analysis (or it's just easier for the study owner if they do not need to be removed from the analysis file). but >200 values has not proven to be useful for visualization, therefore, the values do not need to be stored in Mongo. This PR avoids loading to Mongo any group annotations in the metadata file that have >200 unique values.

To test, set your local instance "Ingest Pipeline Docker Image" configuration to use the following image:
gcr.io/broad-singlecellportal-staging/scp-ingest-jlc_skip_lg_grp_metadata:a72f76b

and upload the metadata file found at:
https://github.com/broadinstitute/scp-ingest-pipeline/blob/jlc_skip_lg_grp_metadata/tests/data/annotation/metadata/convention/lg_grp_metadata_to_skip.txt

In the Study details page the metadata 'barcodekey' is listed and has no values.

In the ingest email, barcodekey is listed as a entry but has no associated values in the listing:
barcodekey: group

This addresses SCP-3761

Copy link
Member

@eweitz eweitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff! I note a trivial blocker regarding maintainability, and a few nits.

Comment on lines 154 to 158
group = True if annot_type == "group" else False
# should not store annotations with >200 unique values for viz
# annot_header is the column of data, which includes name and type
# large is any annotation with more than 200 + 2 unique values
large = True if len(list(self.file[annot_header].unique())) > 202 else False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The True if... else False construct is redundant -- the result of the inner expression is itself a Boolean, and should simply be assigned as such.

Also, unpacking this to have a unique_values variable would help readability farther down.

Suggested change
group = True if annot_type == "group" else False
# should not store annotations with >200 unique values for viz
# annot_header is the column of data, which includes name and type
# large is any annotation with more than 200 + 2 unique values
large = True if len(list(self.file[annot_header].unique())) > 202 else False
group = annot_type == "group"
# should not store annotations with >200 unique values for viz
# annot_header is the column of data, which includes name and type
# large is any annotation with more than 200 + 2 unique values
unique_values = list(self.file[annot_header].unique())
large = len(unique_values) > 202

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the readability improvement with unique_values, very helpful.

The True if... else False is necessary for the if large and group conditional further down. Without else False, the variable may be unassigned and the conditional fails to evaluate successfully. Happy to discuss if I've mis-interpreted the comment.

Comment on lines 173 to 175
"values": list(self.file[annot_header].unique())
if annot_type == "group"
else [],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reuse variables from above for readability, and to avoid the dreaded multi-line ternary.

Suggested change
"values": list(self.file[annot_header].unique())
if annot_type == "group"
else [],
"values": unique_values if group else [],

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusting here to fit Jon's original conception - ingest all metadata, just don't store values for large group.

jlchang and others added 3 commits November 15, 2021 08:25
Co-authored-by: Eric Weitz <eweitz@broadinstitute.org>
Co-authored-by: Eric Weitz <eweitz@broadinstitute.org>
@codecov
Copy link

codecov bot commented Nov 15, 2021

Codecov Report

Merging #221 (a72f76b) into development (51cbd58) will increase coverage by 0.06%.
The diff coverage is 91.66%.

Impacted file tree graph

@@               Coverage Diff               @@
##           development     #221      +/-   ##
===============================================
+ Coverage        71.57%   71.64%   +0.06%     
===============================================
  Files               26       26              
  Lines             3092     3103      +11     
===============================================
+ Hits              2213     2223      +10     
- Misses             879      880       +1     
Impacted Files Coverage Δ
ingest/cell_metadata.py 76.84% <87.50%> (+0.98%) ⬆️
ingest/validation/validate_metadata.py 77.63% <100.00%> (+0.10%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 51cbd58...a72f76b. Read the comment docs.

@jlchang jlchang requested a review from eweitz November 15, 2021 16:02
Copy link
Contributor

@bistline bistline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - we can merge in this release as a part of SCP-3889

@jlchang jlchang merged commit fb8cc0a into development Nov 15, 2021
@jlchang jlchang deleted the jlc_skip_lg_grp_metadata branch November 15, 2021 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants