Handle very large group-based metadata annotations in ingest (SCP-3761) #221

jlchang · 2021-11-10T18:58:15Z

Metadata files may have group annotations with many unique values - they may have worth for analysis (or it's just easier for the study owner if they do not need to be removed from the analysis file). but >200 values has not proven to be useful for visualization, therefore, the values do not need to be stored in Mongo. This PR avoids loading to Mongo any group annotations in the metadata file that have >200 unique values.

To test, set your local instance "Ingest Pipeline Docker Image" configuration to use the following image:
gcr.io/broad-singlecellportal-staging/scp-ingest-jlc_skip_lg_grp_metadata:a72f76b

and upload the metadata file found at:
https://github.com/broadinstitute/scp-ingest-pipeline/blob/jlc_skip_lg_grp_metadata/tests/data/annotation/metadata/convention/lg_grp_metadata_to_skip.txt

In the Study details page the metadata 'barcodekey' is listed and has no values.

In the ingest email, barcodekey is listed as a entry but has no associated values in the listing:
barcodekey: group

This addresses SCP-3761

exactly 200 unique _is_ stored

eweitz

Good stuff! I note a trivial blocker regarding maintainability, and a few nits.

eweitz · 2021-11-12T21:11:18Z

ingest/cell_metadata.py

+            group = True if annot_type == "group" else False
+            # should not store annotations with >200 unique values for viz
+            # annot_header is the column of data, which includes name and type
+            # large is any annotation with more than 200 + 2 unique values
+            large = True if len(list(self.file[annot_header].unique())) > 202 else False


The True if... else False construct is redundant -- the result of the inner expression is itself a Boolean, and should simply be assigned as such.

Also, unpacking this to have a unique_values variable would help readability farther down.

Suggested change

group = True if annot_type == "group" else False

# should not store annotations with >200 unique values for viz

# annot_header is the column of data, which includes name and type

# large is any annotation with more than 200 + 2 unique values

large = True if len(list(self.file[annot_header].unique())) > 202 else False

group = annot_type == "group"

# should not store annotations with >200 unique values for viz

# annot_header is the column of data, which includes name and type

# large is any annotation with more than 200 + 2 unique values

unique_values = list(self.file[annot_header].unique())

large = len(unique_values) > 202

Thanks for the readability improvement with unique_values, very helpful.

The True if... else False is necessary for the if large and group conditional further down. Without else False, the variable may be unassigned and the conditional fails to evaluate successfully. Happy to discuss if I've mis-interpreted the comment.

ingest/cell_metadata.py

eweitz · 2021-11-12T21:23:13Z

ingest/cell_metadata.py

+                            "values": list(self.file[annot_header].unique())
+                            if annot_type == "group"
+                            else [],


Reuse variables from above for readability, and to avoid the dreaded multi-line ternary.

Suggested change

"values": list(self.file[annot_header].unique())

if annot_type == "group"

else [],

"values": unique_values if group else [],

Adjusting here to fit Jon's original conception - ingest all metadata, just don't store values for large group.

tests/test_cell_metadata.py

Co-authored-by: Eric Weitz <eweitz@broadinstitute.org>

codecov · 2021-11-15T14:08:31Z

Codecov Report

Merging #221 (a72f76b) into development (51cbd58) will increase coverage by 0.06%.
The diff coverage is 91.66%.

@@               Coverage Diff               @@
##           development     #221      +/-   ##
===============================================
+ Coverage        71.57%   71.64%   +0.06%     
===============================================
  Files               26       26              
  Lines             3092     3103      +11     
===============================================
+ Hits              2213     2223      +10     
- Misses             879      880       +1

Impacted Files	Coverage Δ
ingest/cell_metadata.py	`76.84% <87.50%> (+0.98%)`	⬆️
ingest/validation/validate_metadata.py	`77.63% <100.00%> (+0.10%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 51cbd58...a72f76b. Read the comment docs.

bistline

Looks good - we can merge in this release as a part of SCP-3889

jlchang added 5 commits November 8, 2021 13:17

exclude large group annotations in metadata ingest

b1f3a4b

Use appropriate cell count descriptor

a341633

add test, >200 uniq not stored

00047c6

exactly 200 unique _is_ stored

add data for lg group skip test

953e254

fix file for lg group skip test

7a71ba3

jlchang requested review from bistline, devonbush, ehanna4 and eweitz November 10, 2021 18:58

eweitz requested changes Nov 12, 2021

View reviewed changes

jlchang and others added 3 commits November 15, 2021 08:25

Update ingest/cell_metadata.py

c72f561

Co-authored-by: Eric Weitz <eweitz@broadinstitute.org>

Update tests/test_cell_metadata.py

fdd758f

Co-authored-by: Eric Weitz <eweitz@broadinstitute.org>

update test file name to match improved test name

a26d891

eweitz approved these changes Nov 15, 2021

View reviewed changes

adjust to store metadata, just not values if large

a72f76b

jlchang requested a review from eweitz November 15, 2021 16:02

bistline approved these changes Nov 15, 2021

View reviewed changes

jlchang merged commit fb8cc0a into development Nov 15, 2021

jlchang deleted the jlc_skip_lg_grp_metadata branch November 15, 2021 18:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handle very large group-based metadata annotations in ingest (SCP-3761) #221

Handle very large group-based metadata annotations in ingest (SCP-3761) #221

Uh oh!

jlchang commented Nov 10, 2021 •

edited

Loading

Uh oh!

eweitz left a comment

Uh oh!

eweitz Nov 12, 2021

Uh oh!

jlchang Nov 15, 2021

Uh oh!

Uh oh!

eweitz Nov 12, 2021

Uh oh!

jlchang Nov 15, 2021

Uh oh!

Uh oh!

codecov bot commented Nov 15, 2021 •

edited

Loading

Uh oh!

bistline left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Handle very large group-based metadata annotations in ingest (SCP-3761) #221

Handle very large group-based metadata annotations in ingest (SCP-3761) #221

Uh oh!

Conversation

jlchang commented Nov 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eweitz left a comment

Choose a reason for hiding this comment

Uh oh!

eweitz Nov 12, 2021

Choose a reason for hiding this comment

Uh oh!

jlchang Nov 15, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

eweitz Nov 12, 2021

Choose a reason for hiding this comment

Uh oh!

jlchang Nov 15, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Nov 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

bistline left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jlchang commented Nov 10, 2021 •

edited

Loading

codecov bot commented Nov 15, 2021 •

edited

Loading