-
Notifications
You must be signed in to change notification settings - Fork 0
# SCP-2635 Ensure cell names from expression matrices are unique across files #124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
778e43f to
fb81c6d
Compare
Codecov Report
@@ Coverage Diff @@
## master #124 +/- ##
==========================================
+ Coverage 51.82% 52.31% +0.49%
==========================================
Files 22 22
Lines 2580 2609 +29
==========================================
+ Hits 1337 1365 +28
- Misses 1243 1244 +1
Continue to review full report at Codecov.
|
| row, | ||
| query_params=(self.study_id, self.mongo_connection._client), | ||
| ): | ||
| raise ValueError("Dense matrix has invalid format") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way for the user to know what the specific issue with the file was? Or will they get the same error message for both a malformed header and a duplicate gene name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracked here
| DenseIngestor.has_unique_header(header), | ||
| DenseIngestor.has_gene_keyword(header, row), | ||
| DenseIngestor.header_has_valid_values(header), | ||
| GeneExpression.has_unique_cells(header, *query_params), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this method needs a way to return the specific problem -- I recommend changing the method so it returns a tuple of [boolean, string], where the boolean is true/false for valid /invalid, and the string is the error message if invalid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracked here
| for cell_values in query_results | ||
| for values in cell_values.get("values") | ||
| ] | ||
| return not any(name in existing_cells for name in cell_names) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use set intersection here so you can identify the total number and first few duplicates. set(lst1) & set(lst2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe you took care of this
bistline
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good - I do agree that we should return the list of repeated cells as the error message, but nothing outside that comment.
| COLLECTION_NAME = "data_arrays" | ||
| query = { | ||
| "$and": [ | ||
| {"linear_data_type": "Study"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not strictly necessary, but you could add {"linear_data_id": study_id}.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im adding this suggestion in another PR.
| for cell_values in query_results | ||
| for values in cell_values.get("values") | ||
| ] | ||
| return not any(name in existing_cells for name in cell_names) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
eweitz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, assuming others' concerns are resolved.
| from gcp_mocks import mock_storage_client, mock_storage_blob | ||
|
|
||
| sys.path.append('../ingest') | ||
| sys.path.append("../ingest") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quote noise like this makes it hard to see relevant differences.
It could be avoided by using --skip-string-normalization in one's IDE, as in our Git hook.
Or we could remove that Black argument from our hook and our IDE configurations, and double-quote all strings in Python. This would increase our Python's difference from our JavaScript, but ensuring readable diffs with minimal IDE configuration seems more important.
Eno, Jean, and I chatted and we lean toward removing --skip-string-normalization.
Co-authored-by: Eric Weitz <eweitz@broadinstitute.org>
Enabling granular error reporting for expression matrix validation (SCP-2660)
devonbush
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, pending the small cleanup Eric suggested
Co-authored-by: Eric Weitz <eweitz@broadinstitute.org>
This PR adds a validation that checks if an expression file has unique cells by querying MongoDB.