Skip to content

Conversation

@knapii-developments
Copy link
Contributor

@knapii-developments knapii-developments commented Aug 10, 2020

This PR adds a validation that checks if an expression file has unique cells by querying MongoDB.

@knapii-developments knapii-developments force-pushed the ea-dense-cross-file-validation branch from 778e43f to fb81c6d Compare August 10, 2020 18:00
@codecov
Copy link

codecov bot commented Aug 10, 2020

Codecov Report

Merging #124 into master will increase coverage by 0.49%.
The diff coverage is 96.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #124      +/-   ##
==========================================
+ Coverage   51.82%   52.31%   +0.49%     
==========================================
  Files          22       22              
  Lines        2580     2609      +29     
==========================================
+ Hits         1337     1365      +28     
- Misses       1243     1244       +1     
Impacted Files Coverage Δ
ingest/expression_files/dense_ingestor.py 96.09% <94.11%> (-0.40%) ⬇️
ingest/expression_files/expression_files.py 81.33% <100.00%> (+4.66%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d8454bb...c8804d7. Read the comment docs.

@knapii-developments knapii-developments marked this pull request as ready for review August 10, 2020 18:06
row,
query_params=(self.study_id, self.mongo_connection._client),
):
raise ValueError("Dense matrix has invalid format")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way for the user to know what the specific issue with the file was? Or will they get the same error message for both a malformed header and a duplicate gene name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked here

DenseIngestor.has_unique_header(header),
DenseIngestor.has_gene_keyword(header, row),
DenseIngestor.header_has_valid_values(header),
GeneExpression.has_unique_cells(header, *query_params),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method needs a way to return the specific problem -- I recommend changing the method so it returns a tuple of [boolean, string], where the boolean is true/false for valid /invalid, and the string is the error message if invalid

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked here

for cell_values in query_results
for values in cell_values.get("values")
]
return not any(name in existing_cells for name in cell_names)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use set intersection here so you can identify the total number and first few duplicates. set(lst1) & set(lst2)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you took care of this

Copy link
Contributor

@bistline bistline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - I do agree that we should return the list of repeated cells as the error message, but nothing outside that comment.

COLLECTION_NAME = "data_arrays"
query = {
"$and": [
{"linear_data_type": "Study"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not strictly necessary, but you could add {"linear_data_id": study_id}.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im adding this suggestion in another PR.

for cell_values in query_results
for values in cell_values.get("values")
]
return not any(name in existing_cells for name in cell_names)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Copy link
Member

@eweitz eweitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, assuming others' concerns are resolved.

from gcp_mocks import mock_storage_client, mock_storage_blob

sys.path.append('../ingest')
sys.path.append("../ingest")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quote noise like this makes it hard to see relevant differences.

It could be avoided by using --skip-string-normalization in one's IDE, as in our Git hook.

Or we could remove that Black argument from our hook and our IDE configurations, and double-quote all strings in Python. This would increase our Python's difference from our JavaScript, but ensuring readable diffs with minimal IDE configuration seems more important.

Eno, Jean, and I chatted and we lean toward removing --skip-string-normalization.

devonbush and others added 2 commits August 11, 2020 10:33
Co-authored-by: Eric Weitz <eweitz@broadinstitute.org>
Enabling granular error reporting for expression matrix validation (SCP-2660)
Copy link
Contributor

@devonbush devonbush left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, pending the small cleanup Eric suggested

@knapii-developments knapii-developments merged commit 4bc936b into master Aug 12, 2020
@knapii-developments knapii-developments deleted the ea-dense-cross-file-validation branch August 12, 2020 12:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants