-
Notifications
You must be signed in to change notification settings - Fork 0
* SCP-2567 Refactor and add tests for dense matrices & expression files #134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## development #134 +/- ##
===============================================
+ Coverage 64.26% 68.04% +3.77%
===============================================
Files 22 22
Lines 2622 2688 +66
===============================================
+ Hits 1685 1829 +144
+ Misses 937 859 -78
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really like the refactor! The only changes I would request is to be consistent in the placement of da_kwargs when calling GeneExpression.create_data_array, and then to change the type for gene_id to a string in the Gene model.
|
|
||
| @staticmethod | ||
| def create_gene_model( | ||
| *ignore, name: str, study_file_id, study_id, _id: int, gene_id: int = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gene_id should be a string, not an integer. In this context, we're referring to an ID from a reference collection, such as Ensembl, or Entrez. While Entrez gene IDs are ints, in Ensembl they're strings like ENSG00000139618.
| data_arrays = [] | ||
| for all_cell_model in self.set_data_array_cells(self.header, ObjectId()): | ||
| for all_cell_model in GeneExpression.create_data_array( | ||
| **self.da_kwargs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why the keyword arguments are passed first here, but last in other calls to create_data_array? We should be consistent for readability even though they are not positionally dependent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also rename that to data_array_kwargs
eweitz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I made some suggestions and asked a question or two, but I see no blockers.
ingest/expression_files/mtx.py
Outdated
| sys.path.append("../ingest") | ||
| from ingest_files import IngestFiles | ||
| from monitor import trace | ||
| except ImportError: | ||
| # Used when importing as external package, e.g. imports in single_cell_portal code | ||
| # Used when importing as external package, e.g. imports in | ||
| # single_cell_portal code | ||
| from .expression_files import GeneExpression | ||
|
|
||
| sys.path.append("../ingest") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of doing sys.path.append("../ingest") twice -- once in the try and once in the except, a cleaner equivalent would be to do that once outside the try (i.e. the line immediately above try).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will make this suggestion in this PR for mtx.
| f"Expected {expected_barcodes} cells and {expected_genes} genes. " | ||
| f"Got {actual_barcodes} cells and {actual_genes} genes." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great error message.
| """ | ||
| self.genes = [g.strip().strip('"') for g in self.genes_file.readlines()] | ||
| self.cells = [c.strip().strip('"') for c in self.barcodes_file.readlines()] | ||
| self.genes: List[str] = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, what benefit is List[str] here? I see the value of (optionally) using type hints to annotate function signatures, but the value in cases like this is less clear to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just to be more concise. It lets a developer know that it's a list of strings.
| return list(map(int, mtx_dimensions)) | ||
|
|
||
| @staticmethod | ||
| def is_sorted(idx: int, visited_expression_idx: List[int]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed all cases of is_foo() renamed to check_foo() in a recent PR. For consistency, we should use one naming pattern or the other, but not both.
This would be consistent those recent changes:
| def is_sorted(idx: int, visited_expression_idx: List[int]): | |
| def check_sorted(idx: int, visited_expression_idx: List[int]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a check that happens in transform and not in check_valid(). For formatting issues, the rational was that we wanted to know all validation errors prior to ingest which is why "is" changed to "check". If is_sorted() returns false parsing stops and immediately exits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, helpful to know. That rationale seems reasonable.
| current_idx = int(raw_gene_idx) | ||
| gene_id, gene = self.genes[current_idx - 1].split("\t") | ||
| if current_idx != last_idx: | ||
| is_sorted = MTXIngestor.is_sorted(current_idx, visited_expression_idx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See other comment re is_foo vs. check_foo.
| is_sorted = MTXIngestor.is_sorted(current_idx, visited_expression_idx) | |
| is_sorted = MTXIngestor.check_sorted(current_idx, visited_expression_idx) |
Co-authored-by: Eric Weitz <eweitz@broadinstitute.org>
…scp-ingest-pipeline into ea-improve-dense-testing
| # Represents row as a list | ||
| for row in self.csv_file_handler: | ||
| valid_expression_scores, cells = DenseIngestor.filter_expression_scores( | ||
| valid_expression_scores, exp_cells = DenseIngestor.filter_expression_scores( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might just write out 'expression' for 'expression_cells'
| for gene_cell_model in self.set_data_array_gene_cell_names( | ||
| gene, id, cells | ||
| # Data array for cell names | ||
| for da in GeneExpression.create_data_array( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'da' is too short -- write out data_array
| data_arrays = [] | ||
| for all_cell_model in self.set_data_array_cells(self.header, ObjectId()): | ||
| for all_cell_model in GeneExpression.create_data_array( | ||
| **self.da_kwargs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also rename that to data_array_kwargs
| # load any remaining models (this is necessary here since there isn't an easy way to detect the | ||
| # last line of the file in the iteration above | ||
| data_arrays.append(da) | ||
| if len(data_arrays) >= GeneExpression.DATA_ARRAY_BATCH_SIZE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert to previous indentation
| self.extra_log_params = {"study_id": self.study_id, "duration": None} | ||
| self.mongo_connection = MongoConnection() | ||
| # Common data array kwargs | ||
| self.da_kwargs = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same, use data_array_kwargs
| """Abstract method for validating expression matrices""" | ||
|
|
||
| @staticmethod | ||
| def create_gene_model( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs a method comment
| self, name: str, linear_data_id: str, values: List | ||
| ): | ||
| @staticmethod | ||
| def create_data_array( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename to create_data_arrays
| if ignore: | ||
| raise TypeError("Positional arguments are not accepted.") | ||
| del fn_kwargs["ignore"] | ||
| for model in DataArray(**fn_kwargs).get_data_array(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename to get_data_arrays
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to touch get_data_array() right now because all file types/metadata convention use this method. Testing for other file types/metadata convention aren't as robust as for expression files. I have no doubt that I'll miss something and cause an error in prod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that's we aren't merging/releasing this change for the dense parsing that's going out today. I only see 3 total mentions of this function in the current codebase--that's not a risky change--and it's also something that any smoke test would catch b/c the ingest for that type would fail altogether if you missed a type. Use your editor's "find/replace all" to look for all instances and change them.
tests/test_expression_files.py
Outdated
| self.assertRaises( | ||
| TypeError, [GeneExpression.create_data_array], hi="bad_kwarg", **kwargs | ||
| ) | ||
| actual_da: Dict = next(GeneExpression.create_data_array(**kwargs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename to 'data_array'
| } | ||
| ) | ||
| return DataArray(**base_data_array_model, **data_array_attrs).get_data_array() | ||
| return DataArray(**base_data_array_model, **data_array_attrs).get_data_arrays() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requested change from @devonbush
| self.linear_data_id = self.study_id | ||
|
|
||
| def get_data_array(self): | ||
| def get_data_arrays(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devonbush requested change
Refactoring DenseIngestor for conciseness and to improve testability. Adding tests for expression_filles.py
MTX is included so that tests past. This is not a complete PR for MTX. Suggestions are welcomed but not required.