-
Notifications
You must be signed in to change notification settings - Fork 0
* SCP-2611 Port parsing of mtx from rails to ingest pipeline #132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## development #132 +/- ##
===============================================
+ Coverage 68.04% 68.41% +0.37%
===============================================
Files 22 22
Lines 2688 2682 -6
===============================================
+ Hits 1829 1835 +6
+ Misses 859 847 -12
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I noted a possible bug (but I'd be a bit surprised if it manifested) and some readability and maintainability improvements, but I see no functional problems.
|
|
||
| try: | ||
| from expression_files import GeneExpression | ||
| from ingest_files import DataArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the imports in the except block to match what we import in the try block. E.g. it seems DataArray should be imported instead of IngestFiles on line 25.
More generally:
Without testing it myself, I don't know if this try/except is needed -- because manage-study doesn't directly use Ingest Pipeline code for expression file parsing. Keeping it here and updating it per above seems safest, but perhaps making this aspect of Ingest-to-public-CLI integration more easily testable is worth discussing at our process meeting today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe manage-study doesn't use ingest pipeline to ingest expression matrices due to prior implementation and performance issues. I'm not sure if it will in the future. I've gone ahead and modeled it after the other file types just for consistency. d1b33d4
| next(self.csv_file_handler) | ||
| for gene_docs, data_array_documents in self.transform(): | ||
| self.load(gene_docs, data_array_documents) | ||
| for documents, collection_name in self.transform(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a """-formatted docstring for the containing execute_ingest method here would be a nice touch.
| # An "R formatted" file has one less entry in the header | ||
| # row than each successive row. Also, "GENE" will not appear in header | ||
| # An "R formatted" file can: | ||
| # Not have gene in the header or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # Not have gene in the header or | |
| # Not have GENE in the header or |
| # Determine if models should be batched | ||
| if ( | ||
| len(data_arrays) + len(current_data_arrays) | ||
| > GeneExpression.DATA_ARRAY_BATCH_SIZE | ||
| ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Easier reading:
| # Determine if models should be batched | |
| if ( | |
| len(data_arrays) + len(current_data_arrays) | |
| > GeneExpression.DATA_ARRAY_BATCH_SIZE | |
| ): | |
| this_batch_size = len(data_arrays) + len(current_data_arrays) | |
| # Determine if models should be batched | |
| if (this_batch_size > GeneExpression.DATA_ARRAY_BATCH_SIZE): |
ingest/expression_files/mtx.py
Outdated
| except ValueError as v: | ||
| error_messages.append(str(v)) | ||
| try: | ||
| MTXIngestor.check_duplicates(barcodes, "barcodes") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Singular, as done for "gene":
| MTXIngestor.check_duplicates(barcodes, "barcodes") | |
| MTXIngestor.check_duplicates(barcodes, "barcode") |
ingest/expression_files/mtx.py
Outdated
| if len(unique_names) != len(names): | ||
| amount_of_duplicates = abs(len(unique_names) - len(names)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clearer and shorter, given that unique_names will never be longer than names:
| if len(unique_names) != len(names): | |
| amount_of_duplicates = abs(len(unique_names) - len(names)) | |
| if len(names) > len(unique_names): | |
| amount_of_duplicates = len(names) - len(unique_names) |
| else: | ||
| raise ValueError("MTX file must be sorted") | ||
|
|
||
| def execute_ingest(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A """-formatted docstring would be nice here, so maintainers can hover over calls to this and see a summary.
ingest/expression_files/mtx.py
Outdated
| # Determine if models should be batched | ||
| if ( | ||
| len(data_arrays) + len(current_data_arrays) | ||
| >= GeneExpression.DATA_ARRAY_BATCH_SIZE | ||
| ): | ||
| yield gene_models, GeneExpression.COLLECTION_NAME | ||
| yield data_arrays, DataArray.COLLECTION_NAME | ||
| num_processed += len(gene_models) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code seems to duplicate code around line 282 in dense_ingestor.py.
If the duplication is feasible to abstract, could you do so?
If not, could you apply my refinement for "Easier reading" from there to here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This piece of code has been refactored
ingest/expression_files/mtx.py
Outdated
| # Data array for expression values | ||
| for data_array in GeneExpression.create_data_arrays( | ||
| name=f"{gene} Expression", | ||
| array_type=f"expression", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No concatenation, so no f-string:
| array_type=f"expression", | |
| array_type="expression", |
| # Have one less entry in the header than each successive row or | ||
| # Have "" as the last value in header. | ||
| if header[0].upper() != "GENE": | ||
| length_of_next_line = len(row) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think 'of' is typically disfavored in variable names. I would rename to next_line_length
| Parameters: | ||
| header (List[str]): Header of the dense matrix | ||
| row (List): A single row from the dense matrix | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment should be extended to mention the return value is a tuple, not just a boolean
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I don't see anywhere in the code that uses the second half of the tuple, should it be omitted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's used in set_header()
| ) | ||
|
|
||
| def load(self, gene_docs: List, data_array_docs: List): | ||
| def load(self, docs: List, collection_name: List): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a nice refactor
| linear_data_id=self.study_file_id, | ||
| **self.data_array_kwargs, | ||
| ): | ||
| data_arrays.append(data_array) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this section should get method extracted into a "create_all_cells_data_array()" method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would it be different? I believe the function signature would look the same as create_data_arrays.
ingest/expression_files/mtx.py
Outdated
| array_type=f"{gene} Cells", | ||
| last_gene_id, last_gene = self.genes[current_idx - 2].split( | ||
| "\t" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been reading this code for the past 5 minutes and can't figure it out. Let's find a time to chat--I assume it's doing something very smart that I just can't figure out.
| def set_header(self): | ||
| csv_file_handler = self.open_file(self.file_path)[0] | ||
| @staticmethod | ||
| def set_header(csv_file_handler) -> List[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bistline will you double check this please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good now
bistline
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Just a couple of questions that are not blocking concerns. The only possible change I might request is to be consistent with the naming convention for expression file ingestors. You have dense_ingestor and mtx. I would say either suffix both classes with _ingestor or just have dense and mtx.
ingest/expression_files/mtx.py
Outdated
| da_cells = GeneExpression.create_data_arrays( | ||
| name=gene, | ||
| array_type=f"{gene} Cells", | ||
| last_gene_id, last_gene = self.genes[current_idx - 2].split( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we certain that we should be offsetting the index by 2 here? Usually MTX files indexes are 1-based rather than 0-based, So I'm unclear how we would be off by 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. But the new implementation clears this up.
| return True | ||
| if len(header) == length_of_next_line: | ||
| last_value = header[-1] | ||
| if last_value.isspace() or last_value == "": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's get some actual matrices generate from R to check this because I'm worried we're not checking the correct things. The 3 header validations we checked for in the old Rails parsers were:
- GENE (case insensitive) in first cell
- Blank space in first cell
- Header row is one entry shorter than all other rows
| self.execute_ingest(args) | ||
| not self.assertEqual(cm.exception.code, 0) | ||
| @patch( | ||
| "expression_files.expression_files.GeneExpression.check_unique_cells", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, why do we need to patch this? Shouldn't it return true if nothing is found, or is this because there is no MongoDB question?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have to patch it because check_unique_cells() makes a query to mongo to check for uniqueness.
| unique_gene_names: List[str] = set(gene_names) | ||
| if len(unique_gene_names) != len(gene_names): | ||
| amount_of_duplicates = len(unique_gene_names) - len(gene_names) | ||
| def check_duplicates(names: List, file_type: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice refactoring.
jlchang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_mtx_dimensions in mtx.py does not have error handling if the file is not actually an mtx file (if the first line of the file that doesn't start with % has non-numeric values, trying to convert to int will result in an error). Please add error handling so user will know we expected to find mtx dimensions and failed and allow ingest_pipeline to fail gracefully when a bad matrix file is uploaded.
Is there value in checking that the first line of the file leads with two percent signs (because if it doesn't, it isn't valid format)?
|
bistline
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see two blocking issues:
- It appears you've accidentally commented out the call to
load()inexpression_files.py, which would mean nothing is getting persisted to Mongo - The sort checking logic is
mtxis not quite right
Otherwise, things look good!
| def set_header(self): | ||
| csv_file_handler = self.open_file(self.file_path)[0] | ||
| @staticmethod | ||
| def set_header(csv_file_handler) -> List[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good now
| num_processed: int, | ||
| force=False, | ||
| ): | ||
| """Creates models are a given gene and batches them for loading if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| """Creates models are a given gene and batches them for loading if | |
| """Creates models for a given gene and batches them for loading if |
| this_batch_size = len(data_arrays) + len(current_data_arrays) | ||
| # Determine if models should be batched | ||
| if this_batch_size >= GeneExpression.DATA_ARRAY_BATCH_SIZE or force: | ||
| # self.load(gene_models, GeneExpression.COLLECTION_NAME) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason these are commented out? This seems like a regression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ingest/expression_files/mtx.py
Outdated
| is_sorted = MTXIngestor.is_sorted(current_idx, visited_expression_idx) | ||
| if not is_sorted: | ||
| if current_idx != prev_idx: | ||
| if not current_idx == prev_idx + 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no guarantee that the index will increment sequentially like this. There may not be any observed expression for a given gene. So if we are still enforcing the sorting of mtx files (I thought we had a way to not do this?) then you just want to make sure the index isn't less than the current index. The values should only go up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| f"Time to load {len(docs)} models: {str(datetime.datetime.now() - start_time)}" | ||
| ) | ||
|
|
||
| def create_models( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eweitz This is the function that @devonbush was referring to and you had touched on in this comment.
eweitz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greatly improved. Filling in for Devon, since the original blocker involved readability, I think one more refinement is worthwhile.
Co-authored-by: Eric Weitz <eweitz@broadinstitute.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I approve. The build failure is a false positive.
This PR: