Skip to content

Conversation

@knapii-developments
Copy link
Contributor

@knapii-developments knapii-developments commented Aug 17, 2020

This PR:

  • Adds functionality to ingest pipeline to parse mtx files as well as additional format validations
  • Adds tests for empty files
  • Adds tests for r formatted files.

@knapii-developments knapii-developments changed the base branch from master to development August 21, 2020 15:33
@knapii-developments knapii-developments marked this pull request as ready for review August 21, 2020 15:46
@knapii-developments knapii-developments changed the title Ea mtx ingest 1 m * SCP-2611 Port parsing of mtx from rails to ingest pipeline Aug 21, 2020
@codecov
Copy link

codecov bot commented Aug 21, 2020

Codecov Report

Merging #132 into development will increase coverage by 0.37%.
The diff coverage is 95.34%.

Impacted file tree graph

@@               Coverage Diff               @@
##           development     #132      +/-   ##
===============================================
+ Coverage        68.04%   68.41%   +0.37%     
===============================================
  Files               22       22              
  Lines             2688     2682       -6     
===============================================
+ Hits              1829     1835       +6     
+ Misses             859      847      -12     
Impacted Files Coverage Δ
ingest/expression_files/mtx.py 94.06% <92.85%> (+8.57%) ⬆️
ingest/expression_files/expression_files.py 90.19% <95.65%> (+3.13%) ⬆️
ingest/expression_files/dense_ingestor.py 94.02% <100.00%> (-1.11%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d251216...5992156. Read the comment docs.

Copy link
Member

@eweitz eweitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I noted a possible bug (but I'd be a bit surprised if it manifested) and some readability and maintainability improvements, but I see no functional problems.


try:
from expression_files import GeneExpression
from ingest_files import DataArray
Copy link
Member

@eweitz eweitz Aug 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the imports in the except block to match what we import in the try block. E.g. it seems DataArray should be imported instead of IngestFiles on line 25.

More generally:

Without testing it myself, I don't know if this try/except is needed -- because manage-study doesn't directly use Ingest Pipeline code for expression file parsing. Keeping it here and updating it per above seems safest, but perhaps making this aspect of Ingest-to-public-CLI integration more easily testable is worth discussing at our process meeting today.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe manage-study doesn't use ingest pipeline to ingest expression matrices due to prior implementation and performance issues. I'm not sure if it will in the future. I've gone ahead and modeled it after the other file types just for consistency. d1b33d4

next(self.csv_file_handler)
for gene_docs, data_array_documents in self.transform():
self.load(gene_docs, data_array_documents)
for documents, collection_name in self.transform():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a """-formatted docstring for the containing execute_ingest method here would be a nice touch.

# An "R formatted" file has one less entry in the header
# row than each successive row. Also, "GENE" will not appear in header
# An "R formatted" file can:
# Not have gene in the header or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Not have gene in the header or
# Not have GENE in the header or

Comment on lines 282 to 286
# Determine if models should be batched
if (
len(data_arrays) + len(current_data_arrays)
> GeneExpression.DATA_ARRAY_BATCH_SIZE
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Easier reading:

Suggested change
# Determine if models should be batched
if (
len(data_arrays) + len(current_data_arrays)
> GeneExpression.DATA_ARRAY_BATCH_SIZE
):
this_batch_size = len(data_arrays) + len(current_data_arrays)
# Determine if models should be batched
if (this_batch_size > GeneExpression.DATA_ARRAY_BATCH_SIZE):

except ValueError as v:
error_messages.append(str(v))
try:
MTXIngestor.check_duplicates(barcodes, "barcodes")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Singular, as done for "gene":

Suggested change
MTXIngestor.check_duplicates(barcodes, "barcodes")
MTXIngestor.check_duplicates(barcodes, "barcode")

Comment on lines 113 to 114
if len(unique_names) != len(names):
amount_of_duplicates = abs(len(unique_names) - len(names))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clearer and shorter, given that unique_names will never be longer than names:

Suggested change
if len(unique_names) != len(names):
amount_of_duplicates = abs(len(unique_names) - len(names))
if len(names) > len(unique_names):
amount_of_duplicates = len(names) - len(unique_names)

else:
raise ValueError("MTX file must be sorted")

def execute_ingest(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A """-formatted docstring would be nice here, so maintainers can hover over calls to this and see a summary.

Comment on lines 226 to 233
# Determine if models should be batched
if (
len(data_arrays) + len(current_data_arrays)
>= GeneExpression.DATA_ARRAY_BATCH_SIZE
):
yield gene_models, GeneExpression.COLLECTION_NAME
yield data_arrays, DataArray.COLLECTION_NAME
num_processed += len(gene_models)
Copy link
Member

@eweitz eweitz Aug 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code seems to duplicate code around line 282 in dense_ingestor.py.

If the duplication is feasible to abstract, could you do so?

If not, could you apply my refinement for "Easier reading" from there to here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This piece of code has been refactored

# Data array for expression values
for data_array in GeneExpression.create_data_arrays(
name=f"{gene} Expression",
array_type=f"expression",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No concatenation, so no f-string:

Suggested change
array_type=f"expression",
array_type="expression",

# Have one less entry in the header than each successive row or
# Have "" as the last value in header.
if header[0].upper() != "GENE":
length_of_next_line = len(row)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 'of' is typically disfavored in variable names. I would rename to next_line_length

Parameters:
header (List[str]): Header of the dense matrix
row (List): A single row from the dense matrix
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment should be extended to mention the return value is a tuple, not just a boolean

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I don't see anywhere in the code that uses the second half of the tuple, should it be omitted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used in set_header()

)

def load(self, gene_docs: List, data_array_docs: List):
def load(self, docs: List, collection_name: List):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a nice refactor

linear_data_id=self.study_file_id,
**self.data_array_kwargs,
):
data_arrays.append(data_array)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this section should get method extracted into a "create_all_cells_data_array()" method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would it be different? I believe the function signature would look the same as create_data_arrays.

array_type=f"{gene} Cells",
last_gene_id, last_gene = self.genes[current_idx - 2].split(
"\t"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been reading this code for the past 5 minutes and can't figure it out. Let's find a time to chat--I assume it's doing something very smart that I just can't figure out.

def set_header(self):
csv_file_handler = self.open_file(self.file_path)[0]
@staticmethod
def set_header(csv_file_handler) -> List[str]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bistline will you double check this please

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good now

Copy link
Contributor

@bistline bistline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just a couple of questions that are not blocking concerns. The only possible change I might request is to be consistent with the naming convention for expression file ingestors. You have dense_ingestor and mtx. I would say either suffix both classes with _ingestor or just have dense and mtx.

da_cells = GeneExpression.create_data_arrays(
name=gene,
array_type=f"{gene} Cells",
last_gene_id, last_gene = self.genes[current_idx - 2].split(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we certain that we should be offsetting the index by 2 here? Usually MTX files indexes are 1-based rather than 0-based, So I'm unclear how we would be off by 2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But the new implementation clears this up.

return True
if len(header) == length_of_next_line:
last_value = header[-1]
if last_value.isspace() or last_value == "":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get some actual matrices generate from R to check this because I'm worried we're not checking the correct things. The 3 header validations we checked for in the old Rails parsers were:

  1. GENE (case insensitive) in first cell
  2. Blank space in first cell
  3. Header row is one entry shorter than all other rows

self.execute_ingest(args)
not self.assertEqual(cm.exception.code, 0)
@patch(
"expression_files.expression_files.GeneExpression.check_unique_cells",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why do we need to patch this? Shouldn't it return true if nothing is found, or is this because there is no MongoDB question?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to patch it because check_unique_cells() makes a query to mongo to check for uniqueness.

unique_gene_names: List[str] = set(gene_names)
if len(unique_gene_names) != len(gene_names):
amount_of_duplicates = len(unique_gene_names) - len(gene_names)
def check_duplicates(names: List, file_type: str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice refactoring.

Copy link
Contributor

@jlchang jlchang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_mtx_dimensions in mtx.py does not have error handling if the file is not actually an mtx file (if the first line of the file that doesn't start with % has non-numeric values, trying to convert to int will result in an error). Please add error handling so user will know we expected to find mtx dimensions and failed and allow ingest_pipeline to fail gracefully when a bad matrix file is uploaded.

Is there value in checking that the first line of the file leads with two percent signs (because if it doesn't, it isn't valid format)?

@knapii-developments
Copy link
Contributor Author

get_mtx_dimensions in mtx.py does not have error handling if the file is not actually an mtx file (if the first line of the file that doesn't start with % has non-numeric values, trying to convert to int will result in an error). Please add error handling so user will know we expected to find mtx dimensions and failed and allow ingest_pipeline to fail gracefully when a bad matrix file is uploaded.

Is there value in checking that the first line of the file leads with two percent signs (because if it doesn't, it isn't valid format)?

c8c3232

@knapii-developments knapii-developments requested review from eweitz and removed request for bistline and devonbush August 26, 2020 15:42
Copy link
Contributor

@bistline bistline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see two blocking issues:

  1. It appears you've accidentally commented out the call to load() in expression_files.py, which would mean nothing is getting persisted to Mongo
  2. The sort checking logic is mtx is not quite right

Otherwise, things look good!

def set_header(self):
csv_file_handler = self.open_file(self.file_path)[0]
@staticmethod
def set_header(csv_file_handler) -> List[str]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good now

num_processed: int,
force=False,
):
"""Creates models are a given gene and batches them for loading if
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Creates models are a given gene and batches them for loading if
"""Creates models for a given gene and batches them for loading if

this_batch_size = len(data_arrays) + len(current_data_arrays)
# Determine if models should be batched
if this_batch_size >= GeneExpression.DATA_ARRAY_BATCH_SIZE or force:
# self.load(gene_models, GeneExpression.COLLECTION_NAME)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason these are commented out? This seems like a regression.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_sorted = MTXIngestor.is_sorted(current_idx, visited_expression_idx)
if not is_sorted:
if current_idx != prev_idx:
if not current_idx == prev_idx + 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no guarantee that the index will increment sequentially like this. There may not be any observed expression for a given gene. So if we are still enforcing the sorting of mtx files (I thought we had a way to not do this?) then you just want to make sure the index isn't less than the current index. The values should only go up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f"Time to load {len(docs)} models: {str(datetime.datetime.now() - start_time)}"
)

def create_models(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eweitz This is the function that @devonbush was referring to and you had touched on in this comment.

Copy link
Member

@eweitz eweitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greatly improved. Filling in for Devon, since the original blocker involved readability, I think one more refinement is worthwhile.

Co-authored-by: Eric Weitz <eweitz@broadinstitute.org>
@eweitz eweitz self-requested a review August 27, 2020 19:49
Copy link
Member

@eweitz eweitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I approve. The build failure is a false positive.

@knapii-developments knapii-developments merged commit 70e8e03 into development Aug 27, 2020
@knapii-developments knapii-developments deleted the ea-mtx-ingest-1M branch October 16, 2020 12:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants