* SCP-2611 Port parsing of mtx from rails to ingest pipeline #132

knapii-developments · 2020-08-17T15:36:00Z

This PR:

Adds functionality to ingest pipeline to parse mtx files as well as additional format validations
Adds tests for empty files
Adds tests for r formatted files.

codecov · 2020-08-21T15:51:26Z

Codecov Report

Merging #132 into development will increase coverage by 0.37%.
The diff coverage is 95.34%.

@@               Coverage Diff               @@
##           development     #132      +/-   ##
===============================================
+ Coverage        68.04%   68.41%   +0.37%     
===============================================
  Files               22       22              
  Lines             2688     2682       -6     
===============================================
+ Hits              1829     1835       +6     
+ Misses             859      847      -12

Impacted Files	Coverage Δ
ingest/expression_files/mtx.py	`94.06% <92.85%> (+8.57%)`	⬆️
ingest/expression_files/expression_files.py	`90.19% <95.65%> (+3.13%)`	⬆️
ingest/expression_files/dense_ingestor.py	`94.02% <100.00%> (-1.11%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d251216...5992156. Read the comment docs.

eweitz

Looks good. I noted a possible bug (but I'd be a bit surprised if it manifested) and some readability and maintainability improvements, but I see no functional problems.

eweitz · 2020-08-24T14:56:38Z

ingest/expression_files/dense_ingestor.py


 try:
    from expression_files import GeneExpression
+    from ingest_files import DataArray


Please update the imports in the except block to match what we import in the try block. E.g. it seems DataArray should be imported instead of IngestFiles on line 25.

More generally:

Without testing it myself, I don't know if this try/except is needed -- because manage-study doesn't directly use Ingest Pipeline code for expression file parsing. Keeping it here and updating it per above seems safest, but perhaps making this aspect of Ingest-to-public-CLI integration more easily testable is worth discussing at our process meeting today.

I believe manage-study doesn't use ingest pipeline to ingest expression matrices due to prior implementation and performance issues. I'm not sure if it will in the future. I've gone ahead and modeled it after the other file types just for consistency. d1b33d4

eweitz · 2020-08-24T15:00:01Z

ingest/expression_files/dense_ingestor.py

        next(self.csv_file_handler)
-        for gene_docs, data_array_documents in self.transform():
-            self.load(gene_docs, data_array_documents)
+        for documents, collection_name in self.transform():


Having a """-formatted docstring for the containing execute_ingest method here would be a nice touch.

eweitz · 2020-08-24T15:00:53Z

ingest/expression_files/dense_ingestor.py

-        # An "R formatted" file has one less entry in the header
-        # row than each successive row. Also, "GENE" will not appear in header
+        # An "R formatted" file can:
+        # Not have gene in the header or


Suggested change

# Not have gene in the header or

# Not have GENE in the header or

eweitz · 2020-08-24T15:03:56Z

ingest/expression_files/dense_ingestor.py

+            # Determine if models should be batched
+            if (
+                len(data_arrays) + len(current_data_arrays)
+                > GeneExpression.DATA_ARRAY_BATCH_SIZE
+            ):


Easier reading:

Suggested change

# Determine if models should be batched

if (

len(data_arrays) + len(current_data_arrays)

> GeneExpression.DATA_ARRAY_BATCH_SIZE

):

this_batch_size = len(data_arrays) + len(current_data_arrays)

# Determine if models should be batched

if (this_batch_size > GeneExpression.DATA_ARRAY_BATCH_SIZE):

eweitz · 2020-08-24T15:07:56Z

ingest/expression_files/mtx.py

+        except ValueError as v:
+            error_messages.append(str(v))
+        try:
+            MTXIngestor.check_duplicates(barcodes, "barcodes")


Singular, as done for "gene":

Suggested change

MTXIngestor.check_duplicates(barcodes, "barcodes")

MTXIngestor.check_duplicates(barcodes, "barcode")

eweitz · 2020-08-24T15:13:49Z

ingest/expression_files/mtx.py

+        if len(unique_names) != len(names):
+            amount_of_duplicates = abs(len(unique_names) - len(names))


Clearer and shorter, given that unique_names will never be longer than names:

Suggested change

if len(unique_names) != len(names):

amount_of_duplicates = abs(len(unique_names) - len(names))

if len(names) > len(unique_names):

amount_of_duplicates = len(names) - len(unique_names)

eweitz · 2020-08-24T15:15:32Z

ingest/expression_files/mtx.py

            else:
                raise ValueError("MTX file must be sorted")

    def execute_ingest(self):


A """-formatted docstring would be nice here, so maintainers can hover over calls to this and see a summary.

eweitz · 2020-08-24T15:23:34Z

ingest/expression_files/mtx.py

+                        # Determine if models should be batched
+                        if (
+                            len(data_arrays) + len(current_data_arrays)
+                            >= GeneExpression.DATA_ARRAY_BATCH_SIZE
+                        ):
+                            yield gene_models, GeneExpression.COLLECTION_NAME
+                            yield data_arrays, DataArray.COLLECTION_NAME
+                            num_processed += len(gene_models)


This code seems to duplicate code around line 282 in dense_ingestor.py.

If the duplication is feasible to abstract, could you do so?

If not, could you apply my refinement for "Easier reading" from there to here?

This piece of code has been refactored

eweitz · 2020-08-24T15:24:36Z

ingest/expression_files/mtx.py

+        # Data array for expression values
+        for data_array in GeneExpression.create_data_arrays(
+            name=f"{gene} Expression",
+            array_type=f"expression",


No concatenation, so no f-string:

Suggested change

array_type=f"expression",

array_type="expression",

devonbush · 2020-08-24T18:35:42Z

ingest/expression_files/dense_ingestor.py

+        # Have one less entry in the header than each successive row or
+        # Have "" as the last value in header.
        if header[0].upper() != "GENE":
            length_of_next_line = len(row)


I think 'of' is typically disfavored in variable names. I would rename to next_line_length

devonbush · 2020-08-24T18:38:31Z

ingest/expression_files/dense_ingestor.py

        Parameters:
            header (List[str]): Header of the dense matrix
            row (List): A single row from the dense matrix
        """


this comment should be extended to mention the return value is a tuple, not just a boolean

Although I don't see anywhere in the code that uses the second half of the tuple, should it be omitted?

It's used in set_header()

devonbush · 2020-08-24T18:42:19Z

ingest/expression_files/expression_files.py

            )

-    def load(self, gene_docs: List, data_array_docs: List):
+    def load(self, docs: List, collection_name: List):


this is a nice refactor

devonbush · 2020-08-24T18:48:38Z

ingest/expression_files/mtx.py

+            linear_data_id=self.study_file_id,
+            **self.data_array_kwargs,
+        ):
+            data_arrays.append(data_array)


this section should get method extracted into a "create_all_cells_data_array()" method

How would it be different? I believe the function signature would look the same as create_data_arrays.

devonbush · 2020-08-24T18:51:10Z

ingest/expression_files/mtx.py

-                            array_type=f"{gene} Cells",
+                        last_gene_id, last_gene = self.genes[current_idx - 2].split(
+                            "\t"
+                        )


I've been reading this code for the past 5 minutes and can't figure it out. Let's find a time to chat--I assume it's doing something very smart that I just can't figure out.

knapii-developments · 2020-08-24T20:16:12Z

ingest/expression_files/dense_ingestor.py

-    def set_header(self):
-        csv_file_handler = self.open_file(self.file_path)[0]
+    @staticmethod
+    def set_header(csv_file_handler) -> List[str]:


@bistline will you double check this please

I think this is good now

bistline

Looks good! Just a couple of questions that are not blocking concerns. The only possible change I might request is to be consistent with the naming convention for expression file ingestors. You have dense_ingestor and mtx. I would say either suffix both classes with _ingestor or just have dense and mtx.

bistline · 2020-08-24T20:13:25Z

ingest/expression_files/mtx.py

-                        da_cells = GeneExpression.create_data_arrays(
-                            name=gene,
-                            array_type=f"{gene} Cells",
+                        last_gene_id, last_gene = self.genes[current_idx - 2].split(


Are we certain that we should be offsetting the index by 2 here? Usually MTX files indexes are 1-based rather than 0-based, So I'm unclear how we would be off by 2.

Yes. But the new implementation clears this up.

bistline · 2020-08-24T20:22:26Z

ingest/expression_files/dense_ingestor.py

-                return True
+            if len(header) == length_of_next_line:
+                last_value = header[-1]
+                if last_value.isspace() or last_value == "":


Let's get some actual matrices generate from R to check this because I'm worried we're not checking the correct things. The 3 header validations we checked for in the old Rails parsers were:

GENE (case insensitive) in first cell

Blank space in first cell

Header row is one entry shorter than all other rows

bistline · 2020-08-24T20:50:28Z

tests/test_ingest.py

-                self.execute_ingest(args)
-            not self.assertEqual(cm.exception.code, 0)
+    @patch(
+        "expression_files.expression_files.GeneExpression.check_unique_cells",


Out of curiosity, why do we need to patch this? Shouldn't it return true if nothing is found, or is this because there is no MongoDB question?

You have to patch it because check_unique_cells() makes a query to mongo to check for uniqueness.

bistline · 2020-08-24T21:01:31Z

ingest/expression_files/mtx.py

-        unique_gene_names: List[str] = set(gene_names)
-        if len(unique_gene_names) != len(gene_names):
-            amount_of_duplicates = len(unique_gene_names) - len(gene_names)
+    def check_duplicates(names: List, file_type: str):


Nice refactoring.

jlchang

get_mtx_dimensions in mtx.py does not have error handling if the file is not actually an mtx file (if the first line of the file that doesn't start with % has non-numeric values, trying to convert to int will result in an error). Please add error handling so user will know we expected to find mtx dimensions and failed and allow ingest_pipeline to fail gracefully when a bad matrix file is uploaded.

Is there value in checking that the first line of the file leads with two percent signs (because if it doesn't, it isn't valid format)?

knapii-developments · 2020-08-26T15:11:40Z

get_mtx_dimensions in mtx.py does not have error handling if the file is not actually an mtx file (if the first line of the file that doesn't start with % has non-numeric values, trying to convert to int will result in an error). Please add error handling so user will know we expected to find mtx dimensions and failed and allow ingest_pipeline to fail gracefully when a bad matrix file is uploaded.

Is there value in checking that the first line of the file leads with two percent signs (because if it doesn't, it isn't valid format)?

c8c3232

bistline

I see two blocking issues:

It appears you've accidentally commented out the call to load() in expression_files.py, which would mean nothing is getting persisted to Mongo
The sort checking logic is mtx is not quite right

Otherwise, things look good!

bistline · 2020-08-26T16:06:12Z

ingest/expression_files/dense_ingestor.py

-    def set_header(self):
-        csv_file_handler = self.open_file(self.file_path)[0]
+    @staticmethod
+    def set_header(csv_file_handler) -> List[str]:


I think this is good now

bistline · 2020-08-26T16:08:28Z

ingest/expression_files/expression_files.py

+        num_processed: int,
+        force=False,
+    ):
+        """Creates models are a given gene and batches them for loading if


Suggested change

"""Creates models are a given gene and batches them for loading if

"""Creates models for a given gene and batches them for loading if

bistline · 2020-08-26T16:09:24Z

ingest/expression_files/expression_files.py

+        this_batch_size = len(data_arrays) + len(current_data_arrays)
+        # Determine if models should be batched
+        if this_batch_size >= GeneExpression.DATA_ARRAY_BATCH_SIZE or force:
+            # self.load(gene_models, GeneExpression.COLLECTION_NAME)


Is there a reason these are commented out? This seems like a regression.

bistline · 2020-08-26T16:17:26Z

ingest/expression_files/mtx.py

-                is_sorted = MTXIngestor.is_sorted(current_idx, visited_expression_idx)
-                if not is_sorted:
+            if current_idx != prev_idx:
+                if not current_idx == prev_idx + 1:


There is no guarantee that the index will increment sequentially like this. There may not be any observed expression for a given gene. So if we are still enforcing the sorting of mtx files (I thought we had a way to not do this?) then you just want to make sure the index isn't less than the current index. The values should only go up.

knapii-developments · 2020-08-27T15:20:41Z

ingest/expression_files/expression_files.py

+            f"Time to load {len(docs)} models: {str(datetime.datetime.now() - start_time)}"
        )
+
+    def create_models(


@eweitz This is the function that @devonbush was referring to and you had touched on in this comment.

eweitz

Greatly improved. Filling in for Devon, since the original blocker involved readability, I think one more refinement is worthwhile.

ingest/expression_files/expression_files.py

Co-authored-by: Eric Weitz <eweitz@broadinstitute.org>

eweitz

Thanks! I approve. The build failure is a false positive.

Add file for data arrays mtx & add da for last row

b1a394d

knapii-developments mentioned this pull request Aug 19, 2020

* SCP-2567 Refactor and add tests for dense matrices & expression files #134

Merged

knapii-developments added 2 commits August 20, 2020 09:58

Add tests and models for r files

73661dd

Add test files and merge from development

c6196f5

knapii-developments changed the base branch from master to development August 21, 2020 15:33

Add missing file

de42e56

knapii-developments marked this pull request as ready for review August 21, 2020 15:46

knapii-developments changed the title ~~Ea mtx ingest 1 m~~ * SCP-2611 Port parsing of mtx from rails to ingest pipeline Aug 21, 2020

Uncomment block

0976548

knapii-developments requested review from bistline, devonbush, eweitz and jlchang August 21, 2020 15:54

eweitz approved these changes Aug 24, 2020

View reviewed changes

devonbush suggested changes Aug 24, 2020

View reviewed changes

knapii-developments commented Aug 24, 2020

View reviewed changes

bistline approved these changes Aug 24, 2020

View reviewed changes

jlchang suggested changes Aug 25, 2020

View reviewed changes

knapii-developments added 3 commits August 26, 2020 10:52

Refactor transform() and update tests

89bf5db

Add test for mtx dimensions

35224a9

Check for no data in mtx

c8c3232

knapii-developments added 4 commits August 26, 2020 11:21

Rename function to better reflect purpose

981fd6c

Add missing test file

b15bf76

Update imports for consistency

d1b33d4

Add missing test file

269fc3e

knapii-developments requested review from bistline, devonbush and jlchang August 26, 2020 15:42

knapii-developments requested review from eweitz and removed request for bistline and devonbush August 26, 2020 15:42

jlchang approved these changes Aug 26, 2020

View reviewed changes

bistline requested changes Aug 26, 2020

View reviewed changes

knapii-developments added 2 commits August 26, 2020 13:13

Refactor and add is_sorted()

7226968

Uncomment load()

075d0e2

knapii-developments requested a review from bistline August 26, 2020 17:16

Add test for unsorted mtx

34f1e2b

bistline approved these changes Aug 26, 2020

View reviewed changes

knapii-developments added 4 commits August 26, 2020 13:30

Add transform test back

d607d6b

Fix dense test

0bdc629

Simplify name

c524d27

Remove linear data id from mtx models

34a3c75

knapii-developments commented Aug 27, 2020

View reviewed changes

eweitz requested changes Aug 27, 2020

View reviewed changes

ingest/expression_files/expression_files.py Outdated Show resolved Hide resolved

Update ingest/expression_files/expression_files.py

5992156

Co-authored-by: Eric Weitz <eweitz@broadinstitute.org>

eweitz self-requested a review August 27, 2020 19:49

eweitz approved these changes Aug 27, 2020

View reviewed changes

knapii-developments merged commit 70e8e03 into development Aug 27, 2020

knapii-developments deleted the ea-mtx-ingest-1M branch October 16, 2020 12:34

	# Not have gene in the header or
	# Not have GENE in the header or

	MTXIngestor.check_duplicates(barcodes, "barcodes")
	MTXIngestor.check_duplicates(barcodes, "barcode")

		if len(unique_names) != len(names):
		amount_of_duplicates = abs(len(unique_names) - len(names))

	"""Creates models are a given gene and batches them for loading if
	"""Creates models for a given gene and batches them for loading if

* SCP-2611 Port parsing of mtx from rails to ingest pipeline #132

* SCP-2611 Port parsing of mtx from rails to ingest pipeline #132

Uh oh!

Conversation

knapii-developments commented Aug 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Aug 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

eweitz left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eweitz Aug 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eweitz Aug 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bistline left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jlchang left a comment

Choose a reason for hiding this comment

Uh oh!

knapii-developments commented Aug 26, 2020

knapii-developments commented Aug 17, 2020 •

edited

Loading

codecov bot commented Aug 21, 2020 •

edited

Loading

eweitz left a comment •

edited

Loading

eweitz Aug 24, 2020 •

edited

Loading

eweitz Aug 24, 2020 •

edited

Loading

eweitz left a comment •

edited

Loading