Skip to content

Conversation

@knapii-developments
Copy link
Contributor

@knapii-developments knapii-developments commented Aug 18, 2020

Refactoring DenseIngestor for conciseness and to improve testability. Adding tests for expression_filles.py

MTX is included so that tests past. This is not a complete PR for MTX. Suggestions are welcomed but not required.

@codecov
Copy link

codecov bot commented Aug 18, 2020

Codecov Report

Merging #134 into development will increase coverage by 3.77%.
The diff coverage is 84.47%.

Impacted file tree graph

@@               Coverage Diff               @@
##           development     #134      +/-   ##
===============================================
+ Coverage        64.26%   68.04%   +3.77%     
===============================================
  Files               22       22              
  Lines             2622     2688      +66     
===============================================
+ Hits              1685     1829     +144     
+ Misses             937      859      -78     
Impacted Files Coverage Δ
ingest/cell_metadata.py 80.30% <55.55%> (ø)
ingest/expression_files/expression_files.py 87.05% <76.92%> (+5.47%) ⬆️
ingest/expression_files/mtx.py 85.49% <85.71%> (-5.53%) ⬇️
ingest/expression_files/dense_ingestor.py 95.13% <100.00%> (+0.03%) ⬆️
ingest/ingest_files.py 74.70% <100.00%> (+0.45%) ⬆️
ingest/ingest_pipeline.py 54.18% <100.00%> (+1.32%) ⬆️
ingest/genomes/utils.py 54.16% <0.00%> (+4.16%) ⬆️
ingest/make_toy_data.py 40.54% <0.00%> (+13.40%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3c814c4...a2c1e80. Read the comment docs.

@knapii-developments knapii-developments marked this pull request as ready for review August 18, 2020 13:28
@knapii-developments knapii-developments changed the title Ea improve dense testing Refactor and add tests for dense matrices & expression files Aug 18, 2020
@knapii-developments knapii-developments changed the title Refactor and add tests for dense matrices & expression files * SCP-2567 Refactor and add tests for dense matrices & expression files Aug 18, 2020
Copy link
Contributor

@bistline bistline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really like the refactor! The only changes I would request is to be consistent in the placement of da_kwargs when calling GeneExpression.create_data_array, and then to change the type for gene_id to a string in the Gene model.


@staticmethod
def create_gene_model(
*ignore, name: str, study_file_id, study_id, _id: int, gene_id: int = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gene_id should be a string, not an integer. In this context, we're referring to an ID from a reference collection, such as Ensembl, or Entrez. While Entrez gene IDs are ints, in Ensembl they're strings like ENSG00000139618.

data_arrays = []
for all_cell_model in self.set_data_array_cells(self.header, ObjectId()):
for all_cell_model in GeneExpression.create_data_array(
**self.da_kwargs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why the keyword arguments are passed first here, but last in other calls to create_data_array? We should be consistent for readability even though they are not positionally dependent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also rename that to data_array_kwargs

Copy link
Member

@eweitz eweitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I made some suggestions and asked a question or two, but I see no blockers.

Comment on lines 21 to 28
sys.path.append("../ingest")
from ingest_files import IngestFiles
from monitor import trace
except ImportError:
# Used when importing as external package, e.g. imports in single_cell_portal code
# Used when importing as external package, e.g. imports in
# single_cell_portal code
from .expression_files import GeneExpression

sys.path.append("../ingest")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of doing sys.path.append("../ingest") twice -- once in the try and once in the except, a cleaner equivalent would be to do that once outside the try (i.e. the line immediately above try).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will make this suggestion in this PR for mtx.

Comment on lines +93 to +94
f"Expected {expected_barcodes} cells and {expected_genes} genes. "
f"Got {actual_barcodes} cells and {actual_genes} genes."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great error message.

"""
self.genes = [g.strip().strip('"') for g in self.genes_file.readlines()]
self.cells = [c.strip().strip('"') for c in self.barcodes_file.readlines()]
self.genes: List[str] = [
Copy link
Member

@eweitz eweitz Aug 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, what benefit is List[str] here? I see the value of (optionally) using type hints to annotate function signatures, but the value in cases like this is less clear to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just to be more concise. It lets a developer know that it's a list of strings.

return list(map(int, mtx_dimensions))

@staticmethod
def is_sorted(idx: int, visited_expression_idx: List[int]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed all cases of is_foo() renamed to check_foo() in a recent PR. For consistency, we should use one naming pattern or the other, but not both.

This would be consistent those recent changes:

Suggested change
def is_sorted(idx: int, visited_expression_idx: List[int]):
def check_sorted(idx: int, visited_expression_idx: List[int]):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a check that happens in transform and not in check_valid(). For formatting issues, the rational was that we wanted to know all validation errors prior to ingest which is why "is" changed to "check". If is_sorted() returns false parsing stops and immediately exits.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, helpful to know. That rationale seems reasonable.

current_idx = int(raw_gene_idx)
gene_id, gene = self.genes[current_idx - 1].split("\t")
if current_idx != last_idx:
is_sorted = MTXIngestor.is_sorted(current_idx, visited_expression_idx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See other comment re is_foo vs. check_foo.

Suggested change
is_sorted = MTXIngestor.is_sorted(current_idx, visited_expression_idx)
is_sorted = MTXIngestor.check_sorted(current_idx, visited_expression_idx)

knapii-developments and others added 3 commits August 18, 2020 12:26
Co-authored-by: Eric Weitz <eweitz@broadinstitute.org>
…scp-ingest-pipeline into ea-improve-dense-testing
# Represents row as a list
for row in self.csv_file_handler:
valid_expression_scores, cells = DenseIngestor.filter_expression_scores(
valid_expression_scores, exp_cells = DenseIngestor.filter_expression_scores(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might just write out 'expression' for 'expression_cells'

for gene_cell_model in self.set_data_array_gene_cell_names(
gene, id, cells
# Data array for cell names
for da in GeneExpression.create_data_array(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'da' is too short -- write out data_array

data_arrays = []
for all_cell_model in self.set_data_array_cells(self.header, ObjectId()):
for all_cell_model in GeneExpression.create_data_array(
**self.da_kwargs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also rename that to data_array_kwargs

# load any remaining models (this is necessary here since there isn't an easy way to detect the
# last line of the file in the iteration above
data_arrays.append(da)
if len(data_arrays) >= GeneExpression.DATA_ARRAY_BATCH_SIZE:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert to previous indentation

self.extra_log_params = {"study_id": self.study_id, "duration": None}
self.mongo_connection = MongoConnection()
# Common data array kwargs
self.da_kwargs = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, use data_array_kwargs

"""Abstract method for validating expression matrices"""

@staticmethod
def create_gene_model(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs a method comment

self, name: str, linear_data_id: str, values: List
):
@staticmethod
def create_data_array(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to create_data_arrays

if ignore:
raise TypeError("Positional arguments are not accepted.")
del fn_kwargs["ignore"]
for model in DataArray(**fn_kwargs).get_data_array():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to get_data_arrays

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to touch get_data_array() right now because all file types/metadata convention use this method. Testing for other file types/metadata convention aren't as robust as for expression files. I have no doubt that I'll miss something and cause an error in prod.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's we aren't merging/releasing this change for the dense parsing that's going out today. I only see 3 total mentions of this function in the current codebase--that's not a risky change--and it's also something that any smoke test would catch b/c the ingest for that type would fail altogether if you missed a type. Use your editor's "find/replace all" to look for all instances and change them.

self.assertRaises(
TypeError, [GeneExpression.create_data_array], hi="bad_kwarg", **kwargs
)
actual_da: Dict = next(GeneExpression.create_data_array(**kwargs))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to 'data_array'

}
)
return DataArray(**base_data_array_model, **data_array_attrs).get_data_array()
return DataArray(**base_data_array_model, **data_array_attrs).get_data_arrays()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requested change from @devonbush

self.linear_data_id = self.study_id

def get_data_array(self):
def get_data_arrays(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devonbush requested change

@knapii-developments knapii-developments merged commit d251216 into development Aug 20, 2020
@knapii-developments knapii-developments deleted the ea-improve-dense-testing branch August 20, 2020 17:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants