* SCP-2567 Refactor and add tests for dense matrices & expression files #134

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

knapii-developments merged 37 commits into development from ea-improve-dense-testing

Aug 20, 2020

Contributor

knapii-developments commented Aug 18, 2020 •

edited

Loading

Refactoring DenseIngestor for conciseness and to improve testability. Adding tests for expression_filles.py

MTX is included so that tests past. This is not a complete PR for MTX. Suggestions are welcomed but not required.

knapii-developments added 24 commits

July 31, 2020 16:11


          Convert rails to ingest pipeline

fb8c9bd


          Update numbers processed

c0ecd2a


          Save work

a55495a


          Add print statements for debugging in papi


          Add more print statements for PAPI

4f829ba


          Remove sorted logic

07291f5


          Merg and Add test for sorted mtx

a284709


          Parse mtx files

45b4029


          Refactor data array and gene model creation

913e759


          Refactor keyword arguments in create_da()

2b2f077


          Add test_create_data_array

1400cb5


          Add test_create_gene_model

69ec86c


          Create test_load

d6da047


          Add tests running

762484f


          Remove file

c564078


          Save work

66eec2e


          Resolve merge conflict

5e42477


          Add working test for MTX

ccd9324


          Fix sorted test

e018f3e


          Save Work

c9cae29


          Fix mtx tests

a8da97d


          Add da batch size back

df98e89


          Remove unused function

0ec197b


          Comment out failing test

c0b524f

codecov bot commented Aug 18, 2020 •

edited

Loading

Codecov Report

Merging #134 into development will increase coverage by 3.77%.
The diff coverage is 84.47%.

@@               Coverage Diff               @@
##           development     #134      +/-   ##
===============================================
+ Coverage        64.26%   68.04%   +3.77%     
===============================================
  Files               22       22              
  Lines             2622     2688      +66     
===============================================
+ Hits              1685     1829     +144     
+ Misses             937      859      -78

Impacted Files	Coverage Δ
ingest/cell_metadata.py	`80.30% <55.55%> (ø)`
ingest/expression_files/expression_files.py	`87.05% <76.92%> (+5.47%)`	⬆️
ingest/expression_files/mtx.py	`85.49% <85.71%> (-5.53%)`	⬇️
ingest/expression_files/dense_ingestor.py	`95.13% <100.00%> (+0.03%)`	⬆️
ingest/ingest_files.py	`74.70% <100.00%> (+0.45%)`	⬆️
ingest/ingest_pipeline.py	`54.18% <100.00%> (+1.32%)`	⬆️
ingest/genomes/utils.py	`54.16% <0.00%> (+4.16%)`	⬆️
ingest/make_toy_data.py	`40.54% <0.00%> (+13.40%)`	⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3c814c4...a2c1e80. Read the comment docs.


          Remove ingest pipeline from PR

ef2d0c6

knapii-developments marked this pull request as ready for review

August 18, 2020 13:28

knapii-developments requested review from bistline, devonbush and eweitz

August 18, 2020 13:28


          Revert file

080e0d4

knapii-developments changed the title ~~Ea improve dense testing~~ Refactor and add tests for dense matrices & expression files

knapii-developments changed the title ~~Refactor and add tests for dense matrices & expression files~~ * SCP-2567 Refactor and add tests for dense matrices & expression files

bistline requested changes

View reviewed changes

Contributor

bistline left a comment •

edited

Loading

Really like the refactor! The only changes I would request is to be consistent in the placement of da_kwargs when calling GeneExpression.create_data_array, and then to change the type for gene_id to a string in the Gene model.

ingest/expression_files/expression_files.py Outdated

    
                  @staticmethod

                  def create_gene_model(

                      *ignore, name: str, study_file_id, study_id, _id: int, gene_id: int = None

Contributor

bistline Aug 18, 2020

gene_id should be a string, not an integer. In this context, we're referring to an ID from a reference collection, such as Ensembl, or Entrez. While Entrez gene IDs are ints, in Ensembl they're strings like ENSG00000139618.

ingest/expression_files/dense_ingestor.py Outdated

    
                      data_arrays = []

                      for all_cell_model in self.set_data_array_cells(self.header, ObjectId()):

                      for all_cell_model in GeneExpression.create_data_array(

                          **self.da_kwargs,

Contributor

bistline Aug 18, 2020

Is there a reason why the keyword arguments are passed first here, but last in other calls to create_data_array? We should be consistent for readability even though they are not positionally dependent.

Contributor

devonbush Aug 18, 2020

I would also rename that to data_array_kwargs

eweitz approved these changes

View reviewed changes

Member

eweitz left a comment

Looks good! I made some suggestions and asked a question or two, but I see no blockers.

ingest/expression_files/mtx.py Outdated Show resolved Hide resolved

ingest/expression_files/mtx.py Outdated

Comment on lines 21 to 28

    
                  sys.path.append("../ingest")

                  from ingest_files import IngestFiles

                  from monitor import trace

              except ImportError:

                  # Used when importing as external package, e.g. imports in single_cell_portal code

                  # Used when importing as external package, e.g. imports in

                  # single_cell_portal code

                  from .expression_files import GeneExpression

                  sys.path.append("../ingest")

Member

eweitz Aug 18, 2020

Instead of doing sys.path.append("../ingest") twice -- once in the try and once in the except, a cleaner equivalent would be to do that once outside the try (i.e. the line immediately above try).

Contributor Author

knapii-developments Aug 19, 2020

I will make this suggestion in this PR for mtx.

ingest/expression_files/mtx.py

Comment on lines +93 to +94

    
                              f"Expected {expected_barcodes} cells and {expected_genes} genes. "

                              f"Got {actual_barcodes} cells and {actual_genes} genes."

Member

eweitz Aug 18, 2020

This is a great error message.

ingest/expression_files/mtx.py

    
                      """

                      self.genes = [g.strip().strip('"') for g in self.genes_file.readlines()]

                      self.cells = [c.strip().strip('"') for c in self.barcodes_file.readlines()]

                      self.genes: List[str] = [

Member

eweitz Aug 18, 2020 •

edited

Loading

Out of curiosity, what benefit is List[str] here? I see the value of (optionally) using type hints to annotate function signatures, but the value in cases like this is less clear to me.

Contributor Author

knapii-developments Aug 18, 2020

It's just to be more concise. It lets a developer know that it's a list of strings.

ingest/expression_files/mtx.py

    
                              return list(map(int, mtx_dimensions))

                  @staticmethod

                  def is_sorted(idx: int, visited_expression_idx: List[int]):

Member

eweitz Aug 18, 2020

I noticed all cases of is_foo() renamed to check_foo() in a recent PR. For consistency, we should use one naming pattern or the other, but not both.

This would be consistent those recent changes:

Suggested change

      
                def is_sorted(idx: int, visited_expression_idx: List[int]):
          
                def check_sorted(idx: int, visited_expression_idx: List[int]):

Contributor Author

knapii-developments Aug 18, 2020

This is a check that happens in transform and not in check_valid(). For formatting issues, the rational was that we wanted to know all validation errors prior to ingest which is why "is" changed to "check". If is_sorted() returns false parsing stops and immediately exits.

Member

eweitz Aug 18, 2020

Gotcha, helpful to know. That rationale seems reasonable.

ingest/expression_files/mtx.py

    
                          current_idx = int(raw_gene_idx)

                          gene_id, gene = self.genes[current_idx - 1].split("\t")

                          if current_idx != last_idx:

                              is_sorted = MTXIngestor.is_sorted(current_idx, visited_expression_idx)

Member

eweitz Aug 18, 2020

See other comment re is_foo vs. check_foo.

Suggested change

      
                            is_sorted = MTXIngestor.is_sorted(current_idx, visited_expression_idx)
          
                            is_sorted = MTXIngestor.check_sorted(current_idx, visited_expression_idx)

knapii-developments and others added 3 commits

August 18, 2020 12:26


          Update ingest/expression_files/mtx.py

c00b5b8

Co-authored-by: Eric Weitz <eweitz@broadinstitute.org>


          Add test for empty dense file

58d6e6f


          Merge branch 'ea-improve-dense-testing' of github.com:broadinstitute/…

99ace29

…scp-ingest-pipeline into ea-improve-dense-testing

devonbush suggested changes

View reviewed changes

ingest/expression_files/dense_ingestor.py

    
                      # Represents row as a list

                      for row in self.csv_file_handler:

                          valid_expression_scores, cells = DenseIngestor.filter_expression_scores(

                          valid_expression_scores, exp_cells = DenseIngestor.filter_expression_scores(

Contributor

devonbush Aug 18, 2020

I might just write out 'expression' for 'expression_cells'

ingest/expression_files/dense_ingestor.py Outdated

    
                              for gene_cell_model in self.set_data_array_gene_cell_names(

                                  gene, id, cells

                              # Data array for cell names

                              for da in GeneExpression.create_data_array(

Contributor

devonbush Aug 18, 2020

'da' is too short -- write out data_array

ingest/expression_files/dense_ingestor.py Outdated

    
                      data_arrays = []

                      for all_cell_model in self.set_data_array_cells(self.header, ObjectId()):

                      for all_cell_model in GeneExpression.create_data_array(

                          **self.da_kwargs,

Contributor

devonbush Aug 18, 2020

I would also rename that to data_array_kwargs

ingest/expression_files/dense_ingestor.py Outdated

    
                      # load any remaining models (this is necessary here since there isn't an easy way to detect the

                      # last line of the file in the iteration above

                                  data_arrays.append(da)

                              if len(data_arrays) >= GeneExpression.DATA_ARRAY_BATCH_SIZE:

Contributor

devonbush Aug 18, 2020

revert to previous indentation

ingest/expression_files/expression_files.py Outdated

    
                      self.extra_log_params = {"study_id": self.study_id, "duration": None}

                      self.mongo_connection = MongoConnection()

                      # Common data array kwargs

                      self.da_kwargs = {

Contributor

devonbush Aug 18, 2020

same, use data_array_kwargs

ingest/expression_files/expression_files.py

    
                      """Abstract method for validating expression matrices"""

                  @staticmethod

                  def create_gene_model(

Contributor

devonbush Aug 18, 2020

needs a method comment

ingest/expression_files/expression_files.py Outdated

    
                      self, name: str, linear_data_id: str, values: List

                  ):

                  @staticmethod

                  def create_data_array(

Contributor

devonbush Aug 18, 2020

rename to create_data_arrays

ingest/expression_files/expression_files.py Outdated

    
                      if ignore:

                          raise TypeError("Positional arguments are not accepted.")

                      del fn_kwargs["ignore"]

                      for model in DataArray(**fn_kwargs).get_data_array():

Contributor

devonbush Aug 18, 2020

rename to get_data_arrays

Contributor Author

knapii-developments Aug 19, 2020

I don't want to touch get_data_array() right now because all file types/metadata convention use this method. Testing for other file types/metadata convention aren't as robust as for expression files. I have no doubt that I'll miss something and cause an error in prod.

Contributor

devonbush Aug 20, 2020

yes, that's we aren't merging/releasing this change for the dense parsing that's going out today. I only see 3 total mentions of this function in the current codebase--that's not a risky change--and it's also something that any smoke test would catch b/c the ingest for that type would fail altogether if you missed a type. Use your editor's "find/replace all" to look for all instances and change them.

tests/test_expression_files.py Outdated

    
                      self.assertRaises(

                          TypeError, [GeneExpression.create_data_array], hi="bad_kwarg", **kwargs

                      )

                      actual_da: Dict = next(GeneExpression.create_data_array(**kwargs))

Contributor

devonbush Aug 18, 2020

rename to 'data_array'


          Review comments + added tests

021a24f

knapii-developments requested a review from bistline

August 19, 2020 17:52


          Merge branch 'development' into ea-improve-dense-testing

30f362b

knapii-developments requested a review from devonbush

August 19, 2020 19:05

knapii-developments and others added 5 commits

August 19, 2020 16:01


          Merge branch 'development' into ea-improve-dense-testing

18c5b92


          Merge pull request #139 from broadinstitute/development

a5f26f3

1.5.7 release


          Add remaining review changes

413ce0f


          Merge

092d917


          Merge from local branch

5cd0867

knapii-developments commented

View reviewed changes

ingest/cell_metadata.py

    
                              }

                          )

                      return DataArray(**base_data_array_model, **data_array_attrs).get_data_array()

                      return DataArray(**base_data_array_model, **data_array_attrs).get_data_arrays()

Contributor Author

knapii-developments Aug 20, 2020

Requested change from @devonbush

knapii-developments commented

View reviewed changes

ingest/ingest_files.py

    
                          self.linear_data_id = self.study_id

                  def get_data_array(self):

                  def get_data_arrays(self):

Contributor Author

knapii-developments Aug 20, 2020

@devonbush requested change


          Fix typo

a2c1e80

bistline approved these changes

View reviewed changes

knapii-developments merged commit d251216 into development

knapii-developments deleted the ea-improve-dense-testing branch

August 20, 2020 17:03

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet