Skip to content

Conversation

@eweitz
Copy link
Member

@eweitz eweitz commented Aug 11, 2020

This improves robustness and maintainability for the SCP Genomes Pipeline and make_toy_data.py.

Previously, these two pieces of infrastructure had no automated tests, and a tedious complex of parameters. The code had moved from another repo in #102 to reduce technical debt after gaining significant features, but those two liabilities remained.

Now those risks are greatly reduced. Genomes Pipeline (genomes_pipeline.py, previously named process_genome_references.py) now has an integration test in test_genomes.py and make_toy_data.py has one in test_toy_data.py. To enable tests, the code has been restructured to enable use as a Python module. Altogether, this substantially increases scp-ingest-pipeline test coverage -- from 52% before to 67% after.

Also, Genomes Pipeline is now more object-oriented, to simplify stateful transformations of genome reference data. Static analysis via flake8 has also been restored for those files. These two changes improve maintainability for four modules comprising ~700 lines of code, which is about a quarter of the code in scp-ingest-pipeline.

While manually verifying this work, I discovered that reference data can no longer be uploaded to GCS due to network timeouts. This seems to be due to slower networks, and independent of these changes. SCP-2662 tracks that.

This satisfies SCP-2491.

@eweitz eweitz changed the title Add tests and refactor Genomes Pipeline and make_toy_data.py (SCP-2491) Harden Genomes Pipeline and make_toy_data.py (SCP-2491) Aug 11, 2020
@eweitz
Copy link
Member Author

eweitz commented Aug 11, 2020

Oddly, even though I used e.g. git mv ingest/genomes/process_genome_references.py ingest/genomes/genomes_pipeline.py, the Files changed tab shows crude file moves (mv instead of git mv). This means Git history is lost for those files and the diffs are slightly larger than they would be.

The affected histories are short, as those files were recently migrated from another repo. The diffs would also be large even with proper git mv. So unless anyone feels strongly I'll plan to not patch the history.

@codecov
Copy link

codecov bot commented Aug 11, 2020

Codecov Report

Merging #129 into master will increase coverage by 15.18%.
The diff coverage is 92.23%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master     #129       +/-   ##
===========================================
+ Coverage   51.82%   67.00%   +15.18%     
===========================================
  Files          22       22               
  Lines        2580     2576        -4     
===========================================
+ Hits         1337     1726      +389     
+ Misses       1243      850      -393     
Impacted Files Coverage Δ
ingest/expression_files/__init__.py 100.00% <ø> (ø)
ingest/genomes/utils.py 54.16% <ø> (+54.16%) ⬆️
ingest/mongo_connection.py 50.00% <0.00%> (-3.85%) ⬇️
ingest/monitor.py 77.04% <ø> (ø)
ingest/genomes/genome_annotation_metadata.py 7.69% <40.00%> (ø)
ingest/make_toy_data.py 40.54% <84.33%> (+40.54%) ⬆️
ingest/genomes/genomes_pipeline.py 86.66% <86.66%> (ø)
ingest/genomes/genome_annotations.py 98.00% <98.00%> (ø)
ingest/genomes/genome_assemblies.py 98.88% <98.88%> (ø)
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d8454bb...8a486fe. Read the comment docs.

Copy link
Contributor

@devonbush devonbush left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a really impressive achievement. I'd love to have you spend a few minutes Monday walking me through this from a user perspective -- this is an area of SCP I'm seeing for the first time.

To use `ingest/genomes/genomes_pipeline.py`:

```
brew install tabix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad we have this documented now!

Copy link
Contributor

@devonbush devonbush left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks good, just some style questions

remote_prod_dir = config['remote_prod_dir']

storage_client = get_gcs_storage_client(vault_path)
storage_client = utils.get_gcs_storage_client(vault_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, I like how you made these easier to read where the functions come from

new_metadata_ref = []

ref_file = 'species_metadata_reference.tsv'
ref_file = f'{output_dir}species_metadata_reference.tsv'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a place where os.path.join might be more robust?
https://docs.python.org/3/library/os.path.html

Copy link
Member Author

@eweitz eweitz Aug 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup! Done in 8a486fe. It's more verbose, but trading that for robustness makes sense.

# Run all tests, show code coverage metrics
pytest --cov=../ingest/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's overkill to be repeating pytest usage documentation here. Maybe just add a link to the pytest docs from the main readme (line 76ish)?

Copy link
Member Author

@eweitz eweitz Aug 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I can see how repeating these examples in each integration test module is a bit much. At the same time, I think only linking to pytest docs makes conveying tailored examples too inconvenient.

So in 8a486fe I've consolidated this handful of conveniences into the "Test" section of the README, added a pointer here, and removed the examples from the tests themselves.

pytest test_make_toy.py
# Run all tests, show code coverage metrics
pytest --cov=../ingest/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, I don't think it's useful to repeat pytest docs at the top of every test

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed; details are in my reply over there.

@eweitz eweitz requested a review from devonbush August 18, 2020 18:19
@eweitz eweitz merged commit c593982 into master Aug 18, 2020
@eweitz eweitz deleted the ew-genomes-pipeline branch August 18, 2020 23:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants