-
Notifications
You must be signed in to change notification settings - Fork 0
Integrating ExpressionWriter into IngestPipeline (SCP-4648)
#273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good overall!
I note two trivial blockers, and many non-blocking maintainability suggestions.
ingest/cli_parser.py
Outdated
| '--matrix-file-path', help='path to matrix file', required=True | ||
| ) | ||
| parser_expression_writer.add_argument( | ||
| '--matrix-file-type', help='type to matrix file (dense or mtx)', required=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leverage the choices keyword argument to get built-in error handling and more idiomatic help output:
| '--matrix-file-type', help='type to matrix file (dense or mtx)', required=True | |
| '--matrix-file-type', help='type of matrix file', required=True, choices=['dense', 'mtx'] |
| parser_expression_writer.add_argument( | ||
| '--gene-file', help='path to gene file (None for dense matrix files)' | ||
| ) | ||
| parser_expression_writer.add_argument( | ||
| '--barcode-file', help='path to barcode file (None for dense matrix files)' | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarify usage, as (rightly) seen in expression_writer.py examples.
| parser_expression_writer.add_argument( | |
| '--gene-file', help='path to gene file (None for dense matrix files)' | |
| ) | |
| parser_expression_writer.add_argument( | |
| '--barcode-file', help='path to barcode file (None for dense matrix files)' | |
| ) | |
| parser_expression_writer.add_argument( | |
| '--gene-file', help='path to gene file (omit for dense matrix files)' | |
| ) | |
| parser_expression_writer.add_argument( | |
| '--barcode-file', help='path to barcode file (omit for dense matrix files)' | |
| ) |
| EXAMPLES (must be invoked via ingest_pipeline.py) | ||
| dense matrix: | ||
| python3 ingest_pipeline.py --study-id 5d276a50421aa9117c982845 --study-file-id 5dd5ae25421aa910a723a337 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
study-file-id ought to be extraneous. I imagine it's only included because of baked in assumptions of Ingest Pipeline. Could you refactor things so this argument can be omitted, or open a ticket to resolve this tech debt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely baked into ingest pipeline, and I don't feel confident removing this. My argument against doing so would be that what we're trying to do with image pipeline & DE fall outside of traditional "ingest", and perhaps we shouldn't be invoking those jobs through the ingest harness. But either way, I can create a ticket to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| render_expression_arrays --matrix-file-path ../tests/data/dense_expression_matrix.txt \ | ||
| --matrix-file-type dense \ | ||
| --cluster-file ../tests/data/cluster_example.txt \ | ||
| --cluster-name 'Dense Example' --render-expression-arrays |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needing to specify both the subparser (render_expression_arrays) and pass essentially the same value as a flag (--render-expression-arrays) is another case of pre-existing tech debt that makes these examples more complex than necessary. Could you refactor or open a tech debt ticket for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same - see above, will ticket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ingest/expression_writer.py
Outdated
|
|
||
| timestamp = datetime.datetime.now().isoformat(sep="T", timespec="seconds") | ||
| url_safe_timestamp = re.sub(':', '', timestamp) | ||
| log_name = f"expression_scatter_images_{url_safe_timestamp}_log.txt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blocker: to avoid confusion with the downstream log file:
| log_name = f"expression_scatter_images_{url_safe_timestamp}_log.txt" | |
| log_name = f"expression_scatter_data_{url_safe_timestamp}_log.txt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah thanks - will change.
ingest/writer_functions.py
Outdated
| :param file: (TextIO) open file object | ||
| :param column: (int) specific column to extract from entity file | ||
| :returns: (list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| :returns: (list) |
ingest/writer_functions.py
Outdated
| Determine which column from a 10X entity file contains valid data | ||
| :param file: (TextIO) open file object | ||
| :returns: (int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| :returns: (int) |
Codecov ReportBase: 65.22% // Head: 66.75% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## development #273 +/- ##
===============================================
+ Coverage 65.22% 66.75% +1.53%
===============================================
Files 27 29 +2
Lines 3715 3974 +259
===============================================
+ Hits 2423 2653 +230
- Misses 1292 1321 +29
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
ehanna4
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functional review only - excited to finally run ingest pipeline locally haha through plenty of trial and error and assistance from Jon was able to run through all the test steps and produce the outputs as advertised
BACKGROUND
In order for Image Pipeline to have the necessary pre-rendered expression artifacts available with which to produce static expression scatter plot images, the
ExpressionWriterclass needs to be integrated into the rest ofIngestPipelineso that it can be launched as a PAPI job. This means that the CLI needs to be invoked throughingest_pipeline.py, and that it can localize/delocalize files from GCP buckets, and have adequate test coverage as a part of normal CI forscp-ingest-pipeline.CHANGES
This update changes
ExpressionWriterfrom a standalone class to one that can be invoked throughIngestPipeline. Basic functionality is unchanged, though file handling is now delegated to theIngestFilesclass, which can push/pull files directly from GCP buckets. In addition, logging is now handled by themonitor.pymodule, and MixPanel/Sentry integration is handled through normal codepaths inIngestPipeline. Delocalization still happens serially at the end, so more downstream work will be required to make this performant in delocalizing ~28K files back to the bucket.This update also changes the Docker image used in CircleCI from the
mediumconfiguration tolarge. This is to address test processes getting killed due to parallelization viamultiprocessing.MANUAL TESTING
Note: if you have difficulties installing
scp-ingest-pipelinepackages locally, you can follow the instructions here to build and setup locally in Docker. Once completed, you can skip to step 3 below. Anypythoncommands will be run inside the Docker container.scp-ingest-pipelineenvironment:tests/data/dense_expression_matrix.txttests/data/cluster_example.txttests/data/mtx/matrix_with_header.mtxtests/data/mtx/cluster_mtx_barcodes.tsvtests/data/mtx/sampled_genes.tsvtests/data/mtx/barcodes.tsvstudy_idandstudy_file_idvalues aren't used right now, so don't worry about getting valid ones):expression_scatter_images_<timestamp>_log.txtwith content like the following:_scp_internal/cache/expression_scatter/data/Dense_ExampleItm2a.json.gzandSergef.json.gzC1orf159,RP11-345P4.9,GABRD,THAP3,DNAJC11,OXCT2,HOMER2