Skip to content

Conversation

@bistline
Copy link
Contributor

@bistline bistline commented Oct 3, 2022

BACKGROUND

In order for Image Pipeline to have the necessary pre-rendered expression artifacts available with which to produce static expression scatter plot images, the ExpressionWriter class needs to be integrated into the rest of IngestPipeline so that it can be launched as a PAPI job. This means that the CLI needs to be invoked through ingest_pipeline.py, and that it can localize/delocalize files from GCP buckets, and have adequate test coverage as a part of normal CI for scp-ingest-pipeline.

CHANGES

This update changes ExpressionWriter from a standalone class to one that can be invoked through IngestPipeline. Basic functionality is unchanged, though file handling is now delegated to the IngestFiles class, which can push/pull files directly from GCP buckets. In addition, logging is now handled by the monitor.py module, and MixPanel/Sentry integration is handled through normal codepaths in IngestPipeline. Delocalization still happens serially at the end, so more downstream work will be required to make this performant in delocalizing ~28K files back to the bucket.

This update also changes the Docker image used in CircleCI from the medium configuration to large. This is to address test processes getting killed due to parallelization via multiprocessing.

MANUAL TESTING

Note: if you have difficulties installing scp-ingest-pipeline packages locally, you can follow the instructions here to build and setup locally in Docker. Once completed, you can skip to step 3 below. Any python commands will be run inside the Docker container.

  1. Pull this branch and initialize the scp-ingest-pipeline environment:
python3 -m venv env --copies
source env/bin/activate
pip install -r requirements.txt
  1. Setup MongoDB integration with the following, passing the path to your vault token (we don't need to talk to the database, but ingest will fail to execute without it):
source scripts/setup_mongo_dev.sh ~/path/to/github-token-for-vault.txt
  1. In a separate terminal, enter a Rails console session and get the bucket ID and browser link for any study with the following:
study = Study.all.sample # you can use a specific study if you want, it doesn't matter
study.bucket_id
study.google_bucket_url
  1. Copy the URL from above and open it in a browser window, and then copy the following files to the bucket:
  • tests/data/dense_expression_matrix.txt
  • tests/data/cluster_example.txt
  • tests/data/mtx/matrix_with_header.mtx
  • tests/data/mtx/cluster_mtx_barcodes.tsv
  • tests/data/mtx/sampled_genes.tsv
  • tests/data/mtx/barcodes.tsv
  1. Using the bucket ID from above, run a job using the dense matrix (the study_id and study_file_id values aren't used right now, so don't worry about getting valid ones):
BUCKET_ID="fc-c36727b0-1663-40fc-ac37-9016a6709829"
python3 ingest_pipeline.py --study-id 5d276a50421aa9117c982845 --study-file-id 5dd5ae25421aa910a723a337 render_expression_arrays \
                             --matrix-file-path gs://$BUCKET_ID/dense_expression_matrix.txt \
                             --matrix-file-type dense \
                             --cluster-file gs://$BUCKET_ID/cluster_example.txt \
                             --cluster-name 'Dense Example' --render-expression-arrays
  1. You should see output similar to this:
distinct_id: 2f30ec50-a04d-4d43-8fd1-b136a2045079
studyAccession: SCPdev
fileName: 5dd5ae25421aa910a723a337
fileType: input_validation_bypassed
fileSize: 1
trigger: dev-mode
logger: ingest-pipeline
appId: single-cell-portal                               
  1. Confirm a log file for this run is created at expression_scatter_images_<timestamp>_log.txt with content like the following:
2022-10-03T10:23:25-0400 expression_writer INFO: creating data directory at Dense_Example
2022-10-03T10:23:25-0400 expression_writer INFO: reading gs://fc-c36727b0-1663-40fc-ac37-9016a6709829/expression_matrix_example.txt as dense matrix
2022-10-03T10:23:25-0400 expression_writer INFO: determining seek points for /tmp/expression_matrix_example.txt with chunk size 357
2022-10-03T10:23:31-0400 expression_writer INFO: completed, total runtime: 0h, 0m, 6s
2022-10-03T10:23:31-0400 expression_writer INFO: pushing all output files to gs://fc-c36727b0-1663-40fc-ac37-9016a6709829/_scp_internal/cache/expression_scatter/data/Dense_Example
2022-10-03T10:23:31-0400 expression_writer INFO: push completed
  1. In the browser window with the bucket, refresh the page and navigate to _scp_internal/cache/expression_scatter/data/Dense_Example
  2. Confirm there are two files here: Itm2a.json.gz and Sergef.json.gz
  3. Now run the sparse example:
BUCKET_ID="fc-c36727b0-1663-40fc-ac37-9016a6709829"
python3 ingest_pipeline.py --study-id 5d276a50421aa9117c982845 --study-file-id 5dd5ae25421aa910a723a337 \
                             render_expression_arrays --matrix-file-path gs://$BUCKET_ID/matrix_with_header.mtx \
                             --matrix-file-type mtx \
                             --gene-file gs://$BUCKET_ID/sampled_genes.tsv \
                             --barcode-file gs://$BUCKET_ID/barcodes.tsv \
                             --cluster-file gs://$BUCKET_ID/cluster_mtx_barcodes.tsv \
                             --cluster-name 'Sparse Example' --render-expression-arrays
  1. Confirm similar output to above, and back in the bucket view, validate there are 7 new files for the following genes: C1orf159, RP11-345P4.9, GABRD, THAP3, DNAJC11, OXCT2, HOMER2

@bistline bistline requested review from ehanna4, eweitz and jlchang October 3, 2022 14:31
Copy link
Member

@eweitz eweitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good overall!

I note two trivial blockers, and many non-blocking maintainability suggestions.

'--matrix-file-path', help='path to matrix file', required=True
)
parser_expression_writer.add_argument(
'--matrix-file-type', help='type to matrix file (dense or mtx)', required=True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leverage the choices keyword argument to get built-in error handling and more idiomatic help output:

Suggested change
'--matrix-file-type', help='type to matrix file (dense or mtx)', required=True
'--matrix-file-type', help='type of matrix file', required=True, choices=['dense', 'mtx']

Comment on lines 371 to 376
parser_expression_writer.add_argument(
'--gene-file', help='path to gene file (None for dense matrix files)'
)
parser_expression_writer.add_argument(
'--barcode-file', help='path to barcode file (None for dense matrix files)'
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify usage, as (rightly) seen in expression_writer.py examples.

Suggested change
parser_expression_writer.add_argument(
'--gene-file', help='path to gene file (None for dense matrix files)'
)
parser_expression_writer.add_argument(
'--barcode-file', help='path to barcode file (None for dense matrix files)'
)
parser_expression_writer.add_argument(
'--gene-file', help='path to gene file (omit for dense matrix files)'
)
parser_expression_writer.add_argument(
'--barcode-file', help='path to barcode file (omit for dense matrix files)'
)

EXAMPLES (must be invoked via ingest_pipeline.py)
dense matrix:
python3 ingest_pipeline.py --study-id 5d276a50421aa9117c982845 --study-file-id 5dd5ae25421aa910a723a337 \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

study-file-id ought to be extraneous. I imagine it's only included because of baked in assumptions of Ingest Pipeline. Could you refactor things so this argument can be omitted, or open a ticket to resolve this tech debt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely baked into ingest pipeline, and I don't feel confident removing this. My argument against doing so would be that what we're trying to do with image pipeline & DE fall outside of traditional "ingest", and perhaps we shouldn't be invoking those jobs through the ingest harness. But either way, I can create a ticket to do so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +10 to +13
render_expression_arrays --matrix-file-path ../tests/data/dense_expression_matrix.txt \
--matrix-file-type dense \
--cluster-file ../tests/data/cluster_example.txt \
--cluster-name 'Dense Example' --render-expression-arrays
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needing to specify both the subparser (render_expression_arrays) and pass essentially the same value as a flag (--render-expression-arrays) is another case of pre-existing tech debt that makes these examples more complex than necessary. Could you refactor or open a tech debt ticket for that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same - see above, will ticket.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


timestamp = datetime.datetime.now().isoformat(sep="T", timespec="seconds")
url_safe_timestamp = re.sub(':', '', timestamp)
log_name = f"expression_scatter_images_{url_safe_timestamp}_log.txt"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocker: to avoid confusion with the downstream log file:

Suggested change
log_name = f"expression_scatter_images_{url_safe_timestamp}_log.txt"
log_name = f"expression_scatter_data_{url_safe_timestamp}_log.txt"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks - will change.

:param file: (TextIO) open file object
:param column: (int) specific column to extract from entity file
:returns: (list)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:returns: (list)

Determine which column from a 10X entity file contains valid data
:param file: (TextIO) open file object
:returns: (int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:returns: (int)

@codecov
Copy link

codecov bot commented Oct 4, 2022

Codecov Report

Base: 65.22% // Head: 66.75% // Increases project coverage by +1.53% 🎉

Coverage data is based on head (545c4a6) compared to base (71fc658).
Patch coverage: 88.80% of modified lines in pull request are covered.

Additional details and impacted files
@@               Coverage Diff               @@
##           development     #273      +/-   ##
===============================================
+ Coverage        65.22%   66.75%   +1.53%     
===============================================
  Files               27       29       +2     
  Lines             3715     3974     +259     
===============================================
+ Hits              2423     2653     +230     
- Misses            1292     1321      +29     
Impacted Files Coverage Δ
ingest/ingest_pipeline.py 55.92% <14.28%> (-1.86%) ⬇️
ingest/expression_writer.py 90.50% <90.50%> (ø)
ingest/writer_functions.py 97.46% <97.46%> (ø)
ingest/cli_parser.py 78.89% <100.00%> (+1.67%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Contributor

@ehanna4 ehanna4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functional review only - excited to finally run ingest pipeline locally haha through plenty of trial and error and assistance from Jon was able to run through all the test steps and produce the outputs as advertised

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants