Integrating `ExpressionWriter` into `IngestPipeline` (SCP-4648) #273

bistline · 2022-10-03T14:31:34Z

BACKGROUND

In order for Image Pipeline to have the necessary pre-rendered expression artifacts available with which to produce static expression scatter plot images, the ExpressionWriter class needs to be integrated into the rest of IngestPipeline so that it can be launched as a PAPI job. This means that the CLI needs to be invoked through ingest_pipeline.py, and that it can localize/delocalize files from GCP buckets, and have adequate test coverage as a part of normal CI for scp-ingest-pipeline.

CHANGES

This update changes ExpressionWriter from a standalone class to one that can be invoked through IngestPipeline. Basic functionality is unchanged, though file handling is now delegated to the IngestFiles class, which can push/pull files directly from GCP buckets. In addition, logging is now handled by the monitor.py module, and MixPanel/Sentry integration is handled through normal codepaths in IngestPipeline. Delocalization still happens serially at the end, so more downstream work will be required to make this performant in delocalizing ~28K files back to the bucket.

This update also changes the Docker image used in CircleCI from the medium configuration to large. This is to address test processes getting killed due to parallelization via multiprocessing.

MANUAL TESTING

Note: if you have difficulties installing scp-ingest-pipeline packages locally, you can follow the instructions here to build and setup locally in Docker. Once completed, you can skip to step 3 below. Any python commands will be run inside the Docker container.

Pull this branch and initialize the scp-ingest-pipeline environment:

python3 -m venv env --copies
source env/bin/activate
pip install -r requirements.txt

Setup MongoDB integration with the following, passing the path to your vault token (we don't need to talk to the database, but ingest will fail to execute without it):

source scripts/setup_mongo_dev.sh ~/path/to/github-token-for-vault.txt

In a separate terminal, enter a Rails console session and get the bucket ID and browser link for any study with the following:

study = Study.all.sample # you can use a specific study if you want, it doesn't matter
study.bucket_id
study.google_bucket_url

Copy the URL from above and open it in a browser window, and then copy the following files to the bucket:

tests/data/dense_expression_matrix.txt
tests/data/cluster_example.txt
tests/data/mtx/matrix_with_header.mtx
tests/data/mtx/cluster_mtx_barcodes.tsv
tests/data/mtx/sampled_genes.tsv
tests/data/mtx/barcodes.tsv

Using the bucket ID from above, run a job using the dense matrix (the study_id and study_file_id values aren't used right now, so don't worry about getting valid ones):

BUCKET_ID="fc-c36727b0-1663-40fc-ac37-9016a6709829"
python3 ingest_pipeline.py --study-id 5d276a50421aa9117c982845 --study-file-id 5dd5ae25421aa910a723a337 render_expression_arrays \
                             --matrix-file-path gs://$BUCKET_ID/dense_expression_matrix.txt \
                             --matrix-file-type dense \
                             --cluster-file gs://$BUCKET_ID/cluster_example.txt \
                             --cluster-name 'Dense Example' --render-expression-arrays

You should see output similar to this:

distinct_id: 2f30ec50-a04d-4d43-8fd1-b136a2045079
studyAccession: SCPdev
fileName: 5dd5ae25421aa910a723a337
fileType: input_validation_bypassed
fileSize: 1
trigger: dev-mode
logger: ingest-pipeline
appId: single-cell-portal

Confirm a log file for this run is created at expression_scatter_images_<timestamp>_log.txt with content like the following:

2022-10-03T10:23:25-0400 expression_writer INFO: creating data directory at Dense_Example
2022-10-03T10:23:25-0400 expression_writer INFO: reading gs://fc-c36727b0-1663-40fc-ac37-9016a6709829/expression_matrix_example.txt as dense matrix
2022-10-03T10:23:25-0400 expression_writer INFO: determining seek points for /tmp/expression_matrix_example.txt with chunk size 357
2022-10-03T10:23:31-0400 expression_writer INFO: completed, total runtime: 0h, 0m, 6s
2022-10-03T10:23:31-0400 expression_writer INFO: pushing all output files to gs://fc-c36727b0-1663-40fc-ac37-9016a6709829/_scp_internal/cache/expression_scatter/data/Dense_Example
2022-10-03T10:23:31-0400 expression_writer INFO: push completed

In the browser window with the bucket, refresh the page and navigate to _scp_internal/cache/expression_scatter/data/Dense_Example
Confirm there are two files here: Itm2a.json.gz and Sergef.json.gz
Now run the sparse example:

BUCKET_ID="fc-c36727b0-1663-40fc-ac37-9016a6709829"
python3 ingest_pipeline.py --study-id 5d276a50421aa9117c982845 --study-file-id 5dd5ae25421aa910a723a337 \
                             render_expression_arrays --matrix-file-path gs://$BUCKET_ID/matrix_with_header.mtx \
                             --matrix-file-type mtx \
                             --gene-file gs://$BUCKET_ID/sampled_genes.tsv \
                             --barcode-file gs://$BUCKET_ID/barcodes.tsv \
                             --cluster-file gs://$BUCKET_ID/cluster_mtx_barcodes.tsv \
                             --cluster-name 'Sparse Example' --render-expression-arrays

Confirm similar output to above, and back in the bucket view, validate there are 7 new files for the following genes: C1orf159, RP11-345P4.9, GABRD, THAP3, DNAJC11, OXCT2, HOMER2

eweitz

Code looks good overall!

I note two trivial blockers, and many non-blocking maintainability suggestions.

eweitz · 2022-10-04T14:11:38Z

ingest/cli_parser.py

+        '--matrix-file-path', help='path to matrix file', required=True
+    )
+    parser_expression_writer.add_argument(
+        '--matrix-file-type', help='type to matrix file (dense or mtx)', required=True


Leverage the choices keyword argument to get built-in error handling and more idiomatic help output:

Suggested change

'--matrix-file-type', help='type to matrix file (dense or mtx)', required=True

'--matrix-file-type', help='type of matrix file', required=True, choices=['dense', 'mtx']

eweitz · 2022-10-04T14:13:47Z

ingest/cli_parser.py

+    parser_expression_writer.add_argument(
+        '--gene-file', help='path to gene file (None for dense matrix files)'
+    )
+    parser_expression_writer.add_argument(
+        '--barcode-file', help='path to barcode file (None for dense matrix files)'
+    )


Clarify usage, as (rightly) seen in expression_writer.py examples.

Suggested change

parser_expression_writer.add_argument(

'--gene-file', help='path to gene file (None for dense matrix files)'

)

parser_expression_writer.add_argument(

'--barcode-file', help='path to barcode file (None for dense matrix files)'

)

parser_expression_writer.add_argument(

'--gene-file', help='path to gene file (omit for dense matrix files)'

)

parser_expression_writer.add_argument(

'--barcode-file', help='path to barcode file (omit for dense matrix files)'

)

eweitz · 2022-10-04T14:20:08Z

ingest/expression_writer.py

+EXAMPLES (must be invoked via ingest_pipeline.py)
+
+dense matrix:
+python3 ingest_pipeline.py --study-id 5d276a50421aa9117c982845 --study-file-id 5dd5ae25421aa910a723a337 \


study-file-id ought to be extraneous. I imagine it's only included because of baked in assumptions of Ingest Pipeline. Could you refactor things so this argument can be omitted, or open a ticket to resolve this tech debt?

This is definitely baked into ingest pipeline, and I don't feel confident removing this. My argument against doing so would be that what we're trying to do with image pipeline & DE fall outside of traditional "ingest", and perhaps we shouldn't be invoking those jobs through the ingest harness. But either way, I can create a ticket to do so.

eweitz · 2022-10-04T14:22:03Z

ingest/expression_writer.py

+                             render_expression_arrays --matrix-file-path ../tests/data/dense_expression_matrix.txt \
+                             --matrix-file-type dense \
+                             --cluster-file ../tests/data/cluster_example.txt \
+                             --cluster-name 'Dense Example' --render-expression-arrays


Needing to specify both the subparser (render_expression_arrays) and pass essentially the same value as a flag (--render-expression-arrays) is another case of pre-existing tech debt that makes these examples more complex than necessary. Could you refactor or open a tech debt ticket for that?

Same - see above, will ticket.

eweitz · 2022-10-04T15:14:23Z

ingest/expression_writer.py

+
+        timestamp = datetime.datetime.now().isoformat(sep="T", timespec="seconds")
+        url_safe_timestamp = re.sub(':', '', timestamp)
+        log_name = f"expression_scatter_images_{url_safe_timestamp}_log.txt"


Blocker: to avoid confusion with the downstream log file:

Suggested change

log_name = f"expression_scatter_images_{url_safe_timestamp}_log.txt"

log_name = f"expression_scatter_data_{url_safe_timestamp}_log.txt"

Ah thanks - will change.

eweitz · 2022-10-04T15:46:38Z

ingest/writer_functions.py

+
+    :param file: (TextIO) open file object
+    :param column: (int) specific column to extract from entity file
+    :returns: (list)


Suggested change

:returns: (list)

eweitz · 2022-10-04T15:46:51Z

ingest/writer_functions.py

+    Determine which column from a 10X entity file contains valid data
+
+    :param file: (TextIO) open file object
+    :returns: (int)


Suggested change

:returns: (int)

ingest/writer_functions.py

codecov · 2022-10-04T16:25:06Z

Codecov Report

Base: 65.22% // Head: 66.75% // Increases project coverage by +1.53% 🎉

Coverage data is based on head (545c4a6) compared to base (71fc658).
Patch coverage: 88.80% of modified lines in pull request are covered.

Additional details and impacted files

@@               Coverage Diff               @@
##           development     #273      +/-   ##
===============================================
+ Coverage        65.22%   66.75%   +1.53%     
===============================================
  Files               27       29       +2     
  Lines             3715     3974     +259     
===============================================
+ Hits              2423     2653     +230     
- Misses            1292     1321      +29

Impacted Files	Coverage Δ
ingest/ingest_pipeline.py	`55.92% <14.28%> (-1.86%)`	⬇️
ingest/expression_writer.py	`90.50% <90.50%> (ø)`
ingest/writer_functions.py	`97.46% <97.46%> (ø)`
ingest/cli_parser.py	`78.89% <100.00%> (+1.67%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

ehanna4

Functional review only - excited to finally run ingest pipeline locally haha through plenty of trial and error and assistance from Jon was able to run through all the test steps and produce the outputs as advertised

bistline added 16 commits September 26, 2022 17:40

initial pass at invoking from ingest_pipeline.py

d023264

removing debug logging

9b4a938

integrating logger

40b1de1

finishing integration, added custom delocalization

8f1158d

fixing docstring

e857933

updating docstring

733607d

adding typing

009e4ab

fixing typing for TextIO

4d39cf9

consolidating logging, adding job-specific log name

8cd9a5e

adding coverage for writer_functions.py

3d2a3c7

adding coverage for expression_writer.py

23a344c

refining tests

f93c8b3

debugging failing test

2771f13

investigating potential OOM issues

56fdad6

adding gotcha for num_cores in test

cfe2b92

upgrading to large image

b41b5cf

bistline requested review from ehanna4, eweitz and jlchang October 3, 2022 14:31

eweitz approved these changes Oct 4, 2022

View reviewed changes

addressing PR comments

d726dcf

bistline added 2 commits October 4, 2022 14:35

using serial delocalization, persist logs

1f192d7

adding instructions for running inside Docker locally

545c4a6

ehanna4 approved these changes Oct 5, 2022

View reviewed changes

bistline merged commit f1ba08e into development Oct 5, 2022

bistline mentioned this pull request Oct 5, 2022

Fixing bug with slices landing on line ends (SCP-4648) #275

Merged

eweitz mentioned this pull request Nov 27, 2023

Fix gene names with slashes in expression writer (SCP-5341) #329

Merged

	'--matrix-file-type', help='type to matrix file (dense or mtx)', required=True
	'--matrix-file-type', help='type of matrix file', required=True, choices=['dense', 'mtx']

	log_name = f"expression_scatter_images_{url_safe_timestamp}_log.txt"
	log_name = f"expression_scatter_data_{url_safe_timestamp}_log.txt"

Integrating ExpressionWriter into IngestPipeline (SCP-4648) #273

Integrating ExpressionWriter into IngestPipeline (SCP-4648) #273

Uh oh!

Conversation

bistline commented Oct 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

BACKGROUND

CHANGES

MANUAL TESTING

Uh oh!

eweitz left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Oct 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ehanna4 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Integrating `ExpressionWriter` into `IngestPipeline` (SCP-4648) #273

Integrating `ExpressionWriter` into `IngestPipeline` (SCP-4648) #273

bistline commented Oct 3, 2022 •

edited

Loading

eweitz left a comment •

edited

Loading

codecov bot commented Oct 4, 2022 •

edited

Loading