Skip to content

Conversation

@bistline
Copy link
Contributor

BACKGROUND

Work on image pipeline can continue much quicker if rather than reading expression data from the API, it can read from pre-rendered data artifacts that represent gene-level expression data for a given matrix/cluster combination. This reduced load on the portal and drastically speeds up rendering images.

CHANGES

This adds render_expression_arrays.py, a scratch script that will take a given cluster and matrix (dense or sparse with features/barcodes files) and write out optimized & compressed JSON arrays of the resulting expression from the matrix, filtered through the list of cells from the cluster. This mimics the expression attribute on expression visualization responses, including interpolating non-existent 0 values from sparse matrix files). These arrays can then be read directly by Image Pipeline, or even the Plotly front-end in an instance of SCP.

MANUAL TESTING

  1. Pull branch and navigate to the scripts/scratch_ingest directory
  2. Run the following Python:
python3 render_expression_arrays.py --matrix-file ../../tests/data/dense_expression_matrix.txt \
                                    --cluster-file ../../tests/data/cluster_example.txt \
                                    --cluster-name 'Dense Example' --precision 1
  1. You should see output in the terminal like the following:
creating data directory at Dense_Example-1f6c26e7-5c6c-455b-b32c-d86e24166b08
using 1 digits of precision for non-zero data
reading ../../tests/data/dense_expression_matrix.txt as dense matrix
writing Itm2a data... 0.0s
writing Sergef data... 0.0s
completed, total runtime in minutes: 5e-05
  1. In the Dense_Example data directory, validate that there are two output files
  2. Open Dense_Example--Sergef.json.gz in a text editor, and validate that the non-zero expression values have only 1 digit of precision, and the 0 values are all integers
  3. (Optional) Run the other example usage scripts from render_expression_arrays.py and confirm they all execute

@bistline bistline requested review from ehanna4, eweitz and jlchang August 30, 2022 21:35
@codecov
Copy link

codecov bot commented Aug 30, 2022

Codecov Report

Base: 65.02% // Head: 65.02% // No change to project coverage 👍

Coverage data is based on head (cbef530) compared to base (688c4b4).
Patch has no changes to coverable lines.

Additional details and impacted files
@@             Coverage Diff              @@
##           development     #269   +/-   ##
============================================
  Coverage        65.02%   65.02%           
============================================
  Files               28       28           
  Lines             3714     3714           
============================================
  Hits              2415     2415           
  Misses            1299     1299           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Member

@eweitz eweitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good! I suggest small maintainability refinements, definitely no blockers at this early prototyping stage.

Comment on lines +28 to +34
import json
import os
import re
import argparse
import uuid
import gzip
import time
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for only using standard library, at least early on!

Comment on lines +45 to +46
# default level of precision
precision = 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that this is somewhat abstracted. Per chat, a "contextual precision" approach might yield 10-100x smaller aggregate expression data sizes, as well as faster JSON.parse times for Image Pipeline and interactive end-user clients.

Abstracting this a bit earlier as you've done here makes that potential optimization that much easier.


def make_data_dir(name):
"""
Make a directory to put output files in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you briefly explain the UUIDv4's benefit in this docstring?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure - this was mostly for local testing so that I could do multiple side-by-side runs and compare outputs.

Comment on lines +100 to +101
for row in cluster_file:
cell = re.split(COMMA_OR_TAB, row)[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use csvreader from the standard library elsewhere in Ingest Pipeline for this sort of thing. But IIRC, that requires sniffing delimiters, which isn't needed here.

I wouldn't be surprised if that other approach is slightly faster, but I'm not confident it's notably so. Using this approach is fine by me; it just seemed worth commenting on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point, and as this eventually integrates with the rest of ingest pipeline, it will likely switch to that. But there are also the issues we see with mime types & file extensions in the rest of ingest, and I figured that was just a little more overhead than I wanted to take on for the initial PoC.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I've seen in ingest pipeline, we aren't sniffing delimiters, we're only assessing file suffixes. That's probably the root of the mime type issues we've had. Jon's approach (or sniffing delimiters) would be helpful in addressing some of the issues we've had with file types.

Comment on lines 118 to 122
entities = []
for line in file:
entry = re.split(ALL_DELIM, line.strip())[column]
entities.append(entry)
return entities
Copy link
Member

@eweitz eweitz Aug 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good case where for loops are faster for humans to read than the equivalent list comprehension.

I wonder how much faster/slower for machines a list comprehension would be here, but I don't consider it urgent to benchmark.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting - I didn't think that there would be any performance implications, but this method was just complicated enough that the case for column offset made the list comprehension hard to read. However, given that we want to eventually speed this up as much possible, I'll look into benchmarking the two and see which is faster (probably the list comprehension).

Copy link
Contributor Author

@bistline bistline Aug 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did some profiling, and a list comprehension (see cbef530) implementation of this was ~20% faster, though we're only talking about tenths of seconds for this particular method. But faster is faster! And I don't really think it affects the readability on reflection.

Comment on lines +131 to +136
matrix_file_path (String): path to matrix file
genes (List): gene names
barcodes (List): cell names
cluster_cells (List): cell names from cluster file
cluster_name (String): name of cluster object
data_dir (String): output data directory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, Python supports gradual typing. Parameters types are implicit by default, like they've historically been prior to Python 3.5, but types can be added as desired.

From what I understand, I find Python's take on types (which might also be Ruby's take?) more readable than, say, TypeScript. The latter requires all parameters to have explicit types. This often contributes to TS code being speckled with not-useful, visually-noisy any types -- which in Python would be absent.

So while our Python shouldn't require types, they might be worth adding where we document them in docstrings -- once we're beyond a prototyping stage of development.

Copy link
Contributor

@jlchang jlchang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Output files and directories are generated as described. Everything looks good.

Comment on lines +100 to +101
for row in cluster_file:
cell = re.split(COMMA_OR_TAB, row)[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I've seen in ingest pipeline, we aren't sniffing delimiters, we're only assessing file suffixes. That's probably the root of the mime type issues we've had. Jon's approach (or sniffing delimiters) would be helpful in addressing some of the issues we've had with file types.

@bistline bistline merged commit b07278a into development Sep 6, 2022
@bistline bistline deleted the jb-image-data-artifacts branch September 6, 2022 13:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants