-
Notifications
You must be signed in to change notification settings - Fork 0
Adding script to generate data for image pipeline (SCP-4584) #269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportBase: 65.02% // Head: 65.02% // No change to project coverage 👍
Additional details and impacted files@@ Coverage Diff @@
## development #269 +/- ##
============================================
Coverage 65.02% 65.02%
============================================
Files 28 28
Lines 3714 3714
============================================
Hits 2415 2415
Misses 1299 1299 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good! I suggest small maintainability refinements, definitely no blockers at this early prototyping stage.
| import json | ||
| import os | ||
| import re | ||
| import argparse | ||
| import uuid | ||
| import gzip | ||
| import time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for only using standard library, at least early on!
| # default level of precision | ||
| precision = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that this is somewhat abstracted. Per chat, a "contextual precision" approach might yield 10-100x smaller aggregate expression data sizes, as well as faster JSON.parse times for Image Pipeline and interactive end-user clients.
Abstracting this a bit earlier as you've done here makes that potential optimization that much easier.
|
|
||
| def make_data_dir(name): | ||
| """ | ||
| Make a directory to put output files in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you briefly explain the UUIDv4's benefit in this docstring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure - this was mostly for local testing so that I could do multiple side-by-side runs and compare outputs.
| for row in cluster_file: | ||
| cell = re.split(COMMA_OR_TAB, row)[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use csvreader from the standard library elsewhere in Ingest Pipeline for this sort of thing. But IIRC, that requires sniffing delimiters, which isn't needed here.
I wouldn't be surprised if that other approach is slightly faster, but I'm not confident it's notably so. Using this approach is fine by me; it just seemed worth commenting on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a fair point, and as this eventually integrates with the rest of ingest pipeline, it will likely switch to that. But there are also the issues we see with mime types & file extensions in the rest of ingest, and I figured that was just a little more overhead than I wanted to take on for the initial PoC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I've seen in ingest pipeline, we aren't sniffing delimiters, we're only assessing file suffixes. That's probably the root of the mime type issues we've had. Jon's approach (or sniffing delimiters) would be helpful in addressing some of the issues we've had with file types.
| entities = [] | ||
| for line in file: | ||
| entry = re.split(ALL_DELIM, line.strip())[column] | ||
| entities.append(entry) | ||
| return entities |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good case where for loops are faster for humans to read than the equivalent list comprehension.
I wonder how much faster/slower for machines a list comprehension would be here, but I don't consider it urgent to benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting - I didn't think that there would be any performance implications, but this method was just complicated enough that the case for column offset made the list comprehension hard to read. However, given that we want to eventually speed this up as much possible, I'll look into benchmarking the two and see which is faster (probably the list comprehension).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did some profiling, and a list comprehension (see cbef530) implementation of this was ~20% faster, though we're only talking about tenths of seconds for this particular method. But faster is faster! And I don't really think it affects the readability on reflection.
| matrix_file_path (String): path to matrix file | ||
| genes (List): gene names | ||
| barcodes (List): cell names | ||
| cluster_cells (List): cell names from cluster file | ||
| cluster_name (String): name of cluster object | ||
| data_dir (String): output data directory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, Python supports gradual typing. Parameters types are implicit by default, like they've historically been prior to Python 3.5, but types can be added as desired.
From what I understand, I find Python's take on types (which might also be Ruby's take?) more readable than, say, TypeScript. The latter requires all parameters to have explicit types. This often contributes to TS code being speckled with not-useful, visually-noisy any types -- which in Python would be absent.
So while our Python shouldn't require types, they might be worth adding where we document them in docstrings -- once we're beyond a prototyping stage of development.
jlchang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Output files and directories are generated as described. Everything looks good.
| for row in cluster_file: | ||
| cell = re.split(COMMA_OR_TAB, row)[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I've seen in ingest pipeline, we aren't sniffing delimiters, we're only assessing file suffixes. That's probably the root of the mime type issues we've had. Jon's approach (or sniffing delimiters) would be helpful in addressing some of the issues we've had with file types.
BACKGROUND
Work on image pipeline can continue much quicker if rather than reading expression data from the API, it can read from pre-rendered data artifacts that represent gene-level expression data for a given matrix/cluster combination. This reduced load on the portal and drastically speeds up rendering images.
CHANGES
This adds
render_expression_arrays.py, a scratch script that will take a given cluster and matrix (dense or sparse with features/barcodes files) and write out optimized & compressed JSON arrays of the resulting expression from the matrix, filtered through the list of cells from the cluster. This mimics theexpressionattribute on expression visualization responses, including interpolating non-existent0values from sparse matrix files). These arrays can then be read directly by Image Pipeline, or even the Plotly front-end in an instance of SCP.MANUAL TESTING
scripts/scratch_ingestdirectoryDense_Exampledata directory, validate that there are two output filesDense_Example--Sergef.json.gzin a text editor, and validate that the non-zero expression values have only 1 digit of precision, and the0values are all integersrender_expression_arrays.pyand confirm they all execute