-
Notifications
You must be signed in to change notification settings - Fork 0
Prototype DE pipeline for dense matrix inputs (SCP-4074) #238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
remove spaces from filename reduce significant digits log exception when annot missing
|
Had CI failure for test_make_toy with |
Codecov Report
@@ Coverage Diff @@
## development #238 +/- ##
===============================================
- Coverage 72.90% 71.29% -1.62%
===============================================
Files 26 27 +1
Lines 3385 3518 +133
===============================================
+ Hits 2468 2508 +40
- Misses 917 1010 +93
Continue to review full report at Codecov.
|
eweitz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kudos, I'm really excited for this prototype! A solid start to a big new scientific feature.
I note a trivial CLI UI blocker, and a bunch of non-blocking suggestions.
| cluster_cell_list = [] | ||
| for value in cluster_cell_values: | ||
| cluster_cell_list.extend(value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be simplified / optimized like so?
| cluster_cell_list = [] | |
| for value in cluster_cell_values: | |
| cluster_cell_list.extend(value) | |
| cluster_cell_list = [] | |
| if len(cluster_cell_values) > 0: | |
| cluster_cell_list = cluster_cell_values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cluster_cells.tolist() yields a list of (single value) lists that needs to be flattened
using extend converts each single-value list (ie. cell name) to a plain value
I'll add a comment that reflects the above complication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5928217
ingest/de.py
Outdated
| self.genes, | ||
| self.barcodes, | ||
| ) | ||
| DifferentialExpression.de_logger.info(f"preparing DE on sparse matrix") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No f-string needed here:
| DifferentialExpression.de_logger.info(f"preparing DE on sparse matrix") | |
| DifferentialExpression.de_logger.info("preparing DE on sparse matrix") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5928217
ingest/de.py
Outdated
| self.cluster_name, | ||
| self.method, | ||
| ) | ||
| DifferentialExpression.de_logger.info(f"preparing DE on dense matrix") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No f-string needed here:
| DifferentialExpression.de_logger.info(f"preparing DE on dense matrix") | |
| DifferentialExpression.de_logger.info("preparing DE on dense matrix") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5928217
ingest/de.py
Outdated
| """ | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a description here? Otherwise such stragglers seems best to omit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed 5928217
| # For AnnData, obs are cells and vars are genes | ||
| # BUT transpose needed for both dense and sparse | ||
| # so transpose step is after this data object composition step | ||
| # therefore the assignements below are the reverse of expected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Helpful note!
ingest/de.py
Outdated
| f'{cluster_name}--{annotation}--{str(group_filename)}--{method}.tsv' | ||
| ) | ||
|
|
||
| rank.to_csv(out_file, sep='\t', float_format='%.4g', index=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This rounds instead of crudely truncating, right? In either case, a brief explanatory note would ease understanding.
| rank.to_csv(out_file, sep='\t', float_format='%.4g', index=False) | |
| # Round numbers to 4th decimal place, e.g. 0.12345 -> 0.1235 | |
| rank.to_csv(out_file, sep='\t', float_format='%.4g', index=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5928217
ingest/ingest_pipeline.py
Outdated
| **self.kwargs, | ||
| ) | ||
| de.execute_de() | ||
| # ToDo: surface failed DE for analytics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you ticket this? We typically do that for all TODOs.
| # ToDo: surface failed DE for analytics | |
| # ToDo: surface failed DE for analytics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, created SCP-4206 and noted in comment. 5928217
ingest/ingest_pipeline.py
Outdated
| elif "differential_expression" in arguments: | ||
| if arguments["differential_expression"]: | ||
| config.set_parent_event_name( | ||
| "ingest-pipeline:differential_expression:ingest" | ||
| ) | ||
| status_de = ingest.calculate_de() | ||
| status.append(status_de) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The inner condition strikes me as something that ought to be redundant with the outer condition.
Having ingest as the terminal segment of the event name also feels confusing. In the context of our pipelines, it'd be cool if we could simply say "ingest" to mean "traditionally transform and load SCP study files to MongoDB" and "differential expression" to mean "run DE calculations on previously-validated and ingested data".
Finally, trivially, delimiter should be hyphen instead of underscore.
| elif "differential_expression" in arguments: | |
| if arguments["differential_expression"]: | |
| config.set_parent_event_name( | |
| "ingest-pipeline:differential_expression:ingest" | |
| ) | |
| status_de = ingest.calculate_de() | |
| status.append(status_de) | |
| elif "differential-expression" in arguments: | |
| config.set_parent_event_name( | |
| "ingest-pipeline:differential-expression" | |
| ) | |
| status_de = ingest.calculate_de() | |
| status.append(status_de) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5928217
ingest/cli_parser.py
Outdated
| ) | ||
|
|
||
| parser_differential_expression.add_argument( | ||
| "--differential_expression", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blocker: for command-line arguments here, we always delimit with hyphens.
| "--differential_expression", | |
| "--differential-expression", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5928217
ingest/ingest_pipeline.py
Outdated
| from .expression_files.dense_ingestor import DenseIngestor | ||
| from .expression_files.mtx import MTXIngestor | ||
| from .cli_parser import create_parser, validate_arguments | ||
| from .de import DifferentialExpression, prepare_h5ad |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks unused in this module, so let's remove it.
Otherwise, I think we'd need to also include it in the import block above this one.
| from .de import DifferentialExpression, prepare_h5ad | |
| from .de import DifferentialExpression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching that! I had removed from the top import block and neglected the second. 5928217
eweitz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I meant to request changes for that trivial blocker per my previous review summary.)
address PR comments
devonbush
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved for merge -- we'll do a deep code dive later
To allow progress on engineering-side development work for the differential expression (DE) feature, this PR aims to provide a working "happy path" DE process that generates output files from a DE analysis on a couple of known good datasets (SCP1671 and SCP1539; Note, the test case is derived from SCP1677 - the cell_type assignments in the de_integration_metadata.tsv file are bogus)
This prototype pipeline still needs tests (both integration and unit tests), try/except clauses etc. and capability to run on studies that have sparse matrix. Suggestions welcomed for how best to harden this code - please indicate whether the suggestion should be implemented for this PR or can be incorporated into a downstream "hardening" PR.
To test:
activate the scp-ingest-pipeline repo virtualenv (added scanpy, updated numpy and scipy versions)
then from the ingest directory of the scp-ingest-pipeline repo:
source ../scripts/setup_mongo_dev.sh <path to your Github token file>unset BARD_HOST_URLpython ingest_pipeline.py --study-id addedfeed000000000000000 --study-file-id dec0dedfeed1111111111111 differential_expression --annotation cell_type__ontology_label --matrix-file-path ../tests/data/differential_expression/de_integration.tsv --matrix-file-type dense --cell-metadata-file ../tests/data/differential_expression/de_integration_unordered_metadata.tsv --cluster-file ../tests/data/differential_expression/de_integration_cluster.tsv --name de_integration --study-accession SCPdev --differential_expressionconfirm that the script runs successfully and generates the following files:
Content of:
can be compared to a reference file at ../tests/data/differential_expression/reference
This PR supports SCP-4074.