Skip to content

Conversation

@jlchang
Copy link
Contributor

@jlchang jlchang commented Mar 24, 2022

To allow progress on engineering-side development work for the differential expression (DE) feature, this PR aims to provide a working "happy path" DE process that generates output files from a DE analysis on a couple of known good datasets (SCP1671 and SCP1539; Note, the test case is derived from SCP1677 - the cell_type assignments in the de_integration_metadata.tsv file are bogus)

This prototype pipeline still needs tests (both integration and unit tests), try/except clauses etc. and capability to run on studies that have sparse matrix. Suggestions welcomed for how best to harden this code - please indicate whether the suggestion should be implemented for this PR or can be incorporated into a downstream "hardening" PR.

To test:

activate the scp-ingest-pipeline repo virtualenv (added scanpy, updated numpy and scipy versions)
then from the ingest directory of the scp-ingest-pipeline repo:

source ../scripts/setup_mongo_dev.sh <path to your Github token file>
unset BARD_HOST_URL

python ingest_pipeline.py --study-id addedfeed000000000000000 --study-file-id dec0dedfeed1111111111111 differential_expression --annotation cell_type__ontology_label --matrix-file-path ../tests/data/differential_expression/de_integration.tsv --matrix-file-type dense --cell-metadata-file ../tests/data/differential_expression/de_integration_unordered_metadata.tsv --cluster-file ../tests/data/differential_expression/de_integration_cluster.tsv --name de_integration --study-accession SCPdev --differential_expression

confirm that the script runs successfully and generates the following files:

de_integration--cell_type__ontology_label--cholinergic_neuron--wilcoxon.tsv
de_integration--cell_type__ontology_label--cranial_somatomotor_neuron--wilcoxon.tsv
de_integration--cell_type__ontology_label--pyramidal_neuron--wilcoxon.tsv
de_integration--cell_type__ontology_label--somatomotor_neuron--wilcoxon.tsv
de_integration--cell_type__ontology_label--sympathetic_cholinergic_neuron--wilcoxon.tsv
de_log.txt

Content of:

de_integration--cell_type__ontology_label--cholinergic_neuron--wilcoxon.tsv

can be compared to a reference file at ../tests/data/differential_expression/reference

This PR supports SCP-4074.

@jlchang jlchang requested review from devonbush, ehanna4 and eweitz March 24, 2022 13:50
@jlchang
Copy link
Contributor Author

jlchang commented Mar 24, 2022

Had CI failure for test_make_toy with ConnectionResetError: [Errno 104] Connection reset by peer
Retried CI with success - false positive was likely transient failure with 3rd party dependency.

@codecov
Copy link

codecov bot commented Mar 24, 2022

Codecov Report

Merging #238 (5928217) into development (6fff171) will decrease coverage by 1.61%.
The diff coverage is 31.61%.

@@               Coverage Diff               @@
##           development     #238      +/-   ##
===============================================
- Coverage        72.90%   71.29%   -1.62%     
===============================================
  Files               26       27       +1     
  Lines             3385     3518     +133     
===============================================
+ Hits              2468     2508      +40     
- Misses             917     1010      +93     
Impacted Files Coverage Δ
ingest/de.py 22.01% <22.01%> (ø)
ingest/ingest_pipeline.py 61.29% <38.46%> (-1.40%) ⬇️
ingest/annotations.py 89.50% <100.00%> (+0.05%) ⬆️
ingest/cli_parser.py 76.59% <100.00%> (+3.42%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6fff171...5928217. Read the comment docs.

Copy link
Member

@eweitz eweitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kudos, I'm really excited for this prototype! A solid start to a big new scientific feature.

I note a trivial CLI UI blocker, and a bunch of non-blocking suggestions.

Comment on lines +76 to +78
cluster_cell_list = []
for value in cluster_cell_values:
cluster_cell_list.extend(value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be simplified / optimized like so?

Suggested change
cluster_cell_list = []
for value in cluster_cell_values:
cluster_cell_list.extend(value)
cluster_cell_list = []
if len(cluster_cell_values) > 0:
cluster_cell_list = cluster_cell_values

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cluster_cells.tolist() yields a list of (single value) lists that needs to be flattened
using extend converts each single-value list (ie. cell name) to a plain value

I'll add a comment that reflects the above complication.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5928217

ingest/de.py Outdated
self.genes,
self.barcodes,
)
DifferentialExpression.de_logger.info(f"preparing DE on sparse matrix")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No f-string needed here:

Suggested change
DifferentialExpression.de_logger.info(f"preparing DE on sparse matrix")
DifferentialExpression.de_logger.info("preparing DE on sparse matrix")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5928217

ingest/de.py Outdated
self.cluster_name,
self.method,
)
DifferentialExpression.de_logger.info(f"preparing DE on dense matrix")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No f-string needed here:

Suggested change
DifferentialExpression.de_logger.info(f"preparing DE on dense matrix")
DifferentialExpression.de_logger.info("preparing DE on dense matrix")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5928217

ingest/de.py Outdated
Comment on lines 186 to 187
"""
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a description here? Otherwise such stragglers seems best to omit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed 5928217

Comment on lines +198 to +201
# For AnnData, obs are cells and vars are genes
# BUT transpose needed for both dense and sparse
# so transpose step is after this data object composition step
# therefore the assignements below are the reverse of expected
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Helpful note!

ingest/de.py Outdated
f'{cluster_name}--{annotation}--{str(group_filename)}--{method}.tsv'
)

rank.to_csv(out_file, sep='\t', float_format='%.4g', index=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rounds instead of crudely truncating, right? In either case, a brief explanatory note would ease understanding.

Suggested change
rank.to_csv(out_file, sep='\t', float_format='%.4g', index=False)
# Round numbers to 4th decimal place, e.g. 0.12345 -> 0.1235
rank.to_csv(out_file, sep='\t', float_format='%.4g', index=False)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5928217

**self.kwargs,
)
de.execute_de()
# ToDo: surface failed DE for analytics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you ticket this? We typically do that for all TODOs.

Suggested change
# ToDo: surface failed DE for analytics
# ToDo: surface failed DE for analytics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, created SCP-4206 and noted in comment. 5928217

Comment on lines 514 to 520
elif "differential_expression" in arguments:
if arguments["differential_expression"]:
config.set_parent_event_name(
"ingest-pipeline:differential_expression:ingest"
)
status_de = ingest.calculate_de()
status.append(status_de)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inner condition strikes me as something that ought to be redundant with the outer condition.

Having ingest as the terminal segment of the event name also feels confusing. In the context of our pipelines, it'd be cool if we could simply say "ingest" to mean "traditionally transform and load SCP study files to MongoDB" and "differential expression" to mean "run DE calculations on previously-validated and ingested data".

Finally, trivially, delimiter should be hyphen instead of underscore.

Suggested change
elif "differential_expression" in arguments:
if arguments["differential_expression"]:
config.set_parent_event_name(
"ingest-pipeline:differential_expression:ingest"
)
status_de = ingest.calculate_de()
status.append(status_de)
elif "differential-expression" in arguments:
config.set_parent_event_name(
"ingest-pipeline:differential-expression"
)
status_de = ingest.calculate_de()
status.append(status_de)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5928217

)

parser_differential_expression.add_argument(
"--differential_expression",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocker: for command-line arguments here, we always delimit with hyphens.

Suggested change
"--differential_expression",
"--differential-expression",

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5928217

from .expression_files.dense_ingestor import DenseIngestor
from .expression_files.mtx import MTXIngestor
from .cli_parser import create_parser, validate_arguments
from .de import DifferentialExpression, prepare_h5ad
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks unused in this module, so let's remove it.

Otherwise, I think we'd need to also include it in the import block above this one.

Suggested change
from .de import DifferentialExpression, prepare_h5ad
from .de import DifferentialExpression

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching that! I had removed from the top import block and neglected the second. 5928217

Copy link
Member

@eweitz eweitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I meant to request changes for that trivial blocker per my previous review summary.)

@jlchang jlchang requested a review from eweitz March 25, 2022 19:00
Copy link
Contributor

@devonbush devonbush left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved for merge -- we'll do a deep code dive later

@jlchang jlchang merged commit 90db079 into development Mar 28, 2022
@jlchang jlchang deleted the jlc_make_data_objects branch March 28, 2022 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants