-
Notifications
You must be signed in to change notification settings - Fork 0
Enable DE on cluster annotations (SCP-4258) #251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
eweitz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great expansion of the DE prototype!
I note one blocker regarding terminology that seems especially likely to confuse maintainers, and suggest a handful of other, non-blocking maintainability refinements.
tests/test_de.py
Outdated
| expected_file = ( | ||
| f"{cluster_name}--{annotation}--{str(sanitized_label)}--{method}.tsv" | ||
| ) | ||
| expected_file = f"{cluster_name}--{annotation}--{str(sanitized_label)}--{scope}--{method}.tsv" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This str is extraneous (see prior note):
| expected_file = f"{cluster_name}--{annotation}--{str(sanitized_label)}--{scope}--{method}.tsv" | |
| expected_file = f"{cluster_name}--{annotation}--{sanitized_label}--{scope}--{method}.tsv" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops - had removed it in de.py but overlooked it in the test. Fixed now in e314978.
ingest/de.py
Outdated
| annots = np.unique(adata.obs[annotation]).tolist() | ||
| for annot in annots: | ||
| annot_label = re.sub(r'\W+', '_', annot) | ||
| clean_annotation = re.sub(r'\W+', '_', annotation) | ||
| DifferentialExpression.de_logger.info( | ||
| f"Writing DE output for {annot_label}" | ||
| ) | ||
| rank = sc.get.rank_genes_groups_df(adata, key=rank_key, group=annot) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blocker: "annotation" vs. "annot" is especially confusing. Referring to "annot" as "group" seems clearer, and better aligns with terminology from single_cell_portal_core.
| annots = np.unique(adata.obs[annotation]).tolist() | |
| for annot in annots: | |
| annot_label = re.sub(r'\W+', '_', annot) | |
| clean_annotation = re.sub(r'\W+', '_', annotation) | |
| DifferentialExpression.de_logger.info( | |
| f"Writing DE output for {annot_label}" | |
| ) | |
| rank = sc.get.rank_genes_groups_df(adata, key=rank_key, group=annot) | |
| groups = np.unique(adata.obs[annotation]).tolist() | |
| for group in groups: | |
| group_label = re.sub(r'\W+', '_', group) | |
| clean_annotation = re.sub(r'\W+', '_', annotation) | |
| DifferentialExpression.de_logger.info( | |
| f"Writing DE output for {group}" | |
| ) | |
| rank = sc.get.rank_genes_groups_df(adata, key=rank_key, group=group) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e314978
ingest/de.py
Outdated
| rank = sc.get.rank_genes_groups_df(adata, key=rank_key, group=annot) | ||
|
|
||
| out_file = f'{cluster_name}--{annotation}--{group_filename}--{method}.tsv' | ||
| out_file = f'{cluster_name}--{clean_annotation}--{annot_label}--{annot_scope}--{method}.tsv' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blocker: see adjacent comment.
| out_file = f'{cluster_name}--{clean_annotation}--{annot_label}--{annot_scope}--{method}.tsv' | |
| out_file = f'{cluster_name}--{clean_annotation}--{group}--{annot_scope}--{method}.tsv' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e314978
ingest/cli_parser.py
Outdated
| parser_differential_expression.add_argument( | ||
| "--annotation-scope", | ||
| required=True, | ||
| help="Scope of annotation file for DE analysis", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consistent with other --annotation-foo arguments.
| help="Scope of annotation file for DE analysis", | |
| help="Scope of annotation for DE analysis", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e314978
ingest/cli_parser.py
Outdated
|
|
||
| parser_differential_expression.add_argument( | ||
| "--cell-metadata-file", | ||
| "--annot-metadata-file", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "metadata" qualifier here seems likelier to confuse than to clarify. "Metadata" strongly connotes "metadata file" in SCP file terminology, and I'm unaware of any other case where we use the phrase "annotation metadata file".
We already use the phrase "annotation file" to mean "cluster file or metadata file" in e.g. annotations.py, so reusing that here (and spelling out the word) seems best for maintainability.
| "--annot-metadata-file", | |
| "--annotation-file", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e314978
ingest/ingest_pipeline.py
Outdated
| if "differential_expression" in arguments: | ||
| # DE may use metadata or cluster file for annots BUT | ||
| # IngestPipeline initialization will need a "cell_metadata_file" | ||
| arguments["cell_metadata_file"] = arguments["annot_metadata_file"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per comment re --annot-metadata-file.
| arguments["cell_metadata_file"] = arguments["annot_metadata_file"] | |
| arguments["cell_metadata_file"] = arguments["annotation_file"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e314978
ingest/ingest_pipeline.py
Outdated
| # DE may use metadata or cluster file for annots BUT | ||
| # IngestPipeline initialization will need a "cell_metadata_file" | ||
| arguments["cell_metadata_file"] = arguments["annot_metadata_file"] | ||
| # IngestPipeline initialiation expects "name" and not "cluster_name" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # IngestPipeline initialiation expects "name" and not "cluster_name" | |
| # IngestPipeline initialization expects "name" and not "cluster_name" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e314978
Codecov Report
@@ Coverage Diff @@
## development #251 +/- ##
===============================================
- Coverage 64.52% 64.52% -0.01%
===============================================
Files 27 27
Lines 3572 3586 +14
===============================================
+ Hits 2305 2314 +9
- Misses 1267 1272 +5
Continue to review full report at Codecov.
|
bistline
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, with two small things:
- The help text has an incorrect argument - it lists
--annot-metadata-filewhen it should have--annotation-file - I don't understand the need for the
differential_expressionargument and--differential-expressionkwarg. It seems redundant, but perhaps there's something I'm missing?
| f" Invalid argument: unable to connect to a BigQuery table called {parsed_args.bq_table}." | ||
| ) | ||
| if ( | ||
| "differential_expression" in parsed_args |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this in addition to the --differential-expression kwarg? They seem redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I only check for the annotation_type kwarg, that argument isn't set for other types of ingest pipeline runs and all other non-DE ingest jobs would fail with:
AttributeError: 'Namespace' object has no attribute 'annotation_type'
Checking for "differential_expression" in parsed_args, ensures I'm only doing this validation for DE jobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I didn't make myself clear - apologies. If we look at the example command line:
python ingest_pipeline.py --study-id addedfeed000000000000000 --study-file-id dec0dedfeed1111111111111 differential_expression --annotation-name cluster_annot --annotation-type group --annotation-scope cluster --matrix-file-path ../tests/data/differential_expression/de_integration.tsv --matrix-file-type dense --annotation-file ../tests/data/differential_expression/de_integration_annotated_cluster.txt --cluster-file ../tests/data/differential_expression/de_integration_annotated_cluster.txt --cluster-name de_integration --study-accession SCPdev --differential-expression
There is the positional argument of differential_expression that you call out here, but then also at the end, there is the --differential-expression kwarg. Why do we need both? Looking in cli_parser.py, I see this:
# Differential expression subparsers
parser_differential_expression = subparsers.add_parser(
"differential_expression",
help="Indicates differential expression analysis processing",
)
parser_differential_expression.add_argument(
"--differential-expression",
required=True,
action="store_true",
help="Indicates that differential expression analysis should be invoked",
)
They look to me to be functionally equivalent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for chatting, Jon. As we discussed, that paradigm was set up for cluster and cell metadata ingest. I'm following that pattern in DE jobs for consistency and in case there's a quirk about subparsers that I don't understand.
For current public, initialized studies with raw counts, 19/97 studies use cluster-specific annotations as their default annotation. To make DE available for the default explore tab view for ~20% of available studies for DE, extend DE process to incorporate cluster file annotations for DE analysis.
Investigated combining metadata and cluster annotations into same AnnData object. Scanpy’s rank_genes_groups expects annotation for DE to be in observations dataframe. Studies like SCP1415 use the same name for annotations in BOTH metadata AND cluster files - potential name collision when combining both sets of metadata ("study-wide" from metadata file and "cluster-specific" from cluster file) into the same dataframe.
Consulted with Devon and decided reasonable workaround is to run two separate jobs and keep the annotations separate. This update to DE requires --annotation-type and --annotation-scope as inputs and the annotation scope is reflected in the output file name ({cluster_name}--{clean_annotation}--{annot_label}--{annot_scope}--{method}.tsv)
To test, run DE using a cluster file with annotations as the "--annot-metadata-file":
activate the scp-ingest-pipeline repo virtualenv
then from the scripts directory of the scp-ingest-pipeline repo, perform this setup:
Run the DE job from the ingest directory of the scp-ingest-pipeline repo:
python ingest_pipeline.py --study-id addedfeed000000000000000 --study-file-id dec0dedfeed1111111111111 differential_expression --annotation-name cluster_annot --annotation-type group --annotation-scope cluster --matrix-file-path ../tests/data/differential_expression/de_integration.tsv --matrix-file-type dense --annotation-file ../tests/data/differential_expression/de_integration_annotated_cluster.txt --cluster-file ../tests/data/differential_expression/de_integration_annotated_cluster.txt --cluster-name de_integration --study-accession SCPdev --differential-expressionconfirm that the script runs successfully and generates the following files:
This satisfies SCP-4258.