Enable DE on cluster annotations (SCP-4258) #251

jlchang · 2022-04-21T14:42:29Z

For current public, initialized studies with raw counts, 19/97 studies use cluster-specific annotations as their default annotation. To make DE available for the default explore tab view for ~20% of available studies for DE, extend DE process to incorporate cluster file annotations for DE analysis.

Investigated combining metadata and cluster annotations into same AnnData object. Scanpy’s rank_genes_groups expects annotation for DE to be in observations dataframe. Studies like SCP1415 use the same name for annotations in BOTH metadata AND cluster files - potential name collision when combining both sets of metadata ("study-wide" from metadata file and "cluster-specific" from cluster file) into the same dataframe.

Consulted with Devon and decided reasonable workaround is to run two separate jobs and keep the annotations separate. This update to DE requires --annotation-type and --annotation-scope as inputs and the annotation scope is reflected in the output file name ({cluster_name}--{clean_annotation}--{annot_label}--{annot_scope}--{method}.tsv)

To test, run DE using a cluster file with annotations as the "--annot-metadata-file":

activate the scp-ingest-pipeline repo virtualenv
then from the scripts directory of the scp-ingest-pipeline repo, perform this setup:

source ../scripts/setup_mongo_dev.sh <path to your Github token file>
unset BARD_HOST_URL

Run the DE job from the ingest directory of the scp-ingest-pipeline repo:
python ingest_pipeline.py --study-id addedfeed000000000000000 --study-file-id dec0dedfeed1111111111111 differential_expression --annotation-name cluster_annot --annotation-type group --annotation-scope cluster --matrix-file-path ../tests/data/differential_expression/de_integration.tsv --matrix-file-type dense --annotation-file ../tests/data/differential_expression/de_integration_annotated_cluster.txt --cluster-file ../tests/data/differential_expression/de_integration_annotated_cluster.txt --cluster-name de_integration --study-accession SCPdev --differential-expression

confirm that the script runs successfully and generates the following files:

de_integration--cluster_annot--0--cluster--wilcoxon.tsv
de_integration--cluster_annot--1--cluster--wilcoxon.tsv
de_integration--cluster_annot--2--cluster--wilcoxon.tsv
de_integration--cluster_annot--3--cluster--wilcoxon.tsv
de_integration--cluster_annot--4--cluster--wilcoxon.tsv

This satisfies SCP-4258.

eweitz

Great expansion of the DE prototype!

I note one blocker regarding terminology that seems especially likely to confuse maintainers, and suggest a handful of other, non-blocking maintainability refinements.

eweitz · 2022-04-21T15:35:56Z

tests/test_de.py

-        expected_file = (
-            f"{cluster_name}--{annotation}--{str(sanitized_label)}--{method}.tsv"
-        )
+        expected_file = f"{cluster_name}--{annotation}--{str(sanitized_label)}--{scope}--{method}.tsv"


This str is extraneous (see prior note):

Suggested change

expected_file = f"{cluster_name}--{annotation}--{str(sanitized_label)}--{scope}--{method}.tsv"

expected_file = f"{cluster_name}--{annotation}--{sanitized_label}--{scope}--{method}.tsv"

Oops - had removed it in de.py but overlooked it in the test. Fixed now in e314978.

eweitz · 2022-04-21T15:45:36Z

ingest/de.py

+        annots = np.unique(adata.obs[annotation]).tolist()
+        for annot in annots:
+            annot_label = re.sub(r'\W+', '_', annot)
+            clean_annotation = re.sub(r'\W+', '_', annotation)
+            DifferentialExpression.de_logger.info(
+                f"Writing DE output for {annot_label}"
+            )
+            rank = sc.get.rank_genes_groups_df(adata, key=rank_key, group=annot)


Blocker: "annotation" vs. "annot" is especially confusing. Referring to "annot" as "group" seems clearer, and better aligns with terminology from single_cell_portal_core.

Suggested change

annots = np.unique(adata.obs[annotation]).tolist()

for annot in annots:

annot_label = re.sub(r'\W+', '_', annot)

clean_annotation = re.sub(r'\W+', '_', annotation)

DifferentialExpression.de_logger.info(

f"Writing DE output for {annot_label}"

)

rank = sc.get.rank_genes_groups_df(adata, key=rank_key, group=annot)

groups = np.unique(adata.obs[annotation]).tolist()

for group in groups:

group_label = re.sub(r'\W+', '_', group)

clean_annotation = re.sub(r'\W+', '_', annotation)

DifferentialExpression.de_logger.info(

f"Writing DE output for {group}"

)

rank = sc.get.rank_genes_groups_df(adata, key=rank_key, group=group)

eweitz · 2022-04-21T15:46:39Z

ingest/de.py

+            rank = sc.get.rank_genes_groups_df(adata, key=rank_key, group=annot)

-            out_file = f'{cluster_name}--{annotation}--{group_filename}--{method}.tsv'
+            out_file = f'{cluster_name}--{clean_annotation}--{annot_label}--{annot_scope}--{method}.tsv'


Blocker: see adjacent comment.

Suggested change

out_file = f'{cluster_name}--{clean_annotation}--{annot_label}--{annot_scope}--{method}.tsv'

out_file = f'{cluster_name}--{clean_annotation}--{group}--{annot_scope}--{method}.tsv'

eweitz · 2022-04-21T15:48:02Z

ingest/cli_parser.py

+    parser_differential_expression.add_argument(
+        "--annotation-scope",
+        required=True,
+        help="Scope of annotation file for DE analysis",


Consistent with other --annotation-foo arguments.

Suggested change

help="Scope of annotation file for DE analysis",

help="Scope of annotation for DE analysis",

eweitz · 2022-04-21T15:57:20Z

ingest/cli_parser.py


    parser_differential_expression.add_argument(
-        "--cell-metadata-file",
+        "--annot-metadata-file",


The "metadata" qualifier here seems likelier to confuse than to clarify. "Metadata" strongly connotes "metadata file" in SCP file terminology, and I'm unaware of any other case where we use the phrase "annotation metadata file".

We already use the phrase "annotation file" to mean "cluster file or metadata file" in e.g. annotations.py, so reusing that here (and spelling out the word) seems best for maintainability.

Suggested change

"--annot-metadata-file",

"--annotation-file",

eweitz · 2022-04-21T15:58:37Z

ingest/ingest_pipeline.py

+    if "differential_expression" in arguments:
+        # DE may use metadata or cluster file for annots BUT
+        # IngestPipeline initialization will need a "cell_metadata_file"
+        arguments["cell_metadata_file"] = arguments["annot_metadata_file"]


Per comment re --annot-metadata-file.

Suggested change

arguments["cell_metadata_file"] = arguments["annot_metadata_file"]

arguments["cell_metadata_file"] = arguments["annotation_file"]

eweitz · 2022-04-21T15:58:55Z

ingest/ingest_pipeline.py

+        # DE may use metadata or cluster file for annots BUT
+        # IngestPipeline initialization will need a "cell_metadata_file"
+        arguments["cell_metadata_file"] = arguments["annot_metadata_file"]
+        # IngestPipeline initialiation expects "name" and not "cluster_name"


Suggested change

# IngestPipeline initialiation expects "name" and not "cluster_name"

# IngestPipeline initialization expects "name" and not "cluster_name"

codecov · 2022-04-21T19:43:03Z

Codecov Report

Merging #251 (e314978) into development (00a04a7) will decrease coverage by 0.00%.
The diff coverage is 52.00%.

@@               Coverage Diff               @@
##           development     #251      +/-   ##
===============================================
- Coverage        64.52%   64.52%   -0.01%     
===============================================
  Files               27       27              
  Lines             3572     3586      +14     
===============================================
+ Hits              2305     2314       +9     
- Misses            1267     1272       +5

Impacted Files	Coverage Δ
ingest/ingest_pipeline.py	`59.79% <0.00%> (-0.64%)`	⬇️
ingest/de.py	`83.73% <55.55%> (+0.08%)`	⬆️
ingest/cli_parser.py	`76.53% <75.00%> (-0.07%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 00a04a7...e314978. Read the comment docs.

bistline

Looks good, with two small things:

The help text has an incorrect argument - it lists --annot-metadata-file when it should have --annotation-file
I don't understand the need for the differential_expression argument and --differential-expression kwarg. It seems redundant, but perhaps there's something I'm missing?

bistline · 2022-04-21T20:43:56Z

ingest/cli_parser.py

                f" Invalid argument: unable to connect to a BigQuery table called {parsed_args.bq_table}."
            )
+    if (
+        "differential_expression" in parsed_args


Why do we need this in addition to the --differential-expression kwarg? They seem redundant.

If I only check for the annotation_type kwarg, that argument isn't set for other types of ingest pipeline runs and all other non-DE ingest jobs would fail with:
AttributeError: 'Namespace' object has no attribute 'annotation_type'

Checking for "differential_expression" in parsed_args, ensures I'm only doing this validation for DE jobs.

Maybe I didn't make myself clear - apologies. If we look at the example command line:

python ingest_pipeline.py --study-id addedfeed000000000000000 --study-file-id dec0dedfeed1111111111111 differential_expression --annotation-name cluster_annot --annotation-type group --annotation-scope cluster --matrix-file-path ../tests/data/differential_expression/de_integration.tsv --matrix-file-type dense --annotation-file ../tests/data/differential_expression/de_integration_annotated_cluster.txt --cluster-file ../tests/data/differential_expression/de_integration_annotated_cluster.txt --cluster-name de_integration --study-accession SCPdev --differential-expression

There is the positional argument of differential_expression that you call out here, but then also at the end, there is the --differential-expression kwarg. Why do we need both? Looking in cli_parser.py, I see this:

# Differential expression subparsers parser_differential_expression = subparsers.add_parser( "differential_expression", help="Indicates differential expression analysis processing", ) parser_differential_expression.add_argument( "--differential-expression", required=True, action="store_true", help="Indicates that differential expression analysis should be invoked", )

They look to me to be functionally equivalent.

Thanks for chatting, Jon. As we discussed, that paradigm was set up for cluster and cell metadata ingest. I'm following that pattern in DE jobs for consistency and in case there's a quirk about subparsers that I don't understand.

jlchang added 8 commits April 20, 2022 07:29

report dups instead of rejecting or falling back to gene_id

673dc43

clarify outfile name structure

a00b185

include additional cli params for annots

092811b

rename DE 'name' to 'cluster-name' for clarity

c135126

trivial check that input annot is type group

e6783d7

fix error msg typos

ec9eb68

add scope to outfile name; update tests

68c9a60

handle exception for failed DE: annot_label w/ single sample

a6f1c8a

jlchang requested review from bistline, devonbush, ehanna4 and eweitz April 21, 2022 14:42

eweitz requested changes Apr 21, 2022

View reviewed changes

address PR blockers/suggestions

e314978

jlchang requested a review from eweitz April 21, 2022 19:39

eweitz approved these changes Apr 21, 2022

View reviewed changes

bistline approved these changes Apr 21, 2022

View reviewed changes

jlchang merged commit 3baaabf into development Apr 22, 2022

jlchang deleted the jlc_indicate_scope branch April 22, 2022 15:41

	expected_file = f"{cluster_name}--{annotation}--{str(sanitized_label)}--{scope}--{method}.tsv"
	expected_file = f"{cluster_name}--{annotation}--{sanitized_label}--{scope}--{method}.tsv"

	out_file = f'{cluster_name}--{clean_annotation}--{annot_label}--{annot_scope}--{method}.tsv'
	out_file = f'{cluster_name}--{clean_annotation}--{group}--{annot_scope}--{method}.tsv'

	help="Scope of annotation file for DE analysis",
	help="Scope of annotation for DE analysis",

	arguments["cell_metadata_file"] = arguments["annot_metadata_file"]
	arguments["cell_metadata_file"] = arguments["annotation_file"]

	# IngestPipeline initialiation expects "name" and not "cluster_name"
	# IngestPipeline initialization expects "name" and not "cluster_name"

Enable DE on cluster annotations (SCP-4258) #251

Enable DE on cluster annotations (SCP-4258) #251

Uh oh!

Conversation

jlchang commented Apr 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eweitz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Apr 21, 2022

Codecov Report

Uh oh!

bistline left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jlchang commented Apr 21, 2022 •

edited

Loading