Skip to content

Conversation

@jlchang
Copy link
Contributor

@jlchang jlchang commented Apr 21, 2022

For current public, initialized studies with raw counts, 19/97 studies use cluster-specific annotations as their default annotation. To make DE available for the default explore tab view for ~20% of available studies for DE, extend DE process to incorporate cluster file annotations for DE analysis.

Investigated combining metadata and cluster annotations into same AnnData object. Scanpy’s rank_genes_groups expects annotation for DE to be in observations dataframe. Studies like SCP1415 use the same name for annotations in BOTH metadata AND cluster files - potential name collision when combining both sets of metadata ("study-wide" from metadata file and "cluster-specific" from cluster file) into the same dataframe.

Consulted with Devon and decided reasonable workaround is to run two separate jobs and keep the annotations separate. This update to DE requires --annotation-type and --annotation-scope as inputs and the annotation scope is reflected in the output file name ({cluster_name}--{clean_annotation}--{annot_label}--{annot_scope}--{method}.tsv)

To test, run DE using a cluster file with annotations as the "--annot-metadata-file":

activate the scp-ingest-pipeline repo virtualenv
then from the scripts directory of the scp-ingest-pipeline repo, perform this setup:

source ../scripts/setup_mongo_dev.sh <path to your Github token file>
unset BARD_HOST_URL

Run the DE job from the ingest directory of the scp-ingest-pipeline repo:
python ingest_pipeline.py --study-id addedfeed000000000000000 --study-file-id dec0dedfeed1111111111111 differential_expression --annotation-name cluster_annot --annotation-type group --annotation-scope cluster --matrix-file-path ../tests/data/differential_expression/de_integration.tsv --matrix-file-type dense --annotation-file ../tests/data/differential_expression/de_integration_annotated_cluster.txt --cluster-file ../tests/data/differential_expression/de_integration_annotated_cluster.txt --cluster-name de_integration --study-accession SCPdev --differential-expression

confirm that the script runs successfully and generates the following files:

de_integration--cluster_annot--0--cluster--wilcoxon.tsv
de_integration--cluster_annot--1--cluster--wilcoxon.tsv
de_integration--cluster_annot--2--cluster--wilcoxon.tsv
de_integration--cluster_annot--3--cluster--wilcoxon.tsv
de_integration--cluster_annot--4--cluster--wilcoxon.tsv

This satisfies SCP-4258.

Copy link
Member

@eweitz eweitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great expansion of the DE prototype!

I note one blocker regarding terminology that seems especially likely to confuse maintainers, and suggest a handful of other, non-blocking maintainability refinements.

tests/test_de.py Outdated
expected_file = (
f"{cluster_name}--{annotation}--{str(sanitized_label)}--{method}.tsv"
)
expected_file = f"{cluster_name}--{annotation}--{str(sanitized_label)}--{scope}--{method}.tsv"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This str is extraneous (see prior note):

Suggested change
expected_file = f"{cluster_name}--{annotation}--{str(sanitized_label)}--{scope}--{method}.tsv"
expected_file = f"{cluster_name}--{annotation}--{sanitized_label}--{scope}--{method}.tsv"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops - had removed it in de.py but overlooked it in the test. Fixed now in e314978.

ingest/de.py Outdated
Comment on lines 331 to 338
annots = np.unique(adata.obs[annotation]).tolist()
for annot in annots:
annot_label = re.sub(r'\W+', '_', annot)
clean_annotation = re.sub(r'\W+', '_', annotation)
DifferentialExpression.de_logger.info(
f"Writing DE output for {annot_label}"
)
rank = sc.get.rank_genes_groups_df(adata, key=rank_key, group=annot)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocker: "annotation" vs. "annot" is especially confusing. Referring to "annot" as "group" seems clearer, and better aligns with terminology from single_cell_portal_core.

Suggested change
annots = np.unique(adata.obs[annotation]).tolist()
for annot in annots:
annot_label = re.sub(r'\W+', '_', annot)
clean_annotation = re.sub(r'\W+', '_', annotation)
DifferentialExpression.de_logger.info(
f"Writing DE output for {annot_label}"
)
rank = sc.get.rank_genes_groups_df(adata, key=rank_key, group=annot)
groups = np.unique(adata.obs[annotation]).tolist()
for group in groups:
group_label = re.sub(r'\W+', '_', group)
clean_annotation = re.sub(r'\W+', '_', annotation)
DifferentialExpression.de_logger.info(
f"Writing DE output for {group}"
)
rank = sc.get.rank_genes_groups_df(adata, key=rank_key, group=group)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e314978

ingest/de.py Outdated
rank = sc.get.rank_genes_groups_df(adata, key=rank_key, group=annot)

out_file = f'{cluster_name}--{annotation}--{group_filename}--{method}.tsv'
out_file = f'{cluster_name}--{clean_annotation}--{annot_label}--{annot_scope}--{method}.tsv'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocker: see adjacent comment.

Suggested change
out_file = f'{cluster_name}--{clean_annotation}--{annot_label}--{annot_scope}--{method}.tsv'
out_file = f'{cluster_name}--{clean_annotation}--{group}--{annot_scope}--{method}.tsv'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e314978

parser_differential_expression.add_argument(
"--annotation-scope",
required=True,
help="Scope of annotation file for DE analysis",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consistent with other --annotation-foo arguments.

Suggested change
help="Scope of annotation file for DE analysis",
help="Scope of annotation for DE analysis",

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e314978


parser_differential_expression.add_argument(
"--cell-metadata-file",
"--annot-metadata-file",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "metadata" qualifier here seems likelier to confuse than to clarify. "Metadata" strongly connotes "metadata file" in SCP file terminology, and I'm unaware of any other case where we use the phrase "annotation metadata file".

We already use the phrase "annotation file" to mean "cluster file or metadata file" in e.g. annotations.py, so reusing that here (and spelling out the word) seems best for maintainability.

Suggested change
"--annot-metadata-file",
"--annotation-file",

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e314978

if "differential_expression" in arguments:
# DE may use metadata or cluster file for annots BUT
# IngestPipeline initialization will need a "cell_metadata_file"
arguments["cell_metadata_file"] = arguments["annot_metadata_file"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per comment re --annot-metadata-file.

Suggested change
arguments["cell_metadata_file"] = arguments["annot_metadata_file"]
arguments["cell_metadata_file"] = arguments["annotation_file"]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e314978

# DE may use metadata or cluster file for annots BUT
# IngestPipeline initialization will need a "cell_metadata_file"
arguments["cell_metadata_file"] = arguments["annot_metadata_file"]
# IngestPipeline initialiation expects "name" and not "cluster_name"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# IngestPipeline initialiation expects "name" and not "cluster_name"
# IngestPipeline initialization expects "name" and not "cluster_name"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e314978

@jlchang jlchang requested a review from eweitz April 21, 2022 19:39
@codecov
Copy link

codecov bot commented Apr 21, 2022

Codecov Report

Merging #251 (e314978) into development (00a04a7) will decrease coverage by 0.00%.
The diff coverage is 52.00%.

@@               Coverage Diff               @@
##           development     #251      +/-   ##
===============================================
- Coverage        64.52%   64.52%   -0.01%     
===============================================
  Files               27       27              
  Lines             3572     3586      +14     
===============================================
+ Hits              2305     2314       +9     
- Misses            1267     1272       +5     
Impacted Files Coverage Δ
ingest/ingest_pipeline.py 59.79% <0.00%> (-0.64%) ⬇️
ingest/de.py 83.73% <55.55%> (+0.08%) ⬆️
ingest/cli_parser.py 76.53% <75.00%> (-0.07%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 00a04a7...e314978. Read the comment docs.

Copy link
Contributor

@bistline bistline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, with two small things:

  1. The help text has an incorrect argument - it lists --annot-metadata-file when it should have --annotation-file
  2. I don't understand the need for the differential_expression argument and --differential-expression kwarg. It seems redundant, but perhaps there's something I'm missing?

f" Invalid argument: unable to connect to a BigQuery table called {parsed_args.bq_table}."
)
if (
"differential_expression" in parsed_args
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this in addition to the --differential-expression kwarg? They seem redundant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I only check for the annotation_type kwarg, that argument isn't set for other types of ingest pipeline runs and all other non-DE ingest jobs would fail with:
AttributeError: 'Namespace' object has no attribute 'annotation_type'

Checking for "differential_expression" in parsed_args, ensures I'm only doing this validation for DE jobs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I didn't make myself clear - apologies. If we look at the example command line:

python ingest_pipeline.py --study-id addedfeed000000000000000 --study-file-id dec0dedfeed1111111111111 differential_expression --annotation-name cluster_annot --annotation-type group --annotation-scope cluster --matrix-file-path ../tests/data/differential_expression/de_integration.tsv --matrix-file-type dense --annotation-file ../tests/data/differential_expression/de_integration_annotated_cluster.txt --cluster-file ../tests/data/differential_expression/de_integration_annotated_cluster.txt --cluster-name de_integration --study-accession SCPdev --differential-expression

There is the positional argument of differential_expression that you call out here, but then also at the end, there is the --differential-expression kwarg. Why do we need both? Looking in cli_parser.py, I see this:

# Differential expression subparsers
parser_differential_expression = subparsers.add_parser(
    "differential_expression",
    help="Indicates differential expression analysis processing",
)

parser_differential_expression.add_argument(
    "--differential-expression",
    required=True,
    action="store_true",
    help="Indicates that differential expression analysis should be invoked",
)

They look to me to be functionally equivalent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for chatting, Jon. As we discussed, that paradigm was set up for cluster and cell metadata ingest. I'm following that pattern in DE jobs for consistency and in case there's a quirk about subparsers that I don't understand.

@jlchang jlchang merged commit 3baaabf into development Apr 22, 2022
@jlchang jlchang deleted the jlc_indicate_scope branch April 22, 2022 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants