Skip to content

Conversation

@jlchang
Copy link
Contributor

@jlchang jlchang commented Jun 20, 2022

Cluster name, annotation and annotation label sanitation for DE output files is currently collapsing down adjacent non-[^a-zA-Z0-9_] characters into one underscore. This can cause a file name mismatch which would prevent display of DE results.

Manual test:

activate the scp-ingest-pipeline repo virtualenv
then from the scripts directory of the scp-ingest-pipeline repo, perform this setup:

source ../scripts/setup_mongo_dev.sh
unset BARD_HOST_URL

Run the DE job from the ingest directory of the scp-ingest-pipeline repo:

python ingest_pipeline.py --study-id addedfeed000000000000000 --study-file-id dec0dedfeed1111111111111 differential_expression --annotation-name misc++cellaneous --annotation-type group --annotation-scope study --matrix-file-path ../tests/data/differential_expression/de_dense_matrix.tsv --matrix-file-type dense --annotation-file ../tests/data/differential_expression/de_dense_metadata_sanitize.txt --cluster-file ../tests/data/differential_expression/de_dense_cluster.tsv --cluster-name "UMAP, pre-QC all cells (complexity greater than or equal to 1000)" --study-accession SCPsanitize --differential-expression

confirm that the job runs successfully and the output files have more than one underscore where contiguous non-[^a-zA-Z0-9_] characters occur. (Note: the metadata file had non-[^a-zA-Z0-9_] characters in the annotation labels)

UMAP__pre_QC_all_cells__complexity_greater_than_or_equal_to_1000_--misc__cellaneous--cholinergic__neuron_--study--wilcoxon.tsv
UMAP__pre_QC_all_cells__complexity_greater_than_or_equal_to_1000_--misc__cellaneous--cranial__somatomotor__neuron--study--wilcoxon.tsv
UMAP__pre_QC_all_cells__complexity_greater_than_or_equal_to_1000_--misc__cellaneous--pyramidal__neuron--study--wilcoxon.tsv
UMAP__pre_QC_all_cells__complexity_greater_than_or_equal_to_1000_--misc__cellaneous--somatomotor_neuron_--study--wilcoxon.tsv
UMAP__pre_QC_all_cells__complexity_greater_than_or_equal_to_1000_--misc__cellaneous--sympathetic__cholinergic_neuron--study--wilcoxon.tsv

This PR satisfies SCP-4459.

@jlchang jlchang requested review from bistline, ehanna4 and eweitz June 20, 2022 23:07
Copy link
Contributor

@ehanna4 ehanna4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have this repo locally yet but i used https://regex101.com/ to test the regexes and the change does what the description says it should so I'm confident this fix will work in the wild.👍

@jlchang jlchang merged commit 852d688 into development Jun 21, 2022
@jlchang jlchang deleted the jlc_fix_name_sanitization branch June 21, 2022 14:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants