Fix imprecise name sanitization in DE (SCP-4459) #260
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Cluster name, annotation and annotation label sanitation for DE output files is currently collapsing down adjacent non-[^a-zA-Z0-9_] characters into one underscore. This can cause a file name mismatch which would prevent display of DE results.
Manual test:
activate the scp-ingest-pipeline repo virtualenv
then from the scripts directory of the scp-ingest-pipeline repo, perform this setup:
Run the DE job from the ingest directory of the scp-ingest-pipeline repo:
python ingest_pipeline.py --study-id addedfeed000000000000000 --study-file-id dec0dedfeed1111111111111 differential_expression --annotation-name misc++cellaneous --annotation-type group --annotation-scope study --matrix-file-path ../tests/data/differential_expression/de_dense_matrix.tsv --matrix-file-type dense --annotation-file ../tests/data/differential_expression/de_dense_metadata_sanitize.txt --cluster-file ../tests/data/differential_expression/de_dense_cluster.tsv --cluster-name "UMAP, pre-QC all cells (complexity greater than or equal to 1000)" --study-accession SCPsanitize --differential-expressionconfirm that the job runs successfully and the output files have more than one underscore where contiguous non-[^a-zA-Z0-9_] characters occur. (Note: the metadata file had non-[^a-zA-Z0-9_] characters in the annotation labels)
This PR satisfies SCP-4459.