Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Follow up to PR #418 #430

Open
wants to merge 34 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
e09dc65
UPDATES TO THE Hifiasm pipeline:
SHuang-Broad May 8, 2023
d3afc4b
For both CCS/ONT, update PBSV
SHuang-Broad May 10, 2023
3a2ac5c
For both CCS/ONT, update Sniffles-2
SHuang-Broad May 10, 2023
29aa964
CRITICAL: updating pbindex to that from the SMRTLink 12 release
SHuang-Broad May 12, 2023
0262525
MAJOR REFACTOR: UNIFY CCS/ONT WGS PIPELINE
SHuang-Broad May 11, 2023
7be9309
MISC UPDATES TO SEVERAL UTILS TASKS
SHuang-Broad Nov 26, 2023
1ff0912
New docker that's intended to replace lr-basic:
SHuang-Broad Aug 11, 2023
4224ab4
New docker that offers a new mode remove duplicates from a queryname-…
SHuang-Broad Dec 1, 2023
a13a541
New docker that updates the samtools in GATK to the latest (1.18)
SHuang-Broad Nov 30, 2023
2905c69
significantly boost capabilities of BAMutils.wdl
SHuang-Broad Nov 28, 2023
b16c619
update some legacy code due to new interfaces in updated BAMutils.wdl
SHuang-Broad Dec 1, 2023
c96ea8e
New workflows
SHuang-Broad Dec 1, 2023
39d77d2
Improve various existing codes
SHuang-Broad Dec 11, 2023
cbfed4e
New workflow to save files (avoid multiple calls to Finalize)
SHuang-Broad Dec 10, 2023
8039c88
New workflow to unify QC and metrics collection on aligned BAM
SHuang-Broad Dec 15, 2023
b16155f
Update grouping of pipelines in .dockstore.yml
SHuang-Broad Dec 19, 2023
5bf0b2f
update WDL validation bash script
SHuang-Broad Dec 22, 2023
0d61935
DEPRECATION
SHuang-Broad Dec 1, 2023
989ead1
Fix bug in deduplicating aligned ONT BAM
SHuang-Broad Dec 4, 2023
b8e67ff
DEPRECATION:
SHuang-Broad Jan 2, 2024
e5a79c2
Update utility code:
SHuang-Broad Jan 3, 2024
b5dc978
a few tweaks to to AlignedBamQCandMetrics:
SHuang-Broad Jan 12, 2024
1abcc7a
correct two mistakes: 1. a typo; 2. should append not replace
SHuang-Broad Jan 18, 2024
da5bfb0
make read-length util tasks pre-emptible
SHuang-Broad Jan 22, 2024
7975d6d
make fingerprint reads extraction more efficient/resilient
SHuang-Broad Jan 23, 2024
5620d74
fix a stupid logical typo error (in saving files)
SHuang-Broad Jan 23, 2024
e910bac
option to force run fingerprintcheck on small bams
SHuang-Broad Jan 31, 2024
04b82f8
make pbindex task more frugal
SHuang-Broad Jan 23, 2024
14be4e2
Safer and more efficient way to do targetted pileup conversion
SHuang-Broad Feb 25, 2024
136091a
Update Pacbio alignment task
SHuang-Broad Jan 19, 2024
d95be34
New subworkflow to align PacBio uBAM
SHuang-Broad Dec 23, 2023
1e26de7
Two new workflows:
SHuang-Broad Dec 27, 2023
44b7330
bug fix for bam2fastq
SHuang-Broad Mar 7, 2024
526765e
optimize alignment further
SHuang-Broad Jun 25, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
131 changes: 97 additions & 34 deletions .dockstore.yml
Original file line number Diff line number Diff line change
@@ -1,17 +1,11 @@
version: 1.2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't reviewed this. Going to assume it's OK.

workflows:

###################################################
# deprecated
- name: ONT10x
subclass: wdl
primaryDescriptorPath: /wdl/deprecated/ONT10x.wdl
- name: ONTWholeGenome
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/ONT/VariantCalling/ONTWholeGenome.wdl
- name: ONTFlowcell
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/ONT/Preprocessing/ONTFlowcell.wdl
- name: PBFlowcell
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/PacBio/Alignment/PBFlowcell.wdl
- name: PBCCS
subclass: wdl
primaryDescriptorPath: /wdl/deprecated/PBCCS.wdl
Expand All @@ -24,21 +18,24 @@ workflows:
- name: PBCCSDemultiplexWholeGenome
subclass: wdl
primaryDescriptorPath: /wdl/deprecated/PBCCSDemultiplexWholeGenome.wdl
- name: PBCCSIsoSeq
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/PacBio/Utility/PBCCSIsoSeq.wdl
- name: PBCCSWholeGenome
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/PacBio/VariantCalling/PBCCSWholeGenome.wdl
- name: PBCLRDemultiplexWholeGenome
subclass: wdl
primaryDescriptorPath: /wdl/deprecated/PBCLRDemultiplexWholeGenome.wdl
- name: PBCLRWholeGenome
subclass: wdl
primaryDescriptorPath: /wdl/deprecated/PBCLRWholeGenome.wdl
- name: LRCNVs
- name: DownloadFromHudsonAlpha
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/VariantCalling/LRCNVs.wdl
primaryDescriptorPath: /wdl/deprecated/DownloadFromHudsonAlpha.wdl

###################################################
# ONT
- name: ONTWholeGenome
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/ONT/VariantCalling/ONTWholeGenome.wdl
- name: ONTFlowcell
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/ONT/Preprocessing/ONTFlowcell.wdl
- name: ONTBasecall
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/ONT/Preprocessing/ONTBasecall.wdl
Expand All @@ -48,15 +45,6 @@ workflows:
- name: ONTAssembleWithFlye
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/ONT/Assembly/ONTAssembleWithFlye.wdl
- name: VerifyFingerprint
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/VerifyFingerprint.wdl
- name: DownloadFromHudsonAlpha
subclass: wdl
primaryDescriptorPath: /wdl/deprecated/DownloadFromHudsonAlpha.wdl
- name: PBAssembleWithHifiasm
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/PacBio/Assembly/PBAssembleWithHifiasm.wdl
- name: ONTMethylation
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/ONT/Epigenomics/ONTMethylation.wdl
Expand All @@ -66,12 +54,48 @@ workflows:
- name: ONTPfTypeDrugResistanceMarkers
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/ONT/MultiAnalysis/ONTPfTypeDrugResistanceMarkers.wdl
- name: ONTProcessBasecall
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/ONT/Preprocessing/ONTProcessBasecall.wdl
- name: ONTFlowcellFromMultipleBasecalls
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/ONT/Preprocessing/ONTFlowcellFromMultipleBasecalls.wdl

###################################################
# PacBio
- name: PBFlowcell
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/PacBio/Alignment/PBFlowcell.wdl
- name: PBCCSIsoSeq
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/PacBio/Utility/PBCCSIsoSeq.wdl
- name: PBCCSWholeGenome
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/PacBio/VariantCalling/PBCCSWholeGenome.wdl
- name: PBAssembleWithHifiasm
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/PacBio/Assembly/PBAssembleWithHifiasm.wdl
- name: PBMASIsoSeqQuantify
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/PacBio/Utility/PBMASIsoSeqQuantify.wdl
- name: PBMASIsoSeqDemultiplex
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/PacBio/Utility/PBMASIsoSeqDemultiplex.wdl
- name: ProcessOnInstrumentDemuxedChunkRefFree
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/PacBio/Utility/ProcessOnInstrumentDemuxedChunkRefFree.wdl
- name: ProcessOnInstrumentDemuxedChunk
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/PacBio/Utility/ProcessOnInstrumentDemuxedChunk.wdl

###################################################
# TechAgnostic - *mics data processing
- name: CallVariantsReadBased
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/VariantCalling/CallVariantsReadBased.wdl
- name: LRCNVs
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/VariantCalling/LRCNVs.wdl
- name: LRJointCallGVCFs
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/VariantCalling/LRJointCallGVCFs.wdl
Expand All @@ -81,18 +105,57 @@ workflows:
- name: LRConvertBCF
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/LRConvertBCF.wdl
- name: DownloadFromSRA

###################################################
# TechAgnostic - *mics data QC & metrics
- name: ShardWholeGenome
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/DownloadFromSRA.wdl
- name: DownloadFromWeb
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/ShardWholeGenome.wdl
- name: MergeSampleBamsAndCollectMetrics
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/DownloadFromWeb.wdl
- name: ONTProcessBasecall
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/MergeSampleBamsAndCollectMetrics.wdl
- name: AlignedBamQCandMetrics
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/ONT/Preprocessing/ONTProcessBasecall.wdl
- name: ONTFlowcellFromMultipleBasecalls
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/AlignedBamQCandMetrics.wdl
- name: CollectBamFlagStats
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/ONT/Preprocessing/ONTFlowcellFromMultipleBasecalls.wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/CollectBamFlagStats.wdl
- name: CountTheBeans
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/CountTheBeans.wdl
- name: DystPeaker
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/DystPeaker.wdl
- name: FASTQstats
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/FASTQstats.wdl
- name: FilterBamByLength
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/FilterBamByLength.wdl
- name: LongReadsContaminationEstimation
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/LongReadsContaminationEstimation.wdl
- name: SexCheckNaive
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/SexCheckNaive.wdl
- name: VerifyFingerprint
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/VerifyFingerprint.wdl
- name: VerifyBamFingerprint
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/VerifyBamFingerprint.wdl

###################################################
# TechAgnostic - utility
- name: CleanupIntermediate
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/CleanupIntermediate.wdl
- name: DownloadFromSRA
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/DownloadFromSRA.wdl
- name: DownloadFromWeb
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/DownloadFromWeb.wdl
- name: SaveFilesToDestination
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/SaveFilesToDestination.wdl
73 changes: 73 additions & 0 deletions docker/lr-bam-dedup/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
FROM python:3.9.16-slim-bullseye

COPY remove_duplicate_ont_aln.py /opt/
COPY remove_duplicate_ont_namesorted_unaligned.py /opt/

RUN pip install pysam==0.21.0

# install gcloud and gsutil cli
ARG DEBIAN_FRONTEND=noninteractive
RUN apt-get -qqy update --fix-missing && \
apt-get -qqy dist-upgrade && \
apt-get -qqy install --no-install-recommends \
apt-transport-https \
ca-certificates \
gnupg \
zlib1g-dev \
curl \
wget \
tree \
tabix && \
echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list && \
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key --keyring /usr/share/keyrings/cloud.google.gpg add - && \
apt-get -qqy update && \
apt-get -qqy install --no-install-recommends google-cloud-cli && \
gcloud config set core/disable_usage_reporting true && \
gcloud config set component_manager/disable_update_check true && \
gcloud config set metrics/environment github_docker_image && \
apt-get -qqy purge gnupg && \
apt-get -qqy clean && \
rm -rf /tmp/* \
/var/tmp/* \
/var/cache/apt/* \
/var/lib/apt/lists/* \
/usr/share/man/?? \
/usr/share/man/??_*

# install latest samtools
ARG DEBIAN_FRONTEND=noninteractive
ARG SAMTOOLS_VERSION=1.18
ARG BCFTOOLS_VERSION=1.18
RUN apt-get -qqy update --fix-missing && \
apt-get -qqy dist-upgrade && \
apt-get -qqy install --no-install-recommends \
ca-certificates \
libbz2-dev \
libcurl4-openssl-dev \
liblzma-dev \
libncurses5-dev \
autoconf \
automake \
bzip2 \
gcc \
make \
wget \
zlib1g-dev && \
wget https://github.com/samtools/samtools/releases/download/${SAMTOOLS_VERSION}/samtools-${SAMTOOLS_VERSION}.tar.bz2 && \
tar xjf samtools-${SAMTOOLS_VERSION}.tar.bz2 && \
cd samtools-${SAMTOOLS_VERSION} && ./configure --without-curses --enable-libcurl && make -s all all-htslib && make install install-htslib && cd - && \
rm -rf samtools-${SAMTOOLS_VERSION}* && \
wget https://github.com/samtools/bcftools/releases/download/${BCFTOOLS_VERSION}/bcftools-${BCFTOOLS_VERSION}.tar.bz2 && \
tar xjf bcftools-${BCFTOOLS_VERSION}.tar.bz2 && \
cd bcftools-${BCFTOOLS_VERSION} && ./configure --without-curses && make -s && make install && cd - && \
rm -rf bcftools-${BCFTOOLS_VERSION}* && \
apt-get -qqy purge autoconf automake bzip2 gcc make wget && \
apt-get -qqy clean && \
rm -rf /tmp/* \
/var/tmp/* \
/var/cache/apt/* \
/var/lib/apt/lists/* \
/usr/share/man/?? \
/usr/share/man/??_* && \
samtools --help && \
bcftools --help
12 changes: 12 additions & 0 deletions docker/lr-bam-dedup/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
VERSION = 0.1.2
TAG1 = us.gcr.io/broad-dsp-lrma/lr-bam-dedup:$(VERSION)
TAG2 = us.gcr.io/broad-dsp-lrma/lr-bam-dedup:latest

all: build push

build:
docker build -t $(TAG1) -t $(TAG2) .

push:
docker push $(TAG1)
docker push $(TAG2)
76 changes: 76 additions & 0 deletions docker/lr-bam-dedup/remove_duplicate_ont_aln.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
import argparse
import pysam


def main():
parser = argparse.ArgumentParser(description='Remove redundant alignment records from ONT BAM file',
prog='remove_redundant_reads')
parser.add_argument('-p', '--prefix', type=str, default="shard", help="Output prefix")
parser.add_argument('-a', '--annotations', type=str, help="Annotations on (potential) duplicate reads")

parser.add_argument('bam', type=str, help="BAM")
args = parser.parse_args()

# create a dict of set's, a trick to avoid Hash collisions
guilty_dict_per_chr = dict()
with open(args.annotations) as f:
for line in f:
arr = line.strip().split('\t')
name = arr[0]
chrom = arr[2]
guilty_dict_per_chr.setdefault(chrom, set())
guilty_dict_per_chr[chrom].add(name)

print("chromosomes on which there are duplicate records:")
print(f" {guilty_dict_per_chr}")

# Silence message about the .bai file not being found.
pysam.set_verbosity(0)

num_alignments, num_dropped_alignments = 0, 0
duplicate_signatures = []
bf = pysam.Samfile(args.bam, 'rb', check_sq=False)
with pysam.Samfile(f'{args.prefix}.bam', 'wb', header=bf.header) as out:
# we rely on the observation that for coordinate sorted BAM,
# duplicate records will appear in blocks, hence once we step off a position with duplicates, we start afresh
current_position = -1
current_signatures = set()
for read in bf:
num_alignments += 1

chrom = read.reference_name
n = read.query_name
if chrom in guilty_dict_per_chr and n in guilty_dict_per_chr[chrom]:

mq = read.mapping_quality
sam_flag = read.flag
pos = read.reference_start
cigar = read.cigarstring
signature = f"{n}-{chrom}-{pos}-{mq}-{sam_flag}-{cigar}"

if current_position != pos: # new position, let's write and reset
out.write(read)
current_position = pos
current_signatures = set()
current_signatures.add(signature)
elif signature in current_signatures: # You're a duplicate record, and not appearing for the 1st time!
num_dropped_alignments += 1
duplicate_signatures.append(signature) # same signature may appear more than twice, hence list and append
pass
else: # you are in a new group of duplicates that map to this location
out.write(read)
current_signatures.add(signature)
else:
out.write(read)

print(f'num_alignments: {num_alignments}')
print(f'num_dropped_alignments: {num_dropped_alignments}')
print(f'num_kept_alignments: {num_alignments - num_dropped_alignments}')

with open(f'{args.prefix}.duplicate.signatures.txt', 'w') as out:
for sig in duplicate_signatures:
out.write(f"{sig}\n")


if __name__ == "__main__":
main()
48 changes: 48 additions & 0 deletions docker/lr-bam-dedup/remove_duplicate_ont_namesorted_unaligned.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
import argparse
import pysam


def main():
parser = argparse.ArgumentParser(description='Remove redundant reads from renamed-sorted ONT BAM file',
prog='remove_redundant_reads')
parser.add_argument('-p', '--prefix', type=str, default="shard", help="Output prefix")
parser.add_argument('-q', '--qnames', type=str, help="Read names of duplicate records")

parser.add_argument('bam', type=str, help="BAM")
args = parser.parse_args()

# Silence message about the .bai file not being found.
pysam.set_verbosity(0)
bf = pysam.Samfile(args.bam, 'rb', check_sq=False)

num_records, num_dropped_records = 0, 0
duplicate_record_names = list()

with pysam.Samfile(f'{args.prefix}.bam', 'wb', header=bf.header) as out:

# we rely on the observation that for queryname sorted, unaligned BAM,
# if two neighboring records have the same query name, then they must be duplicate of each other
current_qm = ''

for read in bf:
num_records += 1

n = read.query_name
if n == current_qm:
duplicate_record_names.append(n)
num_dropped_records += 1
else:
current_qm = n
out.write(read)

print(f'num_records: {num_records}')
print(f'num_dropped_records: {num_dropped_records}')
print(f'num_kept_alignments: {num_records - num_dropped_records}')

with open(args.qnames, 'w') as outf:
for qn in duplicate_record_names:
outf.write(f'{qn}\n')


if __name__ == "__main__":
main()
Loading
Loading