Assemblytics bed to HACk bed #19

agolicz · 2021-10-01T14:41:18Z

Hi,
I have a .bed file produced by Assemblytics.

The coordinates are as follows:
reference ref_start ref_stop type
chrC01 16417864 16417924 Deletion
chrC01 883429 883429 Insertion

As it is a .bed file I assume coordinates are 0-based. I am wondering how to convert Insertion coordinates to HACk BED, were for insertions column 2 is supposed to be: breakpoint-1 and column 3: breakpoint.

Should a corresponding HACk BED be?:

chrC01 16417864 16417924 deletion
chrC01 883428 883429 insertion

Thanks!

Agnieszka

Here is a further example of Assemblytics bed: http://assemblytics.com/analysis.php?code=arabidopsis

davidebolo1993 · 2021-10-01T14:48:12Z

Hi @agolicz,

yes, in standard BED start coordinates are 0-based (end coordinates are not). For your example (assuming this is a "test.bed" file) from Assemblytics, something like the line below should work:

awk 'OFS=FS="\t"''{if($4 == "Insertion") print $1, $3-1, $3, $4; else print $0}' test.bed

Let me know if I can help further,

Davide

agolicz · 2021-10-01T15:04:54Z

Beautiful! Many thanks. One more question do you have a recommended tool to convert HACk bed to vcf?

davidebolo1993 · 2021-10-01T15:12:15Z

Hi @agolicz,

not a tool, but this script here can convert a HACk BED with standard SVs (DEL,DUP,INV,INS,BND) into a VCF that can be in turn used with downstream softwares like truvari for benchmarking. Usage is something like:

python converter.py <input.reference.fasta> <input.hack.bed> <output.hack.vcf>

Hope this helps,

Davide

agolicz · 2021-10-01T15:16:24Z

Perfect, exactly what I was after :).

Agnieszka

agolicz · 2021-10-06T08:43:31Z

Hi again,
I am trying to convert the SHORtS output to fastq (we want to use different mappers), but something a bit strange seems to be happening.

First I tried samtools fastq, but most of the reads were discarded as singletons.
The I tried bedtools bamtofastq but it complains (for some reads) that reads are marked as paired but there is no mate in the bam file.

Commands used:
VISOR SHORtS -g Express617_v1.chrs.fa -s haplo -b shorts.laser.bed -o shorts.10x.out --threads 8 --coverage 10 --length 100 > shorts.10x.stdout 2>shorts.10x.err
Finishes without isses. err file is empty
tail -n 2 shorts.10x.stdout
[05/10/2021 10:39:53][Message] Merging simulated BAM
[05/10/2021 10:44:15][Message] Done

Attempted extraction with samtools:
samtools view shorts.10x.bam | wc -l
75897493
samtools fastq -1 10x.shorts.R1.fq -2 10x.shorts.R2.fq -0 /dev/null -s /dev/null -n shorts.10x.bam
[M::bam2fq_mainloop] discarded 75846243 singletons
[M::bam2fq_mainloop] processed 75849457 reads

Attempted extraction with bedtools:
samtools sort -n -o shorts.10x.qsort.bam shorts.10x.bam
bamToFastq -i shorts.10x.qsort.bam -fq shorts.10x.R1.fq -fq2 shorts.10x.R2.fq
*****WARNING: Query c1h1fh_chrA01_1_29974771|2319701|2320211|3:0:0|5:0:0|7a57b is marked as paired, but its mate does not occur next to it in your BAM file. Skipping.

Indeed when I just grep for the read name it occurs in the bam file only once. One would expect it twice for paired reads.
view shorts.10x.bam | grep "c1h1fh_chrA01_1_29974771|2319701|2320211|3:0:0|5:0:0|7a57b"
c1h1fh_chrA01_1_29974771|2319701|2320211|3:0:0|5:0:0|7a57b 73 chrA01 2332309 2 100M = 2332309 0 GATGGGTTACCAGATAAAGTAAAAAGTTTATTTGTAATTTCCAGATATGTTTAGATGGAGAGAGAGAGAGTGTGTGTATCTCTGTCAAGGGCGTAATTGG 2222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222 NM:i:3 ms:i:170 AS:i:170 nn:i:0 tp:A:P cm:i:2 s1:i:27 s2:i:0 de:f:0.03 MD:Z:21T32T25G19 rl:i:0 RG:Z:illumina

I also noticed that the reads which have no mate in bam have the same mapping location for both R1 and R2. I assume this is expected behavior, but now sure what it signifies.

Agnieszka

davidebolo1993 · 2021-10-06T09:00:47Z

Hi @agolicz,

I think this is an issue related to the names used by VISOR in read headers, which are modified to facilitate downstream benchmarking (and debugging, I must say). For the alignment step I think this is not a big issue, because of pairs being read in parallel from 2 (temporary) FASTQ files, but for re-building the initial FASTQ files I guess this can be problematic. If you can share your data, I can try to replicate the issue and dig a bit better into this one as soon as I have time.

Thanks,

Davide

agolicz · 2021-10-06T09:15:27Z

Thanks, I'll share the files later today.
Do you think it would possible to have an option dump the simulated reads directly to fastq and have fastq along the bam. That would save the hassle of rebuilding...

If you are too busy now, I could modify my local copy and test if you could point me to the part of the code which should be changed.

davidebolo1993 · 2021-10-06T09:24:06Z

Yes, that' the idea, basically adding a --fastq flag that stores the fastq in the output folder.
I think I can have a look into this in the next few days but feel free to test your ideas and get back with suggestions.

Thanks,

Davide

davidebolo1993 · 2021-10-06T11:19:44Z

Hi @agolicz,

attached a new version of VISOR.py and SHORtS.py. Can you give these a try?
You need to un-tar the file attached and replace VISOR.py and SHORtS.py in your VISOR folder with the copies attached before re-running setup.py. SHORtS (and LASeR, same stuff, but will be added in the official release) now accepts a --fastq flag that dumps fastq files (r1.fq and r2.fq) to the output folder without changing the previous simulation model. Hopefully, these can be further re-used for testing other aligners. It works on a small test, but maybe needs some refining.

Thanks,

Davide
newvers.tar.gz

agolicz · 2021-10-06T11:47:44Z

Thanks heaps. Testing now. I will let you know how it goes.

In the meantime I modified the BulkSim function to accept fastq file locations as two additional arguments and then dumped the reads there, but this is much nicer :)

bedtools bamtofastq seemed to have worked on LASeR bams without issues, so hopefully those should be fine.

agolicz · 2021-10-06T12:36:45Z

Seems to have worked fine!

I have one last question regarding coverages (genome size: 764629779 bp).
For SHORtS coverage is exactly as I would expect it:
For --coverage 10
wc -l r1.fq
152696444 r1.fq

float(152696444)/42100/764629779
9.984991965634652

Foe LASeR in the --covarege 5 simulation I got
cat laser.5x.fq | paste - - - - | cut -f 2 | tr -d '\n' | wc -c

float(10376507526)/764629779
13.570629618389477

Is the coverage calculation different for long reads (not total bp/genome size)?
I only noticed because LASeR fastq files are much bigger than SHORts.

Commands to get laser.5x.fq
VISOR LASeR -g Express617_v1.chrs.fa -s haplo -b shorts.laser.bed -o laser.5x.out --threads 8 --coverage 5 > laser.5x.stdout 2>laser.5x.err
bamToFastq -i laser.5x.bam -fq laser.5x.fq > btools.laser.5x.out 2>btools.laser.5x.err

Agnieszka

davidebolo1993 · 2021-10-06T12:44:16Z

Is(are) the contig(s) used for simulation same size of the reference genome? I've never seen such an issue with LASeR (more fluctuations than those observed with short reads though) and I'm honestly using this quite frequently.

Thanks,

Davide

agolicz · 2021-10-06T12:51:46Z

Yep.
The reference genome is Express617_v1.chrs.fa.
Same that was used for HACk

VISOR HACk -g Express617_v1.chrs.fa -b Assemblytics.INS.DEL.visor.uniq.no.chrs.bed -o haplo > HACk.log
cut -f1,2 haplo/h1.fa.fai > haplochroms.dim.tsv
awk 'OFS=FS="\t"''{print $1, "1", $2, "100.0", "100.0"}' haplochroms.dim.tsv > shorts.laser.bed

Total length Express617_v1.chrs.fa: 764629779

The simulated and reference haplotypes as seem similar.

t<-read.csv("shorts.laser.bed", sep="\t", header=F)
sum(t$V3)
[1] 763897743
t<-read.csv("Express617_v1.chrs.fa.fai", sep="\t", header=F)
sum(t$V2)
[1] 764629779

I have single haplotype because it's a plant and highly inbred. Is it possible that LASeR simulates reads as if there were two haplotypes and I would need to cut coverage in half?

ll haplo
total 758429
drwxr-sr-x 2 agolicz agcpgl 2 Oct 5 10:00 ./
drwxr-sr-x 12 agolicz agcpgl 87 Oct 6 14:50 ../
-rw-r--r-- 1 agolicz agcpgl 776629534 Oct 5 10:00 h1.fa
-rw-r--r-- 1 agolicz agcpgl 598 Oct 5 10:00 h1.fa.fai

davidebolo1993 · 2021-10-06T13:02:52Z

Just re-checked with a human chromosome and the simulated coverage is what I expected.

mkdir coveragetest && cd coveragetest
curl -LO ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa
samtools faidx GRCh38_full_analysis_set_plus_decoy_hla.fa chr22 > chr22.fa
samtools faidx chr22.fa
echo -e "chr22\t15000000\t16000000\tdeletion\tNone\t0\nchr22\t20000000\t21000000\tinversion\tNone\t0\nchr22\t30000000\t31000000\ttandem duplication\t2\t0" > HACk.h1.bed
VISOR HACk -g chr22.fa -b HACk.h1.bed -o hack.1.out
cut -f1,2 hack.*.out/*.fai chr22.fa.fai > haplochroms.dim.tsv
cat haplochroms.dim.tsv | sort  | awk '$2 > maxvals[$1] {lines[$1]=$0; maxvals[$1]=$2} END { for (tag in lines) print lines[tag] }' > maxdims.tsv
awk 'OFS=FS="\t"''{print $1, "1", $2, "100.0", "100.0"}' maxdims.tsv > shorts.laser.simple.bed
VISOR LASeR -g chr22.fa -s hack.1.out -b shorts.laser.simple.bed -o laser.1.out --threads 7 --coverage 1 --tag --fastq
cat laser.1.out/r.fq | paste - - - - | cut -f 2 | tr -d '\n' | wc -c #50862344
#contig size is 50818468
#from ipython
50818468/50862344 = 0.99

Can't really say what's the issue here. Can you share a similar code snippet so that I can try to reproduce this weird behaviour?

agolicz · 2021-10-06T13:19:53Z

I must be doing something strange...

VISOR HACk -g Express617_v1.chrs.fa -b Assemblytics.INS.DEL.visor.uniq.no.chrs.bed -o haplo > HACk.log
cut -f1,2 haplo/h1.fa.fai > haplochroms.dim.tsv
awk 'OFS=FS="\t"''{print $1, "1", $2, "100.0", "100.0"}' haplochroms.dim.tsv > shorts.laser.bed
VISOR LASeR -g Express617_v1.chrs.fa -s haplo -b shorts.laser.bed -o laser.5x.out --threads 8 --coverage 5 > laser.5x.stdout 2>laser.5x.err
mv laser.5x.out/sim.srt.bam laser.5x.bam
bamToFastq -i laser.5x.bam -fq laser.5x.fq > btools.laser.5x.out 2>btools.laser.5x.err
cat laser.5x.fq | paste - - - - | cut -f 2 | tr -d '\n' | wc -c

The two files:
http://sendanywhe.re/HH3597LW
http://sendanywhe.re/XACV97J6

davidebolo1993 · 2021-10-07T07:43:30Z

Hi @agolicz, sorry for being late on this.

I honestly can't say what is happening on your side, because I double checked and I actually ended up having the proper coverage.

VISOR HACk -g Express617_v1.chrs.fa -b Assemblytics.INS.DEL.visor.uniq.no.chrs.bed -o haplo
cut -f1,2 haplo/*.fai Express617_v1.chrs.fa.fai > haplochroms.dim.tsv
cat haplochroms.dim.tsv | sort  | awk '$2 > maxvals[$1] {lines[$1]=$0; maxvals[$1]=$2} END { for (tag in lines) print lines[tag] }' > maxdims.tsv
awk 'OFS=FS="\t"''{print $1, "1", $2, "100.0", "100.0"}' maxdims.tsv > shorts.laser.simple.bed
totalbases=$(cut -f1,2 haplo/h1.fa.fai | cut -f2 | awk '{sum+=$1;} END{print sum;}')
VISOR LASeR -g Express617_v1.chrs.fa -s haplo -o laser.out --threads 5 --coverage 5 --fastq -b shorts.laser.simple.bed
#count coverage
#1 from fastq
simbases=$(cat laser.out/r.fq | paste - - - - | cut -f 2 | tr -d '\n' | wc -c)
echo $((simbases/ totalbases)) #returns 4.99
#2 from bam
mosdepth depth -n -x -b 500 laser.out/sim.srt.bam && column -t laser.out/depth.mosdepth.summary.txt #returns min/max/mean coverage per contig

I am pretty sure that, especially for very low coverages ( < 1), you may have higher coverage fluctuations but this is not happening on this example (at least, not on my side).

Thanks,

Davide

davidebolo1993 · 2021-10-07T07:46:46Z

BTW, the LASeR version which accepts --fastq as input is included in the new tar attached.

newvers.tar.gz

I will push these changes to the master branch as soon as I hear back from you.

Thanks,

Davide

agolicz · 2021-10-07T07:48:19Z

Ok, great thanks! I will test out this one. It could be the bam to fastq conversion. So strange!
Will re-run everything and get back to you.

Agnieszka

davidebolo1993 · 2021-10-07T08:09:15Z

Hi @agolicz,

I can confirm that bamToFastq is causing the issue (well, that's actually not an issue). I guess that bamToFastq do not account for multi-mappings (that is, reads that have a primary alignment as well as supplementary alignments) so that the same read is in the end extracted multiple times. A simple check on the output FASTQ from LASeR and bamToFastq unveil the mistery:

#true # of reads
grep "^@" laser.fq | wc -l #271879, in my case
# of reads from bamToFastq
grep "^@" bamtofastq.fq | wc -l #367849, in my case
#de-duplicating
grep "^@" bamtofastq.fq | sort | uniq | wc -l #271879, matching ground truth

agolicz · 2021-10-07T08:25:29Z

Ok, mystery solved! I suppose I did not expect that many reads to have secondary alignments... But it is a bit of a weird genome... How many secondary alignments do you normally see in the human genome?
I will test the LASeR --fastq anyways and get back to you.

If you are pushing changes can I suggest one more thing?
If there is overlapping regions in bed file HAcK gives the following message:
'xxx variants will be skipped for current chromosome as they overlap others'.
It would be great if along the message the bed entries for the skipped variants were also printed. It helps with debugging issues with bed file and possibly producing a final vcf which reflects variants used for haplotype building.

Thanks for all the help. VISOR is a great tool and will help us immensely.

Agnieszka

davidebolo1993 · 2021-10-07T08:41:09Z

How many secondary alignments do you normally see in the human genome?

Quite a lot, especially in synthetic datasets containing a lot of SVs. This is also reported here, for instance, but it depends a lot on the parameters used for mapping I have to say.

It would be great if along the message the bed entries for the skipped variants were also printed. It helps with debugging issues with bed file and possibly producing a final vcf which reflects variants used for haplotype building.

I know but this is not as trivial as it seems, because of how variants are handled internally by HACk. For instance, a standard cut&paste translocation is translated internally into a deletion and an insertion that do not have a corresponding key (entry) in the original BED. A possible solution is to build a second dictionary where each key is the entry in the VCF and the corresponding value a tuple of variants derived from that entry but I'm honestly looking for something more elegant. I'll experiment with possible strategies as soon as I have time.

Thanks for all the help. VISOR is a great tool and will help us immensely.

Thanks.

Davide

agolicz · 2021-10-08T06:00:25Z

Just wanted to confirm that all seems to have worked well :)

Thanks!
Agnieszka

davidebolo1993 · 2021-10-08T07:28:13Z

Hi @agolicz,

Thanks, I've just pushed the changes to the master branch.

Best,

Davide

agolicz · 2021-11-22T10:03:20Z

Hi @agolicz,

not a tool, but this script here can convert a HACk BED with standard SVs (DEL,DUP,INV,INS,BND) into a VCF that can be in turn used with downstream softwares like truvari for benchmarking. Usage is something like:
python converter.py <input.reference.fasta> <input.hack.bed> <output.hack.vcf>
Hope this helps,

Davide

Hi, I have a quick question about converter.py

I converted mv HACk bed to vcf, but I noticed that for while start position for insertions seems to get updated with +1, this is not the case for deletions. Is there a reason for that? Sample code snippet

python converter.py Express617_v1.chrs.fa Assemblytics.INS.DEL.visor.uniq.no.chrs.bed Assemblytics.INS.DEL.visor.uniq.no.chrs.vcf

#Insertion - different start
grep 139536 Assemblytics.INS.DEL.visor.uniq.no.chrs.bed
chrA01  139535  139536  insertion       TTTTTTAAAAAATTTTAAATTTATTCAAATAAAAAAAAAACTTTTGTACAGAGTTTTATCTGCTACCATATTTAAATACAATGAAATGCAAACCAAAAAAAAATATGCTTCTATATCTCTCAAGCTATACTAGCCTCTGTTAAACCAATCCTCCATCATCTTCTCATATTTACCTCTTACCCTCCTTCTCAAAGAAGATATTCTGTTTCTAATGAGCTTATCAAGCCTTGCAATCAGACAGGTAGCTGGCTGCGATGCCTCACCCACTCTTCTCACATTCCTCTCATGCCATAACGCATAAGCCACGGCCTGAAAGCAATAACGCAGTAACACAGTCGAACTCCACTTATGTAAACCCTTCTCAACAATCCGAAGCACTCTATCCCATTTAC  0
grep 139536 Assemblytics.INS.DEL.visor.uniq.no.chrs.vcf
chrA01  139536  var1    N       <INS>   .       PASS    SVTYPE=INS;END=139536;SVLEN=392 GT      1/1

#Deletion - same start
grep 161804 Assemblytics.INS.DEL.visor.uniq.no.chrs.bed
chrA01  161557  161804  deletion        None    0
grep 161804 Assemblytics.INS.DEL.visor.uniq.no.chrs.vcf
chrA01  161557  var2    N       <DEL>   .       PASS    SVTYPE=DEL;END=161804;SVLEN=-247        GT      1/1

agolicz · 2021-11-22T11:29:38Z

I seem to have gone down coordinate rabbit hole a bit and noticed that different tools seem to handle bed SV coordinates a bit differently...
Just to confirm, for the two attached SVs I would expect the coordinates to be:
Insertion: chr1S 754740078 754740079
Deletion: chr1S 754768762 754768819

Is that correct?

davidebolo1993 · 2021-11-22T11:46:09Z

Hi @agolicz,

Because in VISOR INS events are added immediately after the END coordinate specified in your BED, in the corresponding VCF event you should see the END coordinate as POS (which is exactly what I see in your example here). For DEL events, POS and END are the first and last deleted bases.

Best,

Davide

agolicz · 2021-11-22T11:58:19Z

Yes thanks, I think you are correct. The INS position got me confused. Could you double check if my INS and DEL coordinates based on IGV are correct? Just to make sure I get it now...

Agnieszka

davidebolo1993 · 2021-11-22T12:01:30Z

Yes,

they look good to me. Let me know if I can help further.

Thanks,

Davide

agolicz · 2021-11-22T12:02:40Z

Brilliant thanks. I think that should do it!

Agnieszka

agolicz · 2021-11-22T19:34:50Z

Also as a note if anyone ever encounters this thread for the variants pictured above:

Assemblytics output (Assemblytics_structural_variants.bed):

chr1S   754740080   754740080   Assemblytics_w_183750   25377   +   Insertion   0   25377   ptg000015l:1377388-1402765:+    within_alignment
chr1S   754768763   754768820   Assemblytics_w_183752   57  +   Deletion    57  0   ptg000015l:1419021-1419021:+    within_alignment

Corresponding vcf:

chr1S   754740079   svim_asm.INS.33071  C   N   .   PASS    SVTYPE=INS;END=754740079;SVLEN=25377    GT  1/1
chr1S   754768762   svim_asm.DEL.34093  N   A   .   PASS    SVTYPE=DEL;END=754768819;SVLEN=-57  GT  1/1

Corresponding HACk bed coordinates:

chr1S   754740078   754740079
chr1S   754768762   754768819

davidebolo1993 self-assigned this Oct 1, 2021

davidebolo1993 added the question Further information is requested label Oct 1, 2021

davidebolo1993 closed this as completed Oct 4, 2021

davidebolo1993 reopened this Oct 6, 2021

davidebolo1993 closed this as completed Oct 8, 2021

davidebolo1993 mentioned this issue Nov 18, 2022

Definition of oriantation region of breakpoints #29

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assemblytics bed to HACk bed #19

Assemblytics bed to HACk bed #19

agolicz commented Oct 1, 2021

davidebolo1993 commented Oct 1, 2021

agolicz commented Oct 1, 2021

davidebolo1993 commented Oct 1, 2021

agolicz commented Oct 1, 2021

agolicz commented Oct 6, 2021

davidebolo1993 commented Oct 6, 2021

agolicz commented Oct 6, 2021

davidebolo1993 commented Oct 6, 2021

davidebolo1993 commented Oct 6, 2021

agolicz commented Oct 6, 2021

agolicz commented Oct 6, 2021 •

edited

Loading

davidebolo1993 commented Oct 6, 2021

agolicz commented Oct 6, 2021 •

edited

Loading

davidebolo1993 commented Oct 6, 2021

agolicz commented Oct 6, 2021 •

edited

Loading

davidebolo1993 commented Oct 7, 2021

davidebolo1993 commented Oct 7, 2021

agolicz commented Oct 7, 2021

davidebolo1993 commented Oct 7, 2021

agolicz commented Oct 7, 2021

davidebolo1993 commented Oct 7, 2021

agolicz commented Oct 8, 2021

davidebolo1993 commented Oct 8, 2021

agolicz commented Nov 22, 2021 •

edited

Loading

agolicz commented Nov 22, 2021

davidebolo1993 commented Nov 22, 2021

agolicz commented Nov 22, 2021

davidebolo1993 commented Nov 22, 2021

agolicz commented Nov 22, 2021

agolicz commented Nov 22, 2021 •

edited

Loading

Assemblytics bed to HACk bed #19

Assemblytics bed to HACk bed #19

Comments

agolicz commented Oct 1, 2021

davidebolo1993 commented Oct 1, 2021

agolicz commented Oct 1, 2021

davidebolo1993 commented Oct 1, 2021

agolicz commented Oct 1, 2021

agolicz commented Oct 6, 2021

davidebolo1993 commented Oct 6, 2021

agolicz commented Oct 6, 2021

davidebolo1993 commented Oct 6, 2021

davidebolo1993 commented Oct 6, 2021

agolicz commented Oct 6, 2021

agolicz commented Oct 6, 2021 • edited Loading

davidebolo1993 commented Oct 6, 2021

agolicz commented Oct 6, 2021 • edited Loading

davidebolo1993 commented Oct 6, 2021

agolicz commented Oct 6, 2021 • edited Loading

davidebolo1993 commented Oct 7, 2021

davidebolo1993 commented Oct 7, 2021

agolicz commented Oct 7, 2021

davidebolo1993 commented Oct 7, 2021

agolicz commented Oct 7, 2021

davidebolo1993 commented Oct 7, 2021

agolicz commented Oct 8, 2021

davidebolo1993 commented Oct 8, 2021

agolicz commented Nov 22, 2021 • edited Loading

agolicz commented Nov 22, 2021

davidebolo1993 commented Nov 22, 2021

agolicz commented Nov 22, 2021

davidebolo1993 commented Nov 22, 2021

agolicz commented Nov 22, 2021

agolicz commented Nov 22, 2021 • edited Loading

agolicz commented Oct 6, 2021 •

edited

Loading

agolicz commented Oct 6, 2021 •

edited

Loading

agolicz commented Oct 6, 2021 •

edited

Loading

agolicz commented Nov 22, 2021 •

edited

Loading

agolicz commented Nov 22, 2021 •

edited

Loading