-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assemblytics bed to HACk bed #19
Comments
Hi @agolicz, yes, in standard BED start coordinates are 0-based (end coordinates are not). For your example (assuming this is a "test.bed" file) from Assemblytics, something like the line below should work: awk 'OFS=FS="\t"''{if($4 == "Insertion") print $1, $3-1, $3, $4; else print $0}' test.bed Let me know if I can help further, Davide |
Beautiful! Many thanks. One more question do you have a recommended tool to convert HACk bed to vcf? |
Hi @agolicz, not a tool, but this script here can convert a HACk BED with standard SVs (DEL,DUP,INV,INS,BND) into a VCF that can be in turn used with downstream softwares like truvari for benchmarking. Usage is something like: python converter.py <input.reference.fasta> <input.hack.bed> <output.hack.vcf> Hope this helps, Davide |
Perfect, exactly what I was after :). Agnieszka |
Hi again, First I tried samtools fastq, but most of the reads were discarded as singletons. Commands used: Attempted extraction with samtools: Attempted extraction with bedtools: Indeed when I just grep for the read name it occurs in the bam file only once. One would expect it twice for paired reads. I also noticed that the reads which have no mate in bam have the same mapping location for both R1 and R2. I assume this is expected behavior, but now sure what it signifies. Agnieszka |
Hi @agolicz, I think this is an issue related to the names used by VISOR in read headers, which are modified to facilitate downstream benchmarking (and debugging, I must say). For the alignment step I think this is not a big issue, because of pairs being read in parallel from 2 (temporary) FASTQ files, but for re-building the initial FASTQ files I guess this can be problematic. If you can share your data, I can try to replicate the issue and dig a bit better into this one as soon as I have time. Thanks, Davide |
Thanks, I'll share the files later today. If you are too busy now, I could modify my local copy and test if you could point me to the part of the code which should be changed. |
Yes, that' the idea, basically adding a --fastq flag that stores the fastq in the output folder. Thanks, Davide |
Hi @agolicz, attached a new version of VISOR.py and SHORtS.py. Can you give these a try? Thanks, Davide |
Thanks heaps. Testing now. I will let you know how it goes. In the meantime I modified the BulkSim function to accept fastq file locations as two additional arguments and then dumped the reads there, but this is much nicer :) bedtools bamtofastq seemed to have worked on LASeR bams without issues, so hopefully those should be fine. |
Seems to have worked fine! I have one last question regarding coverages (genome size: 764629779 bp).
Foe LASeR in the --covarege 5 simulation I got
Is the coverage calculation different for long reads (not total bp/genome size)? Commands to get laser.5x.fq Agnieszka |
Is(are) the contig(s) used for simulation same size of the reference genome? I've never seen such an issue with LASeR (more fluctuations than those observed with short reads though) and I'm honestly using this quite frequently. Thanks, Davide |
Yep. VISOR HACk -g Express617_v1.chrs.fa -b Assemblytics.INS.DEL.visor.uniq.no.chrs.bed -o haplo > HACk.log Total length Express617_v1.chrs.fa: 764629779 The simulated and reference haplotypes as seem similar.
I have single haplotype because it's a plant and highly inbred. Is it possible that LASeR simulates reads as if there were two haplotypes and I would need to cut coverage in half? ll haplo |
Just re-checked with a human chromosome and the simulated coverage is what I expected. mkdir coveragetest && cd coveragetest
curl -LO ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa
samtools faidx GRCh38_full_analysis_set_plus_decoy_hla.fa chr22 > chr22.fa
samtools faidx chr22.fa
echo -e "chr22\t15000000\t16000000\tdeletion\tNone\t0\nchr22\t20000000\t21000000\tinversion\tNone\t0\nchr22\t30000000\t31000000\ttandem duplication\t2\t0" > HACk.h1.bed
VISOR HACk -g chr22.fa -b HACk.h1.bed -o hack.1.out
cut -f1,2 hack.*.out/*.fai chr22.fa.fai > haplochroms.dim.tsv
cat haplochroms.dim.tsv | sort | awk '$2 > maxvals[$1] {lines[$1]=$0; maxvals[$1]=$2} END { for (tag in lines) print lines[tag] }' > maxdims.tsv
awk 'OFS=FS="\t"''{print $1, "1", $2, "100.0", "100.0"}' maxdims.tsv > shorts.laser.simple.bed
VISOR LASeR -g chr22.fa -s hack.1.out -b shorts.laser.simple.bed -o laser.1.out --threads 7 --coverage 1 --tag --fastq
cat laser.1.out/r.fq | paste - - - - | cut -f 2 | tr -d '\n' | wc -c #50862344
#contig size is 50818468
#from ipython
50818468/50862344 = 0.99 Can't really say what's the issue here. Can you share a similar code snippet so that I can try to reproduce this weird behaviour? |
I must be doing something strange... VISOR HACk -g Express617_v1.chrs.fa -b Assemblytics.INS.DEL.visor.uniq.no.chrs.bed -o haplo > HACk.log The two files: |
Hi @agolicz, sorry for being late on this. I honestly can't say what is happening on your side, because I double checked and I actually ended up having the proper coverage. VISOR HACk -g Express617_v1.chrs.fa -b Assemblytics.INS.DEL.visor.uniq.no.chrs.bed -o haplo
cut -f1,2 haplo/*.fai Express617_v1.chrs.fa.fai > haplochroms.dim.tsv
cat haplochroms.dim.tsv | sort | awk '$2 > maxvals[$1] {lines[$1]=$0; maxvals[$1]=$2} END { for (tag in lines) print lines[tag] }' > maxdims.tsv
awk 'OFS=FS="\t"''{print $1, "1", $2, "100.0", "100.0"}' maxdims.tsv > shorts.laser.simple.bed
totalbases=$(cut -f1,2 haplo/h1.fa.fai | cut -f2 | awk '{sum+=$1;} END{print sum;}')
VISOR LASeR -g Express617_v1.chrs.fa -s haplo -o laser.out --threads 5 --coverage 5 --fastq -b shorts.laser.simple.bed
#count coverage
#1 from fastq
simbases=$(cat laser.out/r.fq | paste - - - - | cut -f 2 | tr -d '\n' | wc -c)
echo $((simbases/ totalbases)) #returns 4.99
#2 from bam
mosdepth depth -n -x -b 500 laser.out/sim.srt.bam && column -t laser.out/depth.mosdepth.summary.txt #returns min/max/mean coverage per contig I am pretty sure that, especially for very low coverages ( < 1), you may have higher coverage fluctuations but this is not happening on this example (at least, not on my side). Thanks, Davide |
BTW, the LASeR version which accepts --fastq as input is included in the new tar attached. I will push these changes to the master branch as soon as I hear back from you. Thanks, Davide |
Ok, great thanks! I will test out this one. It could be the bam to fastq conversion. So strange! Agnieszka |
Hi @agolicz, I can confirm that bamToFastq is causing the issue (well, that's actually not an issue). I guess that bamToFastq do not account for multi-mappings (that is, reads that have a primary alignment as well as supplementary alignments) so that the same read is in the end extracted multiple times. A simple check on the output FASTQ from LASeR and bamToFastq unveil the mistery: #true # of reads
grep "^@" laser.fq | wc -l #271879, in my case
# of reads from bamToFastq
grep "^@" bamtofastq.fq | wc -l #367849, in my case
#de-duplicating
grep "^@" bamtofastq.fq | sort | uniq | wc -l #271879, matching ground truth |
Ok, mystery solved! I suppose I did not expect that many reads to have secondary alignments... But it is a bit of a weird genome... How many secondary alignments do you normally see in the human genome? If you are pushing changes can I suggest one more thing? Thanks for all the help. VISOR is a great tool and will help us immensely. Agnieszka |
Quite a lot, especially in synthetic datasets containing a lot of SVs. This is also reported here, for instance, but it depends a lot on the parameters used for mapping I have to say.
I know but this is not as trivial as it seems, because of how variants are handled internally by HACk. For instance, a standard cut&paste translocation is translated internally into a deletion and an insertion that do not have a corresponding key (entry) in the original BED. A possible solution is to build a second dictionary where each key is the entry in the VCF and the corresponding value a tuple of variants derived from that entry but I'm honestly looking for something more elegant. I'll experiment with possible strategies as soon as I have time.
Thanks. Davide |
Just wanted to confirm that all seems to have worked well :) Thanks! |
Hi @agolicz, Thanks, I've just pushed the changes to the master branch. Best, Davide |
Hi, I have a quick question about converter.py I converted mv HACk bed to vcf, but I noticed that for while start position for insertions seems to get updated with +1, this is not the case for deletions. Is there a reason for that? Sample code snippet
|
I seem to have gone down coordinate rabbit hole a bit and noticed that different tools seem to handle bed SV coordinates a bit differently... |
Hi @agolicz, Because in VISOR INS events are added immediately after the END coordinate specified in your BED, in the corresponding VCF event you should see the END coordinate as POS (which is exactly what I see in your example here). For DEL events, POS and END are the first and last deleted bases. Best, Davide |
Yes thanks, I think you are correct. The INS position got me confused. Could you double check if my INS and DEL coordinates based on IGV are correct? Just to make sure I get it now... Agnieszka |
Yes, they look good to me. Let me know if I can help further. Thanks, Davide |
Brilliant thanks. I think that should do it! Agnieszka |
Also as a note if anyone ever encounters this thread for the variants pictured above: Assemblytics output (Assemblytics_structural_variants.bed):
Corresponding vcf:
Corresponding HACk bed coordinates:
|
Hi,
I have a .bed file produced by Assemblytics.
The coordinates are as follows:
reference ref_start ref_stop type
chrC01 16417864 16417924 Deletion
chrC01 883429 883429 Insertion
As it is a .bed file I assume coordinates are 0-based. I am wondering how to convert Insertion coordinates to HACk BED, were for insertions column 2 is supposed to be: breakpoint-1 and column 3: breakpoint.
Should a corresponding HACk BED be?:
chrC01 16417864 16417924 deletion
chrC01 883428 883429 insertion
Thanks!
Agnieszka
Here is a further example of Assemblytics bed: http://assemblytics.com/analysis.php?code=arabidopsis
The text was updated successfully, but these errors were encountered: