Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct fastq format #44

Closed
francicco opened this issue Nov 30, 2018 · 9 comments
Closed

Correct fastq format #44

francicco opened this issue Nov 30, 2018 · 9 comments

Comments

@francicco
Copy link

HI @warrenlr,

I'm trying to generate the right formatting for the fastq files. I've not found any example, so I tried to deduce it from the bam file NA24143_genome_phased_namesorted.bam1.sorted.bam.
Is this the right way?

@NB501445:179:H2YG7BGX7:2:11312:9490:14604_AAAAAAAAAGCAATTT
AAATTAAGTGATATATCCTCTTACGAAGTTGTTTCGGTTTAAAATTATTGTTATTATTTTTTAATGATGTGCAGAAAGTAAAGCAACATGAAGCTTTATTTACTACCAACTTCATTTCCCTTCAATA
+
EEEEEEEEEEEEEEEEEEEEEEEEEE6EEEEEEEEAE<AEA/A/EE/EEAA</EE/EEEEEEE/EEEEEEE//<EE/EEEEEEEEEEEEEE/EAEAEEEAEE<EAE<</<EAEEAEEE</A/E</EE
@NB501445:179:H2YG7BGX7:3:21602:1860:16340_AAAAAAAAATTATAAA
AAAAAAATGTTACCCACAATCCCATGATTTGCCATGCATGTATATAGGGACATATCCAATGATAGACATCGCGTTGTTACCGCGTCAACGGGGCTCTAGTAAGAATAAATTCGCAGTCACAGCTTTG
+
EEEEEEEEE/EE//E/E/E//6A/EE/A</EEAEEEE/E/<E/EEE<///EE//A/EE/EE/AEEEEEAAE6/EE/EA//EEEEEE<EEA</EE/</EE//AEE//<E/EA<AEA</E/AA/E/E/A

and this is the command to align them with bwa:

bwa mem -t32 $ASSEMBLY.fasta $ARCSR1.gz $ARCSR2.gz | samtools view -Sb - | samtools sort -@ $THREADS -n - > $SPECIES.LinkedReads.bam

Correct?
Best
F

@francicco
Copy link
Author

After the alignment ARCS returns me this error

###### Fri 30 Nov 15:35:27 GMT 2018: Running ARCS
Running: arcs 1.0.5
 pid 161016
 -c 5
 -d 0
 -e 30000
 -l 0
 -m 50-10000
 -r 0.05
 -s 98
 -v 0
 -z 500
 --gap=100
 -b ID1090_1_renamed_bcfrac066_pseudohap_1kb.fasta.scaff_s98_c5_l0_d0_e30000_r0.05
 -g ID1090_1_renamed_bcfrac066_pseudohap_1kb.fasta.scaff_s98_c5_l0_d0_e30000_r0.05.dist.gv
 --barcode-counts=NA
 --tsv=NA
 -a Pdid.LinkedReads.Aligned.sortedByCoord.out.bam
 -f ID1090_1_renamed_bcfrac066_pseudohap_1kb.fasta

=> Getting scaffold sizes... Fri Nov 30 15:35:27 2018

=> Reading alignment files... Fri Nov 30 15:35:28 2018
error: `@SQ	SN:bcfrac066.1	LN:1327': No such file or directory

@lcoombe
Copy link
Member

lcoombe commented Nov 30, 2018

Hi @francicco,

The input fastq files can be either in the format you show above, or the barcode can also be in a comment of the read header (@Name BX:Z:barcode). So, you can just use the reads generated by longranger basic. If you keep the barcode in the BX comment, you just need to add the -C option to the bwa mem command so that the barcode will be placed in the BX tag of the SAM entries.

@NB501445:179:H2YG7BGX7:2:11312:9490:14604 BX:Z:AAAAAAAAAGCAATTT-1
AAATTAAGTGATATATCCTCTTACGAAGTTGTTTCGGTTTAAAATTATTGTTATTATTTTTTAATGATGTGCAGAAAGTAAAGCAACATGAAGCTTTATTTACTACCAACTTCATTTCCCTTCAATA
+
EEEEEEEEEEEEEEEEEEEEEEEEEE6EEEEEEEEAE<AEA/A/EE/EEAA</EE/EEEEEEE/EEEEEEE//<EE/EEEEEEEEEEEEEE/EAEAEEEAEE<EAE<</<EAEEAEEE</A/E</EE
@NB501445:179:H2YG7BGX7:3:21602:1860:16340 BX:Z:AAAAAAAAATTATAAA-1
AAAAAAATGTTACCCACAATCCCATGATTTGCCATGCATGTATATAGGGACATATCCAATGATAGACATCGCGTTGTTACCGCGTCAACGGGGCTCTAGTAAGAATAAATTCGCAGTCACAGCTTTG
+
EEEEEEEEE/EE//E/E/E//6A/EE/A</EEAEEEE/E/<E/EEE<///EE//A/EE/EE/AEEEEEAAE6/EE/EA//EEEEEE<EEA</EE/</EE//AEE//<E/EA<AEA</E/AA/E/E/A

You can also leave the input read files interleaved if you wish (default of longranger basic), and would just need to add -p to the bwa mem command to make sure the reads are paired properly.

As for the error, the input file to -a is expected to be a text file listing the path to all the alignment files you wish to use. It looks like you supplied the BAM file itself.

-a, --fofName=FILE    text file listing input SAM/BAM filenames

As a side note we do have a Makefile (Examples/arcs-make) which runs the whole ARCS pipeline, taking the reads and the draft assembly as input. Even if you don't want to use the Makefile, I'd recommend taking a look to get an idea of the required steps.

Hope that helps!
Lauren

@lcoombe lcoombe closed this as completed Dec 10, 2018
@francicco
Copy link
Author

Hi @lcoombe,

Sorry, I was a bit lat with the test. I'm still playing around with some tests to check the pipeline.
Now I'm mapping the interlaced reads with BWA.

[M::process] 2819384 single-end sequences; 0 paired-end sequences

It looks like BWA doesn't recognise the paired ends with are formatted like this

@D00352:465:CD1CHANXX:2:2209:5496:86527/1_AAAAAAAAAAATGAAC
TAAGTCAACTAATCAACTAGATTCTCAGGTTCCCTCTGCCTACCCTGCTATGTGCGGTATACAGCGTGAAGCTAAAAAAAGCCAACTTATCAACTATGTCGT
+
BB<BF//</<B/<FFF/BB<FF/<F/B</B/FFFB///<<F/B<B/</<7/7F<FF7/B7BBFBF7FFFF<<<///7<F///////////BB77/////7/7
@D00352:465:CD1CHANXX:2:2209:5496:86527/2_AAAAAAAAAAATGAAC
CTAAATATGTTTTTTGTTTCCTAAGTGTGTGATTAGCACTTTAAAACTTCAAAAAAGATTTTTTTATGCGTAAAAAAGCACGATGTTAAATTCTGGGATTCAACACGTACCTAATATCTTAAATT
+
/BBB//FFBBBFFF/F<<FFBB/<FBFB/F/<FFFF</<FF<F<<<FFBF<FFF//7F<<FFFFF////<//</<BFB<//F//F/<BFFFFFF/7/<FB/77/7/7<B/BBF<FFFF<<FBBFF

It that normal?
F

@lcoombe
Copy link
Member

lcoombe commented Dec 11, 2018

Hi @francicco - Could you post your most recent bwa mem command? Are you using the -p parameter?

@lcoombe lcoombe reopened this Dec 11, 2018
@francicco
Copy link
Author

That's the command

bwa mem -t$THREADS $SPECIES.fa -p $ARCSR.fastq.gz | samtools view -Sb - | samtools sort -@ $THREADS -n - > $SPECIES.LinkedReads.Aligned.sortedByCoord.out.bam

@francicco
Copy link
Author

It is actually sorted by read. I just named it wrong.
F

@lcoombe
Copy link
Member

lcoombe commented Dec 11, 2018

@francicco - I believe it is due to the \1 and \2 suffixes that you added to the read header before appending the barcode sequence. I did a small test case, and this interfered with bwa mem's smart pairing (-p).
I suggest either going back to your original post-longranger fastq file (formatted as I showed in my initial comment) or removing those characters from the read headers so that the 2 read headers of a read pair are identical.

$ head -n8 lr.arcsformat.fq 
@gi|453231901|ref|NC_003280.10|_12670490_12670606_0_1_0_0_0:0:0_0:0:0_5db95_AAACACCTCAGGCTGC
GAAAAGTTTCGTGTACTCCACACGGACACATACATTTAGTTTTAAAACTAAAATCAAGCCGCGACGCGACACGCAACGCGCCGTAAATCTACCCCAGATATGGCCGAGCAAAAATGGCCTAGTTCGGC
+
GFIEHGEFEDEGCEDBCCDB@CE=HBABBCABBADABBBB@@B>BACAA@<@AACCB?D@A@@@A@@??><BC??>?<A<B??A?><>==A;=??;B@==AB>>?=?>>=?><>;;@=<;=?>=>=?=
@gi|453231901|ref|NC_003280.10|_12670490_12670606_0_1_0_0_0:0:0_0:0:0_5db95_AAACACCTCAGGCTGC
CACTGTGCTTCACCACTTCATCACCCGCCGATTCTTTTGAAAAAAAAAATTTCTCACATACACATATATTGAGAGTAAAAGTCACAAGCCACGGATTTCTGGCTTCTCTCATAAATTGAAATGGAAGAGTTTGCCGAACTAGGCCATTTTT
+
GIHIIHGICFIHFEIIDCEDDDCEEEDDDCCEEACFBBBBECDBECCBAAAA>@ABAA@ABAAA@A?BB<B@@@D@@E>A??=?@?@B>@?>B>@>??>CB=A>A@?>=<A<>>?>>>B>>BA>>>>>@<?=>====?<==>?<====;?=
$ bwa mem -p /projects/btl_scratch/lcoombe/barcode-assemblies/unitig-graph/celegans/LRSIM-noVariantData/data/celegansN2_referencegenome.fna lr.arcsformat.fq  > test.sam
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 25 sequences (3476 bp)...
[M::process] 1 single-end sequences; 24 paired-end sequences
...
$ head -8 lr.arcsformat.subscr.fq 
@gi|453231901|ref|NC_003280.10|_12670490_12670606_0_1_0_0_0:0:0_0:0:0_5db95/1_AAACACCTCAGGCTGC
GAAAAGTTTCGTGTACTCCACACGGACACATACATTTAGTTTTAAAACTAAAATCAAGCCGCGACGCGACACGCAACGCGCCGTAAATCTACCCCAGATATGGCCGAGCAAAAATGGCCTAGTTCGGC
+
GFIEHGEFEDEGCEDBCCDB@CE=HBABBCABBADABBBB@@B>BACAA@<@AACCB?D@A@@@A@@??><BC??>?<A<B??A?><>==A;=??;B@==AB>>?=?>>=?><>;;@=<;=?>=>=?=
@gi|453231901|ref|NC_003280.10|_12670490_12670606_0_1_0_0_0:0:0_0:0:0_5db95/2_AAACACCTCAGGCTGC
CACTGTGCTTCACCACTTCATCACCCGCCGATTCTTTTGAAAAAAAAAATTTCTCACATACACATATATTGAGAGTAAAAGTCACAAGCCACGGATTTCTGGCTTCTCTCATAAATTGAAATGGAAGAGTTTGCCGAACTAGGCCATTTTT
+
GIHIIHGICFIHFEIIDCEDDDCEEEDDDCCEEACFBBBBECDBECCBAAAA>@ABAA@ABAAA@A?BB<B@@@D@@E>A??=?@?@B>@?>B>@>??>CB=A>A@?>=<A<>>?>>>B>>BA>>>>>@<?=>====?<==>?<====;?=
$ bwa mem -p /projects/btl_scratch/lcoombe/barcode-assemblies/unitig-graph/celegans/LRSIM-noVariantData/data/celegansN2_referencegenome.fna lr.arcsformat.subscr.fq > test.sam
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 25 sequences (3476 bp)...
[M::process] 25 single-end sequences; 0 paired-end sequences
...

Also, you are right that if bwa mem doesn't do proper paired-end alignments that this will interfere with ARCS -- one of the filters that ARCS uses when reading the input BAM(s) is checking that read pairs are mapping in a proper pair.

@francicco
Copy link
Author

Ok, thanks! I'll try to remove them from the header!
Thanks.
F

@lcoombe
Copy link
Member

lcoombe commented Dec 11, 2018

Sounds good - I'll close this, but feel free to re-open or open another issue if you have any other questions.

Thank you for your interest in ARCS!
Lauren

@lcoombe lcoombe closed this as completed Dec 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants