-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reads check after removeBimeraDenove #267
Comments
The length distribution in your merged amplicons will come from a combination of natural length variation in the amplified genetic locus, non-target amplification (eg. of Eukaryotic 18S rather than bacterial 16S) and in some sequencing runs "length-shifted" artefacts (which we don't entirely understand the origin of). In the tutorial, we know the distribution of the targeted bacterial V4 region is tight around 253, so we can look at our length distribution and, in combination with that knowledge, choose a length distribution to keep. There isn't a single formula for every gene region/sequencing run, but looking at your distribution across lengths will certainly help. Here is a simple function to plot the number of SVs and the number of reads by sequence length:
|
hm, I also had the 16S V4 region with primers of 515f and 806R, so actually i should expect the distribution of the sequence size around 253bp as well. |
Yes from that plot it looks like the vast majority of your SVs/reads are falling in that window, so it might make sense to restrict your sequence table to that range. Side note, can you post one of your top sequences? I don't think you should be getting 270 nts as the target length unless there is some primer still attached. |
yeah, I see now maybe there is some primers still there in the sequence. actually the quality of the sequences are pretty bad. I also attached a few the quality score here, and the filter step settings. out_2=filterAndTrim(fwd= file.path(pathF, fastqFs), filt=file.path(filtpathF, fastqFs), |
Yes, you still have parts of the 515F and 806R primers on these reads (about 13/10 nts from each). That will interfere with dada2 analysis, as primers contain ambiguous nucleotides, i.e. nucleotides that are sometimes one base and sometimes another, which appears to the algorithm as biological variation. |
ok, then I see. what would you suggest me to do? I am thinking that I should trim more reads from the left side of both forward and reverse reads, about 10-13 more than I have now? |
Yes you can trim them off with a larger trimLeft, or use the dedicated software tools we mention in the FAQ. |
I see now. if I use trimLeft, the length I set could be the length of my primers (plus the adaptor i guess?) |
Correct. I think you might have a 2nt pad in your reads at the start of each read, in addition to the 17/21nts of the EMP V4 primers. But if you check that, you can use trimLeft to remove the non-biological sequence.
You probably should still use |
Hi, Ben, Illumina adapter-806R: Second primer sets is to attach standard illumina handles and index primers, which looks like this: The data we got was demultiplexed, and should only leave the first set of adaptor and bacterial primers, which means that in the trimming step, I should set trimLeft to 56 for forward and 41 for reverse reads to remove the non-biological sequences? Did I understand it right? after I tried with 56 and 41 in trimLeft, the sequences after removeBimeraDenovo I got looked like this, the reads are much shorter, with the most reads are at 199 and 200bp. Do you think this is acceptable? Thanks a lot for your help 154 156 157 158 159 163 173 177 179 180 181 194 195 196 197 198 199 200 201 203 I also attached the top sequences list, it looked that it still had some non-biological sequences left. |
Ideally you should understand what the layout of your reads is. The most common layout is that the reads contain the primer sequences, but not the rest of the adapters, linkers or index sequence (which is in a separate read). If you don't have such a layout, you can find the F/R primer sequences in your raw reads, and use that to determine the layout. If the non-biological sequence (i.e. primers etc) is only at the start of your reads, you can use dada2 trimLeft to remove it. If it is some more complicated layout, you should use a dedicated tool for the purpose, we recommend cutadapt and trimmomatic on our FAQ. If there are still non-target sequences in your data, you can also perform post-hoc removal of sequences outside the expected length, or that are taxonomically assigned to non-target taxa (eg. Eukaryotes, mitochondria, etc). That would be done in R after dada2 processing. |
I'll point out that whether you end up with primer sequence in your reads is largely a function of what sequencing primers you use. For example, the EMP 16S protocol (http://press.igsb.anl.gov/earthmicrobiome/protocols-and-standards/16s/) uses custom sequencing primers that match the PCR primer sequences. In this way, the first base you read off the sequencer is the first base of the amplicon. @yh330 You may want to ask whoever did your sequencing, but my guess is that they used standard Illumina adapters. For R1:
The first line is your first round PCR adapter, second line is the 2nd PCR adapter, and third line is the Illumina R2 sequencing primer. That means the your R1 sequences start from NNNN-GTGCCAGCMGCCGCGGTAA. Using For R2:
That means your R2 should start from GGACTA. You should check your fastq files to see if this is indeed the case as I can't tell from your list of top sequences. |
Closing under assumption the above solved your problem, but please reopen if still stuck. |
Hi,
I am wondering a bit for the reads check using the table(nchar(getSequences(seqtab.nochime))) after removeBimeraDenovo. I have got a matrix with sequence length and their corresponding number of reads. My amplicon should be with a length of 291bp, but in the matrix it showed a range of sequencing length from 190 to 327. So my questions are:
seqtab2 <- seqtab[,nchar(colnames(seqtab)) %in% seq(250,256)]), which is similar like gel cutting. But I am not completely sure how to set this sequence length range?
Thanks a lot :)
The text was updated successfully, but these errors were encountered: