Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reads check after removeBimeraDenove #267

Closed
yh330 opened this issue Jun 21, 2017 · 13 comments
Closed

Reads check after removeBimeraDenove #267

yh330 opened this issue Jun 21, 2017 · 13 comments

Comments

@yh330
Copy link

yh330 commented Jun 21, 2017

Hi,
I am wondering a bit for the reads check using the table(nchar(getSequences(seqtab.nochime))) after removeBimeraDenovo. I have got a matrix with sequence length and their corresponding number of reads. My amplicon should be with a length of 291bp, but in the matrix it showed a range of sequencing length from 190 to 327. So my questions are:

  1. is this the problem with my previous filter/merging steps, for example the parameters I set still need to be adjusted?
  2. I saw in the tutorial, when you have 250bp amplicon, you also get a range of sequence and there is one step in the tutorial using
    seqtab2 <- seqtab[,nchar(colnames(seqtab)) %in% seq(250,256)]), which is similar like gel cutting. But I am not completely sure how to set this sequence length range?

Thanks a lot :)

@benjjneb
Copy link
Owner

The length distribution in your merged amplicons will come from a combination of natural length variation in the amplified genetic locus, non-target amplification (eg. of Eukaryotic 18S rather than bacterial 16S) and in some sequencing runs "length-shifted" artefacts (which we don't entirely understand the origin of).

In the tutorial, we know the distribution of the targeted bacterial V4 region is tight around 253, so we can look at our length distribution and, in combination with that knowledge, choose a length distribution to keep. There isn't a single formula for every gene region/sequencing run, but looking at your distribution across lengths will certainly help. Here is a simple function to plot the number of SVs and the number of reads by sequence length:

plotLengthDistro <- function(st) {
  require(ggplot2)
  tot.svs <- table(nchar(colnames(st)))
  tot.reads <- tapply(colSums(st), nchar(colnames(st)), sum)
  df <- data.frame(Length=as.integer(c(names(tot.svs), names(tot.reads))),
                   Count=c(tot.svs, tot.reads),
                   Type=rep(c("SVs", "Reads"), times=c(length(tot.svs), length(tot.reads))))
  p <- ggplot(data=df, aes(x=Length, y=Count, color=Type)) + geom_point() + facet_wrap(~Type, scales="free_y") + theme_bw() + xlab("Amplicon Length")
  p
}
# Use on your sequence table
plotLengthDistro(seqtab.nochim)
plotLengthDistro(seqtab.nochim) + scale_y_log10()

@yh330
Copy link
Author

yh330 commented Jun 21, 2017

hm, I also had the 16S V4 region with primers of 515f and 806R, so actually i should expect the distribution of the sequence size around 253bp as well.
I have the graph ready, from what I interpret is that most of the sequence is around 270bp, so should I set the range around 250 to 280?

Distribution of reads and length.pdf

@benjjneb
Copy link
Owner

Yes from that plot it looks like the vast majority of your SVs/reads are falling in that window, so it might make sense to restrict your sequence table to that range.

Side note, can you post one of your top sequences? I don't think you should be getting 270 nts as the target length unless there is some primer still attached.

@yh330
Copy link
Author

yh330 commented Jun 21, 2017

yeah, I see now maybe there is some primers still there in the sequence. actually the quality of the sequences are pretty bad. I also attached a few the quality score here, and the filter step settings.
top sequences.xlsx

forward.pdf
reverse.pdf

out_2=filterAndTrim(fwd= file.path(pathF, fastqFs), filt=file.path(filtpathF, fastqFs),
rev= file.path(pathR, fastqRs), filt.rev=file.path(filtpathR, fastqRs), trimLeft=c(10,10), minLen=c(50,50), truncLen=c(200,170), maxEE=c(2,2), truncQ=11, maxN=0, rm.phix=TRUE,
compress=TRUE, verbose=TRUE, multithread=TRUE)

@benjjneb
Copy link
Owner

Yes, you still have parts of the 515F and 806R primers on these reads (about 13/10 nts from each).

That will interfere with dada2 analysis, as primers contain ambiguous nucleotides, i.e. nucleotides that are sometimes one base and sometimes another, which appears to the algorithm as biological variation.

@yh330
Copy link
Author

yh330 commented Jun 21, 2017

ok, then I see. what would you suggest me to do? I am thinking that I should trim more reads from the left side of both forward and reverse reads, about 10-13 more than I have now?

@benjjneb
Copy link
Owner

Yes you can trim them off with a larger trimLeft, or use the dedicated software tools we mention in the FAQ.

@benjjneb benjjneb reopened this Jun 21, 2017
@yh330
Copy link
Author

yh330 commented Jun 22, 2017

I see now. if I use trimLeft, the length I set could be the length of my primers (plus the adaptor i guess?)
I would also want to try Trimmomatic, but then I could skip the filterAndTrim step in dada2?

@benjjneb
Copy link
Owner

if I use trimLeft, the length I set could be the length of my primers (plus the adaptor i guess?)

Correct. I think you might have a 2nt pad in your reads at the start of each read, in addition to the 17/21nts of the EMP V4 primers. But if you check that, you can use trimLeft to remove the non-biological sequence.

I would also want to try Trimmomatic, but then I could skip the filterAndTrim step in dada2?

You probably should still use filterAndTrim for its filtering purposes (eg. rm.phix and maxEE). And you should perform length truncation with one tool or the other given the strong quality decay in your reads.

@yh330
Copy link
Author

yh330 commented Jun 26, 2017

Hi, Ben,
Thanks for your explanation, but I think I might make it worse or I did not understand my data completely (which was not good). We used two sets of primers for the library preparations.
The first primer set is with the bacterial primers:
Illumina adapter-N4-515F:
5’-ACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNN GTGCCAGCMGCCGCGGTAA -3’ (56bp)

Illumina adapter-806R:
5’-AGACGTGTGCTCTTCCGATCT GGACTACHVGGGTWTCTAAT -3’ (41bp)

Second primer sets is to attach standard illumina handles and index primers, which looks like this:
forward:
AATGATACGGCGACCACCGAGA{TCTACAC}-[i5 index] ACACTCTTTCCCTACACGACG
reverse:
CAAGCAGAAGACGGCATACGAGAT-[i7 index]-GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT

The data we got was demultiplexed, and should only leave the first set of adaptor and bacterial primers, which means that in the trimming step, I should set trimLeft to 56 for forward and 41 for reverse reads to remove the non-biological sequences? Did I understand it right?

after I tried with 56 and 41 in trimLeft, the sequences after removeBimeraDenovo I got looked like this, the reads are much shorter, with the most reads are at 199 and 200bp. Do you think this is acceptable? Thanks a lot for your help

154 156 157 158 159 163 173 177 179 180 181 194 195 196 197 198 199 200 201 203
7 2 3 6 4 1 1 1 3 1 1 1 1 5 19 42 580 100 6 1
204 206 209 216 218 221 227 228 238 246 248 257 260 264 265 272
2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2

I also attached the top sequences list, it looked that it still had some non-biological sequences left.

top sequences_2.xlsx

@benjjneb
Copy link
Owner

Ideally you should understand what the layout of your reads is. The most common layout is that the reads contain the primer sequences, but not the rest of the adapters, linkers or index sequence (which is in a separate read). If you don't have such a layout, you can find the F/R primer sequences in your raw reads, and use that to determine the layout.

If the non-biological sequence (i.e. primers etc) is only at the start of your reads, you can use dada2 trimLeft to remove it. If it is some more complicated layout, you should use a dedicated tool for the purpose, we recommend cutadapt and trimmomatic on our FAQ.

If there are still non-target sequences in your data, you can also perform post-hoc removal of sequences outside the expected length, or that are taxonomically assigned to non-target taxa (eg. Eukaryotes, mitochondria, etc). That would be done in R after dada2 processing.

@fanli-gcb
Copy link
Contributor

I'll point out that whether you end up with primer sequence in your reads is largely a function of what sequencing primers you use.

For example, the EMP 16S protocol (http://press.igsb.anl.gov/earthmicrobiome/protocols-and-standards/16s/) uses custom sequencing primers that match the PCR primer sequences. In this way, the first base you read off the sequencer is the first base of the amplicon.

@yh330 You may want to ask whoever did your sequencing, but my guess is that they used standard Illumina adapters.

For R1:

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx5’-ACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNN GTGCCAGCMGCCGCGGTAA -3’ (56bp)
AATGATACGGCGACCACCGAGA{TCTACAC}-[i5 index] ACACTCTTTCCCTACACGACG
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxACACTCTTTCCCTACACGACGCTCTTCCGATCT

The first line is your first round PCR adapter, second line is the 2nd PCR adapter, and third line is the Illumina R2 sequencing primer. That means the your R1 sequences start from NNNN-GTGCCAGCMGCCGCGGTAA. Using trimLeft=10 will remove NNNGTGCCA, leaving you with the GCAGCCGCG.. that most of your top sequences start with. That implies you'd actually want trimLeft=23.

For R2:

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx5’-AGACGTGTGCTCTTCCGATCT GGACTACHVGGGTWTCTAAT -3’ (41bp)
CAAGCAGAAGACGGCATACGAGAT-[i7 index]-GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT

That means your R2 should start from GGACTA. You should check your fastq files to see if this is indeed the case as I can't tell from your list of top sequences.

@benjjneb
Copy link
Owner

benjjneb commented Jul 8, 2017

Closing under assumption the above solved your problem, but please reopen if still stuck.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants