Sequence Extraction #1321

markdjohnson · 2021-04-15T15:12:18Z

Hello,

I am using dada2 for a metabarcoding project and I was curious if there was a way for me to extract the sequences identified during the tracking step of the dada2 pipeline. For instance, I would like to get those sequences that are under the nonchim catagory and then run a blast on those now filtered sequences. How can I extract them?

Any help would be apprecaited!
Mark Johnson

input filtered denoisedF denoisedR merged nonchim
21-1-trnL 6242 5657 5542 5555 5509 5508
21-2-trnL 8631 8104 8050 8062 8039 8012
21-3-trnL 11648 10953 10839 10851 10771 10605
21-4-trnL 9738 9120 8936 8986 8771 8731
21-5-trnL 7388 6925 6824 6866 6700 6535
21-6-trnL 7120 6721 6621 6634 6531 6269

track

benjjneb · 2021-04-16T01:45:38Z

Identifying the provenance of each denoised read is fairly straightforward, up until merging (where it is still completely possible, just a bit harder). See this issues and those linked for some guidance: #354

markdjohnson · 2021-04-16T15:39:34Z

Thank you so much, I really appreciate it. I'm about to go into the field for a few days so I may comment again next week if I am still having issues.

Mark

markdjohnson · 2021-04-19T19:35:31Z

Hello Ben,

So I wanted to clarify my request a little bit because I am still having a bit of trouble. So down below is a piece of the summary table from the DADA2 pipeline. I would like to extract only the 5508 sequences shown under the nonchim category. These sequences are the filtered, denoised, merged, and nonchim sequences I would like to run a blast on in genbank. Is there a way to extract these 5508 successfully filtered sequences. Thank you so much for your help, I really appreciate you taking the time to answer questions like these.
input filtered denoisedF denoisedR merged nonchim
Sample 1 6242 5657 5542 5555 5509 5508

Mark

benjjneb · 2021-04-20T01:37:49Z

Ah, you just want to get the sequences in a format that is usable outside R, like BLAST?

library(dada2)
sq <- getSequences(seqtab.nochim)
writeFasta(sq, "path/to/output.fa")

And you should have those sequence in fasta format.

markdjohnson · 2021-04-20T16:18:25Z

Hi Ben,

When I did the sq <- getSequences(seqtab.nochim) it gave a result that had only621 sequences. Based on the table posted below, I would have expected to get 306238 sequences from all my samples combined in the nonchim row. Additionally, when I tried to do the writeFasta it told me it couldn't find the writeFasta function. Any help you could provide would be greatly appreciated. Thank you so much for continuing to answer my questions.

      input filtered denoisedF denoisedR merged nonchim

21-1-trnL 6242 5657 5542 5555 5509 5508
21-2-trnL 8631 8104 8050 8062 8039 8012
21-3-trnL 11648 10953 10839 10851 10771 10605
21-4-trnL 9738 9120 8936 8986 8771 8731
21-5-trnL 7388 6925 6824 6866 6700 6535
21-6-trnL 7120 6721 6621 6634 6531 6269
21-7-trnL 8314 7782 7673 7707 7572 7488
21-8-trnL 8103 7572 7440 7460 7212 7059
21-9-trnL 4615 4161 4025 4034 3835 3738
9-1-trnL 6507 6101 5975 5999 5839 5654
9-2-trnL 6040 5548 5317 5354 5189 5097
9-3-trnL 12217 11507 11367 11401 11283 10999
9-4-trnL 9743 9118 9031 9050 8791 8606
9-5-trnL 9910 8525 8349 8378 8184 8026
9-6-trnL 5904 5547 5487 5497 5350 5246
9-7-trnL 4293 4070 3992 4003 3908 3790
9-8-trnL 8171 7243 7111 7138 6965 6845
9-9-trnL 6832 6457 6397 6382 6236 6164
C1-trnL 8806 7782 7684 7656 7552 7244
C10-trnL 11755 11148 11031 11079 10943 10758
C11-trnL 8960 8513 8366 8380 8225 8174
C12-trnL 10022 9531 9381 9422 9136 9115
C13-trnL 7079 6727 6569 6630 6472 6472
C14-trnL 9020 8601 8465 8482 8292 8284
C15-trnL 9131 8636 8430 8437 8023 8021
C16-trnL 8210 7684 7393 7445 7287 7287
C17-trnL 7233 6877 6780 6808 6683 6467
C18-trnL 11759 11144 10980 11036 10865 10703
C19-trnL 10700 10143 9999 10023 9858 9650
C2-trnL 10426 9654 9507 9515 9372 9156
C20-trnL 6460 6021 5923 5955 5670 5513
C21-trnL 5292 4969 4877 4896 4823 4814
C22-trnL 13226 12564 12407 12440 12096 12039
C3-trnL 11049 9580 9486 9486 9435 9141
C4-trnL 9154 8378 8275 8316 8138 8138
C5-trnL 8176 7497 7427 7439 7357 7178
C6-trnL 9360 8465 8381 8389 8280 8216
C7-trnL 10318 9418 9175 9279 9046 8791
C8-trnL 9885 9089 8993 8990 8879 8878
C9-trnL 8474 8092 7961 7960 7847 7827

benjjneb · 2021-04-22T22:56:58Z

There is a difference between "reads" and "sequences" here. `getSequences' will extract all the "sequences", i.e. the unique DNA sequences in the sequence table. That number will not be the same as the number of reads, which are the numbers that are being shown in the results posted above, because some sequences are represented by many reads.

Additionally, when I tried to do the writeFasta it told me it couldn't find the writeFasta function.

My fault there, you need to load the "ShortRead" package as well as the "dada2" package to use writeFasta in that way.

markdjohnson · 2021-04-23T16:57:17Z

Hi Ben, It looks like I got everything to work, thank you so much for all your help and for running this amazing package! Mark

…

On Thu, Apr 22, 2021 at 5:57 PM Benjamin Callahan ***@***.***> wrote: There is a difference between "reads" and "sequences" here. `getSequences' will extract all the "sequences", i.e. the unique DNA sequences in the sequence table. That number will not be the same as the number of reads, which are the numbers that are being shown in the results posted above, because some sequences are represented by many reads. Additionally, when I tried to do the writeFasta it told me it couldn't find the writeFasta function. My fault there, you need to load the "ShortRead" package as well as the "dada2" package to use writeFasta in that way. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1321 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ATWFZACETJXAYF4BVHUVWE3TKCSUTANCNFSM427WMMMA> .

markdjohnson · 2021-04-23T18:32:20Z

Hi Ben,

I actually had one hopefully last question for you. Is there a way to see how many reads go into each unique sequence that is extracted. So for example, I have 621 unique sequences, can I see if one of them only had a single read versus one that has 10,000.

Mark

benjjneb · 2021-04-27T14:31:11Z

In R you just can look at the entry in the sequence table corresponding to that sample and ASV. E.g. for the 12th ASV, seqtab.nochim["C12-trnL", 12] will show you how many times that ASV was seen in that sample. To get the total number across all samples, you can look at the column sums, e.g. colSums(seqtab.nochim) will list the total abundance across all samples for each ASV, in order.

benjjneb closed this as completed May 3, 2021

markdjohnson mentioned this issue May 26, 2021

Sequence Extraction #1351

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sequence Extraction #1321

Sequence Extraction #1321

markdjohnson commented Apr 15, 2021

benjjneb commented Apr 16, 2021

markdjohnson commented Apr 16, 2021

markdjohnson commented Apr 19, 2021

benjjneb commented Apr 20, 2021

markdjohnson commented Apr 20, 2021

benjjneb commented Apr 22, 2021

markdjohnson commented Apr 23, 2021 via email

markdjohnson commented Apr 23, 2021

benjjneb commented Apr 27, 2021

Sequence Extraction #1321

Sequence Extraction #1321

Comments

markdjohnson commented Apr 15, 2021

benjjneb commented Apr 16, 2021

markdjohnson commented Apr 16, 2021

markdjohnson commented Apr 19, 2021

benjjneb commented Apr 20, 2021

markdjohnson commented Apr 20, 2021

benjjneb commented Apr 22, 2021

markdjohnson commented Apr 23, 2021 via email

markdjohnson commented Apr 23, 2021

benjjneb commented Apr 27, 2021