Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running LRASms_Zymo.Rmd and hung up at learnErrors() #7

Closed
mkhemmani opened this issue Feb 9, 2024 · 6 comments
Closed

Running LRASms_Zymo.Rmd and hung up at learnErrors() #7

mkhemmani opened this issue Feb 9, 2024 · 6 comments

Comments

@mkhemmani
Copy link

Hi benjjneb,

I've downloaded the fastq.gz from NCBI pertaining to zymo_CCS_99.9.fastq.gz (SRR9089357), and going through the LRASms_Zyme.Rmd and i'm hung up on line ~92, err <- learnErrors(drp, BAND_SIZE=32, multithread=TRUE, errorEstimationFunction=dada2:::PacBioErrfun)

The output is: Error in getErrors(err, enforce = TRUE) : Error matrix is NULL.

the dada2 version is 1.30.0

I'm comparing the my running Rmd to your github LRASms_Zymo.html and I'm noticing some differences:

track <- fastqFilter(nop, filt, minQ=3, minLen=1000, maxLen=1600, maxN=0, rm.phix=FALSE, maxEE=2, verbose=TRUE)
Warning: '.\Zymo\noprimers\filtered' already existsOverwriting file:./Zymo//noprimers/filtered/zymo_CCS_99_9.fastq.gz
Read in 73057, output 72940 (99.8%) filtered sequences.

drp <- derepFastq(filt, verbose=TRUE)
Dereplicating sequence entries in Fastq file: ./Zymo//noprimers/filtered/zymo_CCS_99_9.fastq.gz
Encountered 22309 unique sequences from 72940 total sequences read.

err <- learnErrors(drp, BAND_SIZE=32, multithread=TRUE, errorEstimationFunction=dada2:::PacBioErrfun)
Error in getErrors(err, enforce = TRUE) : Error matrix is NULL.

I do not know if higher output reads 72940 (99.8%) vs 69367(94.9%) has anything to do with the ultimate learnErrors() error.

I had a similar situation running the LRASms_fecal.Rmd as well.

thank you for your time,
Mark

@benjjneb
Copy link
Owner

I do not know if higher output reads 72940 (99.8%) vs 69367(94.9%) has anything to do with the ultimate learnErrors() error.

That is strange to me. As dada2 versions are updated there occasionally have been minor alterations to output of the base functions, but this seems beyond that. Can you post your full analysis script up to learnErrors, and the sessionInfo() output at the end?

@mkhemmani
Copy link
Author

Here you go:

library(dada2);packageVersion("dada2")
[1] ‘1.30.0’
library(Biostrings);packageVersion("Biostrings")
[1] ‘2.70.1’
library(ShortRead);packageVersion("ShortRead")
[1] ‘1.60.0’
library(ggplot2);packageVersion("ggplot2")
[1] ‘3.4.4’
library(reshape2);packageVersion("reshape2")
[1] ‘1.4.4’
library(RColorBrewer);packageVersion("RColorBrewer")
[1] ‘1.1.3’
path <- "./Data/Zymo" # CHANGE ME to location of the fastq file
path.out <- "Figures/"
path.rds <- "RDS/"
fn <- file.path(path, "zymo_CCS_99_9.fastq.gz")
F27 <- "AGRGTTYGATYMTGGCTCAG"
R1492 <- "RGYTACCTTGTTACGACTT"
rc <- dada2:::rc
theme_set(theme_bw())
genusPalette <- c(Bacillus="#e41a1c", Enterococcus="#377eb8", Escherichia="#4daf4a", Lactobacillus="#984ea3", Listeria="#ff7f00", Pseudomonas="#ffff33", Salmonella="#a65628", Staphylococcus="#f781bf")
nop <- file.path(path, "noprimers", basename(fn))
prim <- removePrimers(fn, nop, primer.fwd=F27, primer.rev=dada2:::rc(R1492), orient=TRUE, verbose=TRUE)
Multiple matches to the primer(s) in some sequences. Using the longest possible match.
39439 sequences out of 77453 are being reverse-complemented.
Overwriting file:C:\Users\mkhem\OneDrive - Loyola University Chicago\Wolfe\Current Studies\R Work\LRAS\Data\Zymo\noprimers\zymo_CCS_99_9.fastq.gz
Read in 77453, output 73057 (94.3%) filtered sequences.

hist(nchar(getSequences(nop)), 100)
dir.create("./Data/Zymo/noprimers/filtered")
Warning: '.\Data\Zymo\noprimers\filtered' already exists
filt <- file.path(path, "noprimers", "filtered", basename(fn))
track <- fastqFilter(nop, filt, minQ=3, minLen=1000, maxLen=1600, maxN=0, rm.phix=FALSE, maxEE=2, verbose=TRUE)
Overwriting file:./Data/Zymo/noprimers/filtered/zymo_CCS_99_9.fastq.gz
Read in 73057, output 72940 (99.8%) filtered sequences.

drp <- derepFastq(filt, verbose=TRUE)
Dereplicating sequence entries in Fastq file: ./Data/Zymo/noprimers/filtered/zymo_CCS_99_9.fastq.gz
Encountered 22309 unique sequences from 72940 total sequences read.

err <- learnErrors(drp, BAND_SIZE=32, multithread=TRUE, errorEstimationFunction=dada2:::PacBioErrfun) # 10s of seconds
107587452 total bases in 72940 reads from 1 samples will be used for learning the error rates.
The max qual score of 93 was not detected. Using standard error fitting.
Error rates could not be estimated (this is usually because of very few reads).
Error in getErrors(err, enforce = TRUE) : Error matrix is NULL.

R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8

time zone: America/Chicago
tzcode source: internal

attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base

other attached packages:
[1] RColorBrewer_1.1-3 reshape2_1.4.4 ggplot2_3.4.4
[4] ShortRead_1.60.0 GenomicAlignments_1.38.2 SummarizedExperiment_1.32.0
[7] Biobase_2.62.0 MatrixGenerics_1.14.0 matrixStats_1.2.0
[10] Rsamtools_2.18.0 GenomicRanges_1.54.1 BiocParallel_1.36.0
[13] Biostrings_2.70.1 GenomeInfoDb_1.38.5 XVector_0.42.0
[16] IRanges_2.36.0 S4Vectors_0.40.2 BiocGenerics_0.48.1
[19] dada2_1.30.0 Rcpp_1.0.11

loaded via a namespace (and not attached):
[1] gtable_0.3.4 xfun_0.42 hwriter_1.3.2.1 latticeExtra_0.6-30
[5] lattice_0.22-5 vctrs_0.6.5 tools_4.3.2 bitops_1.0-7
[9] generics_0.1.3 parallel_4.3.2 tibble_3.2.1 fansi_1.0.6
[13] pkgconfig_2.0.3 Matrix_1.6-4 RcppParallel_5.1.7 lifecycle_1.0.4
[17] GenomeInfoDbData_1.2.11 compiler_4.3.2 stringr_1.5.1 deldir_2.0-2
[21] munsell_0.5.0 codetools_0.2-19 htmltools_0.5.7 yaml_2.3.8
[25] RCurl_1.98-1.13 pillar_1.9.0 crayon_1.5.2 DelayedArray_0.28.0
[29] abind_1.4-5 tidyselect_1.2.0 digest_0.6.33 stringi_1.8.3
[33] dplyr_1.1.4 fastmap_1.1.1 grid_4.3.2 colorspace_2.1-0
[37] cli_3.6.2 SparseArray_1.2.2 magrittr_2.0.3 S4Arrays_1.2.0
[41] utf8_1.2.4 withr_3.0.0 scales_1.3.0 rmarkdown_2.25
[45] jpeg_0.1-10 interp_1.1-6 png_0.1-8 evaluate_0.23
[49] knitr_1.45 rlang_1.1.2 glue_1.6.2 rstudioapi_0.15.0
[53] R6_2.5.1 plyr_1.8.9 zlibbioc_1.48.0

@mkhemmani
Copy link
Author

Also because I was concerned that the fastq.gz file might be different after downloading:
md5sum:
md5sum zymo_CCS_99_9.fastq.gz
d0d9ab9d71b47b40fe31bfe8c9353154 zymo_CCS_99_9.fastq.gz

@benjjneb
Copy link
Owner

Mystery solved (although another mystery is opened)...

I believe you got the fastq file you are using by downloading it from the SRA using the "FASTA/FASTQ download" tab on the web interface (screenshot below).

Screen Shot 2024-02-15 at 10 36 52 AM

This fastq file is not the same as the one we submitted. Instead, all of the quality scores have been removed, and replaced with Q30 at every position (see plotQualityProfile of this file). The original fastq file we submitted, and that will allow you to reproduce our results, is available at the bottom of the "Data Access" tab on the web interface (screenshow below).

Screen Shot 2024-02-15 at 10 39 06 AM

But now my question is: When and why did the SRA start making fastq files with made up quality scores available for download?

@benjjneb
Copy link
Owner

An update from NCBI:

The data you are seeing in this case is the SRA Lite format. You can find more information about our data formats here:

https://www.ncbi.nlm.nih.gov/sra/docs/sra-data-formats/

But in short, in the SRA Lite format each read will have constant quality scores of either 3 or 30. The SRA normalized format includes the full per base quality scores and can be accessed with the SRA toolkit.
Alternatively, in this case the submitted fastq file can be downloaded from the Run Browser 'data access' page, in the 'original format' section:

https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=SRR9089357&display=data-access

So, in short, don't use the fastq/fasta download link. Use SRA toolkit to get the sequencing data.

@mkhemmani
Copy link
Author

Thank you for the follow ups. I suppose NCBI needs to find the ways to trim the fat and push the larger files into cloud services.

curl 	https://sra-pub-src-1.s3.amazonaws.com/SRR9089357/zymo_CCS_99_9.fastq.gz.1 --output zymo_CCS_99_9.fastq.gz

However, I noticed some of the other original sequences are stuck behind AWS s3 which is paywall-ish.

One other thing I caught in the zymo code is that "tax" variable returns a table that has that has "Escherichia" as a "typo" in the silva call which someone serendipitously asked about in Stack Overflow: https://stackoverflow.com/questions/77170403/error-while-running-pacbio-dada2-workflow/78004296#78004296
I addressed it there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants