Running LRASms_Zymo.Rmd and hung up at learnErrors() #7

mkhemmani · 2024-02-09T21:31:36Z

Hi benjjneb,

I've downloaded the fastq.gz from NCBI pertaining to zymo_CCS_99.9.fastq.gz (SRR9089357), and going through the LRASms_Zyme.Rmd and i'm hung up on line ~92, err <- learnErrors(drp, BAND_SIZE=32, multithread=TRUE, errorEstimationFunction=dada2:::PacBioErrfun)

The output is: Error in getErrors(err, enforce = TRUE) : Error matrix is NULL.

the dada2 version is 1.30.0

I'm comparing the my running Rmd to your github LRASms_Zymo.html and I'm noticing some differences:

track <- fastqFilter(nop, filt, minQ=3, minLen=1000, maxLen=1600, maxN=0, rm.phix=FALSE, maxEE=2, verbose=TRUE)
Warning: '.\Zymo\noprimers\filtered' already existsOverwriting file:./Zymo//noprimers/filtered/zymo_CCS_99_9.fastq.gz
Read in 73057, output 72940 (99.8%) filtered sequences.

drp <- derepFastq(filt, verbose=TRUE)
Dereplicating sequence entries in Fastq file: ./Zymo//noprimers/filtered/zymo_CCS_99_9.fastq.gz
Encountered 22309 unique sequences from 72940 total sequences read.

err <- learnErrors(drp, BAND_SIZE=32, multithread=TRUE, errorEstimationFunction=dada2:::PacBioErrfun)
Error in getErrors(err, enforce = TRUE) : Error matrix is NULL.

I do not know if higher output reads 72940 (99.8%) vs 69367(94.9%) has anything to do with the ultimate learnErrors() error.

I had a similar situation running the LRASms_fecal.Rmd as well.

thank you for your time,
Mark

benjjneb · 2024-02-13T22:03:46Z

I do not know if higher output reads 72940 (99.8%) vs 69367(94.9%) has anything to do with the ultimate learnErrors() error.

That is strange to me. As dada2 versions are updated there occasionally have been minor alterations to output of the base functions, but this seems beyond that. Can you post your full analysis script up to learnErrors, and the sessionInfo() output at the end?

mkhemmani · 2024-02-14T02:19:21Z

Here you go:

library(dada2);packageVersion("dada2")
[1] ‘1.30.0’
library(Biostrings);packageVersion("Biostrings")
[1] ‘2.70.1’
library(ShortRead);packageVersion("ShortRead")
[1] ‘1.60.0’
library(ggplot2);packageVersion("ggplot2")
[1] ‘3.4.4’
library(reshape2);packageVersion("reshape2")
[1] ‘1.4.4’
library(RColorBrewer);packageVersion("RColorBrewer")
[1] ‘1.1.3’
path <- "./Data/Zymo" # CHANGE ME to location of the fastq file
path.out <- "Figures/"
path.rds <- "RDS/"
fn <- file.path(path, "zymo_CCS_99_9.fastq.gz")
F27 <- "AGRGTTYGATYMTGGCTCAG"
R1492 <- "RGYTACCTTGTTACGACTT"
rc <- dada2:::rc
theme_set(theme_bw())
genusPalette <- c(Bacillus="#e41a1c", Enterococcus="#377eb8", Escherichia="#4daf4a", Lactobacillus="#984ea3", Listeria="#ff7f00", Pseudomonas="#ffff33", Salmonella="#a65628", Staphylococcus="#f781bf")
nop <- file.path(path, "noprimers", basename(fn))
prim <- removePrimers(fn, nop, primer.fwd=F27, primer.rev=dada2:::rc(R1492), orient=TRUE, verbose=TRUE)
Multiple matches to the primer(s) in some sequences. Using the longest possible match.
39439 sequences out of 77453 are being reverse-complemented.
Overwriting file:C:\Users\mkhem\OneDrive - Loyola University Chicago\Wolfe\Current Studies\R Work\LRAS\Data\Zymo\noprimers\zymo_CCS_99_9.fastq.gz
Read in 77453, output 73057 (94.3%) filtered sequences.
hist(nchar(getSequences(nop)), 100)
dir.create("./Data/Zymo/noprimers/filtered")
Warning: '.\Data\Zymo\noprimers\filtered' already exists
filt <- file.path(path, "noprimers", "filtered", basename(fn))
track <- fastqFilter(nop, filt, minQ=3, minLen=1000, maxLen=1600, maxN=0, rm.phix=FALSE, maxEE=2, verbose=TRUE)
Overwriting file:./Data/Zymo/noprimers/filtered/zymo_CCS_99_9.fastq.gz
Read in 73057, output 72940 (99.8%) filtered sequences.
drp <- derepFastq(filt, verbose=TRUE)
Dereplicating sequence entries in Fastq file: ./Data/Zymo/noprimers/filtered/zymo_CCS_99_9.fastq.gz
Encountered 22309 unique sequences from 72940 total sequences read.
err <- learnErrors(drp, BAND_SIZE=32, multithread=TRUE, errorEstimationFunction=dada2:::PacBioErrfun) # 10s of seconds
107587452 total bases in 72940 reads from 1 samples will be used for learning the error rates.
The max qual score of 93 was not detected. Using standard error fitting.
Error rates could not be estimated (this is usually because of very few reads).
Error in getErrors(err, enforce = TRUE) : Error matrix is NULL.

R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8

time zone: America/Chicago
tzcode source: internal

attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base

other attached packages:
[1] RColorBrewer_1.1-3 reshape2_1.4.4 ggplot2_3.4.4
[4] ShortRead_1.60.0 GenomicAlignments_1.38.2 SummarizedExperiment_1.32.0
[7] Biobase_2.62.0 MatrixGenerics_1.14.0 matrixStats_1.2.0
[10] Rsamtools_2.18.0 GenomicRanges_1.54.1 BiocParallel_1.36.0
[13] Biostrings_2.70.1 GenomeInfoDb_1.38.5 XVector_0.42.0
[16] IRanges_2.36.0 S4Vectors_0.40.2 BiocGenerics_0.48.1
[19] dada2_1.30.0 Rcpp_1.0.11

loaded via a namespace (and not attached):
[1] gtable_0.3.4 xfun_0.42 hwriter_1.3.2.1 latticeExtra_0.6-30
[5] lattice_0.22-5 vctrs_0.6.5 tools_4.3.2 bitops_1.0-7
[9] generics_0.1.3 parallel_4.3.2 tibble_3.2.1 fansi_1.0.6
[13] pkgconfig_2.0.3 Matrix_1.6-4 RcppParallel_5.1.7 lifecycle_1.0.4
[17] GenomeInfoDbData_1.2.11 compiler_4.3.2 stringr_1.5.1 deldir_2.0-2
[21] munsell_0.5.0 codetools_0.2-19 htmltools_0.5.7 yaml_2.3.8
[25] RCurl_1.98-1.13 pillar_1.9.0 crayon_1.5.2 DelayedArray_0.28.0
[29] abind_1.4-5 tidyselect_1.2.0 digest_0.6.33 stringi_1.8.3
[33] dplyr_1.1.4 fastmap_1.1.1 grid_4.3.2 colorspace_2.1-0
[37] cli_3.6.2 SparseArray_1.2.2 magrittr_2.0.3 S4Arrays_1.2.0
[41] utf8_1.2.4 withr_3.0.0 scales_1.3.0 rmarkdown_2.25
[45] jpeg_0.1-10 interp_1.1-6 png_0.1-8 evaluate_0.23
[49] knitr_1.45 rlang_1.1.2 glue_1.6.2 rstudioapi_0.15.0
[53] R6_2.5.1 plyr_1.8.9 zlibbioc_1.48.0

mkhemmani · 2024-02-14T02:24:01Z

Also because I was concerned that the fastq.gz file might be different after downloading:
md5sum:
md5sum zymo_CCS_99_9.fastq.gz
d0d9ab9d71b47b40fe31bfe8c9353154 zymo_CCS_99_9.fastq.gz

benjjneb · 2024-02-15T15:40:20Z

Mystery solved (although another mystery is opened)...

I believe you got the fastq file you are using by downloading it from the SRA using the "FASTA/FASTQ download" tab on the web interface (screenshot below).

This fastq file is not the same as the one we submitted. Instead, all of the quality scores have been removed, and replaced with Q30 at every position (see plotQualityProfile of this file). The original fastq file we submitted, and that will allow you to reproduce our results, is available at the bottom of the "Data Access" tab on the web interface (screenshow below).

But now my question is: When and why did the SRA start making fastq files with made up quality scores available for download?

benjjneb · 2024-02-15T21:33:49Z

An update from NCBI:

The data you are seeing in this case is the SRA Lite format. You can find more information about our data formats here:

https://www.ncbi.nlm.nih.gov/sra/docs/sra-data-formats/

But in short, in the SRA Lite format each read will have constant quality scores of either 3 or 30. The SRA normalized format includes the full per base quality scores and can be accessed with the SRA toolkit.
Alternatively, in this case the submitted fastq file can be downloaded from the Run Browser 'data access' page, in the 'original format' section:

https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=SRR9089357&display=data-access

So, in short, don't use the fastq/fasta download link. Use SRA toolkit to get the sequencing data.

mkhemmani · 2024-02-16T17:02:14Z

Thank you for the follow ups. I suppose NCBI needs to find the ways to trim the fat and push the larger files into cloud services.

curl 	https://sra-pub-src-1.s3.amazonaws.com/SRR9089357/zymo_CCS_99_9.fastq.gz.1 --output zymo_CCS_99_9.fastq.gz

However, I noticed some of the other original sequences are stuck behind AWS s3 which is paywall-ish.

One other thing I caught in the zymo code is that "tax" variable returns a table that has that has "Escherichia" as a "typo" in the silva call which someone serendipitously asked about in Stack Overflow: https://stackoverflow.com/questions/77170403/error-while-running-pacbio-dada2-workflow/78004296#78004296
I addressed it there.

mkhemmani closed this as completed Mar 3, 2024

benjjneb mentioned this issue Apr 12, 2024

Fecal samples not the same results and error msg when learning errors benjjneb/dada2#1927

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running LRASms_Zymo.Rmd and hung up at learnErrors() #7

Running LRASms_Zymo.Rmd and hung up at learnErrors() #7

mkhemmani commented Feb 9, 2024

benjjneb commented Feb 13, 2024

mkhemmani commented Feb 14, 2024

mkhemmani commented Feb 14, 2024

benjjneb commented Feb 15, 2024

benjjneb commented Feb 15, 2024

mkhemmani commented Feb 16, 2024

Running LRASms_Zymo.Rmd and hung up at learnErrors() #7

Running LRASms_Zymo.Rmd and hung up at learnErrors() #7

Comments

mkhemmani commented Feb 9, 2024

benjjneb commented Feb 13, 2024

mkhemmani commented Feb 14, 2024

mkhemmani commented Feb 14, 2024

benjjneb commented Feb 15, 2024

benjjneb commented Feb 15, 2024

mkhemmani commented Feb 16, 2024