trouble running the pipeline for a bacterial species #99

anzezupanic · 2020-07-29T10:40:00Z

Hi,

after successfully running the pipeline for human and zebrafish, I've encountered some problems with a bacterial species. After going through (hopefully all) issues here that were already encountered and solved on github, I can't really find a solution for this one, so maybe you can help?
The error occurs at tximeta, here's an error Output, apparently there is a problem with .makesplicing. Any ideas on what may be causing this?

args <- (commandArgs(trailingOnly = TRUE))
for (i in seq_len(length(args))) {

```
eval(parse(text = args[[i]]))
```
}

suppressPackageStartupMessages({

```
library(dplyr)
```
```
library(tximport)
```
```
library(tximeta)
```
```
library(SingleCellExperiment)
```
})

print(salmondir)
[1] "example_data/output/salmon"
print(json)
[1] "example_data/reference/SalmonIndex/Cg.sidx.json"
print(metafile)
[1] "example_data/metadata.txt"
print(outrds)
[1] "example_data/output/outputR/tximeta_se.rds"
print(annotation)
[1] "Gencode"
print(organism)
[1] "Corynebacterium_glutamicum"

Load json linkedTxome

loadLinkedTxome(json)
saving linkedTxome in bfc (first time)

Read metadata

metadata <- read.delim(metafile, header = TRUE, as.is = TRUE, sep = "\t")

List Salmon directories

salmonfiles <- paste0(salmondir, "/", metadata$names, "/quant.sf")
names(salmonfiles) <- metadata$names

Add file column to metadata and import annotated abundances

In transcript level

coldata <- cbind(metadata, files = salmonfiles, stringsAsFactors = FALSE)
st <- tximeta::tximeta(coldata)
importing quantifications
reading in files with read_tsv
1 2 3 4
found matching linked transcriptome:
[ GENCODE - Corynebacterium glutamicum - release 1 ]
building TxDb with 'GenomicFeatures' package
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... Error in .make_splicings(exons, cds, stop_codons) :
some CDS cannot be mapped to an exon
Calls: ... makeTxDbFromGFF -> makeTxDbFromGRanges -> .make_splicings
In addition: Warning message:
In .get_cds_IDX(mcols0$type, mcols0$phase) :
The "phase" metadata column contains non-NA values for features of type
stop_codon. This information was ignored.
Execution halted

best

Anze

The text was updated successfully, but these errors were encountered:

csoneson · 2020-08-01T11:41:47Z

Hm. ARMOR is really intended for use with species where you have a reference annotation from either GENCODE or Ensembl. Where did you get your reference annotation from? Maybe there's something that can help you in this thread: #97, but we do not make any guarantees for non-{GENCODE, Ensembl} references.

What's causing your specific error is this:

some CDS cannot be mapped to an exon

I'm not sure what your GTF file looks like, but this must work (for tximeta to work):

txdb <- makeTxDbFromGFF(gtf_file)

anzezupanic · 2020-08-10T09:23:48Z

Thank you Charlotte, sorry for the late reply, was on holidays last week.

I got the gtf from NCBI (tried a GenBank and RefSeq assembly), the gtf looks reasonably standard, but guess not. Will look into the details this week and report about any possible solutions I can find. We are interested in using the pipeline with plant genomes as well (not all in Emsembl), but if there's too much trouble making adjustments, we will probably look for other options.
I can send the gtf file if it's of any use, or better here's the link: https://www.ncbi.nlm.nih.gov/assembly/GCF_002277975.1.

csoneson · 2020-08-11T11:59:05Z

So the issue indeed seems to be that the GTF file can not be read with makeTxDbFromGFF() (I downloaded and tried the GenBank one), seemingly because the CDS entries can not be matched with the exon entries. It appears that makeTxDbFromGFF() does support some GTF files where CDSs are direct children of genes (see e.g. this post from a few years back: http://supportupgrade.bioconductor.org/p/71117/), but this one does not seem to fall in that category. If I just crudely replace all CDS with exon in the file, it can be read. So this at least provides an explanation. As for a solution - I would first like to check with @mikelove if he has ever seen (and maybe solved 😃) a similar issue when reading non-model organism data with tximeta. Otherwise we'll see if it can be addressed in another way.

mikelove · 2020-08-11T12:20:45Z

I haven’t seen this one yet. Let me know if there are ways I can help.

anzezupanic · 2020-08-11T12:47:47Z

I've now tried to use an Ensembl genome annotation of a parent strain (with more genes) available in gff3, which I transformed into gtf before running. This has lead me to a completely different error:

Verifying validity of the information in the database:
Checking transcripts ... OK
Checking exons ... OK
generating transcript ranges
Error in checkAssays2Txps(assays, txps) :
none of the transcripts in the quantification files are in the GTF
Calls: -> checkAssays2Txps
Execution halted

which I am currently interpretting is caused by a loss of gene-level data when going from the gff3 to gtf. Here's the gtf file, with only exon and CDS data:
atcc_13032_v4.gtf.txt
Since the gene level data is available in the gff3 file, i just need to convert it better to the gtf and I guess it should work?
Corynebacterium_glutamicum_atcc_13032.ASM19633v1.47.chromosome.Chromosome.gff3.txt

So, not solution yet, but working on it.

csoneson · 2020-08-11T13:06:14Z

Did you requantify the data with the new annotation? It looks like the transcript IDs are not matching.

csoneson · 2020-08-11T13:08:30Z

As for the original GTF file, somehow all the transcripts with annotated CDSs are called "unknown_transcript_1" (regardless of the associated gene), which seems a bit strange.

anzezupanic · 2020-08-11T13:23:26Z

yes, the original gtf really is strange.

yes, I did requantify (i always delete all outputs and the salmon and star indexes, then run the pipeline on a small part of my dataset, my understanding is this forces requantification), this should not be the reason for transcript ids not matching, i think

csoneson · 2020-08-11T13:26:17Z

Could you post just the first lines of one of the quant.sf files from Salmon?

anzezupanic · 2020-08-11T13:33:15Z

Name Length EffectiveLength TPM NumReads
CAF18566 1575 1456.128 72.956849 179.012
CAF18567 327 162.000 0.000000 0.000
CAF18568 1185 913.293 344.404551 530.024
CAF18569 1185 992.982 82.987673 138.858
CAF18570 489 328.875 43.307604 24.000
CAF18571 2055 2078.305 377.541652 1322.179
CAF18572 969 943.521 25.158942 40.000
CAF18573 672 506.014 107.042140 91.271

csoneson · 2020-08-11T13:34:58Z

Right. So the transcript IDs in the GTF file are of the form transcript:CAF18566 (not CAF18566).

anzezupanic · 2020-08-11T13:45:44Z

I see, here's where my utter lack of understanding of how gtf files work let me down. So, if I just replace the form in the GTF file, it should work. I will try later today, thanks!

anzezupanic · 2020-08-12T07:15:11Z

Ran the whole thing over night, and the salmon to edgeR part is solved now, only the shiny part now fails. But, since this is another topic, I'll try and figure it out myself, but if it does not work, will open a new issue. I have a much better idea now of what to look out for when using ARMOR for non-model organisms, thanks for all your help!!!

anzezupanic closed this as completed Aug 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trouble running the pipeline for a bacterial species #99

trouble running the pipeline for a bacterial species #99

anzezupanic commented Jul 29, 2020 •

edited

Loading

Load json linkedTxome

Read metadata

List Salmon directories

Add file column to metadata and import annotated abundances

In transcript level

csoneson commented Aug 1, 2020

anzezupanic commented Aug 10, 2020

csoneson commented Aug 11, 2020

mikelove commented Aug 11, 2020

anzezupanic commented Aug 11, 2020

csoneson commented Aug 11, 2020

csoneson commented Aug 11, 2020

anzezupanic commented Aug 11, 2020

csoneson commented Aug 11, 2020

anzezupanic commented Aug 11, 2020

csoneson commented Aug 11, 2020

anzezupanic commented Aug 11, 2020

anzezupanic commented Aug 12, 2020

trouble running the pipeline for a bacterial species #99

trouble running the pipeline for a bacterial species #99

Comments

anzezupanic commented Jul 29, 2020 • edited Loading

Load json linkedTxome

Read metadata

List Salmon directories

Add file column to metadata and import annotated abundances

In transcript level

csoneson commented Aug 1, 2020

anzezupanic commented Aug 10, 2020

csoneson commented Aug 11, 2020

mikelove commented Aug 11, 2020

anzezupanic commented Aug 11, 2020

csoneson commented Aug 11, 2020

csoneson commented Aug 11, 2020

anzezupanic commented Aug 11, 2020

csoneson commented Aug 11, 2020

anzezupanic commented Aug 11, 2020

csoneson commented Aug 11, 2020

anzezupanic commented Aug 11, 2020

anzezupanic commented Aug 12, 2020

anzezupanic commented Jul 29, 2020 •

edited

Loading