Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trouble running the pipeline for a bacterial species #99

Closed
anzezupanic opened this issue Jul 29, 2020 · 13 comments
Closed

trouble running the pipeline for a bacterial species #99

anzezupanic opened this issue Jul 29, 2020 · 13 comments

Comments

@anzezupanic
Copy link

anzezupanic commented Jul 29, 2020

Hi,

after successfully running the pipeline for human and zebrafish, I've encountered some problems with a bacterial species. After going through (hopefully all) issues here that were already encountered and solved on github, I can't really find a solution for this one, so maybe you can help?
The error occurs at tximeta, here's an error Output, apparently there is a problem with .makesplicing. Any ideas on what may be causing this?

args <- (commandArgs(trailingOnly = TRUE))
for (i in seq_len(length(args))) {

  • eval(parse(text = args[[i]]))
    
  • }

suppressPackageStartupMessages({

  • library(dplyr)
    
  • library(tximport)
    
  • library(tximeta)
    
  • library(SingleCellExperiment)
    
  • })

print(salmondir)
[1] "example_data/output/salmon"
print(json)
[1] "example_data/reference/SalmonIndex/Cg.sidx.json"
print(metafile)
[1] "example_data/metadata.txt"
print(outrds)
[1] "example_data/output/outputR/tximeta_se.rds"
print(annotation)
[1] "Gencode"
print(organism)
[1] "Corynebacterium_glutamicum"

Load json linkedTxome

loadLinkedTxome(json)
saving linkedTxome in bfc (first time)

Read metadata

metadata <- read.delim(metafile, header = TRUE, as.is = TRUE, sep = "\t")

List Salmon directories

salmonfiles <- paste0(salmondir, "/", metadata$names, "/quant.sf")
names(salmonfiles) <- metadata$names

Add file column to metadata and import annotated abundances

In transcript level

coldata <- cbind(metadata, files = salmonfiles, stringsAsFactors = FALSE)
st <- tximeta::tximeta(coldata)
importing quantifications
reading in files with read_tsv
1 2 3 4
found matching linked transcriptome:
[ GENCODE - Corynebacterium glutamicum - release 1 ]
building TxDb with 'GenomicFeatures' package
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... Error in .make_splicings(exons, cds, stop_codons) :
some CDS cannot be mapped to an exon
Calls: ... makeTxDbFromGFF -> makeTxDbFromGRanges -> .make_splicings
In addition: Warning message:
In .get_cds_IDX(mcols0$type, mcols0$phase) :
The "phase" metadata column contains non-NA values for features of type
stop_codon. This information was ignored.
Execution halted

best

Anze

@csoneson
Copy link
Owner

csoneson commented Aug 1, 2020

Hm. ARMOR is really intended for use with species where you have a reference annotation from either GENCODE or Ensembl. Where did you get your reference annotation from? Maybe there's something that can help you in this thread: #97, but we do not make any guarantees for non-{GENCODE, Ensembl} references.

What's causing your specific error is this:

some CDS cannot be mapped to an exon

I'm not sure what your GTF file looks like, but this must work (for tximeta to work):

txdb <- makeTxDbFromGFF(gtf_file)

@anzezupanic
Copy link
Author

Thank you Charlotte, sorry for the late reply, was on holidays last week.

I got the gtf from NCBI (tried a GenBank and RefSeq assembly), the gtf looks reasonably standard, but guess not. Will look into the details this week and report about any possible solutions I can find. We are interested in using the pipeline with plant genomes as well (not all in Emsembl), but if there's too much trouble making adjustments, we will probably look for other options.
I can send the gtf file if it's of any use, or better here's the link: https://www.ncbi.nlm.nih.gov/assembly/GCF_002277975.1.

@csoneson
Copy link
Owner

So the issue indeed seems to be that the GTF file can not be read with makeTxDbFromGFF() (I downloaded and tried the GenBank one), seemingly because the CDS entries can not be matched with the exon entries. It appears that makeTxDbFromGFF() does support some GTF files where CDSs are direct children of genes (see e.g. this post from a few years back: http://supportupgrade.bioconductor.org/p/71117/), but this one does not seem to fall in that category. If I just crudely replace all CDS with exon in the file, it can be read. So this at least provides an explanation. As for a solution - I would first like to check with @mikelove if he has ever seen (and maybe solved 😃) a similar issue when reading non-model organism data with tximeta. Otherwise we'll see if it can be addressed in another way.

@mikelove
Copy link

I haven’t seen this one yet. Let me know if there are ways I can help.

@anzezupanic
Copy link
Author

I've now tried to use an Ensembl genome annotation of a parent strain (with more genes) available in gff3, which I transformed into gtf before running. This has lead me to a completely different error:

Verifying validity of the information in the database:
Checking transcripts ... OK
Checking exons ... OK
generating transcript ranges
Error in checkAssays2Txps(assays, txps) :
none of the transcripts in the quantification files are in the GTF
Calls: -> checkAssays2Txps
Execution halted

which I am currently interpretting is caused by a loss of gene-level data when going from the gff3 to gtf. Here's the gtf file, with only exon and CDS data:
atcc_13032_v4.gtf.txt
Since the gene level data is available in the gff3 file, i just need to convert it better to the gtf and I guess it should work?
Corynebacterium_glutamicum_atcc_13032.ASM19633v1.47.chromosome.Chromosome.gff3.txt

So, not solution yet, but working on it.

@csoneson
Copy link
Owner

Did you requantify the data with the new annotation? It looks like the transcript IDs are not matching.

@csoneson
Copy link
Owner

As for the original GTF file, somehow all the transcripts with annotated CDSs are called "unknown_transcript_1" (regardless of the associated gene), which seems a bit strange.

@anzezupanic
Copy link
Author

yes, the original gtf really is strange.

yes, I did requantify (i always delete all outputs and the salmon and star indexes, then run the pipeline on a small part of my dataset, my understanding is this forces requantification), this should not be the reason for transcript ids not matching, i think

@csoneson
Copy link
Owner

Could you post just the first lines of one of the quant.sf files from Salmon?

@anzezupanic
Copy link
Author

Name Length EffectiveLength TPM NumReads
CAF18566 1575 1456.128 72.956849 179.012
CAF18567 327 162.000 0.000000 0.000
CAF18568 1185 913.293 344.404551 530.024
CAF18569 1185 992.982 82.987673 138.858
CAF18570 489 328.875 43.307604 24.000
CAF18571 2055 2078.305 377.541652 1322.179
CAF18572 969 943.521 25.158942 40.000
CAF18573 672 506.014 107.042140 91.271

@csoneson
Copy link
Owner

Right. So the transcript IDs in the GTF file are of the form transcript:CAF18566 (not CAF18566).

@anzezupanic
Copy link
Author

I see, here's where my utter lack of understanding of how gtf files work let me down. So, if I just replace the form in the GTF file, it should work. I will try later today, thanks!

@anzezupanic
Copy link
Author

Ran the whole thing over night, and the salmon to edgeR part is solved now, only the shiny part now fails. But, since this is another topic, I'll try and figure it out myself, but if it does not work, will open a new issue. I have a much better idea now of what to look out for when using ARMOR for non-model organisms, thanks for all your help!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants