Counts Erroneously Assigned to First Gene in gff file #613

monicabritton · 2019-04-08T21:58:31Z

Some gff3 files (ahem, NCBI) can be composed of lines from a variety of different annotation packages (as indicated in field 2 of the file). I discovered that if the "gene_id" (or equivalent) is not present on some exon lines, the counts associated with those exons are added to the counts for the first gene in the gff3 file.

It would be best if STAR failed during such a run, or at least issued a strong warning message to stderr.

LMK if you want an example ....

monicabritton · 2019-04-08T23:16:52Z

Amendment to my previous post: A "WARNING" message was written to the Log for each of the non-standard exon lines. This was when I indexed the genome (and included the gff3 file in the indexing.)
Sadly, I don't look through the Log much if the job completes without error. No warnings or errors appear in the STAR alignment run, likely because I don't specify the gff3 file there.

alexdobin · 2019-04-10T14:08:47Z

Hi Monica,

I think I need to make this checking optional, as, in principle, the GFF3 file is not required to have a gene_id tag. I could also have a "NoGeneID" gene where all such exons will belong.

Cheers
Alex

monicabritton · 2019-04-10T16:03:19Z

Yes, I like your idea of a "NoGeneID" line in the top "N_" section of ReadsPerGene.out.tab. That makes it very obvious and able to be accounted for when the counts are parsed.

Thanks!
Monica

benslack19 · 2019-04-11T23:01:11Z

Hi @monicabritton , are you also trying a different counting method by any chance?

I'm evaluating counts from STAR BAM files but using featureCounts. I had to convert the NCBI GFF to GTF using this and a custom script to make the "gene_id" tag start first.

I was curious what people's experiences are with using the -quantMode GeneCounts feature within STAR versus another quantifier like featureCounts.

monicabritton · 2019-04-12T07:44:46Z

Hi @benslack19 , I have used featureCounts occasionally, if I need to "re-count" an existing bam, and don't want to re-run the alignment for some reason. If the reads in the bam are PE, then it is necessary to "repair" the bam file to make it name sorted, rather than Coordinate sorted, which takes time and server space, so with PE reads it's probably faster to just re-run STAR, unless you already have the name-sorted bam.

I have run featureCounts using gff3 without converting to GTF, but I almost always make edits to NCBI gff3 files, even when I use them with STAR, primarily because of fields like:
Dbxref=Genbank:NP_148939.1,GeneID:803438
as well as inconsistencies within the gff3 files. I'm not sure if converting to GTF eliminates all of these?

benslack19 · 2019-04-12T14:26:57Z

Thanks @monicabritton . This is very helpful.

The cufflinks gffread package I alluded to earlier converts the NCBI gff column 9 to look like this:
transcript_id "rna0"; gene_id "gene0"; gene_name "DDX11L1";
but in my experience, it didn't work "out of the box" in featureCounts. I had to re-order the line so that gene_id started first.

I haven't fully verified if converting eliminates all gff3 inconsistencies, but I'm getting some lower than expected counts in featureCounts for some genes. I'm still evaluating whether it's because of the gff conversion or some other reason. Trying the quantMode within STAR and just editing the gff3 file could be a way to check. Can I ask what other "inconsistencies within the gff3 files" I should edit for other than the Dbxref observation you pointed out?

alexdobin · 2019-04-12T20:26:36Z

Hi @benslack19, @monicabritton

as far as I remember, featureCounts uses slightly different policy from STAR (=htseq-count) for counting PE reads towards genes - if mate1 overlaps geneA, while mate2 overlaps geneA and geneB, STAR will consider this read Ambiguous, while featureCounts will count it towards geneA.

Cheers
Alex

benslack19 · 2019-04-14T03:06:04Z

Thanks @alexdobin

odoublewen · 2019-08-30T20:59:37Z

According to the change log for STAR-2.7.1a, this issue has been resolved. (So, @alexdobin you may want to close this issue?)

I ran into a very similar issue to this when using v2.5.3a with -quantMode GeneCounts, which I have described in this google-groups posting: "odd behavior in ReadsPerGene.out.tab strand columns, with non-gencode refs".

In a nutshell, the problem was that GTF features lacking transcript_id (in my case, I was creating features that had only gene_id) were getting assigned to the strand of the first gene in the GTF file (but not assigned first gene itself.) So seems a little different from what is discussed in this thread.

In any case, updating to >= STAR 2.7.1a resolves the problem for me.

alexdobin · 2019-09-01T17:45:19Z

Hi Owen,

thanks for sharing the results of your detailed investigation into this problem, it will be helpful to to other users.

Cheers
Alex

alexdobin added the enhancement label Apr 8, 2019

alexdobin added the resolved problem or issue that has been resolved label Sep 1, 2019

alexdobin closed this as completed Sep 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Counts Erroneously Assigned to First Gene in gff file #613

Counts Erroneously Assigned to First Gene in gff file #613

monicabritton commented Apr 8, 2019

monicabritton commented Apr 8, 2019

alexdobin commented Apr 10, 2019

monicabritton commented Apr 10, 2019

benslack19 commented Apr 11, 2019

monicabritton commented Apr 12, 2019

benslack19 commented Apr 12, 2019

alexdobin commented Apr 12, 2019

benslack19 commented Apr 14, 2019

odoublewen commented Aug 30, 2019

alexdobin commented Sep 1, 2019

Counts Erroneously Assigned to First Gene in gff file #613

Counts Erroneously Assigned to First Gene in gff file #613

Comments

monicabritton commented Apr 8, 2019

monicabritton commented Apr 8, 2019

alexdobin commented Apr 10, 2019

monicabritton commented Apr 10, 2019

benslack19 commented Apr 11, 2019

monicabritton commented Apr 12, 2019

benslack19 commented Apr 12, 2019

alexdobin commented Apr 12, 2019

benslack19 commented Apr 14, 2019

odoublewen commented Aug 30, 2019

alexdobin commented Sep 1, 2019