Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Counts Erroneously Assigned to First Gene in gff file #613

Closed
monicabritton opened this issue Apr 8, 2019 · 10 comments
Closed

Counts Erroneously Assigned to First Gene in gff file #613

monicabritton opened this issue Apr 8, 2019 · 10 comments
Labels
enhancement resolved problem or issue that has been resolved

Comments

@monicabritton
Copy link

Some gff3 files (ahem, NCBI) can be composed of lines from a variety of different annotation packages (as indicated in field 2 of the file). I discovered that if the "gene_id" (or equivalent) is not present on some exon lines, the counts associated with those exons are added to the counts for the first gene in the gff3 file.

It would be best if STAR failed during such a run, or at least issued a strong warning message to stderr.

LMK if you want an example ....

@monicabritton
Copy link
Author

Amendment to my previous post: A "WARNING" message was written to the Log for each of the non-standard exon lines. This was when I indexed the genome (and included the gff3 file in the indexing.)
Sadly, I don't look through the Log much if the job completes without error. No warnings or errors appear in the STAR alignment run, likely because I don't specify the gff3 file there.

@alexdobin
Copy link
Owner

Hi Monica,

I think I need to make this checking optional, as, in principle, the GFF3 file is not required to have a gene_id tag. I could also have a "NoGeneID" gene where all such exons will belong.

Cheers
Alex

@monicabritton
Copy link
Author

Yes, I like your idea of a "NoGeneID" line in the top "N_" section of ReadsPerGene.out.tab. That makes it very obvious and able to be accounted for when the counts are parsed.

Thanks!
Monica

@benslack19
Copy link

Hi @monicabritton , are you also trying a different counting method by any chance?

I'm evaluating counts from STAR BAM files but using featureCounts. I had to convert the NCBI GFF to GTF using this and a custom script to make the "gene_id" tag start first.

I was curious what people's experiences are with using the -quantMode GeneCounts feature within STAR versus another quantifier like featureCounts.

@monicabritton
Copy link
Author

Hi @benslack19 , I have used featureCounts occasionally, if I need to "re-count" an existing bam, and don't want to re-run the alignment for some reason. If the reads in the bam are PE, then it is necessary to "repair" the bam file to make it name sorted, rather than Coordinate sorted, which takes time and server space, so with PE reads it's probably faster to just re-run STAR, unless you already have the name-sorted bam.

I have run featureCounts using gff3 without converting to GTF, but I almost always make edits to NCBI gff3 files, even when I use them with STAR, primarily because of fields like:
Dbxref=Genbank:NP_148939.1,GeneID:803438
as well as inconsistencies within the gff3 files. I'm not sure if converting to GTF eliminates all of these?

@benslack19
Copy link

Thanks @monicabritton . This is very helpful.

The cufflinks gffread package I alluded to earlier converts the NCBI gff column 9 to look like this:
transcript_id "rna0"; gene_id "gene0"; gene_name "DDX11L1";
but in my experience, it didn't work "out of the box" in featureCounts. I had to re-order the line so that gene_id started first.

I haven't fully verified if converting eliminates all gff3 inconsistencies, but I'm getting some lower than expected counts in featureCounts for some genes. I'm still evaluating whether it's because of the gff conversion or some other reason. Trying the quantMode within STAR and just editing the gff3 file could be a way to check. Can I ask what other "inconsistencies within the gff3 files" I should edit for other than the Dbxref observation you pointed out?

@alexdobin
Copy link
Owner

Hi @benslack19, @monicabritton

as far as I remember, featureCounts uses slightly different policy from STAR (=htseq-count) for counting PE reads towards genes - if mate1 overlaps geneA, while mate2 overlaps geneA and geneB, STAR will consider this read Ambiguous, while featureCounts will count it towards geneA.

Cheers
Alex

@benslack19
Copy link

Thanks @alexdobin

@odoublewen
Copy link

According to the change log for STAR-2.7.1a, this issue has been resolved. (So, @alexdobin you may want to close this issue?)

I ran into a very similar issue to this when using v2.5.3a with -quantMode GeneCounts, which I have described in this google-groups posting: "odd behavior in ReadsPerGene.out.tab strand columns, with non-gencode refs".

In a nutshell, the problem was that GTF features lacking transcript_id (in my case, I was creating features that had only gene_id) were getting assigned to the strand of the first gene in the GTF file (but not assigned first gene itself.) So seems a little different from what is discussed in this thread.

In any case, updating to >= STAR 2.7.1a resolves the problem for me.

@alexdobin alexdobin added the resolved problem or issue that has been resolved label Sep 1, 2019
@alexdobin
Copy link
Owner

Hi Owen,

thanks for sharing the results of your detailed investigation into this problem, it will be helpful to to other users.

Cheers
Alex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement resolved problem or issue that has been resolved
Projects
None yet
Development

No branches or pull requests

4 participants