-
Notifications
You must be signed in to change notification settings - Fork 506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Counts Erroneously Assigned to First Gene in gff file #613
Comments
Amendment to my previous post: A "WARNING" message was written to the Log for each of the non-standard exon lines. This was when I indexed the genome (and included the gff3 file in the indexing.) |
Hi Monica, I think I need to make this checking optional, as, in principle, the GFF3 file is not required to have a gene_id tag. I could also have a "NoGeneID" gene where all such exons will belong. Cheers |
Yes, I like your idea of a "NoGeneID" line in the top "N_" section of ReadsPerGene.out.tab. That makes it very obvious and able to be accounted for when the counts are parsed. Thanks! |
Hi @monicabritton , are you also trying a different counting method by any chance? I'm evaluating counts from STAR BAM files but using featureCounts. I had to convert the NCBI GFF to GTF using this and a custom script to make the "gene_id" tag start first. I was curious what people's experiences are with using the -quantMode GeneCounts feature within STAR versus another quantifier like featureCounts. |
Hi @benslack19 , I have used featureCounts occasionally, if I need to "re-count" an existing bam, and don't want to re-run the alignment for some reason. If the reads in the bam are PE, then it is necessary to "repair" the bam file to make it name sorted, rather than Coordinate sorted, which takes time and server space, so with PE reads it's probably faster to just re-run STAR, unless you already have the name-sorted bam. I have run featureCounts using gff3 without converting to GTF, but I almost always make edits to NCBI gff3 files, even when I use them with STAR, primarily because of fields like: |
Thanks @monicabritton . This is very helpful. The cufflinks gffread package I alluded to earlier converts the NCBI gff column 9 to look like this: I haven't fully verified if converting eliminates all gff3 inconsistencies, but I'm getting some lower than expected counts in featureCounts for some genes. I'm still evaluating whether it's because of the gff conversion or some other reason. Trying the quantMode within STAR and just editing the gff3 file could be a way to check. Can I ask what other "inconsistencies within the gff3 files" I should edit for other than the Dbxref observation you pointed out? |
Hi @benslack19, @monicabritton as far as I remember, featureCounts uses slightly different policy from STAR (=htseq-count) for counting PE reads towards genes - if mate1 overlaps geneA, while mate2 overlaps geneA and geneB, STAR will consider this read Ambiguous, while featureCounts will count it towards geneA. Cheers |
Thanks @alexdobin |
According to the change log for STAR-2.7.1a, this issue has been resolved. (So, @alexdobin you may want to close this issue?) I ran into a very similar issue to this when using v2.5.3a with -quantMode GeneCounts, which I have described in this google-groups posting: "odd behavior in ReadsPerGene.out.tab strand columns, with non-gencode refs". In a nutshell, the problem was that GTF features lacking In any case, updating to >= STAR 2.7.1a resolves the problem for me. |
Hi Owen, thanks for sharing the results of your detailed investigation into this problem, it will be helpful to to other users. Cheers |
Some gff3 files (ahem, NCBI) can be composed of lines from a variety of different annotation packages (as indicated in field 2 of the file). I discovered that if the "gene_id" (or equivalent) is not present on some exon lines, the counts associated with those exons are added to the counts for the first gene in the gff3 file.
It would be best if STAR failed during such a run, or at least issued a strong warning message to stderr.
LMK if you want an example ....
The text was updated successfully, but these errors were encountered: