should extended_annotation.gtf be a superset of the input gtf? #175

jamestwebber · 2024-04-17T18:42:44Z

This is what I assumed should happen, but it doesn't appear to be the case: my reference GTF has ~61k genes (GRCh38, gencode v39) but the output extended_annotation.gtf does not include all the known genes and transcripts (by a large margin: 23k genes). Is there some filtering going on here?

The text was updated successfully, but these errors were encountered:

andrewprzh · 2024-04-18T09:20:38Z

Hi @jamestwebber

Yes, this is a known flaw in the current version, it is now fixed and will be out in 3.4 (hopefully soon).

Best
Andrey

andrewprzh · 2024-05-09T09:35:52Z

Should be fixed now in IsoQuant 3.4

jamestwebber · 2024-09-16T19:06:43Z

I thought this was fixed, but I'm seeing some instances where the exon information for a gene was not copied over. I wonder if this is related to whether or not reads were assigned to the gene.

jamestwebber · 2024-09-16T19:42:33Z

I noticed this initially in an unprocessed pseudogene (WASH7P) just because it happens to be very close to the beginning of chr1. So if there's any filtering based on biotype, that could also be involved.

andrewprzh · 2024-09-19T22:22:58Z

@jamestwebber

There should not additional filtering, so sounds odd. What kind of information is missing, is it exon records?
Is it possible to see take a look a this example?

Thanks
Andrey

jamestwebber · 2024-09-19T22:47:00Z

Ah! This probably a false alarm: it looks like the transcript name was not copied over, but the exons themselves are present. I was looking for the gene name and didn't see the exons. For example the first exon in both files:

$ grep 'ENST00000488147.1' ~/reference/GRCh38.gencode.v39.annotation.basic.gtf | head -n 2 
chr1    HAVANA  transcript      14404   29570   .       -       .       gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-201"; level 2; transcript_support_level "NA"; hgnc_id "HGNC:38034"; ont "PGO:0000005"; tag "basic"; tag "Ensembl_canonical"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
chr1    HAVANA  exon    29534   29570   .       -       .       gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-201"; exon_number 1; exon_id "ENSE00001890219.1"; level 2; transcript_support_level "NA"; hgnc_id "HGNC:38034"; ont "PGO:0000005"; tag "basic"; tag "Ensembl_canonical"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
$ grep 'ENST00000488147.1' OUT.extended_annotation.gtf | head -n 2
chr1    HAVANA  transcript      14404   29570   .       -       .       gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; exons "11"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-201"; transcript_support_level "NA"; hgnc_id "HGNC:38034"; ont "PGO:0000005"; tag "basic"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1"; 
chr1    HAVANA  exon    29534   29570   .       -       .       gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; exon "1"; exon_id "chr1.40908";

andrewprzh · 2024-09-20T09:28:41Z

Yes, additional information such as gene names etc is only copied for genes and transcript records.
I can make the same for exons if needed.

jamestwebber · 2024-09-20T14:19:31Z

The reason I noticed this is because I was looking at IGV, and it wasn't displaying the exons for WASH7P, only the gene body. I think this is really a bug in how IGV is parsing the GTF (it should be matching on transcript_id), but you will probably update sooner. 😂

andrewprzh · 2024-09-20T15:06:37Z

Yeah, I thought transcript_id would be enough. Maybe converting to GFF3 and having ID and Parent attributes instead will make it work.

Anyway, will fix exon information.

andrewprzh · 2024-09-25T13:30:16Z

Exon attributes should be now copied from the reference in IsoQuant 3.6.1.

andrewprzh added bug Something isn't working fixed in dev Issue resolved but not released yet labels Apr 18, 2024

andrewprzh added the fixed in release Issue resolved and the fix is released, waiting for approval label May 9, 2024

andrewprzh closed this as completed May 9, 2024

jamestwebber reopened this Sep 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

should extended_annotation.gtf be a superset of the input gtf? #175

should extended_annotation.gtf be a superset of the input gtf? #175

jamestwebber commented Apr 17, 2024 •

edited

Loading

andrewprzh commented Apr 18, 2024

andrewprzh commented May 9, 2024

jamestwebber commented Sep 16, 2024

jamestwebber commented Sep 16, 2024

andrewprzh commented Sep 19, 2024

jamestwebber commented Sep 19, 2024

andrewprzh commented Sep 20, 2024

jamestwebber commented Sep 20, 2024

andrewprzh commented Sep 20, 2024

andrewprzh commented Sep 25, 2024

should extended_annotation.gtf be a superset of the input gtf? #175

should extended_annotation.gtf be a superset of the input gtf? #175

Comments

jamestwebber commented Apr 17, 2024 • edited Loading

andrewprzh commented Apr 18, 2024

andrewprzh commented May 9, 2024

jamestwebber commented Sep 16, 2024

jamestwebber commented Sep 16, 2024

andrewprzh commented Sep 19, 2024

jamestwebber commented Sep 19, 2024

andrewprzh commented Sep 20, 2024

jamestwebber commented Sep 20, 2024

andrewprzh commented Sep 20, 2024

andrewprzh commented Sep 25, 2024

jamestwebber commented Apr 17, 2024 •

edited

Loading