Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

should extended_annotation.gtf be a superset of the input gtf? #175

Open
jamestwebber opened this issue Apr 17, 2024 · 10 comments
Open

should extended_annotation.gtf be a superset of the input gtf? #175

jamestwebber opened this issue Apr 17, 2024 · 10 comments
Labels
bug Something isn't working fixed in dev Issue resolved but not released yet fixed in release Issue resolved and the fix is released, waiting for approval

Comments

@jamestwebber
Copy link
Collaborator

jamestwebber commented Apr 17, 2024

This is what I assumed should happen, but it doesn't appear to be the case: my reference GTF has ~61k genes (GRCh38, gencode v39) but the output extended_annotation.gtf does not include all the known genes and transcripts (by a large margin: 23k genes). Is there some filtering going on here?

@andrewprzh
Copy link
Collaborator

Hi @jamestwebber

Yes, this is a known flaw in the current version, it is now fixed and will be out in 3.4 (hopefully soon).

Best
Andrey

@andrewprzh andrewprzh added bug Something isn't working fixed in dev Issue resolved but not released yet labels Apr 18, 2024
@andrewprzh andrewprzh added the fixed in release Issue resolved and the fix is released, waiting for approval label May 9, 2024
@andrewprzh
Copy link
Collaborator

Should be fixed now in IsoQuant 3.4

@jamestwebber
Copy link
Collaborator Author

I thought this was fixed, but I'm seeing some instances where the exon information for a gene was not copied over. I wonder if this is related to whether or not reads were assigned to the gene.

@jamestwebber jamestwebber reopened this Sep 16, 2024
@jamestwebber
Copy link
Collaborator Author

I noticed this initially in an unprocessed pseudogene (WASH7P) just because it happens to be very close to the beginning of chr1. So if there's any filtering based on biotype, that could also be involved.

@andrewprzh
Copy link
Collaborator

@jamestwebber

There should not additional filtering, so sounds odd. What kind of information is missing, is it exon records?
Is it possible to see take a look a this example?

Thanks
Andrey

@jamestwebber
Copy link
Collaborator Author

Ah! This probably a false alarm: it looks like the transcript name was not copied over, but the exons themselves are present. I was looking for the gene name and didn't see the exons. For example the first exon in both files:

$ grep 'ENST00000488147.1' ~/reference/GRCh38.gencode.v39.annotation.basic.gtf | head -n 2 
chr1    HAVANA  transcript      14404   29570   .       -       .       gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-201"; level 2; transcript_support_level "NA"; hgnc_id "HGNC:38034"; ont "PGO:0000005"; tag "basic"; tag "Ensembl_canonical"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
chr1    HAVANA  exon    29534   29570   .       -       .       gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-201"; exon_number 1; exon_id "ENSE00001890219.1"; level 2; transcript_support_level "NA"; hgnc_id "HGNC:38034"; ont "PGO:0000005"; tag "basic"; tag "Ensembl_canonical"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
$ grep 'ENST00000488147.1' OUT.extended_annotation.gtf | head -n 2
chr1    HAVANA  transcript      14404   29570   .       -       .       gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; exons "11"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-201"; transcript_support_level "NA"; hgnc_id "HGNC:38034"; ont "PGO:0000005"; tag "basic"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1"; 
chr1    HAVANA  exon    29534   29570   .       -       .       gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; exon "1"; exon_id "chr1.40908";

@andrewprzh
Copy link
Collaborator

Yes, additional information such as gene names etc is only copied for genes and transcript records.
I can make the same for exons if needed.

@jamestwebber
Copy link
Collaborator Author

The reason I noticed this is because I was looking at IGV, and it wasn't displaying the exons for WASH7P, only the gene body. I think this is really a bug in how IGV is parsing the GTF (it should be matching on transcript_id), but you will probably update sooner. 😂

@andrewprzh
Copy link
Collaborator

Yeah, I thought transcript_id would be enough. Maybe converting to GFF3 and having ID and Parent attributes instead will make it work.

Anyway, will fix exon information.

@andrewprzh
Copy link
Collaborator

Exon attributes should be now copied from the reference in IsoQuant 3.6.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working fixed in dev Issue resolved but not released yet fixed in release Issue resolved and the fix is released, waiting for approval
Projects
None yet
Development

No branches or pull requests

2 participants