Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_gff3 miscalculates CDS #60

Open
mpoelchau opened this issue Nov 29, 2023 · 13 comments
Open

load_gff3 miscalculates CDS #60

mpoelchau opened this issue Nov 29, 2023 · 13 comments

Comments

@mpoelchau
Copy link

We've been having trouble with load_gff3 - I initially posted this on the Apollo repo (GMOD/Apollo#2662), reposting it here now. Thanks for considering this, I'd appreciate any pointers!

We are trying to use the python-apollo arrow annotations load_gff3 command to load annotations to the user-created annotations track. It is changing the CDS locations of the model, both with and without the --disable_cds_recalculation option.

Here is what a load without --disable_cds_recalculation looks like; the correct frame can be seen in the track below.

Screenshot 2023-11-28 at 2 53 30 PM

The gff3 that was used to load the annotation has 6 CDS lines; the gff3 for the uploaded annotation has 12 CDS lines (even though the view shows only one CDS segment). Apollo also won't calculate a protein or CDS sequence on the uploaded annotation.

Here is what a load with --disable_cds_recalculation looks like (command: arrow annotations load_gff3 --source https://apollo2-stage-node1-cbo.nal.usda.gov/apollo Anoplophora_glabripennis ~/Downloads/NW_019416298.gff3 --disable_cds_recalculation)
Screenshot 2023-11-28 at 2 50 01 PM

Again, the gff3 for the uploaded annotation in Apollo has 12 CDS lines instead of 6. Apollo also won't calculate a protein or CDS sequence on the uploaded annotation.
I'll note that if you run the same command multiple times, the single CDS will display in a different spot each time.

If you load the annotation by dragging it up, it loads correctly:

Screenshot 2023-11-28 at 2 56 20 PM

This is happening for many (but not all) annotations in multiple assemblies/organisms.

Some other observations:

  • We haven't observed the problem for models with a single CDS/exon segment
  • The underlying genomic sequence has lowercase nucleotides
  • I tried using load_legacy_gff3, and that calculated the CDS correctly, but I'm unable to delete features when I load them with that method (Hibernate operation: could not execute statement; SQL [n/a]; ERROR: update or delete on table "feature" violates foreign key constraint "fk_8jm56covt0m7m0m191bc5jseh" on table "feature_relationship" Detail: Key (id)=(4858111) is still referenced from table "feature_relationship".; nested exception is org.postgresql.util.PSQLException: ERROR: update or delete on table "feature" violates foreign key constraint "fk_8jm56covt0m7m0m191bc5jseh" on table "feature_relationship" Detail: Key (id)=(4858111) is still referenced from table "feature_relationship".)

I've attached "before" and "after" gff3s. (Used .txt extension because GitHub wouldn't let me upload otherswise)
before.txt
after-nocdsrecalc.txt
after.txt

  • Provide the javascript console log output generated from the action.
    None.

  • Provide the server log output generated from the action (typically catalina.out).
    nothing is added to Catalina.out when I add the annotations.

@mpoelchau
Copy link
Author

I have a few more observations on this, I have no idea if they're expected/normal. I looked into the json that load_gff3 generates as output:

  • an additional CDS stanza is generated with start+end coordinates that encompass all CDS subfeatures
  • even when disable_CDS_recalculation is used, no phase information is included in the json output

json output attached from following command: arrow annotations load_gff3 --disable_cds_recalculation Anoplophora_glabripennis before.txt
after.json

Happy to keep on sleuthing in another direction if needed... Thanks for your help with this!

@MonicaPoelchau-USDA
Copy link

Update - after looking at the gffs of a number of models after uploading, it looks like duplicate CDS features are being created. There are always 2x the number of CDS lines in the gff after loading, whether using --disable_CDS_recalculation or not.

I'm 99% certain it's not a problem with the gffs that I am trying to load - I've tried gffs from multiple sources, and our gff3_QC program doesn't detect any problems.

@MonicaPoelchau-USDA
Copy link

And another update - I was able to reproduce this on the Apollo demo site (human, region chr4:111819946..111848259 (28.31 Kb)). Input gff attached. arrow annotations load_gff3 --disable_cds_recalculation Human-Hg38 testhumandemo2.gff
testhumandemo2.txt

@hexylena
Copy link
Member

hexylena commented Dec 4, 2023

Hi @MonicaPoelchau-USDA , I'll try and find some time this week to look into it.

Do you have a single gene model you could share, that I could use as a test case? Edit: yes you already attached that. Thank you!

@MonicaPoelchau-USDA
Copy link

Thanks for looking into this, @hexylena ! Let me know if there's anything else I can test/provide.

@hexylena
Copy link
Member

Ok, investigating today (sorry again @MonicaPoelchau-USDA) I can reproduce the issue entirely.

I'm seeing zero functional differences with and without the no-cds-recalc, the files are functionally identical (module UUIDs), so that code path can at least be ignored for now: arrow sends the exact same data to apollo's addTranscript API.

As for the generated transcripts I'm getting back duplicated CDSs as well, only the addition of a number of non-canonical start sites (expected, i'm testing on an ecoli copy.)

I can likewise confirm the inability to delete when loading via legacy, that's really ... not great. It would require apollo changes to correct (or in your case maybe manually deleting it from the DB is an option.) I've tried deleting the CDSs one by one and got almost to the last one which now is just stuck there permanently lol.

I can confirm fetching the sequence does nothing, this is due to a 500 internal server error on the apollo side. Genomic and cdna both return sequences, but cds/peptide fail.

Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: 2024-01-26 14:29:41,353 [http-bio-8080-exec-45] ERROR apollo.FeatureRelationshipService  - More than one child feature relationships found for Feature{id=5502660, symbol='null', description='null', name='Mitotic spindle assembly checkpoint protein MAD1-00001', uniqueName='a45e77e6-673d-4ee4-a477-8126637a9426', sequenceLength=null, status=null, dateCreated=2024-01-26 14:21:33.563, lastUpdated=2024-01-26 14:21:33.992} and ID SO:0000316
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: 2024-01-26 14:29:41,360 [http-bio-8080-exec-45] ERROR errors.GrailsExceptionResolver  - NullPointerException occurred when processing request: [POST] /apollo/778475583185271421178966132/AnnotationEditorService
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: Cannot get property 'fmin' on null object. Stacktrace follows:
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: java.lang.NullPointerException: Cannot get property 'fmin' on null object
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: at org.bbop.apollo.FeatureService.$tt__getSequenceAlterationsForFeature(FeatureService.groovy:2938)
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: at org.bbop.apollo.FeatureService.$tt__getResiduesWithAlterationsAndFrameshifts(FeatureService.groovy:1063)
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: at org.bbop.apollo.SequenceService.$tt__getSequenceForFeature(SequenceService.groovy:524)
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: at org.bbop.apollo.SequenceService.$tt__getSequenceForFeatures(SequenceService.groovy:654)
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: at org.bbop.apollo.AnnotationEditorController.getSequence(AnnotationEditorController.groovy:1054)
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: at grails.plugin.cache.web.filter.PageFragmentCachingFilter.doFilter(PageFragmentCachingFilter.java:198)
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: at grails.plugin.cache.web.filter.AbstractFilter.doFilter(AbstractFilter.java:63)
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: at org.grails.plugins.web.rest.api.ControllersRestApi.forward(ControllersRestApi.groovy:53)
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: at org.bbop.apollo.AnnotationEditorController.handleOperation(AnnotationEditorController.groovy:72)
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: at grails.plugin.cache.web.filter.PageFragmentCachingFilter.doFilter(PageFragmentCachingFilter.java:198)
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: at grails.plugin.cache.web.filter.AbstractFilter.doFilter(AbstractFilter.java:63)
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: at org.apache.shiro.web.servlet.AbstractShiroFilter.executeChain(AbstractShiroFilter.java:449)
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: at org.apache.shiro.web.servlet.AbstractShiroFilter$1.call(AbstractShiroFilter.java:365)
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: at org.apache.shiro.subject.support.SubjectCallable.doCall(SubjectCallable.java:90)
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: at org.apache.shiro.subject.support.SubjectCallable.call(SubjectCallable.java:83)
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: at org.apache.shiro.subject.support.DelegatingSubject.execute(DelegatingSubject.java:383)
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: at org.apache.shiro.web.servlet.AbstractShiroFilter.doFilterInternal(AbstractShiroFilter.java:362)
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: at org.apache.shiro.web.servlet.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:125)
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: at com.brandseye.cors.CorsFilter.doFilter(CorsFilter.java:82)
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: at java.lang.Thread.run(Thread.java:750)

Which is seen in the apollo codebase here.
https://github.com/GMOD/Apollo/blob/5985ab8d98cc3b98625101f649c9f300efb5b555/grails-app/services/org/bbop/apollo/FeatureRelationshipService.groovy#L30

I, don't understand why this would be an error, having more than 1 child feature. Your gene model is gff3 spec compliant,
we're sending the correct data to apollo (as evidenced by what it'll show), but there's still something so unusual about this feature for Apollo.

Uploading the GFF3 and manually creating the annotation works perfectly, but that's not what we want. Looking at the uploaded feature in JBrowse's view, it looks right, 6 exons and 6 CDSs.

Investigating the websocket call that's made when I drag and drop this feature onto the annotation track is fascinating, it's very different!

At the top level an mRNA is sent, along with the following:

14:43:33 w-galaxy:[~/arbeit/genomics/python-apollo @master‽]$ cat tmp2 | jq .features[0].children[] -c
{"location":{"fmin":1178316,"fmax":1179335,"strand":1},"type":{"cv":{"name":"sequence"},"name":"exon"},"orig_id":"8f46523a-680c-4f42-ae8c-45b61dc37928"}
{"location":{"fmin":1179393,"fmax":1179530,"strand":1},"type":{"cv":{"name":"sequence"},"name":"exon"},"orig_id":"4d5bc38f-c85b-4ecd-bd42-1f2a036c90d6"}
{"location":{"fmin":1179585,"fmax":1179768,"strand":1},"type":{"cv":{"name":"sequence"},"name":"exon"},"orig_id":"834ee674-1dd2-405a-948f-96140f2c2502"}
{"location":{"fmin":1179824,"fmax":1180016,"strand":1},"type":{"cv":{"name":"sequence"},"name":"exon"},"orig_id":"046e9b0f-0b2f-45b4-98ff-4b11482e073e"}
{"location":{"fmin":1180209,"fmax":1180546,"strand":1},"type":{"cv":{"name":"sequence"},"name":"exon"},"orig_id":"888dc728-2f2a-42e8-8a12-ad298b5189fe"}
{"location":{"fmin":1183892,"fmax":1184805,"strand":1},"type":{"cv":{"name":"sequence"},"name":"exon"},"orig_id":"faf5e3e9-f1de-4675-8c1c-b73e8f8a94bf"}
{"location":{"fmin":1178671,"fmax":1184578,"strand":1},"type":{"cv":{"name":"sequence"},"name":"CDS"}}

Notably it's only exons except for where the CDS doesn't match the parent exon.

So that looks like we're calling the addTranscript API incorrectly? That doesn't seem right, but, perhaps.

ok, so, deleting any CDSs which match their parent exons completely, before sending produces the thing that looks the closest to what was copied from the native jbrowse track (MAD1f, the last track)

image

But that still fails to get the correct sequence from Apollo.

The thing that finally worked was... dropping ALL cds features from the source.

image

I think in this case we're getting "lucky" that there weren't any manual modifications made to the CDS regions and apollo's algorithm works, but, yikes that's not good. Deleting all CDSs from the source gff3 file does not work (weirdly), instead I need this patch to get the loading to work at all:

diff --git a/apollo/util.py b/apollo/util.py
index 802625b..11529cd 100644
--- a/apollo/util.py
+++ b/apollo/util.py
@@ -121,6 +121,28 @@ def _yieldGeneData(gene, disable_cds_recalculation=False, use_name=False):
     # # TODO: handle description
     # # TODO: handle GO, Gene Product, Provenance
  
+    def __floc(location):
+        return f"{location['fmin']}-{location['fmax']}-{location['strand']}"
+
+    for child1 in current['children']:
+        exon_regions = []
+        for child1 in current['children']:
+            for child in child1['children']:
+                print(child)
+                if child['type']['name'] == 'exon': 
+                    exon_regions.append(__floc(child['location']))
+        new_current_children = []
+        for child in child1['children']:
+            if child['type']['name'] == 'CDS':
+                continue
+                nnn = __floc(child['location'])
+                if nnn not in exon_regions:
+                    new_current_children.append(child)
+            else:
+                new_current_children.append(child)
+        child1['children'] = new_current_children
+        print(exon_regions)
+
     if 'children' in current and gene.type == 'gene':
         # Only sending mRNA level as apollo is more comfortable with orphan mRNAs
         return current['children']

I don't love any of this 🙃 I'm not sure what we could change to make Apollo happier here, short of just removing the CDSs blindly. For whatever reason sending CDSs causes problems

@mpoelchau
Copy link
Author

mpoelchau commented Jan 30, 2024

Thanks @hexylena ! I pulled your branch, was able to add the annotations, and they loaded correctly - I was able to retrieve the protein and CDS sequence. I was also able to load them correctly using the main branch and deleting all CDS features from the gff3 file beforehand, as you mentioned. I think this will work for us in most cases since our CDS's should be pretty vanilla (starting in phase 0).

@garrettjstevens
Copy link

ok, so, deleting any CDSs which match their parent exons completely, before sending produces the thing that looks the closest to what was copied from the native jbrowse track (MAD1f, the last track)

I wasn't around when this was written, but this is my best guess of what's going on here:

What Apollo is expecting here for CDS is a single feature that spans the whole coding region. In the sample GFF3 file, all the CDSs have the same ID in column 9, and I think Apollo similarly considers a CDS a single feature, but just keeps track of the start of the first CDS section and the end of the last CDS section since the rest can be inferred from the exon locations.

With this patch, I'm able to load the GFF3 correctly, and it contains the CDS info from the file:

diff --git a/apollo/util.py b/apollo/util.py
--- a/apollo/util.py
+++ b/apollo/util.py
@@ -122,6 +122,25 @@ def _yieldGeneData(gene, disable_cds_recalculation=False, use_name=False):
     # # TODO: handle GO, Gene Product, Provenance
 
     if 'children' in current and gene.type == 'gene':
+        for mRNA in current['children']:
+            new_mRNA_children = []
+            new_cds = None
+            for feature in mRNA['children']:
+                if feature['type']['name'] == 'CDS':
+                    if new_cds:
+                        new_cds_start = new_cds['location']['fmin']
+                        new_cds_end = new_cds['location']['fmax']
+                        this_cds_start = feature['location']['fmin']
+                        this_cds_end = feature['location']['fmax']
+                        new_cds['location']['fmin'] = min(new_cds_start, this_cds_start)
+                        new_cds['location']['fmax'] = max(new_cds_end, this_cds_end)
+                    else:
+                        new_cds = feature
+                else:
+                    new_mRNA_children.append(feature)
+            if new_cds:
+                mRNA['children'] = new_mRNA_children
+                mRNA['children'].append(new_cds)
         # Only sending mRNA level as apollo is more comfortable with orphan mRNAs
         return current['children']
     else:

@hexylena
Copy link
Member

Oh wow @garrettjstevens thanks for looking into it as well!

What Apollo is expecting here for CDS is a single feature that spans the whole coding region.

that is a fascinating interpretation of the spec.

With this patch

Ok, we'll merge that in behind a flag then!

@aathbt
Copy link

aathbt commented Feb 26, 2024

I just encountered the same problem. The fixes you wrote allowed me to properly import the gff, but I now have several exons/transcripts I cannot delete (I get the same error than @mpoelchau). I'd appreciate any pointers on how to solve this!

@mpoelchau
Copy link
Author

As a temporary hack, we have been deleting CDS features from the gff3 before loading them. This is not ideal though (esp if your CDS start and stop don't follow Apollo's assumptions when it recalculates them).

@aathbt
Copy link

aathbt commented Feb 27, 2024

Alright, thanks. Related question: were you able to delete the user-created annotations that you uploaded via load_legacy_gff3? If so, how? I've tried this using both arrow organisms delete_features and the GUI, but I always end up with the update or delete on table "feature" violates foreign key constraint' error (same than in your first post).

@hexylena
Copy link
Member

hexylena commented Feb 27, 2024

@aathbt I was not able to remove the undeletable exons. The ebst result I had was by removing the CDSs, then one could get the exon almost deleted, but something was wrong in the database still. I'm not sure it's something we can fix via the API either

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants