Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve overrides and fix outdated GO IDs #30

Merged
merged 17 commits into from
Apr 14, 2020
Merged

Conversation

bgyori
Copy link
Contributor

@bgyori bgyori commented Apr 12, 2020

This PR makes several changes to the NER-Grounding-Override.tsv:

  • Replace old families mapped to UAZ IDs (these were imported from BEL) with FamPlex
  • Replace some redundant overrides to PF with FamPlex
  • Replace outdated HMDB overrides with other name spaces
  • Fix overrides to specific proteins that should actually be mapped to families
  • Fix a few incorrect overrides that I found as I was examining things
    (I also added a one-off script to help find these cases automatically).

The PR also updates bio_process.tsv and GO-subcellular-locations.tsv to replace deprecated GO IDs with current ones. (Note that I have been planning to implement a script to fully update these resource files from GO, not just replace the outdated IDs in the old file but this is a good first step for the next release).

@MihaiSurdeanu if you see any surprising effects on the Reach tests you have been updating, let me know.

@bgyori
Copy link
Contributor Author

bgyori commented Apr 12, 2020

Actually, this might change the NER files so I'll add those

Copy link
Contributor

@MihaiSurdeanu MihaiSurdeanu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I see what is happening with the tests, or wait?

@bgyori
Copy link
Contributor Author

bgyori commented Apr 12, 2020

Here is the summary of the results and my analysis. Let me try to fix what I can and I'll comment if there are any left that I'm unsure how to fix.

[info] *** 9 TESTS FAILED ***
[error] Failed tests:
[error] 	org.clulab.reach.TestModelEntities
[error] 	org.clulab.reach.TestFamilyResolutions
[error] 	org.clulab.reach.TestOverrides
[info] - should fail for protein complex Bioentities *** FAILED *** (2 milliseconds)
[info]   Some(ArrayBuffer(<KBResolution: Complex IV, be, COX, , <IMKBMetaInfo: be, famplex.tsv.gz, https://identifiers.org/fplx/, , sp=false, f=true, p=false>>, <KBResolution: cytochrome c oxidase, be, COX, , <IMKBMetaInfo: be, famplex.tsv.gz, https://identifiers.org/fplx/, , sp=false, f=true, p=false>>, <KBResolution: COX, be, COX, , <IMKBMetaInfo: be, famplex.tsv.gz, https://identifiers.org/fplx/, , sp=false, f=true, p=false>>)) was not empty (TestFamilyResolutions.scala:325)

-> this test is trying to make a distinction between FamPlex families and complexes but there's no need to do so, labeling them as Family is good.

[info] - should have labeled all mentions as Family in the text ACOX, BMP, Cadherin, CRISP, DDR,
[info]                 COX4, COX6a, COX6b, COX7a, COX7b,
[info]                 COX8, DVL, ETS, FGF, FLOT,
[info]                 GATA, HSP90, IGFBP, IL1, IRS,
[info]                 MAF, NOTCH, PKI, RAS, SAA,
[info]                 and TGFB are unchanged Families. *** FAILED *** (0 milliseconds)
[info]   25 was not equal to 26 (TestOverrides.scala:170)
[info] - should have display labeled all mentions as Some(Family) *** FAILED *** (0 milliseconds)
[info]   25 was not equal to 26 (TestOverrides.scala:175)
[info] - should match expected grounding IDs in the text ACOX, BMP, Cadherin, CRISP, DDR,
[info]                 COX4, COX6a, COX6b, COX7a, COX7b,
[info]                 COX8, DVL, ETS, FGF, FLOT,
[info]                 GATA, HSP90, IGFBP, IL1, IRS,
[info]                 MAF, NOTCH, PKI, RAS, SAA,
[info]                 and TGFB are unchanged Families. *** FAILED *** (0 milliseconds)
[info]   false was not true (TestOverrides.scala:188)

-> Here I reviewed MAF yesterday and it is the symbol of a gene so I think it will currently resolve to that instead of a family. There are also changes to the groundings here, for instance, COX4 will now resolve to FamPlex's COX4.

[info] - should match expected grounding IDs in the text SMAD, SMAD2/3, SMAD 2/3, and TGFB are important Families. *** FAILED *** (0 milliseconds)
[info]   false was not true (TestOverrides.scala:188)

-> TGFB is a FamPlex family

[info] - should match expected grounding IDs in the text Cas8, EGF, EGFR, ErbB, ERK5, GSK3beta are GGPs. *** FAILED *** (1 millisecond)
[info]   false was not true (TestOverrides.scala:188)

-> ERbB is actually a family

[info] - should have labeled all mentions as Gene_or_gene_product in the text Cas8, EGF, EGFR, ErbB, ERK5, GSK3beta are GGPs. *** FAILED *** (0 milliseconds)
[info]   5 was not equal to 6 (TestOverrides.scala:170)
[info] - should have display labeled all mentions as Some(Protein) *** FAILED *** (0 milliseconds)
[info]   5 was not equal to 6 (TestOverrides.scala:175)

-> Same as the one before, ERbB is a family

[info] - should have GPP label *** FAILED *** (32 milliseconds)
[info]   0 was not equal to 1 (TestModelEntities.scala:87)

-> SAPK is actually a family

@bgyori
Copy link
Contributor Author

bgyori commented Apr 12, 2020

I fixed almost all the tests and pushed the changes to the bio31 branch of Reach. I have one question though (for the one remaining failure): there is a test for the string "FLOT" with the expectation that it should ground to FamPlex FLOT. The current grounding entries are relevant here:
In famplex.tsv

485 FLOT»···FLOT
1757 flotillin»··FLOT

In PFAM-families.tsv

12927 Flot»···PF15975                                                                 
12928 Flotillin»··PF15975                                                             

There is currently nothing for this entity in NER-Grounding-Override.tsv.

Why could it be that "FLOT" gets grounded to PF15975 under these circumstances (unless I'm missing something unrelated to the grounding logic)?

@MihaiSurdeanu
Copy link
Contributor

There were a couple of issues:

--- a/main/src/main/scala/org/clulab/reach/grounding/ReachEntityLookup.scala
+++ b/main/src/main/scala/org/clulab/reach/grounding/ReachEntityLookup.scala
@@ -106,9 +106,9 @@ class ReachEntityLookup {
   )

   val familySeq: KBSearchSequence = extraKBs ++ Seq(
+    staticProteinFamilyOrComplex,           // FamPlex families and complexes
     StaticProteinFamily,                    // PFAM families
     StaticProteinFamily2,                   // InterPro families
-    staticProteinFamilyOrComplex,           // FamPlex families and complexes
     ModelGendProteinAndFamily
   )

diff --git a/processors/src/main/resources/reference.conf b/processors/src/main/resources/reference.conf
index edf821d76..cf84b9c50 100644
--- a/processors/src/main/resources/reference.conf
+++ b/processors/src/main/resources/reference.conf
@@ -8,6 +8,7 @@ kbloader: {
   # NB: Order is important as it indicates priority!
   nerKBs: [
     "org/clulab/reach/kb/ner/Gene_or_gene_product.tsv.gz",
+    "org/clulab/reach/kb/ner/FamilyOrComplex.tsv.gz",
     "org/clulab/reach/kb/ner/Family.tsv.gz",
     "org/clulab/reach/kb/ner/Cellular_component.tsv.gz",
     "org/clulab/reach/kb/ner/Simple_chemical.tsv.gz",

I had to prioritize FamPlex over the other Family KBs in ReachEntityLookup. Add, a hidden issue, add FamilyOrComplex to the NER list of KBs. Note that these are configured separately because grounding is a different component. Also, the last issue did not impact FLOT because PFAM had the string. But it was a bug nonetheless.

FLOT is now correctly grounded. Please check. Is my assumption that FamPlex should be prioritized over PFAM for grounding correct?

@bgyori
Copy link
Contributor Author

bgyori commented Apr 12, 2020

I see, I don't fully understand why FamilyOrComplex comes into the picture though since we decided to put FamPlex entries under Family, and in ner_kb.config, we have:

famplex»···Family

I thought that what that meant was that ner/Family.tsv.gz would contain entries from both FamPlex and Pfam. So where does ner/FamilyOrComplex.tsv.gz come into the picture?

@MihaiSurdeanu
Copy link
Contributor

You're right. The change in reference.conf is not needed. I suspect FamilyOrComplex.tsv.gz was a left over from a previous version, which confused me. I'll revert that change.
But the grounding change is what was needed for this issue.

@bgyori
Copy link
Contributor Author

bgyori commented Apr 13, 2020

I just ran the tests again, and I think the reordering of the priority now broke another override test:

[info] - should match expected grounding IDs in the text Activin A, Activin AB, Inhibin A, Inhibin B,
[info]                 AMPK alpha1beta1gamma1, AMPK alpha1beta1gamma2, AMPK alpha1beta1gamma3,
[info]                 AMPK alpha1beta2gamma1, AMPK alpha1beta2gamma2, AMPK alpha2beta1gamma1,
[info]                 AMPK alpha2beta2gamma1, AMPK alpha2beta2gamma2, AMPK alpha2beta2gamma3,
[info]                 AMPK a1b1g1, AMPK a1b1g2, AMPK a1b1g3,
[info]                 AMPK a1b2g1, AMPK a1b2g2, AMPK a2b1g1,
[info]                 AMPK a2b2g1, AMPK a2b2g2, AMPK a2b2g3,
[info]                 alpha1beta1gamma1, alpha1beta1gamma2, alpha1beta1gamma3,
[info]                 alpha1beta2gamma1, alpha1beta2gamma2, alpha2beta1gamma1,
[info]                 alpha2beta2gamma1, alpha2beta2gamma2, and alpha2beta2gamma3
[info]                 are important complexes. *** FAILED *** (10 milliseconds)
[info]   false was not true (TestOverrides.scala:188)

I'll look into this and see what needs to be changed.

@MihaiSurdeanu
Copy link
Contributor

MihaiSurdeanu commented Apr 13, 2020 via email

@bgyori
Copy link
Contributor Author

bgyori commented Apr 13, 2020

Okay, I pushed a change for that test and now all the Reach tests are passing.

@MihaiSurdeanu
Copy link
Contributor

I confirm that tests pass. I'll release next.

@MihaiSurdeanu MihaiSurdeanu merged commit 2a46523 into master Apr 14, 2020
@MihaiSurdeanu MihaiSurdeanu deleted the improve_overrides branch April 14, 2020 15:13
@MihaiSurdeanu
Copy link
Contributor

Bioresources has been released, and the reach branch was merged in master.
Thank you for your help @bgyori !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants