Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate UniProt fragments #42

Merged
merged 3 commits into from
Oct 14, 2020
Merged

Integrate UniProt fragments #42

merged 3 commits into from
Oct 14, 2020

Conversation

bgyori
Copy link
Contributor

@bgyori bgyori commented Sep 10, 2020

This PR extends the update_uniprot_proteins.py script to download and process protein chains and peptides into grounding entries. The approach taken here is to put these into the existing uniprot-proteins.tsv file with IDs formatted as [UniProtID]#[FragmentID] which is now "officially" supported by UniProt. Examples:

Angiotensin-2   P01019#PRO_0000032458   Homo sapiens
Angiotensin-2   P11859#PRO_0000032462   Mus musculus

This adds a total of around 50k new rows to the grounding file.

This PR does not yet touch the NER files and I have not yet written any tests and tried this against Reach. @JakeWolfe and @MihaiSurdeanu would you be able to pick this up from here?

@MihaiSurdeanu
Copy link
Contributor

Thank you @bgyori !

@JakeWolfe : can you please generate the NER files in this branch? And then publishLocal, and write a few unit tests in Reach to make sure that this new entities are visible?

@bgyori : should we keep the Protein Ontology fragments then?

@bgyori
Copy link
Contributor Author

bgyori commented Sep 10, 2020

I suggest we keep the Protein Ontology for now, we can test how Reach works after these additions and see if we need to remove or tweak anything further.

@MihaiSurdeanu
Copy link
Contributor

@JakeWolfe : any update on this? I'd like to release soon. Thank you!

@JakeWolfe
Copy link
Contributor

@MihaiSurdeanu I will get to this by the end of tonight.

@JakeWolfe
Copy link
Contributor

Testing in reach after generating the new NER files and publishing locally is showing failed unit tests in TestEntities, specifically:

"ADAMTS18/ClvPrd is a protein fragment": (2 was not equal to 1) in reference to the number of mentions found
"CCL3L/ClvPrd is a protein fragment": (2 was not equal to 1) in reference to the number of mentions found

It seems to be splitting those mentions into two mentions instead of the expected one previously, this can be verified by the MetaInfo of the mention of ADAMTS18 and CCL3L referencing uniprot-proteins.tsv.gz when this was a protein ontology test. This would seem to indicate that there is overlap between the two files.

@MihaiSurdeanu, @bgyori : Please let me know how to proceed

@MihaiSurdeanu
Copy link
Contributor

Thanks @JakeWolfe!

To understand this problem please check and confirm that:

  • "ADAMTS18/ClvPrd" and ""CCL3L/ClvPrd" show up as a protein fragment in the NER KB.
  • The 4 tokens after tokenization (ADAMTS18, ClvPrd, CCL3L, ClvPrd) show up as protein names in the NER KB for proteins.

@bgyori
Copy link
Contributor Author

bgyori commented Sep 21, 2020

I think ClvPrd is just a code for "cleavage product" used in the Protein Ontology and doesn't really have any "real" significance when it comes to recognizing entities in text. For instance "ADAMTS18/ClvPrd" is just a synthetic name given to an arbitrary "ADAMTS18 cleavage product".

@MihaiSurdeanu
Copy link
Contributor

Thanks @bgyori !

@JakeWolfe: can you please check why we're identifying ClvPrd as proteins? Which KB contains it? @bgyori: we should remove it from there?

@JakeWolfe
Copy link
Contributor

@MihaiSurdeanu I am now looking into this.

@MihaiSurdeanu
Copy link
Contributor

MihaiSurdeanu commented Sep 22, 2020 via email

@bgyori
Copy link
Contributor Author

bgyori commented Oct 6, 2020

Did you run into any issues with the changes? I'm happy to help, just let me know!

@MihaiSurdeanu
Copy link
Contributor

@kwalcock: can you please pick up this thread? In particular, we want to merge this in master, but a couple of test in Reach are failing. To understand this problem, we need to check and confirm that:

  • "ADAMTS18/ClvPrd" and ""CCL3L/ClvPrd" show up as a protein fragment in the NER KB.
  • The 4 tokens after tokenization (ADAMTS18, ClvPrd, CCL3L, ClvPrd) show up as protein names in the NER KB for proteins.

Can you please double check this?

@kwalcock
Copy link
Member

kwalcock commented Oct 7, 2020

If case anyone else is listening, here's the written acknowledgement.

@MihaiSurdeanu
Copy link
Contributor

I appreciate it @kwalcock !

@MihaiSurdeanu
Copy link
Contributor

From @JakeWolfe:

ADAMSTS18 was found in Gene_or_gene_product and Gene_or_gene_product_OLD
ClvPrd is in neither Gene_or_gene_product and Gene_or_gene_product_OLD

ADAMTS18/ClvPrd is in neither

CCL3L1 and CCL3L3 was found in Gene_or_gene_product and Gene_or_gene_product_OLD, but not CCL3L

CCL3L/ClvPrd is in neither

@MihaiSurdeanu
Copy link
Contributor

@JakeWolfe: if you run the reach shell (the "shell" script), what is the actual output on the two failing sentences?
Thanks!

@JakeWolfe
Copy link
Contributor

@MihaiSurdeanu
TEXT: ADAMTS18 / ClvPrd is a protein fragment
TOKENS: (0,ADAMTS18,NN), (1,and,CC), (2,ClvPrd,NN), (3,is,VBZ), (4,a,DT), (5,protein,NN), (6,fragment,NN)
ENTITY LABELS: (ADAMTS18,B-Gene_or_gene_product), (and,O), (ClvPrd,B-Gene_or_gene_product), (is,O), (a,O), (protein,O), (fragment,O)

LEMMAS: adamts18 and clvprd be a protein fragment
roots: 6
outgoing:
0: (1,cc) (2,conj_and)
1:
2:
3:
4:
5:
6: (0,nsubj) (2,nsubj) (3,cop) (4,det) (5,compound)
incoming:
0: (6,nsubj)
1: (0,cc)
2: (0,conj_and) (6,nsubj)
3: (6,cop)
4: (6,det)
5: (6,compound)
6:

ENTITIES: 2

MENTION TEXT: ADAMTS18
LABELS: List(Gene_or_gene_product, MacroMolecule, Equivalable, BioChemicalEntity, BioEntity, Entity, PossibleController)
DISPLAY LABEL: Protein
------------------------------
RULE => ner-gene_or_gene_product-entities
TYPE => CorefTextBoundMention
------------------------------
GROUNDING: <KBResolution: ADAMTS21, uniprot, Q8TE60, homo sapiens, <IMKBMetaInfo: uniprot, uniprot-proteins.tsv.gz, http://identifiers.org/uniprot/, MIR:00100164, sp=true, f=false, p=true>>

CONTEXT: NONE
------------------------------

MENTION TEXT: ClvPrd
LABELS: List(Gene_or_gene_product, MacroMolecule, Equivalable, BioChemicalEntity, BioEntity, Entity, PossibleController)
DISPLAY LABEL: Protein
------------------------------
RULE => ner-gene_or_gene_product-entities
TYPE => CorefTextBoundMention
------------------------------
GROUNDING: <KBResolution: ClvPrd, uaz, UAZ00001, , <IMKBMetaInfo: uaz, , , , sp=false, f=false, p=false>>

CONTEXT: NONE
------------------------------

EVENTS: 0

==================================================

TEXT: CCL3L / ClvPrd is a protein fragment
TOKENS: (0,CCL3L,NN), (1,and,CC), (2,ClvPrd,NN), (3,is,VBZ), (4,a,DT), (5,protein,NN), (6,fragment,NN)
ENTITY LABELS: (CCL3L,B-Gene_or_gene_product), (and,O), (ClvPrd,B-Gene_or_gene_product), (is,O), (a,O), (protein,O), (fragment,O)

LEMMAS: ccl3l and clvprd be a protein fragment
roots: 6
outgoing:
0: (1,cc) (2,conj_and)
1:
2:
3:
4:
5:
6: (0,nsubj) (2,nsubj) (3,cop) (4,det) (5,compound)
incoming:
0: (6,nsubj)
1: (0,cc)
2: (0,conj_and) (6,nsubj)
3: (6,cop)
4: (6,det)
5: (6,compound)
6:

ENTITIES: 2

MENTION TEXT: CCL3L
LABELS: List(Gene_or_gene_product, MacroMolecule, Equivalable, BioChemicalEntity, BioEntity, Entity, PossibleController)
DISPLAY LABEL: Protein
------------------------------
RULE => ner-gene_or_gene_product-entities
TYPE => CorefTextBoundMention
------------------------------
GROUNDING: <KBResolution: CCL3L, uaz, UAZ00002, , <IMKBMetaInfo: uaz, , , , sp=false, f=false, p=false>>

CONTEXT: NONE
------------------------------

MENTION TEXT: ClvPrd
LABELS: List(Gene_or_gene_product, MacroMolecule, Equivalable, BioChemicalEntity, BioEntity, Entity, PossibleController)
DISPLAY LABEL: Protein
------------------------------
RULE => ner-gene_or_gene_product-entities
TYPE => CorefTextBoundMention
------------------------------
GROUNDING: <KBResolution: ClvPrd, uaz, UAZ00001, , <IMKBMetaInfo: uaz, , , , sp=false, f=false, p=false>>

CONTEXT: NONE
------------------------------

EVENTS: 0

==================================================

@kwalcock
Copy link
Member

kwalcock commented Oct 8, 2020

FWIW, the failing tests that I see when running reach on a locally published version of bioresources from the uniprot_fragments branch are these in TestHyphenedEvents and TestApi:

[info] - should have a positive activation of levels of EM by TFs, TWIST1, SNAIL1, SLUG, ZEB1, FOXC2 and CD45 *** FAILED *** (238 milliseconds)
[info]   false was not true (TestHyphenedEvents.scala:17)


[info] - should return 9 positive activation and 3 phosphorylation results from NXML test *** FAILED *** (10 seconds, 407 milliseconds)
[info]   Vector(org.clulab.reach.mentions.CorefEventMention@ed6620fa, org.clulab.reach.mentions.CorefEventMention@1aa155ea, org.clulab.reach.mentions.CorefEventMention@15fde8d6, org.clulab.reach.mentions.CorefEventMention@f71c0834, org.clulab.reach.mentions.CorefEventMention@32c124b2, org.clulab.reach.mentions.CorefEventMention@efacda5c, org.clulab.reach.mentions.CorefEventMention@31d5ece8) had size 7 instead of expected size 9 (TestApi.scala:124)

In reach it is processors/build.sbt that was changed to use the local copy of bioprocessors. Perhaps I've missed some detail.

@MihaiSurdeanu
Copy link
Contributor

Hmm... @JakeWolfe: can you please list here the steps you used to test this PR?

@JakeWolfe
Copy link
Contributor

@MihaiSurdeanu @kwalcock In the uniprot_fragments branch of bioresources I built the KBs using sbt publishLocal, I went to reach and changed the bioresources version to the snapshot produced by publishLocal, then ran the shell commands. In addition I run the TestEntites unit test.

@kwalcock
Copy link
Member

kwalcock commented Oct 8, 2020

So may things could go wrong. I wonder about "I built the KBs using sbt publishLocal". I didn't rebuild anything in the kb directory that I know of. There are instructions for rebuilding the ner directory using reach and the ner_kb.sh script at https://github.com/clulab/bioresources. I assume that's how the new files in ner got to github. Are those the same as your local version (i.e., are we using the same files)?

bioresources/master is set to version 1.1.34-SNAPSHOT so in the uniprot_fragments I used 1.1.35-SNAPSHOT just in case. That could get messed up.

On one or the other of these projects when testing with Windows, it's important to add -Dfile.encoding=UTF-8. I'm not sure which project it was.

I do notice that some CR/LFs are creeping into our files and hope that's not somehow involved.

@JakeWolfe
Copy link
Contributor

JakeWolfe commented Oct 8, 2020

I am running on windows so these tests may have been affected. The NER files in the uniprot_fragments repository is the same as the the one that I am testing on.

How should we move forward with testing?

@kwalcock
Copy link
Member

kwalcock commented Oct 8, 2020

I reran the tests without the -Dfile.encoding=UTF-8 and there was no change. I suppose it could have been important when generating the files in ner.

Tomorrow I'm going to run the tests on a pristine git clone and on Linux and see if it makes a difference. I want to make sure that we're fixing the right problem and not some strange configuration issue. Chances are I can't make the exact same mistakes twice.

@MihaiSurdeanu
Copy link
Contributor

Every time a file in the src/main/resources/org/clulab/reach/kb/ folder changes, we need to rerun the ner_kb.sh script to regenerate the files under .../kb/ner, which is what processors actually uses.
Has this been done for this PR?

@JakeWolfe
Copy link
Contributor

Yes, the KBs were updated in push 7900aff

@kwalcock
Copy link
Member

kwalcock commented Oct 9, 2020

This is completely weird. It sure looks like KBgenerator.scala (https://github.com/clulab/reach/blob/master/processors/src/main/scala/org/clulab/processors/bionlp/ner/KBGenerator.scala) expects the knowledge bases to have three columns:

https://github.com/clulab/reach/blob/5777d66448f1ccb737e69bedae5f8f8073e562b7/processors/src/main/scala/org/clulab/processors/bionlp/ner/KBGenerator.scala#L28-L30

https://github.com/clulab/reach/blob/5777d66448f1ccb737e69bedae5f8f8073e562b7/processors/src/main/scala/org/clulab/processors/bionlp/ner/KBGenerator.scala#L151-L152

but the file I see (protein-ontology-fragments.tsv after gunzip) has only two columns:

14-3-3 protein gamma proteolytic cleavage product	PR:000021868
155 kDa platelet multimerin (human)	PR:000050084
2K	PR:000036831
2K fragment	PR:000036831

This results in exceptions for both Linux and Windows:

17:31:11.709 [run-main-0] INFO  o.c.p.bionlp.BioNLPProcessor - Converting protein-ontology-fragments...
[error] (run-main-0) java.lang.ArrayIndexOutOfBoundsException: 2
[error] java.lang.ArrayIndexOutOfBoundsException: 2
[error] 	at org.clulab.processors.bionlp.ner.KBGenerator$.containsValidSpecies(KBGenerator.scala:151)
[error] 	at org.clulab.processors.bionlp.ner.KBGenerator$.convertKB(KBGenerator.scala:98)
[error] 	at org.clulab.processors.bionlp.ner.KBGenerator$$anonfun$main$2.apply(KBGenerator.scala:53)
[error] 	at org.clulab.processors.bionlp.ner.KBGenerator$$anonfun$main$2.apply(KBGenerator.scala:52)
[error] 	at scala.collection.immutable.List.foreach(List.scala:392)
[error] 	at org.clulab.processors.bionlp.ner.KBGenerator$.main(KBGenerator.scala:52)
[error] 	at org.clulab.processors.bionlp.ner.KBGenerator.main(KBGenerator.scala)
[error] 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[error] 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[error] 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[error] 	at java.lang.reflect.Method.invoke(Method.java:498)
[error] stack trace is suppressed; run last Compile / bgRunMain for the full output
[error] Nonzero exit code: 1
[error] (Compile / runMain) Nonzero exit code: 1
[error] Total time: 434 s (07:14), completed Oct 8, 2020 5:31:11 PM
sbt:reach-exe> 

It looks like this KBGenerator file was copied in from somewhere else on 4/6/2020 with this commit: clulab/reach@3df3e69, and I don't know what it looked like before.

However, it also looks like this protein-ontology-fragments.tsv.gz was also added recently, with a43b98c and that it wasn't previously compressed which might imply that it didn't even go through this process. I don't know.

Something is mixed up, maybe me.

@kwalcock
Copy link
Member

kwalcock commented Oct 9, 2020

I get it. There's also a PR on reach, clulab/reach#693, that contains the updated code needed to process the file.

@bgyori
Copy link
Contributor Author

bgyori commented Oct 11, 2020

Hopefully, I'm not confusing things further by saying that I made a number of changes to the Protein Ontology resource file here in bioresources before it was merged in #41, and pushed the commit clulab/reach@cb71708 to the PR clulab/reach#693 to adapt the tests to these changes. In addition to the possible issue with the processor versions, it's possible @JakeWolfe didn't use the latest state of the PR branch in reach for testing, causing these failures.

@kwalcock
Copy link
Member

kwalcock commented Oct 13, 2020

There are unpublished branches of bioresources and reach called proonto. These steps can be followed to reproduce the failing tests:

  1. git clone http://github.com/clulab/bioresources
  2. cd bioresources
  3. git checkout proonto
  4. cd ..
  5. git clone http://github.com/clulab/reach
  6. cd reach
  7. git checkout proonto
  8. cd ../bioresources/
  9. ./ner_kb.sh
  10. sbt publishLocal
  11. cd ../reach
  12. vim processors/build.sbt # I corrected a mistake here
  13. <change bioresources dependency to 1.1.34-SNAPSHOT>
  14. sbt main/test
  15. sbt export/test

Fixes can be pushed back to these same branches.

@kwalcock
Copy link
Member

BTW these are the errors in TestHyphenedEvents and TestApi, not the ones related to TestEntities.

@JakeWolfe
Copy link
Contributor

@MihaiSurdeanu I am not familiar with that part of reach, how should we approach fixing this?

@MihaiSurdeanu
Copy link
Contributor

MihaiSurdeanu commented Oct 13, 2020 via email

@kwalcock
Copy link
Member

Anyone, I do have a bunch of outstanding commits for reach that update sbt and the plugin dependencies to new versions and make changes to the scripts. If they are needed sooner, please let me know, but I didn't want to make the situation more complicated by trying to merge them now.

@MihaiSurdeanu
Copy link
Contributor

MihaiSurdeanu commented Oct 13, 2020 via email

@kwalcock
Copy link
Member

Both Ben's changes to bioresources and the ones to reach are in the proonto branches. The ones to reach are a little suspect because they seem to solve the problems (with things like ClvPrd) by changing the tests, but they are there.

@bgyori
Copy link
Contributor Author

bgyori commented Oct 13, 2020 via email

@bgyori
Copy link
Contributor Author

bgyori commented Oct 13, 2020

To be more specific and hopefully avoid further confusion... Here is the specific commit that makes a change to the script generating the protein-ontology-fragments.tsv.gz file:

2e6cb2a#diff-688598b0a184d86ab4292c4b56c1fa02b4704b102f672c14e51c3e8dad69308bR69-R73

    # Remove entries like "YWHAB/ClvPrd", these are not useful
    # synonyms
    if re.match(r'^([^/]+)/(ClvPrd|UnMod|iso:\d+/UnMod)$', synonym):
        return False

As you can see here, I "Remove entries like "YWHAB/ClvPrd". Given this change, I don't see why my change in Reach in clulab/reach@cb71708 to replace tests checking for ClvPrd might be considered suspect.

@MihaiSurdeanu
Copy link
Contributor

Also @kwalcock,
When is this file generated: src/main/resources/org/clulab/reach/kb/ner/model.ser.gz
This is a CompactLexicon that you implemented. But it is not refreshed when runing ner_kb.sh.
Do we need it?

@MihaiSurdeanu
Copy link
Contributor

To be more specific and hopefully avoid further confusion... Here is the specific commit that makes a change to the script generating the protein-ontology-fragments.tsv.gz file:

2e6cb2a#diff-688598b0a184d86ab4292c4b56c1fa02b4704b102f672c14e51c3e8dad69308bR69-R73

    # Remove entries like "YWHAB/ClvPrd", these are not useful
    # synonyms
    if re.match(r'^([^/]+)/(ClvPrd|UnMod|iso:\d+/UnMod)$', synonym):
        return False

As you can see here, I "Remove entries like "YWHAB/ClvPrd". Given this change, I don't see why my change in Reach in clulab/reach@cb71708 to replace tests checking for ClvPrd might be considered suspect.

I agree.

@MihaiSurdeanu
Copy link
Contributor

Hey @kwalcock: the shell script in this branch fails with this error:

[info] Running org.clulab.reach.ReachShell
Loading ReachSystem ...
[dynet] random seed: 4052734219
[dynet] allocating memory: 512,512,512,512MB
[dynet] memory allocation done.
19:09:03.452 [run-main-0] INFO  o.c.p.m.DeepLearningPolarityClassifier - Loading saved model SavedLSTM_WideBound_u_tag ...
19:09:04.524 [run-main-0] INFO  o.c.p.m.DeepLearningPolarityClassifier - Loading model finished!
[error] (run-main-0) java.lang.ExceptionInInitializerError
java.lang.ExceptionInInitializerError
	at org.clulab.reach.grounding.ReachEntityLookup$$anonfun$org$clulab$reach$grounding$ReachEntityLookup$$addAdHocFile$1.apply(ReachEntityLookup.scala:35)
	at org.clulab.reach.grounding.ReachEntityLookup$$anonfun$org$clulab$reach$grounding$ReachEntityLookup$$addAdHocFile$1.apply(ReachEntityLookup.scala:33)
	at scala.Option.map(Option.scala:146)
	at org.clulab.reach.grounding.ReachEntityLookup.org$clulab$reach$grounding$ReachEntityLookup$$addAdHocFile(ReachEntityLookup.scala:33)
	at org.clulab.reach.grounding.ReachEntityLookup$$anonfun$1.apply(ReachEntityLookup.scala:80)
	at org.clulab.reach.grounding.ReachEntityLookup$$anonfun$1.apply(ReachEntityLookup.scala:80)
	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
	at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
	at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
	at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
	at org.clulab.reach.grounding.ReachEntityLookup.<init>(ReachEntityLookup.scala:80)
	at org.clulab.reach.ReachSystem.<init>(ReachSystem.scala:37)
	at org.clulab.reach.ReachShell$.delayedEndpoint$org$clulab$reach$ReachShell$1(ReachShell.scala:27)
	at org.clulab.reach.ReachShell$delayedInit$body.apply(ReachShell.scala:15)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
	at scala.App$class.main(App.scala:76)
	at org.clulab.reach.ReachShell$.main(ReachShell.scala:15)
	at org.clulab.reach.ReachShell.main(ReachShell.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
Caused by: java.io.IOException: Stream closed
	at java.io.BufferedInputStream.getInIfOpen(BufferedInputStream.java:159)
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
	at java.util.zip.CheckedInputStream.read(CheckedInputStream.java:59)
	at java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:266)
	at java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:258)
	at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:164)
	at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:79)
	at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:91)
	at org.clulab.reach.grounding.ReachKBUtils$.sourceFromResource(ReachKBUtils.scala:67)
	at org.clulab.reach.grounding.TsvIMKBFactory.org$clulab$reach$grounding$TsvIMKBFactory$$loadFromKBDir(TsvIMKBFactory.scala:45)
	at org.clulab.reach.grounding.TsvIMKBFactory$$anonfun$make$1.apply(TsvIMKBFactory.scala:23)
	at org.clulab.reach.grounding.TsvIMKBFactory$$anonfun$make$1.apply(TsvIMKBFactory.scala:22)
	at scala.Option.foreach(Option.scala:257)
	at org.clulab.reach.grounding.TsvIMKBFactory.make(TsvIMKBFactory.scala:22)
	at org.clulab.reach.grounding.ReachIMKBMentionLookups$.staticProteinFragmentKBML(ReachIMKBMentionLookups.scala:181)
	at org.clulab.reach.grounding.ReachIMKBMentionLookups$.<init>(ReachIMKBMentionLookups.scala:37)
	at org.clulab.reach.grounding.ReachIMKBMentionLookups$.<clinit>(ReachIMKBMentionLookups.scala)
	at org.clulab.reach.grounding.ReachEntityLookup$$anonfun$org$clulab$reach$grounding$ReachEntityLookup$$addAdHocFile$1.apply(ReachEntityLookup.scala:35)
	at org.clulab.reach.grounding.ReachEntityLookup$$anonfun$org$clulab$reach$grounding$ReachEntityLookup$$addAdHocFile$1.apply(ReachEntityLookup.scala:33)
	at scala.Option.map(Option.scala:146)
	at org.clulab.reach.grounding.ReachEntityLookup.org$clulab$reach$grounding$ReachEntityLookup$$addAdHocFile(ReachEntityLookup.scala:33)
	at org.clulab.reach.grounding.ReachEntityLookup$$anonfun$1.apply(ReachEntityLookup.scala:80)
	at org.clulab.reach.grounding.ReachEntityLookup$$anonfun$1.apply(ReachEntityLookup.scala:80)
	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
	at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
	at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
	at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
	at org.clulab.reach.grounding.ReachEntityLookup.<init>(ReachEntityLookup.scala:80)
	at org.clulab.reach.ReachSystem.<init>(ReachSystem.scala:37)
	at org.clulab.reach.ReachShell$.delayedEndpoint$org$clulab$reach$ReachShell$1(ReachShell.scala:27)
	at org.clulab.reach.ReachShell$delayedInit$body.apply(ReachShell.scala:15)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
	at scala.App$class.main(App.scala:76)
	at org.clulab.reach.ReachShell$.main(ReachShell.scala:15)
	at org.clulab.reach.ReachShell.main(ReachShell.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)

It works in master.
Can you please take a look?

@kwalcock
Copy link
Member

Are you using proonto on both bioresources and reach? Between git and sbt it is extremely easy for things to get mixed up. It was necessary to use "sbt reload", "sbt clean", and "git status" several times when I experimented and to even remove the .ivy2 directory. "git status" on jenny where I was working shows only the *.gz files having been modified and processors/build.sbt. Plus there would be the SNAPSHOT bioresources in .ivy2 local. I would remove the 1.1.33 version from cache just to be safe.

I think the error message is actually complaining that a file is missing. I got it once with a mismatched pair of bioresources and reach. Not much help, I know.

@kwalcock
Copy link
Member

Thanks, @bgyori. I suspect myself more than anyone or any thing else.

@MihaiSurdeanu
Copy link
Contributor

Thanks @kwalcock: I'll start from scratch...

@MihaiSurdeanu
Copy link
Contributor

Thanks @kwalcock: I'll start from scratch...

Thanks! That fixed it.

@MihaiSurdeanu
Copy link
Contributor

@bgyori: the failing test passes when I replace "EM" with a known protein such as "KRas". So, "EM" is no longer recognized as a GGP in this branch. Is this on purpose? Thanks!

@kwalcock
Copy link
Member

It looks like that model.ser.gz should be refreshed by ner_kb.sh with

# generate the serialized LexiconNER model now
sbt 'runMain org.clulab.processors.bionlp.ner.KBLoader ../bioresources/src/main/resources/org/clulab/reach/kb/ner/model.ser.gz'

I think I've noticed it sometimes not being refreshed and hope it was because the files used to build it hadn't changed. That it doesn't change might be a sign of something wrong. Here's what I think happens:

The first time reach is called, it depends on the published bioresources 1.1.33. This seems to result in an unchanged model.ser.gz. After that, reach is updated to depend on bioresources 1.1.34-SNAPSHOT. If ner_kb.sh is called again, the file model.ser.gz will change. It is probably this version that should be published for real. I'm assuming that another round doesn't result in any more changes, but I'm not yet sure that's the case. Perhaps something can be done about the circle.

@bgyori
Copy link
Contributor Author

bgyori commented Oct 14, 2020

@bgyori: the failing test passes when I replace "EM" with a known protein such as "KRas". So, "EM" is no longer recognized as a GGP in this branch. Is this on purpose? Thanks!

I looked into this, and found that "EM" was previously incorrectly grounded to PUBCHEM:6426949, and is one of a group of two-letter acronyms that are listed by CHEBI and PubChem as synonyms for pairs of amino acids. Since these are virtually never correct as synonyms for the purposes of text mining, I removed them in #36.

@MihaiSurdeanu
Copy link
Contributor

Thanks! Then I think we can finally merge this branches in their respective masters. @kwalcock: can you please do the honors?

Thank you @bgyori, @kwalcock, and @JakeWolfe for your help with this thorny branch!

@kwalcock
Copy link
Member

I haven't noticed any updates getting as far as github that address the failing tests.

@MihaiSurdeanu
Copy link
Contributor

The update is in the proonto branch of Reach. This bioresources branch is fine as is.

@kwalcock
Copy link
Member

My bad. Thanks.

@kwalcock
Copy link
Member

I'm planning to merge this even though I'm not completely sure the .gz files will be the right ones in the end. They were created using the reach branch proonto, but we'd probably rather have the current reach master create them. In addition, I need to make some changes to files in both projects. I doubt that everything can be done in a single commit anyway. This particular master won't be published however until bioresources and reach are synchronized. I will edit the CHANGES file when we're ready to publish. I shouldn't be too long.

@MihaiSurdeanu
Copy link
Contributor

MihaiSurdeanu commented Oct 14, 2020 via email

@kwalcock kwalcock merged commit 9b39c28 into master Oct 14, 2020
@kwalcock kwalcock deleted the uniprot_fragments branch October 14, 2020 22:20
@kwalcock
Copy link
Member

Also @kwalcock,
When is this file generated: src/main/resources/org/clulab/reach/kb/ner/model.ser.gz
This is a CompactLexicon that you implemented. But it is not refreshed when runing ner_kb.sh.
Do we need it?

I too wonder whether we need it and would like to get rid of it. It is a serialized object, so it is dependent on Java and Scala version, even though it is being saved in bioresources which is independent of those versions. The code that does the serialization lives in reach and the object that is serialized is in processors. Chances that these all line up so that it is a usable resource seems very slim. I added a test for the file and it will fail if the Scala version is changed. I don't know if anything did manage to use it, though, and might break without it.

It seems like the file Gene_or_gene_produce-OLD.tsv.gz is something better for github history than for maven and should be deleted.

@MihaiSurdeanu
Copy link
Contributor

MihaiSurdeanu commented Oct 15, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants