Integrate UniProt fragments #42

bgyori · 2020-09-10T03:14:55Z

This PR extends the update_uniprot_proteins.py script to download and process protein chains and peptides into grounding entries. The approach taken here is to put these into the existing uniprot-proteins.tsv file with IDs formatted as [UniProtID]#[FragmentID] which is now "officially" supported by UniProt. Examples:

Angiotensin-2   P01019#PRO_0000032458   Homo sapiens
Angiotensin-2   P11859#PRO_0000032462   Mus musculus

This adds a total of around 50k new rows to the grounding file.

This PR does not yet touch the NER files and I have not yet written any tests and tried this against Reach. @JakeWolfe and @MihaiSurdeanu would you be able to pick this up from here?

MihaiSurdeanu · 2020-09-10T03:40:43Z

Thank you @bgyori !

@JakeWolfe : can you please generate the NER files in this branch? And then publishLocal, and write a few unit tests in Reach to make sure that this new entities are visible?

@bgyori : should we keep the Protein Ontology fragments then?

bgyori · 2020-09-10T22:01:49Z

I suggest we keep the Protein Ontology for now, we can test how Reach works after these additions and see if we need to remove or tweak anything further.

MihaiSurdeanu · 2020-09-10T23:09:23Z

@JakeWolfe : any update on this? I'd like to release soon. Thank you!

JakeWolfe · 2020-09-10T23:10:50Z

@MihaiSurdeanu I will get to this by the end of tonight.

JakeWolfe · 2020-09-11T05:35:33Z

Testing in reach after generating the new NER files and publishing locally is showing failed unit tests in TestEntities, specifically:

"ADAMTS18/ClvPrd is a protein fragment": (2 was not equal to 1) in reference to the number of mentions found
"CCL3L/ClvPrd is a protein fragment": (2 was not equal to 1) in reference to the number of mentions found

It seems to be splitting those mentions into two mentions instead of the expected one previously, this can be verified by the MetaInfo of the mention of ADAMTS18 and CCL3L referencing uniprot-proteins.tsv.gz when this was a protein ontology test. This would seem to indicate that there is overlap between the two files.

@MihaiSurdeanu, @bgyori : Please let me know how to proceed

MihaiSurdeanu · 2020-09-11T14:34:11Z

Thanks @JakeWolfe!

To understand this problem please check and confirm that:

"ADAMTS18/ClvPrd" and ""CCL3L/ClvPrd" show up as a protein fragment in the NER KB.
The 4 tokens after tokenization (ADAMTS18, ClvPrd, CCL3L, ClvPrd) show up as protein names in the NER KB for proteins.

bgyori · 2020-09-21T19:01:41Z

I think ClvPrd is just a code for "cleavage product" used in the Protein Ontology and doesn't really have any "real" significance when it comes to recognizing entities in text. For instance "ADAMTS18/ClvPrd" is just a synthetic name given to an arbitrary "ADAMTS18 cleavage product".

MihaiSurdeanu · 2020-09-22T00:42:15Z

Thanks @bgyori !

@JakeWolfe: can you please check why we're identifying ClvPrd as proteins? Which KB contains it? @bgyori: we should remove it from there?

JakeWolfe · 2020-09-22T18:00:04Z

@MihaiSurdeanu I am now looking into this.

MihaiSurdeanu · 2020-09-22T18:38:55Z

Thanks!

…

On Tue, Sep 22, 2020 at 11:00 Jake Wolfe ***@***.***> wrote: @MihaiSurdeanu <https://github.com/MihaiSurdeanu> I am now looking into this. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#42 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAI75TRDBBRBPR2FMCJLOALSHDQ3JANCNFSM4RD7QWGQ> .

bgyori · 2020-10-06T21:24:11Z

Did you run into any issues with the changes? I'm happy to help, just let me know!

MihaiSurdeanu · 2020-10-07T13:38:10Z

@kwalcock: can you please pick up this thread? In particular, we want to merge this in master, but a couple of test in Reach are failing. To understand this problem, we need to check and confirm that:

"ADAMTS18/ClvPrd" and ""CCL3L/ClvPrd" show up as a protein fragment in the NER KB.
The 4 tokens after tokenization (ADAMTS18, ClvPrd, CCL3L, ClvPrd) show up as protein names in the NER KB for proteins.

Can you please double check this?

kwalcock · 2020-10-07T16:33:57Z

If case anyone else is listening, here's the written acknowledgement.

MihaiSurdeanu · 2020-10-07T16:35:09Z

I appreciate it @kwalcock !

MihaiSurdeanu · 2020-10-08T03:20:16Z

From @JakeWolfe:

ADAMSTS18 was found in Gene_or_gene_product and Gene_or_gene_product_OLD
ClvPrd is in neither Gene_or_gene_product and Gene_or_gene_product_OLD

ADAMTS18/ClvPrd is in neither

CCL3L1 and CCL3L3 was found in Gene_or_gene_product and Gene_or_gene_product_OLD, but not CCL3L

CCL3L/ClvPrd is in neither

MihaiSurdeanu · 2020-10-08T03:20:56Z

@JakeWolfe: if you run the reach shell (the "shell" script), what is the actual output on the two failing sentences?
Thanks!

JakeWolfe · 2020-10-08T03:23:51Z

@MihaiSurdeanu
TEXT: ADAMTS18 / ClvPrd is a protein fragment
TOKENS: (0,ADAMTS18,NN), (1,and,CC), (2,ClvPrd,NN), (3,is,VBZ), (4,a,DT), (5,protein,NN), (6,fragment,NN)
ENTITY LABELS: (ADAMTS18,B-Gene_or_gene_product), (and,O), (ClvPrd,B-Gene_or_gene_product), (is,O), (a,O), (protein,O), (fragment,O)

LEMMAS: adamts18 and clvprd be a protein fragment
roots: 6
outgoing:
0: (1,cc) (2,conj_and)
1:
2:
3:
4:
5:
6: (0,nsubj) (2,nsubj) (3,cop) (4,det) (5,compound)
incoming:
0: (6,nsubj)
1: (0,cc)
2: (0,conj_and) (6,nsubj)
3: (6,cop)
4: (6,det)
5: (6,compound)
6:

ENTITIES: 2

MENTION TEXT: ADAMTS18
LABELS: List(Gene_or_gene_product, MacroMolecule, Equivalable, BioChemicalEntity, BioEntity, Entity, PossibleController)
DISPLAY LABEL: Protein
------------------------------
RULE => ner-gene_or_gene_product-entities
TYPE => CorefTextBoundMention
------------------------------
GROUNDING: <KBResolution: ADAMTS21, uniprot, Q8TE60, homo sapiens, <IMKBMetaInfo: uniprot, uniprot-proteins.tsv.gz, http://identifiers.org/uniprot/, MIR:00100164, sp=true, f=false, p=true>>

CONTEXT: NONE
------------------------------

MENTION TEXT: ClvPrd
LABELS: List(Gene_or_gene_product, MacroMolecule, Equivalable, BioChemicalEntity, BioEntity, Entity, PossibleController)
DISPLAY LABEL: Protein
------------------------------
RULE => ner-gene_or_gene_product-entities
TYPE => CorefTextBoundMention
------------------------------
GROUNDING: <KBResolution: ClvPrd, uaz, UAZ00001, , <IMKBMetaInfo: uaz, , , , sp=false, f=false, p=false>>

CONTEXT: NONE
------------------------------

EVENTS: 0

==================================================

TEXT: CCL3L / ClvPrd is a protein fragment
TOKENS: (0,CCL3L,NN), (1,and,CC), (2,ClvPrd,NN), (3,is,VBZ), (4,a,DT), (5,protein,NN), (6,fragment,NN)
ENTITY LABELS: (CCL3L,B-Gene_or_gene_product), (and,O), (ClvPrd,B-Gene_or_gene_product), (is,O), (a,O), (protein,O), (fragment,O)

LEMMAS: ccl3l and clvprd be a protein fragment
roots: 6
outgoing:
0: (1,cc) (2,conj_and)
1:
2:
3:
4:
5:
6: (0,nsubj) (2,nsubj) (3,cop) (4,det) (5,compound)
incoming:
0: (6,nsubj)
1: (0,cc)
2: (0,conj_and) (6,nsubj)
3: (6,cop)
4: (6,det)
5: (6,compound)
6:

ENTITIES: 2

MENTION TEXT: CCL3L
LABELS: List(Gene_or_gene_product, MacroMolecule, Equivalable, BioChemicalEntity, BioEntity, Entity, PossibleController)
DISPLAY LABEL: Protein
------------------------------
RULE => ner-gene_or_gene_product-entities
TYPE => CorefTextBoundMention
------------------------------
GROUNDING: <KBResolution: CCL3L, uaz, UAZ00002, , <IMKBMetaInfo: uaz, , , , sp=false, f=false, p=false>>

CONTEXT: NONE
------------------------------

MENTION TEXT: ClvPrd
LABELS: List(Gene_or_gene_product, MacroMolecule, Equivalable, BioChemicalEntity, BioEntity, Entity, PossibleController)
DISPLAY LABEL: Protein
------------------------------
RULE => ner-gene_or_gene_product-entities
TYPE => CorefTextBoundMention
------------------------------
GROUNDING: <KBResolution: ClvPrd, uaz, UAZ00001, , <IMKBMetaInfo: uaz, , , , sp=false, f=false, p=false>>

CONTEXT: NONE
------------------------------

EVENTS: 0

==================================================

kwalcock · 2020-10-08T03:55:12Z

FWIW, the failing tests that I see when running reach on a locally published version of bioresources from the uniprot_fragments branch are these in TestHyphenedEvents and TestApi:

[info] - should have a positive activation of levels of EM by TFs, TWIST1, SNAIL1, SLUG, ZEB1, FOXC2 and CD45 *** FAILED *** (238 milliseconds)
[info]   false was not true (TestHyphenedEvents.scala:17)


[info] - should return 9 positive activation and 3 phosphorylation results from NXML test *** FAILED *** (10 seconds, 407 milliseconds)
[info]   Vector(org.clulab.reach.mentions.CorefEventMention@ed6620fa, org.clulab.reach.mentions.CorefEventMention@1aa155ea, org.clulab.reach.mentions.CorefEventMention@15fde8d6, org.clulab.reach.mentions.CorefEventMention@f71c0834, org.clulab.reach.mentions.CorefEventMention@32c124b2, org.clulab.reach.mentions.CorefEventMention@efacda5c, org.clulab.reach.mentions.CorefEventMention@31d5ece8) had size 7 instead of expected size 9 (TestApi.scala:124)

In reach it is processors/build.sbt that was changed to use the local copy of bioprocessors. Perhaps I've missed some detail.

MihaiSurdeanu · 2020-10-08T04:02:47Z

Hmm... @JakeWolfe: can you please list here the steps you used to test this PR?

JakeWolfe · 2020-10-08T04:34:59Z

@MihaiSurdeanu @kwalcock In the uniprot_fragments branch of bioresources I built the KBs using sbt publishLocal, I went to reach and changed the bioresources version to the snapshot produced by publishLocal, then ran the shell commands. In addition I run the TestEntites unit test.

kwalcock · 2020-10-08T05:03:53Z

So may things could go wrong. I wonder about "I built the KBs using sbt publishLocal". I didn't rebuild anything in the kb directory that I know of. There are instructions for rebuilding the ner directory using reach and the ner_kb.sh script at https://github.com/clulab/bioresources. I assume that's how the new files in ner got to github. Are those the same as your local version (i.e., are we using the same files)?

bioresources/master is set to version 1.1.34-SNAPSHOT so in the uniprot_fragments I used 1.1.35-SNAPSHOT just in case. That could get messed up.

On one or the other of these projects when testing with Windows, it's important to add -Dfile.encoding=UTF-8. I'm not sure which project it was.

I do notice that some CR/LFs are creeping into our files and hope that's not somehow involved.

JakeWolfe · 2020-10-08T05:11:04Z

I am running on windows so these tests may have been affected. The NER files in the uniprot_fragments repository is the same as the the one that I am testing on.

How should we move forward with testing?

kwalcock · 2020-10-08T05:51:48Z

I reran the tests without the -Dfile.encoding=UTF-8 and there was no change. I suppose it could have been important when generating the files in ner.

Tomorrow I'm going to run the tests on a pristine git clone and on Linux and see if it makes a difference. I want to make sure that we're fixing the right problem and not some strange configuration issue. Chances are I can't make the exact same mistakes twice.

MihaiSurdeanu · 2020-10-08T14:54:48Z

Every time a file in the src/main/resources/org/clulab/reach/kb/ folder changes, we need to rerun the ner_kb.sh script to regenerate the files under .../kb/ner, which is what processors actually uses.
Has this been done for this PR?

JakeWolfe · 2020-10-08T20:57:05Z

Yes, the KBs were updated in push 7900aff

kwalcock · 2020-10-09T02:25:50Z

This is completely weird. It sure looks like KBgenerator.scala (https://github.com/clulab/reach/blob/master/processors/src/main/scala/org/clulab/processors/bionlp/ner/KBGenerator.scala) expects the knowledge bases to have three columns:

https://github.com/clulab/reach/blob/5777d66448f1ccb737e69bedae5f8f8073e562b7/processors/src/main/scala/org/clulab/processors/bionlp/ner/KBGenerator.scala#L28-L30

https://github.com/clulab/reach/blob/5777d66448f1ccb737e69bedae5f8f8073e562b7/processors/src/main/scala/org/clulab/processors/bionlp/ner/KBGenerator.scala#L151-L152

but the file I see (protein-ontology-fragments.tsv after gunzip) has only two columns:

14-3-3 protein gamma proteolytic cleavage product	PR:000021868
155 kDa platelet multimerin (human)	PR:000050084
2K	PR:000036831
2K fragment	PR:000036831

This results in exceptions for both Linux and Windows:

17:31:11.709 [run-main-0] INFO  o.c.p.bionlp.BioNLPProcessor - Converting protein-ontology-fragments...
[error] (run-main-0) java.lang.ArrayIndexOutOfBoundsException: 2
[error] java.lang.ArrayIndexOutOfBoundsException: 2
[error] 	at org.clulab.processors.bionlp.ner.KBGenerator$.containsValidSpecies(KBGenerator.scala:151)
[error] 	at org.clulab.processors.bionlp.ner.KBGenerator$.convertKB(KBGenerator.scala:98)
[error] 	at org.clulab.processors.bionlp.ner.KBGenerator$$anonfun$main$2.apply(KBGenerator.scala:53)
[error] 	at org.clulab.processors.bionlp.ner.KBGenerator$$anonfun$main$2.apply(KBGenerator.scala:52)
[error] 	at scala.collection.immutable.List.foreach(List.scala:392)
[error] 	at org.clulab.processors.bionlp.ner.KBGenerator$.main(KBGenerator.scala:52)
[error] 	at org.clulab.processors.bionlp.ner.KBGenerator.main(KBGenerator.scala)
[error] 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[error] 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[error] 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[error] 	at java.lang.reflect.Method.invoke(Method.java:498)
[error] stack trace is suppressed; run last Compile / bgRunMain for the full output
[error] Nonzero exit code: 1
[error] (Compile / runMain) Nonzero exit code: 1
[error] Total time: 434 s (07:14), completed Oct 8, 2020 5:31:11 PM
sbt:reach-exe>

It looks like this KBGenerator file was copied in from somewhere else on 4/6/2020 with this commit: clulab/reach@3df3e69, and I don't know what it looked like before.

However, it also looks like this protein-ontology-fragments.tsv.gz was also added recently, with a43b98c and that it wasn't previously compressed which might imply that it didn't even go through this process. I don't know.

Something is mixed up, maybe me.

kwalcock · 2020-10-09T16:12:48Z

I get it. There's also a PR on reach, clulab/reach#693, that contains the updated code needed to process the file.

bgyori · 2020-10-11T03:57:52Z

Hopefully, I'm not confusing things further by saying that I made a number of changes to the Protein Ontology resource file here in bioresources before it was merged in #41, and pushed the commit clulab/reach@cb71708 to the PR clulab/reach#693 to adapt the tests to these changes. In addition to the possible issue with the processor versions, it's possible @JakeWolfe didn't use the latest state of the PR branch in reach for testing, causing these failures.

kwalcock · 2020-10-13T03:41:51Z

There are unpublished branches of bioresources and reach called proonto. These steps can be followed to reproduce the failing tests:

git clone http://github.com/clulab/bioresources
cd bioresources
git checkout proonto
cd ..
git clone http://github.com/clulab/reach
cd reach
git checkout proonto
cd ../bioresources/
./ner_kb.sh
sbt publishLocal
cd ../reach
vim processors/build.sbt # I corrected a mistake here
<change bioresources dependency to 1.1.34-SNAPSHOT>
sbt main/test
sbt export/test

Fixes can be pushed back to these same branches.

kwalcock · 2020-10-13T03:47:59Z

BTW these are the errors in TestHyphenedEvents and TestApi, not the ones related to TestEntities.

JakeWolfe · 2020-10-13T04:13:19Z

@MihaiSurdeanu I am not familiar with that part of reach, how should we approach fixing this?

MihaiSurdeanu · 2020-10-13T12:13:47Z

I will take over from here. Thanks Keith!

…

On Mon, Oct 12, 2020 at 23:13 Jake Wolfe ***@***.***> wrote: @MihaiSurdeanu <https://github.com/MihaiSurdeanu> I am not familiar with that part of reach, how should we approach fixing this? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#42 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAI75TXQWGBWTL7SEMO7BXDSKPHWZANCNFSM4RD7QWGQ> .

kwalcock · 2020-10-13T22:13:27Z

Anyone, I do have a bunch of outstanding commits for reach that update sbt and the plugin dependencies to new versions and make changes to the scripts. If they are needed sooner, please let me know, but I didn't want to make the situation more complicated by trying to merge them now.

MihaiSurdeanu · 2020-10-13T23:25:14Z

Please wait on those. But did you merge Ben's other branch in here?

…

On Tue, Oct 13, 2020, 5:13 PM Keith Alcock ***@***.***> wrote: Anyone, I do have a bunch of outstanding commits for reach that update sbt and the plugin dependencies to new versions and make changes to the scripts. If they are needed sooner, please let me know, but I didn't want to make the situation more complicated by trying to merge them now. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#42 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAI75TTFU7ZBH4U7NHUWYRLSKTGJNANCNFSM4RD7QWGQ> .

kwalcock · 2020-10-13T23:30:01Z

Both Ben's changes to bioresources and the ones to reach are in the proonto branches. The ones to reach are a little suspect because they seem to solve the problems (with things like ClvPrd) by changing the tests, but they are there.

bgyori · 2020-10-13T23:38:10Z

Note that I took out synonyms containing ClvPrd in bioresources first (this is merged into master) and then updated the corresponding Reach tests to remove checking for these. I'm pretty sure the only issue is that Jake was not using the commit in which I updated these tests.

…

On Tue, Oct 13, 2020, 7:30 PM Keith Alcock ***@***.***> wrote: Both Ben's changes to bioresources and the ones to reach are in the proonto branches. The ones to reach are a little suspect because they seem to solve the problems (with things like ClvPrd) by changing the tests, but they are there. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#42 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABNJ6SFY7KGAFWYKRQ5O3TTSKTPIPANCNFSM4RD7QWGQ> .

bgyori · 2020-10-13T23:46:03Z

To be more specific and hopefully avoid further confusion... Here is the specific commit that makes a change to the script generating the protein-ontology-fragments.tsv.gz file:

2e6cb2a#diff-688598b0a184d86ab4292c4b56c1fa02b4704b102f672c14e51c3e8dad69308bR69-R73

    # Remove entries like "YWHAB/ClvPrd", these are not useful
    # synonyms
    if re.match(r'^([^/]+)/(ClvPrd|UnMod|iso:\d+/UnMod)$', synonym):
        return False

As you can see here, I "Remove entries like "YWHAB/ClvPrd". Given this change, I don't see why my change in Reach in clulab/reach@cb71708 to replace tests checking for ClvPrd might be considered suspect.

MihaiSurdeanu · 2020-10-13T23:46:08Z

Also @kwalcock,
When is this file generated: src/main/resources/org/clulab/reach/kb/ner/model.ser.gz
This is a CompactLexicon that you implemented. But it is not refreshed when runing ner_kb.sh.
Do we need it?

MihaiSurdeanu · 2020-10-13T23:49:09Z

To be more specific and hopefully avoid further confusion... Here is the specific commit that makes a change to the script generating the protein-ontology-fragments.tsv.gz file:

2e6cb2a#diff-688598b0a184d86ab4292c4b56c1fa02b4704b102f672c14e51c3e8dad69308bR69-R73
    # Remove entries like "YWHAB/ClvPrd", these are not useful
    # synonyms
    if re.match(r'^([^/]+)/(ClvPrd|UnMod|iso:\d+/UnMod)$', synonym):
        return False
As you can see here, I "Remove entries like "YWHAB/ClvPrd". Given this change, I don't see why my change in Reach in clulab/reach@cb71708 to replace tests checking for ClvPrd might be considered suspect.

I agree.

MihaiSurdeanu · 2020-10-14T00:11:07Z

Hey @kwalcock: the shell script in this branch fails with this error:

[info] Running org.clulab.reach.ReachShell
Loading ReachSystem ...
[dynet] random seed: 4052734219
[dynet] allocating memory: 512,512,512,512MB
[dynet] memory allocation done.
19:09:03.452 [run-main-0] INFO  o.c.p.m.DeepLearningPolarityClassifier - Loading saved model SavedLSTM_WideBound_u_tag ...
19:09:04.524 [run-main-0] INFO  o.c.p.m.DeepLearningPolarityClassifier - Loading model finished!
[error] (run-main-0) java.lang.ExceptionInInitializerError
java.lang.ExceptionInInitializerError
	at org.clulab.reach.grounding.ReachEntityLookup$$anonfun$org$clulab$reach$grounding$ReachEntityLookup$$addAdHocFile$1.apply(ReachEntityLookup.scala:35)
	at org.clulab.reach.grounding.ReachEntityLookup$$anonfun$org$clulab$reach$grounding$ReachEntityLookup$$addAdHocFile$1.apply(ReachEntityLookup.scala:33)
	at scala.Option.map(Option.scala:146)
	at org.clulab.reach.grounding.ReachEntityLookup.org$clulab$reach$grounding$ReachEntityLookup$$addAdHocFile(ReachEntityLookup.scala:33)
	at org.clulab.reach.grounding.ReachEntityLookup$$anonfun$1.apply(ReachEntityLookup.scala:80)
	at org.clulab.reach.grounding.ReachEntityLookup$$anonfun$1.apply(ReachEntityLookup.scala:80)
	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
	at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
	at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
	at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
	at org.clulab.reach.grounding.ReachEntityLookup.<init>(ReachEntityLookup.scala:80)
	at org.clulab.reach.ReachSystem.<init>(ReachSystem.scala:37)
	at org.clulab.reach.ReachShell$.delayedEndpoint$org$clulab$reach$ReachShell$1(ReachShell.scala:27)
	at org.clulab.reach.ReachShell$delayedInit$body.apply(ReachShell.scala:15)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
	at scala.App$class.main(App.scala:76)
	at org.clulab.reach.ReachShell$.main(ReachShell.scala:15)
	at org.clulab.reach.ReachShell.main(ReachShell.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
Caused by: java.io.IOException: Stream closed
	at java.io.BufferedInputStream.getInIfOpen(BufferedInputStream.java:159)
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
	at java.util.zip.CheckedInputStream.read(CheckedInputStream.java:59)
	at java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:266)
	at java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:258)
	at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:164)
	at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:79)
	at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:91)
	at org.clulab.reach.grounding.ReachKBUtils$.sourceFromResource(ReachKBUtils.scala:67)
	at org.clulab.reach.grounding.TsvIMKBFactory.org$clulab$reach$grounding$TsvIMKBFactory$$loadFromKBDir(TsvIMKBFactory.scala:45)
	at org.clulab.reach.grounding.TsvIMKBFactory$$anonfun$make$1.apply(TsvIMKBFactory.scala:23)
	at org.clulab.reach.grounding.TsvIMKBFactory$$anonfun$make$1.apply(TsvIMKBFactory.scala:22)
	at scala.Option.foreach(Option.scala:257)
	at org.clulab.reach.grounding.TsvIMKBFactory.make(TsvIMKBFactory.scala:22)
	at org.clulab.reach.grounding.ReachIMKBMentionLookups$.staticProteinFragmentKBML(ReachIMKBMentionLookups.scala:181)
	at org.clulab.reach.grounding.ReachIMKBMentionLookups$.<init>(ReachIMKBMentionLookups.scala:37)
	at org.clulab.reach.grounding.ReachIMKBMentionLookups$.<clinit>(ReachIMKBMentionLookups.scala)
	at org.clulab.reach.grounding.ReachEntityLookup$$anonfun$org$clulab$reach$grounding$ReachEntityLookup$$addAdHocFile$1.apply(ReachEntityLookup.scala:35)
	at org.clulab.reach.grounding.ReachEntityLookup$$anonfun$org$clulab$reach$grounding$ReachEntityLookup$$addAdHocFile$1.apply(ReachEntityLookup.scala:33)
	at scala.Option.map(Option.scala:146)
	at org.clulab.reach.grounding.ReachEntityLookup.org$clulab$reach$grounding$ReachEntityLookup$$addAdHocFile(ReachEntityLookup.scala:33)
	at org.clulab.reach.grounding.ReachEntityLookup$$anonfun$1.apply(ReachEntityLookup.scala:80)
	at org.clulab.reach.grounding.ReachEntityLookup$$anonfun$1.apply(ReachEntityLookup.scala:80)
	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
	at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
	at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
	at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
	at org.clulab.reach.grounding.ReachEntityLookup.<init>(ReachEntityLookup.scala:80)
	at org.clulab.reach.ReachSystem.<init>(ReachSystem.scala:37)
	at org.clulab.reach.ReachShell$.delayedEndpoint$org$clulab$reach$ReachShell$1(ReachShell.scala:27)
	at org.clulab.reach.ReachShell$delayedInit$body.apply(ReachShell.scala:15)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
	at scala.App$class.main(App.scala:76)
	at org.clulab.reach.ReachShell$.main(ReachShell.scala:15)
	at org.clulab.reach.ReachShell.main(ReachShell.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)

It works in master.
Can you please take a look?

kwalcock · 2020-10-14T00:48:08Z

Are you using proonto on both bioresources and reach? Between git and sbt it is extremely easy for things to get mixed up. It was necessary to use "sbt reload", "sbt clean", and "git status" several times when I experimented and to even remove the .ivy2 directory. "git status" on jenny where I was working shows only the *.gz files having been modified and processors/build.sbt. Plus there would be the SNAPSHOT bioresources in .ivy2 local. I would remove the 1.1.33 version from cache just to be safe.

I think the error message is actually complaining that a file is missing. I got it once with a mismatched pair of bioresources and reach. Not much help, I know.

kwalcock · 2020-10-14T00:49:49Z

Thanks, @bgyori. I suspect myself more than anyone or any thing else.

MihaiSurdeanu · 2020-10-14T01:24:50Z

Thanks @kwalcock: I'll start from scratch...

MihaiSurdeanu · 2020-10-14T01:34:00Z

Thanks @kwalcock: I'll start from scratch...

Thanks! That fixed it.

MihaiSurdeanu · 2020-10-14T01:42:19Z

@bgyori: the failing test passes when I replace "EM" with a known protein such as "KRas". So, "EM" is no longer recognized as a GGP in this branch. Is this on purpose? Thanks!

kwalcock · 2020-10-14T02:26:42Z

It looks like that model.ser.gz should be refreshed by ner_kb.sh with

# generate the serialized LexiconNER model now
sbt 'runMain org.clulab.processors.bionlp.ner.KBLoader ../bioresources/src/main/resources/org/clulab/reach/kb/ner/model.ser.gz'

I think I've noticed it sometimes not being refreshed and hope it was because the files used to build it hadn't changed. That it doesn't change might be a sign of something wrong. Here's what I think happens:

The first time reach is called, it depends on the published bioresources 1.1.33. This seems to result in an unchanged model.ser.gz. After that, reach is updated to depend on bioresources 1.1.34-SNAPSHOT. If ner_kb.sh is called again, the file model.ser.gz will change. It is probably this version that should be published for real. I'm assuming that another round doesn't result in any more changes, but I'm not yet sure that's the case. Perhaps something can be done about the circle.

bgyori · 2020-10-14T02:41:43Z

@bgyori: the failing test passes when I replace "EM" with a known protein such as "KRas". So, "EM" is no longer recognized as a GGP in this branch. Is this on purpose? Thanks!

I looked into this, and found that "EM" was previously incorrectly grounded to PUBCHEM:6426949, and is one of a group of two-letter acronyms that are listed by CHEBI and PubChem as synonyms for pairs of amino acids. Since these are virtually never correct as synonyms for the purposes of text mining, I removed them in #36.

MihaiSurdeanu · 2020-10-14T14:21:38Z

Thanks! Then I think we can finally merge this branches in their respective masters. @kwalcock: can you please do the honors?

Thank you @bgyori, @kwalcock, and @JakeWolfe for your help with this thorny branch!

kwalcock · 2020-10-14T20:28:28Z

I haven't noticed any updates getting as far as github that address the failing tests.

MihaiSurdeanu · 2020-10-14T20:30:47Z

The update is in the proonto branch of Reach. This bioresources branch is fine as is.

kwalcock · 2020-10-14T20:33:56Z

My bad. Thanks.

kwalcock · 2020-10-14T22:16:40Z

I'm planning to merge this even though I'm not completely sure the .gz files will be the right ones in the end. They were created using the reach branch proonto, but we'd probably rather have the current reach master create them. In addition, I need to make some changes to files in both projects. I doubt that everything can be done in a single commit anyway. This particular master won't be published however until bioresources and reach are synchronized. I will edit the CHANGES file when we're ready to publish. I shouldn't be too long.

MihaiSurdeanu · 2020-10-14T22:18:36Z

I think the .gz files are fine, as nothing changed in the generation part of the code. Agree with everything else!

…

On Wed, Oct 14, 2020 at 17:16 Keith Alcock ***@***.***> wrote: I'm planning to merge this even though I'm not completely sure the .gz files will be the right ones in the end. They were created using the reach branch proonto, but we'd probably rather have the current reach master create them. In addition, I need to make some changes to files in both projects. I doubt that everything can be done in a single commit anyway. This particular master won't be published however until bioresources and reach are synchronized. I will edit the CHANGES file when we're ready to publish. I shouldn't be too long. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#42 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAI75TTBL2NO33YSWNHIYXDSKYPNPANCNFSM4RD7QWGQ> .

kwalcock · 2020-10-15T23:00:27Z

Also @kwalcock,
When is this file generated: src/main/resources/org/clulab/reach/kb/ner/model.ser.gz
This is a CompactLexicon that you implemented. But it is not refreshed when runing ner_kb.sh.
Do we need it?

I too wonder whether we need it and would like to get rid of it. It is a serialized object, so it is dependent on Java and Scala version, even though it is being saved in bioresources which is independent of those versions. The code that does the serialization lives in reach and the object that is serialized is in processors. Chances that these all line up so that it is a usable resource seems very slim. I added a test for the file and it will fail if the Scala version is changed. I don't know if anything did manage to use it, though, and might break without it.

It seems like the file Gene_or_gene_produce-OLD.tsv.gz is something better for github history than for maven and should be deleted.

MihaiSurdeanu · 2020-10-15T23:11:59Z

Agree, let's remove both?

…

On Thu, Oct 15, 2020 at 18:00 Keith Alcock ***@***.***> wrote: Also @kwalcock <https://github.com/kwalcock>, When is this file generated: src/main/resources/org/clulab/reach/kb/ner/model.ser.gz This is a CompactLexicon that you implemented. But it is not refreshed when runing ner_kb.sh. Do we need it? I too wonder whether we need it and would like to get rid of it. It is a serialized object, so it is dependent on Java and Scala version, even though it is being saved in bioresources which is independent of those versions. The code that does the serialization lives in reach and the object that is serialized is in processors. Chances that these all line up so that it is a usable resource seems very slim. I added a test for the file and it will fail if the Scala version is changed. I don't know if anything did manage to use it, though, and might break without it. It seems like the file Gene_or_gene_produce-OLD.tsv.gz is something better for github history than for maven and should be deleted. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#42 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAI75TUMMGAHJBDTXHBYX63SK55JTANCNFSM4RD7QWGQ> .

bgyori added 2 commits September 9, 2020 22:24

Update UniProt proteins

7d34bc3

Add protein fragments to UniProt

41351fb

Updated NER files, moving to test in reach

7900aff

kwalcock merged commit 9b39c28 into master Oct 14, 2020

kwalcock deleted the uniprot_fragments branch October 14, 2020 22:20

Integrate UniProt fragments #42

Integrate UniProt fragments #42

Conversation

bgyori commented Sep 10, 2020

MihaiSurdeanu commented Sep 10, 2020

bgyori commented Sep 10, 2020

MihaiSurdeanu commented Sep 10, 2020

JakeWolfe commented Sep 10, 2020

JakeWolfe commented Sep 11, 2020

MihaiSurdeanu commented Sep 11, 2020

bgyori commented Sep 21, 2020

MihaiSurdeanu commented Sep 22, 2020

JakeWolfe commented Sep 22, 2020

MihaiSurdeanu commented Sep 22, 2020 via email

bgyori commented Oct 6, 2020

MihaiSurdeanu commented Oct 7, 2020

kwalcock commented Oct 7, 2020

MihaiSurdeanu commented Oct 7, 2020

MihaiSurdeanu commented Oct 8, 2020

MihaiSurdeanu commented Oct 8, 2020

JakeWolfe commented Oct 8, 2020

kwalcock commented Oct 8, 2020

MihaiSurdeanu commented Oct 8, 2020

JakeWolfe commented Oct 8, 2020

kwalcock commented Oct 8, 2020

JakeWolfe commented Oct 8, 2020 • edited Loading

kwalcock commented Oct 8, 2020

MihaiSurdeanu commented Oct 8, 2020

JakeWolfe commented Oct 8, 2020

kwalcock commented Oct 9, 2020

kwalcock commented Oct 9, 2020

bgyori commented Oct 11, 2020

kwalcock commented Oct 13, 2020 • edited Loading

kwalcock commented Oct 13, 2020

JakeWolfe commented Oct 13, 2020

MihaiSurdeanu commented Oct 13, 2020 via email

kwalcock commented Oct 13, 2020

MihaiSurdeanu commented Oct 13, 2020 via email

kwalcock commented Oct 13, 2020

bgyori commented Oct 13, 2020 via email

bgyori commented Oct 13, 2020

MihaiSurdeanu commented Oct 13, 2020

MihaiSurdeanu commented Oct 13, 2020

MihaiSurdeanu commented Oct 14, 2020

kwalcock commented Oct 14, 2020

kwalcock commented Oct 14, 2020

MihaiSurdeanu commented Oct 14, 2020

MihaiSurdeanu commented Oct 14, 2020

MihaiSurdeanu commented Oct 14, 2020

kwalcock commented Oct 14, 2020

bgyori commented Oct 14, 2020

MihaiSurdeanu commented Oct 14, 2020

kwalcock commented Oct 14, 2020

MihaiSurdeanu commented Oct 14, 2020

kwalcock commented Oct 14, 2020

kwalcock commented Oct 14, 2020

MihaiSurdeanu commented Oct 14, 2020 via email

kwalcock commented Oct 15, 2020

MihaiSurdeanu commented Oct 15, 2020 via email

JakeWolfe commented Oct 8, 2020 •

edited

Loading

kwalcock commented Oct 13, 2020 •

edited

Loading