-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate UniProt fragments #42
Conversation
Thank you @bgyori ! @JakeWolfe : can you please generate the NER files in this branch? And then publishLocal, and write a few unit tests in Reach to make sure that this new entities are visible? @bgyori : should we keep the Protein Ontology fragments then? |
I suggest we keep the Protein Ontology for now, we can test how Reach works after these additions and see if we need to remove or tweak anything further. |
@JakeWolfe : any update on this? I'd like to release soon. Thank you! |
@MihaiSurdeanu I will get to this by the end of tonight. |
Testing in reach after generating the new NER files and publishing locally is showing failed unit tests in TestEntities, specifically: "ADAMTS18/ClvPrd is a protein fragment": (2 was not equal to 1) in reference to the number of mentions found It seems to be splitting those mentions into two mentions instead of the expected one previously, this can be verified by the MetaInfo of the mention of ADAMTS18 and CCL3L referencing uniprot-proteins.tsv.gz when this was a protein ontology test. This would seem to indicate that there is overlap between the two files. @MihaiSurdeanu, @bgyori : Please let me know how to proceed |
Thanks @JakeWolfe! To understand this problem please check and confirm that:
|
I think |
Thanks @bgyori ! @JakeWolfe: can you please check why we're identifying ClvPrd as proteins? Which KB contains it? @bgyori: we should remove it from there? |
@MihaiSurdeanu I am now looking into this. |
Thanks!
…On Tue, Sep 22, 2020 at 11:00 Jake Wolfe ***@***.***> wrote:
@MihaiSurdeanu <https://github.com/MihaiSurdeanu> I am now looking into
this.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#42 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAI75TRDBBRBPR2FMCJLOALSHDQ3JANCNFSM4RD7QWGQ>
.
|
Did you run into any issues with the changes? I'm happy to help, just let me know! |
@kwalcock: can you please pick up this thread? In particular, we want to merge this in master, but a couple of test in Reach are failing. To understand this problem, we need to check and confirm that:
Can you please double check this? |
If case anyone else is listening, here's the written acknowledgement. |
I appreciate it @kwalcock ! |
From @JakeWolfe: ADAMSTS18 was found in Gene_or_gene_product and Gene_or_gene_product_OLD ADAMTS18/ClvPrd is in neither CCL3L1 and CCL3L3 was found in Gene_or_gene_product and Gene_or_gene_product_OLD, but not CCL3L CCL3L/ClvPrd is in neither |
@JakeWolfe: if you run the reach shell (the "shell" script), what is the actual output on the two failing sentences? |
@MihaiSurdeanu LEMMAS: adamts18 and clvprd be a protein fragment ENTITIES: 2 MENTION TEXT: ADAMTS18 CONTEXT: NONE MENTION TEXT: ClvPrd CONTEXT: NONE EVENTS: 0 ================================================== TEXT: CCL3L / ClvPrd is a protein fragment LEMMAS: ccl3l and clvprd be a protein fragment ENTITIES: 2 MENTION TEXT: CCL3L CONTEXT: NONE MENTION TEXT: ClvPrd CONTEXT: NONE EVENTS: 0 ================================================== |
FWIW, the failing tests that I see when running reach on a locally published version of bioresources from the uniprot_fragments branch are these in TestHyphenedEvents and TestApi:
In reach it is processors/build.sbt that was changed to use the local copy of bioprocessors. Perhaps I've missed some detail. |
Hmm... @JakeWolfe: can you please list here the steps you used to test this PR? |
@MihaiSurdeanu @kwalcock In the uniprot_fragments branch of bioresources I built the KBs using sbt publishLocal, I went to reach and changed the bioresources version to the snapshot produced by publishLocal, then ran the shell commands. In addition I run the TestEntites unit test. |
So may things could go wrong. I wonder about "I built the KBs using sbt publishLocal". I didn't rebuild anything in the kb directory that I know of. There are instructions for rebuilding the ner directory using reach and the ner_kb.sh script at https://github.com/clulab/bioresources. I assume that's how the new files in ner got to github. Are those the same as your local version (i.e., are we using the same files)? bioresources/master is set to version On one or the other of these projects when testing with Windows, it's important to add I do notice that some CR/LFs are creeping into our files and hope that's not somehow involved. |
I am running on windows so these tests may have been affected. The NER files in the uniprot_fragments repository is the same as the the one that I am testing on. How should we move forward with testing? |
I reran the tests without the Tomorrow I'm going to run the tests on a pristine git clone and on Linux and see if it makes a difference. I want to make sure that we're fixing the right problem and not some strange configuration issue. Chances are I can't make the exact same mistakes twice. |
Every time a file in the |
Yes, the KBs were updated in push 7900aff |
This is completely weird. It sure looks like KBgenerator.scala (https://github.com/clulab/reach/blob/master/processors/src/main/scala/org/clulab/processors/bionlp/ner/KBGenerator.scala) expects the knowledge bases to have three columns: but the file I see (protein-ontology-fragments.tsv after gunzip) has only two columns:
This results in exceptions for both Linux and Windows:
It looks like this KBGenerator file was copied in from somewhere else on 4/6/2020 with this commit: clulab/reach@3df3e69, and I don't know what it looked like before. However, it also looks like this protein-ontology-fragments.tsv.gz was also added recently, with a43b98c and that it wasn't previously compressed which might imply that it didn't even go through this process. I don't know. Something is mixed up, maybe me. |
I get it. There's also a PR on reach, clulab/reach#693, that contains the updated code needed to process the file. |
Hopefully, I'm not confusing things further by saying that I made a number of changes to the Protein Ontology resource file here in bioresources before it was merged in #41, and pushed the commit clulab/reach@cb71708 to the PR clulab/reach#693 to adapt the tests to these changes. In addition to the possible issue with the processor versions, it's possible @JakeWolfe didn't use the latest state of the PR branch in reach for testing, causing these failures. |
There are unpublished branches of bioresources and reach called proonto. These steps can be followed to reproduce the failing tests:
Fixes can be pushed back to these same branches. |
BTW these are the errors in TestHyphenedEvents and TestApi, not the ones related to TestEntities. |
@MihaiSurdeanu I am not familiar with that part of reach, how should we approach fixing this? |
I will take over from here.
Thanks Keith!
…On Mon, Oct 12, 2020 at 23:13 Jake Wolfe ***@***.***> wrote:
@MihaiSurdeanu <https://github.com/MihaiSurdeanu> I am not familiar with
that part of reach, how should we approach fixing this?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#42 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAI75TXQWGBWTL7SEMO7BXDSKPHWZANCNFSM4RD7QWGQ>
.
|
Anyone, I do have a bunch of outstanding commits for reach that update sbt and the plugin dependencies to new versions and make changes to the scripts. If they are needed sooner, please let me know, but I didn't want to make the situation more complicated by trying to merge them now. |
Please wait on those.
But did you merge Ben's other branch in here?
…On Tue, Oct 13, 2020, 5:13 PM Keith Alcock ***@***.***> wrote:
Anyone, I do have a bunch of outstanding commits for reach that update sbt
and the plugin dependencies to new versions and make changes to the
scripts. If they are needed sooner, please let me know, but I didn't want
to make the situation more complicated by trying to merge them now.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#42 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAI75TTFU7ZBH4U7NHUWYRLSKTGJNANCNFSM4RD7QWGQ>
.
|
Both Ben's changes to bioresources and the ones to reach are in the proonto branches. The ones to reach are a little suspect because they seem to solve the problems (with things like ClvPrd) by changing the tests, but they are there. |
Note that I took out synonyms containing ClvPrd in bioresources first (this
is merged into master) and then updated the corresponding Reach tests to
remove checking for these. I'm pretty sure the only issue is that Jake was
not using the commit in which I updated these tests.
…On Tue, Oct 13, 2020, 7:30 PM Keith Alcock ***@***.***> wrote:
Both Ben's changes to bioresources and the ones to reach are in the
proonto branches. The ones to reach are a little suspect because they seem
to solve the problems (with things like ClvPrd) by changing the tests, but
they are there.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#42 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABNJ6SFY7KGAFWYKRQ5O3TTSKTPIPANCNFSM4RD7QWGQ>
.
|
To be more specific and hopefully avoid further confusion... Here is the specific commit that makes a change to the script generating the protein-ontology-fragments.tsv.gz file: 2e6cb2a#diff-688598b0a184d86ab4292c4b56c1fa02b4704b102f672c14e51c3e8dad69308bR69-R73
As you can see here, I "Remove entries like "YWHAB/ClvPrd". Given this change, I don't see why my change in Reach in clulab/reach@cb71708 to replace tests checking for ClvPrd might be considered suspect. |
Also @kwalcock, |
I agree. |
Hey @kwalcock: the
It works in master. |
Are you using proonto on both bioresources and reach? Between git and sbt it is extremely easy for things to get mixed up. It was necessary to use "sbt reload", "sbt clean", and "git status" several times when I experimented and to even remove the .ivy2 directory. "git status" on jenny where I was working shows only the *.gz files having been modified and processors/build.sbt. Plus there would be the SNAPSHOT bioresources in .ivy2 local. I would remove the 1.1.33 version from cache just to be safe. I think the error message is actually complaining that a file is missing. I got it once with a mismatched pair of bioresources and reach. Not much help, I know. |
Thanks, @bgyori. I suspect myself more than anyone or any thing else. |
Thanks @kwalcock: I'll start from scratch... |
Thanks! That fixed it. |
@bgyori: the failing test passes when I replace "EM" with a known protein such as "KRas". So, "EM" is no longer recognized as a GGP in this branch. Is this on purpose? Thanks! |
It looks like that model.ser.gz should be refreshed by ner_kb.sh with
I think I've noticed it sometimes not being refreshed and hope it was because the files used to build it hadn't changed. That it doesn't change might be a sign of something wrong. Here's what I think happens: The first time reach is called, it depends on the published bioresources 1.1.33. This seems to result in an unchanged model.ser.gz. After that, reach is updated to depend on bioresources 1.1.34-SNAPSHOT. If ner_kb.sh is called again, the file model.ser.gz will change. It is probably this version that should be published for real. I'm assuming that another round doesn't result in any more changes, but I'm not yet sure that's the case. Perhaps something can be done about the circle. |
I looked into this, and found that "EM" was previously incorrectly grounded to PUBCHEM:6426949, and is one of a group of two-letter acronyms that are listed by CHEBI and PubChem as synonyms for pairs of amino acids. Since these are virtually never correct as synonyms for the purposes of text mining, I removed them in #36. |
Thanks! Then I think we can finally merge this branches in their respective masters. @kwalcock: can you please do the honors? Thank you @bgyori, @kwalcock, and @JakeWolfe for your help with this thorny branch! |
I haven't noticed any updates getting as far as github that address the failing tests. |
The update is in the proonto branch of Reach. This bioresources branch is fine as is. |
My bad. Thanks. |
I'm planning to merge this even though I'm not completely sure the .gz files will be the right ones in the end. They were created using the reach branch proonto, but we'd probably rather have the current reach master create them. In addition, I need to make some changes to files in both projects. I doubt that everything can be done in a single commit anyway. This particular master won't be published however until bioresources and reach are synchronized. I will edit the CHANGES file when we're ready to publish. I shouldn't be too long. |
I think the .gz files are fine, as nothing changed in the generation part
of the code. Agree with everything else!
…On Wed, Oct 14, 2020 at 17:16 Keith Alcock ***@***.***> wrote:
I'm planning to merge this even though I'm not completely sure the .gz
files will be the right ones in the end. They were created using the reach
branch proonto, but we'd probably rather have the current reach master
create them. In addition, I need to make some changes to files in both
projects. I doubt that everything can be done in a single commit anyway.
This particular master won't be published however until bioresources and
reach are synchronized. I will edit the CHANGES file when we're ready to
publish. I shouldn't be too long.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#42 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAI75TTBL2NO33YSWNHIYXDSKYPNPANCNFSM4RD7QWGQ>
.
|
I too wonder whether we need it and would like to get rid of it. It is a serialized object, so it is dependent on Java and Scala version, even though it is being saved in bioresources which is independent of those versions. The code that does the serialization lives in reach and the object that is serialized is in processors. Chances that these all line up so that it is a usable resource seems very slim. I added a test for the file and it will fail if the Scala version is changed. I don't know if anything did manage to use it, though, and might break without it. It seems like the file Gene_or_gene_produce-OLD.tsv.gz is something better for github history than for maven and should be deleted. |
Agree, let's remove both?
…On Thu, Oct 15, 2020 at 18:00 Keith Alcock ***@***.***> wrote:
Also @kwalcock <https://github.com/kwalcock>,
When is this file generated:
src/main/resources/org/clulab/reach/kb/ner/model.ser.gz
This is a CompactLexicon that you implemented. But it is not refreshed
when runing ner_kb.sh.
Do we need it?
I too wonder whether we need it and would like to get rid of it. It is a
serialized object, so it is dependent on Java and Scala version, even
though it is being saved in bioresources which is independent of those
versions. The code that does the serialization lives in reach and the
object that is serialized is in processors. Chances that these all line up
so that it is a usable resource seems very slim. I added a test for the
file and it will fail if the Scala version is changed. I don't know if
anything did manage to use it, though, and might break without it.
It seems like the file Gene_or_gene_produce-OLD.tsv.gz is something better
for github history than for maven and should be deleted.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#42 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAI75TUMMGAHJBDTXHBYX63SK55JTANCNFSM4RD7QWGQ>
.
|
This PR extends the
update_uniprot_proteins.py
script to download and process protein chains and peptides into grounding entries. The approach taken here is to put these into the existinguniprot-proteins.tsv
file with IDs formatted as [UniProtID]#[FragmentID] which is now "officially" supported by UniProt. Examples:This adds a total of around 50k new rows to the grounding file.
This PR does not yet touch the NER files and I have not yet written any tests and tried this against Reach. @JakeWolfe and @MihaiSurdeanu would you be able to pick this up from here?