-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Entity boundary issue related to sites #722
Comments
Thanks @bgyori ! |
I see, is there any documentation / communication from around that time that has examples of issues before that change? That might help figure out how to engineer an overall better solution. |
No... This was an early decision, which may not even be justified right now. Not sure. |
Yes, I think in principle a good prioritization would be to first check if the named entity can be grounded fully as is (in the above example as "cyclin A1"). Then if it cannot be grounded, check if it can be broken up into a protein + site combination. I'm actually not sure what some examples of this latter case might be since in most cases e.g., "ERK2 T185" the site isn't recognized as part of the entity in the first place. In any case, we can try to check empirically if there are any issues with changing this. Also, I couldn't immediately find where this is implemented in the code but I could think more about the issue if you can point me to it. |
I'll take a look at the code later today. |
Just FYI, if you look at this file: you'll see that the rule that creates Site have priority 1, whereas these rules: which convert BIO notations into actual Reach entities have priority 3, which means they are executed later. |
There's a little more here than just priorities. More soon, hopefully. |
It was not a priority issue. It was this line: which prevents entities contain upper-case letters followed by digits. @bgyori: should we remove this constraint? Or adjust it in a meaningful way? |
I see, I'm still wondering if there was a good reason for adding that condition but I couldn't come up with any examples, even synthetically where it would make a difference. For instance, if we have a typical mutation listed as a separate word, like |
Just for the record, some synthetic examples I've tried (not all of them make sense biologically) are e.g.,
where we get a new
where the site is an |
This came up because in addition of the bioresources NER, which you know, Reach merges in the outputs of a CRF NER, which (at least in past) we have observed to "eat" into mutations and sites. But it doesn't seem to be happening commonly at all. So it's maybe safe to just remove? Should I do it? |
I'm trying to do some analysis on papers with mutations to see if we lose anything, I'll report what I find here. |
A larger scale test actually highlighted some consequences to this proposed change. Several instances of figure names were extracted as named entities e.g.:
I found that before the change, from the same sentences we still extracted
as a named entity but we didn't get |
Can you please give us a couple of example sentences?
…On Thu, Jan 14, 2021 at 19:51 Benjamin M. Gyori ***@***.***> wrote:
A larger scale test actually highlighted some consequences to this
proposed change. Several instances of figure names were extracted as named
entities e.g.:
- Fig. 5a
- Fig. 4f
- figure 4A
I found that before the change, from the same sentences we still
extracted
- Fig.
as a named entity but we didn't get figure, presumably because that is
a stop word. Is there a way to recognize Fig. and figure as stopwords
assuming they could come with a suffix like Fig. 5a and figure 4a?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#722 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAI75TU4S7RHRIOOF62H2FDSZ6UUXANCNFSM4V6TDIEA>
.
|
|
Thanks!
…On Thu, Jan 14, 2021 at 20:06 Benjamin M. Gyori ***@***.***> wrote:
- Similarly, we showed that wild-type p53 was polyubiquitinated by
Pirh2 but not by Pirh2-DN and Pirh2-ΔRING (Fig. 5C, compare lane 3 with
lanes 4 and 5).
- In contrast, the levels of IRP2 and TfR1 were increased, whereas the
level of FTH1 was decreased, by ectopic mutant p53 (Fig. 4f, compare lanes
3–4 with 1–2, respectively).
- In addition, knockout of IRP2 led to decreased expression of TfR1
and increased expression of FTH1 (Fig. 5a), consistent with previous report
[41].
- MG132 treatment rescued the NSC59984-mediated down-regulation of
mutant p53 (figure 4A).
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#722 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAI75TRHFEJC2JJCJL7SAWTSZ6WKVANCNFSM4V6TDIEA>
.
|
@enoriega, can you please:
Thanks! |
@MihaiSurdeanu adding If you consider I think the cleanest way to solve this is to discard the mention by encoding figure and variants into the rule itself. What do you suggest? |
I fixed this a while ago in the NER post-processing code: But this obviously doesn't work anymore. Can you please try to fix that block of code? |
Done. I noticed that after fixing it, the figure numbers (4a, 5C, etc) where recognized as sites. I updated the rule to fix this too.
Pull request coming soon. |
@MihaiSurdeanu I had to bump the I see how this I inconvenient, and will propose a fix soon. |
Great, thanks!
reach master should not depend on any snapshot. So, we need to release
bioresources first.
…On Mon, Jan 18, 2021 at 10:17 AM Enrique Noriega ***@***.***> wrote:
@MihaiSurdeanu <https://github.com/MihaiSurdeanu> I had to bump the
bioresources dependency to version 1.1.37-SNAPSHOT. Should I leave this
on for the pull request? Otherwise, what is the recommended course of
action?
I see how this I inconvenient, and will propose a fix soon.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#722 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAI75TW4VUTU2J4JYLU36ITS2RULNANCNFSM4V6TDIEA>
.
|
A very common entity boundary issue currently happens with cyclins, which can both be described as a family ("cyclin") or specific proteins like "cyclin A1". Currently, in many of these cases, the entity boundary is cropped to just "cyclin", even if a specific type like "A1" or "D1" follows.
Looking at bioresources, in
hgnc.tsv.gz
we haveand
cyclin A1
also appears inner/Gene_or_gene_product.tsv.gz
. Still, when puttingcyclin A1
into the Reach shell, we getSo it appears that NER works correctly, since if I understand correctly,
(cyclin,B-Gene_or_gene_product), (A1,I-Gene_or_gene_product)
means that a single entity is detected, but it then gets broken up into a Protein and a Site. (One additional note: I checked whether this behavior has anything to do with the fact that "cyclin" is also defined as an override in NER-Grounding-Override.tsv.gz, but that doesn't seem to be the case, the same behavior happens if I remove the override).The text was updated successfully, but these errors were encountered: