Entity boundary issue related to sites #722

bgyori · 2021-01-12T04:12:24Z

A very common entity boundary issue currently happens with cyclins, which can both be described as a family ("cyclin") or specific proteins like "cyclin A1". Currently, in many of these cases, the entity boundary is cropped to just "cyclin", even if a specific type like "A1" or "D1" follows.

Looking at bioresources, in hgnc.tsv.gz we have

cyclin A1»··P78396»·Human

and cyclin A1 also appears in ner/Gene_or_gene_product.tsv.gz. Still, when putting cyclin A1 into the Reach shell, we get

>>> cyclin A1

sentence #0
TEXT:   cyclin A1
TOKENS: (0,cyclin,NN), (1,A1,NN)
ENTITY LABELS: (cyclin,B-Gene_or_gene_product), (A1,I-Gene_or_gene_product)

LEMMAS: cyclin a1
roots: 1
outgoing:
	0:
	1: (0,compound)
incoming:
	0: (1,compound)
	1:

     
ENTITIES: 2

MENTION TEXT:  cyclin
LABELS:        List(Gene_or_gene_product, MacroMolecule, Equivalable, BioChemicalEntity, BioEntity, Entity, PossibleController)
DISPLAY LABEL: Protein
	------------------------------
	RULE => ner-gene_or_gene_product-entities
	TYPE => CorefTextBoundMention
	------------------------------
	GROUNDING: <KBResolution: Cyclin, fplx, Cyclin, human, <IMKBMetaInfo: uaz, NER-Grounding-Override.tsv.gz, , , sp=false, f=false, p=false>>

CONTEXT: NONE
	------------------------------

MENTION TEXT:  A1
LABELS:        List(Site)
DISPLAY LABEL: Site
	------------------------------
	RULE => site_1letter_a
	TYPE => CorefTextBoundMention
	------------------------------
	GROUNDING: <KBResolution: A1, uaz, UAZ00001, , <IMKBMetaInfo: uaz, , , , sp=false, f=false, p=false>>

CONTEXT: NONE
	------------------------------


EVENTS:   0

==================================================

So it appears that NER works correctly, since if I understand correctly, (cyclin,B-Gene_or_gene_product), (A1,I-Gene_or_gene_product) means that a single entity is detected, but it then gets broken up into a Protein and a Site. (One additional note: I checked whether this behavior has anything to do with the fact that "cyclin" is also defined as an override in NER-Grounding-Override.tsv.gz, but that doesn't seem to be the case, the same behavior happens if I remove the override).

The text was updated successfully, but these errors were encountered:

MihaiSurdeanu · 2021-01-12T14:46:16Z

Thanks @bgyori !
I think this behavior was on purpose based on what we saw when we developed it. We observed a while ago that the NER "eats" into descriptions of sites because of the same overlap you observed. So we allowed the Site detector (which is executed downstream of the NER) to take away tokens from the NER if the site grammar thinks it's a site. I think I can revert that if we decide that's the preferred behavior. What do you think?

bgyori · 2021-01-12T15:07:16Z

I see, is there any documentation / communication from around that time that has examples of issues before that change? That might help figure out how to engineer an overall better solution.

MihaiSurdeanu · 2021-01-12T15:10:06Z

No... This was an early decision, which may not even be justified right now. Not sure.
I think we could employ your tactic here: revert the change, and compare outputs before and after to understand the impact.

bgyori · 2021-01-12T15:20:17Z

Yes, I think in principle a good prioritization would be to first check if the named entity can be grounded fully as is (in the above example as "cyclin A1"). Then if it cannot be grounded, check if it can be broken up into a protein + site combination. I'm actually not sure what some examples of this latter case might be since in most cases e.g., "ERK2 T185" the site isn't recognized as part of the entity in the first place. In any case, we can try to check empirically if there are any issues with changing this.

Also, I couldn't immediately find where this is implemented in the code but I could think more about the issue if you can point me to it.

MihaiSurdeanu · 2021-01-12T15:55:58Z

I'll take a look at the code later today.

MihaiSurdeanu · 2021-01-12T22:06:01Z

Just FYI, if you look at this file:
https://github.com/clulab/reach/blob/master/main/src/main/resources/org/clulab/reach/biogrammar/entities/entities.yml

you'll see that the rule that creates Site have priority 1, whereas these rules:
https://github.com/clulab/reach/blob/master/main/src/main/resources/org/clulab/reach/biogrammar/entities/entities.yml#L129-L233

which convert BIO notations into actual Reach entities have priority 3, which means they are executed later.
I'll play with these priorities to see what happens.

MihaiSurdeanu · 2021-01-12T22:52:42Z

There's a little more here than just priorities. More soon, hopefully.

MihaiSurdeanu · 2021-01-13T01:17:43Z

It was not a priority issue. It was this line:
https://github.com/clulab/reach/blob/master/main/src/main/resources/org/clulab/reach/biogrammar/entities/entities.yml#L186

which prevents entities contain upper-case letters followed by digits.
As the comment says this was added to prevent the entity to "eat" mutations.
Removing the !word=... constraint it makes Reach recognize "cyclin A1" as an entity. Also, all tests continue to pass.

@bgyori: should we remove this constraint? Or adjust it in a meaningful way?
Thanks!

bgyori · 2021-01-13T14:40:25Z

I see, I'm still wondering if there was a good reason for adding that condition but I couldn't come up with any examples, even synthetically where it would make a difference. For instance, if we have a typical mutation listed as a separate word, like BRAF V600E, the V600E part is not recognized as part of the named entity anyway and so this condition isn't applied. In other cases, where there is a named entity match for the entire sequence like Cyclin D1, the correct behavior is to not break that up. Could there be cases where NER extends the entity boundary to include the second word even if it is not part of an entity string explicitly listed in the NER files?

bgyori · 2021-01-13T15:16:32Z

Just for the record, some synthetic examples I've tried (not all of them make sense biologically) are e.g., KRAS G13, KRAS G13C, EGFR A1, EGFR C15, EGFR A66C, AKT1 A66, etc. There are a few different cases for entity labels, one is

(EGFR,B-Gene_or_gene_product), (Y1,B-Gene_or_gene_product)
(AKT1,B-Gene_or_gene_product), (A66,B-Simple_chemical)

where we get a new B- (not I-) entity for the site. Another is

(KRAS,B-Gene_or_gene_product), (G13C,O)

where the site is an O (again not I-). I couldn't produce any examples with entities tagged as B- followed by I- where the I- was a legitimate site for B-.

MihaiSurdeanu · 2021-01-13T15:20:49Z

This came up because in addition of the bioresources NER, which you know, Reach merges in the outputs of a CRF NER, which (at least in past) we have observed to "eat" into mutations and sites. But it doesn't seem to be happening commonly at all. So it's maybe safe to just remove? Should I do it?

bgyori · 2021-01-14T19:10:54Z

I'm trying to do some analysis on papers with mutations to see if we lose anything, I'll report what I find here.

bgyori · 2021-01-15T02:51:41Z

A larger scale test actually highlighted some consequences to this proposed change. Several instances of figure names were extracted as named entities e.g.:

Fig. 5a
Fig. 4f
figure 4A

I found that before the change, from the same sentences we still extracted

Fig.

as a named entity but we didn't get figure, presumably because that is a stop word. Is there a way to recognize Fig. and figure as stopwords assuming they could come with a suffix like Fig. 5a and figure 4a?

MihaiSurdeanu · 2021-01-15T02:59:57Z

Can you please give us a couple of example sentences?

…

On Thu, Jan 14, 2021 at 19:51 Benjamin M. Gyori ***@***.***> wrote: A larger scale test actually highlighted some consequences to this proposed change. Several instances of figure names were extracted as named entities e.g.: - Fig. 5a - Fig. 4f - figure 4A I found that before the change, from the same sentences we still extracted - Fig. as a named entity but we didn't get figure, presumably because that is a stop word. Is there a way to recognize Fig. and figure as stopwords assuming they could come with a suffix like Fig. 5a and figure 4a? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#722 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAI75TU4S7RHRIOOF62H2FDSZ6UUXANCNFSM4V6TDIEA> .

bgyori · 2021-01-15T03:06:05Z

Similarly, we showed that wild-type p53 was polyubiquitinated by Pirh2 but not by Pirh2-DN and Pirh2-ΔRING (Fig. 5C, compare lane 3 with lanes 4 and 5).
In contrast, the levels of IRP2 and TfR1 were increased, whereas the level of FTH1 was decreased, by ectopic mutant p53 (Fig. 4f, compare lanes 3–4 with 1–2, respectively).
In addition, knockout of IRP2 led to decreased expression of TfR1 and increased expression of FTH1 (Fig. 5a), consistent with previous report [41].
MG132 treatment rescued the NSC59984-mediated down-regulation of mutant p53 (figure 4A).

MihaiSurdeanu · 2021-01-15T03:37:53Z

Thanks!

…

On Thu, Jan 14, 2021 at 20:06 Benjamin M. Gyori ***@***.***> wrote: - Similarly, we showed that wild-type p53 was polyubiquitinated by Pirh2 but not by Pirh2-DN and Pirh2-ΔRING (Fig. 5C, compare lane 3 with lanes 4 and 5). - In contrast, the levels of IRP2 and TfR1 were increased, whereas the level of FTH1 was decreased, by ectopic mutant p53 (Fig. 4f, compare lanes 3–4 with 1–2, respectively). - In addition, knockout of IRP2 led to decreased expression of TfR1 and increased expression of FTH1 (Fig. 5a), consistent with previous report [41]. - MG132 treatment rescued the NSC59984-mediated down-regulation of mutant p53 (figure 4A). — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#722 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAI75TRHFEJC2JJCJL7SAWTSZ6WKVANCNFSM4V6TDIEA> .

MihaiSurdeanu · 2021-01-15T15:30:23Z

@enoriega, can you please:

Add unit tests for these sentences.
Add all the "figure" variants to the NER stop words here: https://github.com/clulab/bioresources/blob/master/src/main/resources/org/clulab/reach/kb/ner_stoplist.txt

Thanks!

enoriega · 2021-01-18T03:06:22Z

@MihaiSurdeanu adding figure and variants to the stop list file doesn't work for this problem:

If you consider figure 4A, the CRF assigns tags B- and I- to the tokens. The stop list is only checked for single word entities (https://github.com/clulab/reach/blob/master/processors/src/main/scala/org/clulab/processors/bionlp/BioNERPostProcessor.scala#L85-L89). At this point the grammar hasn't been called and 4A hasn't been detached yet.

I think the cleanest way to solve this is to discard the mention by encoding figure and variants into the rule itself.

What do you suggest?

MihaiSurdeanu · 2021-01-18T15:19:54Z

I fixed this a while ago in the NER post-processing code:
https://github.com/clulab/reach/blob/master/processors/src/main/scala/org/clulab/processors/bionlp/BioNERPostProcessor.scala#L100-L107

But this obviously doesn't work anymore. Can you please try to fix that block of code?
Thanks!

enoriega · 2021-01-18T17:12:21Z

Done. I noticed that after fixing it, the figure numbers (4a, 5C, etc) where recognized as sites. I updated the rule to fix this too.

reach/main/src/main/resources/org/clulab/reach/biogrammar/entities/entities.yml

Line 33 in 4d7445b

(?<! [word=/(?i)^(table|figure)/])

Pull request coming soon.

enoriega · 2021-01-18T17:17:26Z

@MihaiSurdeanu I had to bump the bioresources dependency to version 1.1.37-SNAPSHOT. Should I leave this on for the pull request? Otherwise, what is the recommended course of action?

I see how this I inconvenient, and will propose a fix soon.

MihaiSurdeanu · 2021-01-18T18:20:50Z

Great, thanks! reach master should not depend on any snapshot. So, we need to release bioresources first.

…

On Mon, Jan 18, 2021 at 10:17 AM Enrique Noriega ***@***.***> wrote: @MihaiSurdeanu <https://github.com/MihaiSurdeanu> I had to bump the bioresources dependency to version 1.1.37-SNAPSHOT. Should I leave this on for the pull request? Otherwise, what is the recommended course of action? I see how this I inconvenient, and will propose a fix soon. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#722 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAI75TW4VUTU2J4JYLU36ITS2RULNANCNFSM4V6TDIEA> .

Integrate fix to #722 into master

enoriega self-assigned this Jan 15, 2021

enoriega added a commit that referenced this issue Jan 18, 2021

This commit fixes issue #722

2be6afa

kwalcock added a commit that referenced this issue Jan 19, 2021

Merge pull request #726 from clulab/issue722

251d9d7

Integrate fix to #722 into master

enoriega closed this as completed Jan 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Entity boundary issue related to sites #722

Entity boundary issue related to sites #722

bgyori commented Jan 12, 2021

MihaiSurdeanu commented Jan 12, 2021

bgyori commented Jan 12, 2021

MihaiSurdeanu commented Jan 12, 2021

bgyori commented Jan 12, 2021

MihaiSurdeanu commented Jan 12, 2021

MihaiSurdeanu commented Jan 12, 2021

MihaiSurdeanu commented Jan 12, 2021

MihaiSurdeanu commented Jan 13, 2021

bgyori commented Jan 13, 2021

bgyori commented Jan 13, 2021

MihaiSurdeanu commented Jan 13, 2021

bgyori commented Jan 14, 2021

bgyori commented Jan 15, 2021 •

edited

Loading

MihaiSurdeanu commented Jan 15, 2021 via email

bgyori commented Jan 15, 2021

MihaiSurdeanu commented Jan 15, 2021 via email

MihaiSurdeanu commented Jan 15, 2021

enoriega commented Jan 18, 2021

MihaiSurdeanu commented Jan 18, 2021

enoriega commented Jan 18, 2021

enoriega commented Jan 18, 2021

MihaiSurdeanu commented Jan 18, 2021 via email

Entity boundary issue related to sites #722

Entity boundary issue related to sites #722

Comments

bgyori commented Jan 12, 2021

MihaiSurdeanu commented Jan 12, 2021

bgyori commented Jan 12, 2021

MihaiSurdeanu commented Jan 12, 2021

bgyori commented Jan 12, 2021

MihaiSurdeanu commented Jan 12, 2021

MihaiSurdeanu commented Jan 12, 2021

MihaiSurdeanu commented Jan 12, 2021

MihaiSurdeanu commented Jan 13, 2021

bgyori commented Jan 13, 2021

bgyori commented Jan 13, 2021

MihaiSurdeanu commented Jan 13, 2021

bgyori commented Jan 14, 2021

bgyori commented Jan 15, 2021 • edited Loading

MihaiSurdeanu commented Jan 15, 2021 via email

bgyori commented Jan 15, 2021

MihaiSurdeanu commented Jan 15, 2021 via email

MihaiSurdeanu commented Jan 15, 2021

enoriega commented Jan 18, 2021

MihaiSurdeanu commented Jan 18, 2021

enoriega commented Jan 18, 2021

enoriega commented Jan 18, 2021

MihaiSurdeanu commented Jan 18, 2021 via email

bgyori commented Jan 15, 2021 •

edited

Loading