Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Entity boundary issue related to sites #722

Closed
bgyori opened this issue Jan 12, 2021 · 22 comments
Closed

Entity boundary issue related to sites #722

bgyori opened this issue Jan 12, 2021 · 22 comments
Assignees

Comments

@bgyori
Copy link
Contributor

bgyori commented Jan 12, 2021

A very common entity boundary issue currently happens with cyclins, which can both be described as a family ("cyclin") or specific proteins like "cyclin A1". Currently, in many of these cases, the entity boundary is cropped to just "cyclin", even if a specific type like "A1" or "D1" follows.

Looking at bioresources, in hgnc.tsv.gz we have

cyclin A1»··P78396»·Human 

and cyclin A1 also appears in ner/Gene_or_gene_product.tsv.gz. Still, when putting cyclin A1 into the Reach shell, we get

>>> cyclin A1

sentence #0
TEXT:   cyclin A1
TOKENS: (0,cyclin,NN), (1,A1,NN)
ENTITY LABELS: (cyclin,B-Gene_or_gene_product), (A1,I-Gene_or_gene_product)

LEMMAS: cyclin a1
roots: 1
outgoing:
	0:
	1: (0,compound)
incoming:
	0: (1,compound)
	1:

     
ENTITIES: 2

MENTION TEXT:  cyclin
LABELS:        List(Gene_or_gene_product, MacroMolecule, Equivalable, BioChemicalEntity, BioEntity, Entity, PossibleController)
DISPLAY LABEL: Protein
	------------------------------
	RULE => ner-gene_or_gene_product-entities
	TYPE => CorefTextBoundMention
	------------------------------
	GROUNDING: <KBResolution: Cyclin, fplx, Cyclin, human, <IMKBMetaInfo: uaz, NER-Grounding-Override.tsv.gz, , , sp=false, f=false, p=false>>

CONTEXT: NONE
	------------------------------

MENTION TEXT:  A1
LABELS:        List(Site)
DISPLAY LABEL: Site
	------------------------------
	RULE => site_1letter_a
	TYPE => CorefTextBoundMention
	------------------------------
	GROUNDING: <KBResolution: A1, uaz, UAZ00001, , <IMKBMetaInfo: uaz, , , , sp=false, f=false, p=false>>

CONTEXT: NONE
	------------------------------


EVENTS:   0

==================================================

So it appears that NER works correctly, since if I understand correctly, (cyclin,B-Gene_or_gene_product), (A1,I-Gene_or_gene_product) means that a single entity is detected, but it then gets broken up into a Protein and a Site. (One additional note: I checked whether this behavior has anything to do with the fact that "cyclin" is also defined as an override in NER-Grounding-Override.tsv.gz, but that doesn't seem to be the case, the same behavior happens if I remove the override).

@MihaiSurdeanu
Copy link
Contributor

Thanks @bgyori !
I think this behavior was on purpose based on what we saw when we developed it. We observed a while ago that the NER "eats" into descriptions of sites because of the same overlap you observed. So we allowed the Site detector (which is executed downstream of the NER) to take away tokens from the NER if the site grammar thinks it's a site. I think I can revert that if we decide that's the preferred behavior. What do you think?

@bgyori
Copy link
Contributor Author

bgyori commented Jan 12, 2021

I see, is there any documentation / communication from around that time that has examples of issues before that change? That might help figure out how to engineer an overall better solution.

@MihaiSurdeanu
Copy link
Contributor

No... This was an early decision, which may not even be justified right now. Not sure.
I think we could employ your tactic here: revert the change, and compare outputs before and after to understand the impact.

@bgyori
Copy link
Contributor Author

bgyori commented Jan 12, 2021

Yes, I think in principle a good prioritization would be to first check if the named entity can be grounded fully as is (in the above example as "cyclin A1"). Then if it cannot be grounded, check if it can be broken up into a protein + site combination. I'm actually not sure what some examples of this latter case might be since in most cases e.g., "ERK2 T185" the site isn't recognized as part of the entity in the first place. In any case, we can try to check empirically if there are any issues with changing this.

Also, I couldn't immediately find where this is implemented in the code but I could think more about the issue if you can point me to it.

@MihaiSurdeanu
Copy link
Contributor

I'll take a look at the code later today.

@MihaiSurdeanu
Copy link
Contributor

Just FYI, if you look at this file:
https://github.com/clulab/reach/blob/master/main/src/main/resources/org/clulab/reach/biogrammar/entities/entities.yml

you'll see that the rule that creates Site have priority 1, whereas these rules:
https://github.com/clulab/reach/blob/master/main/src/main/resources/org/clulab/reach/biogrammar/entities/entities.yml#L129-L233

which convert BIO notations into actual Reach entities have priority 3, which means they are executed later.
I'll play with these priorities to see what happens.

@MihaiSurdeanu
Copy link
Contributor

There's a little more here than just priorities. More soon, hopefully.

@MihaiSurdeanu
Copy link
Contributor

It was not a priority issue. It was this line:
https://github.com/clulab/reach/blob/master/main/src/main/resources/org/clulab/reach/biogrammar/entities/entities.yml#L186

which prevents entities contain upper-case letters followed by digits.
As the comment says this was added to prevent the entity to "eat" mutations.
Removing the !word=... constraint it makes Reach recognize "cyclin A1" as an entity. Also, all tests continue to pass.

@bgyori: should we remove this constraint? Or adjust it in a meaningful way?
Thanks!

@bgyori
Copy link
Contributor Author

bgyori commented Jan 13, 2021

I see, I'm still wondering if there was a good reason for adding that condition but I couldn't come up with any examples, even synthetically where it would make a difference. For instance, if we have a typical mutation listed as a separate word, like BRAF V600E, the V600E part is not recognized as part of the named entity anyway and so this condition isn't applied. In other cases, where there is a named entity match for the entire sequence like Cyclin D1, the correct behavior is to not break that up. Could there be cases where NER extends the entity boundary to include the second word even if it is not part of an entity string explicitly listed in the NER files?

@bgyori
Copy link
Contributor Author

bgyori commented Jan 13, 2021

Just for the record, some synthetic examples I've tried (not all of them make sense biologically) are e.g., KRAS G13, KRAS G13C, EGFR A1, EGFR C15, EGFR A66C, AKT1 A66, etc. There are a few different cases for entity labels, one is

(EGFR,B-Gene_or_gene_product), (Y1,B-Gene_or_gene_product)
(AKT1,B-Gene_or_gene_product), (A66,B-Simple_chemical)

where we get a new B- (not I-) entity for the site. Another is

(KRAS,B-Gene_or_gene_product), (G13C,O)

where the site is an O (again not I-). I couldn't produce any examples with entities tagged as B- followed by I- where the I- was a legitimate site for B-.

@MihaiSurdeanu
Copy link
Contributor

This came up because in addition of the bioresources NER, which you know, Reach merges in the outputs of a CRF NER, which (at least in past) we have observed to "eat" into mutations and sites. But it doesn't seem to be happening commonly at all. So it's maybe safe to just remove? Should I do it?

@bgyori
Copy link
Contributor Author

bgyori commented Jan 14, 2021

I'm trying to do some analysis on papers with mutations to see if we lose anything, I'll report what I find here.

@bgyori
Copy link
Contributor Author

bgyori commented Jan 15, 2021

A larger scale test actually highlighted some consequences to this proposed change. Several instances of figure names were extracted as named entities e.g.:

  • Fig. 5a
  • Fig. 4f
  • figure 4A

I found that before the change, from the same sentences we still extracted

  • Fig.

as a named entity but we didn't get figure, presumably because that is a stop word. Is there a way to recognize Fig. and figure as stopwords assuming they could come with a suffix like Fig. 5a and figure 4a?

@MihaiSurdeanu
Copy link
Contributor

MihaiSurdeanu commented Jan 15, 2021 via email

@bgyori
Copy link
Contributor Author

bgyori commented Jan 15, 2021

  • Similarly, we showed that wild-type p53 was polyubiquitinated by Pirh2 but not by Pirh2-DN and Pirh2-ΔRING (Fig. 5C, compare lane 3 with lanes 4 and 5).
  • In contrast, the levels of IRP2 and TfR1 were increased, whereas the level of FTH1 was decreased, by ectopic mutant p53 (Fig. 4f, compare lanes 3–4 with 1–2, respectively).
  • In addition, knockout of IRP2 led to decreased expression of TfR1 and increased expression of FTH1 (Fig. 5a), consistent with previous report [41].
  • MG132 treatment rescued the NSC59984-mediated down-regulation of mutant p53 (figure 4A).

@MihaiSurdeanu
Copy link
Contributor

MihaiSurdeanu commented Jan 15, 2021 via email

@MihaiSurdeanu
Copy link
Contributor

@enoriega, can you please:

Thanks!

@enoriega enoriega self-assigned this Jan 15, 2021
@enoriega
Copy link
Member

@MihaiSurdeanu adding figure and variants to the stop list file doesn't work for this problem:

If you consider figure 4A, the CRF assigns tags B- and I- to the tokens. The stop list is only checked for single word entities (https://github.com/clulab/reach/blob/master/processors/src/main/scala/org/clulab/processors/bionlp/BioNERPostProcessor.scala#L85-L89). At this point the grammar hasn't been called and 4A hasn't been detached yet.

I think the cleanest way to solve this is to discard the mention by encoding figure and variants into the rule itself.

What do you suggest?

@MihaiSurdeanu
Copy link
Contributor

I fixed this a while ago in the NER post-processing code:
https://github.com/clulab/reach/blob/master/processors/src/main/scala/org/clulab/processors/bionlp/BioNERPostProcessor.scala#L100-L107

But this obviously doesn't work anymore. Can you please try to fix that block of code?
Thanks!

@enoriega
Copy link
Member

Done. I noticed that after fixing it, the figure numbers (4a, 5C, etc) where recognized as sites. I updated the rule to fix this too.

Pull request coming soon.

@enoriega
Copy link
Member

@MihaiSurdeanu I had to bump the bioresources dependency to version 1.1.37-SNAPSHOT. Should I leave this on for the pull request? Otherwise, what is the recommended course of action?

I see how this I inconvenient, and will propose a fix soon.

enoriega added a commit that referenced this issue Jan 18, 2021
@MihaiSurdeanu
Copy link
Contributor

MihaiSurdeanu commented Jan 18, 2021 via email

kwalcock added a commit that referenced this issue Jan 19, 2021
Integrate fix to #722 into master
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants