Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to processors 8.3.0 #742

Closed
wants to merge 3 commits into from
Closed

Conversation

kwalcock
Copy link
Member

No description provided.

This avoids deprecation warning
@kwalcock
Copy link
Member Author

[error] Failed tests:
[error] 	org.clulab.processors.TestBioNLPProcessor

[error] Failed tests:
[error] 	org.clulab.reach.TestBindingEvents
[error] 	org.clulab.reach.TestNERLabeling
[error] 	org.clulab.reach.PolaritySuite
[error] 	org.clulab.reach.TestDeModifications
[error] 	org.clulab.reach.TestTemplaticSimpleDeEvents
[error] 	org.clulab.reach.TestPolarity
[error] 	org.clulab.reach.TestAmountEvents
[error] 	org.clulab.reach.NegationTests
[error] 	org.clulab.reach.TestActivationEvents
[error] 	org.clulab.reach.TestCoreference
[error] 	org.clulab.reach.TestModifications
[error] 	org.clulab.reach.ExperimentalRegulationTests
[error] 	org.clulab.reach.TestRegulationEvents
[error] 	org.clulab.reach.TestOverrides
[error] 	org.clulab.reach.TestTemplaticSimpleEvents
[error] 	org.clulab.reach.TestConversionEvents

[error] Failed tests:
[error] 	org.clulab.reach.export.TestApi
[error] 	org.clulab.reach.export.TestOutputDegrader

@kwalcock
Copy link
Member Author

I don't have a good idea why these tests are failing. They must be related to recent changes in the NER code or the embeddings code. Any suggestions out there?

@MihaiSurdeanu
Copy link
Contributor

MihaiSurdeanu commented Mar 27, 2021 via email

@MihaiSurdeanu
Copy link
Contributor

@kwalcock, @enoriega: it seems that the change affected the behavior of the LexiconNER. I propose we focus on understanding this first. Here are two examples:

From org.clulab.reach.TestNERLabeling:
Before 8.3.0, the sentence below marks "E3" as Simple-chemical:

TEXT:   Estrone E1 , estradiol E2 , and estriol E3 do not cause cancer .
TOKENS: (0,Estrone,NN), (1,E1,NN), (2,,,,), (3,estradiol,NN), (4,E2,NN), (5,,,,), (6,and,CC), (7,estriol,NN), (8,E3,NN), (9,do,VBP), (10,not,RB), (11,cause,VB), (12,cancer,NN), (13,.,.)
ENTITY LABELS: (Estrone,B-Simple_chemical), (E1,B-Simple_chemical), (,,O), (estradiol,B-Simple_chemical), (E2,B-Simple_chemical), (,,O), (and,O), (estriol,B-Simple_chemical), (E3,B-Simple_chemical), (do,O), (not,O), (cause,O), (cancer,O), (.,O)

After the change to 8.3.0, "E3" is marked as Family:

TEXT:   Estrone E1 , estradiol E2 , and estriol E3 do not cause cancer .
TOKENS: (0,Estrone,NN), (1,E1,NN), (2,,,,), (3,estradiol,NN), (4,E2,NN), (5,,,,), (6,and,CC), (7,estriol,NN), (8,E3,NN), (9,do,VBP), (10,not,RB), (11,cause,VB), (12,cancer,NN), (13,.,.)
ENTITY LABELS: (Estrone,B-Simple_chemical), (E1,B-Simple_chemical), (,,O), (estradiol,B-Simple_chemical), (E2,B-Simple_chemical), (,,O), (and,O), (estriol,B-Simple_chemical), (E3,B-Family), (do,O), (not,O), (cause,O), (cancer,O), (.,O)

Another one:
In org.clulab.processors.TestBioNLPProcessor, before 8.3.0 "neurofibromin" was marked as a GGP:

TEXT:   We tested the level of neurofibromin present in the sample
TOKENS: (0,We,PRP), (1,tested,VBD), (2,the,DT), (3,level,NN), (4,of,IN), (5,neurofibromin,JJ), (6,present,JJ), (7,in,IN), (8,the,DT), (9,sample,NN)
ENTITY LABELS: (We,O), (tested,O), (the,O), (level,O), (of,O), (neurofibromin,B-Gene_or_gene_product), (present,O), (in,O), (the,O), (sample,O)

After 8.3.0, this entity is no longer recognized:

TEXT:   We tested the level of neurofibromin present in the sample
TOKENS: (0,We,PRP), (1,tested,VBD), (2,the,DT), (3,level,NN), (4,of,IN), (5,neurofibromin,JJ), (6,present,JJ), (7,in,IN), (8,the,DT), (9,sample,NN)
ENTITY LABELS: (We,O), (tested,O), (the,O), (level,O), (of,O), (neurofibromin,O), (present,O), (in,O), (the,O), (sample,O)

Note that in bioresources, we do not have "neurofibromin" as a GGP. We have "neurofibromin 1" and "neurofibromin 2". So technically, the latter behavior is correct.
But I would like to understand what is happening here. @enoriega: can you please help? Can you please try to debug these two examples and understand what behavior changed? I am pretty sure it's in LexiconNER, but not sure what. Once we do, I suspect @kwalcock can fix it.

Thank you both!

@MihaiSurdeanu
Copy link
Contributor

@enoriega : can you please prioritize this?

@enoriega
Copy link
Member

@kwalcock @MihaiSurdeanu

For Estrone E1 , estradiol E2 , and estriol E3 do not cause cancer ., the change happened because there is a conflicting match in NER-Grounding-override.tsv with two matching entries, one for simple_chemical and another for family. This situation was already present, and before the change, the class CompactLexiconNER consistently chose simple chemical. After the change, it now choses family, and I was able to trace the culprit to the longest match function:

https://github.com/clulab/processors/blob/master/main/src/main/scala/org/clulab/sequences/CompactLexiconNER.scala#L69-L137

I'm not sure if the best way to fix it is to make it behave as before or to remove the conflicting entry in the overrides file.


For We tested the level of neurofibromin present in the sample, both NER components work correctly. The CRF labels neurofibromin as B-Gene-or-gene-product. The problem was introduced by a refactor of the following function, which overwrites the CRF's tag with the lexicon's (which is O, because it is no an entry in any KB) while attempting to merge both tag sets:

https://github.com/clulab/processors/blob/master/main/src/main/scala/org/clulab/sequences/LexiconNER.scala#L261-L278

processors' version used by the master branch of reach has a different implementation of mergeLabels

@MihaiSurdeanu
Copy link
Contributor

Thanks @enoriega !

@kwalcock : has anything changed in these methods? why the different behavior?

@kwalcock
Copy link
Member Author

I'll look at it very soon. @enoriega did excellent detective work. I suspect that I broke some things, especially in cases of ties or conflicts, although everything was supposed to have been taken into account.

@kwalcock
Copy link
Member Author

The second one with We tested the level of neurofibromin present is an outright mistake, mine, and it's pretty bad. The line of code in LexiconNER.mergeLabels that was

src.copyToArray(dst, offset, notOutsideCount)

I thought to mean the same as

Array.copy(src, offset, dst, offset, notOutsideCount)

but instead it is

Array.copy(src, 0, dst, offset, notOutsideCount)

so that the wrong values are being copied into the destination. There will be an update and probably a warning not to use version 8.3.0. Master was affected for a time, but it looks like there was only the one official release.

@MihaiSurdeanu
Copy link
Contributor

:)

@kwalcock
Copy link
Member Author

This second issue with Estrone E1 , estradiol E2 , and estriol E3 do not cause cancer . does have to do with there being two entries in the overrideKB and I indeed changed the behavior there on purpose, but now that it looks like a problem, I'll say mistakenly. The explanation is quite long-winded. If this tie breaking is especially important it might be worth visiting more formally, though. I changed processors so that the tests pass, but that is just so that previous behavior is preserved, not because it's good behavior.

Long ago, before changes of 2/2/2019, code for the override KBs looked like

      overrideKBs.get.foreach(okb => {
        val reader = loadStreamFromClasspath(okb)
        val overrideMatchers = loadOverrideKB(reader, lexicalVariationEngine, caseInsensitiveMatching, knownCaseInsensitives)
        for(name <- overrideMatchers.keySet.toList.sorted) {
          val matcher = overrideMatchers(name)
          logger.info(s"Loaded OVERRIDE matcher for label $name. This matcher contains ${matcher.uniqueStrings.size} unique strings; the size of the first layer is ${matcher.entries.size}.")
          matchers += Tuple2(name, matcher)
        }
        reader.close()
      })

If I understand this correctly, the matchers were ordered by the overrideKBs (files) and then within each of those, alphabetically by label which is in the keySet. This may have just been a way to ensure consistency. Afterwards came the regular KBs in the order they were specified. Each NE could appear in numerous of the matchers, but matcher order determined the winner.

The version right after that which has been in production until recently used a single matcher which for each NE included a single number that was the index into an array of labels. There could only be one number. It would have been difficult to calculate the winning number to match previous behavior and I probably missed seeing the sorting anyway. The label first associated with an NE was used.

In the most recent version which is causing the problems, I relaxed that first label requirement and allowed overwriting if there was a duplicate in any override KB whose label would have come earlier in the list of regular KBs. That messes up the reach tests. The change might have been to better approximate original behavior, but I didn't realize that the modified behavior of the previous paragraph was already important.

There will be a processors PR momentarily.

@MihaiSurdeanu
Copy link
Contributor

Thanks @kwalcock !

@MihaiSurdeanu
Copy link
Contributor

In the second version, how were the labels assigned to the same NE sorted? That may be some arbitrary behavior not worth preserving...

Also, @enoriega: can you please remove the multiple labels for E3? Maybe keep chemical, since that is backwards compatible?

@kwalcock
Copy link
Member Author

They were first come, first serve in the second, middle version. However, the queue was first all the overrideKBs in order, then the regular KBs. The overrideKBs refer to labels (names) of the regular KBs, so using an overrideKB one could more or less move an NE from one KB to a different one. Maybe that was meant by override. For the newer version I might have been thinking that the order of the regular KBs should determine the priority, not the line or file in an overrideKB that something appeared. That might make more sense for new entries. One overrides an otherwise closed KB by adding a new entry and it should have the same precedence as everything else in there.

IIRC correctly, I encountered many duplicates, although maybe not ones in the same files like E3. I could make a list.

@MihaiSurdeanu
Copy link
Contributor

Ok, then it is important to preserve this functionality. That order has meaning, and it is what was meant by override.

@kwalcock
Copy link
Member Author

If someone was to decide that the overrides were case sensitive, then things like Trp and TRP would not be redundant or conflicting. If anything is all lower case, then it gets involved in the known case insensitives, and removing them gets complicated. Nevertheless, here are the "duplicates":

eht-1864*
akt*
axin*
camk2*
cdc25*
cdkn1*
cox4*
cox6b*
cox7a*
cox8*
cyclin-a*
dgk*
dvl*
e3*
erk*
fos*
g-alpha*
hsp70*
hsp90*
ifngr*
pp1c*
prkaa*
rps6ka*
raf*
rap1*
ras*
shc*
smc1*
thr*
trp*
vav*
wnt5*
cadherin*
eif4a*
eif4g*
eotaxin*
porin*
thioredoxin* (reductase*)
ubiquitin*

@kwalcock
Copy link
Member Author

Here is also a note to self. If instead of the CombinedLexiconNER, the SeparatedLexiconNER is used, a few of the tests will fail. I think these are tests that were written after the Combined version was put into use, so they depend on the per NE override. Previously the override priority was per file. So, for example, in NER-Grounding-Override.tsv there is now

Line 249 something Family
Line 298 otherthing Site
Line 333 Thr Site
Line 756 THR Family

Because of line 249, the Separated one will give priority to Family because it saw that label first. The Combined version makes that decision on a per NE basis. Line 333 will cause Site to be used because the later Family will not overwrite it now. It had been allowing this overwrite, but that was at odds with the middle version from above, so that won't be done anymore.

@kwalcock
Copy link
Member Author

This PR has been superseded by others, particularly one updating processors to 8.3.3. Some of the individual commits which might be useful will be transferred to a different PR.

@kwalcock kwalcock closed this May 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants