Update to processors 8.3.0 #742

kwalcock · 2021-03-23T18:49:39Z

No description provided.

This avoids deprecation warning

kwalcock · 2021-03-26T21:35:15Z

[error] Failed tests:
[error] 	org.clulab.processors.TestBioNLPProcessor

[error] Failed tests:
[error] 	org.clulab.reach.TestBindingEvents
[error] 	org.clulab.reach.TestNERLabeling
[error] 	org.clulab.reach.PolaritySuite
[error] 	org.clulab.reach.TestDeModifications
[error] 	org.clulab.reach.TestTemplaticSimpleDeEvents
[error] 	org.clulab.reach.TestPolarity
[error] 	org.clulab.reach.TestAmountEvents
[error] 	org.clulab.reach.NegationTests
[error] 	org.clulab.reach.TestActivationEvents
[error] 	org.clulab.reach.TestCoreference
[error] 	org.clulab.reach.TestModifications
[error] 	org.clulab.reach.ExperimentalRegulationTests
[error] 	org.clulab.reach.TestRegulationEvents
[error] 	org.clulab.reach.TestOverrides
[error] 	org.clulab.reach.TestTemplaticSimpleEvents
[error] 	org.clulab.reach.TestConversionEvents

[error] Failed tests:
[error] 	org.clulab.reach.export.TestApi
[error] 	org.clulab.reach.export.TestOutputDegrader

kwalcock · 2021-03-27T02:45:33Z

I don't have a good idea why these tests are failing. They must be related to recent changes in the NER code or the embeddings code. Any suggestions out there?

MihaiSurdeanu · 2021-03-27T02:54:42Z

I’ll take a look soon. On March 26, 2021 at 7:45:46 PM, Keith Alcock ***@***.***) wrote: I don't have a good idea why these tests are failing. They must be related to recent changes in the NER code or the embeddings code. Any suggestions out there? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#742 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAI75TR5HKW3NZ7EUJIT6WDTFVBFTANCNFSM4ZVYREKQ> .

MihaiSurdeanu · 2021-03-28T20:50:51Z

@kwalcock, @enoriega: it seems that the change affected the behavior of the LexiconNER. I propose we focus on understanding this first. Here are two examples:

From org.clulab.reach.TestNERLabeling:
Before 8.3.0, the sentence below marks "E3" as Simple-chemical:

TEXT:   Estrone E1 , estradiol E2 , and estriol E3 do not cause cancer .
TOKENS: (0,Estrone,NN), (1,E1,NN), (2,,,,), (3,estradiol,NN), (4,E2,NN), (5,,,,), (6,and,CC), (7,estriol,NN), (8,E3,NN), (9,do,VBP), (10,not,RB), (11,cause,VB), (12,cancer,NN), (13,.,.)
ENTITY LABELS: (Estrone,B-Simple_chemical), (E1,B-Simple_chemical), (,,O), (estradiol,B-Simple_chemical), (E2,B-Simple_chemical), (,,O), (and,O), (estriol,B-Simple_chemical), (E3,B-Simple_chemical), (do,O), (not,O), (cause,O), (cancer,O), (.,O)

After the change to 8.3.0, "E3" is marked as Family:

TEXT:   Estrone E1 , estradiol E2 , and estriol E3 do not cause cancer .
TOKENS: (0,Estrone,NN), (1,E1,NN), (2,,,,), (3,estradiol,NN), (4,E2,NN), (5,,,,), (6,and,CC), (7,estriol,NN), (8,E3,NN), (9,do,VBP), (10,not,RB), (11,cause,VB), (12,cancer,NN), (13,.,.)
ENTITY LABELS: (Estrone,B-Simple_chemical), (E1,B-Simple_chemical), (,,O), (estradiol,B-Simple_chemical), (E2,B-Simple_chemical), (,,O), (and,O), (estriol,B-Simple_chemical), (E3,B-Family), (do,O), (not,O), (cause,O), (cancer,O), (.,O)

Another one:
In org.clulab.processors.TestBioNLPProcessor, before 8.3.0 "neurofibromin" was marked as a GGP:

TEXT:   We tested the level of neurofibromin present in the sample
TOKENS: (0,We,PRP), (1,tested,VBD), (2,the,DT), (3,level,NN), (4,of,IN), (5,neurofibromin,JJ), (6,present,JJ), (7,in,IN), (8,the,DT), (9,sample,NN)
ENTITY LABELS: (We,O), (tested,O), (the,O), (level,O), (of,O), (neurofibromin,B-Gene_or_gene_product), (present,O), (in,O), (the,O), (sample,O)

After 8.3.0, this entity is no longer recognized:

TEXT:   We tested the level of neurofibromin present in the sample
TOKENS: (0,We,PRP), (1,tested,VBD), (2,the,DT), (3,level,NN), (4,of,IN), (5,neurofibromin,JJ), (6,present,JJ), (7,in,IN), (8,the,DT), (9,sample,NN)
ENTITY LABELS: (We,O), (tested,O), (the,O), (level,O), (of,O), (neurofibromin,O), (present,O), (in,O), (the,O), (sample,O)

Note that in bioresources, we do not have "neurofibromin" as a GGP. We have "neurofibromin 1" and "neurofibromin 2". So technically, the latter behavior is correct.
But I would like to understand what is happening here. @enoriega: can you please help? Can you please try to debug these two examples and understand what behavior changed? I am pretty sure it's in LexiconNER, but not sure what. Once we do, I suspect @kwalcock can fix it.

Thank you both!

MihaiSurdeanu · 2021-03-29T16:48:37Z

@enoriega : can you please prioritize this?

enoriega · 2021-03-29T21:22:49Z

@kwalcock @MihaiSurdeanu

For Estrone E1 , estradiol E2 , and estriol E3 do not cause cancer ., the change happened because there is a conflicting match in NER-Grounding-override.tsv with two matching entries, one for simple_chemical and another for family. This situation was already present, and before the change, the class CompactLexiconNER consistently chose simple chemical. After the change, it now choses family, and I was able to trace the culprit to the longest match function:

https://github.com/clulab/processors/blob/master/main/src/main/scala/org/clulab/sequences/CompactLexiconNER.scala#L69-L137

I'm not sure if the best way to fix it is to make it behave as before or to remove the conflicting entry in the overrides file.

For We tested the level of neurofibromin present in the sample, both NER components work correctly. The CRF labels neurofibromin as B-Gene-or-gene-product. The problem was introduced by a refactor of the following function, which overwrites the CRF's tag with the lexicon's (which is O, because it is no an entry in any KB) while attempting to merge both tag sets:

https://github.com/clulab/processors/blob/master/main/src/main/scala/org/clulab/sequences/LexiconNER.scala#L261-L278

processors' version used by the master branch of reach has a different implementation of mergeLabels

MihaiSurdeanu · 2021-03-29T22:43:20Z

Thanks @enoriega !

@kwalcock : has anything changed in these methods? why the different behavior?

kwalcock · 2021-03-29T23:57:08Z

I'll look at it very soon. @enoriega did excellent detective work. I suspect that I broke some things, especially in cases of ties or conflicts, although everything was supposed to have been taken into account.

kwalcock · 2021-03-30T04:02:08Z

The second one with We tested the level of neurofibromin present is an outright mistake, mine, and it's pretty bad. The line of code in LexiconNER.mergeLabels that was

src.copyToArray(dst, offset, notOutsideCount)

I thought to mean the same as

Array.copy(src, offset, dst, offset, notOutsideCount)

but instead it is

Array.copy(src, 0, dst, offset, notOutsideCount)

so that the wrong values are being copied into the destination. There will be an update and probably a warning not to use version 8.3.0. Master was affected for a time, but it looks like there was only the one official release.

MihaiSurdeanu · 2021-03-30T04:03:08Z

:)

kwalcock · 2021-03-30T18:11:13Z

This second issue with Estrone E1 , estradiol E2 , and estriol E3 do not cause cancer . does have to do with there being two entries in the overrideKB and I indeed changed the behavior there on purpose, but now that it looks like a problem, I'll say mistakenly. The explanation is quite long-winded. If this tie breaking is especially important it might be worth visiting more formally, though. I changed processors so that the tests pass, but that is just so that previous behavior is preserved, not because it's good behavior.

Long ago, before changes of 2/2/2019, code for the override KBs looked like

      overrideKBs.get.foreach(okb => {
        val reader = loadStreamFromClasspath(okb)
        val overrideMatchers = loadOverrideKB(reader, lexicalVariationEngine, caseInsensitiveMatching, knownCaseInsensitives)
        for(name <- overrideMatchers.keySet.toList.sorted) {
          val matcher = overrideMatchers(name)
          logger.info(s"Loaded OVERRIDE matcher for label $name. This matcher contains ${matcher.uniqueStrings.size} unique strings; the size of the first layer is ${matcher.entries.size}.")
          matchers += Tuple2(name, matcher)
        }
        reader.close()
      })

If I understand this correctly, the matchers were ordered by the overrideKBs (files) and then within each of those, alphabetically by label which is in the keySet. This may have just been a way to ensure consistency. Afterwards came the regular KBs in the order they were specified. Each NE could appear in numerous of the matchers, but matcher order determined the winner.

The version right after that which has been in production until recently used a single matcher which for each NE included a single number that was the index into an array of labels. There could only be one number. It would have been difficult to calculate the winning number to match previous behavior and I probably missed seeing the sorting anyway. The label first associated with an NE was used.

In the most recent version which is causing the problems, I relaxed that first label requirement and allowed overwriting if there was a duplicate in any override KB whose label would have come earlier in the list of regular KBs. That messes up the reach tests. The change might have been to better approximate original behavior, but I didn't realize that the modified behavior of the previous paragraph was already important.

There will be a processors PR momentarily.

MihaiSurdeanu · 2021-03-30T18:17:36Z

Thanks @kwalcock !

MihaiSurdeanu · 2021-03-30T18:20:13Z

In the second version, how were the labels assigned to the same NE sorted? That may be some arbitrary behavior not worth preserving...

Also, @enoriega: can you please remove the multiple labels for E3? Maybe keep chemical, since that is backwards compatible?

kwalcock · 2021-03-30T18:52:03Z

They were first come, first serve in the second, middle version. However, the queue was first all the overrideKBs in order, then the regular KBs. The overrideKBs refer to labels (names) of the regular KBs, so using an overrideKB one could more or less move an NE from one KB to a different one. Maybe that was meant by override. For the newer version I might have been thinking that the order of the regular KBs should determine the priority, not the line or file in an overrideKB that something appeared. That might make more sense for new entries. One overrides an otherwise closed KB by adding a new entry and it should have the same precedence as everything else in there.

IIRC correctly, I encountered many duplicates, although maybe not ones in the same files like E3. I could make a list.

MihaiSurdeanu · 2021-03-30T18:54:20Z

Ok, then it is important to preserve this functionality. That order has meaning, and it is what was meant by override.

kwalcock · 2021-03-31T02:05:39Z

If someone was to decide that the overrides were case sensitive, then things like Trp and TRP would not be redundant or conflicting. If anything is all lower case, then it gets involved in the known case insensitives, and removing them gets complicated. Nevertheless, here are the "duplicates":

eht-1864*
akt*
axin*
camk2*
cdc25*
cdkn1*
cox4*
cox6b*
cox7a*
cox8*
cyclin-a*
dgk*
dvl*
e3*
erk*
fos*
g-alpha*
hsp70*
hsp90*
ifngr*
pp1c*
prkaa*
rps6ka*
raf*
rap1*
ras*
shc*
smc1*
thr*
trp*
vav*
wnt5*
cadherin*
eif4a*
eif4g*
eotaxin*
porin*
thioredoxin* (reductase*)
ubiquitin*

kwalcock · 2021-03-31T02:15:52Z

Here is also a note to self. If instead of the CombinedLexiconNER, the SeparatedLexiconNER is used, a few of the tests will fail. I think these are tests that were written after the Combined version was put into use, so they depend on the per NE override. Previously the override priority was per file. So, for example, in NER-Grounding-Override.tsv there is now

Line 249 something Family
Line 298 otherthing Site
Line 333 Thr Site
Line 756 THR Family

Because of line 249, the Separated one will give priority to Family because it saw that label first. The Combined version makes that decision on a per NE basis. Line 333 will cause Site to be used because the later Family will not overwrite it now. It had been allowing this overwrite, but that was at odds with the middle version from above, so that won't be done anymore.

kwalcock · 2021-05-18T15:32:50Z

This PR has been superseded by others, particularly one updating processors to 8.3.3. Some of the individual commits which might be useful will be transferred to a different PR.

kwalcock added 2 commits March 23, 2021 11:22

Update processors

fb86616

Use ThreadUtils

4943012

This avoids deprecation warning

Update to processors 8.3.1

986e8bd

kwalcock mentioned this pull request Apr 6, 2021

Remove duplicate entries from Bioresources #746

Closed

kwalcock closed this May 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update to processors 8.3.0 #742

Update to processors 8.3.0 #742

kwalcock commented Mar 23, 2021

kwalcock commented Mar 26, 2021

kwalcock commented Mar 27, 2021

MihaiSurdeanu commented Mar 27, 2021 via email

MihaiSurdeanu commented Mar 28, 2021

MihaiSurdeanu commented Mar 29, 2021

enoriega commented Mar 29, 2021

MihaiSurdeanu commented Mar 29, 2021

kwalcock commented Mar 29, 2021

kwalcock commented Mar 30, 2021

MihaiSurdeanu commented Mar 30, 2021

kwalcock commented Mar 30, 2021

MihaiSurdeanu commented Mar 30, 2021

MihaiSurdeanu commented Mar 30, 2021

kwalcock commented Mar 30, 2021

MihaiSurdeanu commented Mar 30, 2021

kwalcock commented Mar 31, 2021

kwalcock commented Mar 31, 2021

kwalcock commented May 18, 2021

Update to processors 8.3.0 #742

Update to processors 8.3.0 #742

Conversation

kwalcock commented Mar 23, 2021

kwalcock commented Mar 26, 2021

kwalcock commented Mar 27, 2021

MihaiSurdeanu commented Mar 27, 2021 via email

MihaiSurdeanu commented Mar 28, 2021

MihaiSurdeanu commented Mar 29, 2021

enoriega commented Mar 29, 2021

MihaiSurdeanu commented Mar 29, 2021

kwalcock commented Mar 29, 2021

kwalcock commented Mar 30, 2021

MihaiSurdeanu commented Mar 30, 2021

kwalcock commented Mar 30, 2021

MihaiSurdeanu commented Mar 30, 2021

MihaiSurdeanu commented Mar 30, 2021

kwalcock commented Mar 30, 2021

MihaiSurdeanu commented Mar 30, 2021

kwalcock commented Mar 31, 2021

kwalcock commented Mar 31, 2021

kwalcock commented May 18, 2021