-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update to processors 8.3.0 #742
Conversation
This avoids deprecation warning
|
I don't have a good idea why these tests are failing. They must be related to recent changes in the NER code or the embeddings code. Any suggestions out there? |
I’ll take a look soon.
On March 26, 2021 at 7:45:46 PM, Keith Alcock ***@***.***) wrote:
I don't have a good idea why these tests are failing. They must be related
to recent changes in the NER code or the embeddings code. Any suggestions
out there?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#742 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAI75TR5HKW3NZ7EUJIT6WDTFVBFTANCNFSM4ZVYREKQ>
.
|
@kwalcock, @enoriega: it seems that the change affected the behavior of the LexiconNER. I propose we focus on understanding this first. Here are two examples: From
After the change to 8.3.0, "E3" is marked as
Another one:
After 8.3.0, this entity is no longer recognized:
Note that in bioresources, we do not have "neurofibromin" as a GGP. We have "neurofibromin 1" and "neurofibromin 2". So technically, the latter behavior is correct. Thank you both! |
@enoriega : can you please prioritize this? |
For I'm not sure if the best way to fix it is to make it behave as before or to remove the conflicting entry in the overrides file. For processors' version used by the master branch of reach has a different implementation of |
I'll look at it very soon. @enoriega did excellent detective work. I suspect that I broke some things, especially in cases of ties or conflicts, although everything was supposed to have been taken into account. |
The second one with src.copyToArray(dst, offset, notOutsideCount) I thought to mean the same as Array.copy(src, offset, dst, offset, notOutsideCount) but instead it is Array.copy(src, 0, dst, offset, notOutsideCount) so that the wrong values are being copied into the destination. There will be an update and probably a warning not to use version 8.3.0. Master was affected for a time, but it looks like there was only the one official release. |
:) |
This second issue with Long ago, before changes of 2/2/2019, code for the override KBs looked like overrideKBs.get.foreach(okb => {
val reader = loadStreamFromClasspath(okb)
val overrideMatchers = loadOverrideKB(reader, lexicalVariationEngine, caseInsensitiveMatching, knownCaseInsensitives)
for(name <- overrideMatchers.keySet.toList.sorted) {
val matcher = overrideMatchers(name)
logger.info(s"Loaded OVERRIDE matcher for label $name. This matcher contains ${matcher.uniqueStrings.size} unique strings; the size of the first layer is ${matcher.entries.size}.")
matchers += Tuple2(name, matcher)
}
reader.close()
}) If I understand this correctly, the matchers were ordered by the overrideKBs (files) and then within each of those, alphabetically by label which is in the keySet. This may have just been a way to ensure consistency. Afterwards came the regular KBs in the order they were specified. Each NE could appear in numerous of the matchers, but matcher order determined the winner. The version right after that which has been in production until recently used a single matcher which for each NE included a single number that was the index into an array of labels. There could only be one number. It would have been difficult to calculate the winning number to match previous behavior and I probably missed seeing the sorting anyway. The label first associated with an NE was used. In the most recent version which is causing the problems, I relaxed that first label requirement and allowed overwriting if there was a duplicate in any override KB whose label would have come earlier in the list of regular KBs. That messes up the reach tests. The change might have been to better approximate original behavior, but I didn't realize that the modified behavior of the previous paragraph was already important. There will be a processors PR momentarily. |
Thanks @kwalcock ! |
In the second version, how were the labels assigned to the same NE sorted? That may be some arbitrary behavior not worth preserving... Also, @enoriega: can you please remove the multiple labels for E3? Maybe keep chemical, since that is backwards compatible? |
They were first come, first serve in the second, middle version. However, the queue was first all the overrideKBs in order, then the regular KBs. The overrideKBs refer to labels (names) of the regular KBs, so using an overrideKB one could more or less move an NE from one KB to a different one. Maybe that was meant by override. For the newer version I might have been thinking that the order of the regular KBs should determine the priority, not the line or file in an overrideKB that something appeared. That might make more sense for new entries. One overrides an otherwise closed KB by adding a new entry and it should have the same precedence as everything else in there. IIRC correctly, I encountered many duplicates, although maybe not ones in the same files like E3. I could make a list. |
Ok, then it is important to preserve this functionality. That order has meaning, and it is what was meant by override. |
If someone was to decide that the overrides were case sensitive, then things like Trp and TRP would not be redundant or conflicting. If anything is all lower case, then it gets involved in the known case insensitives, and removing them gets complicated. Nevertheless, here are the "duplicates": eht-1864* |
Here is also a note to self. If instead of the CombinedLexiconNER, the SeparatedLexiconNER is used, a few of the tests will fail. I think these are tests that were written after the Combined version was put into use, so they depend on the per NE override. Previously the override priority was per file. So, for example, in NER-Grounding-Override.tsv there is now Line 249 something Family Because of line 249, the Separated one will give priority to Family because it saw that label first. The Combined version makes that decision on a per NE basis. Line 333 will cause Site to be used because the later Family will not overwrite it now. It had been allowing this overwrite, but that was at odds with the middle version from above, so that won't be done anymore. |
This PR has been superseded by others, particularly one updating processors to 8.3.3. Some of the individual commits which might be useful will be transferred to a different PR. |
No description provided.