-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"damage" matched as entity #25
Comments
I've prepared an exhaustive list of words that overlap between capitalized protein names and English words. This isn't filtered for word frequency, so use with care -- aah is unlikely to appear in a biomedical text except as a protein name, but that might not be the case for cat, ape, chimp, etc. |
Hi @danebell: I would like to add this list to bioresources and use it to filter out false positives in the rule-based NER. But, at this point, it seems too over-inclusive. Can you please do another pass over it, and keep just words that are likely to appear as non-proteins in bio texts? For example, I would remove the first 9 entries (they are likely to be only used as proteins), but keep ACT, ANT, etc. |
Yes, I can definitely do this, but there are over 1000 entries, so if it's okay, I will wait for at least a week to do so. |
Agreed, it can wait. I suspect that less than 10% of this are actually On Wed, Feb 24, 2016 at 11:03 AM, Dane Bell notifications@github.com
|
closed by the addition of the NER stop list in bioresources 1.1.9. |
There are a lot of damage-binding proteins and other entities with "damage" in their name, but I don't think "damage" should be matched as it is being, e.g. in "Two of these sites ( Ser966 and Ser957 in Smc1 ) have been shown to be phosphorylated by the ATM kinase in response to DNA damage."
The text was updated successfully, but these errors were encountered: