Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"damage" matched as entity #25

Closed
herongrove opened this issue Jun 13, 2015 · 5 comments
Closed

"damage" matched as entity #25

herongrove opened this issue Jun 13, 2015 · 5 comments
Assignees
Labels

Comments

@herongrove
Copy link

There are a lot of damage-binding proteins and other entities with "damage" in their name, but I don't think "damage" should be matched as it is being, e.g. in "Two of these sites ( Ser966 and Ser957 in Smc1 ) have been shown to be phosphorylated by the ATM kinase in response to DNA damage."

@herongrove herongrove added the bug label Jun 13, 2015
@dsidi dsidi removed their assignment Jul 2, 2015
@herongrove
Copy link
Author

I've prepared an exhaustive list of words that overlap between capitalized protein names and English words. This isn't filtered for word frequency, so use with care -- aah is unlikely to appear in a biomedical text except as a protein name, but that might not be the case for cat, ape, chimp, etc.
overlap.txt

@MihaiSurdeanu
Copy link
Contributor

Hi @danebell: I would like to add this list to bioresources and use it to filter out false positives in the rule-based NER. But, at this point, it seems too over-inclusive. Can you please do another pass over it, and keep just words that are likely to appear as non-proteins in bio texts? For example, I would remove the first 9 entries (they are likely to be only used as proteins), but keep ACT, ANT, etc.

@herongrove
Copy link
Author

Yes, I can definitely do this, but there are over 1000 entries, so if it's okay, I will wait for at least a week to do so.

@MihaiSurdeanu
Copy link
Contributor

Agreed, it can wait. I suspect that less than 10% of this are actually
relevant in this list (the rest are valid protein names that are unlikely
to be used otherwise in bio texts).

On Wed, Feb 24, 2016 at 11:03 AM, Dane Bell notifications@github.com
wrote:

Yes, I can definitely do this, but there are over 1000 entries, so if it's
okay, I will wait for at least a week to do so.


Reply to this email directly or view it on GitHub
#25 (comment).

@herongrove herongrove self-assigned this Apr 4, 2016
@MihaiSurdeanu
Copy link
Contributor

closed by the addition of the NER stop list in bioresources 1.1.9.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants