New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to improve named entity normalization for human proteins? #2
Comments
Hi @dhimmel,
For the gene/protein type, we are using an off-the-shelf gene type normalizer GNormPlus and the human proteins in your examples are the entities that GNormPlus could not normalize. If a better gene/protein type normalizer is released in the future, we are planning to replace it with the current gene/protein type normalizer. |
Thanks @mjeensung for the clarification. Feel free to post any leads on better gene/protein normalizers here... I'm happy to help evaluate. Looking at the GNormPlus docs, it does "mention recognition and concept normalization". So are you able to just apply GNormPlus at the concept normalization stage for genes, while using the mention recognition from BERN2? I think the code I'm asking about is: Lines 307 to 401 in 20cef6b
|
Here's our off the shelf gene (and other entity) normalizer that's ready for use: https://github.com/indralab/gilda |
@dhimmel, that's correct. For genes, mentions are recognized by the BERN2 NER model (better performance than GNormPlus) and normalized by GNormPlus. |
Thank you for recommending this great tool, @cthoyt. |
Very excited to see BERN2! Really nice work so far.
I'm looking to map certain mentions of proteins to standard identifiers. Here's a list of these proteins, where each protein is also followed by a direction of activity:
Using the nice web interface, I get:
So overall BERN2 does a good job recognizing the protein mentions. However, we actually already know what the protein text is, and are more interested in normalization. Most of the gene/protein mentions receive "ID: CUI-less". Any advice on how to improve the performance of named entity normalization for human proteins?
I see that the website notes that normalization is done by https://github.com/dmis-lab/BioSyn, so feel free to migrate this issue to that repo if it's best there.
The text was updated successfully, but these errors were encountered: