-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Entity Mention Linker #3388
Entity Mention Linker #3388
Conversation
2fae0a2
to
3c9d412
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR @helpmefindaname and @WangXII!
It was a bit hard to put together the code to actually tag a sentence, for several reasons:
- It is not obvious that one first needs to run a biomedical NER tagger
- The entity linker models have a cryptic
entity_tag_type
which needs to be overwritten - The predictions are just integer codes, and it was not obvious to me to which knowledge base they refer to
- If I interpret correctly, the tested model seems to get even simple things wrong
Here is the code used for testing:
from flair.data import Sentence
from flair.datasets import NCBI_GENE_HUMAN_DICTIONARY
from flair.models import EntityMentionLinker
from flair.nn import Classifier
# Example sentence
sentence = Sentence("We observed A2M, CDKN1A and alpha-2-macroglobulin in the specimen.")
# instantiate NER tagger for genes
ner_tagger = Classifier.load("hunflair-gene")
# instantate Gene linker (the entity_label_type needs to be set to "ner")
gene_linker = EntityMentionLinker.load("bio-gene")
gene_linker.entity_label_type = "ner"
# use both taggers to predict
ner_tagger.predict(sentence)
gene_linker.predict(sentence)
# interpret results using dictionary
dictionary = NCBI_GENE_HUMAN_DICTIONARY()
print(sentence)
for entity in sentence.get_labels("gene"):
print(entity)
link = dictionary[entity.value]
print(f" -> linked to: '{link.concept_name}'")
In this snippet, though mentioned explicitly, the genes A2M and alpha-2-macroglobulin are linked to some random entries. The "exact match" model performs only marginally better.
@WangXII - am I doing something wrong or why is the accuracy of the linking in such cases so low?
Some suggestions / questions:
- You could prepare
MultitaskModel
s for each type (gene, disease, etc.) that combine the necessary NER tagger and linker, and correctly set the entity_tag_type. This would allow users to easily instantiate a single model that directly works. - The label_type of all linking models could be "link" - not sure what is gained from having different label_type names for different biomedical NER classes.
- Is there some way of including the dictionary into the linker and preparing convenience functions to interpret the links?
Thanks for creating out the updated pull request @helpmefindaname and for pointing out the low accuracy @alanakbik! I've looked at the low accuracies and we indeed had a bug with linking to the correct knowledge base identifiers. This part should be fixed now and A2M and alpha-2-macroglobulin point to the correct NCBI Gene identifier number 2. |
@WangXII thanks for the update! Is there a way I can compute evaluation numbers in Flair? i.e. load a gold dataset, load the model, make predictions and evaluate? Could you post a snippet for this? |
@sg-wbi can answer this the best. I think we have yet to update the evaluation script to the revised Flair API |
@sg-wbi can you share an evaluation script? I would like to use it to test the models for accuracy before merging. |
Unfortunately this is not straightforward. When we developed this our efforts in the integration into flair stopped at the model level. This is becasue we had a very specific use case. At the current state our evaluation scripts require multiple preprocessing steps. We (@mariosaenger, @WangXII) will get back to you asap with a workable solution (to your suggestions as well). |
Tests for accuracyHere's the script that will give you accuracy results for 3 commonly used datasets. from collections import defaultdict
from datasets import load_dataset
from flair.models import EntityMentionLinker
from flair.models.entity_mention_linking import BioSynEntityPreprocessor
ENTITY_TYPE_TO_MODEL = {
"diseases": "dmis-lab/biosyn-sapbert-ncbi-disease",
"chemical": "dmis-lab/biosyn-sapbert-bc5cdr-chemical",
"genes": "dmis-lab/biosyn-sapbert-bc2gn",
}
ENTITY_TYPE_TO_DATASET = {"diseases": "ncbi_disease", "chemical": "bc5cdr", "genes": "gnormplus"}
ENITY_TYPE_TO_DICTIONARY = {"diseases": "ctd-diseases", "chemical": "ctd-chemicals", "genes": "ncbi-gene"}
def main():
for entity_type, model in ENTITY_TYPE_TO_MODEL.items():
ds_name = ENTITY_TYPE_TO_DATASET[entity_type]
dictionary = ENITY_TYPE_TO_DICTIONARY[entity_type]
ds = load_dataset(f"bigbio/{ds_name}", f"{ds_name}_bigbio_kb", trust_remote_code=True)
print(f"Loaded corpus: `{ds_name}`")
annotations = [a for d in ds["test"] for a in d["entities"]]
if ds_name == "bc5cdr":
annotations = [a for a in annotations if a["type"].lower() == entity_type]
mention_to_uids = defaultdict(list)
uid_to_link = {}
for a in annotations:
# skip mentions without normalization
if len(a["normalized"]) == 0:
continue
if ds_name == "gnormplus":
# no prefix for NCBI Gene
uid_to_link[a["id"]] = [n["db_id"] for n in a["normalized"]]
else:
uid_to_link[a["id"]] = [":".join((n["db_name"], n["db_id"])) for n in a["normalized"]]
for t in a["text"]:
mention_to_uids[t].append(a["id"])
linker = EntityMentionLinker.build(
model,
entity_type,
dictionary_name_or_path=dictionary,
hybrid_search=True,
preprocessor=BioSynEntityPreprocessor(),
batch_size=1024,
)
mentions = sorted(linker.preprocessor.process_mention(m) for m in set(mention_to_uids))
results = linker.candidate_generator.search(entity_mentions=mentions, top_k=1)
hits = 0
total = 0
for m, r in zip(mentions, results):
for uid in mention_to_uids[m]:
y_true = uid_to_link[uid]
y_pred, _ = r[0]
total += 1
if y_pred in y_true:
hits += 1
accuracy = round(hits / total * 100, 2)
print(f"EVALUATION | MODEL: `{model}`, CORPUS: `{ds_name}`, ACCURACY@1: {accuracy}")
if __name__ == "__main__":
main() You should get the following results EVALUATION | MODEL: `dmis-lab/biosyn-sapbert-ncbi-disease`, CORPUS: `ncbi_disease`, ACCURACY@1: 84.3
EVALUATION | MODEL: `dmis-lab/biosyn-sapbert-bc5cdr-chemical`, CORPUS: `bc5cdr`, ACCURACY@1: 93.85
EVALUATION | MODEL: `dmis-lab/biosyn-sapbert-bc2gn`, CORPUS: `gnormplus`, ACCURACY@1: 74.26 NOTE: the script reports accuracy on the gold mentions of the dataset (i.e. no NER) since this is how these models are commonly evaluated. Testing that NOTE: links to fetch models from the huggingface hub need to be updated. Let us know where we should put pre-trained models. Suggestions
Here we still have to find a nice solution. As you and @mariosaenger discussed last week - we may opt to build a distinct
For this, we adapted the implementation according to the approach done in the
Since this PR is already quite substantial, if that's ok with you, we would provide bundled models which correctly link the new NER and NEN models together when all code changes are done. |
I am working on integrating you evaluation script into this PR. ....
- mentions = sorted(linker.preprocessor.process_mention(m) for m in set(mention_to_uids))
- results = linker.candidate_generator.search(entity_mentions=mentions, top_k=1)
+ mentions = sorted(mention_to_uids.keys())
+ preproc_mentions = [linker.preprocessor.process_mention(m) for m in mentions]
+ results = linker.candidate_generator.search(entity_mentions=preproc_mentions, top_k=1)
hits = 0
total = 0
for m, r in zip(mentions, results):
.... Imo the before ignored all labels for mentiones that were preprocessed (e.g. the hard ones). With the initial script I get the following results:
after the change I get some worse results, with the exception of the genes corpus, that improves a lot:
I also noticed, that some labels in the dataset do not exist in the dictionary and therefore cannot be predicted right.
Can you comfirm this? |
Thanks for taking care of this.
Yes you are right. In my version only mentions which do not change after preprocessing were evaluated.
Yes this is to be expected. Dictionary labels change over time (become obsolete/are merged). Some of these corpora were created 10 years ago |
…t basic text and Ab3P pre-processing to the new structure; fix bug in Ab3P abbreviation detection
…m text and (2) entity / concept names from an knowledge base or ontology
- improve name consistency - make code more pythonic - dictionaries always do lazy loading - consistency in dictionary parsing: always yield (cui,name) - clean up loading w/ CONSTANTS (easily swap models) - allow access to sparse and dense search
- yet better naming - add batched search - fix dicionary loading
- predict only on mentions of give entity type
- fix mypy typing - fix typos - update docstrings - rm faiss from requirements - better naming - allow user to specify annotation layer in predict - allow no mentions
- better naming - unique cache name
- add option to time search - change error to warning if pre-trained model is not hybrid - check if there are mentions to predict
- preprocessing: ensure no empty strings after processing - preprocessing: ensure Ab3P works - generator: separate sparse and dense search - generator: constant with sparse weight for pre-trained models
- normalize entity types: diseases->disease, genes-gene - predict: compatibility with Classifier.load('hunflair').label_type
- fix(preprocessing): rm path from a3bp-preprocessor state
3c905d5
to
44c1413
Compare
flair/datasets/entity_linking.py
Outdated
def __init__( | ||
self, | ||
candidates: Iterable[EntityCandidate], | ||
dataset_name: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add comments what dataset_name
is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Justrequests for more docstrings in some new classes.
Additionally, the evaluate method is quite bare-bones, but I have no good idea how to better reuse existing evaluation code for more informative evaluation. So this issue should not block a merge.
return InMemoryEntityLinkingDictionary(list(self._idx_to_candidates.values()), self._dataset_name) | ||
|
||
|
||
class InMemoryEntityLinkingDictionary(EntityLinkingDictionary): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add comment explaining this class.
flair/datasets/entity_linking.py
Outdated
@@ -1760,3 +2210,398 @@ def __init__( | |||
banned_sentences=banned_sentences, | |||
sample_missing_splits=sample_missing_splits, | |||
) | |||
|
|||
|
|||
class BigbioCorpus(Corpus, abc.ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add comment to explain this class
flair/datasets/entity_linking.py
Outdated
return FlairDatapointDataset(all_sentences) | ||
|
||
|
||
class BIGBIO_NCBI_DISEASE(BigbioCorpus): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also for each of the BigbioCorpus subclasses please add a brief description of the dataset and link.
flair/datasets/entity_linking.py
Outdated
yield unified_example | ||
|
||
|
||
class BIGBIO_BC5CDR_CHEMICAL(BigbioCorpus): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also for each of the BigbioCorpus subclasses please add a brief description of the dataset and link.
flair/datasets/entity_linking.py
Outdated
yield data | ||
|
||
|
||
class BIGBIO_GNORMPLUS(BigbioCorpus): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also for each of the BigbioCorpus subclasses please add a brief description of the dataset and link.
log.error("-" * 80) | ||
Path(flair.cache_root / "models" / model_folder).rmdir() # remove folder again if not valid | ||
raise | ||
model_path = hf_download(model_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good change!
self, | ||
data_points: Union[List[Sentence], Dataset], | ||
gold_label_type: str, | ||
out_path: Optional[Union[str, Path]] = None, | ||
embedding_storage_mode: str = "none", | ||
mini_batch_size: int = 32, | ||
main_evaluation_metric: Tuple[str, str] = ("accuracy", "f1-score"), | ||
exclude_labels: List[str] = [], | ||
gold_label_dictionary: Optional[Dictionary] = None, | ||
return_loss: bool = True, | ||
k: int = 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many of these parameters are unused (like out_path
, embedding_storage_mode
, etc.). This is a limitation of our current evaluate signature of the Model
class.
Some possibilities (not in this PR):
- Have the EntityMentionLinker inherit from Classifier instead of Model and reuse its evaluate function. This would also entail adapting the predict method and determining how many of these parameters make sense for an untrained model. For instance, is there a batch size?
- Implement functionality for as many parameters as possible. For instance, we could implement something for out_path.
@alanakbik I have added the requested docstrings/comments. |
@sg-wbi @helpmefindaname @mariosaenger thanks a lot for adding this major new feature to Flair! |
via @mariosaenger in #3180
"The main contribution is a entity linking model which uses dense (transformer-based) embeddings and (optionally) sparse character-based representations, for normalizing an entity mention to specific identifiers in a knowledge base / dictionary. To this end, the model embeds the entity mention text and all concept names from the knowledge base and outputs the k best-matching concepts based on embedding similarity."
for each "gene", "disease", "chemical" & "species" I created and uploaded a model to hf,
Those models can be loaded via
EntityMentionLinker.load("bio-{label_type}"
orEntityMentionLinker.load("bio-{label_type}-exact-match"
The first represents the recommended default configuration (note for species it is currently exact match due to lack of alternative, while the latter represents the most simple model to use.I suppose the recommendation of models will change soon, but would recommend to not make this part of this PR, but rather change it afterwards.