Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid prediction - no wikipedia entity #89

Closed
schwabmi opened this issue Sep 21, 2022 · 10 comments
Closed

Invalid prediction - no wikipedia entity #89

schwabmi opened this issue Sep 21, 2022 · 10 comments

Comments

@schwabmi
Copy link

schwabmi commented Sep 21, 2022

Hi, I use the end-to-end entity linking model of GENRE.
Unfortunately, for some predictions, I get entity names, that do not appear in Wikipedia.

Code:

from genre.hf_model import GENRE
from genre.entity_linking import get_end_to_end_prefix_allowed_tokens_fn_hf as get_prefix_allowed_tokens_fn

model = GENRE.from_pretrained("models/hf_e2e_entity_linking_aidayago").eval()

sentences = ["For some people he's the John Travolta of early 80's art."]

prefix_allowed_tokens_fn = get_prefix_allowed_tokens_fn(model, sentences)


print(model.sample(
    sentences,
    prefix_allowed_tokens_fn=prefix_allowed_tokens_fn,
))

Output:

[[{'text': "For some people he's the { John Travolta } [ John Trapolta ] of early 80's art.", 'score': tensor(-0.6125)}, {'text': "For some { people } [ People (magazine) ] he's the { John Travolta } [ John Trapolta ] of early 80's { art } [ Art ].", 'score': tensor(-0.7216)}, {'text': "For some { people } [ People (magazine) ] he's the { John Travolta } [ John Trapolta ] of early 80's art.", 'score': tensor(-0.7357)}, {'text': "For some people he's the { John Travolta } [ John Trapolta ] of early 80's { art } [ Art ].", 'score': tensor(-0.7769)}, {'text': "For some { people } [ People (magazine) ] he's the { John Travolta } [ John Trapolta ] of early 80's { art } [ Visual arts ].", 'score': tensor(-0.8873)}]]

"John Trapolta" does not exist neither does John Trapolta. If I understood the paper correct, the model should only output valid wikipedia entities, right? Can you help me out what I did wrong?

Cheers!

@nicola-decao
Copy link
Contributor

This is because you are not using the constrained search as shown in the example code. https://github.com/facebookresearch/GENRE/tree/main/examples_genre

@schwabmi
Copy link
Author

Thanks! But I thought, that it should always give me a valid entity name without any constraint on the candidate set.
In the e2e examples, you constrain the candidates via "candidates_trie" with different candidates including the name "Einstein". But how can I constrain a sentence where I do not know the entities existing inside the sentence and hence cannot create a candidate list?

@nicola-decao
Copy link
Contributor

Then the candidates_trie is a trie with all possible entities in your KB. Similarly to was is shown for Entity Disambiguation.

@schwabmi
Copy link
Author

Thanks. I tried the following: Passing the whole BPE prefix tree as candidates_trie:

with open("../data/kilt_titles_trie_dict.pkl", "rb") as f:
    trie = Trie.load_from_dict(pickle.load(f))

prefix_allowed_tokens_fn = get_prefix_allowed_tokens_fn(
    model,
    sentences,
    candidates_trie=trie)

But that does not work, the output does not make sense. This is, what you suggested, right? Or is it more complicated and I have to rewrite the code in entity_linking.py?

@nicola-decao
Copy link
Contributor

As I show in the example the trie needs to be formatted as follows:

    candidates_trie=Trie([
        model.encode(" }} [ {} ]".format(e))[1:].tolist()
        for e in ["Albert Einstein", "Nobel Prize in Physics", "NIL"]
    ])

the trie from ../data/kilt_titles_trie_dict.pkl is not formatted like that. You need to generate the trie from a list of valid entity names (ie all titles from Wikipedia).

@schwabmi
Copy link
Author

Sorry, for asking again.
As you suggested in #56, I used the KILT knowledge source to extract all wikipedia titles.
I saved all titles in the list candidate_list and did what you suggested:

model = GENRE.from_pretrained("models/hf_e2e_entity_linking_aidayago").eval()
sentences = ["For some people he's the John Travolta of early 80's art."]

prefix_allowed_tokens_fn = get_prefix_allowed_tokens_fn(
    model,
    sentences,
    candidates_trie=Trie([
        model.encode(" }} [ {} ]".format(e))[1:].tolist()
        for e in candidate_list]))

print(model.sample(
   sentences,
   prefix_allowed_tokens_fn=prefix_allowed_tokens_fn))

Output

[[{'text': "For some { people } [ People (magazine) ] he's the { John Travolta } [ John Tromp ] of early 80's { art } [ Art ].", 'score': tensor(-1.0483)}, {'text': "For some { people } [ People (magazine) ] he's the { John Travolta } [ John Tromp ] of early 80's art.", 'score': tensor(-1.1079)}, {'text': "For some { people } [ People (magazine) ] he's the { John Travolta } [ John Tromp ] of early 80's { art } [ Visual arts ].", 'score': tensor(-1.2112)}, {'text': "For some { people } [ People (magazine) ] he's the { John Travolta } [ John Tromp ] of early 80's { art } [ Fine art ].", 'score': tensor(-1.2184)}, {'text': "For some { people } [ People (magazine) ] he's the { John Travolta } [ John Tromp ] of early 80's { art } [ Artist ].", 'score': tensor(-1.2463)}]]

The output for John Travolta was not John Travolta as expected but John Tromp. If I use the disambiguation model and tag John Travolta, the output is John Travolta. But it should be the same, right?
I checked the candidate_list, there is John Travolta in it.

@nicola-decao
Copy link
Contributor

This looks correct to me. One suggestion: save your trie so you do not need to recompute it every time.

The disambiguation model and the end2end linking model are not the same so they might give different outputs. The disambiguation is usually much more precise than the end2end linking model.

@schwabmi
Copy link
Author

Okay, thank you.
In thought, after finding a mention in the text the end2end linking model uses the same way as the disambiguation model to get the linked wikipedia entity candidate.

Then, it would make sense to first compute all mentions from the e2e model and not use the linked wikipedida entities but use the disambiguation model to link those mentions, wouldn't it?

@nicola-decao
Copy link
Contributor

nicola-decao commented Sep 29, 2022

The two models operate in different ways. Please refer to the paper for details.

Then, it would make sense to first compute all mentions from the e2e model and not use the linked wikipedida entities but use the disambiguation model to link those mentions, wouldn't it?

Yes, I think so. It is also faster to use something like FLAIR to get the mentions.

@schwabmi
Copy link
Author

Thank you for the fast and detailled answers! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants