Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue in huggingface prefix_allowed_tokens_fn #51

Closed
amy-hyunji opened this issue Aug 8, 2021 · 8 comments
Closed

issue in huggingface prefix_allowed_tokens_fn #51

amy-hyunji opened this issue Aug 8, 2021 · 8 comments

Comments

@amy-hyunji
Copy link

Hello,

I tried to use the constrained beam search with huggingface and realized that @nicola-decao has added the function by prefix_allowed_tokens_fn in huggingface generation. However, I am occasionally getting an error of generating a token that is not in the constraint.
For example, when given constraint {2: [3, 6, 47], 3: [6], 6: [47], 47: [3]}, when given [2,6] as input_ids, I get number other than 47 which is the only possible output given this constraint. Is there any way I can solve this or is there anything I'm missing?

Thanks!

@nicola-decao
Copy link
Contributor

Hello, can you post a piece of code where you observe a bug?

@amy-hyunji
Copy link
Author

I'm using T5ForConditionalGeneration as the model.

Code I used to create the dict with constraints:

possible_dict = {}
for sen in df['output']:
    toks = tokenizer.encode(sen)
    for i in range(len(toks)-1):
        if toks[i] not in possible_dict.keys():
            possible_dict[toks[i]] = [toks[i+1]]
        elif toks[i+1] not in possible_dict[toks[i]]:
            possible_dict[toks[i]].append(toks[i+1])
all_keys = copy.deepcopy(list(possible_dict.keys()))
possible_dict[0] = all_keys

by this, when given sentence token like [0,4,7,5,8,1], possible_dict will look like {0: [4], 4:[7], 5:[8], 7: [5], 8: [1]}

def prefix_allowed_tokens_fn(batch_id, input_ids):
    return possible_dict[int(input_ids[-1])]
for i, (_input, _output) in enumerate(zip(df['input'], df['output'])):
    toks = tokenizer(_input, return_tensors="pt")
    input_ids = toks['input_ids'].cuda()
    attention_mask = toks['attention_mask'].cuda()
    _predict = model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        num_beams=10, 
        num_return_sequences=10,
        prefix_allowed_tokens_fn=prefix_allowed_tokens_fn
    )
    _predict = tokenizer.batch_decode(_predict, skip_special_tokens=True)
    _predict = list(set(_predict))

when I run this code, I get an error as below
image
which occurs because 4 is not inside the possible_dict key list.

Though 4 was not in the possible list of the token generated before, in other words when the generated token is (..., 53, 4), I could see that 4 is not in the list inside possible_dict[53]. Do you have any guess on why it would generate some tokens that are not inside the constrained list?

Thanks!

@amy-hyunji
Copy link
Author

@nicola-decao I think I'm getting the issue when num_beams is larger than 1. It sometimes generates tokens that are not In the pre-constructed constrained dict. I haven't got the same issue when I try with num_beams=1. Is this intended?

Thanks!

@nicola-decao
Copy link
Contributor

@amy-hyunji you are defining a Trie with one possible output. This is why it fails when num_beams=1. The model cannot possibly predict any other sequence.

@amy-hyunji
Copy link
Author

amy-hyunji commented Sep 22, 2021

@nicola-decao Hi, I have one additional question. I tried to increase the beam size for case where the Trie may have one possible path, the one on the top issues. I noticed that in _get_from_trie function in genre/trie.py, if the corresponding input_ids is not in the trie_dict, the code returns an empty list. (

return []
) However, in this case I think path(list of tokens) that aren't in the Trie are generated. Is this somewhat expected result or am I missing something? Is there a way where I can increase the beam size for Trie with one path and strict the generated sentences to all be in the corpus? I'm using hotpot distractor setting evidence sentences as the corpus and T5 as the model. If more details are needed, I would be glad to share some by e-mail since the size of the corpus is quite big!

@nicola-decao
Copy link
Contributor

genre/trie.py is not responsible for generation: it is just an implementation of a prefix tree. If you are using the huggingface T5 and you have some issues generating this is not the right repository to ask these questions but rather issues a comment on the transformer repository.

What I believe is happening is that the beam search function is selecting the top-k token to predict and if all tokens have zero probability (ie when the code returns an empty list) then it has to select something and it selects the first k tokens in the vocabulary.

@bryanzhou008
Copy link

Hi @amy-hyunji I am facing the same issue (model generating a token outside of the constraint when num_beams >1), were you able to find a work around for this problem? Any ideas are greatly appreciated, thank you very much!

@nicola-decao
Copy link
Contributor

@bryanzhou008 There is no problem here. Please read the discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants