issue in huggingface prefix_allowed_tokens_fn #51

amy-hyunji · 2021-08-08T15:23:32Z

Hello,

I tried to use the constrained beam search with huggingface and realized that @nicola-decao has added the function by prefix_allowed_tokens_fn in huggingface generation. However, I am occasionally getting an error of generating a token that is not in the constraint.
For example, when given constraint {2: [3, 6, 47], 3: [6], 6: [47], 47: [3]}, when given [2,6] as input_ids, I get number other than 47 which is the only possible output given this constraint. Is there any way I can solve this or is there anything I'm missing?

Thanks!

nicola-decao · 2021-08-09T08:36:49Z

Hello, can you post a piece of code where you observe a bug?

amy-hyunji · 2021-08-10T00:26:45Z

I'm using T5ForConditionalGeneration as the model.

Code I used to create the dict with constraints:

possible_dict = {}
for sen in df['output']:
    toks = tokenizer.encode(sen)
    for i in range(len(toks)-1):
        if toks[i] not in possible_dict.keys():
            possible_dict[toks[i]] = [toks[i+1]]
        elif toks[i+1] not in possible_dict[toks[i]]:
            possible_dict[toks[i]].append(toks[i+1])
all_keys = copy.deepcopy(list(possible_dict.keys()))
possible_dict[0] = all_keys

by this, when given sentence token like [0,4,7,5,8,1], possible_dict will look like {0: [4], 4:[7], 5:[8], 7: [5], 8: [1]}

def prefix_allowed_tokens_fn(batch_id, input_ids):
    return possible_dict[int(input_ids[-1])]

for i, (_input, _output) in enumerate(zip(df['input'], df['output'])):
    toks = tokenizer(_input, return_tensors="pt")
    input_ids = toks['input_ids'].cuda()
    attention_mask = toks['attention_mask'].cuda()
    _predict = model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        num_beams=10, 
        num_return_sequences=10,
        prefix_allowed_tokens_fn=prefix_allowed_tokens_fn
    )
    _predict = tokenizer.batch_decode(_predict, skip_special_tokens=True)
    _predict = list(set(_predict))

when I run this code, I get an error as below

which occurs because 4 is not inside the possible_dict key list.

Though 4 was not in the possible list of the token generated before, in other words when the generated token is (..., 53, 4), I could see that 4 is not in the list inside possible_dict[53]. Do you have any guess on why it would generate some tokens that are not inside the constrained list?

Thanks!

amy-hyunji · 2021-08-12T10:12:10Z

@nicola-decao I think I'm getting the issue when num_beams is larger than 1. It sometimes generates tokens that are not In the pre-constructed constrained dict. I haven't got the same issue when I try with num_beams=1. Is this intended?

Thanks!

nicola-decao · 2021-08-12T12:11:04Z

@amy-hyunji you are defining a Trie with one possible output. This is why it fails when num_beams=1. The model cannot possibly predict any other sequence.

amy-hyunji · 2021-09-22T16:14:35Z

@nicola-decao Hi, I have one additional question. I tried to increase the beam size for case where the Trie may have one possible path, the one on the top issues. I noticed that in _get_from_trie function in genre/trie.py, if the corresponding input_ids is not in the trie_dict, the code returns an empty list. (

GENRE/genre/trie.py

Line 79 in 1e24c13

return []

) However, in this case I think path(list of tokens) that aren't in the Trie are generated. Is this somewhat expected result or am I missing something? Is there a way where I can increase the beam size for Trie with one path and strict the generated sentences to all be in the corpus? I'm using hotpot distractor setting evidence sentences as the corpus and T5 as the model. If more details are needed, I would be glad to share some by e-mail since the size of the corpus is quite big!

nicola-decao · 2021-09-22T17:32:59Z

genre/trie.py is not responsible for generation: it is just an implementation of a prefix tree. If you are using the huggingface T5 and you have some issues generating this is not the right repository to ask these questions but rather issues a comment on the transformer repository.

What I believe is happening is that the beam search function is selecting the top-k token to predict and if all tokens have zero probability (ie when the code returns an empty list) then it has to select something and it selects the first k tokens in the vocabulary.

bryanzhou008 · 2022-11-24T02:34:39Z

Hi @amy-hyunji I am facing the same issue (model generating a token outside of the constraint when num_beams >1), were you able to find a work around for this problem? Any ideas are greatly appreciated, thank you very much!

nicola-decao · 2022-12-03T10:04:28Z

@bryanzhou008 There is no problem here. Please read the discussion.

nicola-decao closed this as completed Aug 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

issue in huggingface prefix_allowed_tokens_fn #51

issue in huggingface prefix_allowed_tokens_fn #51

amy-hyunji commented Aug 8, 2021

nicola-decao commented Aug 9, 2021

amy-hyunji commented Aug 10, 2021

amy-hyunji commented Aug 12, 2021

nicola-decao commented Aug 12, 2021

amy-hyunji commented Sep 22, 2021 •

edited

nicola-decao commented Sep 22, 2021

bryanzhou008 commented Nov 24, 2022

nicola-decao commented Dec 3, 2022

issue in huggingface prefix_allowed_tokens_fn #51

issue in huggingface prefix_allowed_tokens_fn #51

Comments

amy-hyunji commented Aug 8, 2021

nicola-decao commented Aug 9, 2021

amy-hyunji commented Aug 10, 2021

amy-hyunji commented Aug 12, 2021

nicola-decao commented Aug 12, 2021

amy-hyunji commented Sep 22, 2021 • edited

nicola-decao commented Sep 22, 2021

bryanzhou008 commented Nov 24, 2022

nicola-decao commented Dec 3, 2022

amy-hyunji commented Sep 22, 2021 •

edited