语言生成中，一次生成一个单词，然后将该单词添加到输入中，预测下一个单词；
- 每一个时间步，模型给出的都是基于历史生成结果的条件概率。
  
设输出文本词典 $\mathcal{Y}$（包含特殊符号`<eos>`）的大小为 $\left|\mathcal{Y}\right|$，输出序列的最大长度为 $T′$。所有可能的输出序列一共有 $\mathcal{O}(\left|\mathcal{Y}\right|^{T'})$ 种。这些输出序列中所有特殊符号`<eos>`后面的子序列将被舍弃。

如何从所有序列中选择需要的序列，获得完整的句子？   
需要一个称为 **解码** 的额外动作来融合模型多个时间步的输出，而且使得最终得到的序列概率最大。
>- 自回归语言生成基于一个假设，即生成的单词序列的概率分布可以分解为每一步生成单词的概率的连乘
$$P(w_{1:T}|w_0) = \prod_{t=1}^TP(w_{t}|w_{1:t-1},w_0)$$
其中 $w_0$ 为起始符号，生成`EOS`时停止


# 贪婪搜索(Greedy Search):

- 在每个时间步，选择概率最高的单词作为结果
$$y_{t'} = \operatorname*{argmax}_{y \in \mathcal{Y}} P(y \mid y_1, \ldots, y_{t'-1}, \mathbf{c})$$ 
如下图所示，得到的生成输出序列 ` ABC<eos> `。该输出序列的条件概率是      $0.5\times0.4\times0.4\times0.6 = 0.048$
<img src="../images/greedy-search.svg" width="40%">  
   
      
         
- 在时间步 2 选择概率第 2 大的词 `C`，由于时间步3所基于的时间步1和2的输出子序列由上图中的 `“A”“B”`变为的`“A”“C”`，则时间步 3 生成各个词的条件概率发生了变化。
<img src="../images/greedy-search2.svg" width="40%">  
此时的输出序列 `“A”“C”“B”“<eos>”`的条件概率是 $0.5\times0.3\times0.6\times0.6=0.054$，大于贪婪搜索得到的输出序列的条件概率。
>因此，**贪婪搜索得到的输出序列并非最优输出序列。**
   
     
- 贪心搜索产生的结果很快会出现重复，其主要确定是，错过了隐藏在低概率单词后面的高概率单词

In [1]:
import torch
from transformers import AutoTokenizer, AutoModelWithLMHead

In [2]:
tokenizer = AutoTokenizer.from_pretrained('../../H/models/huggingface/gpt2')
model = AutoModelWithLMHead.from_pretrained('../../H/models/huggingface/gpt2')
input_context = 'The dog'
input_ids = tokenizer.encode(input_context, return_tensors='pt')
outputs = model.generate(
    input_ids=input_ids,
    max_length=40,
)
print('Generated :\n {}'.format(
    tokenizer.decode(outputs[0], skip_special_tokens=True)))

Generated :
 The dog was found around 7:20 a.m. on the ground. No foul play was found, and the dog was transported to an area hospital that is investigating the incident.


"


# 穷举搜索(exhausive search)

穷举所有可能的输出序列，得到条件概率最大的序列；计算开销  $\mathcal{O}(\left|\mathcal{Y}\right|^{T'})$ 很容易过大。例如，当$|\mathcal{Y}|=10000$且$T'=10$时，我们将评估$10000^{10} = 10^{40}$个序列

# 集束搜索(Beam Search)
- 在每个时间步，保留最可能的几个(num_beams)单词，最终选择总体概率最高的假设；相当于形成一个多叉树，最终选择数中概率最高的那条路径；
<img src="../images/beam_search.svg" width="80%">  
$$\frac{1}{L^\alpha} \log P(y_1, \ldots, y_{L}) = \frac{1}{L^\alpha} \sum_{t'=1}^L \log P(y_{t'} \mid y_1, \ldots, y_{t'-1}, \boldsymbol{c}),$$

其中L为最终候选序列长度，$\alpha$一般可选为0.75。分母上的$L^\alpha$是为了惩罚较长序列在以上分数中较多的对数相加项。束搜索的计算开销为$\mathcal{O}(k\left|\mathcal{Y}\right|T')$

In [None]:
class BeamHypotheses:
    def __init__(self, num_beams, max_length, length_penalty):
        self.max_length = max_length - 1  # ignoring bos_token
        self.num_beams = num_beams
        self.beams = []
        self.worst_score = 1e9

    def __len__(self):
        return len(self.beams)

    def add(self, hyp, sum_logprobs):
        score = sum_logprobs / len(hyp)**self.length_penalty
        if len(self) < self.num_beams or score > self.worst_score:
            # 可更新的情况：数量未饱和或超过最差得分
            self.beams.append((score, hyp))
            if len(self) > self.num_beams:
                # 数量饱和需要删掉一个最差的
                sorted_scores = sorted([
                    (s, idx) for idx, (s, _) in enumerate(self.beams)
                ])
                del self.beams[sorted_scores[0][1]]
                self.worst_score = sorted_scores[1][0]
            else:
                self.worst_score = min(score, self.worst_score)

    def is_done(self, best_sum_logprobs, cur_len=None):
        """
        相关样本是否已经完成生成。
        best_sum_logprobs是新的候选序列中的最高得分。
        """

        if len(self) < self.num_beams:
            return False
        else:
            if cur_len is None:
                cur_len = self.max_length
            cur_score = best_sum_logprobs / cur_len**self.length_penalty
            # 是否最高分比当前保存的最低分还差
            ret = self.worst_score >= cur_score
            return ret

In [None]:
def beam_search_generate(context,
                         batch_size=3,
                         max_length=20,
                         min_length=2,
                         num_beams=2,
                         bos_token_id=101,
                         pad_token_id=0,
                         eos_token_id=102):
    """
    context:编码器编码获得的向量
    batch_size:每批数据中包含的样本量
    bos_token_id:句子开头 token id
    pad_token_id:填充的 token id
    eos_token_id:句子结束标记的 token id
    """
    # 建立beam容器，每个样本一个
    generated_hyps = [
        BeamHypotheses(num_beams,
                       max_length,
                       length_penalty,
                       early_stopping=early_stopping)
        for _ in range(batch_size)
    ]

    # 每个beam容器的得分，共batch_size*num_beams个
    beam_scores = torch.zeros((batch_size, num_beams),
                              dtype=torch.float,
                              device=encoder_input_ids.device)
    beam_scores = beam_scores.view(-1)

    # 每个样本是否完成生成，共batch_size个
    done = [False for _ in range(batch_size)]

    # 为了并行计算，一次生成batch_size*num_beams个序列
    # 第一步自动填入bos_token
    input_ids = torch.full(
        (batch_size * num_beams, 1),
        bos_token_id,
        dtype=torch.long,
        device=next(self.parameters()).device,
    )

    # 当前长度设为1
    cur_len = 1

    while cur_len < max_length:
        # 将编码器得到的上下文向量和当前结果输入解码器，即图中1
        output = decoder.decode_next_step(context, input_ids)
        # 输出矩阵维度为：(batch*num_beams)*cur_len*vocab_size

        # 取出最后一个时间步的各token概率，即当前条件概率
        # (batch*num_beams)*vocab_size
        scores = next_token_logits = output[:, -1, :]

        ###########################
        # 这里可以做一大堆操作减少重复 #
        ###########################

        # 计算序列条件概率的，因为取了log，所以直接相加即可。得到图中2矩阵
        # (batch_size * num_beams, vocab_size)
        next_scores = scores + beam_scores[:, None].expand_as(scores)

        # 为了提速，将结果重排成图中3的形状
        next_scores = next_scores.view(
            batch_size,
            num_beams * vocab_size)  # (batch_size, num_beams * vocab_size)

        # 取出分数最高的token（图中黑点）和其对应得分
        # sorted=True，保证返回序列是有序的
        next_scores, next_tokens = torch.topk(next_scores,
                                              2 * num_beams,
                                              dim=1,
                                              largest=True,
                                              sorted=True)

        # 下一个时间步整个batch的beam列表
        # 列表中的每一个元素都是三元组
        # (分数, token_id, beam_id)
        next_batch_beam = []

        # 对每一个样本进行扩展
        for batch_idx in range(batch_size):

            # 检查样本是否已经生成结束
            if done[batch_idx]:
                # 对于已经结束的句子，待添加的是pad token
                next_batch_beam.extend([(0, pad_token_id, 0)] *
                                       num_beams)  # pad the batch
                continue

            # 当前样本下一个时间步的beam列表
            next_sent_beam = []

            # 对于还未结束的样本需要找到分数最高的num_beams个扩展
            # 注意，next_scores和next_tokens是对应的
            # 而且已经按照next_scores排好顺序
            for beam_token_rank, (beam_token_id,
                                  beam_token_score) in enumerate(
                                      zip(next_tokens[batch_idx],
                                          next_scores[batch_idx])):
                # get beam and word IDs
                # 这两行可参考图中3进行理解
                beam_id = beam_token_id // vocab_size
                token_id = beam_token_id % vocab_size

                effective_beam_id = batch_idx * num_beams + beam_id

                # 如果出现了EOS token说明已经生成了完整句子
                if (eos_token_id is
                        not None) and (token_id.item() == eos_token_id):
                    # if beam_token does not belong to top num_beams tokens, it should not be added
                    is_beam_token_worse_than_top_num_beams = beam_token_rank >= num_beams
                    if is_beam_token_worse_than_top_num_beams:
                        continue
                    # 往容器中添加这个序列
                    generated_hyps[batch_idx].add(
                        input_ids[effective_beam_id].clone(),
                        beam_token_score.item(),
                    )
                else:
                    # add next predicted word if it is not eos_token
                    next_sent_beam.append(
                        (beam_token_score, token_id, effective_beam_id))

                # 扩展num_beams个就够了
                if len(next_sent_beam) == num_beams:
                    break

            # 检查这个样本是否已经生成完了，有两种情况
            # 1. 已经记录过该样本结束
            # 2. 新的结果没有使结果改善
            done[batch_idx] = done[batch_idx] or generated_hyps[
                batch_idx].is_done(next_scores[batch_idx].max().item(),
                                   cur_len=cur_len)

            # 把当前样本的结果添加到batch结果的后面
            next_batch_beam.extend(next_sent_beam)

        # 如果全部样本都已经生成结束便可以直接退出了
        if all(done):
            break

        # 把三元组列表再还原成三个独立列表
        beam_scores = beam_scores.new([x[0] for x in next_batch_beam])
        beam_tokens = input_ids.new([x[1] for x in next_batch_beam])
        beam_idx = input_ids.new([x[2] for x in next_batch_beam])

        # 准备下一时刻的解码器输入
        # 取出实际被扩展的beam
        input_ids = input_ids[beam_idx, :]
        # 在这些beam后面接上新生成的token
        input_ids = torch.cat([input_ids, beam_tokens.unsqueeze(1)], dim=-1)

        # 更新当前长度
        cur_len = cur_len + 1
        # end of length while

    # 将未结束的生成结果结束，并置入容器中
    for batch_idx in range(batch_size):
        # 已经结束的样本不需处理
        if done[batch_idx]:
            continue

        # 把结果加入到generated_hyps容器
        for beam_id in range(num_beams):
            effective_beam_id = batch_idx * num_beams + beam_id
            final_score = beam_scores[effective_beam_id].item()
            final_tokens = input_ids[effective_beam_id]
            generated_hyps[batch_idx].add(final_tokens, final_score)

    # select the best hypotheses，最终输出
    # 每个样本返回几个句子
    output_num_return_sequences_per_batch = 1
    # 记录每个返回句子的长度，用于后面pad
    sent_lengths = input_ids.new(output_batch_size)
    best = []

    # 对每个样本取出最好的output_num_return_sequences_per_batch个句子
    for i, hypotheses in enumerate(generated_hyps):
        sorted_hyps = sorted(hypotheses.beams, key=lambda x: x[0])
        for j in range(output_num_return_sequences_per_batch):
            effective_batch_idx = output_num_return_sequences_per_batch * i + j
            best_hyp = sorted_hyps.pop()[1]
            sent_lengths[effective_batch_idx] = len(best_hyp)
            best.append(best_hyp)

    # 如果长短不一则pad句子，使得最后返回结果的长度一样
    if sent_lengths.min().item() != sent_lengths.max().item():
        sent_max_len = min(sent_lengths.max().item() + 1, max_length)
        # 先把输出矩阵填满PAD token
        decoded = input_ids.new(output_batch_size,
                                sent_max_len).fill_(pad_token_id)

        # 填入真正的内容
        for i, hypo in enumerate(best):
            decoded[i, :sent_lengths[i]] = hypo
            # 填上eos token
            if sent_lengths[i] < max_length:
                decoded[i, sent_lengths[i]] = eos_token_id
    else:
        # 所有生成序列都还没结束，直接堆叠即可
        decoded = torch.stack(best).type(torch.long).to(
            next(self.parameters()).device)

    # 返回的结果包含BOS token
    return decoded

In [11]:
beam_output = model.generate(input_ids, max_length=50, num_beams=5)
print('Generated :\n {}'.format(
    tokenizer.decode(outputs[0], skip_special_tokens=True)))

Generated :
 The dog was found around 7:20 a.m. on the ground. No foul play was found, and the dog was transported to an area hospital that is investigating the incident.


"


In [None]:
beam_output = model.generate(input_ids,
                             max_length=50,
                             num_beams=5,
                             no_repeat_ngram_size=2,
                             early_stopping=True)

- 集束搜索可以很好地完成任务——在机器翻译或摘要中，期望生成的长度或多或少是可预测的。
- 但在开放式的生成，期望输出的长度可能会有很大的变化，例如对话和故事生成，集束搜索可能不是最好选择的问题
- Beam Search虽然比贪心强了不少，但还是会生成出空洞、重复、前后矛盾的文本。《The Curious Case of Neural Text Degeneration》[1]论文认为这种问题是由于这种试图最大化序列条件概率的解码策略从根上就有问题

# 采样(sampling)
- 生成单词时不再时选择概率最高的单词，而是随机采样；此时语言生成不再是确定性的，容易产生前后不一致的问题。但在在开放闲聊领域，生成文本的长度都比较短，这种问题就被自然的淡化了。
- 通过改变 `softmax` 输出时的超参 `temperature`可以控制概率分布的形貌，当$T$大的时候，概率分布趋向平均，随机性增大；当小的时候，概率密度趋向于集中，即强者愈强，随机性降低。降低$T$，提高高概率单词被采样的可能性，降低生成的随机性
$$p_i = \frac{exp(y_{i}/T)}{\sum exp(y_{i}/T)}$$

In [None]:
beam_output = model.generate(input_ids,
                             max_length=50,
                             do_sample=True,
                             top_k=0,
                             temperature=0.7)

## Top-K 抽样
- 在采样前将输出的概率分布截断，取出概率最大的`k`个词构成一个集合，然后将这个子集词的概率再归一化，最后从新的概率分布中采样词汇。
- 难点在于超参 `k` 的选择。因为这个概率分布变化比较大，有时候可能很均匀(flat)，有的时候比较集中(peaked)。对于集中的情况还好说，当分布均匀时，一个较小的k容易丢掉很多优质候选词。但如果k定的太大，这个方法又会退化回普通采样。

In [None]:
beam_output = model.generate(input_ids,
                             max_length=50,
                             do_sample=True,
                             top_k=50)

## Top-p (nucleus) sampling
- 改在概率从高到底选择输出，构造一个最小候选集 $V$，使得 $$\sum_{x\in V}P(x)>p$$
- 然后重新归一化集合内词的概率，并从中采样
- 也可以将 Top-k 和 Top-p 两者结合

In [None]:
beam_output = model.generate(input_ids,
                             max_length=50,
                             do_sample=True,
                             top_k=0,
                             top_p=0.92)

## 惩罚重复
- 仍然会出现重复，加入 `n_grams` 惩罚，保证 `n_grams` 不会出现两次。
- 必须谨慎使用

In [None]:
def top_k_top_p_filtering(logits,
                          top_k=0,
                          top_p=1.0,
                          filter_value=-float("Inf"),
                          min_tokens_to_keep=1):
    """ Filter a distribution of logits using top-k and/or nucleus (top-p) filtering
        Args:
            logits: logits distribution shape (batch size, vocabulary size)
            if top_k > 0: keep only top k tokens with highest probability (top-k filtering).
            if top_p < 1.0: keep the top tokens with cumulative probability >= top_p (nucleus filtering).
                Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751)
            Make sure we keep at least min_tokens_to_keep per batch example in the output
        From: https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317
    """
    if top_k > 0:
        top_k = min(max(top_k, min_tokens_to_keep),
                    logits.size(-1))  # Safety check
        # Remove all tokens with a probability less than the last token of the top-k
        indices_to_remove = logits < torch.topk(logits,
                                                top_k)[0][..., -1, None]
        logits[indices_to_remove] = filter_value

    if top_p < 1.0:
        sorted_logits, sorted_indices = torch.sort(logits, descending=True)
        cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1),
                                        dim=-1)

        # Remove tokens with cumulative probability above the threshold (token with 0 are kept)
        sorted_indices_to_remove = cumulative_probs > top_p
        if min_tokens_to_keep > 1:
            # Keep at least min_tokens_to_keep (set to min_tokens_to_keep-1 because we add the first one below)
            sorted_indices_to_remove[..., :min_tokens_to_keep] = 0
        # Shift the indices to the right to keep also the first token above the threshold
        sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[
            ..., :-1].clone()
        sorted_indices_to_remove[..., 0] = 0

        # scatter sorted tensors to original indexing
        indices_to_remove = sorted_indices_to_remove.scatter(
            1, sorted_indices, sorted_indices_to_remove)
        logits[indices_to_remove] = filter_value
    return logits

In [None]:
def enforce_repetition_penalty_(self, lprobs, batch_size, num_beams,
                                prev_output_tokens, repetition_penalty):
    """repetition penalty (from CTRL paper https://arxiv.org/abs/1909.05858). """
    for i in range(batch_size * num_beams):
        for previous_token in set(prev_output_tokens[i].tolist()):
            # if score < 0 then repetition penalty has to multiplied to reduce the previous token probability
            if lprobs[i, previous_token] < 0:
                lprobs[i, previous_token] *= repetition_penalty
            else:
                lprobs[i, previous_token] /= repetition_penalty

In [None]:
def calc_banned_tokens(prev_input_ids, num_hypos, no_repeat_ngram_size,
                       cur_len):
    # Copied from fairseq for no_repeat_ngram in beam_search"""
    if cur_len + 1 < no_repeat_ngram_size:
        # return no banned tokens if we haven't generated no_repeat_ngram_size tokens yet
        return [[] for _ in range(num_hypos)]
    generated_ngrams = [{} for _ in range(num_hypos)]
    for idx in range(num_hypos):
        gen_tokens = prev_input_ids[idx].numpy().tolist()
        generated_ngram = generated_ngrams[idx]
        # 就是这巧妙的一句
        for ngram in zip(
                *[gen_tokens[i:] for i in range(no_repeat_ngram_size)]):
            prev_ngram_tuple = tuple(ngram[:-1])
            generated_ngram[prev_ngram_tuple] = generated_ngram.get(
                prev_ngram_tuple, []) + [ngram[-1]]

    def _get_generated_ngrams(hypo_idx):
        # Before decoding the next token, prevent decoding of ngrams that have already appeared
        start_idx = cur_len + 1 - no_repeat_ngram_size
        ngram_idx = tuple(
            prev_input_ids[hypo_idx, start_idx:cur_len].numpy().tolist())
        return generated_ngrams[hypo_idx].get(ngram_idx, [])

    banned_tokens = [
        _get_generated_ngrams(hypo_idx) for hypo_idx in range(num_hypos)
    ]
    return banned_tokens

In [None]:
if do_sample:
    # 这是今天的采样方式
    _scores = scores + beam_scores[:, None].expand_as(
        scores)  # (batch_size * num_beams, vocab_size)
    # Top-p/top-k filtering，这一步重建了候选集
    _scores = top_k_top_p_filtering(
        _scores, top_k=top_k, top_p=top_p,
        min_tokens_to_keep=2)  # (batch_size * num_beams, vocab_size)
    # re-organize to group the beam together to sample from all beam_idxs
    _scores = _scores.contiguous().view(
        batch_size,
        num_beams * vocab_size)  # (batch_size, num_beams * vocab_size)

    # Sample 2 next tokens for each beam (so we have some spare tokens and match output of greedy beam search)
    probs = F.softmax(_scores, dim=-1)
    # 采样
    next_tokens = torch.multinomial(probs, num_samples=2 *
                                    num_beams)  # (batch_size, num_beams * 2)
    # Compute next scores
    next_scores = torch.gather(_scores, -1,
                               next_tokens)  # (batch_size, num_beams * 2)
    # sort the sampled vector to make sure that the first num_beams samples are the best
    next_scores, next_scores_indices = torch.sort(next_scores,
                                                  descending=True,
                                                  dim=1)
    next_tokens = torch.gather(
        next_tokens, -1, next_scores_indices)  # (batch_size, num_beams * 2)
else:
    # 这是昨天的beam search方式
    # 直接将log概率相加求条件概率
    next_scores = scores + beam_scores[:, None].expand_as(
        scores)  # (batch_size * num_beams, vocab_size)

    # re-organize to group the beam together (we are keeping top hypothesis accross beams)
    next_scores = next_scores.view(
        batch_size,
        num_beams * vocab_size)  # (batch_size, num_beams * vocab_size)

    next_scores, next_tokens = torch.topk(next_scores,
                                          2 * num_beams,
                                          dim=1,
                                          largest=True,
                                          sorted=True)