In [1]:
# 为了可重复性
SEED = 34

# 输出文本的最大单词数
MAX_LEN = 70

In [2]:
input_sequence = "I don't know about you, but there's only one thing I want to do after a long day of work"

In [3]:
# 导入 transformers
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer

# 获取大型GPT2 tokenizer 和 GPT2模型
local_model_dir = 'D:/Content_Secu/task4/gpt2-large'

# 从本地文件夹加载分词器和模型
tokenizer = GPT2Tokenizer.from_pretrained(local_model_dir)
GPT2 = TFGPT2LMHeadModel.from_pretrained(local_model_dir, pad_token_id=tokenizer.eos_token_id)

#tokenizer = GPT2Tokenizer.from_pretrained("gpt2-medium")
#GPT2 = TFGPT2LMHeadModel.from_pretrained("gpt2-medium", pad_token_id=tokenizer.eos_token_id)

#tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
#GPT2 = TFGPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

# 查看模型参数
GPT2.summary()

All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


Model: "tfgpt2lm_head_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 transformer (TFGPT2MainLaye  multiple                 774030080 
 r)                                                              
                                                                 
Total params: 774,030,080
Trainable params: 774,030,080
Non-trainable params: 0
_________________________________________________________________


# 第一次尝试（贪婪搜索）

**使用贪婪搜索，概率最高的词被预测为下一个词，即每个时间步 $t$ 更新下一个词为：**

$$w_t = argmax_{w}P(w | w_{1:t-1})$$

**让我们看看这种简单方法的表现：**

In [4]:
# 导入深度学习基础库
import tensorflow as tf
tf.random.set_seed(SEED)

In [5]:
# 用我们的开头来续写，使用 GPT-2 模型生成一个最大长度为 MAX_LEN 的文本
# 编码生成条件的上下文
input_ids = tokenizer.encode(input_sequence, return_tensors='tf')

# 生成文本直到输出长度（包括上下文长度）达到50
greedy_output = GPT2.generate(input_ids, max_length = MAX_LEN)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens = True))

Output:
----------------------------------------------------------------------------------------------------
I don't know about you, but there's only one thing I want to do after a long day of work: go to the gym.

I'm not talking about the gym that's right next to my house. I'm talking about the gym that's right next to my office.

I'm not talking about the gym that


**这样我们就完成了文本生成。我们的结果并不是很好 —— 我们可以看到，模型很快就开始重复自己了。贪婪搜索的主要问题是，高概率的词可能被前面低概率的词所掩盖，所以模型无法探索更多样化的词组合。我们可以通过实施 Beam 搜索来防止这种情况发生：**

## 带N-gram惩罚的Beam搜索

**Beam搜索本质上是贪婪搜索，但模型在每个时间步跟踪并保留`num_beams`个假设，因此模型能够在生成文本时比较替代路径。我们还可以通过设置`no_repeat_ngram_size = 2`来包括n-gram惩罚，确保不会有2-gram重复出现。我们还将设置`num_return_sequences = 5`，这样我们可以看到其他5个beam的样子。**

**修改`generate`函数中的一些参数以使用Beam搜索：**

In [6]:
# 修改 `generate` 函数中的一些参数以使用 Beam 搜索：
beam_outputs = GPT2.generate(
    input_ids, 
    max_length = MAX_LEN, 
    num_beams = 5, 
    no_repeat_ngram_size = 2, 
    num_return_sequences = 5, 
    early_stopping = True
)

print('')
print("Output:\n" + 100 * '-')

# 使用了 GPT-2 模型的 Beam 搜索策略来生成文本，并展示五个不同的生成结果
for i, beam_output in enumerate(beam_outputs):
      print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))


Output:
----------------------------------------------------------------------------------------------------
0: I don't know about you, but there's only one thing I want to do after a long day of work, and that's to sit down and watch a movie."

"I know, I know," you say. "But you're not going to like this one. It's not a good movie. I mean, it's
1: I don't know about you, but there's only one thing I want to do after a long day of work, and that's to sit down and watch a movie."

"I know, I know," you say. "But you're not going to like this one. It's about a guy who has a crush on a girl
2: I don't know about you, but there's only one thing I want to do after a long day of work, and that's to sit down and watch a movie."

"I know, I know," you say. "But you're not going to like this one. It's about a guy who has a crush on a woman
3: I don't know about you, but there's only one thing I want to do after a long day of work, and that's to sit down and watch a movie."

"I know, I know," 

**这样就好多了！5 个不同的 beam 假设几乎都是一样的，但如果我们增加 `num_beams` 的数量，我们会在不同的 beams 中看到更多的变化。当然，Beam 搜索也不是完美的。它在生成文本的长度大致恒定时表现得很好，比如翻译或总结这类问题，但对于像对话或故事生成这类开放式问题就不那么适用了（因为很难找到 `num_beams` 和 `no_repeat_ngram_size` 之间的平衡）。**

## 基本采样

**现在我们将探索非确定性解码——采样。与其沿着固定路径寻找概率最高的终点文本，不如随机根据其条件概率分布选择下一个词：**

$$w_t \sim P(w|w_{1:t-1})$$

**然而，当我们引入这种随机性时，生成的文本往往会变得不连贯（更多信息见[这里](https://arxiv.org/pdf/1904.09751.pdf)），因此我们可以引入温度参数，这个参数可以增加高概率词出现的机会，同时减少低概率词的机会：**

**我们只需设置 do_sample = True 来实现随机采样，top_k = 0（暂不使用 Top-K 采样），temperature = 0.8 设置采样温度，：**

In [7]:
# 温度参数用于控制输出文本的随机性。温度越高，输出越随机；温度越低，则趋于选择概率较高的词。
sample_output = GPT2.generate(
                             input_ids, 
                             do_sample = True, 
                             max_length = MAX_LEN, 
                             top_k = 0, 
                             temperature = 0.8
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens = True))

Output:
----------------------------------------------------------------------------------------------------
I don't know about you, but there's only one thing I want to do after a long day of work."

"Hmm. Must be quite the choice of words."

"Well, it's not a choice of words, but a need. I can't find the right answer until I find my answer."

"


## Top-K 采样

**在 Top-K 采样中，选择最有可能的 k 个下一个词，并将整个概率质量转移到这些词上。因此我们不是增加高概率词出现的机会，而是完全移除低概率词：**

**设置top_k的值:**

In [8]:
# 只从最有可能的 k 个词中采样
sample_output = GPT2.generate(
                             input_ids, 
                             do_sample = True, 
                             max_length = MAX_LEN, 
                             top_k = 50
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens = True), '...')

Output:
----------------------------------------------------------------------------------------------------
I don't know about you, but there's only one thing I want to do after a long day of work. I want to get out of here and go jogging. To go jogging."

"That may be true, but I don't really have much money to spare!"

"That's true too. Why don ...


**Top-K 采样似乎比我们之前的随机采样生成的文本更连贯。但我们可以做得更好：**

## Top-P 采样

**Top-P 采样（也称为核心采样）类似于 Top-K，但不是选择最有可能的 k 个词，我们选择累积概率大于 p 的最小词集，然后将整个概率质量转移到这个集合的词上：**

**top_k=0,设置top_p的值:**

In [9]:
# 只从最有可能的 80% 词中采样
sample_output = GPT2.generate(
                             input_ids, 
                             do_sample = True, 
                             max_length = MAX_LEN, 
                             top_p = 0.8, 
                             top_k = 0
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens = True), '...')

Output:
----------------------------------------------------------------------------------------------------
I don't know about you, but there's only one thing I want to do after a long day of work: try out some dessert! Today I've got a total of four different fruit ice creams from The Baker's Dozen. I'm going to share three of them with you, each with a twist.

One was made ...


## Top-K 和 Top-P 采样

**我们可以在这里同时使用 Top-K 和 Top-P 采样。这样可以减少我们得到奇怪词（低概率词）的机会，同时允许动态选择大小。我们只需要为两者设置一个值。如果我们想，我们甚至可以包含最初的温度参数，现在让我们看看在添加这些参数后，我们的模型表现如何。我们将检查前 5 个返回结果，看看我们的答案有多少变化：**

In [10]:
# 结合两种采样技术
sample_outputs = GPT2.generate(
                              input_ids,
                              do_sample = True, 
                              max_length = 2*MAX_LEN,
                              #temperature = .7,
                              top_k = 50, 
                              top_p = 0.85, 
                              num_return_sequences = 5
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
    print("{}: {}...".format(i, tokenizer.decode(sample_output, skip_special_tokens = True)))
    print('')
    print('')

Output:
----------------------------------------------------------------------------------------------------
0: I don't know about you, but there's only one thing I want to do after a long day of work and this is one of it. I have to do something else. It's been quite an exciting couple of weeks at the office, haven't I?

Makes you wonder about the people who didn't get the memo that a long day of work is about to turn into a long day of fun....


1: I don't know about you, but there's only one thing I want to do after a long day of work: watch some movies on my bed!

So, I took a trip to my local mall to check out a new line of "couples" furniture. It's the same type of furniture that I saw on an episode of The Bachelor. It's the kind of furniture that makes me think that if I were to be on a reality TV show, I would be dating one of the characters from that show.

The first thing I noticed about the furniture was that there are no chairs. There is only one bed, a desk, a table, and t

## temperature、Top-K 和 Top-P 采样

**我们可以在这里同时使用temperature、 Top-K 和 Top-P 采样。这样可以减少我们得到奇怪词（低概率词）的机会，同时允许动态选择大小。我们只需要为两者设置一个值。如果我们想，我们甚至可以包含最初的温度参数，现在让我们看看在添加这些参数后，我们的模型表现如何。我们将检查前 5 个返回结果，看看我们的答案有多少变化：**

In [11]:
# 结合两种采样技术
sample_outputs = GPT2.generate(
                              input_ids,
                              do_sample = True, 
                              max_length = 2*MAX_LEN,
                              temperature = 0.8,
                              top_k = 50, 
                              top_p = 0.85, 
                              num_return_sequences = 5
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
    print("{}: {}...".format(i, tokenizer.decode(sample_output, skip_special_tokens = True)))
    print('')
    print('')

Output:
----------------------------------------------------------------------------------------------------
0: I don't know about you, but there's only one thing I want to do after a long day of work: drink beer. But, if you're not into that, I've got some great suggestions for you.

1. Try a beer made with a non-alcoholic malt.

For example, this is the first beer I've ever made with a pilsner malt. I wanted a beer that was a bit less bitter than what I was used to. But this beer was delicious.

2. Go with a stout.

I think the most underrated beer in the world. I've been drinking it for years and I've always been impressed by...


1: I don't know about you, but there's only one thing I want to do after a long day of work: I want to go to the movies."

The first person I told about this was the manager of the movie theater I used to frequent. He laughed and said, "You mean you want to go to the movies on a Friday night?"

"Yeah, I do," I said. "I mean, I'll be able to see a movie on 

# 基准提示

**在这里，我们将看到当给定一些更有趣的输入时，GPT-2 模型的表现如何。**

In [12]:
MAX_LEN = 150

## "In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English."

In [13]:
prompt1 = 'In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.'

input_ids = tokenizer.encode(prompt1, return_tensors='tf')

In [14]:
sample_outputs = GPT2.generate(
                              input_ids,
                              do_sample = True, 
                              max_length = MAX_LEN,
                              temperature = 0.8,
                              top_k = 50, 
                              top_p = 0.85 
                              #num_return_sequences = 5
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
    print("{}: {}...".format(i, tokenizer.decode(sample_output, skip_special_tokens = True)))
    print('')

Output:
----------------------------------------------------------------------------------------------------
0: In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

The discovery came when scientists headed to the area in 2011 to take a look for water. When they didn't find anything, they headed back to the location to get more water samples, which was when the strange creatures were spotted.

The team of researchers, led by Dr. Tom Tarnita of the National Geographic Society, says the animals were not a typical herd of unicorns that could be found in many areas of the world, where they are thought to live. The animals had a white coat,...



## "Miley Cyrus was caught shoplifting from Abercrombie and Fitch on Hollywood Boulevard today."

In [21]:
prompt2 = 'Miley Cyrus was caught shoplifting from Abercrombie and Fitch on Hollywood Boulevard today.'

input_ids = tokenizer.encode(prompt2, return_tensors='tf')

In [22]:
sample_outputs = GPT2.generate(
                              input_ids,
                              do_sample = True, 
                              max_length = MAX_LEN,
                              temperature = 0.8,
                              top_k = 50, 
                              top_p = 0.85
                              #num_return_sequences = 5
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
    print("{}: {}...".format(i, tokenizer.decode(sample_output, skip_special_tokens = True)))
    print('')

Output:
----------------------------------------------------------------------------------------------------
0: Miley Cyrus was caught shoplifting from Abercrombie and Fitch on Hollywood Boulevard today. The singer was caught in the act after a security guard saw her walk out of the store with a bag full of clothes. She was later arrested on suspicion of shoplifting.

The 21-year-old pop star was released on $5,000 bond.

PHOTOS: The '90s' Most Shocking Celebrity Mug Shots

It is unclear whether the incident was a one-time incident or if Cyrus has a history of shoplifting in the past. She was seen walking out of the store with the bag full of clothes.

Cyrus was arrested last month after she was spotted in a Beverly Hills mall with a...



## "Legolas and Gimli advanced on the orcs, raising their weapons with a harrowing war cry."

In [23]:
prompt3 = 'Legolas and Gimli advanced on the orcs, raising their weapons with a harrowing war cry.'

input_ids = tokenizer.encode(prompt3, return_tensors='tf')

In [24]:
sample_outputs = GPT2.generate(
                              input_ids,
                              do_sample = True, 
                              max_length = MAX_LEN,
                              temperature = 0.8,
                              top_k = 50, 
                              top_p = 0.85 
                              #num_return_sequences = 5
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
    print("{}: {}...".format(i, tokenizer.decode(sample_output, skip_special_tokens = True)))
    print('')

Output:
----------------------------------------------------------------------------------------------------
0: Legolas and Gimli advanced on the orcs, raising their weapons with a harrowing war cry. As they passed through the battle, Gimli took a step back, and saw that the orcs had a large group of humans, their heads covered with a red and black cloak. "We will not let them pass," he said. "This is the land of the orcs, and I will not let them pass." He then turned to the humans and spoke: "The people of the north have spoken. They have said that you are not welcome here." They turned to look at him. "Who are you?" they asked. "I am the lord of the north," he said. "The north is mine. The north is my home....



## "For today’s homework assignment, please describe the reasons for the US Civil War."


In [25]:
prompt4 = "For today’s homework assignment, please describe the reasons for the US Civil War."

input_ids = tokenizer.encode(prompt4, return_tensors='tf')

In [26]:
sample_outputs = GPT2.generate(
                              input_ids,
                              do_sample = True, 
                              max_length = MAX_LEN,
                              temperature = 0.8,
                              top_k = 50, 
                              top_p = 0.85 
                              #num_return_sequences = 5
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
    print("{}: {}...".format(i, tokenizer.decode(sample_output, skip_special_tokens = True)))
    print('')

Output:
----------------------------------------------------------------------------------------------------
0: For today’s homework assignment, please describe the reasons for the US Civil War.

What do you think would have happened if the Civil War never happened?

Share your answer in the comments section below!...

