# 二、自然语言处理之模型应用----文本生成

HuggingFace有一个巨大的模型库，其中一些是已经非常成熟的经典模型，这些模型即使不进行任何训练也能直接得出比较好的预测结果，也就是常说的Zero Shot Learning。

In [1]:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

### 1) 下载模型

In [2]:
# 下载模型
#!HF_ENDPOINT=https://hf-mirror.com hf download openai-community/gpt2 --local-dir ../models/openai-community/gpt2

### 2) 使用pipeline加载模型

使用管道工具时，调用者需要做的只是告诉管道工具要进行的任务类型，管道工具会自动分配合适的模型，直接给出预测结果，如果这个预测结果对于调用者已经可以满足需求，则不再需要再训练。

管道工具的API非常简洁，隐藏了大量复杂的底层代码，即使是非专业人员也能轻松使用。

In [3]:
# 文本生成
from transformers import pipeline
text_generator = pipeline(task="text-generation",
                          model="../models/openai-community/gpt2",
                          device=device)

Device set to use cuda


### 3) 查看模型的配置信息

In [4]:
print(text_generator.model.config)

GPT2Config {
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.57.1",
  "use_cache": true,
  "vocab_size": 50257
}



### 4) 使用模型预测

In [5]:
# 生成文本
start_sentence="Hello, I'm a student, I will"
text_generator(start_sentence)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Hello, I'm a student, I will not allow anyone to steal my work.\n\nI was shocked when I read that students were being targeted by the IRS in this case. I feel ashamed and embarrassed to be one of the victims of the IRS. This was not my experience.\n\nI am so sorry for this. I just want to make sure everyone understands that I am only doing what I can.\n\nThank you everyone for your support.\n\nJohn,\n\nThis is a very serious matter. It is my understanding that you are a student. My email address is john.k.chapman@gmail.com.\n\nI did not take this action as a threat. I have been in and out of the U.S. for over 10 years. My parents did not want me to be a threat to my parents. I do not believe that my parents and I can ever be safe from the IRS. I am very concerned.\n\nI will do everything I can to make sure that the law is fair and this case is not one of those cases that is going to go to trial.\n\nIn the meantime, I am going to continue to defend myself and my fam

### 5) 生成多个可选文本

In [6]:
start_sentence="Hello, I'm a student, I will"
results = text_generator(start_sentence, num_return_sequences=5)
for r in results:
    print(r)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{'generated_text': 'Hello, I\'m a student, I will be attending you tomorrow, and I\'ll get you to get up for the exam. I\'ll be there for you, too.\n\n"I\'m, uh, you know what, I\'m really sorry. I\'m sorry you didn\'t do a better job than me. I\'m really sorry that I\'m the only one who didn\'t get to see you. I\'ve got a lot of things to say, and I really don\'t want to lose my job. I just want to get up for the exam."\n\n"What?"\n\n"I don\'t wanna lose my job. I want to get up for the exam. That\'s all. I know, I know, I know. I\'m going to be here tomorrow, which means I\'ll be right there with you every minute of the exam, and I\'ll be able to get you to do the test. And I\'m going to be in the office, and I\'m going to be getting up for your exam. So I\'ll be, uh, I\'ll be here, and I\'ll be your teacher, and I\'ll be able to take care of the things you did for me during your time at school. And I\'m not going to be lecturing you. I\'m going to'}
{'generated_text': 'Hello, I\'m a

### 6) 使用from_pretrained加载模型

参考这个页面 https://huggingface.co/openai-community/gpt2

In [7]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch

# 加载 GPT-2 tokenizer 和语言模型
tokenizer = GPT2Tokenizer.from_pretrained('../models/openai-community/gpt2')
model = GPT2LMHeadModel.from_pretrained('../models/openai-community/gpt2')

# 输入文本转换为 tokens
start_sentence="Hello, I'm a student, I will"
encoded_input = tokenizer(start_sentence, return_tensors='pt')

# 获取模型的输出（只计算当前输入的一步结果）
outputs = model(**encoded_input)
logits = outputs.logits

# 选择概率最高的下一个 token
next_token_id = torch.argmax(logits[:, -1, :], dim=-1).unsqueeze(0)

# 将生成的 token 添加到原始输入
generated_sequence = torch.cat([encoded_input['input_ids'], next_token_id], dim=1)

# 解码生成的 tokens 为文本
decoded_output = tokenizer.decode(generated_sequence[0], skip_special_tokens=True)

# 打印生成的文本
print(decoded_output)


Hello, I'm a student, I will be


#### 上面代码这个只能生一个token。

要生成更长的文本，可以手动进行逐步生成。你可以根据每个时间步的预测结果，将生成的标记添加到输入序列中，逐步扩展序列长度。这样可以实现比单次预测更长的文本输出。

具体步骤如下：

- 首先，给模型输入初始的文本。
- 通过 model(**encoded_input) 预测下一个标记。
- 将预测的标记拼接到输入文本中，作为新的输入，再次进行预测。
- 重复这个过程，直到生成足够长的文本。

In [8]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch

# 加载本地的 GPT-2 tokenizer 和语言模型
tokenizer = GPT2Tokenizer.from_pretrained('../models/openai-community/gpt2')
model = GPT2LMHeadModel.from_pretrained('../models/openai-community/gpt2')

# 输入文本转换为token
start_sentence="Hello, I'm a student, I will"
encoded_input = tokenizer(start_sentence, return_tensors='pt')

# 最大生成长度
max_length = 100  # 可以根据需要调整

# 逐步生成
generated = encoded_input['input_ids']  # 初始输入的 token ids

model.eval()
with torch.no_grad():
    for _ in range(max_length):
        # 获取模型输出
        outputs = model(input_ids=generated)
        logits = outputs.logits

        # 选择最后一个时间步的logits，选择概率最大的标记
        next_token_id = torch.argmax(logits[:, -1, :], dim=-1).unsqueeze(0)

        # 将新生成的标记拼接到已有序列中
        generated = torch.cat((generated, next_token_id), dim=1)

        # 如果生成结束标记 `<|endoftext|>`，则停止生成
        if next_token_id.item() == tokenizer.eos_token_id:
            break

# 解码生成的标记为文本
decoded_output = tokenizer.decode(generated[0], skip_special_tokens=True)

print(decoded_output)


Hello, I'm a student, I will be studying for my degree. I'm not going to be able to go to college. I'm not going to be able to go to college. I'm not going to be able to go to college. I'm not going to be able to go to college. I'm not going to be able to go to college. I'm not going to be able to go to college. I'm not going to be able to go to college. I'm not going to be able to go to


#### 上面生成的文本里有很多重复。

为了避免生成重复的内容，可以考虑 使用Top-k采样代替每次都选择概率最高的标记（argmax），可以使用采样策略来增加生成内容的多样性。

例如，top-k 采样从前 k 个可能性最高的标记中随机选择。

以下是如何将 top-k 采样添加到代码中的例子：

In [9]:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# 加载本地 GPT-2 模型和分词器
tokenizer = GPT2Tokenizer.from_pretrained('../models/openai-community/gpt2')
model = GPT2LMHeadModel.from_pretrained('../models/openai-community/gpt2')

# 输入文本转换为token
start_sentence="Hello, I'm a student, I will"
encoded_input = tokenizer(start_sentence, return_tensors='pt')

# 最大生成长度
max_length = 100  
generated = encoded_input['input_ids']

model.eval()
with torch.no_grad():
    for _ in range(max_length):
        # 获取模型输出
        outputs = model(input_ids=generated)
        logits = outputs.logits

        # 选择最后一个时间步的logits
        next_token_logits = logits[:, -1, :]

        # 使用 top-k 采样策略
        top_k = 50  # 设置top-k的值
        top_k_logits, top_k_indices = torch.topk(next_token_logits, top_k)
        probs = torch.softmax(top_k_logits, dim=-1)

        # 根据概率分布采样
        next_token_id = torch.multinomial(probs, num_samples=1)
        next_token_id = top_k_indices.gather(dim=-1, index=next_token_id)

        # 将新生成的标记拼接到已有序列中
        generated = torch.cat((generated, next_token_id), dim=1)

        # 如果生成结束标记 `<|endoftext|>`，则停止生成
        if next_token_id.item() == tokenizer.eos_token_id:
            break

# 解码生成的标记为文本
decoded_output = tokenizer.decode(generated[0], skip_special_tokens=True)
print(decoded_output)


Hello, I'm a student, I will stay with you when you decide to leave.

You'll be happy you did; when you decide that you don't mind leaving now, will be happy you did!

The old guard at Equestria Police has been a bad influence on you. Some of his members like it, many of his customers are simply not very smart, others make it to the top jobs in the city.

So you don't mind, and if you do come back, you'll be happy


## 生成的文本不能到一个完整的段落后自动停止

要实现生成文本在形成一个完整段落后自动停止，可以使用以下策略：

1. 检测结束标记 (<|endoftext|>)
GPT-2 通常会生成一个特殊的结束标记 (<|endoftext|>) 来标志文本结束。如果模型生成了这个标记，可以停止生成。不过这种方式依赖于模型是否会在合适的时间生成结束标记。如果模型未生成结束标记，就需要更多策略来判断。

2. 基于标点符号的判断
可以检测生成文本中的标点符号来判断段落是否完整。常见的段落结束标志包括句号（.）、换行符（\n）等。当检测到一定数量的句号或换行符时，可以认为段落已经完成并停止生成。

3. 设置段落长度限制
通过对生成的字符或标记长度进行限制来间接控制段落的结束。设定一个合理的长度（例如，100-200个字符或标记），生成超过这个长度时自动停止生成。

代码示例：以下示例展示了如何通过检测句号来自动结束生成：

In [10]:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# 加载本地 GPT-2 模型和分词器
tokenizer = GPT2Tokenizer.from_pretrained('../models/openai-community/gpt2')
model = GPT2LMHeadModel.from_pretrained('../models/openai-community/gpt2')

# 输入文本转换为token
start_sentence="Hello, I'm a student, I will"
encoded_input = tokenizer(start_sentence, return_tensors='pt')

# 最大生成长度
max_length = 200
generated = encoded_input['input_ids']

model.eval()
with torch.no_grad():
    for _ in range(max_length):
        # 获取模型输出
        outputs = model(input_ids=generated)
        logits = outputs.logits

        # 选择最后一个时间步的logits
        next_token_logits = logits[:, -1, :]

        # 使用top-k采样策略生成新标记
        top_k = 50
        top_k_logits, top_k_indices = torch.topk(next_token_logits, top_k)
        probs = torch.softmax(top_k_logits, dim=-1)
        next_token_id = torch.multinomial(probs, num_samples=1)
        next_token_id = top_k_indices.gather(dim=-1, index=next_token_id)

        # 将新生成的标记拼接到已有序列中
        generated = torch.cat((generated, next_token_id), dim=1)

        # 解码生成的标记为当前文本
        decoded_output = tokenizer.decode(generated[0], skip_special_tokens=True)
        
        # 检查是否生成了足够的完整句子
        if decoded_output.count('.') >= 3:  # 检查句号的数量，判断是否生成了至少三句话
            break

        # 如果生成结束标记 `<|endoftext|>`，则停止生成
        if next_token_id.item() == tokenizer.eos_token_id:
            break

# 最终输出完整段落
print(decoded_output)


Hello, I'm a student, I will pay the tuition and fees."

After he is satisfied, he will go further and go through the next steps to pay off a loan.

"I need someone to finance the whole loan," he said.


## 上面的代码越来越复杂了，我们还可以使用model.generate，实现自动生成

In [11]:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# 加载模型和分词器
tokenizer = GPT2Tokenizer.from_pretrained('../models/openai-community/gpt2')
model = GPT2LMHeadModel.from_pretrained('../models/openai-community/gpt2')

# 输入文本转换为token
start_sentence="Hello, I'm a student, I will"
encoded_input = tokenizer(start_sentence, return_tensors='pt')

# 调用 model.generate
output = model.generate(
    input_ids=encoded_input['input_ids'],
    max_new_tokens=200,  # 控制最大生成长度
    eos_token_id=tokenizer.eos_token_id,  # 假如模型有endoftext标记
    do_sample=True,  # 使用采样策略
    top_k=50,        # top-k 采样策略
    top_p=0.95       # top-p (nucleus) 采样策略
)

# 解码生成的文本
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

# 打印生成的文本
print(generated_text)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Hello, I'm a student, I will teach you everything. (A) You are not going to have any time or any training with me on this subject. (B) I'm going to have some personal training with you, you will learn, and you will learn from me. (C) Don't give me advice. (D) The training with you is just my opinion, as far as you are concerned. I cannot provide training, and you are not allowed to talk to me. (E) You are not allowed to do any other exercises with me because, and I know, you are all different types of people. (F) The class schedule is just not perfect, and you are allowed to practice in your own way. I want you to learn from me when you are alone. (G) I'm happy to meet you, but what can I expect from you? (H) What could be better? (I) I am not going to spend money, and you are only paid to sit here and
