## AutoTokenizer
简单来说，就是一个工厂函数，能够直接从网络或者本地调用已经写好的配置文件（通常是.json格式），实例化一个能够直接使用的tokenizer。</br>
关于tokenizer的概念，此处不再赘述。

In [1]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
'''
参数 pretrained_model_name_or_path: 字符串，指定要加载的分词器的名称或路径。
参数 do_lower_case: 布尔值，仅适用于某些分词器（如BERT）。如果设置为 True，则在分词前将文本转换为小写。默认为 True 对于uncased模型，False 对于cased模型。
参数 use_fast: 布尔值，是否使用 Rust 实现的快速分词器。默认通常为 True，推荐使用。快速分词器通常比 Python 实现的分词器快很多。
参数 revision: 字符串，指定模型或分词器的特定版本（如 commit hash 或 branch name）。
参数 token: 字符串，用于需要认证才能下载私有模型时的认证token。
'''


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

'\n参数 pretrained_model_name_or_path: 字符串，指定要加载的分词器的名称或路径。\n参数 do_lower_case: 布尔值，仅适用于某些分词器（如BERT）。如果设置为 True，则在分词前将文本转换为小写。默认为 True 对于uncased模型，False 对于cased模型。\n参数 use_fast: 布尔值，是否使用 Rust 实现的快速分词器。默认通常为 True，推荐使用。快速分词器通常比 Python 实现的分词器快很多。\n参数 revision: 字符串，指定模型或分词器的特定版本（如 commit hash 或 branch name）。\n参数 token: 字符串，用于需要认证才能下载私有模型时的认证token。\n'

In [2]:
# 对文本进行编码
text = "The universe is now shuttling across us."
encoded_ipt = tokenizer(text)
print(encoded_ipt)
print(encoded_ipt.sequence_ids())

# 对结果进行解码
decoded_txt = tokenizer.decode(encoded_ipt["input_ids"])
print(decoded_txt)

# 保存和加载分词器
tokenizer.save_pretrained("./my_tokenizer")
# 之后可以从本地加载
loaded_tokenizer = AutoTokenizer.from_pretrained("./my_tokenizer")

{'input_ids': [101, 1109, 6271, 1110, 1208, 3210, 13756, 1506, 1366, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
[None, 0, 0, 0, 0, 0, 0, 0, 0, 0, None]
[CLS] The universe is now shuttling across us. [SEP]


In [3]:
import os
from transformers import AutoTokenizer

# --- 1. 创建样例数据集文件 ---
sample_texts_content = """这是一个关于自然语言处理的样例句子。
Hello, how are you today? I hope you are having a great time.
The quick brown fox jumps over the lazy dog.
What is the capital of France? Paris.
I love coding in Python.
"""
dataset_file = "nlp_sample_texts.txt"
with open(dataset_file, "w", encoding="utf-8") as f:
    f.write(sample_texts_content)
print(f"样例数据集 '{dataset_file}' 已生成。\n")

# --- 2. 加载 AutoTokenizer ---
print("--- 2. 加载 AutoTokenizer ---")
model_name = "bert-base-uncased" # 我们将使用BERT的uncased版本
try:
    # 尝试加载快速分词器，如果不支持则退回普通分词器
    tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
    print(f"成功加载模型 '{model_name}' 的快速分词器。")
except Exception as e:
    print(f"加载快速分词器失败：{e}。尝试加载普通分词器。")
    tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
    print(f"成功加载模型 '{model_name}' 的普通分词器。")

print(f"分词器类型: {type(tokenizer)}\n")

# --- 3. 基本编码和解码 ---
print("--- 3. 基本编码和解码 ---")
single_sentence = "Hello, how are you today?"
print(f"原始句子: '{single_sentence}'")

# 编码
encoded_input = tokenizer(single_sentence)
print(f"编码结果 (字典): {encoded_input}")
print(f"input_ids: {encoded_input['input_ids']}")
print(f"attention_mask: {encoded_input['attention_mask']}")
if 'token_type_ids' in encoded_input:
    print(f"token_type_ids: {encoded_input['token_type_ids']}")

# 解码
decoded_text = tokenizer.decode(encoded_input['input_ids'])
print(f"解码结果 (包含特殊token): '{decoded_text}'")
decoded_text_no_special = tokenizer.decode(encoded_input['input_ids'], skip_special_tokens=True)
print(f"解码结果 (跳过特殊token): '{decoded_text_no_special}'\n")

# --- 4. 批量编码多个句子 ---
print("--- 4. 批量编码多个句子 ---")
sentences = [
    "The quick brown fox jumps over the lazy dog.",
    "What is the capital of France? Paris."
]
print("原始句子列表:")
for s in sentences:
    print(f"- '{s}'")

# 批量编码，自动填充到最长序列，并返回 PyTorch 张量
batch_encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
print(f"\n批量编码结果 (PyTorch 张量):\n{batch_encoded_input}")
print(f"input_ids shape: {batch_encoded_input['input_ids'].shape}")
print(f"attention_mask shape: {batch_encoded_input['attention_mask'].shape}\n")

# 批量解码
print("批量解码结果:")
for i, ids in enumerate(batch_encoded_input['input_ids']):
    decoded = tokenizer.decode(ids, skip_special_tokens=True)
    print(f"句子 {i+1} 解码: '{decoded}'")
print("\n")

# --- 5. 处理长文本与截断/填充 ---
print("--- 5. 处理长文本与截断/填充 ---")
long_text = "This is a very long sentence that needs to be truncated because it exceeds the maximum length. We will demonstrate how truncation works by setting a small max_length parameter."
print(f"原始长文本: '{long_text}'")

# 截断到10个token
truncated_encoded = tokenizer(long_text, max_length=10, truncation=True, return_tensors="pt")
print(f"截断后的 input_ids: {truncated_encoded['input_ids']}")
print(f"截断后解码: '{tokenizer.decode(truncated_encoded['input_ids'][0], skip_special_tokens=True)}'\n")

# 填充到固定长度20
short_text_for_padding = "This is a short text."
print(f"原始短文本: '{short_text_for_padding}'")
padded_encoded = tokenizer(short_text_for_padding, max_length=20, padding='max_length', return_tensors="pt")
print(f"填充后的 input_ids: {padded_encoded['input_ids']}")
print(f"填充后解码: '{tokenizer.decode(padded_encoded['input_ids'][0], skip_special_tokens=False)}' (包含填充符)\n")

# --- 6. 词汇表和特殊token ---
print("--- 6. 词汇表和特殊token ---")
print(f"词汇表大小: {tokenizer.vocab_size}")
print(f"CLS token: {tokenizer.cls_token} (ID: {tokenizer.cls_token_id})")
print(f"SEP token: {tokenizer.sep_token} (ID: {tokenizer.sep_token_id})")
print(f"PAD token: {tokenizer.pad_token} (ID: {tokenizer.pad_token_id})")
print(f"UNK token: {tokenizer.unk_token} (ID: {tokenizer.unk_token_id})")
print(f"所有特殊token: {tokenizer.all_special_tokens}\n")

# 查看特定token的ID
word_id = tokenizer.convert_tokens_to_ids("hello")
print(f"'hello' 的 ID: {word_id}")

# 查看特定ID的token
id_token = tokenizer.convert_ids_to_tokens(100) # [UNK]
print(f"ID 100 对应的 token: '{id_token}'\n")

# --- 7. 添加新的token ---
print("--- 7. 添加新的token ---")
new_tokens = ["[NEW_TOKEN]", "<DOMAIN_SPECIFIC_WORD>"]
num_added_toks = tokenizer.add_tokens(new_tokens)
print(f"添加了 {num_added_toks} 个新token: {new_tokens}")
print(f"新的词汇表大小: {tokenizer.vocab_size}")

# 编码包含新token的句子
sentence_with_new_token = "This sentence contains [NEW_TOKEN] and <DOMAIN_SPECIFIC_WORD>."
encoded_with_new_token = tokenizer(sentence_with_new_token, return_tensors="pt")
print(f"包含新token的句子编码: {encoded_with_new_token['input_ids']}")
print(f"解码结果: '{tokenizer.decode(encoded_with_new_token['input_ids'][0], skip_special_tokens=False)}'\n")

# --- 8. 保存和加载分词器 ---
print("--- 8. 保存和加载分词器 ---")
save_directory = "./my_bert_tokenizer"
tokenizer.save_pretrained(save_directory)
print(f"分词器已保存到: '{save_directory}'")

loaded_tokenizer = AutoTokenizer.from_pretrained(save_directory)
print(f"从本地加载的分词器类型: {type(loaded_tokenizer)}")
print(f"原始分词器词汇表大小: {tokenizer.vocab_size}, 加载分词器词汇表大小: {loaded_tokenizer.vocab_size}")

# 清理生成的文件
print("\n--- 清理文件 ---")
if os.path.exists(dataset_file):
    os.remove(dataset_file)
    print(f"已删除样例数据集文件: {dataset_file}")

import shutil
if os.path.exists(save_directory):
    shutil.rmtree(save_directory)
    print(f"已删除保存的分词器目录: {save_directory}")

样例数据集 'nlp_sample_texts.txt' 已生成。

--- 2. 加载 AutoTokenizer ---


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

成功加载模型 'bert-base-uncased' 的快速分词器。
分词器类型: <class 'transformers.models.bert.tokenization_bert_fast.BertTokenizerFast'>

--- 3. 基本编码和解码 ---
原始句子: 'Hello, how are you today?'
编码结果 (字典): {'input_ids': [101, 7592, 1010, 2129, 2024, 2017, 2651, 1029, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
input_ids: [101, 7592, 1010, 2129, 2024, 2017, 2651, 1029, 102]
attention_mask: [1, 1, 1, 1, 1, 1, 1, 1, 1]
token_type_ids: [0, 0, 0, 0, 0, 0, 0, 0, 0]
解码结果 (包含特殊token): '[CLS] hello, how are you today? [SEP]'
解码结果 (跳过特殊token): 'hello, how are you today?'

--- 4. 批量编码多个句子 ---
原始句子列表:
- 'The quick brown fox jumps over the lazy dog.'
- 'What is the capital of France? Paris.'

批量编码结果 (PyTorch 张量):
{'input_ids': tensor([[  101,  1996,  4248,  2829,  4419, 14523,  2058,  1996, 13971,  3899,
          1012,   102],
        [  101,  2054,  2003,  1996,  3007,  1997,  2605,  1029,  3000,  1012,
           102,     0]]), 'token_type_ids': tensor([[0, 0, 0,