## 使用qwen2模型分词
- https://huggingface.co/Qwen/Qwen2-7B-Instruct

In [1]:
import warnings
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

In [2]:
import torch
import tokenizers

from transformers import AutoConfig
from transformers import Qwen2Tokenizer

### 1. 查看模型的配置

In [3]:
# 第一次的话会下载config.json文件
config = AutoConfig.from_pretrained("Qwen/Qwen2-7B-Instruct")

In [4]:
config

Qwen2Config {
  "_name_or_path": "Qwen/Qwen2-7B-Instruct",
  "architectures": [
    "Qwen2ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151645,
  "hidden_act": "silu",
  "hidden_size": 3584,
  "initializer_range": 0.02,
  "intermediate_size": 18944,
  "max_position_embeddings": 32768,
  "max_window_layers": 28,
  "model_type": "qwen2",
  "num_attention_heads": 28,
  "num_hidden_layers": 28,
  "num_key_value_heads": 4,
  "rms_norm_eps": 1e-06,
  "rope_theta": 1000000.0,
  "sliding_window": 131072,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.41.2",
  "use_cache": true,
  "use_sliding_window": false,
  "vocab_size": 152064
}

In [5]:
# 查看hiddlen_size
config.hidden_size

3584

In [6]:
# vocab_size
config.vocab_size

152064

### 2. 分词

In [7]:
tokenizer = Qwen2Tokenizer.from_pretrained("Qwen/Qwen2-7B-Instruct")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [8]:
tokenizer("I love python and transformer.")

{'input_ids': [40, 2948, 10135, 323, 42578, 13], 'attention_mask': [1, 1, 1, 1, 1, 1]}

In [9]:
def tokenizer_info(t):
    print("tokenizer.name_or_path:     \t", t.name_or_path)
    print("tokenizer.vocab_size:       \t", t.vocab_size)
    print("tokenizer.model_max_length: \t", t.model_max_length)
    print("tokenizer.vocab_files_names:\t", t.vocab_files_names)

In [10]:
tokenizer_info(tokenizer)

tokenizer.name_or_path:     	 Qwen/Qwen2-7B-Instruct
tokenizer.vocab_size:       	 151643
tokenizer.model_max_length: 	 131072
tokenizer.vocab_files_names:	 {'vocab_file': 'vocab.json', 'merges_file': 'merges.txt'}


测试一下中文分词。

In [11]:
text_zh = "这里是一段中文信息，看qwen2是怎么分词的。"
outputs = tokenizer(text_zh)
outputs

{'input_ids': [99817, 99639, 37474, 104811, 27369, 3837, 50930, 80, 16948, 17, 107343, 17177, 99689, 9370, 1773], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [12]:
text_zh_tokens = tokenizer.convert_ids_to_tokens(outputs["input_ids"])
print(text_zh_tokens)
print(len(text_zh), len(text_zh_tokens))

['è¿ĻéĩĮ', 'æĺ¯ä¸Ģ', 'æ®µ', 'ä¸Ńæĸĩ', 'ä¿¡æģ¯', 'ï¼Į', 'çľĭ', 'q', 'wen', '2', 'æĺ¯æĢİä¹Ī', 'åĪĨ', 'è¯į', 'çļĦ', 'ãĢĤ']
23 15


> 我们中文字符串长度是23，分词后tokens的长度是15. 且分词的中文token人没法看出它们是啥,但是计算机认他即可。