## Tokenization

NLP中三个重要的部分：分词(`tokenization`)、向量嵌入(`enbedding`)、网络架构(`architecture`)。    
- 分词器、向量嵌入、模型都可以看做是一个独立的函数。这些函数接收一些输入，并生成一些输出。   
- 每个函数都会将输出传递到下一个流水线中的环节。     
- 我们把分词后的文本传递到向量嵌入层，向量嵌入层再传递到模型中。   
- 在学习过程中，我们可以将这些函数视为黑盒，并且一次只关注其中的一个环节。

分词器是一个将字符串序列转换为词条(`Token`)序列的程序。

**分词器是模型用来读取文本，向量嵌入是模型用来理解文本的。**

`tokenization`: 分词/词元化是指将字符串分解为给模型使用的`token`(原子单元)的步骤。   
词元化测策略有很多种，我们具体使用那种方案最佳，通常都需要从语料库中学习。

接下来我们讨论一下两种极端的分词器：
- 字符词元化
- 单词词元化

In [1]:
import torch
import torch.nn.functional as F
import numpy as np

### 1. 字符词元化

> 把字符串按照字符级别拆分。

In [2]:
# 准备一段文本
text = """
PyTorch is an optimized tensor library for deep learning using GPUs and CPUs.

Features described in this documentation are classified by release status:

Stable: These features will be maintained long-term and there should generally be no major performance limitations or gaps in documentation. We also expect to maintain backwards compatibility (although breaking changes can happen and notice will be given one release ahead of time).

Beta: These features are tagged as Beta because the API may change based on user feedback, because the performance needs to improve, or because coverage across operators is not yet complete. For Beta features, we are committing to seeing the feature through to the Stable classification. We are not, however, committing to backwards compatibility.

Prototype: These features are typically not available as part of binary distributions like PyPI or Conda, except sometimes behind run-time flags, and are at an early stage for feedback and testing.
"""

先把换行符替换为空格：

In [3]:
text = text.replace("\n", " ")

In [4]:
text

' PyTorch is an optimized tensor library for deep learning using GPUs and CPUs.  Features described in this documentation are classified by release status:  Stable: These features will be maintained long-term and there should generally be no major performance limitations or gaps in documentation. We also expect to maintain backwards compatibility (although breaking changes can happen and notice will be given one release ahead of time).  Beta: These features are tagged as Beta because the API may change based on user feedback, because the performance needs to improve, or because coverage across operators is not yet complete. For Beta features, we are committing to seeing the feature through to the Stable classification. We are not, however, committing to backwards compatibility.  Prototype: These features are typically not available as part of binary distributions like PyPI or Conda, except sometimes behind run-time flags, and are at an early stage for feedback and testing. '

In [5]:
# 直接使用list把字符串拆分为单个的字符列表
tokenized_text_list = list(text)
len(tokenized_text_list)

987

In [6]:
len(text)

987

In [7]:
# 现在计算出所有字符的集合，并排好序
tokenized_tokens_set = sorted(set(tokenized_text_list))
len(tokenized_tokens_set)

43

> 我们的语料库中字符总共有43个。我们把其从第一个到最后一个分别用数字1,2,....,42编好号。

In [8]:
print(tokenized_tokens_set)

[' ', '(', ')', ',', '-', '.', ':', 'A', 'B', 'C', 'F', 'G', 'I', 'P', 'S', 'T', 'U', 'W', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


In [9]:
# token到数字的映射
token2index = {
    c: index
    for index, c in enumerate(tokenized_tokens_set)
}
print(token2index)
print("\n词表的长度：", len(token2index))

{' ': 0, '(': 1, ')': 2, ',': 3, '-': 4, '.': 5, ':': 6, 'A': 7, 'B': 8, 'C': 9, 'F': 10, 'G': 11, 'I': 12, 'P': 13, 'S': 14, 'T': 15, 'U': 16, 'W': 17, 'a': 18, 'b': 19, 'c': 20, 'd': 21, 'e': 22, 'f': 23, 'g': 24, 'h': 25, 'i': 26, 'j': 27, 'k': 28, 'l': 29, 'm': 30, 'n': 31, 'o': 32, 'p': 33, 'r': 34, 's': 35, 't': 36, 'u': 37, 'v': 38, 'w': 39, 'x': 40, 'y': 41, 'z': 42}

词表的长度： 43


> 我们得到了一个包含每个字符到唯一整数的映射，即分词器的此表。其词表的长度是43。

**现在我们有了词表，有了分词的文本列表，可把字符文本列表转换为一个整数列表了。**

In [10]:
input_ids = [token2index.get(token, 0) for token in tokenized_text_list]

In [11]:
print(input_ids)

[0, 13, 41, 15, 32, 34, 20, 25, 0, 26, 35, 0, 18, 31, 0, 32, 33, 36, 26, 30, 26, 42, 22, 21, 0, 36, 22, 31, 35, 32, 34, 0, 29, 26, 19, 34, 18, 34, 41, 0, 23, 32, 34, 0, 21, 22, 22, 33, 0, 29, 22, 18, 34, 31, 26, 31, 24, 0, 37, 35, 26, 31, 24, 0, 11, 13, 16, 35, 0, 18, 31, 21, 0, 9, 13, 16, 35, 5, 0, 0, 10, 22, 18, 36, 37, 34, 22, 35, 0, 21, 22, 35, 20, 34, 26, 19, 22, 21, 0, 26, 31, 0, 36, 25, 26, 35, 0, 21, 32, 20, 37, 30, 22, 31, 36, 18, 36, 26, 32, 31, 0, 18, 34, 22, 0, 20, 29, 18, 35, 35, 26, 23, 26, 22, 21, 0, 19, 41, 0, 34, 22, 29, 22, 18, 35, 22, 0, 35, 36, 18, 36, 37, 35, 6, 0, 0, 14, 36, 18, 19, 29, 22, 6, 0, 15, 25, 22, 35, 22, 0, 23, 22, 18, 36, 37, 34, 22, 35, 0, 39, 26, 29, 29, 0, 19, 22, 0, 30, 18, 26, 31, 36, 18, 26, 31, 22, 21, 0, 29, 32, 31, 24, 4, 36, 22, 34, 30, 0, 18, 31, 21, 0, 36, 25, 22, 34, 22, 0, 35, 25, 32, 37, 29, 21, 0, 24, 22, 31, 22, 34, 18, 29, 29, 41, 0, 19, 22, 0, 31, 32, 0, 30, 18, 27, 32, 34, 0, 33, 22, 34, 23, 32, 34, 30, 18, 31, 20, 22, 0, 29, 26, 3

**接下来我们将这个整数的列表，转换为独热编码向量(one-hot)**。

In [12]:
# 把整数列表，转换为张量
input_ids_tensor = torch.tensor(input_ids)
# 直接使用torch.nn.functional实例化one_hot
input_ids_one_hot_encodings = F.one_hot(input_ids_tensor, num_classes=len(token2index))

# 查看独热编码后的形状
input_ids_one_hot_encodings.shape

torch.Size([987, 43])

In [13]:
# 查看第11个token
print("第一个token是：{}({})".format(tokenized_text_list[10], token2index[tokenized_text_list[10]]))

第一个token是：s(35)


In [14]:
# 查看第11个token的独热编码
input_ids_one_hot_encodings[10]

tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0])

In [15]:
# 查看第11个token的独热编码 最大值的index
torch.argmax(input_ids_one_hot_encodings[10])

tensor(35)

### 2. 单词词元化
> 把字符串按照单词级别拆分

把文本根据空格分割出所有的单词:

In [16]:
words = text.split(" ")
# 去掉,.等
words = [word.strip(')(,.;:!?_-"') for word in words]
# 排除空单词''
words = [word for word in words if word]

In [17]:
print(words)

['PyTorch', 'is', 'an', 'optimized', 'tensor', 'library', 'for', 'deep', 'learning', 'using', 'GPUs', 'and', 'CPUs', 'Features', 'described', 'in', 'this', 'documentation', 'are', 'classified', 'by', 'release', 'status', 'Stable', 'These', 'features', 'will', 'be', 'maintained', 'long-term', 'and', 'there', 'should', 'generally', 'be', 'no', 'major', 'performance', 'limitations', 'or', 'gaps', 'in', 'documentation', 'We', 'also', 'expect', 'to', 'maintain', 'backwards', 'compatibility', 'although', 'breaking', 'changes', 'can', 'happen', 'and', 'notice', 'will', 'be', 'given', 'one', 'release', 'ahead', 'of', 'time', 'Beta', 'These', 'features', 'are', 'tagged', 'as', 'Beta', 'because', 'the', 'API', 'may', 'change', 'based', 'on', 'user', 'feedback', 'because', 'the', 'performance', 'needs', 'to', 'improve', 'or', 'because', 'coverage', 'across', 'operators', 'is', 'not', 'yet', 'complete', 'For', 'Beta', 'features', 'we', 'are', 'committing', 'to', 'seeing', 'the', 'feature', 'throug

让单词变为唯一，且排序：

In [18]:
words_set = sorted(set(words))
print(words_set)

['API', 'Beta', 'CPUs', 'Conda', 'Features', 'For', 'GPUs', 'Prototype', 'PyPI', 'PyTorch', 'Stable', 'These', 'We', 'across', 'ahead', 'also', 'although', 'an', 'and', 'are', 'as', 'at', 'available', 'backwards', 'based', 'be', 'because', 'behind', 'binary', 'breaking', 'by', 'can', 'change', 'changes', 'classification', 'classified', 'committing', 'compatibility', 'complete', 'coverage', 'deep', 'described', 'distributions', 'documentation', 'early', 'except', 'expect', 'feature', 'features', 'feedback', 'flags', 'for', 'gaps', 'generally', 'given', 'happen', 'however', 'improve', 'in', 'is', 'learning', 'library', 'like', 'limitations', 'long-term', 'maintain', 'maintained', 'major', 'may', 'needs', 'no', 'not', 'notice', 'of', 'on', 'one', 'operators', 'optimized', 'or', 'part', 'performance', 'release', 'run-time', 'seeing', 'should', 'sometimes', 'stage', 'status', 'tagged', 'tensor', 'testing', 'the', 'there', 'this', 'through', 'time', 'to', 'typically', 'user', 'using', 'we', 

In [19]:
tokens = {
    v: k for k, v in enumerate(words_set)
}

In [20]:
# 查看前面10个token
i = 0
for k, v in tokens.items():
    print(k, ":", v)
    i += 1
    if i > 10:
        break

API : 0
Beta : 1
CPUs : 2
Conda : 3
Features : 4
For : 5
GPUs : 6
Prototype : 7
PyPI : 8
PyTorch : 9
Stable : 10


In [21]:
len(tokens)

103

In [22]:
# 获取单词的ids
word_ids = [tokens.get(word, 0) for word in words]

In [23]:
print(word_ids)

[9, 59, 17, 77, 89, 61, 51, 40, 60, 99, 6, 18, 2, 4, 41, 58, 93, 43, 19, 35, 30, 81, 87, 10, 11, 48, 101, 25, 66, 64, 18, 92, 84, 53, 25, 70, 67, 80, 63, 78, 52, 58, 43, 12, 15, 46, 96, 65, 23, 37, 16, 29, 33, 31, 55, 18, 72, 101, 25, 54, 75, 81, 14, 73, 95, 1, 11, 48, 19, 88, 20, 1, 26, 91, 0, 68, 32, 24, 74, 98, 49, 26, 91, 80, 69, 96, 57, 78, 26, 39, 13, 76, 59, 71, 102, 38, 5, 1, 48, 100, 19, 36, 96, 83, 91, 47, 94, 96, 91, 10, 34, 12, 19, 71, 56, 36, 96, 23, 37, 7, 11, 48, 19, 97, 71, 22, 20, 79, 73, 28, 42, 62, 8, 78, 3, 45, 85, 27, 82, 50, 18, 19, 21, 17, 44, 86, 51, 49, 18, 90]


In [24]:
len(word_ids)

150

**使用独热编码来展示word_ids**:

`one-hot`向量数组中，一个元素是`1`，其它的元素都是`0`。

In [25]:
word_ids_one_hot = torch.zeros(len(word_ids), len(tokens))

In [26]:
for idx, token in enumerate(word_ids):
    # print(idx, token)
    word_ids_one_hot[idx][token] = 1

In [27]:
word_ids_one_hot.shape

torch.Size([150, 103])

用`one-hot`来表示单词，是单词映射到向量的最简单的方法之一。`one-hot`向量的维度就是单词字典的大小。   
这里单词字典的个数是103，那么向量维度就是103.

**注意：** 这里的`one-hot`向量其实没有真正的语义含义的，没法表示出单词各属性的信息。

In [28]:
word_ids[0]

9

In [29]:
word_ids_one_hot[0]

tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [30]:
torch.argmax(word_ids_one_hot[0])

tensor(9)

分词器重要的一点是：输入的是原始文本，输出是向量序列。   

**注意：** 上面的**字符词元化**、**单词词元化**的分词示例，在实际应用中是不会使用这种分词方式的。我们一般会使用介于字符词元化和单词词元化之间的子词词元化。