处理自然语言的模型通常使用不同的字符集来处理不同的语言。Unicode 是一种标准的编码系统，用于表示几乎所有语言的字符。每个字符使用 0 和 0x10FFFF 之间的唯一整数码位进行编码。Unicode 字符串是由零个或更多码位组成的序列。

# 1.Setup

In [3]:
import tensorflow as tf
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# 2.tf.string 数据类型

In [9]:
tf.constant(u"Thanks 😊")
tf.constant(u"Thanks 😊").shape
tf.constant([u"You're", "welcome!"])
tf.constant([u"You're", "welcome!"]).shape

<tf.Tensor: shape=(), dtype=string, numpy=b'Thanks \xf0\x9f\x98\x8a'>

TensorShape([])

<tf.Tensor: shape=(2,), dtype=string, numpy=array([b"You're", b'welcome!'], dtype=object)>

TensorShape([2])

# 3.表示 Unicode

在 TensorFlow 中有两种表示 Unicode 字符串的标准方式:
    
- `string` 标量-使用已知字符编码对码位序列进行编码
- `int32` 向量-每个位置包含单个码位

TensorFlow 提供了在下列不同表示之间进行转换的运算:

- `tf.strings.unicode_decode`：将编码的字符串标量转换为码位的向量。
- `tf.strings.unicode_encode`：将码位的向量转换为编码的字符串标量。
- `tf.strings.unicode_transcode`：将编码的字符串标量转换为其他编码

解码多个字符串时，每个字符串中的字符数可能不相等。返回结果是 tf.RaggedTensor，其中最里面的维度的长度会根据每个字符串中的字符数而变化

In [12]:
text_utf8 = tf.constant(u"语言处理")
text_utf8

text_utf16be = tf.constant(u"语言处理".encode("UTF-16-BE"))
text_utf16be

text_chars = tf.constant([ord(char) for char in u"语言处理"])
text_chars

<tf.Tensor: shape=(), dtype=string, numpy=b'\xe8\xaf\xad\xe8\xa8\x80\xe5\xa4\x84\xe7\x90\x86'>

<tf.Tensor: shape=(), dtype=string, numpy=b'\x8b\xed\x8a\x00Y\x04t\x06'>

<tf.Tensor: shape=(4,), dtype=int32, numpy=array([35821, 35328, 22788, 29702], dtype=int32)>

In [19]:
tf.strings.unicode_decode(text_utf8, input_encoding = "UTF-8")
tf.strings.unicode_encode(text_chars, output_encoding = "UTF-8")
tf.strings.unicode_transcode(text_utf8, input_encoding = "UTF-8", output_encoding = "UTF-16-BE")

<tf.Tensor: shape=(4,), dtype=int32, numpy=array([35821, 35328, 22788, 29702], dtype=int32)>

<tf.Tensor: shape=(), dtype=string, numpy=b'\xe8\xaf\xad\xe8\xa8\x80\xe5\xa4\x84\xe7\x90\x86'>

<tf.Tensor: shape=(), dtype=string, numpy=b'\x8b\xed\x8a\x00Y\x04t\x06'>

In [27]:
batch_utf8 = [s.encode("UTF-8") for s in [u'hÃllo',  u'What is the weather tomorrow',  u'Göödnight', u'😊']]

batch_chars_ragged = tf.strings.unicode_decode(batch_utf8, input_encoding = "UTF-8")
for sentence_chars in batch_chars_ragged.to_list():
    print(sentence_chars)

batch_chars_padded = batch_chars_ragged.to_tensor(default_value = -1)
batch_chars_padded

batch_chars_sparse = batch_chars_ragged.to_sparse()
batch_chars_sparse

[104, 195, 108, 108, 111]
[87, 104, 97, 116, 32, 105, 115, 32, 116, 104, 101, 32, 119, 101, 97, 116, 104, 101, 114, 32, 116, 111, 109, 111, 114, 114, 111, 119]
[71, 246, 246, 100, 110, 105, 103, 104, 116]
[128522]


<tf.Tensor: shape=(4, 28), dtype=int32, numpy=
array([[   104,    195,    108,    108,    111,     -1,     -1,     -1,
            -1,     -1,     -1,     -1,     -1,     -1,     -1,     -1,
            -1,     -1,     -1,     -1,     -1,     -1,     -1,     -1,
            -1,     -1,     -1,     -1],
       [    87,    104,     97,    116,     32,    105,    115,     32,
           116,    104,    101,     32,    119,    101,     97,    116,
           104,    101,    114,     32,    116,    111,    109,    111,
           114,    114,    111,    119],
       [    71,    246,    246,    100,    110,    105,    103,    104,
           116,     -1,     -1,     -1,     -1,     -1,     -1,     -1,
            -1,     -1,     -1,     -1,     -1,     -1,     -1,     -1,
            -1,     -1,     -1,     -1],
       [128522,     -1,     -1,     -1,     -1,     -1,     -1,     -1,
            -1,     -1,     -1,     -1,     -1,     -1,     -1,     -1,
            -1,     -1,     -1,     -1

<tensorflow.python.framework.sparse_tensor.SparseTensor at 0x192defa90>

In [29]:
tf.strings.unicode_encode(
    [
        [99, 97, 116], 
        [100, 111, 103],
        [99, 111, 119],
    ],
    output_encoding = "UTF-8"
)
tf.strings.unicode_encode(
    batch_chars_ragged,
    output_encoding = "UTF-8"
)
tf.strings.unicode_encode(
    tf.RaggedTensor.from_sparse(batch_chars_sparse),
    output_encoding = "UTF-8"
)
tf.strings.unicode_encode(
    tf.RaggedTensor.from_tensor(batch_chars_padded, padding = -1),
    output_encoding = "UTF-8"
)

<tf.Tensor: shape=(3,), dtype=string, numpy=array([b'cat', b'dog', b'cow'], dtype=object)>

<tf.Tensor: shape=(4,), dtype=string, numpy=
array([b'h\xc3\x83llo', b'What is the weather tomorrow',
       b'G\xc3\xb6\xc3\xb6dnight', b'\xf0\x9f\x98\x8a'], dtype=object)>

<tf.Tensor: shape=(4,), dtype=string, numpy=
array([b'h\xc3\x83llo', b'What is the weather tomorrow',
       b'G\xc3\xb6\xc3\xb6dnight', b'\xf0\x9f\x98\x8a'], dtype=object)>

<tf.Tensor: shape=(4,), dtype=string, numpy=
array([b'h\xc3\x83llo', b'What is the weather tomorrow',
       b'G\xc3\xb6\xc3\xb6dnight', b'\xf0\x9f\x98\x8a'], dtype=object)>

# 4.Unicode 运算

In [34]:
thanks = u'Thanks 😊'.encode("UTF-8")
num_bytes = tf.strings.length(thanks).numpy()
num_chars = tf.strings.length(thanks, unit = "UTF8_CHAR").numpy()
print(f"{num_bytes} bytes")
print(f"{num_chars} UTF-8 characters")

11 bytes
8 UTF-8 characters


# 5.Unicode 字符体系

# 6.简单分词

In [41]:
sentence_texts = [u'Hello, world.', u'世界こんにちは']

sentence_char_codepoint = tf.strings.unicode_decode(sentence_texts, "UTF-8")

sentence_char_script = tf.strings.unicode_script(sentence_char_codepoint)

sentence_char_starts_word = tf.concat(
    [
        tf.fill([sentence_char_script.nrows(), 1], True),
        tf.not_equal(sentence_char_script[:, 1:], sentence_char_script[:, :-1]),
    ], 
    axis = 1
)
word_starts = tf.squeeze(tf.where(sentence_char_starts_word.values), axis = 1)

word_char_codepoint = tf.RaggedTensor.from_row_starts(values = sentence_char_codepoint.values, row_starts = word_starts)

sentence_num_words = tf.reduce_sum(tf.cast(sentence_char_starts_word, tf.int64), axis = 1)
sentence_word_char_codepoint = tf.RaggedTensor.from_row_lengths(values = word_char_codepoint, row_lengths = sentence_num_words)

print(sentence_char_codepoint)
print(sentence_char_script)
print(word_starts)
print(word_char_codepoint)
print(sentence_word_char_codepoint)

tf.strings.unicode_encode(sentence_word_char_codepoint, "UTF-8").to_list()

<tf.RaggedTensor [[72, 101, 108, 108, 111, 44, 32, 119, 111, 114, 108, 100, 46], [19990, 30028, 12371, 12435, 12395, 12385, 12399]]>
<tf.RaggedTensor [[25, 25, 25, 25, 25, 0, 0, 25, 25, 25, 25, 25, 0], [17, 17, 20, 20, 20, 20, 20]]>
tf.Tensor([ 0  5  7 12 13 15], shape=(6,), dtype=int64)
<tf.RaggedTensor [[72, 101, 108, 108, 111], [44, 32], [119, 111, 114, 108, 100], [46], [19990, 30028], [12371, 12435, 12395, 12385, 12399]]>
<tf.RaggedTensor [[[72, 101, 108, 108, 111], [44, 32], [119, 111, 114, 108, 100], [46]], [[19990, 30028], [12371, 12435, 12395, 12385, 12399]]]>


[[b'Hello', b', ', b'world', b'.'],
 [b'\xe4\xb8\x96\xe7\x95\x8c',
  b'\xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf']]