<a href="https://colab.research.google.com/github/hansglick/book_errata/blob/main/p030_unicode_strings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import tensorflow as tf
import numpy as np
print(tf.__version__)

2.8.2


Apparemment, n'importe quel caractère de n'importe quel alphabet peut s'encoder dans le système Unicode. Chaque caractère unicode est encodé vers un entier allant de 0 à 0x10FFFF. tf.string permet de créer des tensors de byte strings. Par défaut les strings unicode sont encodés en utf-8. Pas super clair mais bon.

In [2]:
tf.constant(u"Thanks 😊")

<tf.Tensor: shape=(), dtype=string, numpy=b'Thanks \xf0\x9f\x98\x8a'>

Je crois que le *u* devant signifie unicode. Apparemment y'a deux manières de représenter une string unicode sur tensorflow : 
 1. Avec un string scalar, genre `tf.constant(u"语言处理")`
 2. Avec des entiers, bizzare, `tf.constant([ord(char) for char in u"语言处理"])`

In [3]:
# Unicode string, represented as a UTF-8 encoded string scalar.
text_utf8 = tf.constant(u"语言处理")
text_utf8

# Unicode string, represented as a UTF-16-BE encoded string scalar.
text_utf16be = tf.constant(u"语言处理".encode("UTF-16-BE"))
text_utf16be

# Unicode string, represented as a vector of Unicode code points.
text_chars = tf.constant([ord(char) for char in u"语言处理"])
text_chars

<tf.Tensor: shape=(4,), dtype=int32, numpy=array([35821, 35328, 22788, 29702], dtype=int32)>

# Opérations de conversions

 * `tf.strings.unicode_decode` : Converts an encoded string scalar to a vector of code points.
 * `tf.strings.unicode_encode` : Converts a vector of code points to an encoded string scalar.
 * `tf.strings.unicode_transcode` : Converts an encoded string scalar to a different encoding.

In [7]:
a = tf.strings.unicode_decode(text_utf8,input_encoding='UTF-8')
print(text_utf8)
print(a)
print("")
a = tf.strings.unicode_encode(text_chars,output_encoding='UTF-8')
print(text_chars)
print(a)
print("")
a = tf.strings.unicode_transcode(text_utf8,input_encoding='UTF8',output_encoding='UTF-16-BE')
print(text_utf8)
print(a)


tf.Tensor(b'\xe8\xaf\xad\xe8\xa8\x80\xe5\xa4\x84\xe7\x90\x86', shape=(), dtype=string)
tf.Tensor([35821 35328 22788 29702], shape=(4,), dtype=int32)

tf.Tensor([35821 35328 22788 29702], shape=(4,), dtype=int32)
tf.Tensor(b'\xe8\xaf\xad\xe8\xa8\x80\xe5\xa4\x84\xe7\x90\x86', shape=(), dtype=string)

tf.Tensor(b'\xe8\xaf\xad\xe8\xa8\x80\xe5\xa4\x84\xe7\x90\x86', shape=(), dtype=string)
tf.Tensor(b'\x8b\xed\x8a\x00Y\x04t\x06', shape=(), dtype=string)


In [8]:
# A batch of Unicode strings, each represented as a UTF8-encoded string.
batch_utf8 = [s.encode('UTF-8') for s in
              [u'hÃllo', u'What is the weather tomorrow', u'Göödnight', u'😊']]
batch_chars_ragged = tf.strings.unicode_decode(batch_utf8,
                                               input_encoding='UTF-8')
for sentence_chars in batch_chars_ragged.to_list():
  print(sentence_chars)

[104, 195, 108, 108, 111]
[87, 104, 97, 116, 32, 105, 115, 32, 116, 104, 101, 32, 119, 101, 97, 116, 104, 101, 114, 32, 116, 111, 109, 111, 114, 114, 111, 119]
[71, 246, 246, 100, 110, 105, 103, 104, 116]
[128522]


In [11]:
# Padded 
print(batch_chars_ragged)
print("")
batch_chars_padded = batch_chars_ragged.to_tensor(default_value=-1)
print(batch_chars_padded.numpy())

<tf.RaggedTensor [[104, 195, 108, 108, 111],
 [87, 104, 97, 116, 32, 105, 115, 32, 116, 104, 101, 32, 119, 101, 97, 116,
  104, 101, 114, 32, 116, 111, 109, 111, 114, 114, 111, 119]               ,
 [71, 246, 246, 100, 110, 105, 103, 104, 116], [128522]]>

[[   104    195    108    108    111     -1     -1     -1     -1     -1
      -1     -1     -1     -1     -1     -1     -1     -1     -1     -1
      -1     -1     -1     -1     -1     -1     -1     -1]
 [    87    104     97    116     32    105    115     32    116    104
     101     32    119    101     97    116    104    101    114     32
     116    111    109    111    114    114    111    119]
 [    71    246    246    100    110    105    103    104    116     -1
      -1     -1     -1     -1     -1     -1     -1     -1     -1     -1
      -1     -1     -1     -1     -1     -1     -1     -1]
 [128522     -1     -1     -1     -1     -1     -1     -1     -1     -1
      -1     -1     -1     -1     -1     -1     -1     -1     

In [17]:
a = tf.strings.unicode_encode([[99, 97, 116], [100, 111, 103], [99, 111, 119], [99, 111, 119]],
                          output_encoding='UTF-8')
print(a)
print("")

a = tf.strings.unicode_encode(batch_chars_ragged, output_encoding='UTF-8')
print(a)
print("")

a = tf.strings.unicode_encode(tf.RaggedTensor.from_tensor(batch_chars_padded, padding=-1),output_encoding='UTF-8')
print(a)
print("")

tf.Tensor([b'cat' b'dog' b'cow' b'cow'], shape=(4,), dtype=string)

tf.Tensor(
[b'h\xc3\x83llo' b'What is the weather tomorrow'
 b'G\xc3\xb6\xc3\xb6dnight' b'\xf0\x9f\x98\x8a'], shape=(4,), dtype=string)


In [20]:
sentence_texts = [u'Hello, world.', u'世界こんにちは']
print(sentence_texts)
print("")

sentence_char_codepoint = tf.strings.unicode_decode(sentence_texts, 'UTF-8')
print(sentence_char_codepoint)
print("")

sentence_char_script = tf.strings.unicode_script(sentence_char_codepoint)
print(sentence_char_script)
print("")

['Hello, world.', '世界こんにちは']

<tf.RaggedTensor [[72, 101, 108, 108, 111, 44, 32, 119, 111, 114, 108, 100, 46],
 [19990, 30028, 12371, 12435, 12395, 12385, 12399]]>

<tf.RaggedTensor [[25, 25, 25, 25, 25, 0, 0, 25, 25, 25, 25, 25, 0],
 [17, 17, 20, 20, 20, 20, 20]]>



In [25]:
sentence_char_starts_word = tf.concat(
    [tf.fill([sentence_char_script.nrows(), 1], True),
     tf.not_equal(sentence_char_script[:, 1:], sentence_char_script[:, :-1])],
    axis=1)
print(sentence_char_starts_word)
print("")

word_starts = tf.squeeze(tf.where(sentence_char_starts_word.values), axis=1)
print(word_starts)
print("")

word_char_codepoint = tf.RaggedTensor.from_row_starts(
    values=sentence_char_codepoint.values,
    row_starts=word_starts)
print(word_char_codepoint)
print("")

sentence_num_words = tf.reduce_sum(
    tf.cast(sentence_char_starts_word, tf.int64),
    axis=1)
print(sentence_num_words)
print("")

sentence_word_char_codepoint = tf.RaggedTensor.from_row_lengths(
    values=word_char_codepoint,
    row_lengths=sentence_num_words)
print(sentence_word_char_codepoint)
print("")

<tf.RaggedTensor [[True, False, False, False, False, True, False, True, False, False, False,
  False, True]                                                             ,
 [True, False, True, False, False, False, False]]>

tf.Tensor([ 0  5  7 12 13 15], shape=(6,), dtype=int64)

<tf.RaggedTensor [[72, 101, 108, 108, 111], [44, 32], [119, 111, 114, 108, 100], [46],
 [19990, 30028], [12371, 12435, 12395, 12385, 12399]]>

tf.Tensor([4 2], shape=(2,), dtype=int64)

<tf.RaggedTensor [[[72, 101, 108, 108, 111], [44, 32], [119, 111, 114, 108, 100], [46]],
 [[19990, 30028], [12371, 12435, 12395, 12385, 12399]]]>



In [28]:
print(sentence_word_char_codepoint)
print("")
a = tf.strings.unicode_encode(sentence_word_char_codepoint, 'UTF-8').to_list()
print(a)
print("")

<tf.RaggedTensor [[[72, 101, 108, 108, 111], [44, 32], [119, 111, 114, 108, 100], [46]],
 [[19990, 30028], [12371, 12435, 12395, 12385, 12399]]]>

[[b'Hello', b', ', b'world', b'.'], [b'\xe4\xb8\x96\xe7\x95\x8c', b'\xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf']]

