<h1>文本深度学习</h1>

<h2>句子序列相互转换</h2>

<h3>导入包</h3>

In [19]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

<h3>训练token</h3>

In [21]:
words = [
    "I love my dog",
    "I love mu cat",
    "I love Gu YUting previously"
]
token = Tokenizer(num_words=100, oov_token="<OOV>")
token.fit_on_texts(words)

In [14]:
print(token.word_index)

{'<OOV>': 1, 'i': 2, 'love': 3, 'my': 4, 'dog': 5, 'mu': 6, 'cat': 7, 'gu': 8, 'yuting': 9, 'previously': 10}


<h3>使用token将句子转换成序列</h3>

In [26]:
seq = token.texts_to_sequences(words)
print(type(seq))         # 填充之前是list类型
print(seq)

<class 'list'>
[[2, 3, 4, 5], [2, 3, 6, 7], [2, 3, 8, 9, 10]]


<h3>使用token将序列转换成句子</h3>

In [23]:
print(token.sequences_to_texts(seq))

['i love my dog', 'i love mu cat', 'i love gu yuting previously']


<h2>对序列进行填充</h2>

In [24]:
seq_pad = pad_sequences(
    sequences=seq,
    maxlen=5,           # 设置最长句子长度
    padding="pre",       # 设置在前面补零(pre)，还是在后面补零(post)
    truncating="post"    # 如果超出长度，是去掉前面(pre)，还是去掉后面(post)
)
print(type(seq_pad))   # 填充之后变成numpy类型
print(seq_pad)

<class 'numpy.ndarray'>
[[ 0  2  3  4  5]
 [ 0  2  3  6  7]
 [ 2  3  8  9 10]]


<h2>使用数据</h2>

<h3>加载数据</h2>

In [1]:
import tensorflow_datasets as tdfs
imdb, info = tdfs.load("imdb_reviews", with_info=True, as_supervised=True)

<h3>查看训练集测试集</h3>

In [2]:
train_data = imdb["train"]
print(type(train_data))
test_data = imdb["test"]
print(type(test_data))

<class 'tensorflow.python.data.ops.dataset_ops.PrefetchDataset'>
<class 'tensorflow.python.data.ops.dataset_ops.PrefetchDataset'>


<h3>分割标签</h3>

In [3]:
train_sequences = []
train_labels = []
test_sequences = []
test_labels = []
for s, l in train_data:
    train_sequences.append(s.numpy().decode("utf8"))
    train_labels.append(l.numpy())
for s, l in test_data:
    test_sequences.append(str(s.numpy().decode("utf8")))
    test_labels.append(l.numpy())

In [4]:
print(test_sequences[0])
print(type(test_sequences[0]))
print(test_labels[0])
print(type(test_labels[0]))

There are films that make careers. For George Romero, it was NIGHT OF THE LIVING DEAD; for Kevin Smith, CLERKS; for Robert Rodriguez, EL MARIACHI. Add to that list Onur Tukel's absolutely amazing DING-A-LING-LESS. Flawless film-making, and as assured and as professional as any of the aforementioned movies. I haven't laughed this hard since I saw THE FULL MONTY. (And, even then, I don't think I laughed quite this hard... So to speak.) Tukel's talent is considerable: DING-A-LING-LESS is so chock full of double entendres that one would have to sit down with a copy of this script and do a line-by-line examination of it to fully appreciate the, uh, breadth and width of it. Every shot is beautifully composed (a clear sign of a sure-handed director), and the performances all around are solid (there's none of the over-the-top scenery chewing one might've expected from a film like this). DING-A-LING-LESS is a film whose time has come.
<class 'str'>
1
<class 'str'>


<h4>查看数据的类型转换过程</h4>

<h5>s为EagerTensor类型</h5>

In [7]:
print(type(s))
print(s)

<class 'tensorflow.python.framework.ops.EagerTensor'>
tf.Tensor(b"They just don't make cartoons like they used to. This one had wit, great characters, and the greatest ensemble of voice over artists ever assembled for a daytime cartoon show. This still remains as one of the highest rated daytime cartoon shows, and one of the most honored, winning several Emmy Awards.", shape=(), dtype=string)


<h5>转换为dtype指定类型</h5>

In [9]:
print(type(s.numpy()))
print(s.numpy())

<class 'bytes'>
b"They just don't make cartoons like they used to. This one had wit, great characters, and the greatest ensemble of voice over artists ever assembled for a daytime cartoon show. This still remains as one of the highest rated daytime cartoon shows, and one of the most honored, winning several Emmy Awards."


<h5>编码成字符串</h5>

In [11]:
print(type(s.numpy().decode("utf8")))
print(s.numpy().decode("utf8"))

<class 'str'>
They just don't make cartoons like they used to. This one had wit, great characters, and the greatest ensemble of voice over artists ever assembled for a daytime cartoon show. This still remains as one of the highest rated daytime cartoon shows, and one of the most honored, winning several Emmy Awards.


<h4>查看标签的类型转换过程</h4>

<h5>l为EagerTensor类型</h5>

In [13]:
print(type(l))
print(l)

<class 'tensorflow.python.framework.ops.EagerTensor'>
tf.Tensor(1, shape=(), dtype=int64)


<h5>转换为dtype指定类型</h5>

In [None]:
print(type(l.numpy()))
print(l.numpy())