# 第六章 用于文本和序列

循环神经网络和一维卷积神经网络

文本向量化：
1.文本分割为单词，单词转换为向量
2.文本分割为字符，每个字符转为向量
3.提取单词n-gram，在转为向量


### 单词的one-hot编码

In [None]:
import numpy as np

samples = ['The cat sat on the mat.','THe dog ate my homework.']
# 注意这里The 和 THe不一样
token_index={}
for sample in samples:
    for word in sample.split():
        if word not in token_index:
            token_index[word]= len(token_index)+1
            #索引编号从1开始

max_length = 10
#考虑样本前10个单词
print(token_index.values())
results = np.zeros(shape=(len(samples),max_length,max(token_index.values())+1))

for i , sample in enumerate(samples):
    for j , word in list(enumerate(sample.split()))[:max_length]:
        index= token_index.get(word)
        results[i,j,index]=1.

print(results)


### 字符级别的onehot

In [None]:
import string
samples = ['The cat sat on the mat.','THe dog ate my homework.']
characters= string.printable
token_index=dict(zip(range(1,len(characters)+1),characters))

max_length=50
results=np.zeros((len(samples),max_length,max(token_index.keys())+1))
for i , sample in enumerate(samples):
    for j , character in enumerate(sample):
        index= token_index.get(character)
        results[i,j,index]=1.
print(results)


利用keras内置函数实现

In [None]:
from keras.preprocessing.text import Tokenizer
#分词器

samples = ['The cat sat on the mat.','The dog ate my homework.']

tokenizer = Tokenizer(num_words=1000)
# 用一千个常用词

tokenizer.fit_on_texts(samples)
#构建单词索引

sequences = tokenizer.texts_to_sequences(samples)
#字符串转换为证书索引列表

one_hot_results = tokenizer.texts_to_matrix(samples,mode='binary')

word_index = tokenizer.word_index
print('Found %s unique tokens.'% len(word_index))

In [None]:
print(tokenizer.index_word)

利用散列技巧处理，虽然能够节约空间但是会出现散列冲突

In [None]:
samples = ['The cat sat on the mat.','The dog ate my homework.']
dimensionality = 1000
max_length=10
results = np.zeros((len(samples),max_length,dimensionality))
for i , sample in enumerate(samples):
    for j , word in list(enumerate(sample.split()))[:max_length]:
        index = abs(hash(word))%dimensionality
        results[i,j,index]=1.

print(results)

onehot>稀疏，高维，硬编码
词嵌入>密集，低维，从数据学习中得到
1.完成主任务的同时学习
2.使用预训练好的词

### 使用embdding层
对每一个实际任务学习一个新的嵌入空间

In [None]:
from keras.layers import Embedding
embedding_layer = Embedding(1000,64)


将embedding层理解为一个字典，将整数索引映射为（对应词）密集向量
输入二维整数张量(samples, sequence_length)
返回一个三维浮点数张量，用RNN层或一维卷积层来处理这个三维张量

In [30]:
from keras.datasets import imdb
# 改动调用sequence
from tensorflow.keras.preprocessing import sequence
max_features = 10000
maxlen =20

(x_train,y_train),(x_test,y_test) = imdb.load_data(num_words=max_features)

# print(x_train) 一个整数序列的列表
x_train =sequence.pad_sequences(x_train,maxlen=maxlen)
x_text =sequence.pad_sequences(x_test,maxlen=maxlen)

In [31]:
from keras.models import Sequential
from keras.layers import Flatten, Dense,Embedding

model= Sequential()
model.add(Embedding(10000,8,input_length=maxlen))

model.add(Flatten())
#将输入的三维向量展平为二维

model.add(Dense(1,activation='sigmoid'))
model.compile(optimizer='rmsprop',loss='binary_crossentropy',metrics=['acc'])
model.summary()

history = model.fit(x_train,y_train,epochs=10,batch_size=32,validation_split=0.2)

print(history.history.keys())
print(history.history.values())


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 20, 8)             80000     
                                                                 
 flatten (Flatten)           (None, 160)               0         
                                                                 
 dense (Dense)               (None, 1)                 161       
                                                                 
Total params: 80,161
Trainable params: 80,161
Non-trainable params: 0
_________________________________________________________________


2021-12-14 20:51:05.815973: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-12-14 20:51:06.263799: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 8932 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:65:00.0, compute capability: 7.5


TypeError: fit() got an unexpected keyword argument 'epoches'