# Mục Đích
> Trong phần này chúng ta sẽ học chi tiết hơn về RNN (recurrent neural network) cho dữ liệu dạng sequences (words sau khi đã được tokenize). Và kiến trúc kinh điển của RNN là LSTM (long-short term memory).

In [14]:
import tensorflow as tf
import tensorflow.keras as keras
import tensorflow.keras.preprocessing.text as text
import tensorflow.keras.preprocessing.sequence as sequence
import matplotlib.pyplot as plt
import numpy as np
import json

In [2]:
!ls

00_Tokenization.ipynb  01_Word_Embedding.ipynb	02_RNN.ipynb


# Load Dữ liệu
> Chúng ta vẫn sẽ sử dụng dữ liệu Sarcasm cho vị dụ về RNN và LSTMs ở đây.

In [3]:
file_name = '/home/ddpham/git/TFExam/data/Sarcasm_Headlines_Dataset.json'
sentences = []
labels = []
with open(file_name, 'r') as file:
    for line in file.readlines():
        data = json.loads(line)
        sentences.append(data['headline'])
        labels.append(data['is_sarcastic'])
file.close()

In [4]:
len(sentences)

26709

# Tokenizer

In [5]:
# Tokenize:
num_words = 1000
oov_tok = 'UNK'
train_size = 20000

tokenizer = text.Tokenizer(num_words=num_words, oov_token=oov_tok)
train_sentences = sentences[:train_size]
valid_sentences = sentences[train_size:]
train_labels = np.array(labels[:train_size])
valid_labels = np.array(labels[train_size:])

tokenizer.fit_on_texts(train_sentences)
word_index = tokenizer.word_index
train_sequences = tokenizer.texts_to_sequences(train_sentences)
valid_sequences = tokenizer.texts_to_sequences(valid_sentences)

# Pad:
max_len = 20
embed_dim = 16
pad_type = 'post'
trunc_type = 'post'

train_sequences = sequence.pad_sequences(train_sequences, maxlen=max_len, padding=pad_type, truncating=trunc_type)
valid_sequences = sequence.pad_sequences(valid_sequences, maxlen=max_len, padding=pad_type, truncating=trunc_type)
train_sequences.shape, valid_sequences.shape

((20000, 20), (6709, 20))

# Tạo model
> Chúng ta sẽ sử dụng model với LSTM bằng cách kết hợp giữa `Bidirectional` và `LSTM`. Lưu ý, Bidirectional chỉ là vỏ bọc của LSTM để hỗ trợ việc tạo ra sequence_processing, thỏa mãn các điều kiện về  sử dụng historical states của RNN hay LSTMs.

In [29]:
??keras.layers.Bidirectional

[0;31mInit signature:[0m [0mkeras[0m[0;34m.[0m[0mlayers[0m[0;34m.[0m[0mBidirectional[0m[0;34m([0m[0;34m*[0m[0margs[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mSource:[0m        
[0;32mclass[0m [0mBidirectional[0m[0;34m([0m[0mWrapper[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m  [0;34m"""Bidirectional wrapper for RNNs.[0m
[0;34m[0m
[0;34m  Arguments:[0m
[0;34m    layer: `keras.layers.RNN` instance, such as `keras.layers.LSTM` or[0m
[0;34m      `keras.layers.GRU`. It could also be a `keras.layers.Layer` instance[0m
[0;34m      that meets the following criteria:[0m
[0;34m      1. Be a sequence-processing layer (accepts 3D+ inputs).[0m
[0;34m      2. Have a `go_backwards`, `return_sequences` and `return_state`[0m
[0;34m        attribute (with the same semantics as for the `RNN` class).[0m
[0;34m      3. Have an `input_spec` attribute.[0m
[0;34m      4. Implement serialization via `get_config()`

In [6]:
model = keras.Sequential([
    keras.layers.Embedding(num_words, embed_dim, input_length=max_len)
    , keras.layers.Bidirectional(keras.layers.LSTM(max_len))
    , keras.layers.Dense(max_len, activation='relu')
    , keras.layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics='accuracy')
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 20, 16)            16000     
_________________________________________________________________
bidirectional (Bidirectional (None, 40)                5920      
_________________________________________________________________
dense (Dense)                (None, 20)                820       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 21        
Total params: 22,761
Trainable params: 22,761
Non-trainable params: 0
_________________________________________________________________


In [7]:
epochs=10
model.fit(train_sequences, train_labels, epochs=epochs, validation_data=(valid_sequences, valid_labels))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f4b44058ac0>

Chúng ta có thể thấy, chỉ với 2 epochs thôi mà kết quả đã tốt hơn so với kết quả trong phần trước (Word-Embedding).

# Subword Tokenizer
> Trong bài trước chúng ta cũng có đề cập về subword tokenizer, tuy nhiên chưa thực hiện. Ở đây chúng ta sẽ bàn về subword tokenizer nhiều hơn. Subword tokenizer được áp dụng cho phương pháp BERT.

Để sử dụng được subwork tokenizer, chúng ta cần cài đặt tensorflow-text.

In [16]:
!pip install tensorflow-text

Collecting tensorflow-text
  Using cached tensorflow_text-2.4.3-cp38-cp38-manylinux1_x86_64.whl (3.4 MB)
Collecting tensorflow-hub>=0.8.0
  Using cached tensorflow_hub-0.12.0-py2.py3-none-any.whl (108 kB)
Collecting opt-einsum~=3.3.0
  Using cached opt_einsum-3.3.0-py3-none-any.whl (65 kB)
Collecting flatbuffers~=1.12.0
  Using cached flatbuffers-1.12-py2.py3-none-any.whl (15 kB)
Collecting grpcio~=1.32.0
  Using cached grpcio-1.32.0-cp38-cp38-manylinux2014_x86_64.whl (3.8 MB)
Installing collected packages: grpcio, opt-einsum, flatbuffers, tensorflow-hub, tensorflow-text
  Attempting uninstall: grpcio
    Found existing installation: grpcio 1.36.1
    Uninstalling grpcio-1.36.1:
      Successfully uninstalled grpcio-1.36.1
  Attempting uninstall: opt-einsum
    Found existing installation: opt-einsum 3.1.0
    Uninstalling opt-einsum-3.1.0:
      Successfully uninstalled opt-einsum-3.1.0
  Attempting uninstall: flatbuffers
    Found existing installation: flatbuffers 20210226132247
   

In [17]:
dir(text)

['Tokenizer',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '_sys',
 'hashing_trick',
 'one_hot',
 'text_to_word_sequence',
 'tokenizer_from_json']

In [13]:
!conda list

# packages in environment at /home/ddpham/miniconda3/envs/tf:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
_tflow_select             2.1.0                       gpu  
absl-py                   0.12.0           py38h06a4308_0  
aiohttp                   3.7.4            py38h27cfd23_1  
anyio                     2.2.0            py38h06a4308_1  
argon2-cffi               20.1.0           py38h27cfd23_1  
astunparse                1.6.3                      py_0  
async-timeout             3.0.1            py38h06a4308_0  
async_generator           1.10               pyhd3eb1b0_0  
attrs                     20.3.0             pyhd3eb1b0_0  
babel                     2.9.0              pyhd3eb1b0_0  
backcall                  0.2.0              pyhd3eb1b0_0  
blas                      1.0                         mkl  
bleach                    3.3.0              pyhd3eb1b0_0  
blinker                   1.4

In [26]:
import tensorflow_text as tftext

NotFoundError: /home/ddpham/miniconda3/envs/tf/lib/python3.8/site-packages/tensorflow_text/python/metrics/_text_similarity_metric_ops.so: undefined symbol: _ZNK10tensorflow8OpKernel11TraceStringERKNS_15OpKernelContextEb

In [None]:
subword_tokenizer = text.