imdb에서 수집한 50,000개의 영화 리뷰 텍스트를 긍정 또는 부정으로 분류하는 이진 분류 예제 

Tensorflow_hub: 재사용 가능한 머신러닝 모듈들이 있는 라이브러리 

Tensorflow_datasets: TensorFlow에서 사용할 수 있도록 준비된 데이터세트 모음
-Audio, Image, Text 등등 여러 데이터가 있다. 

In [2]:
from __future__ import absolute_import, division, print_function, unicode_literals

import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds

print("버전: ", tfds.__version__)
print("즉시 실행 모드: ", tf.executing_eagerly())
print("허브 버전: ", hub.__version__)
print("GPU ", "사용 가능" if tf.config.experimental.list_physical_devices("GPU") else "사용 불가능")

버전:  2.1.0
즉시 실행 모드:  True
허브 버전:  0.8.0
GPU  사용 가능


In [3]:
# 훈련 세트를 6대 4로 나눕니다.
# 결국 훈련에 15,000개 샘플, 검증에 10,000개 샘플, 테스트에 25,000개 샘플을 사용하게 됩니다.
train_data,test_data = tfds.load(
    name="imdb_reviews", 
    split=('train[20%:50%]', 'test[50%:]'),
    as_supervised=True)


처음 10개의 데이터 샘플 출력 

In [3]:
train_examples_batch, train_labels_batch = next(iter(train_data.batch(10)))
train_examples_batch

<tf.Tensor: shape=(10,), dtype=string, numpy=
array([b'......this film is pretty awful, the only thing stopping me from giving it a rating of 1 was the fact that I unfortunately have seen worse.<br /><br />The jungle music, juttering demons, and fluorescent UV style blood/teeth/eyes give it that "awful" look, and the script is dire.....this film is more like a test to see how long you can last before giving up on it. It\'s also predictable but not in a good way. Nothing this film does is in a good way. I watched it 10 minutes ago and thought I would rant a bit so there you are. (oh and the acting doesn\'t let the film down, it\'s also terrible)',
       b"Think of the ending of the Grudge 2 with the following :<br /><br />- a man who repeatedly says the word Sunshine - a cowboy - a love story - Sarah Michelle Gellar cutting herself - and a creepy mirror<br /><br />OH AND UNDERWATER SEA ANIMALS...yay...<br /><br />not a good movie... I seriously did not enjoy it whatsoever. The poster f

In [None]:
0은 부정적인 리뷰, 1은 긍정적인 리뷰 

In [4]:
train_labels_batch

<tf.Tensor: shape=(10,), dtype=int64, numpy=array([0, 0, 1, 1, 0, 0, 1, 0, 0, 1], dtype=int64)>

In [None]:
구글뉴스를 이용하여 사전 훈련된 텍스트 임베딩 모델인 "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1"를 이용해서 

워드 임베딩을 진행

In [5]:
#텐서플로 허브 모델을 사용하는 케라스 층
embedding = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1"
hub_layer = hub.KerasLayer(embedding, input_shape=[], 
                           dtype=tf.string, trainable=True)
hub_layer(train_examples_batch[:3])

<tf.Tensor: shape=(3, 20), dtype=float32, numpy=
array([[ 2.080595  , -3.519821  ,  3.820798  ,  0.17795023, -5.7022676 ,
        -4.0035257 , -4.093607  ,  1.0823902 ,  3.7320447 , -1.6127218 ,
        -2.829569  ,  1.5775235 ,  1.0270905 ,  0.35990515, -4.59064   ,
         2.495925  ,  4.2067194 , -1.7372451 , -3.1296098 , -2.2202034 ],
       [ 1.5430363 , -2.7185216 ,  2.4919388 ,  0.12749232, -3.7062044 ,
        -1.839461  , -2.4596212 ,  1.6462464 ,  3.672319  , -0.14821526,
        -3.3468337 ,  0.6979172 ,  0.35498896, -0.18110909, -2.6257515 ,
         0.8833957 ,  2.676844  , -0.6901346 , -2.4871542 , -1.2647508 ],
       [ 1.1911206 , -3.231674  ,  2.916924  , -0.6853074 , -3.8457842 ,
        -2.08943   , -1.9154665 ,  0.01646899,  2.1377332 , -0.74050987,
        -3.3031478 ,  1.0411956 , -1.2922063 ,  0.43493563, -4.214861  ,
         1.7607119 ,  3.2658105 , -2.4540887 , -3.0205731 , -0.5040016 ]],
      dtype=float32)>

In [6]:
model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
keras_layer (KerasLayer)     (None, 20)                400020    
_________________________________________________________________
dense (Dense)                (None, 16)                336       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
Total params: 400,373
Trainable params: 400,373
Non-trainable params: 0
_________________________________________________________________


In [7]:
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [8]:
history = model.fit(train_data.shuffle(10000).batch(512),
                    epochs=20,
                    verbose=1)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [9]:
results = model.evaluate(test_data.batch(512), verbose=2)
for name, value in zip(model.metrics_names, results):
  print("%s: %.3f" % (name, value))

loss: 0.394
accuracy: 0.829
