# TF2 RNN:Recurrent Neural Network

### RNN
* Recurrent Neural Network, 순환 신경망
* 과거의 행위가 다음 판단에 영향을 미치는 경우
* 고정 데이터가 아닌 순서가 있는 데이터
    * 예) 안녕, 넌 이름이 뭐니? Vs 안녕, 그 동안 즐거웠어.
* 구글 번역 서비스
    * Seq2Seq 신경망 모델
![image.png](https://i.imgur.com/Ot9qXuc.png)

### 순환 신경망
* 시퀀스 데이터 처리할 수 있는 방법
    * RNN
    * 1D Convnet
* 주요 사례
    * 문서 분류, 시계열 분류
    * 감성분석
* 특징
    * 이전 상태 유지
    ```python
    state = 0
    for input in inputs:
        outout, state = rnn_cell(input, state)
    ```
*  RNN 구조 유형
![image.png](https://i.imgur.com/MAesSJV.png)
* RNN 개선
    * LSTM(Long Short Term Memory)
    * GRU

### Vanilla RNN
* 상태는 한개의 히든 벡터 H 로 구성
* $h_t = fw(h_{t-1}, x_t)$
    * $h_t = tanh(W_{hh}h_{t-1} + W_{xh}x_t + b_h)$
    * $ y_t = W_{hy}h_t + b_y $
![image.png](https://i.imgur.com/Q1xfRVJ.png)
* 학습할 변수들
    * $W_{hx}$, $W_{hh}$, $b_h$
    * $W_{hy}$, $b_y$

### Vanilla RNN Graph
![image.png](https://i.imgur.com/Szpv2aU.png)

### Vanila RNN 실습
* RNN 구조를 직접 만들어 다음 글자 예측 실습
    * 입력 : "hihell"
    * 출력 : "ihello"
    * many to many
![image.png](https://i.imgur.com/F8toAhT.png)

#### 입출력 데이타 준비

In [None]:
import tensorflow as tf
import numpy as np

sentence = "hihello"
x = "hihell"
y = "ihello"

voca = list({c for c in sentence})
dic = {c:i for i,c in enumerate(voca)}
dic2idx = {i:c for i, c in enumerate(voca)}
print(dic, dic2idx)

n_class = len(dic)  #example size
n_time_steps = len(x) #sequence
hidden_size = n_class #output size

x_idx = [dic[c] for c in x]
print("x_idx:", x_idx)
y_idx = [dic[c] for c in y]
print("y_idx:", y_idx)

x_enc =tf.keras.utils.to_categorical(x_idx, num_classes=n_class)
x_enc = np.expand_dims(x_enc, axis=0)
print("x_enc:", x_enc, x_enc.shape, )

y_enc = tf.keras.utils.to_categorical(y_idx, num_classes=n_class)
y_enc = np.expand_dims(y_enc, axis=0)
print("y_enc:", y_enc, y_enc.shape)

#### hidden state 준비

In [None]:

initializer =  tf.keras.initializers.GlorotUniform() #xavier

Wx = tf.Variable(initializer([n_class, hidden_size]), name="Wx" )
Wh = tf.Variable(initializer([hidden_size, hidden_size]), name="Wh" )
bh = tf.Variable(initializer([hidden_size]), name="bias_h" )

#### hidden state 연산

In [None]:
def rnn_step(previous_hidden_state, x):
    current_hidden_state = tf.tanh(
        tf.matmul(previous_hidden_state, Wh) + 
        tf.matmul(x, Wx) + bh)
    return current_hidden_state


#### 출력 연산

In [None]:
Wy = tf.Variable(initializer([hidden_size, n_class]))
by = tf.Variable(initializer([n_class]))

def get_linear_layer(hidden_state):
    return tf.matmul(hidden_state, Wy) + by


#### 신경망 학습

* `tf.scan(fn, elemes)`
    * 모든 elemes를 순회하면서 fn에 전달
    * 이전 fn의 반환 값과 elemens의 다음 항목을 fn에 전달
    * 모든 fn의 반환 값을 리스트로 반환

In [None]:
# tf.scan() 사용 설명을 위한 예시
def f(prev, next):
    print(prev, next)
    return prev + next

data = np.arange(5).reshape(-1,1)+1
ret = tf.scan(f, data)
print(ret)

In [None]:

learning_rate = 0.001
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
n_epoch = 500
for step in range(n_epoch):
    with tf.GradientTape() as tape:
        X_ = tf.transpose(x_enc, perm=[1, 0, 2])
        all_hidden_states = tf.scan(rnn_step, X_, name='states')#, initializer=init_hidden)
        all_outputs = tf.map_fn(get_linear_layer, all_hidden_states)
        cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=all_outputs, labels=y_enc))
    grads = tape.gradient(cost, [Wy, by, Wx, Wh,  bh])
    optimizer.apply_gradients(grads_and_vars=zip(grads, [Wy, by, Wx, Wh,  bh]))
    if (step+1) % 20 == 0:
            if step+ 1== n_epoch :
                print(all_outputs)
            prediction = tf.argmax(all_outputs, axis=2)
            print("step:{}, cost:{}, predict:{}, str:{}".format(step, cost, np.squeeze(prediction), [dic2idx[i] for i in np.squeeze(prediction)]))


### TF RNN API
* `cell = tf.keras.layers..rnn_cell.SimpleRNNCell(units=hidden_size)`
    * RNN 구현을 위해 필요한 변수 선언 및 구조 제공
    * `hidden_size` : output, state size
* `rnn = tf.keras.layers.RNN(cell,return_sequences=True, return_state=True)`
* `outputs, states = rnn(X)`
    * rnn_cell을 전달 받아 반복(tf.scan()) 연산해서 output과 state 계산
    * rnn.trainable_variables : 학습할 변수 얻기
* `rnn = tf.keras.layers..rnn_cell.SimpleRNNCell(units=hidden_size, return_sequences=True, return_state=True)`
    * SimpleRNNCell + RNN
    * Cell과 RNN 역할을 동시에 수행

In [None]:
import tensorflow as tf
import numpy as np


x_data = np.array([[[1,2,3,4],
                    [5,6,7,8]]], dtype=np.float32) #(1,2,4) : (batch, time_step, depth)
print(x_data.shape)

hidden_size = 2
cell = tf.keras.layers.SimpleRNNCell(hidden_size)

print(cell.output_size, cell.state_size)
rnn = tf.keras.layers.RNN(cell, return_sequences=True, return_state=True)
outputs, return_state = rnn(x_data)
print(outputs, return_state)
print(rnn.trainable_variables)


#### TF RNN API로 "hi hello" 구현

In [None]:
import tensorflow as tf
import numpy as np

sentence = "hihello"
x = "hihell"
y = "ihello"

voca = list({c for c in sentence})
dic = {c:i for i,c in enumerate(voca)}
dic2idx = {i:c for i, c in enumerate(voca)}
print(dic, dic2idx)

n_class = len(dic)  #example size
n_time_steps = len(x) #sequence
hidden_size = n_class #output size

x_idx = [dic[c] for c in x]
print("x_idx:", x_idx)
y_idx = [dic[c] for c in y]
print("y_idx:", y_idx)

x_enc =tf.keras.utils.to_categorical(x_idx, num_classes=n_class)
x_enc = np.expand_dims(x_enc, axis=0)
print("x_enc:", x_enc, x_enc.shape, )

y_enc = tf.keras.utils.to_categorical(y_idx, num_classes=n_class)
y_enc = np.expand_dims(y_enc, axis=0)
print("y_enc:", y_enc, y_enc.shape)

In [None]:
initializer =  tf.initializers.GlorotUniform() #xavier
Wy = tf.Variable(initializer([hidden_size, n_class]))
by = tf.Variable(initializer([n_class]))

def get_linear_layer(hidden_state):
    return tf.matmul(hidden_state, Wy) + by


learning_rate = 0.1
#cell = tf.keras.layers.SimpleRNNCell(hidden_size)
#rnn = tf.keras.layers.RNN(cell, return_sequences=True, return_state=True)
rnn = tf.keras.layers.SimpleRNN(hidden_size, return_sequences=True, return_state=True)

optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
n_epoch = 500
for step in range(n_epoch):
    with tf.GradientTape() as tape:
        outputs, states = rnn(x_enc)
        all_outputs = tf.map_fn(get_linear_layer, outputs)
        variables = rnn.trainable_variables + [Wy, by]
        cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=all_outputs, labels=y_enc))
    grads = tape.gradient(cost, variables)
    optimizer.apply_gradients(zip(grads, variables))
    if (step+1) % 20 == 0:
        prediction = tf.argmax(all_outputs, axis=2)
        print("step:{}, cost:{}, predict:{}, str:{}".format(step, cost, np.squeeze(prediction), [dic2idx[i] for i in np.squeeze(prediction)]))                 

## LSTM
* Long Short-Term Memory Unit
* Hochreiter(1997) 제안, RNN 변형 
* 자연어 처리 분야 딥러닝 기법 중 가장 활발히 사용
* RNN의 맨 뒤에서 맨 앞의 정보를 기억하지 못하는 특성 보완
* RNN의 Hidden state에 cell-state를 추가
* 오차의 그라디언트가 시간을 거슬러 잘 흘러갈 수 있다.
* 1000단계가 넘는 Backpropagation 과정에서 오차 값 유지
![image.png](https://i.imgur.com/POpVFUa.png)

* Hidden Layer를 4개의 계층으로 구성
    * Cell State : 이전 정보를 다음 단계로 전달, 3가지 게이트에 의해 전달 여부 결정
    * Forget Gate : 기존 Cell State에서 어떠한 정보를 지울 지 결정
    * Input Gate : Cell State에 저장할 새로운 정보를 결정
    * Output Gate : Cell State 값을 다음 상태로 출력할 지 결정

### GRUCell
* Gated Recurrent Units
* 2014 뉴욕대 조경현 교수 제안
* LSTM의 변형, 더 간단한 구조
* 게이트 된 순환 유닛(Gate Recurrent Unit)
* 잊기와 입력 게이트들을 하나의 단일 Update 게이트로 통합
* 셀 상태와 숨겨진 상태 통합
![image.png](https://i.imgur.com/c2YyJz9.png)

### LSTM/GRU를 이용한 "hihello" 실습
* `tf.keras.layers.LSTMCell()`
* `tf.keras.layers.LSTM()`
* `tf.keras.layers.GRUCell()`
* `tf.keras.layers.GRU()`

In [None]:
import tensorflow as tf
import numpy as np

sentence = "hihello"
x = "hihell"
y = "ihello"

voca = list({c for c in sentence})
dic = {c:i for i,c in enumerate(voca)}
dic2idx = {i:c for i, c in enumerate(voca)}
print(dic, dic2idx)

n_class = len(dic)  #example size
n_time_steps = len(x) #sequence
hidden_size = n_class #output size

x_idx = [dic[c] for c in x]
print("x_idx:", x_idx)
y_idx = [dic[c] for c in y]
print("y_idx:", y_idx)

x_enc =tf.keras.utils.to_categorical(x_idx, num_classes=n_class)
x_enc = np.expand_dims(x_enc, axis=0)
print("x_enc:", x_enc, x_enc.shape, )

y_enc = tf.keras.utils.to_categorical(y_idx, num_classes=n_class)
y_enc = np.expand_dims(y_enc, axis=0)
print("y_enc:", y_enc, y_enc.shape)


############# Cell Selecte ##############################
#cell = tf.keras.layers.LSTMCell(hidden_size)
cell = tf.keras.layers.GRUCell(hidden_size)
#rnn = tf.keras.layers.RNN(cell, return_sequences=True)
#rnn = tf.keras.layers.LSTM(hidden_size, return_sequences=True)
rnn = tf.keras.layers.GRU(hidden_size, return_sequences=True)

initializer =  tf.initializers.GlorotUniform() #xavier
Wy = tf.Variable(initializer([hidden_size, n_class]))
by = tf.Variable(initializer([n_class]))

def get_linear_layer(hidden_state):
    return tf.matmul(hidden_state, Wy) + by

learning_rate = 0.001
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
prediction = tf.argmax(all_outputs, axis=2)

n_epoch = 1000
for step in range(n_epoch):
    with tf.GradientTape() as tape:
        outputs = rnn(x_enc)
        all_outputs = tf.map_fn(get_linear_layer, outputs)
        cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=all_outputs, labels=y_enc))
    grads = tape.gradient(cost, rnn.trainable_variables + [Wy, by])
    optimizer.apply_gradients(zip(grads, rnn.trainable_variables + [Wy, by]))
    if (step+1) % 20 == 0:
        prediction = tf.argmax(all_outputs, axis=2)
        print("step:{}, cost:{}, predict:{}, str:{}".format(step, cost, np.squeeze(prediction), [dic2idx[i] for i in np.squeeze(prediction)]))                                

## TF Model을 사용한 "hihello" 실습

In [None]:
import tensorflow as tf
import numpy as np

sentence = "hihello"
x = "hihell"
y = "ihello"

voca = list({c for c in sentence})
dic = {c:i for i,c in enumerate(voca)}
dic2idx = {i:c for i, c in enumerate(voca)}
print(dic, dic2idx)

n_class = len(dic)  #example size
n_time_steps = len(x) #sequence
hidden_size = n_class #output size

x_idx = [dic[c] for c in x]
print("x_idx:", x_idx)
y_idx = [dic[c] for c in y]
print("y_idx:", y_idx)

x_enc =tf.keras.utils.to_categorical(x_idx, num_classes=n_class)
x_enc = np.expand_dims(x_enc, axis=0)
print("x_enc:", x_enc, x_enc.shape, )

y_enc = tf.keras.utils.to_categorical(y_idx, num_classes=n_class)
y_enc = np.expand_dims(y_enc, axis=0)
print("y_enc:", y_enc, y_enc.shape)


############# model build ##############################

model = tf.keras.models.Sequential()
#model.add(tf.keras.layers.LSTM(hidden_size, input_shape=(n_time_steps, n_class), return_sequences=True))
model.add(tf.keras.layers.GRU(hidden_size, input_shape=(n_time_steps, n_class), return_sequences=True))
model.add(tf.keras.layers.Dense(hidden_size))
#model.summary()


learning_rate = 0.001
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)


n_epoch = 1000
for step in range(n_epoch):
    with tf.GradientTape() as tape:
        outputs = model(x_enc)
        cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=outputs, labels=y_enc))
    grads = tape.gradient(cost, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))
    if (step+1) % 20 == 0:
        prediction = tf.argmax(model(x_enc), axis=2)
        print("step:{}, cost:{}, predict:{}, str:{}".format(step, cost, np.squeeze(prediction), [dic2idx[i] for i in np.squeeze(prediction)]))                                

### tf.keras API 사용 'hihello' 예제 실습

In [None]:
import tensorflow as tf

keras = tf.keras

sentence = "hihello"
x = "hihell"
y = "ihello"

voca = list({c for c in sentence})
dic = {c:i for i,c in enumerate(voca)}
print(dic)

n_class = len(dic)  #example size
n_time_steps = len(x) #sequence
hidden_size = n_class #output size

x_idx = [dic[c] for c in x]
print("x_idx:", x_idx)
y_idx = [dic[c] for c in y]
print("y_idx:", y_idx)

x_enc =tf.keras.utils.to_categorical(x_idx, num_classes=n_class)
x_enc = np.expand_dims(x_enc, axis=0)
print("x_enc:", x_enc, x_enc.shape, )

y_enc = tf.keras.utils.to_categorical(y_idx, num_classes=n_class)
y_enc = np.expand_dims(y_enc, axis=0)
print("y_enc:", y_enc, y_enc.shape)

model = keras.Sequential()
model.add(keras.layers.LSTM((hidden_size), input_shape=(n_time_steps, n_class), return_sequences=True))
model.add(keras.layers.Dense(hidden_size))
model.add(keras.layers.Activation('softmax'))
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

n_epochs = 500
model.fit(x_enc, np.reshape(y_idx, (1,6,1)), epochs=n_epochs, verbose=0)

preds = model.predict(x_enc, verbose=0)
print(preds, np.squeeze(np.argmax(preds, axis=2)))
print([voca[i] for i in np.squeeze(np.argmax(preds, axis=2))])

### MNIST 이미지
* many to one
* 28 * 28 숫자 이미지 : 시퀀스로서의 이미지
![image.png](https://i.imgur.com/1dwjGgP.png)

In [None]:
import tensorflow as tf

element_size = 28
time_steps = 28
num_classes = 10
batch_size = 128
hidden_layer_size = 128

# MNIST 데이터 불러오기 위한 함수 정의
def mnist_load():
    (train_x, train_y), (test_x, test_y) = tf.keras.datasets.mnist.load_data()
    # Train set
    train_x = train_x.astype('float32') / 255.
    # Test set
    test_x = test_x.astype('float32') / 255.
    return (train_x, train_y), (test_x, test_y)
# MNIST 데이터 불러오기
(train_x, train_y), (test_x, test_y) = mnist_load()


model = keras.Sequential()
model.add(keras.layers.LSTM((hidden_layer_size), input_shape=(time_steps, element_size)))
model.add(keras.layers.Dense(num_classes))
model.add(keras.layers.Activation('softmax'))
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])


model.fit(train_x, train_y, epochs=20, batch_size=batch_size)

results = model.evaluate(test_x, test_y,  verbose=0)
print(f"Test loss:{results[0]}, accuracy:{results[1]}")

### 형태소 분석기
#### konlpy 설치
* https://konlpy-ko.readthedocs.io/ko/latest/
* 설치 절차
    1. JDK 설치 및 JAVA_HOME 설정
        * https://www.oracle.com/technetwork/java/javase/downloads/index.html
        * JAVA_HOME 환경 변수 설정
    2. JPype 다운로드 및 설치
        * Download : https://www.lfd.uci.edu/~gohlke/pythonlibs/#jpype
            * 사용하는 파이썬 버전에 맞게 골라서 다운도르
        * 설치 : `pip install JPype-XXX.whl`
    3. Konlpy 설치 
        * `!pip install konlpy`
    * 맥 : 
        * `export MACOSX_DEPLOYMENT_TARGET=10.10`
        * `CFLAGS="-stdlib=libc++" pip install konlpy`


In [None]:
!pip install ./assets/JPype1‑0.7.1‑cp37‑cp37m‑win_amd64.whl

In [2]:
import sys
sys.version

'3.6.4 (v3.6.4:d48ecebad5, Dec 18 2017, 21:07:28) \n[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]'

In [1]:
!pip install konlpy

Collecting konlpy
  Using cached https://files.pythonhosted.org/packages/85/0e/f385566fec837c0b83f216b2da65db9997b35dd675e107752005b7d392b1/konlpy-0.5.2-py2.py3-none-any.whl
Collecting JPype1>=0.7.0 (from konlpy)
  Using cached https://files.pythonhosted.org/packages/d7/62/0f312d578e0165e9b5e8fcae0291f7ee83783b3805f59071006b21229d55/JPype1-0.7.1.tar.gz
Collecting lxml>=4.1.0 (from konlpy)
  Using cached https://files.pythonhosted.org/packages/d0/70/a067810a5b6ddfea32600bfbbd9916dae4d4b5b0754d49c50208c7d00663/lxml-4.4.2-cp27-cp27m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl
Collecting colorama (from konlpy)
  Using cached https://files.pythonhosted.org/packages/c9/dc/45cdef1b4d119eb96316b3117e6d5708a08029992b2fee2c143c7a0a5cc5/colorama-0.4.3-py2.py3-none-any.whl
Building wheels for collected packages: JPype1
  Running setup.py bdist_wheel for JPype1 ... [?25lerror
  Complete output from command /Library/Frameworks/Python.framework/V

  Running setup.py install for JPype1 ... [?25lerror
    Complete output from command /Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python -u -c "import setuptools, tokenize;__file__='/private/var/folders/sv/p6bxgfmn2kxg05bt8k_5thqw0000gn/T/pip-install-0yF2IQ/JPype1/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /private/var/folders/sv/p6bxgfmn2kxg05bt8k_5thqw0000gn/T/pip-record-7VS2Jl/install-record.txt --single-version-externally-managed --compile:
    running install
    running build
    running build_py
    creating build/lib.macosx-10.6-intel-2.7
    creating build/lib.macosx-10.6-intel-2.7/jpype
    copying jpype/_jcollection.py -> build/lib.macosx-10.6-intel-2.7/jpype
    copying jpype/_jcomparable.py -> build/lib.macosx-10.6-intel-2.7/jpype
    copying jpype/_classpath.py -> build/lib.macosx-10.6-intel-2.7/jpype
    copying jpype

In [3]:
# 맥인 경우만 실행
!CFLAGS="-stdlib=libc++" pip install jpype1
!export MACOSX_DEPLOYMENT_TARGET=10.10
!CFLAGS='-stdlib=libc++' pip install konlpy

Collecting jpype1
  Using cached https://files.pythonhosted.org/packages/d7/62/0f312d578e0165e9b5e8fcae0291f7ee83783b3805f59071006b21229d55/JPype1-0.7.1.tar.gz
Building wheels for collected packages: jpype1
  Running setup.py bdist_wheel for jpype1 ... [?25lerror
  Complete output from command /Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python -u -c "import setuptools, tokenize;__file__='/private/var/folders/sv/p6bxgfmn2kxg05bt8k_5thqw0000gn/T/pip-install-_EZaDp/jpype1/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /private/var/folders/sv/p6bxgfmn2kxg05bt8k_5thqw0000gn/T/pip-wheel-NJjQ9o --python-tag cp27:
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib.macosx-10.6-intel-2.7
  creating build/lib.macosx-10.6-intel-2.7/jpype
  copying jpype/_jcollection.py -> build/lib.macosx-10.6-intel-2.7/jpy

[31mCommand "/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python -u -c "import setuptools, tokenize;__file__='/private/var/folders/sv/p6bxgfmn2kxg05bt8k_5thqw0000gn/T/pip-install-_EZaDp/jpype1/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /private/var/folders/sv/p6bxgfmn2kxg05bt8k_5thqw0000gn/T/pip-record-I7O3fc/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /private/var/folders/sv/p6bxgfmn2kxg05bt8k_5thqw0000gn/T/pip-install-_EZaDp/jpype1/[0m
[33mYou are using pip version 10.0.1, however version 19.3.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[?25hCollecting konlpy
  Using cached https://files.pythonhosted.org/packages/85/0e/f385566fec837c0b83f216b2da65db9997b35dd675e107752005b7d392b1/konlpy-0.5.2-py2.py3-none-any.whl
Collecting JPype1

Failed to build JPype1
[31mmarkdown 3.1.1 has requirement setuptools>=36, but you'll have setuptools 28.8.0 which is incompatible.[0m
[31mtensorboard 2.1.0 has requirement setuptools>=41.0.0, but you'll have setuptools 28.8.0 which is incompatible.[0m
[31mgoogle-auth 1.10.0 has requirement setuptools>=40.3.0, but you'll have setuptools 28.8.0 which is incompatible.[0m
Installing collected packages: JPype1, lxml, colorama, konlpy
  Running setup.py install for JPype1 ... [?25lerror
    Complete output from command /Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python -u -c "import setuptools, tokenize;__file__='/private/var/folders/sv/p6bxgfmn2kxg05bt8k_5thqw0000gn/T/pip-install-HIIcd4/JPype1/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /private/var/folders/sv/p6bxgfmn2kxg05bt8k_5thqw0000gn/T/pip-record-Gw0dke/install-record.txt --s

In [1]:
from konlpy.tag import Twitter

twitter = Twitter()
malist = twitter.pos('나는 누구이고 여기는 어디인가.', norm=True, stem=True)
print(malist)

ModuleNotFoundError: No module named 'konlpy'

## 마르코프 체인과 문장생성
* 마르코프 체인 : 과거의 상태와 무관하게 현재의 상태만을 기반으로 다음 상태를 선택
* 마르코프 체인 문장 생성
    1. 문장을 형태소(또는 단어)로 분할
    2. 단어의 전후 단어를 n개씩 모아 딕셔너리 생성
        * 예:n=3) 나는 누구이고 여기는 어디인가
        * [나,는,누구], [는, 누구, 이고], [누구, 이고, 여기], [이고, 여기, 는], [여기, 는, 어디], [는, 어디, 인가]ㅓ
        * 어떤 형태소 다음에 나타날 요소를 선택할 수 있다.
    3. 딕셔너리로 임의의 문장 생성
        * 예) 등록된 예문으로 새 문장 만들기
            * 등록된 문장
                * 개,도,닷새,가,되면,주인,을,안다.
                * 기르던,개,에게,다리,가,물렸다
                * 닭,쫒던,개,지붕,쳐다,보듯,한다
                * 똥,묻은,개,가,겨,묻은,개,나무란다.
            * 개로 시작하는 새 문장
                * 개 : 도/에게/지붕/가, 
                * 개가 :  되면/물렸다/겨
                * 개가 되면 :  주인
                * 개가 되면 주인 : 을
                * 개가 되면 주인을 :  안다
                * 개가 되면 주인을 안다
* LSTM으로 전후 단어의 연관성을 학습
    * 확률이 높은 단어 선택
    * 자연스런 문장 생성

## 마르코프 체인을 이용한 챗봇 만들기
* Eliza online demo : http://www.masswerk.at/elizabot/
    * 환자 중심 상담 이론 기반
    * 상대의 말을 반복하는 단순한 기능
    * 영어만 가능

### 챗봇 실습

In [None]:
import codecs
from bs4 import BeautifulSoup
import urllib.request
from konlpy.tag import Twitter
import os, re, json, random

script_file = './assets/script.txt'
dict_file = "./chatbot-data.json"
dic = {}
twitter = Twitter()
# 딕셔너리에 단어 등록하기 --- (※1)
def register_dic(words, save=True):
    global dic
    if len(words) == 0: return
    tmp = ["@"]
    for i in words:
        word = i[0]
        if word == "" or word == "\r\n" or word == "\n": continue
        tmp.append(word)
        if len(tmp) < 3: continue
        if len(tmp) > 3: tmp = tmp[1:]
        set_word3(dic, tmp)
        if word == "." or word == "?":
            tmp = ["@"]
            continue
    # 딕셔너리가 변경될 때마다 저장하기
    if save:
        json.dump(dic, open(dict_file,"w", encoding="utf-8"))
# 딕셔너리에 글 등록하기
def set_word3(dic, s3):
    w1, w2, w3 = s3
    if not w1 in dic: dic[w1] = {}
    if not w2 in dic[w1]: dic[w1][w2] = {}
    if not w3 in dic[w1][w2]: dic[w1][w2][w3] = 0
    dic[w1][w2][w3] += 1
# 문장 만들기 --- (※2)
def make_sentence(head):
    if not head in dic: return ""
    ret = []
    if head != "@": ret.append(head)        
    top = dic[head]
    w1 = word_choice(top)
    w2 = word_choice(top[w1])
    ret.append(w1)
    ret.append(w2)
    while True:
        if w1 in dic and w2 in dic[w1]:
            w3 = word_choice(dic[w1][w2])
        else:
            w3 = ""
        ret.append(w3)
        if w3 == "." or w3 == "？ " or w3 == "": break
        w1, w2 = w2, w3
    ret = "".join(ret)
    # 띄어쓰기
    params = urllib.parse.urlencode({
        "q": ret
    })
    # 네이버 맞춤법 검사기를 사용합니다.
    try:
        url = "https://m.search.naver.com/p/csearch/ocontent/util/SpellerProxy?where=nexearch&color_blindness=0&" + params
        data = urllib.request.urlopen(url)
        data = data.read().decode("utf-8")
        data = json.loads(data)
        data = data["message"]["result"]["html"]
        data = soup = BeautifulSoup(data, "html.parser").getText()
        # 리턴
    except :
        data = ret
    return data
def word_choice(sel):
    keys = sel.keys()
    return random.choice(list(keys))
# 챗봇 응답 만들기 --- (※3)
def make_reply(text):
    # 단어 학습시키기
    if not text[-1] in [".", "?"]: text += "."
    words = twitter.pos(text)
    register_dic(words)
    # 사전에 단어가 있다면 그것을 기반으로 문장 만들기
    for word in words:
        face = word[0]
        if face in dic: return make_sentence(face)
    return make_sentence("@")
# 딕셔너리가 있다면 읽어 들이기
if os.path.exists(dict_file):
    dic = json.load(open(dict_file,"r"))
    print("dictionary loaded.")
else:
    print("no dictionary. trainning chatbot is needed.")
    
if os.path.exists(script_file):
    f = open(script_file, 'rt', encoding='UTF8')
    while True:
        line = f.readline()
        if not line : 
            break
        line = line.strip()
        if line == ""  :
            continue
        if not line[-1] in [".", "?"]: line += "."
        words = twitter.pos(line)
        #print(line)
        register_dic(words, save=False)
    json.dump(dic, open(dict_file,"w", encoding="utf-8"))
    print("trained using script .")
else:
    print("no script file. ")
        


In [None]:
print("대화를 종료하시려면 'exit'를 입력하세요.")
while True:
    txt = input("You :")
    if txt == "exit":
        print("Bye~")
        break
    reply = make_reply(txt)
    print("Bot :%s"%reply)

## 마르코프 체인과 LSTM을 이용한 문장 자동 생성기 예제
* 박경리 토지 다운로드
    * 국립국어원 : https://ithub.korean.go.kr/user/total/database/corpusView.do?boardSeq=2&articleSeq=2081&boardGb=T&isInsUpd=&boardType=CORPUS
    * ./data/BEXX004.txt
* 필요 라이브러리
    * `!pip install beautifulsoup4`
    * `!pip install konlpy`

In [None]:
import codecs
from bs4 import BeautifulSoup
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout
from tensorflow.keras.layers import LSTM
from tensorflow.keras.optimizers import RMSprop
#from tensorflow.keras.utils.data_utils import get_file
import numpy as np
import random, sys

fp = codecs.open("./data/BEXX0004.txt", "r", encoding="utf-16")
soup = BeautifulSoup(fp, "html.parser")
body = soup.select_one("body")
text = body.getText() + " "
print(text)
print('코퍼스의 길이: ', len(text))
# 문자를 하나하나 읽어 들이고 ID 붙이기
chars = sorted(list(set(text)))
print(chars)
print('사용되고 있는 문자의 수:', len(chars))

In [None]:
char_indices = dict((c, i) for i, c in enumerate(chars)) # 문자 → ID
indices_char = dict((i, c) for i, c in enumerate(chars)) # ID → 문자

# 텍스트를 maxlen개의 문자로 자르고 다음에 오는 문자 등록하기
maxlen = 20
step = 3   # 문장 마다 3글자씩 건너서 생성
sentences = []
next_chars = []

for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])

print('학습할 구문의 수:', len(sentences))
    
print('텍스트를 ID 벡터로 변환합니다...')
X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

# 모델 구축하기(LSTM)
print('모델을 구축합니다...')
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))
optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

model.fit(X, y, batch_size=128, nb_epoch=30, verbose=0) # 

In [None]:

# 후보를 배열에서 꺼내기
def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    ret = np.argmax(probas)
    #print(f"sample {temperature}, {np.argmax(preds)}, {ret}")
    return ret


start_index = random.randint(0, len(text) - maxlen - 1)
generated = ''
sentence = text[start_index: start_index + maxlen]
generated += sentence
print('--- 시드 = "' + sentence + '"')
sys.stdout.write(generated)
# 시드를 기반으로 텍스트 자동 생성
for i in range(400):
    x = np.zeros((1, maxlen, len(chars)))
    for t, char in enumerate(sentence):
        x[0, t, char_indices[char]] = 1.
    # 다음에 올 문자를 예측하기
    preds = model.predict(x, verbose=0)[0]
    next_index = sample(preds)
    #next_index = np.argmax(preds)#sample(preds, diversity)
    next_char = indices_char[next_index]
    # 출력하기
    generated += next_char
    sentence = sentence[1:] + next_char
    sys.stdout.write(next_char)
    sys.stdout.flush()
print()



## 텍스트 데이터 다루기
* 텍스트가 가장 흔한 시퀀스 데이타
* 비전이 픽셀에 대한 패턴인식
* 자연어처리는 단어, 문장, 문단에 대한 패턴인식
* 입력데이타는 텍스트 원본일 수 없어서 텍스트 벡터화 필요
    * 텍스트를 단어로 나누고 각 단어를 하나의 벡터로 변환
    * 텍스트를 문자로 나누고 각 문자를 하나의 벡터로 변환

* 어떻게 변환하든 이것을 토큰이라한다.
* 토큰에 수치형 벡터 연결
    * 원핫 인코딩(One-hot encoding)
    * 토큰 임베딩(Token Embbeding)

### One-Hot Encoding
* 토큰을 벡터로 변환하는 가장 일반적인 방법
* 모든 단어(문자)에 고유한 인텍스 부여, 인텍스를 원핫 인코딩
    * 예) I love you
        * I : 0, [1,0,0]
        * love : 1, [0,1,0
        * you : 2, [0,0,1]
* 단점
    * 희소 배열 생성(대부분이 0)        
    * 단어(문자) 수가 많아 지면 차원이 너무 비대해진다.
    * 단어(문자)에 유사도 없음
    * 단어(문자)의 순서에 대한 정보 없음
* keras 원핫인코딩 유틸
    * keras.preprocessing.text.Tokenizer
    

### Token Embbeding - Word2Vec
* 원핫 인코딩 단점 보완
* 밀집, 저차원
* Word2Vec(Skip gram)
* window size에 맞게 이웃하는 단어 선택
* 각 단어를 one-hot encoding으로 변환
* 예) king brave man/ queen beautiful women(window_size=1)
    * king[1,0,0,0,0,0] - brave[0,1,0,0,0,0]
    * brave[0,1,0,0,0,0] - man[0,0,1,0,0,0]
    * brave[0,1,0,0,0,0] - king[1,0,0,0,0,0]
    * queen[0,0,1,0,0,0] - beautiful[0,0,0,0,1,0]
    * beautiful[0,0,0,0,1,0] - women[0,0,0,0,0,1]
    * women[0,0,0,0,0,1] - beautiful[0,0,0,0,1,0]
*  hidden layer가 2인 네트워크에 input과 output으로 전달해서 학습
    * ![](https://i.imgur.com/vjVYupm.png)
* 학습한 W1, W2가 입력 단어에 대한 벡터
   * king[1,1]
   * brave[1,2]
   * man[1,3]
   * queen[5,5]
   * beautiful[5,6]
   * women[5,7]

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np

word2vec = {"king":np.array([1,1]), "brave":np.array([1,2]),"man": np.array([1,3]), "queen":np.array([5,5]), "beautiful":np.array([5,6]),"women": [5,7]}
for k, v in word2vec.items():
    plt.annotate(k, v)
    plt.scatter(v[0], v[1])

king_man_women = word2vec['king'] - word2vec['man'] + word2vec['women']    
print("king-man+women = " , king_man_women)


        

###  텍스트 유사도 표현
* Bag of Words
    * 각 단어의 출연 빈도를 표시
* N-gram
    * 텍스트에서 단어나 문자의 n-그램을 추출하여 그것을 하나의 벡터로 변환
        * n-gram: 문장에서 이웃한 N개의 문자
            * 예)"The cat sat on the mat"
            * 2-gram : "The cat", "cat on", "on the", "the mat"
            * 3-gram : "The cat sat", "cat sat on", "sat on the", "on the mat" 

### Word2Vec 구현 예제

In [None]:
corpus = ['king is a strong man', 
          'queen is a wise woman', 
          'boy is a young man',
          'girl is a young woman',
          'prince is a young king',
          'princess is a young queen',
          'man is strong', 
          'woman is pretty',
          'prince is a boy will be king',
          'princess is a girl will be queen']

def remove_stop_words(corpus):
    stop_words = ['is', 'a', 'will', 'be']
    results = []
    for text in corpus:
        tmp = text.split(' ')
        for stop_word in stop_words:
            if stop_word in tmp:
                tmp.remove(stop_word)
        results.append(" ".join(tmp))
    
    return results


corpus = remove_stop_words(corpus) # stop word 제거하기

In [None]:
words = []
for text in corpus:
    for word in text.split(' '):
        words.append(word)

words = set(words)
print(words)

In [None]:
word2int = {}

for i,word in enumerate(words):
    word2int[word] = i

sentences = []
for sentence in corpus:
    sentences.append(sentence.split())
    
WINDOW_SIZE = 2

data = []
for sentence in sentences:
    for idx, word in enumerate(sentence):
        for neighbor in sentence[max(idx - WINDOW_SIZE, 0) : min(idx + WINDOW_SIZE, len(sentence)) + 1] : 
            if neighbor != word:
                data.append([word, neighbor])

In [None]:
import pandas as pd
for text in corpus:
    print(text)

df = pd.DataFrame(data, columns = ['input', 'label'])

In [None]:
df.head(10)

In [None]:
word2int

In [None]:
import tensorflow as tf
import numpy as np

ONE_HOT_DIM = len(words)

# function to convert numbers to one hot vectors
def to_one_hot_encoding(data_point_index):
    one_hot_encoding = np.zeros(ONE_HOT_DIM)
    one_hot_encoding[data_point_index] = 1
    return one_hot_encoding

X = [] # input word
Y = [] # target word

for x, y in zip(df['input'], df['label']):
    X.append(to_one_hot_encoding(word2int[ x ]))
    Y.append(to_one_hot_encoding(word2int[ y ]))

# convert them to numpy arrays
X_train = np.asarray(X).astype(np.float32)
Y_train = np.asarray(Y).astype(np.float32)


# word embedding will be 2 dimension for 2d visualization
EMBEDDING_DIM = 2 
init = tf.initializers.GlorotUniform()
# hidden layer: which represents word vector eventually
W1 = tf.Variable(init([ONE_HOT_DIM, EMBEDDING_DIM]))
b1 = tf.Variable(tf.random.normal([1])) #bias


# output layer
W2 = tf.Variable(init([EMBEDDING_DIM, ONE_HOT_DIM]))
b2 = tf.Variable(tf.random.normal([1]))




In [None]:
epochs = 20000
optimizer = tf.keras.optimizers.SGD(learning_rate=0.05)
for i in range(epochs):
    with tf.GradientTape() as tape:
        hidden_layer = tf.add(tf.matmul(X_train,W1), b1)
        prediction = tf.nn.softmax(tf.add( tf.matmul(hidden_layer, W2), b2))
        cost = tf.reduce_mean(-tf.reduce_sum(Y_train * tf.math.log(prediction), axis=[1]))
        #cost = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=Y_train, logits=prediction))
    grads = tape.gradient(cost, [W1, b1, W2, b2])
    optimizer.apply_gradients(zip(grads, [W1, b1, W2, b2]))    
    if i % 3000 == 0:
        print('epochs '+str(i)+' cost is : ',cost.numpy())

In [None]:
vectors = (W1 + b1).numpy()
vectors

In [None]:
w2v_df = pd.DataFrame(vectors, columns = ['x1', 'x2'])
w2v_df['word'] = words
w2v_df = w2v_df[['word', 'x1', 'x2']]
w2v_df

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

#fig, ax = plt.subplots()
plt.figure(figsize=(10,10))  
for word, x1, x2 in zip(w2v_df['word'], w2v_df['x1'], w2v_df['x2']):
    plt.plot(x1, x2, 'b.')
    plt.annotate(word, (x1,x2 ))
    
      
PADDING = 1.0
x_axis_min = np.amin(vectors, axis=0)[0] - PADDING
y_axis_min = np.amin(vectors, axis=0)[1] - PADDING
x_axis_max = np.amax(vectors, axis=0)[0] + PADDING
y_axis_max = np.amax(vectors, axis=0)[1] + PADDING
 
plt.xlim(x_axis_min,x_axis_max)
plt.ylim(y_axis_min,y_axis_max)

plt.show()
