---
---
---
# ***Attention***
---
---
---
1. *Attention Mechanism*
2. *Bahdanau Attention*
3. *BiLSTM with Attention mechanism*

---
## 1. ***Attention Mechanism***
---

`Seq2Seq의 한계`
- 하나의 고정된 크기의 벡터에 모든 정보를 압축하는 방법에서 정보 손실이 발생
- RNN의 고질적인 문제인 Gradient Vanishing 문제 발생

`Attention의 아이디어`
- 입력 시퀀스가 길어지면 출력 시퀀스의 정확도가 떨어지는 것을 보정해주기 위한 기법
- 디코더에서 출력 단어를 예측하는 매 시점마다 인코더에서의 전체 입력 문장을 다시 참고
- 전체 입력 문장을 다 참고하는 것이 아니라 해당 시점에서 예측해야할 단어와 연관이 있는 입력 단어 부분을 참고

<br>

#### 1. ***Attention Function***

$Attention(Q, K, V) = Attention Value$ <br>
> $Q = Query : t \; 시점의 \; decoder \; cell에서의 \; hidden \; state$ <br>
$K = Keys : 모든 \; 시점의 \; encoder \; cell의 \; hidden \; state$ <br>
$V = Values : 모든 \; 시점의 \; encoder \; cell의 \; hidden \; state$
 - 주어진 Query에 대해서 모든 Key와의 유사도를 각각 구함
 - 유사도를 키와 mapping되어있는 각각의 Value에 반영
 - 유사도가 반영된 Value를 모두 더해 return
 
<br>

#### 2. ***Dot-Product Attention***

![](https://wikidocs.net/images/page/22893/dotproductattention1_final.PNG)

<br>
<br>

> #### **Dot-product Attention Mechanism**

<br>

1. ***Get Attention Score***

    $Score(s_t, h_i) = s_t^T h_i$ <br>
    $e^t = [s_t^T h_1, s_t^T h_2, ... , s_t^T h_N]$
    

2. ***Get Attention Distribution with Softmax***

    $\alpha^t(attention \; distribution) = softmax(e^t)$ <br>
    $attention \; weight = each \; value \; of \; attention \; distribution$
 
3. ***Get Attention Value***

    $\alpha_t = \Sigma_{i=1}^N \alpha_i^t h_i$ 

4. ***Concatenate Attention value and hidden state(decoder)***

    $v_t = concat (\alpha_t, s_t)$ 
    
3. ***Calculate Output***

    $\tilde{s_t} = tanh(W_c[\alpha_t;s_t] + b_c)$ <br>
    $\hat{y_t} = softmax(W_y \tilde{s_t} + b_y)$
    
    
> #### **Various types of attention**

이름 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; |스코어 함수 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; |Defined by	
---|---|---
$dot$|$Score(s_t, h_i) = s_t^T h_i$|Luong et al. (2015)	
$scaled \; dot$ | $Score(s_t, h_i) = \frac{s_t^Th_i}{\sqrt{n}}$ | Vaswani et al. (2017)	
$general$ | $Score(s_t, h_i) = s_t^T W_a h_i$ // 단, $W_a$는 학습 가능한 가중치 행렬 | Luong et al. (2015)	
$concat$ | $Score(s_t, h_i) = W_a^T \, tanh(W_b[s_t;h_i]) score(s_t, h_i) = W_a^t \, tanh(W_b s_t + W_c h_i)$ | Bahdanau et al. (2015)	
$location-base$ | $\alpha_t = softmax(W_a s_t)$ // $\alpha_t$ 산출 시에 $s_t$만 사용하는 방법 | Luong et al. (2015)	



---
## 2. ***Bahdanau Attention***
---


<br>

#### 1. ***Bahdanau Attention Function***

$Attention(Q, K, V) = Attention Value$ <br>
> $Q = Query : t-1 \; 시점의 \; decoder \; cell에서의 \; hidden \; state$ <br>
$K = Keys : 모든 \; 시점의 \; encoder \; cell의 \; hidden \; state$ <br>
$V = Values : 모든 \; 시점의 \; encoder \; cell의 \; hidden \; state$



1. ***Get Attention Score***

    $Score(s_{t-1}, h_i) = W_a^T \, tanh(W_b s_t + W_c h_i) = e^t$
    
2. ***Get Attention Distribution with Softmax***

    $\alpha^t(attention \; distribution) = softmax(e^t)$ <br>
    $attention \; weight = each \; value \; of \; attention \; distribution$
 
3. ***Get Attention Value***

    $\alpha_t = h_vector \cdot \alpha^t$ 

4. ***Concatenate Input data and Attention value for next timestep***

    $x_{t(new)} = concat (\alpha_t, x_t)$ 
    

    

---
## 3. ***BiLSTM with Attention mechanism***
---


<br>

1. ***IMDB 리뷰 데이터 전처리하기***
2. ***Bahdanau Attention***
3. ***BiLSTM with Attention Mechanism***

In [1]:
# 1. 데이터 전처리
from tensorflow.keras.datasets import imdb
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.sequence import pad_sequences

vocab_size = 10000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words = vocab_size)
print('리뷰의 최대 길이 : {}'.format(max(len(l) for l in X_train)))
print('리뷰의 평균 길이 : {}'.format(sum(map(len, X_train))/len(X_train)))

max_len = 500
X_train = pad_sequences(X_train, maxlen=max_len)
X_test = pad_sequences(X_test, maxlen=max_len)
print('train data shape : ', X_train.shape)
print('test data shape : ', X_test.shape)

  x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
  x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])


리뷰의 최대 길이 : 2494
리뷰의 평균 길이 : 238.71364
train data shape :  (25000, 500)
test data shape :  (25000, 500)


In [3]:
# 2. 바다나우 어텐션
import tensorflow as tf

class BahdanauAttention(tf.keras.Model):
  def __init__(self, units):
    super(BahdanauAttention, self).__init__()
    self.W1 = Dense(units)
    self.W2 = Dense(units)
    self.V = Dense(1)

  def call(self, values, query): # 단, key와 value는 같음
    # query shape == (batch_size, hidden size)
    # hidden_with_time_axis shape == (batch_size, 1, hidden size)
    # score 계산을 위해 뒤에서 할 덧셈을 위해서 차원을 변경해줍니다.
    hidden_with_time_axis = tf.expand_dims(query, 1)

    # score shape == (batch_size, max_length, 1)
    # we get 1 at the last axis because we are applying score to self.V
    # the shape of the tensor before applying self.V is (batch_size, max_length, units)
    score = self.V(tf.nn.tanh(
        self.W1(values) + self.W2(hidden_with_time_axis)))

    # attention_weights shape == (batch_size, max_length, 1)
    attention_weights = tf.nn.softmax(score, axis=1)

    # context_vector shape after sum == (batch_size, hidden_size)
    context_vector = attention_weights * values
    context_vector = tf.reduce_sum(context_vector, axis=1)

    return context_vector, attention_weights

In [None]:
# 3. 양방향 어텐션
from tensorflow.keras.layers import Dense, Embedding, Bidirectional, LSTM, Concatenate, Dropout
from tensorflow.keras import Input, Model
from tensorflow.keras import optimizers
import os

sequence_input = Input(shape=(max_len,), dtype='int32')
embedded_sequences = Embedding(vocab_size, 128, input_length=max_len, mask_zero = True)(sequence_input)

lstm = Bidirectional(LSTM(64, dropout=0.5, return_sequences = True))(embedded_sequences)lstm, forward_h, forward_c, backward_h, backward_c = Bidirectional \
  (LSTM(64, dropout=0.5, return_sequences=True, return_state=True))(lstm)

print(lstm.shape, forward_h.shape, forward_c.shape, backward_h.shape, backward_c.shape)

state_h = Concatenate()([forward_h, backward_h]) # 은닉 상태
state_c = Concatenate()([forward_c, backward_c]) # 셀 상태

attention = BahdanauAttention(64) # 가중치 크기 정의
context_vector, attention_weights = attention(lstm, state_h)

dense1 = Dense(20, activation="relu")(context_vector)
dropout = Dropout(0.5)(dense1)
output = Dense(1, activation="sigmoid")(dropout)
model = Model(inputs=sequence_input, outputs=output)

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

history = model.fit(X_train, y_train, epochs = 3, batch_size = 256, validation_data=(X_test, y_test), verbose=1)
print("\n 테스트 정확도: %.4f" % (model.evaluate(X_test, y_test)[1]))