Bidirectional Attention Flow for Machine Comprehension(2016)

1. Abstract

MC(Machine Comprehension) : answering a query about a given context paragraph
- 문장(context)의 흐름을 이해하고, 주어진 문장(context)에서 질문(query)에 대답하는 것
- query와 context의 복잡한 상호관계를 이해해야함
- ex) Context : 현우는 동영이에게 1000원을 줬다. 동영이는 우정이에게 다시 1000원을 줬다.
- ex) Query : 동영이가 가진 돈은?
최근에 Attention 모델이 MC에 사용되기 시작함
- 문맥의 작은 부분에 집중 가능하며, 그 정보를 fixed-size vector로 표현
BiDAF : Bi-Directional Attention Flow
- multi-stage hierarchical process
- context를 요약하지않고 query-aware context vector를 찾아내는 방식
- context을 세분화시킴(query에서 사용되는 중요한 단어를 골라냄)
query-aware context vector로 context summarization 대체
- context가 query와 연관지어져 query에 필요한 특성을 반영한 vector

2. Introduction

Attention mechanism은 target의 context paragraph를 잘 이해하게 해준다.
Attention mechanism이 과거에 가졌던 특징
- context를 fixed-size vector로 요약하여 attention weight 계산(문맥정보를 추출)
- attention vector는 계속 update
- uni-directional
BiDAF
- hierarchical multi-stage architecture
- context를 세분화
- char/word/contextual embedding 사용

BiDAF의 장점

attention layer가 문맥을 fixed-size vector로 요약하는 것으로 사용되지 않음.
- 이전 layer에서 계산된 attended vector를 subsequent model로 flow하게 해줌
- 이 방법을 통해 요약으로 인한 정보손실이 줄어들게 됨
memory-less attention
- time-step마다 attention을 계산하여 이전 attention layers를 사용하지않음
- 따라서 현재 time step에서 query와 context의 attention을 학습하는 것에 집중
- 또한, 이전의 잘못된 attention의 영향을 받지 않게 됨
bi-directional
- context와 query가 충분히 정보를 교환가능하다.

3. Model

model

6개의 layer로 이루어진 hierarchical multi-stage process

3.1 Character Embedding Layer

charCNN을 이용하여 word를 vector로 만들어줌
character에 ID vector를 부여하고 CNN에 넣어서 word를 예측하는 방식으로 학습.
CNN output을 pooling으로 차원을 줄여 embedding으로 사용
$$d$$ dimension으로 embedding

3.2 Word Embedding Layer

pre-trained word embedding model러 word를 vector로 만들어줌
논문에서는 GloVe 사용
$$d$$ dimension으로 embedding

3.3 Contextual Embedding Layer

char embedding과 word embedding을 concatenate한 후 2 layer highway network를 통과시킴
word 주변의 contextual cues(문맥적 단서)를 사용하여 단어의 embedding을 재정의
BLSTM을 통해 주변 문맥을 파악할 수 있게 함
- forward, backward output을 concatenate하므로 dimension = $$2d$$
$$T$$ = #context words, $$J$$ = #query words
$$H$$ = Context의 context matrix, $$U$$ = Query의 context matrix where $$H,U \in {R^{2d \times T}},{R^{2d \times J}}$$
matrix에 vector가 행으로 쌓인 것이 아니라 열로 쌓인 것을 주의, 즉 $$m$$번째 열 = $$m$$번째 단어

3.4 Attention Flow Layer

query와 context vector를 연결시키고, 각 단어마다 문맥 속의 query-aware feature vectors를 만듦
context vector와 query를 interaction시켜, context에서 query에 대한 답이 될 수 있는 특징들을 추출
즉, Attention Flow Layer는 context와 query의 정보를 서로 연결한다.
- ex) Context : 사자가 동물원에 있다. Query : 여기엔 무슨 동물이 있나?
- ex) 위의 문장에서 '동물원~~여기', '사자~~동물'을 연결시키는 역할
이전과는 다르게, query와 context를 single feature vectors로 요약하는 것에 사용되지 않음
- 요약을 통한 정보손실을 방지할 수 있음 - query-aware vector
각 time step의 attention vectors가 subsequent modeling layer로 들어가게 해주도록 함
bi-direction으로 attention계산 : Context2Query, Query2Context - context와 query의 정보교환을 도움

3.4.1 similarity matrix

context와 query가 공유
similarity matrix $${S_{tj}}$$ = $$\alpha ({H_{:t}},{U_{:j}})$$ where $$S \in {R^{T \times J}}$$
- $${S_{tj}} \in R$$ 는 t-th context word와 j-th query word의 similarity 의미
- $${H_{:t}}$$ = context의 t번째 word vector, $${U_{:j}}$$ = query의 j번째 word vector
- $$\alpha (h, u)$$ = $${w_{S}^{T}} [h; u; h \odot u]$$
- $$\alpha$$ = a trainable scalar function, similarity를 encoding하는 역할
- $${w_{S}^{T}}$$ = trainable weight vector, where $${w_{S}^{T}} \in {R^{6d}}$$

3.4.2 Context2Query Attention

query words 중에서 어떤 query word가 각각의 context word와 가장 관계가 있는지 signify
$${a_{t}}$$ = $$softmax({S_{t:}}) \in {R^{J}}$$
- $${S_{t:}}$$ = t-th context word와 query word들의 similarity를 의미
- t-th context word와 query word의 attention weight
- $$\sum {a_{tj}} = 1$$ for all t
- similarity $${S_{tj}}$$에 softmax를 취해줬으므로 유사성이 큰 weight만 값이 커지게 됨
attended query vector $$\widetilde{U_{:t}}$$ = $$\sum {a_{tj}}{U_{:j}}$$, where &&{U_{:j}} \in {R^{2d \times 1}}$$
- j-th query word에 t-th context와의 attention weight를 곱하는 것
- context가 총 개의 단어이므로, $$\widetilde{U} \in {R^{2d \times T}}$$
$${a_{t}}$$는 유사성이 큰 것만 값이 크므로, 모든 쿼리의 단어들에 weight를 곱해서 더할 경우 $$\widetilde{U_{:t}}$$는 유사성이 큰 query의 단어들만 반영이 됨
- 즉, t-th context word 입장에서 유사성이 높은 query의 특성을 알게 됨
- 따라서, $$\widetilde{U}$$는 context 전체의 attended query vector를 담고 있음
- t-th context word가 어떤 query word와 가까운지를 알려줌
- t-th context word를 유사성이 높은 query word로 표현함

3.4.3 Query2Context Attention

context words 중에서 어떤 것이 각각의 query word와 가장 관계 있는지 signify
- 따라서, query에 대답하는 것에서 중요한 역할
$$b$$ = $$softmax({max_{col}}(S))$$, where $$b \in {R^{T}}$$
- $${max_{col}}$$ : 각 row에서 가장 큰 값을 뽑아냄
- softmax를 취해줬으므로 $$b = (b_{1}, b_{2}, ..., b_{T})$$에서 값이 큰 $$b_{t}$$만 살아남음
attended context vector $$\widetilde{h}$$ = $$\sum {b_{t}}{H_{:t}} \in {R^{2d \times 1}}$$
- weight $$b_{t}$$를 곱해서 값이 더해지면, similarity가 큰 t-th context word만 값이 더해짐
- 즉, similarity가 큰 context word만 살아남게 되므로 query 입장에서 중요한 word만 살아남게 됨
- context에서 가장 중요한 vector의 가중합을 의미
$$h$$를 $$T$$개 나열하여 $$\widetilde{H} \in {R^{2d \times T}}$$

3.4.3 query-aware representation of each context word

$${G_{:t}}$$ = $$\beta ({H_{:t}}, \widetilde{U_{:t}}, \widetilde{H_{:t}})$$
$${G_{:t}} \in {R^{8d}}$$ = t-th context vector
- $$G \in {R^{8d \times T}}$$ .
$$\beta(h, \widetilde{u}, \widetilde{h})$$ = $$[h;\widetilde{u};h \odot \widetilde{u};h \odot \widetilde{h}]$$

3.5 Modeling Layer

RNN을 통하여 context scan
위에서 구한 $$G$$가 input
query와 context word의 interaction을 파악
BLSTM 사용

3.6 Output Layer

query에 대한 answer을 output
질문에 대한 대답으로 paragraph의 sub-phrase를 찾아야함
paragraph안에 있는 start, end의 indices 를 predict 해서 phrase를 찾음
$${p_{1}}$$ = $$softmax({w_{p_{1}}^{T}}[G;{M^{1}}])$$
$${p_{2}}$$ = $$softmax({w_{p_{2}}^{T}}[G;{M^{2}}])$$
$${M^{1}}, {M^{2}} \in {R^{2d \times T}}$$는 각각 LSTM의 forward, backward output

3.7 Training

loss : sum of negative pobalites of true start and end indices s by the predicted distributions
- $$L( \theta ) = -\sum log({p_{y_{i}^{1}}^{1}}) + log({p_{y_{i}^{2}}^{2}})$$
- $$\theta$$ = training weights
- $${P_{k}}$$ = k-th value of vector P

3.8 Test

The answer span $$(k, l)$$ with maximum value of $${P_{k}^{1}}{P_{l}^{2}}$$

4. Related work

Machine comprehension
Visual question answering

5. Question Answering Experiment

result1

Dataset : SQuAD
Ensemble로 사용한 것이 제일 성능이 좋았다.

5.1 Visualization

result2 result3

contextual embedding layer가 단어의 사용을 좀 더 잘 구별했다.

5.2 Discussion

result4

어떤 Query에 어떤 Context vector가 담겼는지를 보여줌

5.3 Error Analysis

50% : imprecise boundaries of the answers
28% : syntactic complication/ambiguity
14% : paraphrase problems
4% : require external knowledge
2% : need multiple sentences to answer
2% : mistake during tokenization

6. Cloze Test Experiment

result5

7. Conclusion

ablation analysis에서 확인했듯이, model의 구성요소가 모두 중요하다.
context summarization 없이 query-aware context representation을 찾아냈다.

8. Reference

https://arxiv.org/abs/1611.01603
http://dalpo0814.tistory.com/category/Machine%20Learning
Ybigta deepNLP-study

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bidirectional Attention Flow for Machine Comprehension(2016)

1. Abstract

2. Introduction

BiDAF의 장점

3. Model

3.1 Character Embedding Layer

3.2 Word Embedding Layer

3.3 Contextual Embedding Layer

3.4 Attention Flow Layer

3.4.1 similarity matrix

3.4.2 Context2Query Attention

3.4.3 Query2Context Attention

3.4.3 query-aware representation of each context word

3.5 Modeling Layer

3.6 Output Layer

3.7 Training

3.8 Test

4. Related work

5. Question Answering Experiment

5.1 Visualization

5.2 Discussion

5.3 Error Analysis

6. Cloze Test Experiment

7. Conclusion

8. Reference

Clone this wiki locally