# Neural Arithmetic Logit Units(NALU)

## Introduction

- 딥러닝 발달로 다양한 일을 할 수 있게됨
- 그러나 systematic generalization은 글쎄...
- systematic abstraction 보다는 memorization에 가깝다

![Imgur](https://i.imgur.com/hXScZyC.png)

- 논문에서는 일반적인 신경망에 적용할 수 있는 모듈을 만들었음
- non-linearity 없이도 하나의 뉴런에 numerical quantitiy 표현
- 미분도 가능해서 역전파 가능
- Underlying numerical nature of data 파악
- 과거 모델 마지막 단에만 적용해도 성능이 확 좋아짐

### Numerical extrapolation failures in NN

![Imgur](https://i.imgur.com/rPg30Sf.png)

$$ f(x)=x $$

- 단순 identity function도 신경망은 제대로 학습하지 못함
- 스칼라값을 input으로 가지는 오토인코더를 설계
- 3 hidden layers of size 8, 10000 iterations, squared error loss
- [-5,5]로 학습시키면 이 사이 값은 잘 맞추는데 [-20,20]은 못맞춤
- PReLU같은 highly linear한 활성함수는 잘맞춤. 그러나 non-linear한 활성함수(tanh, sigmoid) 등은 형편없음)


## NAC, NALU

- special case of a linear layer
- NAC: 덧셈, 뺄셈
- NALU : 곱셈, 나눗셈
- NALU에는 gate-controlled sub opreaions가 있음

![Imgur](https://i.imgur.com/ryzfWG7.png)

![Imgur](https://i.imgur.com/fXQdJYw.png)

```python
def NALU(prev_layer, num_outputs):
    eps=1e-7
    shape = (int(prev_layer.shape[-1]),num_outputs)

    # NAC cell
    W_hat = tf.Variable(tf.truncated_normal(shape, stddev=0.02))
    M_hat = tf.Variable(tf.truncated_normal(shape, stddev=0.02))
    W = tf.tanh(W_hat) * tf.sigmoid(M_hat)
    a = tf.matmul(prev_layer, W)
    G = tf.Variable(tf.truncated_normal(shape, stddev=0.02))
    
    # NALU
    m = tf.exp(tf.matmul(tf.log(tf.abs(prev_layer) + eps), W))
    g = tf.sigmoid(tf.matmul(prev_layer, G))
    out = g * a + (1 - g) * m
    return out
 ```


## Experiments

### Simple function learning tasks

- 사칙연산, 제곱, 루트 등 학습
- 입력값은 랜덤생성
- interpolation, extrapolation 실험

![Imgur](https://i.imgur.com/SfxbKcM.png)

### Mnist counting and arithmetic tasks

- 10개 MNIST digit 랜덤으로 뽑음
- 어떤 digit이 관찰되었는지 마지막에 뽑아내야 함
- MNIST Digit Counting task : 몇 개 봤는지 세야 함 (10-way regression)
- MNIST Digit Addition task : compute the sum of the digits (a linear regression)
- 곱셈 extrapolation이 잘 안되는건 아마 너무 작은 숫자가 등장해서 그런 듯(분모 들어가면 뻥튀기)

![Imgur](https://i.imgur.com/taGu9I6.png)




### Language to Number translation tasks

- LSTM?
- five hundred and fifteen -> 515
- 0부터 1000까지 실험
- Embedding layer -> LSTM -> NAC or NALU
- LSTM은 과적합 됨
- 모델이 'eighty one, eighty four, eighty seven'을 학습 중에 봤다면 'eighty'도 바로 맞춘다!



![Imgur](https://i.imgur.com/0TLT4zQ.png)

### Program evaluation

![Imgur](https://i.imgur.com/rnGmpJY.png)

### Learning to track time in a Grid-World Environment

- 정해진 시간 t에 정확히 빨간 네모에 도착하면 reward를 받음
- 5 x 5 grid-world
- 56 x 56 pixel input
- action : up down left right pass
- 정해진 시간 T 안에는 닿을 수 있게 실험 설계
- T <= 19일 때 NAC합친 모델은 잘하지만 T > 13일 때 A3C 모델은 망가짐
- 왜 agents가 결국 실패할까?
  - stimuli greater than 12, the baseline agent behaves as if the stimulus were still 12
  - 목적지에 t=12(너무 빠름)에 도착. larger stimuli보다 reward가 적음
  - 반대로 stimuli가 20보다 크면 절대 도착 못함
  - 다른 실험에 비해 extrapolation 결과가 좀 안좋은 것은 모델 자체가 여전히 LSTM을 사용하여 numeracy를 encode하기 때문
  

![Imgur](https://i.imgur.com/FIvZ6mX.png)


### MNIST Parity Prediction Task & Ablation Study

- 기존 네트워크에서 마지막 레이어만 바꿈
- 바이어스 없애고 non linearity 추가하니까 더 잘됨

![Imgur](https://i.imgur.com/Fv2uAID.png)

![Imgur](https://i.imgur.com/icfjGpF.png)

In [1]:
import tensorflow as tf
import numpy as np
from tqdm import tqdm_notebook as tqdm

def NALU(prev_layer, num_outputs):
    eps=1e-7
    shape = (int(prev_layer.shape[-1]),num_outputs)

    # NAC cell
    W_hat = tf.Variable(tf.truncated_normal(shape, stddev=0.02))
    M_hat = tf.Variable(tf.truncated_normal(shape, stddev=0.02))
    W = tf.tanh(W_hat) * tf.sigmoid(M_hat)
    a = tf.matmul(prev_layer, W)
    G = tf.Variable(tf.truncated_normal(shape, stddev=0.02))
    
    # NALU
    m = tf.exp(tf.matmul(tf.log(tf.abs(prev_layer) + eps), W))
    g = tf.sigmoid(tf.matmul(prev_layer, G))
    out = g * a + (1 - g) * m
    return out


arithmetic_functions={
'add': lambda x,y :x+y,
}

def get_data(N, op):
    split = 4
    X_train = np.random.rand(N, 10)*10
    #to be mutually exclusive
    a = X_train[:, :split].sum(1)
    b = X_train[:, split:].sum(1)
    Y_train = op(a, b)[:, None]
    print(X_train.shape)
    print(Y_train.shape)
    
    X_test = np.random.rand(N, 10)*100
    #to be mutually exclusive
    a = X_test[:, :split].sum(1)
    b = X_test[:, split:].sum(1)
    Y_test = op(a, b)[:, None]
    print(X_test.shape)
    print(Y_test.shape)
    
    return (X_train,Y_train),(X_test,Y_test)

  from ._conv import register_converters as _register_converters


In [7]:
np.sum(X_train[0])
#Y_train[0]

44.01296727592485

In [2]:
tf.reset_default_graph()
train_examples=10000

(X_train,Y_train),(X_test,Y_test)=get_data(train_examples,arithmetic_functions['add'])  
X = tf.placeholder(tf.float32, shape=[train_examples, 10])
Y = tf.placeholder(tf.float32, shape=[train_examples, 1])

X_1=NALU(X,2)
Y_pred=NALU(X_1,1)

loss = tf.nn.l2_loss(Y_pred - Y) # NALU uses mse
optimizer = tf.train.AdamOptimizer(0.1)
train_op = optimizer.minimize(loss)

with tf.Session() as session:

    session.run(tf.global_variables_initializer())
    for ep in tqdm(range(50000)):
        _,pred,l = session.run([train_op, Y_pred, loss], 
                feed_dict={X: X_train, Y: Y_train})
        if ep % 1000 == 0:
            print('epoch {0}, loss: {1}'.format(ep,l))

    _,test_predictions,test_loss = session.run([train_op, Y_pred,loss],feed_dict={X:X_test,Y:Y_test})

print(test_loss) #8.575397e-05

(10000, 10)
(10000, 1)
(10000, 10)
(10000, 1)


HBox(children=(IntProgress(value=0, max=50000), HTML(value='')))

epoch 0, loss: 12629128.0
epoch 1000, loss: 0.0003893437678925693
epoch 2000, loss: 0.00036618439480662346
epoch 3000, loss: 0.000331127637764439
epoch 4000, loss: 0.00028548159752972424
epoch 5000, loss: 0.00023292415426112711
epoch 6000, loss: 0.00017961060802917928
epoch 7000, loss: 0.0002196947461925447
epoch 8000, loss: 8.894406346371397e-05
epoch 9000, loss: 11.192487716674805
epoch 10000, loss: 3.6837584048043936e-05
epoch 11000, loss: 2.1076919438201003e-05
epoch 12000, loss: 1.602229349373374e-05
epoch 13000, loss: 9.249086360796355e-06
epoch 14000, loss: 4.6393128286581486e-05
epoch 15000, loss: 0.0034365211613476276
epoch 16000, loss: 3.094708517892286e-06
epoch 17000, loss: 2.2120443645690102e-06
epoch 18000, loss: 1.7752499843481928e-06
epoch 19000, loss: 1.1439497029641643e-06
epoch 20000, loss: 7.810526767570991e-07
epoch 21000, loss: 5.993938430037815e-07
epoch 22000, loss: 4.33683908340754e-07
epoch 23000, loss: 4.783655640494544e-07
epoch 24000, loss: 5.54413418285548