# Learning Non-linear Features in Neural Networks
---

I had recently watched a [interview](https://youtu.be/UMpSrvGB4zs) of Geoffrey Hinton by Andrew Ng. In that interview the question of activation functions came up, and it was mentioned  how it took many layers of sigmoid activations to get a ReLU activation. This brought up something I did not have an intuitive handle on. How did a neural network learn arbitrary non-linear functions? How many layers and how much data would it take to learn such a function?

Since I learn by doing, I decided to use this opportunity to play around with tensorflow to answer these questions.

## The data

The non-linear functions I'm aiming to learn are multiplication and squaring. I feel like these are fairly simple commonly used when feature engineering, so it might be useful to know how much it would take to replicate those.

So I'm going to start off with 10000 training examples. I'm not sure if I want the dependent variables to be exact or have some noise, so I will create both

In [5]:
import numpy as np

np.random.seed(5)

n = 10000
noise_std = 10
X = np.random.random([2, n]).astype(np.float32)*1000
y_mul = np.reshape(X[0]*X[1], [1, X.shape[1]])
y_mul_noisy = y_mul + np.random.standard_normal(n)*noise_std
y_sq = np.reshape(X[0]*X[0]**2, [1, X.shape[1]])
y_sq_noisy = y_sq + np.random.standard_normal(n)*noise_std

print(X.shape, y_mul.shape, y_mul_noisy.shape, y_sq.shape)

(2, 10000) (1, 10000) (1, 10000) (1, 10000)


Now a neural network architecture needs to be chosen. Though it should not work, I am going with the simplest configuration, a perceptron, or single layer network. A perceptron for regression will basically look for a linear combination of the inputs to get the output. You can't really model either of the functions with a linear combination, but the errors from this will serve as a benchmark to compare other architectures to.

In [6]:
import tensorflow as tf

learning_rate = 0.01
training_epochs = 1000
display_step = 50

X_tf = tf.placeholder("float", [None, X.shape[0]])
Y_tf = tf.placeholder("float", [None, 1])

weights = tf.Variable(tf.random_normal([X.shape[0], 1]))
bias = tf.Variable(tf.random_normal([1]))

def perceptron(x):
    y_hat = tf.add(tf.matmul(x, weights), bias)
    return y_hat

Y_hat = perceptron(X_tf)
cost = tf.reduce_sum(tf.pow(Y_hat-Y_tf, 2))/(2*n)
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
init = tf.global_variables_initializer()

<dtype: 'float32'> (?, 2) <dtype: 'float32_ref'> (2, 1)
<dtype: 'float32'> (?, 1)


In [7]:
# X.T[0].reshape([1,2])
y_mul.shape

(1, 10000)

In [8]:
with tf.Session() as sess:
    sess.run(init)
    for epoch in range(training_epochs):
        for (x, y) in zip(X.T, y_mul.T):
            sess.run(optimizer, feed_dict={X_tf: x.reshape([1,2]), Y_tf: y.reshape([1,1])})
        if (epoch+1) % display_step == 0:
            c = sess.run(cost, feed_dict={X_tf: X.T, Y_tf: y_mul.T})
            print("Epoch:", '%04d' % (epoch+1), "cost=", "{:.9f}".format(c), \
                "W=", sess.run(weights), "b=", sess.run(bias))

KeyboardInterrupt: 