Wanted to see if I could create a neural network that modeled the function below. 

$ y = x_{0}^2 + \sin(x_{1}) - x_{2} + 5 $

In [180]:
import tensorflow as tf
import numpy as np
import random
np.set_printoptions(suppress=True) # B/c I hate looking at Numpy's scientific notation when printing matrix values LOL

First need to create the synthetic dataset. Basically we will have two matrices, one that contains all of the examples with the 3 different inputs, and one with all the output values. 

In [308]:
numExamples = 100
numVariables = 3
maxValue = 25
outputValues = 1

allX = np.asarray([np.random.randint(maxValue, size=numVariables) for i in range(numExamples)])
allY = np.asarray([x[0]**2 + np.sin(x[1]) - x[2] + 5 for x in allX]).reshape(numExamples, outputValues)

Visualizing the shapes of our training pairs.

In [309]:
print "Our X matrix shape:", allX.shape
print "Our Y matrix shape:", allY.shape

Our X matrix shape: (100, 3)
Our Y matrix shape: (100, 1)


Sample input/output pair.

In [310]:
print "X input:", allX[0]
print "Expected Y output:", allY[0]

X input: [20  5  7]
Expected Y output: [397.04107573]


Simple one hidden layer neural network.

In [311]:
numHiddenUnits = 10

x = tf.placeholder(tf.float32, shape=[None, numVariables])
y = tf.placeholder(tf.float32, shape=[None, outputValues])

W1 = tf.Variable(tf.truncated_normal([numVariables, numHiddenUnits], stddev=0.1))
B1 = tf.Variable(tf.constant(0.1), [numHiddenUnits])
W2 = tf.Variable(tf.truncated_normal([numHiddenUnits, outputValues], stddev=0.1))
B2 = tf.Variable(tf.constant(0.1), [outputValues])

H1preRelu = tf.matmul(x,W1) + B1
H1 = tf.nn.relu(H1preRelu)
yLogits = tf.nn.relu(tf.matmul(H1,W2) + B2)

Traditional MSE loss.

In [318]:
loss = tf.reduce_mean((yLogits - y)**2)
opt = tf.train.GradientDescentOptimizer(learning_rate = .01).minimize(loss)

One of the interesting problems I came across while doing this is the inability for the network to converge when the initial loss is a certain value. Let's look at what happens below during the first 5 iterations. Run the cell below, and notice the values for the weights in the beginning. They're all centered around 0, and we had a good balance of positive and negative weights. Notice that the loss at the beginning is an extremely large number. This means that the derivative of the loss with respect to the weights is also likely going to be large, which results in large weight updates in the negative direction. 

Notice how the weight values change from Iteration 0 to 1 to 2 etc. You'll see that almost all of them are negative values (and some of them are really large). Then, take a look at the H1 values after the matmul. A lot of those are negative numbers, and so when we apply our activation function, Relu in this case, we get outputs of all zeros, which means that we'll always get an extremely high loss, and the network won't be able to converge since all the predictions are going to be 0 all the time. 

In [319]:
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)

trainingIterations = 10000

for i in range(trainingIterations):
    if i <= 5:
        _, trainingLoss, curW1, curH1, curH1pre, curPreds = sess.run([opt, loss, W1, H1, H1preRelu, yLogits], feed_dict={x: allX, y: allY})
        print "====================== Iteration %d ======================"%i
        print "Current W1 weights (first dim):"
        print (curW1[0]),"\n"
        print "Current H1 (pre-RELU) values:"
        print (curH1pre[0]),"\n"
        print "Current H1 (post-RELU) values:"
        print (curH1[0]),"\n"
        print "Current loss:", trainingLoss,"\n"
    if i % 1000 == 0 and i != 0:
        _, trainingLoss, curPreds = sess.run([opt, loss, yLogits], feed_dict={x: allX, y: allY})
        print ("========= Iteration %d, Training Loss %g =========" %(i, trainingLoss))
        print ("Prediction: %g"%(curPreds[0]))
        print ("Label: %g"%(allY[0]))

Current W1 weights (first dim):
[-0.09582855 -0.01781602  0.00541513  0.14701568  0.04391215  0.15258543
 -0.08114988  0.02420023 -0.0397035  -0.09304301] 

Current H1 (pre-RELU) values:
[-2.283804    1.5308748  -0.9513644   3.6935768   1.5965968   4.8146434
 -1.4702303  -0.47646996 -0.8460771  -1.2214923 ] 

Current H1 (post-RELU) values:
[0.        1.5308748 0.        3.6935768 1.5965968 4.8146434 0.
 0.        0.        0.       ] 

Current loss: 70307.52 

Current W1 weights (first dim):
[-0.09582855 -0.01781602  0.00541513  0.14701568  0.04391215  0.15258543
 -0.08114988  0.02420023 -0.0397035  -0.09304301] 

Current H1 (pre-RELU) values:
[-2.283804    1.5308748  -0.9513644   3.6935768   1.5965968   4.8146434
 -1.4702303  -0.47646996 -0.8460771  -1.2214923 ] 

Current H1 (post-RELU) values:
[0.        1.5308748 0.        3.6935768 1.5965968 4.8146434 0.
 0.        0.        0.       ] 

Current loss: 70307.52 

Current W1 weights (first dim):
[-0.09582855 -0.01781602  0.00541513  

Now, let's see what happens when we divide the loss by some factor in order to decrease the magnitude of some of those gradients and weight updates. 

In [332]:
loss = tf.reduce_mean(((yLogits - y)**2)/100) ###### LINE WITH THE CHANGE ######

opt = tf.train.GradientDescentOptimizer(learning_rate = .01).minimize(loss)
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)

trainingIterations = 10000

for i in range(trainingIterations):
#    if i <= 5:
#        _, trainingLoss, curW1, curH1, curH1pre, curPreds = sess.run([opt, loss, W1, H1, H1preRelu, yLogits], feed_dict={x: allX, y: allY})
#        print "====================== Iteration %d ======================"%i
#        print "Current W1 weights (first dim):"
#        print (curW1[0]),"\n"
#        print "Current H1 (pre-RELU) values:"
#        print (curH1pre[0]),"\n"
#        print "Current H1 (post-RELU) values:"
#        print (curH1[0]),"\n"
#        print "Current loss:", trainingLoss,"\n"
    if i % 1000 == 0 and i != 0:
        _, trainingLoss, curPreds = sess.run([opt, loss, yLogits], feed_dict={x: allX, y: allY})
        print ("========= Iteration %d, Training Loss %g =========" %(i, trainingLoss))
        print ("Prediction: %g"%(curPreds[0]))
        print ("Label: %g"%(allY[0]))

Prediction: 0.157072
Label: 397.041
Prediction: 2.44362
Label: 397.041
Prediction: 9.50603
Label: 397.041
Prediction: 35.5889
Label: 397.041
Prediction: 116.434
Label: 397.041
Prediction: 242.384
Label: 397.041
Prediction: 211.31
Label: 397.041
Prediction: 298.193
Label: 397.041
Prediction: 259.967
Label: 397.041


Like you can see above, the network is definitely training and getting closer to the outpt value. So, the value of your initial loss seems to be very important to whether or not your network will converge. This is a toy example, but it is useful to keep in mind whenever you  The two fixes that immediately come to mind are:

1) Decreasing the learning rate so you get smaller weight updates.

2) Dividing loss function by some constant to make optimization easier. 

Anything else?