In [3]:
import tensorflow as tf
import numpy as np
def loadData(fileName):
    with np.load(fileName) as data:
        Data, Target = data["images"], data["labels"]
        np.random.seed(521)
        randIdx = np.arange(len(Data))
        np.random.shuffle(randIdx)
        Data = Data[randIdx]/255
        Target = Target[randIdx]
        trainData, trainTarget = Data[:15000], Target[:15000]
        validData, validTarget = Data[15000:16000], Target[15000:16000]
        testData, testTarget = Data[16000:], Target[16000:]
    return trainData, trainTarget, validData, validTarget, testData, testTarget

  from ._conv import register_converters as _register_converters


In [4]:
fileName = "notMNIST.npz"
trainData, trainTarget, validData, validTarget, testData, testTarget = loadData(fileName)

trainDataSize = len(trainData)
validDataSize = len(validData)
testDataSize = len(testData)

print("Train Data Size is %d, Valid Data Size is %d, Test Data Size is %d" 
      %(trainDataSize, validDataSize, testDataSize))

print(trainTarget)

Train Data Size is 15000, Valid Data Size is 1000, Test Data Size is 2724
[5 9 9 ... 2 0 9]


1.1 Feedforward fully connected neural networks
===
Implement a simple neural network with one hidden layer and 1000 hidden units. Train your neural network on the entire notMNIST training set of ten classes. Because the neural network loss functions are non-convex, a proper weights initialization scheme is crucial to prevent vanishing gradient during back-propagation as a result of learning stuck at a plateau at the beginning of the training. You will use the Xavier initialization to initialize the weight matrices for all the neural networks in this assignment. That is, each weight matrix is initialized from zero-mean independent Gaussians whose variance is 3/(#input_units + #output_units). Unlike the weight matrices, the bias units will be initalized to zero.

1 layer-wise building block
===
Write a vectorized Tensorflow Python function that takes the hidden activations from the preivous layer then return the weighted sum of the inputs (i.e. the z) for the current hidden layer. You will also initailize the weight matrix and the biases in the same function. You should use Xavier initialization for the weight matrix. Your function should be able to compute the weighted sum for all the data points in your mini-batch at once using matrix multiplication. It should not contain loops over the training exmaples in the mini-batch. The function should accept two arguments, the input tensor and the number of the hidden units. Include the snippets of the Python code.



In [5]:
def layerWiseBuildingBlock(X, numHiddenUnits):
    """Takes the hidden activations from the previous layer then return the weighted sum
    of the inputs for the current hidden layer"""
    # INPUT: input tensor and the number of the hidden units
    # Output: the weighted sum of the inputs for the current hidden layer
    prevDim = tf.to_int32(X.get_shape()[1])
    std_dev = tf.to_float(3/(prevDim + numHiddenUnits))
    
    # Variable Creation
    S = tf.placeholder(tf.float32, [None, numHiddenUnits])
    W = tf.Variable(tf.truncated_normal(shape=[prevDim,numHiddenUnits], stddev=std_dev))
    b = tf.Variable(0.0)
    
    # Graph definition
    S = tf.matmul(X, W) + b  # dim is [None, numHiddenUnits]
    
    return S, W

In [6]:
def test():
    c = tf.constant([[1,2],[3,4]], dtype=tf.float32)
    print(sess.run(c))

    prevDim = tf.shape(c)[1]
    print(sess.run(prevDim))

    layerWiseBuildingBlock(c, 1000)

    sess = tf.Session()
    init = tf.global_variables_initializer()
    sess.run(init)


2 Learning
===
Use your function from the previous question to build your neural network model with ReLU activation functions in TensorFlow and tf.nn.relu can be useful. For training your network, you are supposed to find a reasonable value for your learning rate. You should train your neural network for different values of learning rate and choose the one that gives you the fastest convergence in terms of the training loss function. (You might want to "babysit" your experiments and terminate a particular run prematurely as soon as you find out that the learning rate value is not very good.) Trying 3 different values should be enough. You may also find it useful to apply a small amount of weight decay to prevent overfitting. (e.g. lambda=3e-4). On the training set, validation set and test set, record your classification erros and cross-entropy losses after each epoch. Plot the training, validation, and test classification error vs. the number of epochs. Make a second plot for the cross-entropy loss vs. the number of epochs. Comment on  your observations.

In [7]:
def buildGraph(numLayers, numHiddenUnits, learningRate):
    """Build neural network model with ReLU activation functions"""
    
    # Variable creation
    X = tf.placeholder(tf.float32, [None, 28, 28], name='input_x')
    X_flatten = tf.reshape(X, [-1, 28*28])
    y_target = tf.placeholder(tf.float32, name='target_y')
    y_onehot = tf.one_hot(tf.to_int32(y_target), 10, 1.0, 0.0, axis=-1)
    Lambda = tf.placeholder("float32", name='Lambda')
    
    # Graph definition
    # Input <=> Hidden
    S1, W1 = layerWiseBuildingBlock(X_flatten, numHiddenUnits)
    thetaS1 = tf.nn.relu(S1)
    
    # Hidden <=> Output
    S2, W2 = layerWiseBuildingBlock(thetaS1, 10)
    #thetaS2 = tf.nn.relu(S2)
    
    # Final output layer
    y_logit = tf.nn.relu(S2)
    y_predicted = tf.nn.softmax(y_logit)
    
    # Error and accuracy definition
    crossEntropyError = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(
                                      labels=y_onehot, logits=y_logit),
                                      name='mean_cross_entropy')
    acc = tf.reduce_mean(tf.to_float(tf.equal(tf.argmax(y_predicted, -1),
                                             tf.to_int64(y_target))))
    weightLoss = (tf.reduce_sum(W1*W1) + tf.reduce_sum(W2*W2)) * Lambda * 0.5
    loss = crossEntropyError + weightLoss
    
    # Training mechanism
    optimizer = tf.train.AdamOptimizer(learning_rate=learningRate)
    train = optimizer.minimize(loss=loss)
    
    return X, y_target, y_predicted, crossEntropyError, train, Lambda, acc

In [None]:
# SGD Implementation
# Run over 100~200 epoches
B = 500
max_iter = 50000
wd_lambda = 0.0
numBatches = np.floor(len(trainData)/B)

trainLoss_list = []
validLoss_list = []
testLoss_list = []

trainAcc_list = []
validAcc_list = []
testAcc_list = []

numLayers=1
numHiddenUnits = 1000
learningRate = 0.001
X, y_target, y_predicted, crossEntropyError, train, Lambda, acc = buildGraph(numLayers, numHiddenUnits, learningRate)

sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)

for step in range(0, max_iter+1):
    if step % numBatches == 0:
        # Sample minibatch without replacement
        randIdx = np.arange(len(trainData))
        np.random.shuffle(randIdx)
        trainData = trainData[randIdx]
        trainTarget = trainTarget[randIdx]
        i = 0  # cyclic index for mini-batch
        
#         # storing MSE and Acc for the three datasets every epoch
#         err = meanSquaredError.eval(feed_dict={X: trainData, y_target: trainTarget})
#         acc = np.mean((y_predicted.eval(feed_dict={X: trainData}) > 0.5) == trainTarget)
#         trainLoss_list.append(err)
#         trainAcc_list.append(acc)
        
#         err = meanSquaredError.eval(feed_dict={X: validData, y_target: validTarget})
#         acc = np.mean((y_predicted.eval(feed_dict={X: validdata}) > 0.5) == validTarget)
#         validLoss_list.append(err)
#         validAcc_list.append(acc)
        
#         err = meanSquaredError.eval(feed_dict={X: testData, y_target: testTarget})
#         acc = np.mean((y_predicted.eval(feed_dict={X: testData}) > 0.5) == testTarget)
#         testLoss_list.append(err)
#         testAcc_list.append(acc)
        
    # Slicing a mini-batch from the whole training dataset
    feeddict = {X: trainData[i*B:(i+1)*B], y_target: trainTarget[i*B:(i+1)*B],
               Lambda: wd_lambda}
    
    # Update model parameters
    _, err, yhat = sess.run([train, crossEntropyError, y_predicted], feed_dict=feeddict)
    
    # storing weights every iteration
    # wList.append(currentW)
    i += 1
    
    # displaying training MSE error every 100 iterations
    if not (step % 100):
        print("Iter: %3d, CrossEntropyError: %4.2f" % (step, err))
        
        

Iter:   0, CrossEntropyError: 2.30
Iter: 100, CrossEntropyError: 0.69
Iter: 200, CrossEntropyError: 0.74
Iter: 300, CrossEntropyError: 0.73
Iter: 400, CrossEntropyError: 0.65
Iter: 500, CrossEntropyError: 0.64
Iter: 600, CrossEntropyError: 0.66
Iter: 700, CrossEntropyError: 0.70
Iter: 800, CrossEntropyError: 0.62
Iter: 900, CrossEntropyError: 0.66
Iter: 1000, CrossEntropyError: 0.65
Iter: 1100, CrossEntropyError: 0.68
Iter: 1200, CrossEntropyError: 0.65
Iter: 1300, CrossEntropyError: 0.68
Iter: 1400, CrossEntropyError: 0.71
Iter: 1500, CrossEntropyError: 0.60
Iter: 1600, CrossEntropyError: 0.67
Iter: 1700, CrossEntropyError: 0.64
Iter: 1800, CrossEntropyError: 0.62
Iter: 1900, CrossEntropyError: 0.61
Iter: 2000, CrossEntropyError: 0.71
Iter: 2100, CrossEntropyError: 0.65
Iter: 2200, CrossEntropyError: 0.70
Iter: 2300, CrossEntropyError: 0.70
Iter: 2400, CrossEntropyError: 0.64
Iter: 2500, CrossEntropyError: 0.70
Iter: 2600, CrossEntropyError: 0.66
Iter: 2700, CrossEntropyError: 0.63
It

3 Early stopping
===
Early stopping is the simplest procedure to avoid overfitting. Determine and highlight the early stopping point on the classification error plot from question 1.1.2, and report the training, validation and test classification error at the early stopping point. Are the early stopping points the same on the two plots? Why or why not? Which plot should be used for early stopping, and why?