# Practical Session 5: Deep Learning

In this practical, we will continue from where the lecture left off and learn more about using TensorFlow. 

The practical will cover a few different network architectures and we will look at different components that are often used in neural networks.

To start off, let's import TensorFlow into our notebook.

In [1]:
import tensorflow as tf

tf.__version__

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


'1.14.0'

Note: TensorFlow v2 was only released recently.  The v1 API is still available as a submodule, although we won't be using that in this notebook:

In [2]:
tf.compat.v1

<module 'tensorflow._api.v2.compat.v1' from '/home/dks28/.local/lib/python3.6/site-packages/tensorflow/_api/v2/compat/v1/__init__.py'>

## Minimal TensorFlow Example

This is the first example from the lecture. We first create a network takes an input vector, multiplies it by a weight matrix, adds a weight vector, and returns the result.

`tf.Variable` defines model parameters, which can be trained (as we will see shortly).  Here, we initialise the matrix variable as a 3x3 matrix, with every entry as 1.  Meanwhile, we initialise the vector variable with every entry as 0.
`tf.linalg.matvec` multiplies a matrix and a vector.

In [2]:
weight_matrix = tf.Variable(tf.ones(shape=(3,3)))
weight_vector = tf.Variable(tf.zeros(shape=(3,)))

def affine_transformation(input_vector):
    return tf.linalg.matvec(weight_matrix, input_vector) + weight_vector

result = affine_transformation([2.,3.,7.])
print(result)

tf.Tensor([12. 12. 12.], shape=(3,), dtype=float32)


## Training the Parameters

This is the second example from the lecture, showing how to optimise the parameters in your model.

We first define a network that takes an input vector, multiplies it with a matrix (defined above), and sums the elements of the resulting vector (using `tf.math.reduce_sum`).  We then define a loss function, as the square error.  Given a specific input and output, we can calculate the loss of applying the network to the input.

Next, we define an optimiser &ndash; here, we are using stochastic gradient descent (SGD) with learning rate 0.001.  We then use this optimiser to train this network for 10 epochs, over this single training point.  This optimises the output towards the target value 20.  Printing out the results, we can see that the output gradually moves towards the target.

In [3]:
def network(input_vector):
    return tf.math.reduce_sum(affine_transformation(input_vector))

def loss_fn(predicted, gold):
    return tf.square(predicted - gold)

input = [2.,3.,7.]
gold_output = 20

def loss():
    return loss_fn(network(input), gold_output)

opt = tf.keras.optimizers.SGD(learning_rate=1e-3)

for epoch in range(10):
    opt.minimize(loss, var_list=[weight_matrix, weight_vector])
    print(network(input))

tf.Tensor(29.952, shape=(), dtype=float32)
tf.Tensor(26.190144, shape=(), dtype=float32)
tf.Tensor(23.850271, shape=(), dtype=float32)
tf.Tensor(22.39487, shape=(), dtype=float32)
tf.Tensor(21.489607, shape=(), dtype=float32)
tf.Tensor(20.926535, shape=(), dtype=float32)
tf.Tensor(20.576305, shape=(), dtype=float32)
tf.Tensor(20.358461, shape=(), dtype=float32)
tf.Tensor(20.222961, shape=(), dtype=float32)
tf.Tensor(20.138683, shape=(), dtype=float32)


## Network Layers

For most cases, we don't actually need to create the trainable variables manually. Instead, the feedfoward layer is available as a pre-defined module.

We can define a network as a sequence of operations, using `tf.keras.Sequential`.  The first operation here is a dense feedforward layer (`tf.keras.layers.Dense`), which acts like the `affine_transfomation` function we defined earlier.  The second operation sums the elements of the vector &ndash; this isn't a standard operation, so we have used `tf.keras.layers.Lambda` to allow a user-defined function.

By default, the parameters in a layer (like `tf.keras.layers.Dense`) are initialised randomly.

In [4]:
model = tf.keras.Sequential([
    tf.keras.layers.Dense(3, input_shape=(3,)),
    tf.keras.layers.Lambda(lambda x: tf.math.reduce_sum(x, axis=1))
])

Note that such a model expects the input data to be given as a *minibatch* &ndash; this means that the input tensor should have an extra index, which ranges over datapoints.  In our case, instead of passing a 3-dimensional input vector, we have to pass an Nx3 matrix, where N is the number of datapoints.  Here, we can apply the model to a single datapoint (a 1x3 matrix):

In [5]:
model.predict(tf.constant([[2.,3.,7.]]))

array([13.488218], dtype=float32)

Now that we have a model defined in terms of layers, let's replace the manually created variables of the previous section.

In [7]:
def loss_fn(predicted, gold):
    return tf.square(predicted - gold)

input = tf.constant([[2.,3.,7.]])
gold_output = 20

def loss():
    return loss_fn(model(input), gold_output)

opt = tf.keras.optimizers.SGD(learning_rate=1e-3)

for epoch in range(10):
    opt.minimize(loss, var_list=model.trainable_variables)
    print(model(input))

tf.Tensor([9.799363], shape=(1,), dtype=float32)
tf.Tensor([13.655205], shape=(1,), dtype=float32)
tf.Tensor([16.053537], shape=(1,), dtype=float32)
tf.Tensor([17.545301], shape=(1,), dtype=float32)
tf.Tensor([18.473175], shape=(1,), dtype=float32)
tf.Tensor([19.050314], shape=(1,), dtype=float32)
tf.Tensor([19.409298], shape=(1,), dtype=float32)
tf.Tensor([19.632582], shape=(1,), dtype=float32)
tf.Tensor([19.771467], shape=(1,), dtype=float32)
tf.Tensor([19.857853], shape=(1,), dtype=float32)


In fact, for standard optimizers and loss functions, the TensorFlow API makes it even easier for us:

In [8]:
model = tf.keras.Sequential([
    tf.keras.layers.Dense(3, input_shape=(3,)),
    tf.keras.layers.Lambda(lambda x: tf.math.reduce_sum(x))
])

model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=1e-3),
              loss='mean_squared_error')

input = tf.constant([[2.,3.,7.]])
gold_output = tf.constant([[20.]])

for epoch in range(10):
    model.train_on_batch(input, gold_output)
    print(model(input))

tf.Tensor(8.528937, shape=(), dtype=float32)
tf.Tensor(12.864999, shape=(), dtype=float32)
tf.Tensor(15.562031, shape=(), dtype=float32)
tf.Tensor(17.239582, shape=(), dtype=float32)
tf.Tensor(18.28302, shape=(), dtype=float32)
tf.Tensor(18.93204, shape=(), dtype=float32)
tf.Tensor(19.335726, shape=(), dtype=float32)
tf.Tensor(19.586824, shape=(), dtype=float32)
tf.Tensor(19.743006, shape=(), dtype=float32)
tf.Tensor(19.840149, shape=(), dtype=float32)


## Activation Functions

As we saw in the lecture, activation functions are what gives neural networks their power to model non-linear patterns in the data.  After applying an affine transformation, we then apply a non-linear activation function to each element.

There are a number of different activation functions to choose from.

The [**sigmoid** function](https://en.wikipedia.org/wiki/Logistic_function), also known as the logistic function, transforms the value into the range between 0 and 1.

In [9]:
tf.keras.layers.Dense(100, activation='sigmoid')

<tensorflow.python.keras.layers.core.Dense at 0x7f6e306cca58>

The [**tanh** function](https://en.wikipedia.org/wiki/Hyperbolic_function) has a similar shape to the sigmoid function, but transforms the value into the range between -1 and 1.

In [10]:
tf.keras.layers.Dense(100, activation='tanh')

<tensorflow.python.keras.layers.core.Dense at 0x7f6e306cc550>

The <a href="https://en.wikipedia.org/wiki/Rectifier_(neural_networks)">**Rectified Linear Unit** function</a>, or ReLU, is the identity for positive values, but maps all negative values to 0.

In [11]:
tf.keras.layers.Dense(100, activation='relu')

<tensorflow.python.keras.layers.core.Dense at 0x7f6e3957e5f8>

For classification tasks, an important activation function is the [**softmax**](https://en.wikipedia.org/wiki/Softmax_function).  This is unlike the activation functions mentioned above, because it isn't applied to each element separately.  It converts a vector of scores into a probability distribution &ndash; after applying the softmax, all values are between 0 and 1, and together they sum to 1.  Higher scores are assigned to higher probabilities, via the formula:

$P(i) \propto \exp(x_i)$

Or, more explicitly (notice how the value of the denominator depends on all other values):

$P(i) = \frac{\exp(x_i)}{\sum_j \exp(x_j)}$

The softmax is often used in the output layer of a network performing classification, in order to predict a probability distribution over all the possible classes.  For example, the following model takes a 20-dimensional input, maps it to a 50-dimensional hidden layer, then maps that to a distribution over 10 output classes.

In [6]:
model = tf.keras.Sequential([
    tf.keras.layers.Dense(50, input_shape=(20,), activation='relu'),
    tf.keras.layers.Dense(10, activation='softmax')
])

## Operations and Useful Functions

TensorFlow has corresponding versions of all the main operations you might want to use. This means you can add them into your computation graph and into your neural network.  The most common operations are available in `tf`, and further operations are available in `tf.math`.

In [13]:
tf.abs                 # absolute value
tf.negative            # computes the negative value
tf.sign                # returns 1, 0 or -1 depending on the sign of the input
tf.math.reciprocal     # reciprocal 1/x
tf.square              # return input squared
tf.round               # return rounded value
tf.sqrt                # square root
tf.math.rsqrt          # reciprocal of square root
tf.pow                 # power
tf.exp                 # exponential

<function tensorflow.python.ops.gen_math_ops.exp(x, name=None)>

These operations can be applied to scalar values, but also to vectors, matrices and higher-order tensors. In the latter case, they will be applied element-wise. For example:

In [14]:
print(tf.negative([3.2,-2.7]))
print(tf.square([1.5,-2.1]))

tf.Tensor([-3.2  2.7], shape=(2,), dtype=float32)
tf.Tensor([2.25      4.4099994], shape=(2,), dtype=float32)


Some useful operations are performed over a whole vector/matrix tensor and return a single value (we saw `tf.reduce_sum` earlier):

In [15]:
tf.reduce_sum # Add elements together
tf.reduce_mean # Average over elements
tf.reduce_min # Minimum value
tf.reduce_max # Maximum value
tf.argmax # Index of the largest value
tf.argmin # Index of the smallest value

<function tensorflow.python.ops.math_ops.argmin_v2(input, axis=None, output_type=tf.int64, name=None)>

## Adaptive Learning Rates

Above, we used stochastic gradient descent (SGD) to train our model.  This uses a fixed learning rate to update the parameters.  Several optimisation algorithms are based on SGD, but adaptively adjust the learning rate (usually for each parameter separately).

Different adaptive learning rate strategies are also implemented in TensorFlow as functions. For example:

In [16]:
tf.keras.optimizers.SGD
tf.keras.optimizers.Adadelta
tf.keras.optimizers.Adam
tf.keras.optimizers.RMSprop

tensorflow.python.keras.optimizer_v2.rmsprop.RMSprop

If you are interested in the differences between these strategies, [this blog post](http://ruder.io/optimizing-gradient-descent/) provides more details.

## Training an XOR Function

[XOR](https://en.wikipedia.org/wiki/XOR_gate) is the function that takes two binary values and returns 1 if one of them is 1 and the other 0, while returning 0 if both of them have the same value.

It can be a difficult function to learn and cannot be modelled with a linear model. But let's try anyway.

Our dataset consists of all the possible different states that XOR can take:

In [7]:
xor_input = tf.constant([[0.0, 0.0], [0.0, 1.0], [1.0, 0.0], [1.0, 1.0]])
xor_output = tf.constant([0.0, 1.0, 1.0, 0.0])

Now we construct a linear network and optimize it on this dataset, printing out the predictions at each epoch:

In [8]:
linear_model = tf.keras.Sequential([tf.keras.layers.Dense(1, input_shape=(2,))])

linear_model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=0.1),
                     loss='mean_squared_error')

for epoch in range(50):
    linear_model.train_on_batch(xor_input, xor_output)
    if (epoch + 1) % 10 == 0:
        print('after {} epochs:'.format(epoch+1), linear_model(xor_input).numpy().reshape((4,)))

after 10 epochs: [-0.21557409  0.53843343  0.2679081   1.0219157 ]
after 20 epochs: [-0.02778651  0.49849507  0.3365215   0.8628031 ]
after 30 epochs: [0.11615891 0.4882609  0.39128137 0.7633833 ]
after 40 epochs: [0.22096115 0.4852433  0.42717808 0.69146025]
after 50 epochs: [0.29715112 0.48554987 0.4507841  0.63918287]


As you can see, it's not doing very well. Ideally, the predictions should be \[0, 1, 1, 0\], but in this case they are hovering around 0.5 for every input case.

In order to improve this architecture, let's add some non-linear layers into our model.

In [10]:
nonlinear_model = tf.keras.Sequential([
    tf.keras.layers.Dense(10, input_shape=(2,), activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

nonlinear_model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=1),
                        loss='mean_squared_error')

for epoch in range(50):
    nonlinear_model.train_on_batch(xor_input, xor_output)
    if (epoch + 1) % 10 == 0:
        print('after {} epochs:'.format(epoch+1), nonlinear_model(xor_input).numpy().reshape((4,)))

after 10 epochs: [0.4512474 0.5330985 0.594556  0.5184007]
after 20 epochs: [0.41759464 0.5759795  0.65515167 0.46746808]
after 30 epochs: [0.37860784 0.65029824 0.71506745 0.39570573]
after 40 epochs: [0.31847364 0.726701   0.773647   0.32191446]
after 50 epochs: [0.2764116  0.8051618  0.8178675  0.27033544]


This is much better. The values are much closer to \[0, 1, 1, 0\] than before, and they will continue improving if we train for longer.  (But remember that the model is initialised randomly &ndash; if you run it a few times, you will see that the results vary with each run.)

Notice that we have used a higher learning rate for this network. It will still learn with a smaller learning rate, but will converge more slowly. As discussed in the lecture, the learning rate is a hyperparameter that can vary quite a bit depending on the network architecture and dataset.

## XOR Classification

We can also do classification with TensorFlow. For this, we often use the softmax activation function described above, which predicts the probability for each of the possible classes.

We also have to change the loss function, as squared error is not suitable for classification.  A suitable loss function is [cross entropy](https://en.wikipedia.org/wiki/Cross_entropy).  Because the correct output has probability 1 for the correct class, and probability 0 for the rest, cross entropy is the same as the negative log probability of the correct class.  In other words, by minimising cross entropy, we are trying to find the maximum likelihood model.

We can change the XOR example above to perform classification instead.  In this case, we are constructing a binary classifier &ndash; choosing between the classes of 0 and 1.  The output now prints the predicted probabilities of the two classes.

In [11]:
nonlinear_model = tf.keras.Sequential([
    tf.keras.layers.Dense(10, input_shape=(2,), activation='relu'),
    tf.keras.layers.Dense(2, activation='softmax')
])

nonlinear_model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=1),
                        loss='sparse_categorical_crossentropy')

for epoch in range(50):
    nonlinear_model.train_on_batch(xor_input, xor_output)
    if (epoch + 1) % 10 == 0:
        print('after {} epochs:'.format(epoch+1), nonlinear_model(xor_input).numpy(), sep='\n')

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
after 10 epochs:
[[0.5117795  0.4882205 ]
 [0.18200074 0.8179992 ]
 [0.5093838  0.49061623]
 [0.65182793 0.34817204]]
after 20 epochs:
[[0.5365528  0.46344724]
 [0.09488795 0.905112  ]
 [0.5270997  0.4729003 ]
 [0.82926255 0.17073746]]
after 30 epochs:
[[0.544068   0.45593202]
 [0.0363895  0.9636105 ]
 [0.4307922  0.56920785]
 [0.9330242  0.06697579]]
after 40 epochs:
[[0.68452746 0.31547254]
 [0.0463866  0.9536134 ]
 [0.1929348  0.8070652 ]
 [0.9329038  0.06709618]]
after 50 epochs:
[[0.81181437 0.18818559]
 [0.03061958 0.9693804 ]
 [0.08298508 0.91701496]
 [0.96215767 0.03784232]]


## Minibatching

For the XOR data, there are only 4 datapoints.  However, with realistic datasets, it is inefficient to train on the whole dataset at once, because this will require a lot of computation in order to make a single update step.  Instead, we can train on a batch of data at a time.  For example, taking batches of 2 datapoints for the XOR data:

In [12]:
nonlinear_model = tf.keras.Sequential([
    tf.keras.layers.Dense(10, input_shape=(2,), activation='relu'),
    tf.keras.layers.Dense(2, activation='softmax')
])

nonlinear_model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=1),
                        loss='sparse_categorical_crossentropy')

BATCH_SIZE = 2

for epoch in range(50):
    for i in range(0,len(xor_input),BATCH_SIZE):
        input_batch = xor_input[i:i+BATCH_SIZE]
        output_batch = xor_output[i:i+BATCH_SIZE]
        nonlinear_model.train_on_batch(input_batch, output_batch)
    if (epoch + 1) % 10 == 0:
        print('after {} epochs:'.format(epoch+1), nonlinear_model(xor_input).numpy(), sep='\n')

after 10 epochs:
[[0.5654291  0.43457085]
 [0.5125258  0.48747417]
 [0.1469599  0.8530401 ]
 [0.7588526  0.24114738]]
after 20 epochs:
[[0.78188807 0.21811192]
 [0.0782566  0.9217434 ]
 [0.02357241 0.9764276 ]
 [0.9759774  0.02402258]]
after 30 epochs:
[[0.9252303  0.07476968]
 [0.02335044 0.9766495 ]
 [0.01326589 0.98673403]
 [0.9908187  0.00918128]]
after 40 epochs:
[[0.9599664  0.04003363]
 [0.01128379 0.9887162 ]
 [0.00891465 0.99108535]
 [0.99415094 0.00584907]]
after 50 epochs:
[[0.97475106 0.02524896]
 [0.00826528 0.99173474]
 [0.00618102 0.99381894]
 [0.9962547  0.00374533]]


Again, this kind of functionality is built into TensorFlow.  The following code trains the model with the given batch size and number of epochs.

In [13]:
nonlinear_model = tf.keras.Sequential([
    tf.keras.layers.Dense(10, input_shape=(2,), activation='relu'),
    tf.keras.layers.Dense(2, activation='softmax')
])

nonlinear_model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=1),
                        loss='sparse_categorical_crossentropy')

nonlinear_model.fit(xor_input, xor_output, batch_size=2, epochs=40)

print('final loss:', nonlinear_model.evaluate(xor_input, xor_output))
print('final predictions:', nonlinear_model.predict(xor_input), sep='\n')

Train on 4 samples
Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40
final loss: 0.485472172498703
final predictions:
[[0.72563225 0.27436778]
 [0.00715317 0.9928469 ]
 [0.72563225 0.27436778]
 [0.72563225 0.27436778]]


# Assignment: Classification of House Locations

In the first practical, you used the California House Prices Dataset in order to predict the prices of the houses based on various properties about the houses. In this assignment, we will experiment with TensorFlow and train a model to predict the "ocean proximity" of a house.

First, we read in the dataset:

In [16]:
import pandas as pd
data = pd.read_csv('../DSPNP_practical1/housing/housing.csv')
data.sample(5)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
699,-122.12,37.69,10.0,2227.0,560.0,1140.0,472.0,2.3973,167300.0,NEAR BAY
3848,-118.45,34.18,34.0,1843.0,442.0,861.0,417.0,3.6875,246400.0,<1H OCEAN
16293,-121.23,37.96,37.0,2351.0,564.0,1591.0,549.0,1.6563,57200.0,INLAND
4941,-118.27,33.99,38.0,1407.0,447.0,1783.0,402.0,1.8086,97100.0,<1H OCEAN
2053,-119.73,36.68,32.0,755.0,205.0,681.0,207.0,1.7986,49300.0,INLAND


Next, we split the ocean proximity column from the other features and convert the values to numerical IDs. Remember, the ocean_proximity column already contains discrete classes, so it is well-suited for the classification task. However, these are strings and in order to optimize the softmax function in TensorFlow, we need numerical IDs instead of strings. We can use the pandas map function to do the conversion:

In [17]:
X = data.copy().drop(["ocean_proximity"], axis=1)
Y = data.copy()["ocean_proximity"].map({"<1H OCEAN":0, "INLAND":1, "ISLAND": 2, "NEAR BAY": 3, "NEAR OCEAN": 4}).values

Now, let's split off some data for development and testing:

In [18]:
import sklearn as sk
import sklearn.model_selection, sklearn.impute

X_train, X_test, Y_train, Y_test = sk.model_selection.train_test_split(X, Y, test_size=0.2, train_size=0.8, random_state=28)
X_train, X_dev, Y_train, Y_dev = sk.model_selection.train_test_split(X_train, Y_train, test_size=0.2, train_size=0.8, random_state=28)

And finally, let's preprocess the input features before giving them to the network. We need to fill in missing values with the imputer, and standardize the values to a similar range using the scaler:

In [19]:
imputer = sk.impute.SimpleImputer(strategy="median")
imputer.fit(X_train)

X_train = imputer.transform(X_train)
X_dev = imputer.transform(X_dev)
X_test = imputer.transform(X_test)

scaler = sk.preprocessing.StandardScaler().fit(X_train)

X_train = scaler.transform(X_train)
X_dev = scaler.transform(X_dev)
X_test = scaler.transform(X_test)

We now have a dataset that we can work with. 

Input features:

In [20]:
print(X_train.shape)
print(X_dev.shape)
print(X_test.shape)
print(X_train[:3])

(13209, 9)
(3303, 9)
(4128, 9)
[[ 1.15689593 -0.71445053  0.34720488 -0.64577547 -0.58476315 -0.49719495
  -0.67595606 -1.38919759 -1.27249527]
 [-1.38215885  0.89977649  1.85989223 -0.25794142 -0.45025071 -0.54065637
  -0.49431384  0.57927013  2.54593863]
 [ 0.31386989 -0.12194524  0.50643512 -0.18439456 -0.34877641 -0.44851816
  -0.32564607 -0.14257305 -0.96144134]]


And the correstponding gold-standard labels:

In [21]:
print(Y_train.shape)
print(Y_dev.shape)
print(Y_test.shape)
print(Y_train[:10])

(13209,)
(3303,)
(4128,)
[1 4 1 0 0 4 1 0 4 0]


Based on the code examples above, construct a TensorFlow model, then train, tune and test it on this dataset. Experiment with different model settings and hyperparameters. Calculate and evaluate classification accuracy - the percentage of datapoints where the predicted class matches the gold-standard class.

During the practical session, give examples of what you tried and what were your findings.

Some suggestions and tips:
 * The XOR classification code can be a good place to start.
 * The output layer needs to have size 5, because the dataset has 5 possible classes.
 * Try testing on the development set as you are training, to make sure you don't overfit.
 * Evaluate on the dev set as much as you want, but evaluate on the test set only after you have chosen a good set of hyperparameters.
 * You could try different learning rates, hidden layer sizes, learning strategies, etc.
 * Adaptive learning rates can (and sometimes should) be used together with a regular hand-picked learning rate, and different adaptive learning rates can prefer very different regular learning rates.

In [51]:
import numpy as np
nonlinear_model_2l = tf.keras.Sequential([
    tf.keras.layers.Dense(20, input_shape=(9,), activation='sigmoid'),
    tf.keras.layers.Dense(5, activation='softmax')
])

nonlinear_model_2l.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=1),
                        loss='sparse_categorical_crossentropy')
# nonlinear_model_2l.fit(X_train, Y_train, batch_size=3000, epochs=40) # acc=0.7914
# nonlinear_model_2l.fit(X_train, Y_train, batch_size=2000, epochs=40) # acc=0.8062
nonlinear_model_2l.fit(X_train, Y_train, batch_size=1000, epochs=40) # acc=0.8392

print('final loss:', nonlinear_model_2l.evaluate(X_dev, Y_dev))
print('final responsibilities:', nonlinear_model_2l.predict(X_dev[:5]), sep='\n')
print('final predictions:', tf.argmax(nonlinear_model_2l.predict(X_dev[:5]), axis=1), Y_dev[:5])
print('accuracy on dev set:', (np.array(tf.argmax(nonlinear_model_2l.predict(X_dev), axis=1))==Y_dev).mean())

Train on 13209 samples
Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40
final loss: 0.4288221789992581
final responsibilities:
[[9.48291868e-02 9.00334179e-01 1.07743756e-04 3.62093351e-03
  1.10791053e-03]
 [8.33442271e-01 4.67154384e-03 6.89506269e-05 7.42673175e-04
  1.61074638e-01]
 [4.50458407e-04 9.99548256e-01 1.99382200e-07 1.66519385e-07
  9.14955706e-07]
 [9.12196100e-01 4.94983979e-02 1.56282640e-05 6.84234110e-05
  3.82214449e-02]
 [8.61013293e-01 3.09970547e-02 3.08135641e-05 9.74766590e-05
  1.07861347e-01]]
final predictions: tf.Tensor([1 0 1 0 0], shape=(5

In [63]:
nonlinear_model_2l.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=1),
                        loss='sparse_categorical_crossentropy')

# nonlinear_model_2l.fit(X_train, Y_train, batch_size=3000, epochs=80) # acc=0.8913
# nonlinear_model_2l.fit(X_train, Y_train, batch_size=2000, epochs=80) # acc=0.8840
nonlinear_model_2l.fit(X_train, Y_train, batch_size=1000, epochs=80) # acc=0.8565

print('final loss:', nonlinear_model_2l.evaluate(X_dev, Y_dev))
print('final responsibilities:', nonlinear_model_2l.predict(X_dev[:5]), sep='\n')
inds = np.random.choice(np.arange(len(Y_dev)), 5)
print('final predictions:', tf.argmax(nonlinear_model_2l.predict(X_dev[inds]), axis=1), Y_dev[inds])
print('accuracy on dev set:', (np.array(tf.argmax(nonlinear_model_2l.predict(X_dev), axis=1))==Y_dev).mean())

Train on 13209 samples
Epoch 1/80
Epoch 2/80
Epoch 3/80
Epoch 4/80
Epoch 5/80
Epoch 6/80
Epoch 7/80
Epoch 8/80
Epoch 9/80
Epoch 10/80
Epoch 11/80
Epoch 12/80
Epoch 13/80
Epoch 14/80
Epoch 15/80
Epoch 16/80
Epoch 17/80
Epoch 18/80
Epoch 19/80
Epoch 20/80
Epoch 21/80
Epoch 22/80
Epoch 23/80
Epoch 24/80
Epoch 25/80
Epoch 26/80
Epoch 27/80
Epoch 28/80
Epoch 29/80
Epoch 30/80
Epoch 31/80
Epoch 32/80
Epoch 33/80
Epoch 34/80
Epoch 35/80
Epoch 36/80
Epoch 37/80
Epoch 38/80
Epoch 39/80
Epoch 40/80
Epoch 41/80
Epoch 42/80
Epoch 43/80
Epoch 44/80
Epoch 45/80
Epoch 46/80
Epoch 47/80
Epoch 48/80
Epoch 49/80
Epoch 50/80
Epoch 51/80
Epoch 52/80
Epoch 53/80
Epoch 54/80
Epoch 55/80
Epoch 56/80
Epoch 57/80
Epoch 58/80
Epoch 59/80
Epoch 60/80
Epoch 61/80
Epoch 62/80
Epoch 63/80
Epoch 64/80
Epoch 65/80
Epoch 66/80
Epoch 67/80
Epoch 68/80
Epoch 69/80
Epoch 70/80
Epoch 71/80
Epoch 72/80
Epoch 73/80
Epoch 74/80
Epoch 75/80
Epoch 76/80
Epoch 77/80
Epoch 78/80
Epoch 79/80
Epoch 80/80
final loss: 0.190412072247

In [84]:
nonlinear_model_2l_10 = tf.keras.Sequential([
    tf.keras.layers.Dense(10, input_shape=(9,), activation='sigmoid'),
    tf.keras.layers.Dense(5, activation='softmax')
])

nonlinear_model_2l_10.compile(optimizer=tf.keras.optimizers.Adadelta(learning_rate=1),
                        loss='sparse_categorical_crossentropy')
# nonlinear_model_2l.fit(X_train, Y_train, batch_size=3000, epochs=40) # acc=0.7914
# nonlinear_model_2l.fit(X_train, Y_train, batch_size=2000, epochs=40) # acc=0.8062
nonlinear_model_2l_10.fit(X_train, Y_train, batch_size=1000, epochs=200) # acc=0.8392

print('final loss:', nonlinear_model_2l_10.evaluate(X_dev, Y_dev))
print('final responsibilities:', nonlinear_model_2l_10.predict(X_dev[:5]), sep='\n')
print('final predictions:', tf.argmax(nonlinear_model_2l_10.predict(X_dev[:5]), axis=1), Y_dev[:5])
print('accuracy on dev set:', (np.array(tf.argmax(nonlinear_model_2l_10.predict(X_dev), axis=1))==Y_dev).mean())

Train on 13209 samples
Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/20

Epoch 184/200
Epoch 185/200
Epoch 186/200
Epoch 187/200
Epoch 188/200
Epoch 189/200
Epoch 190/200
Epoch 191/200
Epoch 192/200
Epoch 193/200
Epoch 194/200
Epoch 195/200
Epoch 196/200
Epoch 197/200
Epoch 198/200
Epoch 199/200
Epoch 200/200
final loss: 0.4865850584275283
final responsibilities:
[[1.58670992e-01 8.12314391e-01 1.40910575e-04 1.27879409e-02
  1.60857458e-02]
 [8.07010293e-01 5.10007469e-03 1.41349446e-04 2.54613697e-04
  1.87493607e-01]
 [6.04497211e-04 9.99382019e-01 2.65310092e-07 2.15647322e-08
  1.32583782e-05]
 [7.67030835e-01 1.74440891e-01 9.17746002e-05 1.55404032e-05
  5.84209189e-02]
 [8.23620498e-01 4.85591888e-02 1.15206196e-04 3.06747097e-05
  1.27674520e-01]]
final predictions: tf.Tensor([1 0 1 0 0], shape=(5,), dtype=int64) [1 0 1 0 0]
accuracy on dev set: 0.8104753254617015


In [90]:
nonlinear_model_3l = tf.keras.Sequential([
    tf.keras.layers.Dense(10, input_shape=(9,), activation='tanh'),
    tf.keras.layers.Dense(20, activation='sigmoid'),
    tf.keras.layers.Dense(5, activation='softmax')
])

nonlinear_model_3l.compile(optimizer=tf.keras.optimizers.Ftrl(learning_rate=1),
                        loss='sparse_categorical_crossentropy')
nonlinear_model_3l.fit(X_train, Y_train, batch_size=3000, epochs=1000, verbose=0) # acc=0.7914
# nonlinear_model_3l.fit(X_train, Y_train, batch_size=2000, epochs=40) # acc=0.8062
# nonlinear_model_3l.fit(X_train, Y_train, batch_size=1000, epochs=80) # acc=0.8392

print('final loss:', nonlinear_model_3l.evaluate(X_dev, Y_dev))
print('final responsibilities:', nonlinear_model_3l.predict(X_dev[:5]), sep='\n')
print('final predictions:', tf.argmax(nonlinear_model_3l.predict(X_dev[:5]), axis=1), Y_dev[:5])
print('accuracy on train set:', (np.array(tf.argmax(nonlinear_model_3l.predict(X_train), axis=1))==Y_train).mean())
print('accuracy on dev set:', (np.array(tf.argmax(nonlinear_model_3l.predict(X_dev), axis=1))==Y_dev).mean())

final loss: 0.1524726086553573
final responsibilities:
[[3.9775503e-05 9.9996006e-01 2.0025535e-09 9.1685941e-08 6.1773964e-10]
 [9.9034274e-01 3.1162926e-05 2.1506421e-04 1.9875783e-07 9.4107315e-03]
 [1.3791361e-06 9.9999857e-01 3.0815483e-10 1.5014988e-08 2.3488120e-10]
 [9.9972218e-01 1.7338974e-04 1.3641670e-05 3.5361246e-08 9.0792513e-05]
 [9.8702186e-01 5.8072623e-05 1.9058556e-04 1.1102935e-07 1.2729295e-02]]
final predictions: tf.Tensor([1 0 1 0 0], shape=(5,), dtype=int64) [1 0 1 0 0]
accuracy on train set: 0.9330759330759331
accuracy on dev set: 0.9361186799878898


In [104]:
nonlinear_model_3lL = tf.keras.Sequential([
    tf.keras.layers.Dense(18, input_shape=(9,), activation='tanh'),
    tf.keras.layers.Dense(25, activation='softplus'),
    tf.keras.layers.Dense(20, activation='sigmoid'),
    tf.keras.layers.Dense(5, activation='softmax')
])

nonlinear_model_3lL.compile(optimizer=tf.keras.optimizers.Ftrl(learning_rate=1),
                        loss='sparse_categorical_crossentropy')
nonlinear_model_3lL.fit(X_train, Y_train, batch_size=3000, epochs=100, verbose=0) # acc=0.7914
# nonlinear_model_3lL.fit(X_train, Y_train, batch_size=2000, epochs=40) # acc=0.8062
# nonlinear_model_3lL.fit(X_train, Y_train, batch_size=1000, epochs=80) # acc=0.8392

print('final loss:', nonlinear_model_3lL.evaluate(X_dev, Y_dev))
print('final responsibilities:', nonlinear_model_3lL.predict(X_dev[:5]), sep='\n')
print('final predictions:', tf.argmax(nonlinear_model_3lL.predict(X_dev[:5]), axis=1), Y_dev[:5])
print('accuracy on train set:', (np.array(tf.argmax(nonlinear_model_3lL.predict(X_train), axis=1))==Y_train).mean())
print('accuracy on dev set:', (np.array(tf.argmax(nonlinear_model_3lL.predict(X_dev), axis=1))==Y_dev).mean())

final loss: 0.3662685560451072
final responsibilities:
[[6.5618856e-03 9.9065650e-01 2.2857690e-04 2.1495228e-03 4.0347278e-04]
 [9.7588718e-01 2.8703150e-03 4.3732007e-06 3.2193452e-04 2.0916145e-02]
 [5.8905976e-03 9.9140888e-01 2.2711218e-04 2.0906748e-03 3.8264907e-04]
 [9.3498063e-01 5.8943801e-02 2.7389069e-05 7.4532663e-04 5.3028637e-03]
 [9.7716522e-01 1.5107216e-02 1.1321932e-05 4.4966702e-04 7.2665447e-03]]
final predictions: tf.Tensor([1 0 1 0 0], shape=(5,), dtype=int64) [1 0 1 0 0]
accuracy on train set: 0.8567643273525627
accuracy on dev set: 0.8546775658492279
