# Machine Learning with TensorFlow

* In the previous unit, we have seen how to build a DNN classifier
  "from scratch".

* Knowing how to do this is knowledge at the level of
  "knowing how to derive the Euler-Lagrange equations":
  
  We may well find ourselves in some unusual situation
  where we have to "go back to first principles" to reason something out.

* Much of current ML work (model-defining, model-training) is framework-based,
  but there is both a blessing and a curse here:
  * Frameworks cover many common cases.
  * They also may limit our perspective to the framework's narrow field-of-view.
  * Interesting work remains to be done that will require extending frameworks,
    or even "working outside existing frameworks since we have to break some
    rules".

* Here, we will mostly focus on Google's TensorFlow library.
  Later, we will also look a bit into JAX.

  * TensorFlow is a well-supported evolving (but somewhat under-documented)
    library that is popular for ML, and has been designed for ML,
    but can be used for many other tasks as well. "A frigate".
  * JAX is a smaller and more experimental project that "tries to
    take the good bits and pieces from TensorFlow", such as
    "fast gradients" and "accelerated linear algebra". "A skiff".

## TensorFlow for the ML Practitioner

Suppose we wanted to redo what we did by hand using the infrastructure provided by TensorFlow. Here is an "easy sailing" version.

First, let us train a model (we will come back to discussing all the nice things that we are getting for free here thanks to TensorFlow).

In [None]:
# We will need this PyPI module later
!pip install tf2onnx


import os
import time

import numpy
import tensorflow as tf
import tensorflow_datasets as tfds


# Loading the training and test set. We split the training-set into 'training',
# 'validation', and 'extra' for ad-hoc purposes.
def get_training_datasets():
  (ds_train_raw,
   ds_validation_raw,
   ds_extra_raw,
   ds_test_raw), ds_info = tfds.load(
      'mnist',
      split=['train[:75%]', 'train[75%:99%]', 'train[99%:]', 'test'],
      shuffle_files=True,
      as_supervised=True,
      with_info=True)
  total_num_examples = sum(s.num_examples for s in ds_info.splits.values())
  #
  def normalize_image(image, label):
    """Normalizes images."""
    return tf.cast(image, tf.float32) / 255., label
  #
  def transform(ds):
    return (ds
            .map(normalize_image, num_parallel_calls=tf.data.AUTOTUNE)
            .cache()
            # Shuffle buffer size needs to be at least as large as the number
            # of examples - but does not matter much otherwise.
            .shuffle(total_num_examples, seed=0)
            .batch(32)
            .prefetch(tf.data.AUTOTUNE))
  #
  return (transform(ds_train_raw),
          transform(ds_validation_raw),
          transform(ds_extra_raw),
          transform(ds_test_raw))


ds_train, ds_validation, ds_extra, ds_test = get_training_datasets()

In [None]:
# Training and test sets are iterators.
# For later, we save part of the `ds_extra` dataset to the filesystem.

# The os.access() check makes this cell idempotent.
if os.access('training_examples.npz', os.R_OK):
  print('Small sample dataset already was saved to filesystem.')
else:
  sample_batches = list(ds_extra.take(10).as_numpy_iterator())
  sample_batches_images = numpy.stack(
    [image_data for image_data, labels in sample_batches], axis=0)
  sample_batches_labels = numpy.stack(
    [labels for image_data, labels in sample_batches], axis=0)
  #
  # `sample_batches_images` is a `[num_batches, 32, 28, 28, 1]`-array:
  # `num_batches` batches of 32 images each which are 28x28 with one
  #  color-channel.
  # `training_batches_labels` is a `[num_batches, 32]`-array:
  #  One label per batch per image in the batch.
  print('Shapes:', sample_batches_images.shape, sample_batches_labels.shape)
  print('Labels:\n', sample_batches_labels, sep='')
  #
  numpy.savez_compressed('training_examples.npz',
                         images=sample_batches_images,
                         labels=sample_batches_labels)

Shapes: (10, 32, 28, 28, 1) (10, 32)
Labels:
[[6 2 3 1 6 9 3 6 1 2 5 4 2 1 5 8 5 7 6 9 3 8 6 4 1 5 9 8 3 2 6 9]
 [7 2 9 9 9 5 4 4 0 8 2 8 7 4 1 2 6 8 3 4 5 8 0 5 2 8 9 7 5 8 5 7]
 [3 9 6 9 1 3 2 3 0 7 7 2 4 6 1 7 4 3 3 9 4 1 1 2 1 4 6 2 2 8 6 4]
 [2 8 4 3 3 5 6 4 1 1 1 5 7 7 5 7 4 5 1 5 7 3 0 1 1 2 8 8 5 1 7 1]
 [0 1 3 3 5 6 0 3 9 1 7 0 7 3 1 9 4 5 5 8 8 6 1 7 3 7 2 6 7 1 7 3]
 [7 9 0 8 8 7 4 3 6 5 8 8 9 8 1 7 3 4 1 9 5 7 8 1 9 4 0 7 2 3 4 5]
 [2 0 7 2 8 6 2 3 6 1 9 2 7 4 4 8 4 1 5 2 7 2 0 8 8 3 9 3 3 0 3 4]
 [2 3 8 5 0 1 6 5 5 0 8 5 9 9 4 8 8 1 4 4 9 8 4 4 6 5 3 0 8 8 1 7]
 [6 0 1 6 8 2 3 9 4 8 1 1 0 4 4 2 3 4 1 0 5 1 9 0 0 6 2 7 5 2 7 9]
 [1 8 4 9 2 1 9 1 0 8 0 1 4 4 7 5 3 8 2 9 9 4 1 6 2 5 8 1 1 3 7 3]]


In [None]:
### Training a model.

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(50, activation='tanh'),
  tf.keras.layers.Dense(50, activation='tanh'),
  tf.keras.layers.Dense(50, activation='tanh'),
  tf.keras.layers.Dense(50, activation='tanh'),
  tf.keras.layers.Dense(10)
])

# Participant Exercise: Try out other architectures, such as...:
#
# model = tf.keras.models.Sequential([
#   tf.keras.layers.Flatten(input_shape=(28, 28)),
#   tf.keras.layers.Dense(80, activation='relu'),
#   tf.keras.layers.Dense(80, activation='relu'),
#   tf.keras.layers.Dense(80, activation='relu'),
#   tf.keras.layers.Dense(10)
# ])
#
# How simple can we make this and still get >95% accuracy?


model.compile(
    optimizer=tf.keras.optimizers.Adam(1e-4),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)

model.fit(
    ds_train,
    epochs=10,
    validation_data=ds_validation)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7dd008367700>

We now have a trained model. Perhaps not one with quite state-of-the-art
performance, but at least a model that clearly seems to know something about the task it was being built for.

Let us see how to:
  * Get information about the trained model.
  * Save it to a file and load it again.
  * Actually use it to make predictions.

In [None]:
model.summary()

# Saving to a `*.h5` file will make tf-Keras use the HDF5 format
# for the saved model.
model.save("mnist_model.h5")

print('######')
!ls -lah mnist_model*

reloaded_model = tf.keras.models.load_model("mnist_model.h5")


# Let us use one batch of examples we extracted earlier:
examples_images, examples_labels = (
    sample_batches_images[0, ...], sample_batches_labels[0, ...])

def predict(image, verbose=False):
  if image.size != 28*28:
    raise ValueError('Expecting input data to provide 28x28 pixels.')
  logits = model.predict(image.reshape(1, 28, 28), verbose=verbose)
  if verbose:
    print('Logits:', logits.round(3))
  return numpy.argmax(logits)

# We could run predictions on an entire batch, but here process
# individual images.
for num_example, (image, label) in enumerate(zip(examples_images,
                                                 examples_labels)):
  predicted = predict(image, verbose=False)  # Feel free to set verbose=True
  ok = 'OK' if predicted == label else 'BAD'
  print(f'Example Image {num_example:2d}: '
        f'predicted={predicted}, actual={label}  - {ok}')


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 flatten (Flatten)           (None, 784)               0         
                                                                 
 dense (Dense)               (None, 50)                39250     
                                                                 
 dense_1 (Dense)             (None, 50)                2550      
                                                                 
 dense_2 (Dense)             (None, 50)                2550      
                                                                 
 dense_3 (Dense)             (None, 50)                2550      
                                                                 
 dense_4 (Dense)             (None, 10)                510       
                                                                 
Total params: 47,410
Trainable params: 47,410
Non-traina

While we are at it: A 'trained ML model' is a bit like an 'electronics module': We would typically like to know how to use this as a component in some larger engineering design (/ product).

It is quite possible to use trained TensorFlow models on smartphones, advanced microcontrollers, or merely make them part of some compiled application. Often, a good approach is to convert the model to "TensorFlow Lite" form for deployment.

There are TFLite libraries for various systems/architectures to then load a model, feed input to it, and obtain predictions from it. These exist for: microcontrollers, tiny computers such as the Raspberry Pi, Android apps, iOS apps, compiled binaries, etc.

Here, we will just sketch how this looks like using again Python-TFLite - so, we are not quite cutting the Python umbilical cord yet. We will first save our model in TFLite form, and then switch over from using TensorFlow to using only the TFLite module.

In [None]:
# Converting the model to TFLite ("TensorFlow Lite") form.

model = tf.keras.models.load_model("mnist_model.h5")
converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model = converter.convert()

# Save the model.
with open('mnist_model.tflite', 'wb') as f:
  f.write(tflite_model)

!ls -lah 'mnist_model.tflite'



-rw-r--r-- 1 root root 189K Aug 22 11:34 mnist_model.tflite


For later, we can also convert the trained model to `.onnx` format, which is understood by other projects as a representation of a computational graph. For graphs that do not use too exotic computational operations, this should generally work. At the time of this writing, we cannot convert from a model saved in HDF5 format, so we need to save our model again in TensorFlow's own format. Rather than importing the `tf2onnx` module here, we perform conversion in a subprocess, executed via a shell escape.

In [None]:
model.save("mnist_model_tf")

!python -m tf2onnx.convert --saved-model mnist_model_tf --output mnist_model.onnx
!ls -lah mnist_model.*



2023-08-22 11:42:04,486 - INFO - Signatures found in model: [serving_default].
2023-08-22 11:42:04,486 - INFO - Output names: ['dense_4']
2023-08-22 11:42:04,566 - INFO - Using tensorflow=2.12.0, onnx=1.14.0, tf2onnx=1.15.0/6d6b6c
2023-08-22 11:42:04,566 - INFO - Using opset <onnx, 15>
2023-08-22 11:42:04,572 - INFO - Computed 0 values for constant folding
2023-08-22 11:42:04,586 - INFO - Optimizing ONNX model
2023-08-22 11:42:04,629 - INFO - After optimization: Cast -1 (1->0), Identity -2 (2->0)
2023-08-22 11:42:04,631 - INFO - 
2023-08-22 11:42:04,631 - INFO - Successfully converted TensorFlow model mnist_model_tf to ONNX
2023-08-22 11:42:04,631 - INFO - Model inputs: ['flatten_input']
2023-08-22 11:42:04,631 - INFO - Model outputs: ['dense_4']
2023-08-22 11:42:04,631 - INFO - ONNX model is saved at mnist_model.onnx
-rw-r--r-- 1 root root 605K Aug 22 11:34 mnist_model.h5
-rw-r--r-- 1 root root 190K Aug 22 11:42 mnist_model.onnx
-rw-r--r-- 1 root root 189K Aug 22 11:34 mnist_model.tfl

Let us actually download the `.tflite` and also `.onnx` file locally...

In [None]:
import google.colab.files
google.colab.files.download('mnist_model.tflite')
google.colab.files.download('mnist_model.onnx')

In [None]:
# NOTE: For some versions of TensorFlow and TFLite, it is not possible
# to import `tflite` and `tensorflow` into the same Python process.
#
# This normally is not even needed, given that `tf` has a `tf.lite` sub-module,
# but here we want to demonstrate using TFlite only and not the full-blown
# TensorFlow module.
#
# If executing this cell fails, then restarting the Colab runtime will
# replace the running Python interpreter with a new one while retaining
# on-filesystem state of the virtual machine. So, it might be necessary to
# do [Runtime] -> [Restart Runtime] (Short-cut: Control-M dot) before
# continuing with this notebook by executing this cell.
#
# Since the runtime system may have been restarted here,
# we re-import the modules that we will need going forward.

# Note: On non-colab systems, if `tflite-runtime` is not yet available
# for too-recent a CPython version, this might require e.g.
# python3.10 -m pip install tflite-runtime
!pip install tflite-runtime

import time
import numpy
import tflite_runtime.interpreter as tflite

reloaded_examples = numpy.load('training_examples.npz')
sample_batches_images =  reloaded_examples['images']
sample_batches_labels =  reloaded_examples['labels']



In [None]:
def get_mnist_tflite_predictor(model_path):
  interpreter = tflite.Interpreter(model_path=model_path)
  interpreter.allocate_tensors()
  input_details = interpreter.get_input_details()
  output_details = interpreter.get_output_details()
  def fn_predict(in_data, verbose=False):
    t0 = time.time()
    # Note that this is not reentrant! Different invocations of the current
    # function use the same `interpreter`, and mutate its state by
    # "setting the input". So, if multithreading executed called a function
    # that does this concurrently more-than-once-at-the-same-time, things
    # would go wrong.
    interpreter.set_tensor(input_details[0]['index'],
                           in_data.reshape(1, 28, 28))
    interpreter.invoke()
    t1 = time.time()
    output_data = interpreter.get_tensor(output_details[0]['index'])
    result = numpy.squeeze(output_data)
    if verbose:
      print(f'(t={(t1-t0)*1000:.3f} msec):', result.round(3))
    return result
  return fn_predict

mnist_tflite_predictor = get_mnist_tflite_predictor(
    model_path='mnist_model.tflite')

print('Predicted: ',
      numpy.argmax(mnist_tflite_predictor(sample_batches_images[0, 0, ...],
                                          verbose=True)),
      ' -  Actual: ', sample_batches_labels[0, 0])

(t=0.906 msec): [-4.890e-01 -1.157e+00 -2.000e-03 -3.020e+00  8.140e-01  9.350e-01
  8.952e+00 -8.248e+00 -7.890e-01 -3.224e+00]
Predicted:  6  -  Actual:  6


### What TensorFlow did for us here

  * For simple multi-layer architectures:
    * Provide us with an extremely simple way to specify a neural network.
    * Automatically pick reasonable defaults for the distribution of
      randomized initial weights and biases.
    * Hide all the subtleties around computing fast gradients.
  * Offer a convenient way to specify how we want to utilize gradients
    for optimization.
    
    Earlier: 'multiply gradient with a small factor and
    take a small step in the opposite direction (going down)'.
    
    Here: pick a very simple yet quite effective alternative strategy
    known as 'Adam Optimization'.

  * Allow us a convenient way to specify a loss function - here directly
    from logits.
    
    (Earlier, we had to go from logits to probabilities,
    do softmax, and then hand-backpropagate it all).

  * Provide convenience functions for loading and transforming input data,
    which include very substantial performance optimizations.
  * Handle setting up the computation in such a way that training can optionally
    run on CPU, GPU, or TPU(!)
  * Provide us with a straightforward way to save and deploy a trained model.
  * Allow us to specify performance metrics to track during training.

### What we have not seen yet

  * "The scaffolding between this high-level perspective and the
    low-level approach we discussed earlier."
  * How to put unusual data-processing into TensorFlow so that we
    can still use high level infrastructure like model-saving.
  * How to use the "tensor arithmetics with fast gradients" machinery
    inside TensorFlow to do non-ML physics.
  * More tuning and tweaks we can apply to improve performance
    (L2 regularization, architectural elements such as:
    dropout layers, convolutional+max-pooling layers).

### Additional remarks

 * On 'weight initialization': TensorFlow by default uses "[Glorot](https://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf) (or 'Xavier') initialization" - basic idea: if N normal distributed inputs get summed over, use a 1/sqrt(N) scaling factor for random weight initialization.
 * Here, we are using the 'test set' to obtain performance metrics.
   This is generally a bad idea, since it leads to applying tweaks based
   on information obtained by "peeking at test set performance".

   **If we want to report proper ability-to-generalize, we must not use
   observations on the test set to make model tuning decisions.**

   (The right thing to do here is to lock away the "test set" and split
    the for-training examples into an actual 'training set for computing
    gradients' and a 'validation set' for making tuning decisions.)

   Overall, this example is mostly about showing "how we can wire things up",
   and in this context, this is tolerable, since we are not out to
   "break the world record on MNIST classification".

 * Nowadays, a ML system doing MNIST-classification is a bit like a TV
   screen showing a test image - it merely indicates that we built
   something which allows data to flow the way it should, and we might
   be able to identify some glaring problems.


### Architecture Improvements

Before we "open the engine hood" and take a deeper look at the technology, let us discuss a few easy-to-explain ideas around "what can we do here to further improve image classifier performance?"

There are some simple general ideas that are reasonably easy-to-explain in just one or two paragraphs - if we accept that the "cheap heuristic explanations" we are giving might actually be a little bit off. At this level, this is more "cooking advice" than a thorough discussion of deeper characteristics.

##### **L2 Regularization**

Now that we have seen how to train deep architectures: What would we expect to happen if we substantially over-sized or under-sized a model (in terms of number of layers and also units per layer) relative to what experimentation tells us is the point where increasing model size brings no further benefit?

Under-sizing is easy to understand: The model will not be able to extract and process all the features that are in principle available, and will have to make hard choices what to ignore.

Over-sizing is more interesting: If we massively over-sized the model, and used a fixed-size training set, it would at some point have enough capacity to just memorize every training example - and it may memorize such examples in weird ways, such as "If the top right pixel is green and the top left pixel is brighter than its neighbors, then we know we are looking at example #1735". Naturally, this then messes up the model's ability to generalize.
*In particular*, if there is any classification error in the training set labels, the model will not do what an under-sized model would do - "if I cannot do well and have to take a hit, I perhaps should take one on this crazy outlier in the data". Rather, it would try to come up with a very fancy "explanation" why the outlier (and some region around it) is "really special".

If we built a model with only linear activation functions, there would be no point going to two or more layers - since "all transformations are linear".
So, the power of deep architectures is closely related to their ability to put nonlinearities to good use. If we used an architecture with tanh-activation as the nonlinearity on hidden layers, and a final linear scaling layer, then tanh-s that receive small-magnitude input would behave mostly-linear.

With this in mind, it appears natural to try an approach where "we give the network the ability to use many nonlinearities, but we attach a small price to it, so the network will only decide to actually use that power if that brings a benefit".

Now, we are in an "optimize performance, but also do X" (i.e. try not to use large weights) situation. We still want to consider ML training as a numerical optimization problem - but that then means that we have to slightly deviate from the idea of "what we optimize for is performance" towards "our loss function is a weighted sum of contributions, where one measures performance, and the others are about other relevant aspects, such as minimizing some quantity that we believe to be a useful proxy for the model inclination to avoid unnecessarily complex explanations".

The basic idea behind "L2 regularization" is: Let us add a term to the total loss that is the sum-squared of weights. The model can use nonlinearity, and "smallish weights are still cheap", but making a weight large is considered costly.

The effect is nicely illustrated in these two TensorFlow Playground training runs: Two large-capacity models trained on the same "simple explanation but noisy data" dataset, having the same outliers. Without L2 regularization, we see a tendency to "draw a complicated border to explain the training set". With L2 regularization, we get a nicer border. As expected, the complex border is related to "overfitting", and we do see an accuracy gap between training and test set.

(See screenshots "L2 Regularization" on "supplementary material" document.)

#### **ReLU activation**

Traditionally, the earliest nonlinearity to be used widely in neural network research was the 'sigmoid' $x\mapsto 1/(1+\exp(-x))$ (this again is of course just the logistic function). The thinking here was that this should serve as a crude model for biological neurons that still allows us to do backpropagation: Zero output for low activation, saturation for high activation, but we want a continuous function.

It so turns out that one can in principle use just about every "not too wild" nonlinearity for Deep Neural networks, even functions such as $x\mapsto x^2$ or $x\mapsto\exp(-x^2/2)$. Overall, this might not be too surprising at least for smooth such functions - "if we zoom in at some point, we will see linear behavior, if we zoom out a bit, we will see quadratic corrections, but higher corrections will still be mostly-negligible".

It somewhat came to a surprise to discover that an extremely simple nonlinearity, which furthermore happens to be scale invariant, usually performs really well in comparison - the "rectified linear unit",
$x\mapsto (x+|x|)/2$. There was a natural drive towards such a very simple function from the desire to avoid complicated numerical calculations such as doing an exponential - and it stuck when it showed good performance.

More recently, there have been variations on the topic, such as ELU and SELU, but ReLU remains an important workhorse.

It is interesting to ponder what the functions described by exclusively-ReLU deep networks look like: these are continuous functions that are affine-linear on (generalized) polyhedral cells. ("Generalized" since they may extend to infinity.) Naturally, "gradients just propagate through, even to very deep layers".

The power of a deep ReLU network is nicely illustrated in the TensorFlow Playground by doing a "swiss roll with noisy data" problem using only the x/y coordinate as input features, and going for a maximum-capacity network, even without any regularization - but one really has to watch the training process to appreciate this:

(See screenshot "ReLU Activation" on "supplementary material" document.)

#### **Dropout Regularization**

Looking at some TensorFlow Playground training runs for the "swiss roll" problem that use the same quite noisy input data, but started from differently initialized random weights, it is not surprising that we get roughly-comparable classifier performance, but both models made different doubtful decisions about where to make the contour look complicated to net a few more examples and get slightly lower loss on the training set.

So, each model will have "quirks". Naturally, we would expect that if we just did what we discussed earlier, combining the assessments of different (perhaps not quite independent) "experts", such as majority-vote-of-5-differently-trained-models, we should be able to average out such "structure hallucinations". Of course, training and deploying multiple models is expensive, so: can we perhaps use some variant of this idea which retains some of the "averaging over different models" property while not being so expensive?

One idea here is to use "dropout" layers: During the training process, we keep randomly disabling some units (but correspondingly scaling up the total input to a unit that has temporarily lost some of its input units) in changing patterns. Effectively, we are training an exponentially large family of networks (since with $N$ "faulty" units, we have $\sim 2^N$ ways to disable half
of them) which are all obtained from a "master network" by turning off some units.

Another way to think about this is that "dropout" punishes complex co-adaptation of many units. If a complex hypothesis is realized by having some specific set of 10 units activate in a particular pattern, "dropout" makes it unlikely that they all are available at the same time, so, less complex hypotheses that only require two or three units to cooperate are favored over intricate explanations.

(See screenshots "Dropout" on "supplementary material" document.)

#### **Early Stopping**

This is in the category of "sometimes it helps, but sometimes we observe the opposite, and it helps more to keep training even when the model stopped learning anything".

The basic idea is that it can happen that a model first learns the overall structure of the problem, those aspects that generalize well, but as it keeps being fed the training examples, it will increasingly fine-tune on accidental properties of the training set - so, we should be able to limit overfitting by stopping the training process early, typically according to some heuristics.

#### **Small Batch Sizes**

This point is controversial - there are indications that, while there is some truth to this advice, the opposite also holds.

Overall, neural network training amounts to finding some minimum (in weight/bias parameter space) of a "loss: function that has many local minima (at the very least since we can always permute/relabel units and get an equivalent network). If we use smallish batch sizes (perhaps 32 or 64) to estimate gradients, that makes gradients inherently noisy, and this noisiness makes us not see narrow basins in the loss function, going for larger, "more robust" minima instead.

#### **Convolutional Architectures**

So far, we fed our classifiers one-dimension-per-pixel data vectors.

Now, if we would in advance pick an image-scrambling permutation of pixels
and applied this in the same way to all training and test examples, this would make the problem no more difficult for the ML architectures we have seen so far, but would make the problem *much* harder for a human. So, clearly, the problem has extra structure which we do not even remotely exploit yet.

The MNIST dataset is normalized to always have the ink-center-of-gravity in the middle of the image, and also in some other ways. This makes this item somewhat a case of "this happens to work also on MNIST, despite the major reasons why this is a good thing applying to a limited extent". Sticking with MNIST, one way to think about the problem is that we might be able to reduce "overfitting" (so, mistaking accidental structure observed in the training set for relevant)
by applying some data-reduction that we would expect to reduce such accidental features. Obviously, digits are a quite anthropomorphic solution to the problem of communicating data between humans - in the sense that doing the same between computers, we would likely go for a more barcode or QR-code like solution. So, unsurprisingly, digits work reasonably well even for people with slightly bad eyesight - they do not contain relevant fine detail that must be exactly right.

So, one might wonder whether blurring the image a bit could help with classification. A "blurring" transform basically amounts to performing a convolution with some kernel, such as a Gaussian. This leaves the question: What kernel to use? If we were to answer "this is ML, so we might simply treat the kernel's parameters as learnable and use gradient descent to find out what a good kernel might be", and perhaps add "let's blur in more than one way at the same time", we basically have invented "Convolutional Neural Networks".

Another way to think about this: If we humans look at an object, it mostly does not matter much whether it sits right in the center of our field of vision, or is a tiny bit displaced to the left, right, or up, or down. So, for many image classifiaction problems, there is some inherent translational symmetry in the task: If I want to know whether there is a frog in an image, then it should not matter for detecting the frog if the camera "pointed two pixels further to the left" when the picture was taken. So, we might want to do localized feature-extraction that is only sensitive to some small windowed region in the image, and slide that window over the entire image, both horizontally and vertically. Here, "each window gets treated the same", in the sense that we use the same weights independent of window-position. So, in training, if we have frogs in 4 out of 16 batch-examples, then such a set-up will try to tweak the convolution kernel(s) in such a way that they become sensitive to the differences between images-with-frogs and images-without-frogs, irrespective of how we have to place the window(s) to best fit the frogs.

Right after a convolutional layer, which convolves window-wise with a collection of learnable kernels, one typically puts a resolution-reducing layer, often max-pooling. About max-pooling, Geoffrey Hinton has (in)famously said: "The pooling operation used in convolutional neural networks is a big mistake and the fact that it works so well is a disaster." Here, we contend ourselves with observing that "it often works quite well."


In [None]:
# Before we proceed with `tf` code again, we might have to re-start the runtime
# system and re-run the very first cell of this notebook, which re-imports
# TF and other modules, and loads the dataset.

In [None]:
# Let us put some of these ideas to the test.
# Example adjusted from: https://keras.io/examples/vision/mnist_convnet/


cnn_model = tf.keras.models.Sequential(
    [
        tf.keras.layers.Input(shape=(28, 28, 1)),
        tf.keras.layers.Conv2D(32, kernel_size=(3, 3), activation="relu"),
        tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
        tf.keras.layers.Conv2D(64, kernel_size=(3, 3), activation="relu"),
        tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(50, activation='relu'),
        tf.keras.layers.Dense(10),
    ]
)

cnn_model.compile(
    optimizer=tf.keras.optimizers.Adam(1e-4),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)

if True:
  # This can take about half an hour or so. Feel free to disable.
  cnn_model.fit(ds_train, epochs=30, validation_data=ds_validation)


In [None]:
if True:
  # Some extra training, for refinement - feel free to disable.
  cnn_model.fit(ds_train, epochs=2, validation_data=ds_validation)


Epoch 1/2
Epoch 2/2


In [None]:
cnn_model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d_2 (Conv2D)           (None, 26, 26, 32)        320       
                                                                 
 max_pooling2d_2 (MaxPooling  (None, 13, 13, 32)       0         
 2D)                                                             
                                                                 
 conv2d_3 (Conv2D)           (None, 11, 11, 64)        18496     
                                                                 
 max_pooling2d_3 (MaxPooling  (None, 5, 5, 64)         0         
 2D)                                                             
                                                                 
 flatten_1 (Flatten)         (None, 1600)              0         
                                                                 
 dropout_1 (Dropout)         (None, 1600)             

What we have here may well be superhuman ability at reading handwritten digits
which we can readily deploy as a 400 KB model file on even quite cheap microcontrollers. This certainly is interesting.

However, given that we still did not even try hard yet, this illustrates another point: "MNIST is in general too simple to demonstrate superiority of some ML method" and "just about every idea works well on MNIST".


In [None]:
# Let us download/save HDF5 and tflite forms of this model.

from google.colab import files

!rm -rf mnist_cnn_model.h5

if 'SAVE-TRAINED-MODEL' and False:  # Remove 'and False' to save the model.
  cnn_model.save('mnist_cnn_model.h5')
  !ls -la mnist_cnn_model.*
  cnn_converter = tf.lite.TFLiteConverter.from_keras_model(cnn_model)
  tflite_cnn_model = cnn_converter.convert()
  #
  # Save the model.
  with open('mnist_cnn_model.tflite', 'wb') as f:
    f.write(tflite_cnn_model)
  #
  files.download('mnist_cnn_model.h5')
  files.download('mnist_cnn_model.tflite')


if 'UPLOAD-MODEL' and True:  # Upload a model instead.
  model_data = next(iter(files.upload().values()))
  with open('mnist_cnn_model.h5', 'wb') as h_model:
    h_model.write(model_data)
  reloaded_cnn_model = tf.keras.models.load_model("mnist_cnn_model.h5")
  #
  print(
      f'Test set accuracy: {reloaded_cnn_model.evaluate(ds_test)[1] * 100:.2f}%'
      )


Test set accuracy: 99.17%


# Other major architectural ideas

We discussed convolutional networks for image recognition tasks, and also some of the common approaches to improve model performance (which typically means: generalization).

Our next major focus will be "using the TensorFlow machinery to do Physics with it - perhaps without any ML involved". So, this may be a good opportunity to discuss a few other important general ML-related ideas.

## Embeddings

Much of "supervised Machine Learning" (i.e. we have target labels) typically is about:
  
  1. Representing some aspect(s) of the real world we care about
     as a high-dimensional vector. (This might be: an image,
       a molecular structure, a sentence, a medical record, etc.)

  1. Defining a "model" $m_{\vec \theta}$
     with trainable parameters $\vec\theta\in\mathbb{R}^D$ that can
     produce the desired output.

  1. Obtaining training examples and tweaking model parameters
     (typically via some variant of stochastic gradient descent)
     to handle training set examples well.

  1. Measuring performance on a validation-set.

  1. Once everything looks good, measuring how well the model
     generalizes on the test set, and deploying the model.

We want to focus on 1.: How do we turn something like a sentence into a vector?


The "[Netflix Prize Problem](https://en.wikipedia.org/wiki/Netflix_Prize)" provides an interesting setting to study this question. The Wikipedia article has details about the back story, but in a nutshell, this was about the Netflix video streaming company setting up a $1M competition back in 2006 for building a video recommendation system that outperforms their own system by at least 10%.

Academically, what was interesting about this problem was that Netflix provided a very large training dataset. Getting training data labels is generally expensive, and here the research community had an opportunity to use a dataset much larger than what they normally had available, at the order of 100M training examples.

The problem was as follows: Netflix customers watch movies, and get an opportunity to rate how well they enjoyed a movie right after they watched it.
These ratings do obviously contain personal preferences about movies, and so it clearly is attractive for Netflix to have a recommendation system that produces individualized suggestions about what other movies a customer might enjoy watching.

For our purposes, we can imagine the overall setting to be as follows: We have a large "user movie ratings" matrix $R_{um}$, where user $u$ rates movie $m$ - let's say with score $R_{um}\in [-1.. 1] \cup {\rm NaN}$, where ${\rm NaN}$ is supposed to mean: "this user has not rated (perhaps not watched) that movie".
We might have additional information about movies, but here, we want to focus exclusively on this matrix.

The matrix that we are given has some entries blanked out - Netflix knows how the given user rated the given movie, but for some (perhaps a million or so) ratings, they do not tell us and instead have put a ${\rm NaN}$ entry there, just as if the user had not yet rated the movie. We do not know which entries are the missing ones, but we want to build a system that can predict these ratings well.


**Let us have a break here and give everybody 10 minutes to think about the problem.**

In the end, the prize was won in 2009, in a "photo finish". The winning team did a hundred different things, but we want to focus on the core idea.

This was simply to try to find two matrices $U, M$ such that $R\approx UM$ holds for the known entries. With indices, $R_{um}=\tilde U_{uk}\,\tilde M_{km}$. Here, $u$ is still a user-index (going up to perhaps $100\,000$), and $m$ is still a movie index (going up to maybe $10\,000$ or so). The range of the index $k$ is small-ish, perhaps going up to $K=50$.

We see how we can regard this as an optimization problem. We also immediately see how we can use such a factorization to make predictions. But why is doing this useful?

Suppose individual customers have movie preferences that could be described as "generally likes Jackie Chan movies", "likes comedy", "dislikes horror movies". Now, if we had a fixed list of perhaps 200 such categories, we might try to get to a quantitative prediction if we find every user's and every movie's "profile" as a vector of quantitified alignment with each category.
Then, the alignment between a given user's and movie's profile should allow us to predict an unknown rating.

The beauty of this "matrix factorization" approach is that it determines these categories for us, in such a way that they are most useful for the problem at hand!

Now, of course, as stated, the problem has a $GL(K)$ ambiguity, since we can always transform $\tilde U\to \tilde U \Lambda$, $\tilde M\to \Lambda^{-1} \tilde M$, with
$\Lambda\in GL(K)$. But let's say we deal with this somehow, perhaps by additionally postulating that every row-vector of $\tilde U$ and every column-vector of $\tilde M$ must be length-1. This would then still leave us with some ambiguity (we could still do an $O(K)$ rotation of the coordinate basis), but apart from such details, we are left with a linear preference model where the problem itself seeks out the relevant directions.

Another way to view this problem: If all entries of $R$ were known, it very likely would not be a "random" matrix but have extra structure. So, the question is: if we wanted to do data reduction and store far fewer elements, what would be a good proposal to still get a reasonably good approximation to the original matrix? Intuitively, "the best thing we can usually do" is to perform a Principal Component Analysis. This amounts to writing the matrix as a product $R=USV^T$ where here, since $R$ is real, $U$ and $V$ are orthogonal, and $S$ is rectangular with entries only on the diagonal~$S_{(j)(j)}$ - and then trimming $S$ by only retaining the $K$ largest-magnitude "generalized eigenvalues". So, the approach is: "Perform a Singular Value Decomposition, and project out all smallish generalized eigenvalues". This is, of course, a tried and tested pre-Deep-Learning ML approach that is widely used in many disciplines, also in financial analysis.

Can we then interpret these "principal axes"? It turns out that, indeed, if we work this out for the Netflix matrix, we find that the two most relevant dimensions are "Drama-vs-Comedy" and "Unsurprising-Plot-vs-Plot-Twists".

A nicely readable article about this is: [Matrix Factorization Techniques for Recommender Systems](https://datajobs.com/data-science-repo/Recommender-Systems-%5BNetflix%5D.pdf).


Now, what does this mean for Deep Learning? A basic insight is that if we have tokenized input (such as for text: each word or multi-word term in the dictionary is one token), we often can simply map each token to some $K$-dimensional vector with a trainable mapping (which we call an "embedding"), and leave it to the problem at hand to adjust the token-to-vector mapping in a way that maximizes usefulness-for-the-problem.

This approach has produced some interesting surprises. Leaving out some (not quite irrelevant) detail, training a word embedding model on a problem that involves examples from news articles, for example, one may well find that the embedding evolved some lattice-like structure.
If $E: {\rm token}\to{\mathbb R}^K$ is the trained token-embedding function, we may find that equations such as $E({\rm Paris}) - E({\rm France}) + E({\rm Germany}) \approx E({\rm Berlin})$ hold (in the sense that among the nearest neighbors of the left-hand-side vector, we find the embedding vector for "Berlin" on typically the 1st, but generally maybe at most 2nd or 3rd place).

Google put some code from 2013 online that allows exploring this phenomenon.
It meanwhile has been affected by some moderate "bit rot", but for those who are curious, the web page is
[https://code.google.com/archive/p/word2vec/](https://code.google.com/archive/p/word2vec/), and the code archive is:
[https://storage.googleapis.com/google-code-archive-source/v2/code.google.com/word2vec/source-archive.zip](https://storage.googleapis.com/google-code-archive-source/v2/code.google.com/word2vec/source-archive.zip)


## The wider ML Landscape

Some important notions and ideas from the wider Machine Learning landscape that we at least should have mentioned, but will not have any time to go into detail about.

  * Unsupervised Learning

    In general: Learning where we have no ground truth labels.
    * Autoencoders (Variations on the "input $\to$ {compact representation} $\to$ original" idea.)
    * t-SNE as a "useful heuristic for visualizing in 2d or 3d how data
      clusters in high dimensions".

  * Semi-Supervised Learning

    Learning where we have some labeled examples, and many more
    unlabeled examples. Idea: "Knowing how stuff generally looks like
    will typically help".

  * Reinforcement Learning

    Learning behavior from reward observations.

  * Regression Problems

    Estimating a non-discrete quantity.
    Key questions revolve around "how much spread is there in the 'labels'",
    and 'what features allow me to explain how much of that spread?'
    I.e. "Standard deviation of the height of all children attending
    primary school is $X$, but if I know their age, I can predict
    their height with standard deviation $Y<X$.

  * Non-Neural-Network Approaches

    * K-Means Classifiers
    * Support Vector Machines and Kernel Machines

  * Specialized NN architectures

    * Sequence-to-Sequence learning: LSTM ('ancient' but quite powerful).
    * Graph-NNs(/-CNNs)
    * "Transformer Architectures"
      More recent, quite powerful, we are still in the process
      of properly understanding what they can do and how they work.
      Basis for large language models such as
      [GPT-3](https://en.wikipedia.org/wiki/GPT-3).
    * Generative Adversarial Networks - for generating e.g.
      realistic-looking art.
      
      Also, for example: "Neural Photo Editing"
      ([Example YouTube video](https://www.youtube.com/watch?v=FDELBFSeqQs) - [paper](https://arxiv.org/abs/1609.07093)).

  * Tweaking pre-trained models;
    [TensorFlow Hub](https://www.tensorflow.org/hub)