# CS-5600/6600 Lecture 20 - TensorFlow and Keras

**Instructor: Dylan Zwick**

*Weber State University*

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras

##TensorFlow

**What is TensorFLow?**

TensorFlow is an open-source machine learning platform developed by Google that provides a powerful suite of tools for data scientists and developers to build, deploy, and train machine learning models. It was initially released in 2015, and has evolved significantly since. The TensorFlow library allows developers to create complex neural networks using a variety of programming languages, such as Python and JavaScript. Additionally, TensorFlow makes it easy to deploy models on mobile devices or cloud platforms like Google Cloud Platform (GCP) and Amazon Web Services (AWS).

TensorFlow is used in a variety of applications, ranging from natural language processing (NLP) and image recognition to predictive analytics and autonomous vehicle control. It can be used to train deep neural networks for object detection and classification, generate recommendations, classify images, and build voice-powered applications.

**What are Tensors?**

A tensor is, basically, an $n$-dimensional generalization of a matrix. A zero-dimensional tensor is a scalar, which contains a single value and has no axes. A one-dimensional tensor is a vector, which contains a list of values and has one axis. A two-dimensional tensor is a matrix that contains values stored across two axes.

<center>
  <img src="https://drive.google.com/uc?export=view&id=1gT9zfALy0q1MLieyqWeyYbbawZM8jXfw" alt="Tensors">
</center>

**How Does TensorFlow Work?**

At the core of TensorFlow is a dataflow graph, which describes how data moves through a series of operations or transformations. The basic idea behind the dataflow graph is that operations are expressed as nodes, with each node performing a single operation on its inputs. The inputs and outputs of the operations are passed through edges (tensors). This makes it possible to break down complex computations into smaller, more manageable chunks.

TensorFlow also provides a number of tools for constructing and training neural networks. One of the most popular tools is the utility tf.keras (much more on this below), which allows users to quickly build and train deep learning models without having to write code from scratch. It also includes powerful visualization tools to help users understand the data and model parameters.

TensorFlow is also extensible, and it can be used with a variety of programming languages, including Python, C++, JavaScript, and Go. It also has support for running on GPUs (graphics processing units) for maximum performance.

Some important facts about TensorFlow:

* Its core is very similar to NumPy, but with GPU support.
* It supports distributed computing (across multiple devices and servers).
* It includes a kind of just-in-time (JIT) compiler that allows it to optimize computations for speed and memory usage. It works by extracting the *computation graph* from a Python function, then optimizing it, and finally running it efficiently.
* Computation graphs can be exported to a portable format, so you can train a TensorFlow model in one environment and run it in another.

TensorFlow offers many more features built on top of these core features: the most important is of course *tf.keras*, but it also has data loading and preprocessing ops (*tf.data*, *tf.io*, etc.), image processing ops (*tf.image*), signal processing ops (*tf.signal*), and more. We won't cover anywhere close to everything, so it's worth checking out the [documentation](https://www.tensorflow.org/).

At the lowest level, each TensorFlow operation (*op* for short) is implemented using highly efficient C++ code. Many operations have multiple implementations called *kernels*: each kernel is dedicated to a specific device type, such as CPUs, GPUs, or even TPUs (*tensor processing units*). As you may know, GPUs can dramatically speed up computations by splitting them into many smaller chunks and running them in parallel across many GPU threads. TPUs are even faster: they are custom ASIC chips build specifically for deep learning operations.

Most of the time your code will use the high-level APIs (especially *tf.keras* and *tf.data*); but when you need more flexibility, you can use the lower-level Python API, handling tensors directly. Note APIs for other languages are also available. In fact, TensorFlow runs not only in Windows, Linux, and macOS, but also on mobile devices, including both iOS and Android. There's even a JavaScript implementation called *TensorFlow.js* that makes is possible to run models directly in a browser!

**Using TensorFlow like Numpy**

TensorFlow's API revolves, appropriately, around *tensors*, which flow from operation to operation - thus the name. A tensor is very similar to a NumPy *ndarray*: it is usually a multidimensional array, but it can also hold a scalar. Let's see how to create a manipulate them.

In [None]:
a = np.array([2.,4.,5.])
t = tf.constant(a)
t.numpy()

In [None]:
tf.square(a)

In [None]:
np.square(t)

*Type Conversions*

Type conversions can significantly hurt performance, and they can easily go unnoticed when they are done automatically. To avoid this, TensorFlow does not perform any type conversions automatically: it just raises an exception if you try to execute an operation on tensors with incompatible types. For example, you cannot add a float tensor or an integer tensor, and you cannot even add a 32-bit float and a 64-bit float:

In [None]:
tf.constant(2.) + tf.constant(40)

In [None]:
tf.constant(2.) + tf.constant(40, dtype=tf.float64)

In [None]:
tf.constant(2.) + tf.constant(40, dtype=tf.float32)

**Variables**

The *tf.Tensor* values we've seen so far are immutable: you cannot modify them. This means that we cannot use regular tensors to implement weights in a neural network, since they need to be tweaked by backpropagation. Plus, other parameters may also need to change over time. What we need is a *tf.Variable*:

In [None]:
v = tf.Variable([[1., 2., 3.], [4., 5., 6.]])
v

A *tf.Variable* acts much like a *tf.Tensor*: you can perform the same operations with it, it plays nicely with NumPy as well, and it is just as picky with types. But it can also be modified in place using the *assign()*, *assign_add()*, or *assign_sub()* methods, which increment or decrement the variable by the given value.

In [None]:
v.assign(2 * v)

In [None]:
v[0,1].assign(42)

In [None]:
v[:,2].assign([0., 1.])

In practice you will rarely have to create variables manually, since Keras provides an *add_weight()* method that will take care of it for you. Moreover, model parameters will generally be updated directly by the optimizers, so you will rarely need to update variables manually.

OK, so far TensorFlow looks a lot like NumPy, and that's by design. So what's something TensorFlow can do than NumPy can't? Well, with TensorFlow we can retrieve the gradient of any differentiable expression with respect to any of its inputs. This is done within something called a *GradientTape*:

In [None]:
input_var = tf.Variable(initial_value=3.)
with tf.GradientTape() as tape:
    result = tf.square(input_var)
gradient = tape.gradient(result, input_var)
gradient

This is most commonly used to retrieve the gradients of the loss of a model with respect to its weights.

##Keras

In our last class, we stepped through the math of back propagation, and went deeper into the implementation of a neural network. Today, we'll learn about Keras, a tool that handles it all for you - and probably does a better job than if you built it yourself!

Keras is a high-level deep learning [API](https://aws.amazon.com/what-is/api/) that allows you to easily build, train, evaluate, and execute all sorts of neural networks. Its documentation can be found [here](https://keras.io/). A good book, by Francois Chollet, the person who designed it, is [Deep Learning with Python](https://www.amazon.com/Learning-Python-Second-Fran%C3%A7ois-Chollet/dp/1617296864/ref=sr_1_1), one of the recommended books for this class. It has quickly gained popularity, owing to its ease of use, flexibility, and design.

To perform the heavy computations required by neural networks, Keras relies on a computations backend. At present, you can choose from (at least) TensorFlow, Microsoft Cognitive Toolkit, and Theano. In fact TensorFlow itself now comes bundled with its own Keras implementation, *tf.keras*. Unsurprisingly, it only supports TensorFlow as the backend, but it also offers some very useful extra features - for example, it supports TensorFlow's Data API, which makes it easy to load and preprocess data efficiently. For this reason, we'll use tf.keras in this class. However, most of what we do won't be TensorFlow specific, and so the code should mostly run fine on other Keras implementations as well that use Python. Note also that the PyTorch API is quite similar to Keras. This is mostly because they share a common ancestor in Scikit-Learn.

The Keras library contains a number of common machine learning datasets that you can load, including the handwritten digits dataset - [MNIST](https://keras.io/api/datasets/mnist/) - that has been our main example so far. We can import it with the code below.

In [None]:
mnist = keras.datasets.mnist
(X_train_full, y_train_full), (X_test, y_test) = mnist.load_data()

We note here that the dataset is already split into a training set and a test set, but there is no validation set, so we'll create one now. Additionally, we'll scale the input features - which are represented as integers from $0$ to $255$, to the $0-1$ range.

In [None]:
X_valid, X_train = X_train_full[:5000] / 255.0, X_train_full[5000:] / 255.0
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]
X_test = X_test / 255.0

Now, let's build the neural network!

***Creating the Model***

Here is a classification MLP with one hidden layer:

In [None]:
model = keras.models.Sequential()
model.add(keras.layers.Input(shape=[28, 28]))
model.add(keras.layers.Flatten())
model.add(keras.layers.Dense(30, activation="sigmoid"))
model.add(keras.layers.Dense(10, activation="sigmoid"))

Let's go through this code line by line:

- The first line creates a *Sequential* model. This is the simplest kind of Keras model for neural networks that are composed of a single stack of layers connected sequentially.
- We define the shape of the input.
- We "flatten" the input, which means we transform our $28 \times 28$ array into a one-dimensional array with $28 \times 28 = 784$ entries.
- Next we add a dense hidden layer with 30 nodes (a.k.a *neurons*). It will use the sigmoid activation function. Each *Dense* layer manages its own weight matrix, containing all the connection weights between the neurons and their inputs. It also manages a vector of bias terms (one per node).
- Finally, we add a dense output layer with 10 nodes (one per class), again using the sigmoid activation function.

The potential activation functions can be found [here](https://keras.io/api/layers/activations/). Note that instead of adding the layers one-by-one, we could pass a list of layers when we create the model.

In [None]:
"""
model = keras.models.Sequential([
    keras.layers.Input(shape=[28, 28]),
    keras.layers.Flatten(),
    keras.layers.Dense(30, activation="sigmoid"),
    keras.layers.Dense(10, activation="sigmoid")
    ])
"""

The model's *summary()* method displays all the model's layers, including each layer's name (which is automatically generated unless you set it when creating the layer), its output shape, and the number of parameters.

In [None]:
model.summary()

Keras even includes a tool for generating an image of the model.

In [None]:
keras.utils.plot_model(model)

Perhaps this one is not so incredibly enlightening.

You can easily get a model's list of layers to fetch a layer by its index, or you can fetch a layer by its name.

In [None]:
model.layers

In [None]:
hidden1 = model.layers[1]
hidden1.name

All the parameters of a layer can be accessed using its *get_weights()* and *set_weights()* methods. For a dense layer, this includes both the connection weights and the bias terms.

In [None]:
weights, biases = hidden1.get_weights()
weights

In [None]:
weights.shape

In [None]:
biases

In [None]:
biases.shape

Notice the dense layer initialized the connection weights randomly (which is needed to avoid redundancy in learning), while the biases were initialized to zeros, which is fine. If you want to use different initialization methods, you can use kernel_initializer and bias_initializer. More info [here](https://keras.io/api/layers/initializers/).

***Compiling the Model***

After a model in created, you must call its *compile()* method to specify the loss function and the optimizer to use. You can also, if desired, specify a list of extra metrics to compute during training and evaluation.

In [None]:
model.compile(loss="sparse_categorical_crossentropy", optimizer="sgd", metrics = ["accuracy"])

This requires some explanation. First, we use the "sparse_categorical_crossentropy" loss because we have sparse labels (for each instance, there is just a target class index, from $0$ to $9$ in this case), and the classes are exclusive. If instead we had one-hot vectors we would use "categorical_crossentropy" loss instead.

What is the cross-entropy? It's the negative of the logarithm of the predicted probability of the result:

<center>
  $\displaystyle L = -\log(\hat{y}[y])$
</center>

Here $y$ is the actual value, and $\hat{y}[y]$ is the model's probability of the actual value. We add these up over all predictions to get our total cross-entropy.

Regarding the optimizer, "sgd" means that we will train the model using simple stochastic gradient descent. Note the learning rate defaults to $lr = 0.01$, but we could specify it with *optimizer = keras.optimizers.SGD(lr=???)*.

Finally, since it is a classifier, it's useful to describe its "accuracy" during training and evaluation.

See these links for more info on [losses](https://keras.io/api/losses/), [optimizers](https://keras.io/api/optimizers/), and [metrics](https://keras.io/api/metrics/).

***Training the Model***

Now the model is ready to be trained. To do this, we just call its *fit()* method.

In [None]:
history = model.fit(X_train, y_train, epochs = 10, validation_data=(X_valid,y_valid))

Note that passing the validation set is optional. However, it's good practice. If the performance on the training set is much better than on the validation set, you've probably got an overfit model. Note that instead of passing a validation set using the *validation_data* argument, we could set *validation_split* to the ratio of the training set that you want Keras to use for validation. So, *validation_split = $0.1$* would tell Keras to use the last 10% of the data for validation.

That's it! The neural network is trained. The *fit()* method returns a *History* object containing the training parameters (history.params), the list of epochs it went through (history.epoch), and most importantly a dictionary (history.history) containing the loss and extra metrics it measured at the end of each epoch on the training data and the validation data (if provided).

We can use this dictionary to create a pandas dataframe and then plot it.

In [None]:
pd.DataFrame(history.history).plot(figsize=(10,6))
plt.grid(True)
plt.show()

The validation accuracy starts above the training accuracy. Why is this? Because the validation accuracy is calculated at the *end* of the epoch of training, while the training accuracy is calculated *during* training. Note that after 10 epochs both validation and training accuracy were quite close, and still growing, indicating that more training probably would be valuable.

If you're not satisfied with the performance of your model, you should go back and tune the hyperparameters. The first one to check is the learning rate. If that doesn't help, try another optimizer. If the performance is still not great, try tuning model hyperparameters such as the number of layers, the number of nodes per layer, and the types of activation function used for each layer.

Finally, we should verify how the model performs on the test set.

In [None]:
model.evaluate(X_test, y_test)

***Predicting With The Model***

We can use the model's *predict()* method to make predictions on new instances. Since we don't have actual new instances, we can just try the first three instances in the test set.

In [None]:
X_new = X_test[:3]
y_proba = model.predict(X_new)
y_proba.round(2)

*Note* - These probabilities were not normalized, so they can sum to more than 1. The classifier just picks the highest value.

##Appendix - Building a Linear Classifier With Just TensorFlow

**Example - A Linear Classifier**

Alright, using just what we've learned so far, we now know enough to build any machine learning model based on gradient descent. Woo hoo!

In a machine learning job interview, you may be asked to implement a linear classifier from scratch with TensorFlow. Let's see how we'd do that.

First, let's come up with some nicely linearly separable data to work with: two classes of points in a 2D plane. We'll generate each class of points by drawing their coordinates from a random distribution with a specific covariance matrix and a specific mean. We'll use the same covariance matrix for both clouds, but we'll use two different mean values.

In [None]:
num_samples_per_class = 1000
negative_samples = np.random.multivariate_normal(
    mean = [0,3],
    cov = [[1, 0.5], [0.5, 1]],
    size = num_samples_per_class)

positive_samples = np.random.multivariate_normal(
    mean = [3,0],
    cov = [[1, 0.5], [0.5, 1]],
    size = num_samples_per_class)

inputs = np.vstack((negative_samples, positive_samples)).astype(np.float32)
targets = np.vstack((np.zeros((num_samples_per_class, 1), dtype="float32"),
                   np.ones((num_samples_per_class, 1), dtype="float32")))

We can plot this data:

In [None]:
plt.scatter(inputs[:,0], inputs[:,1], c=targets[:,0])
plt.show()

Now, let's create a linear classifier that can learn to separate these two blobs. A linear classifier is a model of the form:

<center>
    prediction = $\displaystyle W\textbf{x} + \textbf{b}$,
</center>

where $W$ is a matrix and $\textbf{b}$ is a vector. (Here $W$ stands for "weights" and $\textbf{b}$ stands for "bias".)

First, let's create our initial weights and biases, initialized with random values and zeros, respectively.

In [None]:
input_dim = 2
output_dim = 1
W = tf.Variable(initial_value=tf.random.uniform(shape=(input_dim, output_dim)))
b = tf.Variable(initial_value=tf.zeros(shape=(output_dim,)))

Now, we'll create our forward pass function. Note that given how we've set it up, we'll want to have our inputs on the *left* of the multiplication and our outputs on the *right*.

In [None]:
def model(inputs):
    return tf.matmul(inputs, W) + b

Our linear classifier is operating on 2D inputs, and so $W$ is really just two scalar coefficients $w_{1}$ and $w_{2}$. Meanwhile, $b$ is just a single scalar coefficient. In other words, the prediction is:

<center>
    $\displaystyle w_{1}x_{1} + w_{2}x_{2} + b$.
</center>

We can change the index of the variables to be more "pythonic", and the labels for the inputs to reflect a 2D plane:

<center>
    $\displaystyle w_{0}x + w_{1}y + b$.
</center>

Alright, so now that we have our forward pass function, we'll also need our error function - how we measure how close our prediction was to the actual value. For this, we can use our old favorite, the square of the difference.

In [None]:
def square_loss(targets, predictions):
    per_sample_losses = tf.square(targets - predictions)
    return tf.reduce_mean(per_sample_losses)

*Note* - What do we mean by "reduce_mean" here? Why isn't it just "mean"? Well, because for this operation TensorFlow computes the mean through a map-reduce operation in which the order of operations isn't always the same. This matters because for certain very high precision situations, doing the operations in a different order can lead to different rounding decisions and so different final outcomes. The name of the function is meant to make that explicit.

Alright, so now the final thing we need to do is handle the updates to our weight after a given batch. To do this, we can use TensorFlow to calculate the gradient of our loss with respect to our weights and biases, respectively, and then update our weights and biases accordingly with our learning rate.

In [None]:
learning_rate = .1

def training_step(inputs, targets):
    with tf.GradientTape() as tape:
        predictions = model(inputs)
        loss = square_loss(targets, predictions)
    grad_loss_wrt_W, grad_loss_wrt_b = tape.gradient(loss, [W,b])
    W.assign_sub(grad_loss_wrt_W * learning_rate)
    b.assign_sub(grad_loss_wrt_b * learning_rate)
    return loss

Isn't that nice! That gradient function is pretty sweet.

Now, we just need to run our *training_step* function however many times we think we should. Let's try 42.

In [None]:
for step in range(42):
    loss = training_step(inputs, targets)
    print(f"Loss at step {step+1}: {loss: .4f}")

OK. Looks like the loss has settled down, which means the line has too. So, what are our predicted values going to be? Well, we can plot them using the code below.

In [None]:
predictions = model(inputs)
plt.scatter(inputs[:,0], inputs[:,1], c=predictions[:, 0] > 0.5)
plt.show()

Fundamentally, our predictions are linearly separated. What's the separating line? Well, if we start with our model equation:

<center>
    $\displaystyle w_{0}x + w_{1}y + b$,
</center>

we can note that it predicts the label $1$ when:

<center>
    $\displaystyle w_{0}x + w_{1}y + b \geq .5$,
</center>

and so the boundary line is:

<center>
    $\displaystyle w_{0}x + w_{1}y + b = .5$.
</center>

We can rewrite this in slope-intercept form as:

<center>
    $\displaystyle y = -\frac{w_{0}}{w_{1}}x + \frac{.5-b}{w_{1}}$.
</center>

Plotting this line against our initial labels we have:

In [None]:
x = np.linspace(-2,6,200)
y = (-W[0] / W[1]) * x + (0.5-b) / W[1]
plt.plot(x,y,"-r")
plt.scatter(inputs[:,0], inputs[:,1], c=targets[:,0])
plt.show()

Voila! A linear classifier. You move on in the interview.