---

**Load essential libraries**

---

In [None]:
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('dark_background')
%matplotlib inline

import tensorflow as tf

---

**Check TensorFlow version**

---

In [None]:
tf.__version__

'2.15.0'

---

Answer the following questions inline using the documentation from:
 - Introduction to tensors https://www.tensorflow.org/guide/tensor
 - Introduction to variables https://www.tensorflow.org/guide/variable
 - Introduction to gradients and automatic differentiation https://www.tensorflow.org/guide/autodiff

 ---


1. A scalar is a rank $\underbrace{0/1/2}_\text{choose one}$ tensor.
2. *True/false*: a scalar has no axes.
3. A matrix is a rank $\underbrace{0/1/2}_\text{choose one}$ tensor and has $\underbrace{1/2}_\text{choose one}$ axes.
4. What does the function call $\texttt{tf.reshape(A, [-1]}$ does for a given tensor $\texttt{A}$?
5. *True/false*: $\texttt{tf.reshape()}$ can be used to swap axes of a tensor such as $\texttt{(patients, timestamps, features)}$ to $\texttt{(timestamps, patients, features)},$
6. $\texttt{tf.keras}$ uses $\underline{\qquad\qquad\qquad}$ to store model parameters.
7. *True/false*: calling $\texttt{assign}$ reuses a tensor's exisiting memory to assign the values.
8. *True/false*:  creating a new tensor $\texttt{b}$ based on the value of another tensor $\texttt{a}$ as $\texttt{b = tf.Variable(a)}$ will have the tensors allocated different memory.
9. *True/false*: two tensor variables can have the same name.
10. An example of a variable that would not need gradients is a $\underline{\qquad\qquad\qquad\qquad}.$
11. Tensor variables are typically placed in $\underbrace{\text{CPU/GPU}}_{\text{choose one}}.$
12. *True/false*: a tensor variable is trainable by default.
13. A gradient tape tape will automatically watch a $\underbrace{\texttt{tf.Variable/tf.constant}}_\text{choose one}$ but not a $\underbrace{\texttt{tf.Variable/tf.constant}}_\text{choose one}.$
14. What attribute can be used to calculate a layer's gradient w.r.t. all its trainable variables?
15. The option $\texttt{persistent = True}$ for a gradient tape $\underbrace{\texttt{stores/discards}}_\text{choose one}$ all intermediate results during the forward pass.
16. Executing the statement $\texttt{print(type(x).__name__)}$ when $\texttt{x}$ is a constant and when $\texttt{x}$ is a variable results in what?
17. Which one among $\texttt{tf.Tensor}$ and $\texttt{tf.Variable}$ is immutable? Which one has no state but only value? Which one has a state which is actually its value?







---

---

Consider a 1-layer neural network for a sample with 3 features: heart rate, blood pressure, and temperature and 2 possible output categories: diabetic and non-diabetic.

An individual who is diabetic has heart rate = 76 BPM, BP = 120 mm Hg, and temperature = 37.5 $^\circ\text{C}.$

Here is the forward propagation: $$\mathbf{x}\longrightarrow\mathbf{x}_B=\begin{bmatrix}\mathbf{x}\\1\end{bmatrix}\longrightarrow\mathbf{z}^{[1]} = \mathbf{W}^{[1]}\mathbf{x}_B\longrightarrow\underbrace{\mathbf{a}^{[1]}}_{=\hat{\mathbf{y}}} = \text{softmax}\left(\mathbf{a}^{[1]}\right)\longrightarrow L = \sum_{k=0}^1-y_k\log\left(\hat{y}_k\right).$$

*   Fill in the missing entries of the code below to calculate the gradients there-in.
* Explain why some gradient shapes do not seem to match with the usual $\text{input shape}\times\text{output shape}$ rule for gradient shapes using the documentation on [Gradients of non-scalar target](https://www.tensorflow.org/guide/autodiff#gradients_of_non-scalar_targets) as resource.
* Try $\texttt{persistent = true}$ and $\texttt{persistent = false}$ in the gradient tape and observe what happens in each case.

---

In [None]:
x = tf.constant([?, ?, ?])
y = tf.constant([?, ?])
xB = tf.concat([?, 1.0*tf.ones(1)], axis = 0)
W = tf.?(0.01*(tf.random.normal((?, ?))))

with tf.GradientTape(persistent = ?) as g:
  z = tf.linalg.matvec(?, ?)
  a = tf.nn.softmax(?)
  yhat = ?
  L = tf.reduce_?(-?*tf.math.log(?))

print('Loss = %f'%(L))
print('Gradient of L w.r.t. yhat')
gradL_yhat = g.gradient(?, ?)
print(gradL_yhat)
print('-----')
print('Gradient of a w.r.t. z')
grada_z = g.gradient(?, ?)
print(grada_z)
print('-----')
print('Gradient of z w.r.t. W')
gradz_W = g.gradient(?, ?)
print(gradz_W)
print('-----')
print('Gradient of L w.r.t. W')
gradL_W = g.gradient(?, ?)
print(gradL_W)

# Delete gradient tape and release memory
del g

---

Recalculate gradients pen-and-paper-way with the same weights from above using numpy. Compare the gradient results here with the ones that you had from the previous cell. Why are some gradients different? In both approaches (in this cell and in the one above), is the gradient of interest $\nabla_{\mathbf{W}^{[1]}}(L)$ the same? Note that this is the only gradient we need to update the weights matrix $\mathbf{W}^{[1]}.$


---

In [None]:
xB_np = xB.numpy().reshape(-1, 1) # bias-feature added sample vector for numpy
y = np.array([?, ?]) #
z = np.dot(W.numpy(), ?) # note we use the same weights from the previous cell here
a = tf.nn.softmax(?, axis = 0).numpy().flatten()
yhat = ?
L = tf.reduce_sum(-?*?)

print('Loss = %f'%(L))
print('-----')
print('Gradient of L w.r.t. yhat')
gradL_yhat = (? / ?)
print(gradL_yhat)
print('-----')
print('Gradient of a w.r.t. z')
grada_z = (np.identity(np.size(?))-?.reshape(-1,?).T) * a.reshape(?, ?)
print(grada_z)
print('-----')
print('Gradient of z w.r.t. W')
gradz_W = np.zeros((?.shape[0], W.shape[1], ?.shape[0]))
gradz_W[range(?), :, range(?)] = ?.flatten()
print(gradz_W)
print('-----')
print('Gradient of L w.r.t. W')
gradL_W = np.dot(?, np.dot(?, ?.reshape(-1, 1))).squeeze()
print(gradL_W)

---

For each activation function below,

1.   Sigmoid $\sigma(z)$
2.   Hyperbolic tangent $\tanh(z)$
3.   Rectified Linear Unit $\text{ReLU}(z)$
4.   Leaky rectified linear unit $\text{LReLU}(z)$

*   plot the activation and its gradient in the same figure for raw score values $z$ ranging between $-10$ and $10$;
*   comment on whether the backward flowing gradient on the input side of the activation layer will have a smaller or bigger magnitude compared to the backward flowing gradient on the output side of the activation layer. Recall that what connects these two gradients is the local gradient of the activation layer which you may have just plotted.





---


In [None]:
z = tf.linspace(?, ?, 129) # A tf.Tensor, not a tf.Variable

with tf.GradientTape(persistent = ?) as g:
    g.?(?)
    a_sigmoid = tf.math.?(z)
    a_tanh = tf.math.?(z)
    a_ReLU = z * tf.cast((z > ?), tf.float64)
    a_LReLU = ? + 0.01*z*tf.cast((z ? 0), tf.float64)

grada_sigmoid_z = g.gradient(?, ?)
grada_tanh_z = g.gradient(?, ?)
grada_ReLU_z = g.gradient(?, ?)
grada_LReLU_z = g.gradient(?, ?)

fig, axs = plt.subplots(2, 2, figsize = (8, 8))
axs[0, 0].plot(?, ?, label = 'activated score')
axs[0, 0].plot(?, ?, label='gradient of activated score')
axs[0, 0].legend(loc = 'upper left')
axs[0, 0].set_xlabel('z')
axs[0, 0].set_title('Sigmoid activation and gradient');

axs[?, ?].plot(?, ?, label = 'activated score')
axs[?, ?].plot(?, ?, label='gradient of activated score')
axs[?, ?].legend(loc = 'upper left')
axs[?, ?].set_?('z')
axs[?, ?].set_?('Tanh activation and gradient');

axs[?, ?].plot(?, ?, label = 'activated score')
axs[?, ?].plot(?, ?, label='gradient of activated score')
axs[?, ?].legend(loc = 'upper left')
axs[?, ?].?('z')
axs[?, ?].?('ReLU activation and gradient');

axs[?, ?].plot(?, ?, label = 'activated score')
axs[?, ?].plot(?, ?, label='gradient of activated score')
axs[?, ?].?(loc = 'upper left')
axs[?, ?].?('z')
axs[?, ?].?('Leaky ReLU activation and gradient');