# Introduction

TensorFlow is based on (surprise!) tensors. Tensors are veeery similar to numpy arrays. The main differences are that
1. They can be easily handed by GPUs
2. They are immutables
I you want to know more, the official documentation has a very interesting [notebook](https://www.tensorflow.org/guide/tensor)
showing their properties. You may have a look at the TensorFlow basics [guide](https://www.tensorflow.org/guide). This notebook is loosely based on that guide.

We are going to use tensors to build a neural network from scratch. If you are relatively new to neural networks and don't
know the mathematical details of how neural networks work, I recommend the introductory neural network lectures from Stanford
University (here [part 1](https://www.youtube.com/watch?v=MfIjxPh6Pys) and [part 2](https://www.youtube.com/watch?v=zUazLXZZA2U)).
From now on, I am going to assume that you know all the math background.

OK, let's get our hands dirty! Let's start importing libraries.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

from matplotlib import pyplot as plt
from time import time
import tensorflow as tf
from sklearn.preprocessing import OneHotEncoder, scale
from sklearn.model_selection import train_test_split

2023-02-21 20:01:42.668291: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-02-21 20:01:42.668318: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-02-21 20:01:43.501737: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-02-21 20:01:43.501846: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory


Now load the data

In [2]:
X_train=np.linspace(0,2*np.pi,10)
target_y = tf.sin(X_train)


2023-02-21 20:01:44.877532: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2023-02-21 20:01:44.877593: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (archlinux): /proc/driver/nvidia/version does not exist


And prepare it for our neural network

We are done with the preparation!

# Neural network without hidden layers

Now let's start with a very simple model of one layer. First we need a weight ~~matrix~~ tensor.
We have 784 inputs (pixels) and 10 outputs (numbers 0-9), therefore the weight matrix should be 784 x 10. Let's initialize it.

In [3]:
weights_random = tf.Variable(tf.zeros([1,1]))

Now we are using a *variable* instead of a tensor. Variables are essentially tensors wrapped in something more, in
order to allow them to do fancy things, so
for the time being let's consider them equivalent.

Let's define a couple of well-known metrics: the mean squared error loss and the cross-entropy loss.

In [None]:
def loss_mse(target_y, predicted_y):
  return tf.reduce_mean(tf.square(target_y - predicted_y))

def loss_crsntrpy(target_y,predicted_y):
  return -tf.reduce_sum(tf.reduce_mean(target_y * tf.math.log(predicted_y + 1e-12),axis=0))

In case the names `reduce_sum`, `reduce_mean` reminds you to Hadoop's MapReduce. You are not [totally wrong](https://realpython.com/python-reduce-function/)

Although we can build our simple neural network model using the weights_random variable and loss_crsntrpy alone, it is
simpler to use some of the TensorFlow machinery already there for us. One is the [tf.Module](https://www.tensorflow.org/api_docs/python/tf/Module)
object. It is very convenient
to define classes based on tf.Module, because we don't need to implement things like tf.Variable handling.
Our model will be a class derived from tf.Module, where we will define the weights matrix `w` and
the matrix of biases `b` as class properties. We make the class callable, so we can make predictions just doing `y_predicted = MyModel()(X_model)`

In [None]:
class MyModel(tf.Module):
  def __init__(self, **kwargs):
    super().__init__(**kwargs)
    # Let's start this to zeros. We will see soon why this is not a good idea.
    self.w = tf.Variable(tf.zeros([1]))
    self.b = tf.Variable(0.0)

  def __call__(self, x):
    return tf.nn.softmax(x @ self.w + self.b)


We initialize the model.

In [None]:
model = MyModel()

Now we are going to define the train function. Note that we use this strange class called GradientType: what is this and how
does it work? If you watched the Stanford University lectures, you know that the goal is to minimize the loss, and for that we
need to calculate the derivative of the loss with respect to our weights. How can we calculate derivatives with a computer?
There are three possibilities, [numerical differentation](https://en.wikipedia.org/wiki/Numerical_differentiation),
[symbolic differentation](https://en.wikipedia.org/wiki/Computer_algebra) and [automatic differentiation](https://en.wikipedia.org/wiki/Automatic_differentiation).
Numerical differentiation is the easiest to understand and implement. The problem is that is not accurate enough for deep
neural networks. Symbolic differentiation leads to very complicated expressions (and may not be able to handle certain
layers). Automatic differentiation is somehow in the middle and is the best way of calculate derivatives of neural networks
(thanks to its clever use of the chain rule).

OK, but what is gradient tape? In order to calculate the derivative, TensorFlow should somehow keep track of the functions
that we are defining and that we want to differentiate, and the variables with respect to we want to differentiate (now you start to see why `tf.Variable` is useful, don't you?). GradientTape is the class that is used for that. You can read more
details in the [official documentation](https://www.tensorflow.org/api_docs/python/tf/GradientTape)

In [None]:
def train(model,x,y,learning_rate):

  with tf.GradientTape() as t:
    current_loss = loss_crsntrpy(y, model(x))

  dw, db = t.gradient(current_loss, [model.w, model.b])

  # assign_sub is a function of the variable class that does w = w - (learning_rate * dw) in an efficient way
  model.w.assign_sub(learning_rate * dw)
  model.b.assign_sub(learning_rate * db)


Let's train this simple model

In [None]:
for i_epochs in range(50):
  train(model,X_train,target_y,learning_rate=0.2)
  train_loss = loss_crsntrpy(target_y,model(X_train))
 # test_loss = loss_crsntrpy(y_test,model(X_test))
  #print("Training loss in epoch {0} = {1}.   Test loss = {2}".format(i_epochs,train_loss.numpy(),test_loss.numpy()))
  print("Training loss in epoch {0} = {1}.   Test loss = {2}".format(i_epochs,train_loss.numpy()))


Well, we got there! We have a cross-entropy loss of 0.37, that is not bad at all.

# Neural network with a hidden layer

Now let's include a hidden layer of 100 neurons.

In [None]:
class ModelHidden(tf.Module):
  def __init__(self, **kwargs):
    super().__init__(**kwargs)
    # Initialize the weights to `5.0` and the bias to `0.0`
    # In practice, these should be randomly initialized
    self.w0 = tf.Variable(tf.zeros([784,100]))
    self.b0 = tf.Variable(0.0)

    self.w1 = tf.Variable(tf.zeros([100,10]))
    self.b1 = tf.Variable(0.0)

  def __call__(self, x0):
    x1 = tf.nn.sigmoid(x0 @ self.w0 + self.b0)
    return tf.nn.softmax(x1 @ self.w1 + self.b1)


We will need a new train function

In [None]:
def train_hidden(model,x,y,learning_rate):

  with tf.GradientTape() as t:
    current_loss = loss_crsntrpy(y, model(x))

  dw1, dw0, db1, db0 = t.gradient(current_loss, [model.w1, model.w0, model.b1, model.b0])

  model.w1.assign_sub(learning_rate * dw1)
  model.b1.assign_sub(learning_rate * db1)
  model.w0.assign_sub(learning_rate * dw0)
  model.b0.assign_sub(learning_rate * db0)


Let's train this model!

In [None]:
model_hidden = ModelHidden()

for i_epochs in range(50):
  train_hidden(model_hidden,X_train,y_train,learning_rate=0.2)
  train_loss = loss_crsntrpy(y_train,model_hidden(X_train))
  test_loss = loss_crsntrpy(y_test,model_hidden(X_test))
  print("Training loss in epoch {0} = {1}.   Test loss = {2}".format(i_epochs,train_loss.numpy(),test_loss.numpy()))

The neural network seems to not be learning anything. This is because we initialize all the weights to zero. This way, the
neurons are heavily correlated and in practice, they behave as one neuron. One of the things of working with TensorFlow at
low level is that we can easily have a look at the weight tensor. We can see that the weight of all neurons are almost the same.

In [None]:
print(model_hidden.w0)

We can get rid of this initializing the weight matrix to random numbers.
(We will soon see that this may be not enough for more complex architectures).

In [None]:
class BetterModelHidden(tf.Module):
  def __init__(self, **kwargs):
    super().__init__(**kwargs)
    self.w0 = tf.Variable(tf.random.uniform([784, 100]))
    self.b0 = tf.Variable(0.0)

    self.w1 = tf.Variable(tf.random.uniform([100, 10]))
    self.b1 = tf.Variable(0.0)

  def __call__(self, x0):
    x1 = tf.nn.sigmoid(x0 @ self.w0 + self.b0)
    return tf.nn.softmax(x1 @ self.w1 + self.b1)

model_hidden = BetterModelHidden()

for i_epochs in range(50):
  train_hidden(model_hidden, X_train, y_train, learning_rate=0.2)
  train_loss = loss_crsntrpy(y_train, model_hidden(X_train))
  test_loss = loss_crsntrpy(y_test, model_hidden(X_test))
  print("Training loss in epoch {0} = {1}.   Test loss = {2}".format(i_epochs, train_loss.numpy(), test_loss.numpy()))

Now it works better! It seems that its loss is larger than the no-hidden layer neural network, but this is just because the model is still not converged (gradient descent is not the most efficient algorithm). If we train both neural networks with a better algorithm (or more than 2000 epochs) we will see that the model with a hidden layer is indeed better.

# Deep neural networks and weight initialization strategies

We were able to get relatively good results with the two-layer neural network. But what if we want to implement a complex architecture
like [these](http://slazebni.cs.illinois.edu/spring17/lec01_cnn_architectures.pdf)?
Our code soon becomes too complex and error prone. For that reason we are going to create a new `MyLayer` class.

In [None]:
class MyLayer(tf.Module):
  def __init__(self,input_size,output_size,is_last=False, **kwargs):
    super().__init__(**kwargs)
    self.w = tf.Variable(tf.random.uniform([input_size,output_size]))
    self.b = tf.Variable(0.0)
    self.is_last = is_last
    self.input_size = input_size
    self.output_size = output_size

  def __call__(self, x):
    if self.is_last:
      result = tf.nn.softmax(x @ self.w + self.b)
    else:
      result = tf.nn.sigmoid(x @ self.w + self.b)
    return result


class MyNeuralNetwork(tf.Module):
  def __init__(self,layers,**kwargs):
    super().__init__(**kwargs)

    # Check if layers sizes are inconsistent
    first_layer = True
    for layer in layers:
      if first_layer:
        first_layer = False
      else:
        if layer.input_size != previous_output:
          print('Inconsistent layers')
      previous_output = layer.output_size
    if layers[-1].is_last == False:
      print('Last layer is_last = True!')
    self._layers = layers

  def __call__(self, x0):
    for i_layer in self._layers:
      x0 = i_layer(x0)
    return x0

def train_nn(neural_network,x,y,learning_rate):

  with tf.GradientTape(persistent=True) as t:
    current_loss = loss_crsntrpy(y, neural_network(x))

  gradiente = t.gradient(current_loss, neural_network.trainable_variables)

  for i_trainable_variable,i_gradient in zip(neural_network.trainable_variables,gradiente):
    i_trainable_variable.assign_sub(learning_rate*i_gradient)


Let's create a huge neural network and see what happens.

In [None]:
nn = MyNeuralNetwork([MyLayer(784,500),MyLayer(500,500),MyLayer(500,500),
                      MyLayer(500,500),MyLayer(500,500),MyLayer(500,500),
                      MyLayer(500,10,is_last=True)])

for i_epochs in range(50):
  train_nn(nn,X_train,y_train,learning_rate=0.2)
  train_loss = loss_crsntrpy(y_train,nn(X_train))
  test_loss = loss_crsntrpy(y_test,nn(X_test))
  print("Training loss in epoch {0} = {1}.   Test loss = {2}".format(i_epochs,train_loss.numpy(),test_loss.numpy()))

Mmmhhh... what happened here? The loss is definitely not improving! Why using a larger neural network leads to worse
results? The answer is actually quite easy. Let's have a look at the gradients.

In [None]:
with tf.GradientTape(persistent=True) as t:
  current_loss = loss_crsntrpy(y_train, nn(X_train))

check_gradient = t.gradient(current_loss, nn.trainable_variables)
print(check_gradient[1])
# Add (the absolute value of) all elements
print(np.sum(np.abs(check_gradient[1].numpy())))

The gradient is zero everywhere ($\sum_{ij} |g_{ij}| = 0$)! What is going on here? In fact this is a well-known problem.
The problem of vanishing gradients. It appears because the typical sigmoid activation
function tends to saturate (i. e. $\sigma(x) \approx \sigma(x + \Delta x)$ for $x > 3$). In other words, the
sigmoid won't change with changes in x, therefore derivatives (gradients) are zero. We
can see this problem here in full detail (advantages of being working with TensorFlow at a low level).
Let's see what happen in the linear part of the first layer (before applying the sigmoid function)

In [None]:
plt.plot(np.linspace(-10,10,200),tf.sigmoid(np.linspace(-10,10,200)))

In [None]:
print(X_train @ nn._layers[0].w + nn._layers[0].b)

And after the sigmoid

In [None]:
print(tf.nn.sigmoid(X_train @ nn._layers[0].w + nn._layers[0].b))

We got mostly either 0s or 1s. That means that our first layer is either completely saturated or unsaturated for almost
every input. No surprise why the gradients are zero.
How can we do to improve it? There are better ways to initialize weights than a random number between zero and one.
See this 30-min long TowardsDataScience [article](https://towardsdatascience.com/weight-initialization-in-deep-neural-networks-268a306540c0) (behind paywall) for a large and thorough discussion of different weight initialization techniques,
or just skip the book and go for the [movie](https://youtu.be/zUazLXZZA2U?t=3243).

Let's create a new layer class using the so-called Xavier initialization scheme for the weights:

In [None]:
class MyBetterLayer(MyLayer):
  def __init__(self,input_size,output_size,is_last = False,**kwargs):
    super().__init__(input_size,output_size,is_last,**kwargs)
    self.is_last = is_last
    self.w = tf.Variable(tf.random.normal([input_size, output_size]) * tf.sqrt(2 / (input_size + output_size)), name='w')
    self.b = tf.Variable(0.0,name='b')
    
nn = MyNeuralNetwork([MyBetterLayer(784,500),MyBetterLayer(500,500),MyBetterLayer(500,500),
                      MyBetterLayer(500,500),MyBetterLayer(500,500),MyBetterLayer(500,500),
                      MyBetterLayer(500,10,is_last=True)])

for i_epochs in range(50):
  train_nn(nn,X_train,y_train,learning_rate=0.2)
  train_loss = loss_crsntrpy(y_train,nn(X_train))
  test_loss = loss_crsntrpy(y_test,nn(X_test))
  print("Training loss in epoch {0} = {1}.   Test loss = {2}".format(i_epochs,train_loss.numpy(),test_loss.numpy()))

Now the gradient is much better!

In [None]:
with tf.GradientTape(persistent=True) as t:
  current_loss = loss_crsntrpy(y_train, nn(X_train))

check_gradient = t.gradient(current_loss, nn.trainable_variables)
print(check_gradient[1])
print(np.sum(np.abs(check_gradient[1].numpy())))

# Final remarks

With TensorFlow we can choose if we want the tensors to be run in the GPU or in the CPU. Let's see the difference in time, just to see the advantage of GPUs.

In [None]:
start_time = time()
nn = MyNeuralNetwork([MyBetterLayer(784,500),MyBetterLayer(500,500),MyBetterLayer(500,500),
                      MyBetterLayer(500,500),MyBetterLayer(500,500),MyBetterLayer(500,500),
                      MyBetterLayer(500,10,is_last=True)])

for i_epochs in range(50):
  train_nn(nn,X_train,y_train,learning_rate=0.2)
end_time = time()

print('Using a GPU we needed {0} seconds to train the network'.format(end_time-start_time))


with tf.device('CPU:0'):
  start_time = time()
  nn = MyNeuralNetwork([MyBetterLayer(784, 500), MyBetterLayer(500, 500), MyBetterLayer(500, 500),
                        MyBetterLayer(500, 500), MyBetterLayer(500, 500), MyBetterLayer(500, 500),
                        MyBetterLayer(500, 10, is_last=True)])

  for i_epochs in range(50):
    train_nn(nn, X_train, y_train, learning_rate=0.2)
  end_time = time()

print('Using a CPU we needed {0} seconds to train the network'.format(end_time-start_time))


We can slightly change the neural network object to make it more Keras-like.

In [None]:

class MySequential(tf.Module):
  def __init__(self,**kwargs):
    super().__init__(**kwargs)
    self._layers = []

  def add(self,layer):
    self._layers.append(layer)

  def fit(self,x,y,learning_rate,n_epochs):
    for i_epoch in range(n_epochs):
      with tf.GradientTape(persistent=True) as t:
        current_loss = loss_crsntrpy(y, self.predict(x))
      gradiente = t.gradient(current_loss, self.trainable_variables)

      for i_trainable_variable, i_gradient in zip(self.trainable_variables, gradiente):
        i_trainable_variable.assign_sub(learning_rate * i_gradient)

      train_loss = loss_crsntrpy(y_train, nn.predict(X_train))
      test_loss = loss_crsntrpy(y_test, nn.predict(X_test))
      print("Training loss in epoch {0} = {1}.   Test loss = {2}".format(i_epoch, train_loss.numpy(), test_loss.numpy()))

  def predict(self,x0):
    for i_layer in self._layers:
      x0 = i_layer(x0)
    return x0



In [None]:
nn = MySequential()

nn.add(MyBetterLayer(784,100))
nn.add(MyBetterLayer(100,100))
nn.add(MyBetterLayer(100,10,is_last=True))

nn.fit(X_train,y_train,0.2,100)

In [None]:
nn.predict(X_test)

And that is it! Now you can create your own layers! However, this simple high level API of TensorFlow that we have
created lacks a lot of features that are already in Keras (for instance, better algorithms for loss minimization). Fortunately, Keras API has its own [Layer](https://keras.io/api/layers/base_layer/) object that you
can easily [extend](https://www.tensorflow.org/tutorials/customization/custom_layers) to create your own layer. This way you can create layers not present in Keras, while taking advantage
of the other parts of the Keras ecosystem.

Happy coding!



![](https://cdn.pixabay.com/photo/2014/10/22/23/14/geek-499140_960_720.jpg)