In [None]:
from typing import Tuple, List, Dict, Any
import numpy as np
import matplotlib.pyplot as plt
# sklearn dependencies
from sklearn.datasets import make_moons
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Hands-on Introduction to Deep Learning (Lecture 1)
The **prereqs** are kept (on purpose) to a minimum, i.e.:
- Basic knowledge of <font color="blue"><b>(partial) derivatives and the chain rule</b></font>.
- Ability to perform <font color="blue"><b>simple matrix operations</b></font> (dot product, multiplication, transpose).
- Knowledge of <a href="https://www.python.org"><b>Python</b></a> and <a href="https://www.numpy.org"><b>NumPy</b></a>.<br>
  CHPC provides courses on these topics. You can find them at:<br>
  + <a href="https://github.com/chpc-uofu/python-lectures"><b>Introduction to Python</b></a>
  + <a href="https://github.com/chpc-uofu/intro-numpy"><b>Introduction to NumPy & SciPy</b></a>

# 1.Artificial Intelligence (AI)
<font color="green"><b>Artificial intelligence (AI)</b></font> is the ability of machines to perform tasks that<br>normally <font color="green"><b>require human intelligence</b></font>.
Among these tasks we have, e.g.:
* discern hand-written digits, addresses,...
* facial recognition in order to admit people access to buildings
* write essays
* translate from one language into another one.
* convert audio into transcripts
* self-driving vehicles
* $\ldots$

Traditionally AI has been performed using <font color="green"><b>different approaches</b></font>:
* <a href="https://pdfs.semanticscholar.org/4fb2/7b22f57442ef8dddfd5eec2e8ef17b901323.pdf">**Rules-Based AI (Expert systems)**</a>:
  - Domain experts extract rules and codify them.
  - Money & time consuming.
  - Rather rigid in order to incorporate new knowledge.
* <a href="https://www.amazon.com/Logic-Based-Artificial-Intelligence-International-Engineering/dp/0792372247">**Logic based AI**</a>: The use of symbolic logic to perform reasoning
* <a href="https://epubs.siam.org/doi/book/10.1137/1.9781611977882">**Machine Learning (ML)**</a>:
  - A data-centric approach
  - Computer systems are able to "learn" from data without explicit instructions
  - Central underpinnings of ML: probability theory, statistics and mathematical optimization.

In the last decade, ML has become the research focus of AI, due to the following reasons:
* The enormous explosion of data creation (Internet, audio, video, audio,..)<br>
  <font color="orangered"><b>Data has become the new currency</b></font>: ML requires huge amounts of data to train its models.
* The increase of parallel computational power:<br>
  - the <font color="green"><b>Graphical Processing Units (GPUs)</b></font> were originally developed for gaming
  - General-purpose GPUs (GPGPUs) emerged for tasks beyond visualization such as Scientific Computing and ML.
     
## Types of Machine Learning (ML):
* <font color="green"><b>Supervised Learning:</b></font>
  - <font color="blue"><b>input data and (labeled) output data</b></font> are used for training
  - common techniques: regression, classification
* <font color="green"><b>Unsupervised Learning:</b></font>
  - <font color="blue"><b>input data</b></font> are used for training.
  - high dimensional input data are used and reduced in dimensionality
  - common techniques: clustering, PCA, ..
* <font color="green"><b>Reinforcement Learning:</b></font>
  - concept borrowed from behavorial psychology
  - positive actions are rewarded; negative actions are penalized.

## Current status  
- Currently, an ML model is trained for <font color="green"><b>one specific task</b></font>.
  * it may outperform a human specialist in one particular task (e.g. the interpretation of medical images).
  * but <font color="green"><b>NOT as versatile</b></font> as the human brain/intelligence.
- The versatile AI equivalent of the human mind is <font color="green"><b>AGI (Artificial General Intelligence)</b></font>.

# 2.The concept of a neuron/perceptron
A <font color="green"><b>neuron</b></font> or nerve cell:
- fundamental unit of the nervous system
- fires electrical signals across a <font color="green"><b>neural network</b></font>.
- contains a nucleus and mitochondria
- has additional structure:
  - <font color="green"><b>dendrites</b></font>: receive the incoming electric signal (<font color="green"><b>input</b></font>)
  - <font color="green"><b>axon</b></font>: transmits the electrical signal away from the nerve cell body (<font color="green"><b>output</b></font>)

<img src="neuron.jpg" alt="neuron" width="300">

Source: <a href="https://www.ninds.nih.gov/health-information/public-education/brain-basics/brain-basics-life-and-death-neuron"><b>Brain Basics: The Life and Death of a Neuron<b></a>

The concept of a physical neuron:
- gave rise to the concept of perceptron (<a href="https://bpb-us-e2.wpmucdn.com/websites.umass.edu/dist/a/27637/files/2016/03/rosenblatt-1957.pdf"><b>Rosenblatt, 1957</b></a>)
  
In essence, a <font color="green"><b>perceptron</b></font> is a **non-linear function** $f$ 
(<font color="green"><b>activation function</b></font>)<br>
which operates <font color="green"><b>input</b></font> and returns <font color="green"><b>output</b></font>.

Or a little more formal, $\textbf{y}=\widehat{h}(\textbf{x})$, where:
* $\widehat{h}$: non-linear/activation operator
* $\textbf{x} \in \mathbb{R}^{n_1 \times 1}$: input vector i.e. $\textbf{x}:=(x_1,x_2,\ldots, x_{n_1})^T$.
* $\textbf{y} \in \mathbb{R}^{n_2 \times 1}$: output vector i.e. $\textbf{y}:=(y_1,y_2,\ldots, y_{n_2})^T$.

In what follows we will perform 
<font color="green"><b>logistic regression</b></font> (<font color="orangered"><b>the most simple (shallow) neural net possible</b></font>) on a simple data set. <br>This simple toy model/example will allow us:
* to display the <b>(essential) features</b> of deep learning.
* to easily transition to the <b>general case</b> (see <a href="./lecture2.ipynb"><b>Lecture 2</b></a>)

# 3.Logistic Regression (LR) - a bird's view
## 3.1.Goals/Tasks
* To train a <font color="green"><b>binary classifier</b></font> based on a given data set
  <br>using <font color="red"><b>one neuron</b></font> in just <font color="red"><b>one layer</b></font>, i.e.
  (**shallow network**)
* To obtain the <font color="green"><b>accuracy</b></font> of the trained model using a test set.
* To <font color="green"><b>predict</b></font> the outcome of some data (provided) using the model.
* To get a start with PyTorch/Keras.

<img src="./perceptron.jpeg" width="400">

<font><i>Logistic regression as a (shallow neural network):</i></font>
* <font color="green"><b>Input</b></font> ($l=0$): vector $x_i$ with $5$ <font color="green"><b>features</b></font>.
* <font color="red"><b>First layer</b></font> ($l=1$) with <font color="red"><b>one unit/node</b></font>:
  + creation of $z_i:=\displaystyle \sum_{j=1}^5 x_{ij} w_j + b$  (i.e. `affine transformation`)
  + $a_i := \widehat{h}^{[1]}(z_i)$ (i.e. `non linear activation`)
* Output ($a_i$):
  + if `training`: to be used for loss function
  + elif `test`/`inference`: to be used to determine the label

## 3.1.Initialization of the parameters

Before starting the optimization process for the <font color="green"><b>parameters</b></font> (weight vector and bias),<br>
those parameters need to be <font color="green"><b>initialized</b></font>.


## 3.2.Training of the binary classifier
The <font color="green"><b>training</b></font> of a binary classifier (and <font color="orangered"><b>in extenso any deep neural net</b></font>) 
<br>consists of an **iterative loop** of the following $3$ tasks:

1. <font color="green"><b>Forward propagation</b></font>:<br>
   Given a training set, and a set of <font color="green"><b>parameters</b></font> 
   we calculate the associated cost function,<br>
   which is a measure how different the predicted data are from the true data.

2. <font color="green"><b>Back propagation</b></font>:<br>
   Based on the cost function we calculate the <font color="green"><b>gradients of the parameters</b>   </font>.
 
3. <font color="green"><b>Update of the parameters</b></font>:<br>
   The parameters are <font color="green"><b>updated</b></font> using <font color="green"><b>gradient descent</b></font>.

The <font color="green"><b>training set</b></font> consists of $m_{\mathrm{train}}$ data points</b> $(\mathbf{x}_i,y_i)$, $i \in \{1,\ldots,m_{\mathrm{train}}\}$<br>
  where:
  - $\mathbf{x_i}$ is a column vector of length of $n$, i.e. $\mathbf{x_i} \in \mathbb{R}^{n \times 1}$.<br>
    Each dimension of the vector $\mathbf{x_i}$ represents a <font color="green"><b>feature</b></font>.
  - $y_i$ is either 0 ($\texttt{False}$) or 1 ($\texttt{True}$), i.e. $y_i \in \mathbb{R}$.
    

## 3.3.Testing of the binary classifier
In the <font color="green"><b>testing phase</b></font> the accuracy of the model will be determined.

In the next sections, we will describe the details of each of these steps.

# 4.LR: Initialization

During the <font color="green"><b>initialization</b></font> all parameters (weights and bias) are set to zero.
- <font color="green"><b>weight vector</b></font>: $  w \in \mathbb{R}^{n \times 1}$ 
- <font color="green"><b>b</b></font>: $b \in \mathbb{R}$

#### **Exercise 1**: Initialize the parameters

`def init_param(n: int) -> Tuple[np.ndarray, float]:`<br>
  * The <font color="green"><b>weight</b></font> vector is a **column vector** of length $n$ (#features).
  * The <font color="green"><b>bias</b></font> is a **scalar** of type float.
  * All elements of the weight vector and bias can be initialized to $0.0$.

In [None]:
# Exercise 1:
def init_param(n: int) -> Tuple[np.ndarray, float]:
    """
    Initialize the parameters (weight, bias) for the Binary Classifier.
    
    Args:
        n (int): The number of features/dim. of the input vector
    
    Returns:
        Tuple[np.ndarray, float]: A tuple containing 
           - initialized weight: shape (n,1)
           - initialized bias (scalar).
    """
    # Here comes YOUR code to initialize the weight vector & bias.
    # W =  <--- YOUR CODE: Weight (vector) to zero
    # b =  <--- YOUR CODE: bias (float) to zero
    return W,b

##### <font color="blue"><b>Simple check of the function:</b></font>

In [None]:
# %load tests/lec1/test_ex1.py

In [None]:
# %load solutions/lec1/sol_ex1.py

# 5.LR: Forward Propagation
* At the perceptron/node, each instance $i$ ($i \in \{1,\ldots,m_{\mathrm{train}}\}$) will be<br>
  subjected to the following $2$ <font color="green"><b>transformations</b></font>:
  1. `affine` transformation:<br>
       $\begin{eqnarray}
         z_i & = &   \mathbf{x_i^T}.\mathbf{w} +b  \nonumber \\
            &=  &  \displaystyle{\sum_{j=1}^n x_{ij} w_j +b} \nonumber
       \end{eqnarray}$
    
     where:<br>
     - $ \mathbf{x_i}$: <font color="green"><b>input vector</b></font> for example $i$ ($\in \mathbb{R}^{n\times 1}$)
  
     - $ \mathbf{w}$ : <font color="green"><b>weight</b></font> vector ($\in \mathbb{R}^{n \times 1}$)<br>
       Note: the weight vector has <font color="orangered"><b>same number of dimensions as there are features</b></font>.
    
     - $ b$ : <font color="green"><b>bias</b></font> ($\in \mathbb{R}$).

     - Each data instance uses the <font color="orangered"><b>same weight vector and bias</b></font>.   
  1. `non linear activation`:<br>
     $a_i =  \sigma(z_i)$ , $a_i \in \mathbb{R}$<br>
     
     - $\sigma$ is known as the `sigmoid` function:<br>
   
       * $\begin{equation}
         \sigma(z) = \displaystyle \frac{1}{1+e^{-z}}  \nonumber \;\;,\;\; 0 \leq \sigma(z) \leq 1
         \end{equation}$
    
     - The activation of the <b>last layer</b> (in this case we only have one layer) is the same<br>
       as the predicted value ($\widehat{y_i})$. Thus, <br>
    
       $\begin{equation}
       \widehat{y_i} := a_i \nonumber
       \end{equation}$
   
* Calculate the <font color="green"><b>cost function</b></font> ($\mathcal{C}$).<br>
  The cost function $\mathcal{C}$ is defined as the mean of the <font color="green"><b>loss functions</b></font> ($\mathcal{L}^{(i)}$) over the $m_{\mathrm{train}}$ data points:

  $\begin{eqnarray}
       \mathcal{C}       & := & \displaystyle \frac{1}{m_{\mathrm{train}}} \sum_{i=1}^{m_{\mathrm{train}}}        \mathcal{L}^{(i)} \nonumber
  \end{eqnarray}$

  In case of <font color="green"><b>binary classification</b></font>, the **loss function** $\mathcal{L}^{(i)}$ for data point $i$ is given by:
  
  $\begin{eqnarray}
       \mathcal{L}^{(i)} & = & - \bigg [ y_i \log(\widehat{y_i}) + (1-y_i)\log(1-\widehat{y_i}) \bigg ] \nonumber\\
                         & = & - \bigg [ y_i \log(a_i) + (1-y_i)\log(1-a_i) \bigg ]  \nonumber
  \end{eqnarray}$

* <font color="red"><b>Note:</b></font><br>
  - The loss function in logistic regression is <font color="green"><b>convex</b></font>.<br>Loss functions in deep learning are
    generally <font color="red"><b>NOT</b></font> convex.
  - <a href="./convexity.ipynb"><b>Convexity</b></a> has some nice properties: e.g. the local minimum is the **GLOBAL** minimum.
    

* <font color="green"><b>Vectorization:</b></font><br>
  In case of NumPy vectorization speeds up code significantly.

  Given that:
  + $\mathbf{x_i}$ is a column vector, i.e. $\mathbf{x_i} \in \mathbb{R}^{n \times 1}$.
  + $y_i \in \{0,1\}$.
  + The $\mathrm{m}$ column vectors $\mathbf{x_i}$ can be stacked in an matrix $X \in \mathbb{R}^{m \times n}$,<br>
    given by:

    $\begin{eqnarray}
         X & := & \begin{pmatrix}  \mathbf{x^T_1}\\
                                   \mathbf{x^T_2} \\
                                   \vdots \\
                                   \mathbf{x^T_{m}}
                 \end{pmatrix}\nonumber
    \end{eqnarray}$

      
  + The $\mathrm{m}$ $y_i$ values can be collected in a column vector $Y \in \mathbb{R}^{m \times 1}$,<br>
    given by:

    $\begin{eqnarray}
       Y & := & \begin{pmatrix} y_1 & y_2 & \cdots & y_{\mathrm{m}-1} & y_{\mathrm{m}} 
                 \end{pmatrix}^T \nonumber
      \end{eqnarray}$

  we get:
  - $\begin{eqnarray}
         \mathbf{Z} & = &   \mathbf{X}.\mathbf{w} + b\mathbf{1}  \nonumber 
       \end{eqnarray}$<br>
    where $\mathbf{1}$ is the unity vector $\in \mathbb{R}^{m \times 1}$

  - $\begin{eqnarray}
          \mathbf{A} & = & \sigma(\mathbf{Z}) \nonumber
    \end{eqnarray}$
 

<font color="red"><b>Note: (for sake of completeness)</b></font><br>

The `activation` function $\sigma(z)$ has the following properties:

$
   \begin{eqnarray}
   \lim_{ z \to -\infty} \sigma(z) & =&0 \nonumber \\
   \lim_{ z \to +\infty} \sigma(z) & = & 1 \nonumber \\
   \sigma(0) & = &\frac{1}{2} \nonumber \\
  \displaystyle \frac{d \sigma(z)}{dz} & = & \sigma(z)(1-\sigma(z)) \nonumber
  \end{eqnarray}$

The range of the sigmoid is $[0,1]$ and can thus be interpreted as a <font color="green"><b>probability</b></font>.

In [None]:
x = np.linspace(-7.5, 7.5, 1501)
y = 1.0/(1.0+np.exp(-x))
plt.title(r"Sigmoid function and decision boundary")
plt.xlabel(r"$x$")
plt.ylabel(r"$\sigma(x)$",rotation=0)
plt.plot(x,y, label =r"$\sigma(x)$" )
plt.axvline(x=0,color='r',ymin=0.0, ymax=1.0, label="Decision boundary")
plt.plot(0.0,0.5,marker='o',color='orange')
plt.legend()
plt.grid()
plt.show()

#### **Exercise 2**: Implementation of the sigmoid activation function
Perform $\sigma(\mathbf{Z})$ element-wise 

In [None]:
# Exercise 2:
def sigmoid(Z: np.ndarray) -> np.ndarray:
    """
    Compute the sigmoid function.
    
    Args:
        Z (np.ndarray): The input value(s).
    
    Returns:
        np.ndarray: The sigmoid of the input value(s).
    """
    # Here comes YOUR code for the sigmoid function.
    # return <-- YOUR CODE 

##### <font color="blue"><b>Simple check of the function:</b></font>

In [None]:
# %load tests/lec1/test_ex2.py

In [None]:
# %load solutions/lec1/sol_ex2.py

#### **Exercise 3** : Implementation of the forward propagation
*  $ \mathbf{Z} = \mathbf{X} \mathbf{w} + \mathbf{b} $
*  $ \mathbf{A} = \sigma(\mathbf{Z})$  ($\texttt{Ex. 2}$)
*  $ \mathcal{C} = - \frac{1}{m} \displaystyle \sum_{i=1}^m \bigg [ y_i \log(a_i) + (1-y_i)\log(1-a_i) \bigg ]$ <br>
The latter equation can be easily vectorized (and should) in NumPy using $\mathbf{A}$ and $\mathbf{Y}$.

In [None]:
# Exercise 1.3:
def forward(X: np.ndarray, Y: np.ndarray,
            W: np.ndarray, b: float) -> Tuple[np.ndarray, float]:
    """
    Perform the forward pass of the binary classifier.
    
    Args:
        X (np.ndarray): The training data -> shape(m, n) 
                        where m is #samples & n is #features.
        Y (np.ndarray): The training labels (targets) -> shape(m,1) 
                        where m is #samples.
        W (np.ndarray): The weight vector             -> shape(n,1)
        b (float)     : The bias term                 -> float
    
    Returns:
        Tuple[np.ndarray, float]: 
          A tuple containing the activation matrix and the cost.
    """
    # Here comes YOUR code for the forward propagation.
    num_samples = X.shape[0]
    # Z = <--- YOUR CODE
    # A = <--- YOUR CODE  
    # cost = <--- YOUR CODE
    return A, cost

##### <font color="blue"><b>Simple check of the function:</b></font>

In [None]:
# %load tests/lec1/test_ex3.py

In [None]:
# %load solutions/lec1/sol_ex3.py

# 6.LR: Back Propagation

In the <font color="green"><b>back propagation</b></font> we calculate the **gradient of the cost function** w.r.t the weights and the bias:

+ $\begin{eqnarray}
      \frac{\partial \mathcal{L}^{(i)}}{\partial a_i} & = & \frac{\partial}{\partial a_i} \bigg [ - \big [ y_i \log(a_i) + (1-y_i)\log(1-a_i) \big ] \bigg ] \nonumber \\
                                                      & = & -\frac{y_i}{a_i} + \frac{(1-y_i)}{(1-a_i)} \nonumber
    \end{eqnarray}$
 
    
+ $\begin{eqnarray}
        \frac{\partial \mathcal{L}^{(i)}}{\partial z_i} & = & \frac{\partial \mathcal{L}^{(i)}}{\partial a_i} \frac{\partial a_i}{\partial z_i} \nonumber \\
         & = & \frac{\partial \mathcal{L}^{(i)}}{\partial a_i} \frac{\partial \sigma(z_i)}{\partial z_i}  \nonumber \\
         & =& a_i - y_i \nonumber
    \end{eqnarray}$
 
    
+ $\begin{eqnarray}
       \frac{\partial \mathcal{L}^{(i)}}{\partial b} & = & \frac{\partial \mathcal{L}^{(i)}}{\partial a_i} \frac{\partial a_i}{\partial z_i}\frac{\partial z_i}{\partial b} \nonumber \\
                 & = & a_i - y_i \nonumber
    \end{eqnarray}$

  
+ $\begin{eqnarray}
      \frac{\partial \mathcal{L}^{(i)}}{\partial w_j} & = & \frac{\partial \mathcal{L}^{(i)}}{\partial a_i} \frac{\partial a_i}{\partial z_i}\frac{\partial z_i}{\partial w_j} \nonumber \\
                 & = & (a_i - y_i) x_{ij} \nonumber
    \end{eqnarray}$
    
Thus, <br>

+ $\begin{eqnarray}
     \frac{\partial\mathcal{C}}{\partial b}  & =   &= & \frac{1}{m_{\mathrm{train}}}\displaystyle \sum_{i=1}^{m_{\mathrm{train}}} \frac{\partial \mathcal{L}^{(i)}}{\partial b}  
                                             & = & \frac{1}{m_{\mathrm{train}}} \displaystyle \sum_{i=1}^{m_{\mathrm{train}}} (a_i - y_i) \nonumber
     \end{eqnarray}$

+ $\begin{eqnarray}
     \frac{\partial\mathcal{C}}{\partial w_j}  &= & \frac{1}{m_{\mathrm{train}}}\displaystyle \sum_{i=1}^{m_{\mathrm{train}}} \frac{\partial \mathcal{L}^{(i)}}{\partial w_j} 
      & = & \frac{1}{m_{\mathrm{train}}} \displaystyle \sum_{i=1}^{m_{\mathrm{train}}} (a_i - y_i) x_{ij} \;\;,\;\;\forall \, j \in \{1,\ldots,n\} \nonumber
     \end{eqnarray}$

#### **Exercise 4**: Back propagation

These are the steps:  
* $\mathbf{dZ} = \mathbf{A} - \mathbf{Y}$
* $\mathbf{dW} = \frac{1}{m} \mathbf{X}^T.\mathbf{dZ}$
* $db = \frac{1}{m} \displaystyle \sum_{i=1}^m \mathbf{dZ}_i$
* return ($\mathbf{dW},db$)

In [None]:
# Exercise 4:
def calcgrad(X: np.ndarray, Y: np.ndarray, 
             A:np.ndarray) ->  Tuple[np.ndarray, float]:
    """        
    Computes the gradients of the cost function with respect to W and b.
    Arg:
        X (np.ndarray): Training data     -> shape(m,n)
        Y (np.ndarray): Training labels   -> shape(m,1)
        A (np.ndarray): Activation matrix -> shape(m,1) 
    Return:
        A tuple containing the gradients with respect to W and b.
    """ 
    # Here comes the calcgrad code
    num_samples = X.shape[0]
    # dZ = <--- YOUR CODE
    # dW = <--- YOUR CODE
    # db = <--- YOUR CODE
    return dW,db

##### <font color="blue"><b>Simple check of the function:</b></font>

In [None]:
# %load tests/lec1/test_ex4.py
# Test Ex.4 : Checking the calcgrad function

In [None]:
# %load solutions/lec1/sol_ex4.py

# 7.LR: Update the parameters (with gradient descent)

In this step, the parameters will be updated using <a href="https://en.wikipedia.org/wiki/Gradient_descent"><b>gradient descent</b></a>: 

  $\begin{eqnarray}
      b & = & b - \alpha  \frac{\partial\mathcal{C}}{\partial b} \nonumber \\
      w_j & = & w_j - \alpha  \frac{\partial\mathcal{C}}{\partial w_j} \;\;,\;\;\forall \, j \in \{1,\ldots,n\} \nonumber
    \end{eqnarray}$

  where $\alpha$ is known as the <font color="green"><b>learning rate </b></font> or the <font color="green"><b>step size</b></font>.  

#### **Exercise 5**: Update of the parameters

In [None]:
def update(Weights: np.ndarray, bias: float,
           dWeight: np.ndarray, dbias:float,
           lr:float) -> Tuple[np.ndarray, float]:
    """
    Update the parameters using the gradients and learning rate.    

    Args:
        Weights (np.ndarray): The weight vector              -> shape(n, 1).
        bias (float)        : The bias term                  -> float
        dWeight (np.ndarray): The grad. of the cost w.r.t. W -> shape(n, 1).
        dbias (float)       : The grad. of the cost w.r.t. b -> float
        lr (float)          : The learning rate.    

    Returns:
        Tuple[np.ndarray, float]: 
          A tuple containing the updated weight and bias.
    """
    # Here comes the code to update the weight vector and the bias.
    # Weights = <--- YOUR CODE
    # bias =    <--- YOUR CODE
    return Weights, bias

##### <font color="blue"><b>Simple check of the function:</b></font>

In [None]:
# %load tests/lec1/test_ex5.py

In [None]:
# %load solutions/lec1/sol_ex5.py

# 8.LR: Training: All components combined

During **training**, the following steps are to be performed:

* initialize $\mathbf{W},b$ to 0.0  
* perform the loop over all epochs
  - Calculate $\mathbf{A}, \mathcal{C}$ (cost) using the forward function
  - Calculate the gradients $\mathbf{dW}, db$
  - Update $\mathbf{W}, b$ using the gradients $\mathbf{dW}, db$
* return lst(cost), $\mathbf{W},b$. 

#### **Exercise 6**: Training (complete)

In [None]:
# Exercise 6
def train_model(X: np.ndarray, Y: np.ndarray,
                num_epochs: int, lr: float) -> Tuple[List[float],np.ndarray, float]:
    """
    Train the binary classifier using gradient descent.
    
    Args:
        X (np.ndarray): The training data (features)  -> shape(m,n) 
                        where m is #samples & n is #features.
        Y (np.ndarray): The training labels (targets) -> shape(m,1)
                        where m is #samples.
        W (np.ndarray): The weight vector             -> shape(n,1).
        b (float)     : The bias term                 -> float
        num_epochs (int): The number of epochs to train.
        lr (float)    : The learning rate.    
    
    Returns:
        Tuple[np.ndarray, float]: 
          A tuple containing the final weight and bias after training.
    """
    # Here comes YOUR code to train the model.
    lstCost = []
    # W,b =         <--- YOUR CODE : Initialize to 0.0 using previous function
    for i in range(num_epochs):
        # A, cost = <--- YOUR CODE : Use the forward function
        lstCost.append(cost)
        # dW, db =  <--- YOUR CODE : Calc. the gradient
        # W, b   =  <--- YOUR CODE : Perform the update
    return lstCost, W, b    

In [None]:
# %load solutions/lec1/sol_ex6.py

# 9.LR: Testing of the binary classifier 
Once our neural net has been trained, the <font color="green"><b>optimal values</b></font><br> 
for the parameters $\mathbf{w}$ and $b$ (i.e. $\mathbf{\widehat{w}}$ and $\widehat{b}$) are known. <br> 
We are now ready to <font color="green"><b>test</b></font> our neural net model.

* Apply the <font color="green"><b> predict </b></font> function to the **test data set**.<br>
  The **predict** function has $2$ components:
  + Apply the <font color="green"><b>forward propagation</b></font> to the test data set (use $\mathbf{\widehat{w}}$ and $\widehat{b}$).
  + <font color="green"><b>Map</b></font> the obtained activations to either $0$ or $1$
* Calculate the <font color="green"><b>accuracy</b></font> i.e. the ratio of the number of correct predictions over total predictions. 

## 9.1.Prediction function
* The <font color="green"><b>test set</b></font> consists of $m_{\mathrm{test}}$ test data points</b>:<br> $(\mathbf{x}_i,y_i)$, $i \in \{1,\ldots,m_{\mathrm{test}}\}$<br>
  where:
  - $\mathbf{x_i}$ is a column vector of length of $n$, i.e. $\mathbf{x_i} \in \mathbb{R}^{n \times 1}$.
  - $y_i$ is either 0 ($\texttt{False}$) or 1 ($\texttt{True}$), i.e. $y_i \in \mathbb{R}$.
  - The $m_{\mathrm{test}}$ $\mathbf{x_i}$ column vectors can be collected in the matrix $X \in \mathbb{R}^{ \mathrm{m_{test}}\times n}$,<br>
    given by:

    $\begin{eqnarray}
         X & := & \begin{pmatrix}  \mathbf{x^T_1}\\
                                   \mathbf{x^T_2} \\
                                   \vdots \\
                                   \mathbf{x^T_{\mathrm{m_{test}}}}
                 \end{pmatrix} \nonumber
    \end{eqnarray}$

  - The $m_{\mathrm{test}}$ $y_i$ values can be collected in the column vector $Y$,<br>
    given by:

    $\begin{eqnarray}
         Y & := & \begin{pmatrix} y_1 & y_2 & \cdots & y_{\mathrm{m_{test}-1}} & y_{\mathrm{m_{test}}} 
                 \end{pmatrix}^T \nonumber
     \end{eqnarray}$

* Apply <font color="green"><b>forward propagation</b></font> in (matrix) form (efficiency reasons):
  - $\begin{eqnarray}
        \mathbf{z} & = & \mathbf{X.\widehat{w}} + \mathbf{\widehat{b}} \nonumber
     \end{eqnarray}$<br>
     where:
     + $\mathbf{X} \in \mathbb{R}^{m_{\mathrm{test}} \times n}$
     + $\mathbf{\widehat{w}}$ is a column vector with the <font color="green"><b>optimized weights</b></font> ($ \in \mathbb{R}^{n \times 1}$)  
     + $\mathbf{\widehat{b}}$ is a column vector with the <font color="green"><b>optimized bias</b>
     </font> ($\widehat{b}$) ($\widehat{b}\,\mathbf{1} \in \mathbb{R}^{n \times 1}$)
     + $\mathbf{z} \in \mathbb{R}^{m_{\mathrm{test}} \times 1}$
  - $\begin{eqnarray}
        \mathbf{a} & = & \sigma(\mathbf{{z}}) \nonumber 
     \end{eqnarray} \;,\;\mathbf{a} \in \mathbb{R}^{m_{\mathrm{test}} \times 1}$    
* The elements that are calculated ($\widehat{\mathbf{y}}:=\mathbf{a}$) are in the interval $[0,1]$.<br>
  - In order to compare them with the <font color="green"><b>test labels</b></font>, we must <font color="green"><b>map/discretize</b></font> the elements of $\mathbf{a}$ into $\{0,1\}$.
  - $\widetilde{y_i} = F(\widehat{\mathbf{y_i}})$ where
    $F(\widehat{\mathbf{y_i}})$ is defined in the following way:<br>
     $\begin{equation}
      \widetilde{y_i}= 
\begin{cases}
    1,& \text{if } a_i \geq 0.5\\
    0,              & \text{otherwise}
\end{cases} \nonumber
     \end{equation}$    

#### **Exercise 7**: predict labels

Steps:
* Calculate $\mathbf{A}$ using $\mathbf{W}$ and $b$ (obtained from training)
* Map all the elements of A to either $0$ or $1$.

In [None]:
# Exercise 7:
def predict_labels(X: np.ndarray, W: np.ndarray, b: float) -> np.ndarray:
    """
    Make predictions using the trained model.

    Args:
        X (np.ndarray): The input data (features)  -> shape(m,n) 
                        where m is #samples & n is #features.
        W (np.ndarray): The weight vector          -> shape(n,1).
        b (float)     : The bias term              -> float
    
    Returns:
        np.ndarray: The predicted labels (0 or 1).
    """
    # Here comes the code to predict the labels (either 0 or 1)
    # A = <--- YOUR CODE 
    return np.where(A >= 0.5, 1, 0)

##### <font color="blue"><b>Simple check of the function:</b></font>

In [None]:
# %load tests/lec1/test_ex7.py

In [None]:
# %load solutions/lec1/sol_ex7.py

## 9.2.Accuracy

<font color="green"><b>Accuracy</b></font> is defined as the <font color="green"><b>ratio</b></font> of:
- the number of correct classifications to 
- the total number of classifications.



#### **Exercise 8**: Accuracy

Step:
* return ratio (number matches/total number) * 100.

In [None]:
# Exercise 8:
def accuracy(Y_true: np.ndarray, Y_pred: np.ndarray) -> float:
    """
    Calculate the accuracy of the predictions.
    
    Args:
        Y_true (np.ndarray): The true labels      -> shape(m,1)
        Y_pred (np.ndarray): The predicted labels -> shape(m,1)
    
    Returns:
        float: The accuracy as a percentage.
    """
    # return <--- YOUR CODE: Here comes the ratio * 100

##### <font color="blue"><b>Simple check of the function:</b></font>

In [None]:
# %load tests/lec1/test_ex8.py

In [None]:
# %load solutions/lec1/sol_ex8.py

# 10.In praxi: Train a binary classifier

## Generation of a synthetic data set
* The python library <a href="https://scikit-learn.org/stable/"><b>scikit-learn</b></a> (based on NumPy & SciPy) is used to generate a <font color="green"><b>synthetic data set</b></font>.
* In order to facilitate the visualization we will only choose 2 features.

In [None]:
# Code to generate a data set
X, y = make_moons(n_samples=500, noise=0.25, random_state=42)
print(f"  X.shape:{X.shape}")
print(f"  y.shape:{y.shape}")

## 10.1.Splitting the data set
The data set will be split (using scikit-learn) into:
- <font color="green"><b>training set</b></font> 
- <font color="green"><b>test set</b></font>

<font color="orangered"><b>Note:</b></font>
* Normally, we will also create a <font color="green"><b>dev/validation</b></font> set. In <a href="./lecture2.ipynb"><b>Lecture 2</b></a>, we will elaborate on the use of a validation set.
* Make sure that the training, validation and test sets belong to the <font color="green"><b>same distribution</b></font>.<br>

In [None]:
# Code to split the data in training and a test set.
test_ratio = 0.30
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=test_ratio, random_state=42)
print(f"Splitting the data set ...")
print(f"  Test ratio:{test_ratio}")
print(f"  Training Data Set:")
print(f"    X_train.shape :: {X_train.shape}")
print(f"    y_train.shape :: {y_train.shape}")
print(f"  Test Data Set:")
print(f"    X_test.shape  :: {X_test.shape}")
print(f"    y_test.shape  :: {y_test.shape}")

## 10.2. Visualization of the training data

In [None]:
# Visualization of the training set
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap='RdBu')  
plt.title("Synthetic (Training) Data (2 Features)")
plt.xlabel(r"$x_1$")
plt.ylabel(r"$x_2$")
plt.show()

## 10.3. Train the model using your code

* The cost function ($\mathcal{C}$) is <a href="convexity.ipynb"><b>convex</b></a> which implies that the local minimum is also the global minimum.
* We have only <font color="green"><b>one</b></font> hyperparameter (the learning rate $\alpha$). So, how to choose it?
  + a <font color="red"><b>too large</b></font> value of $\alpha$ will lead to moving around the minimum but not reaching it.
  + a <font color="red"><b>too small</b></font> value of $\alpha$ will lead to extemely small step sizes and will make the convergency process very long.

In [None]:
lstCost, W, b = train_model(X=X_train, Y=y_train[:,np.newaxis], num_epochs=20000, lr=0.05)
# If you have errors load the following module
#import binclas as bc
#lstCost, W,b = bc.train_model(X=X_train, Y=y_train[:,np.newaxis], num_epochs=20000, lr=0.05)

In [None]:
print(f"Last el. of Cost array:")
for item in lstCost[-10:]:
    print(f"  Cost:{item:20.14f}")
print(f"\nWeight (optimized): {W.ravel()}")
print(f"Bias (optimized)  : {b}")

# Predict Labels
y_pred = predict_labels(X_train, W,b)
acc = accuracy(y_pred, y_train[:,np.newaxis])
print(f"Accuracy          : {acc:.4f}")

In [None]:
# Visualization of the training set
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap='RdBu')  
plt.title("Synthetic (Training) Data (2 Features)")
plt.xlabel(r"$x_1$")
plt.ylabel(r"$x_2$")
plt.show()

## 10.4. Training: comparison with LogisticRegression (from sklearn)

In [None]:
# Training the model using different versions of sckit-learn
# High-accuracy without L2 (term) 
model1 = LogisticRegression(penalty=None, max_iter=100000, tol=1.E-12).fit(X_train,y_train)
print(f"  LogisticRegression (sklearn) without L2 (Training Set) ::")
print(f"    coef:{model1.coef_}")
print(f"    intercept:{model1.intercept_}")
print(f"    score:{model1.score(X_train,y_train):8.4f}")

model2 = LogisticRegression(max_iter=100000, tol=1.E-12).fit(X_train,y_train)
print(f"  LogisticRegression (sklearn) with L2 Reg. (default) (Training Set) ::")
print(f"    coef:{model2.coef_}")
print(f"    intercept:{model2.intercept_}")
print(f"    score:{model2.score(X_train,y_train):8.4f}")

## 10.5. Plot training data & the decision boundary
The point where the <font color="green"><b>decision boundary</b></font> can be found, is $\sigma(z) = \frac{1}{2}$.<br>
Solving for $x_2$ leads to:

$\begin{eqnarray}
   x_2  = &  \frac{-(\widehat{w_1} x_1 + \widehat{b})}{\widehat{w_2}} \nonumber
\end{eqnarray}$

In [None]:
# Visualization of the training set + decision boundary
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap='RdBu')  
plt.title("Train Data (2 Features) + decision boundary")
plt.xlabel(r"$x_1$")
plt.ylabel(r"$x_2$")
(x1_min, x1_max) = np.min(X_train[:,0]), np.max(X_train[:,0])
delta = 0.1
x1 = np.linspace(x1_min-delta,x1_max+delta,501)
x2 = -(W[0]*x1 + b)/W[1]
plt.plot(x1, x2,'g', label=r"$x_2 = - \frac{\widehat{w_1}\,x_1+\widehat{b}}{\widehat{w_2}}$")
plt.legend()
plt.show()

## 10.6. Test the model

In [None]:
# Here comes your code to test the model
# Predict Labels
y_pred = predict_labels(X_test, W,b)
acc = accuracy(y_pred, y_test[:,np.newaxis])
print(f"  Accuracy:{acc:.4f}")

#### Comparison with LogisticRegression (from sklearn)

In [None]:
print(f"  LogisticRegression (sklearn) without L2 (Test Data Set) ::")
print(f"    score:{model1.score(X_test,y_test):8.4f}")
print(f"  LogisticRegression (sklearn) with L2 reg. (Test Set) ::")
print(f"    score:{model2.score(X_test,y_test):8.4f}")

# Conclusion
* You have implemented the <font color="orangered"><b>simplest (shallow) neural net</b></font> from scratch.
* You have learned that <font color="orangered"><b>training of a neural net</b></font> is (in general)  <br> an iterative process consisting of the following <font color="orangered"><b>components</b></font>:
  + <font color="orangered"><b>forward propagation</b></font>.
  + <font color="orangered"><b>backward propagation</b></font>.
  + <font color="orangered"><b>updating parameters</b></font> using gradient descent.
* Once the training is finished, you can <font color="orangered"><b>test/validate your model</b></font>.
* After <font color="orangered"><b>testing</b></font> you can apply your model on new data (from the same distribution)/inference</b></font>.

The above algorithm is the <font color="green"><b>basic algorithm</b></font> for <font color="green"><b>all</b></font> neural networks.