In [1]:
import numpy as np

## 2.1. Element-wise `sigmoid` 
`z` is a numpy ndarray.
$$ \text{For } z \in \mathbb{R}^n \text{,     } sigmoid(z) = sigmoid\begin{pmatrix}
    z_1  \\
    z_2  \\
    ...  \\
    z_n  \\
\end{pmatrix} = \begin{pmatrix}
    \frac{1}{1+e^{-z_1}}  \\
    \frac{1}{1+e^{-z_2}}  \\
    ...  \\
    \frac{1}{1+e^{-z_n}}  \\
\end{pmatrix}\tag{1} $$

In [2]:
def sigmoid(z):
    """
    sigmoid / logistic function (i.e., activation)
    compute the sigmoid of z element-wise
    np.exp is preferable to math.exp:
        - the parameter z is often a vector or a matrix, and np.exp can compute them element-wise
    """
    return 1.0 / (1.0 + np.exp(-z))

## 2.1.1 Sigmoid gradient

$$sigmoid\_derivative(z) = \sigma'(z) = \sigma(z) (1 - \sigma(z))\tag{2}$$

In [3]:
def sigmoid_derivative(z):
    """
    Compute the gradient of the sigmoid function
    """
    a = sigmoid(z)
    ds = a * (1 - a)    
    return ds

## 2.2 Normalizing rows

Gradient descent converges faster after normalization (i.e., changing x to $ \frac{x}{\| x\|} $ , dividing each row vector of x by its norm).


For example, if $$x = 
\begin{bmatrix}
    0 & 3 & 4 \\
    2 & 6 & 4 \\
\end{bmatrix}\tag{3}$$ then $$\| x\| = np.linalg.norm(x, axis = 1, keepdims = True) = \begin{bmatrix}
    5 \\
    \sqrt{56} \\
\end{bmatrix}\tag{4} $$and        $$ x\_normalized = \frac{x}{\| x\|} = \begin{bmatrix}
    0 & \frac{3}{5} & \frac{4}{5} \\
    \frac{2}{\sqrt{56}} & \frac{6}{\sqrt{56}} & \frac{4}{\sqrt{56}} \\
\end{bmatrix}\tag{5}$$

$\| x\|$ is broadcasted to compute $ \frac{x}{\| x\|} $.

In [4]:
def normalize_rows(x):
    """
    Implement a function that normalizes each row of the matrix x (to have unit length).
    """
    x_norm = np.linalg.norm(x, axis=1, keepdims=True)
    x = x / x_norm
    return x

## 2.3 Softmax

- $ \text{for } x \in \mathbb{R}^{1\times n} \text{,     } softmax(x) = softmax(\begin{bmatrix}
    x_1  &&
    x_2 &&
    ...  &&
    x_n  
\end{bmatrix}) = \begin{bmatrix}
     \frac{e^{x_1}}{\sum_{j}e^{x_j}}  &&
    \frac{e^{x_2}}{\sum_{j}e^{x_j}}  &&
    ...  &&
    \frac{e^{x_n}}{\sum_{j}e^{x_j}} 
\end{bmatrix} $ 

- $\text{for a matrix } x \in \mathbb{R}^{m \times n} \text{,  $x_{ij}$ maps to the element in the $i^{th}$ row and $j^{th}$ column of $x$, thus we have: }$  $$softmax(x) = softmax\begin{bmatrix}
    x_{11} & x_{12} & x_{13} & \dots  & x_{1n} \\
    x_{21} & x_{22} & x_{23} & \dots  & x_{2n} \\
    \vdots & \vdots & \vdots & \ddots & \vdots \\
    x_{m1} & x_{m2} & x_{m3} & \dots  & x_{mn}
\end{bmatrix} = \begin{bmatrix}
    \frac{e^{x_{11}}}{\sum_{j}e^{x_{1j}}} & \frac{e^{x_{12}}}{\sum_{j}e^{x_{1j}}} & \frac{e^{x_{13}}}{\sum_{j}e^{x_{1j}}} & \dots  & \frac{e^{x_{1n}}}{\sum_{j}e^{x_{1j}}} \\
    \frac{e^{x_{21}}}{\sum_{j}e^{x_{2j}}} & \frac{e^{x_{22}}}{\sum_{j}e^{x_{2j}}} & \frac{e^{x_{23}}}{\sum_{j}e^{x_{2j}}} & \dots  & \frac{e^{x_{2n}}}{\sum_{j}e^{x_{2j}}} \\
    \vdots & \vdots & \vdots & \ddots & \vdots \\
    \frac{e^{x_{m1}}}{\sum_{j}e^{x_{mj}}} & \frac{e^{x_{m2}}}{\sum_{j}e^{x_{mj}}} & \frac{e^{x_{m3}}}{\sum_{j}e^{x_{mj}}} & \dots  & \frac{e^{x_{mn}}}{\sum_{j}e^{x_{mj}}}
\end{bmatrix} = \begin{pmatrix}
    softmax\text{(first row of x)}  \\
    softmax\text{(second row of x)} \\
    ...  \\
    softmax\text{(last row of x)} \\
\end{pmatrix} $$

In [5]:
def softmax(x):
    """
    Calculates the softmax for each row of the input x.
    """
    
    x_exp = np.exp(x)
    x_sum = np.sum(x_exp, axis=1, keepdims=True)
    s = x_exp / x_sum
    return s

## 2.4 L1 and L2 loss functions

Numpy vectorized version of the L1 and L2 loss. Let $ \hat{y} $ denote predictions and $y$ represent true value.

- L1 loss is defined as:
$$\begin{align*} & L_1(\hat{y}, y) = \sum_{i=0}^m|y^{(i)} - \hat{y}^{(i)}| \end{align*}\tag{6}$$

- L2 loss is defined as $$\begin{align*} & L_2(\hat{y},y) = \sum_{i=0}^m(y^{(i)} - \hat{y}^{(i)})^2 \end{align*}\tag{7}$$
L2 loss function use `np.dot()`. if $x = [x_1, x_2, ..., x_n]$, then `np.dot(x,x)` = $\sum_{j=0}^n x_j^{2}$. 

In [6]:
def L1(yhat, y):
    """
    L1 loss function
    """
    loss = np.sum(np.abs(yhat-y))
    return loss

In [7]:
def L2(yhat, y):
    """
    L2 loss function
    """
    loss = np.dot(yhat-y, yhat-y)
    return loss

## 2.5 Logistic regression
Logistic regression is a special case of neural network.

For one example $x^{(i)}$:
$$z^{(i)} = w^T x^{(i)} + b \tag{1}$$
$$\hat{y}^{(i)} = a^{(i)} = sigmoid(z^{(i)})\tag{2}$$ 
$$ \mathcal{L}(a^{(i)}, y^{(i)}) =  - y^{(i)}  \log(a^{(i)}) - (1-y^{(i)} )  \log(1-a^{(i)})\tag{3}$$

The cost is then computed by summing over all training examples:
$$ J = \frac{1}{m} \sum_{i=1}^m \mathcal{L}(a^{(i)}, y^{(i)})\tag{4}$$

Forward Propagation:
- You get X
- You compute $A = \sigma(w^T X + b) = (a^{(0)}, a^{(1)}, ..., a^{(m-1)}, a^{(m)})$
- You calculate the cost function: $J = -\frac{1}{m}\sum_{i=1}^{m}y^{(i)}\log(a^{(i)})+(1-y^{(i)})\log(1-a^{(i)})$

Here are the two formulas: 

$$ \frac{\partial J}{\partial w} = \frac{1}{m}X(A-Y)^T\tag{5}$$
$$ \frac{\partial J}{\partial b} = \frac{1}{m} \sum_{i=1}^m (a^{(i)}-y^{(i)})\tag{6}$$

In [8]:
# full logistic regression project please see: 
# https://github.com/gaoisbest/Machine-Learning-projects/tree/master/Logistic%20regression
def gradient_descent(X, Y, nx, ny, m, num_iterations, alpha):
    """
    Gradient descent to train parameters.
    """
    W = np.zeros(shape=(nx, 1), dtype=np.float32) # weights initialization
    b = 0.0 # bias initialization
    for _ in range(num_iterations):
        Z = np.dot(W.T, X) + b # shape: (1, m)
        A = sigmoid(Z) # shape: (1, m)
        cost = -1.0 / m * np.sum(Y * np.log(A) + (1-Y) * np.log(1-A))
        print('cost:{}'.format(cost))
        # computation graph
        dZ = A - Y # The derivative of cost to A to Z. shape: (1, m)
        dW = 1.0 / m * np.dot(X, dZ.T) # The derivative of cost to A to Z to W. shape: (nx, 1)
        W -= alpha * dW # update W
        db = 1.0 / m * np.sum(dZ) # The derivative of cost to A to Z to b
        b -= alpha * db # update b
    return W, b

## 2.6 Learning rate
**How to set the value of learning rate for gradient descent?**

**Solution**: plot cost under different learning rate, see the convergency tendency. The x-axis is number of iterations.

For sufficiently **small** learning rate, cost should decrease on every iteration, but slow to converge.
For **large** learning rate, cost may not converge.

How to choose learning rate ? Increase learning rate steply, select best learning rate.
Try 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1

From https://zh.coursera.org/learn/machine-learning/lecture/3iawu/gradient-descent-in-practice-ii-learning-rate

## 2.7 Reshape
Goal: flatten X of shape (a,b,c,d) to X_flatten of shape (b$*$c$*$d, a): 
```
X_flatten = X.reshape(X.shape[0], -1).T      # X.T is the transpose of X
```