## Logistic Regression ##
<p> Here lowercase variables correspond to single training example and uppercase variables correspond to entire training set.</p><br>
** Notations **: 
* ${m}=\text{Total number of training/input records} $
* ${n}_{x}=\text{Number of features per training/input record} $
* ${X}=\text{Input data stacked vertically in matrix with shape } ({n}_{x},{m}) $
* ${W}=\text{Weights of the input features stacked horizontally in a column vector with shape } ({n}_{x},1) $
* ${b}=\text{Bias terms stacked vertically in a row vector with shape } (1,{n}_{x}) $
* ${Z}=\text{Predicted outputs of all the m training samples before activation, stacked vertically in a row vector with shape } (1,m) $
* ${A}=\text{Predicted outputs of all the m training samples after applying activation, stacked vertically in a row vector with shape } (1,m) $
* ${Y}=\text{Actual labels of all the m training samples after applying activation, stacked vertically in a row vector with shape } (1,m) $
* ${L({a}^{i}, {y}^{i})}=\text{Loss for a single training example, where }{a}^{i}\text{ is the predicted output and }{y}^{i}\text{ is the actual label for the }{i}^{th}\text{ training sample} $
* ${J(W,b)} = \text{Cost function for the overall training set and is equal to the mean of all individual Losses}$
* ${dZ} = \text{Derivative of the cost function wrt output before activation Z, for entire training set stacked vertically into a row vector of shape }(1,{n}_{x}) $
* ${dW} = \text{Derivative of the cost function wrt weights W, for entire training set stacked horizontally into a column vector of shape }({n}_{x},1) $
* ${db} = \text{Derivative of the cost function wrt bias b, for entire training set stacked verticall into a column row vector of shape }(1,{n}_{x}) $
* ${Y}=\text{Actual labels of all the m training samples after applying activation, stacked vertically in a row vector with shape } (1,m) $

** Equations **: 

- ${Z}={W^T}{X}+b $
- ${A}=sigmoid(Z) $
* ${L({a}^{i}, {y}^{i})} = -({y}^{i}log({a}^{i}) + (1-{y}^{i})(log(1-{a}^{i}))$
* ${J(W,b)} = \sum_{i=0}^mL({a}^{i}, {y}^{i}) $
* ${dZ}={A-Y} $
* ${dW}=\frac{np.dot({X,}{dZ^T})}{m} $
* ${db}={np.sum(dZ)}$

## 1 - Building basic functions with numpy ##
### 1.1 - sigmoid function, np.exp() ###

In [7]:
import numpy as np
def sigmoid(x):
    """
    Arguments:
    x -- A scalar or numpy array of any size

    Return:
    s -- sigmoid(x)
    """
   
    s = 1/(1+(np.exp(-x)))
    return s

### 1.2 - Sigmoid gradient

In [8]:
def sigmoid_derivative(x):
    """
    Arguments:
    x -- A scalar or numpy array

    Return:
    ds -- Your computed gradient.
    """
    
    ds=sigmoid(x)*(1-sigmoid(x))
    
    return ds

### 1.3 - Reshaping arrays ###

In [14]:
def image2vector(image):
    """
    Argument:
    image -- a numpy array of shape (length, height, depth)
    
    Returns:
    v -- a vector of shape (length*height*depth, 1)
    """
    
    v = image.reshape(image.shape[0]*image.shape[1]*image.shape[2], 1)
    
    return v
    

### 1.4 - Normalizing rows

Another common technique we use in Machine Learning and Deep Learning is to normalize our data. It often leads to a better performance because gradient descent converges faster after normalization. Here, by normalization we mean changing x to $ \frac{x}{\| x\|} $ (dividing each row vector of x by its norm).

For example, if $$x = 
\begin{bmatrix}
    0 & 3 & 4 \\
    2 & 6 & 4 \\
\end{bmatrix}\tag{3}$$ then $$\| x\| = np.linalg.norm(x, axis = 1, keepdims = True) = \begin{bmatrix}
    5 \\
    \sqrt{56} \\
\end{bmatrix}\tag{4} $$and        $$ x\_normalized = \frac{x}{\| x\|} = \begin{bmatrix}
    0 & \frac{3}{5} & \frac{4}{5} \\
    \frac{2}{\sqrt{56}} & \frac{6}{\sqrt{56}} & \frac{4}{\sqrt{56}} \\
\end{bmatrix}\tag{5}$$ Note that you can divide matrices of different sizes and it works fine: this is called broadcasting and you're going to learn about it in part 5.


**Exercise**: Implement normalizeRows() to normalize the rows of a matrix. After applying this function to an input matrix x, each row of x should be a vector of unit length (meaning length 1).

In [18]:
def normalizeRows(x):
    """
    Implement a function that normalizes each row of the matrix x (to have unit length).
    
    Argument:
    x -- A numpy matrix of shape (n, m)
    
    Returns:
    x -- The normalized (by row) numpy matrix. You are allowed to modify x.
    """

    x_norm = np.linalg.norm(x,axis=1,keepdims=True)    
    x =x/x_norm 

    return x

### 1.5 - Broadcasting and the softmax function ####
**Exercise**: Implement a softmax function using numpy. You can think of softmax as a normalizing function used when your algorithm needs to classify two or more classes.

**Instructions**:
- $ \text{for } x \in \mathbb{R}^{1\times n} \text{,     } softmax(x) = softmax(\begin{bmatrix}
    x_1  &&
    x_2 &&
    ...  &&
    x_n  
\end{bmatrix}) = \begin{bmatrix}
     \frac{e^{x_1}}{\sum_{j}e^{x_j}}  &&
    \frac{e^{x_2}}{\sum_{j}e^{x_j}}  &&
    ...  &&
    \frac{e^{x_n}}{\sum_{j}e^{x_j}} 
\end{bmatrix} $ 

- $\text{for a matrix } x \in \mathbb{R}^{m \times n} \text{,  $x_{ij}$ maps to the element in the $i^{th}$ row and $j^{th}$ column of $x$, thus we have: }$  $$softmax(x) = softmax\begin{bmatrix}
    x_{11} & x_{12} & x_{13} & \dots  & x_{1n} \\
    x_{21} & x_{22} & x_{23} & \dots  & x_{2n} \\
    \vdots & \vdots & \vdots & \ddots & \vdots \\
    x_{m1} & x_{m2} & x_{m3} & \dots  & x_{mn}
\end{bmatrix} = \begin{bmatrix}
    \frac{e^{x_{11}}}{\sum_{j}e^{x_{1j}}} & \frac{e^{x_{12}}}{\sum_{j}e^{x_{1j}}} & \frac{e^{x_{13}}}{\sum_{j}e^{x_{1j}}} & \dots  & \frac{e^{x_{1n}}}{\sum_{j}e^{x_{1j}}} \\
    \frac{e^{x_{21}}}{\sum_{j}e^{x_{2j}}} & \frac{e^{x_{22}}}{\sum_{j}e^{x_{2j}}} & \frac{e^{x_{23}}}{\sum_{j}e^{x_{2j}}} & \dots  & \frac{e^{x_{2n}}}{\sum_{j}e^{x_{2j}}} \\
    \vdots & \vdots & \vdots & \ddots & \vdots \\
    \frac{e^{x_{m1}}}{\sum_{j}e^{x_{mj}}} & \frac{e^{x_{m2}}}{\sum_{j}e^{x_{mj}}} & \frac{e^{x_{m3}}}{\sum_{j}e^{x_{mj}}} & \dots  & \frac{e^{x_{mn}}}{\sum_{j}e^{x_{mj}}}
\end{bmatrix} = \begin{pmatrix}
    softmax\text{(first row of x)}  \\
    softmax\text{(second row of x)} \\
    ...  \\
    softmax\text{(last row of x)} \\
\end{pmatrix} $$

In [21]:
def softmax(x):
    """Calculates the softmax for each row of the input x.
    Argument:
    x -- A numpy matrix of shape (n,m)

    Returns:
    s -- A numpy matrix equal to the softmax of x, of shape (n,m)
    """
    
    x_exp = np.exp(x)
    x_sum = np.sum(x_exp,axis=1,keepdims=True)
    s = x_exp/x_sum
    
    return s

In [34]:
def L1(yhat, y):
    """
    Arguments:
    yhat -- vector of size m (predicted labels)
    y -- vector of size m (true labels)
    
    Returns:
    loss -- the value of the L2 loss function defined above
    """
    
    ### START CODE HERE ### (≈ 1 line of code)
    loss = np.sum(np.abs((yhat-y)))
    ### END CODE HERE ###
    
    return loss

In [29]:
def L2(yhat, y):
    """
    Arguments:
    yhat -- vector of size m (predicted labels)
    y -- vector of size m (true labels)
    
    Returns:
    loss -- the value of the L2 loss function defined above
    """
    
    ### START CODE HERE ### (≈ 1 line of code)
    loss = np.sum(np.square((yhat-y)))
    ### END CODE HERE ###
    
    return loss