# Logistic regression as a neural network

**Notation**

* assigning predicted category based on observed feature values
* an observation $ (x,y); x \in R; y\in \{0,1\}$
* $m$ training examples $\{(x^{(1)},y^{(1)}), (x^{(1)},y^{(1)}), ..., (x^{(m)},y^{(m)})\}$
    * $M_{train}; M_{test}$

* X
    * conveniently $X = \begin{bmatrix} \\ x^{(1)}  x^{(2)} ... x^{(m)} \\\\\end{bmatrix}$, as opposed to the traditional row-wise convention (makes implementation more straightforward)
    * matrix $X \in R^{N_x \cdot m}$ 
* Y
    * conveniently $Y = \begin{bmatrix} y^{(1)} y^{(2)} ... y^{(m)}\end{bmatrix}$
    * matrix $Y \in R^{1 \cdot m}$

**Logistic regression**  

* the goal of binary classification is $P(y=1|X)$, using parameters $w \in R^{N_x}$ and $b \in R$
* output $\hat{y} = \sigma(w^T \cdot x+b)$
* $\sigma (z) = \frac{1}{1+e^{-z}}$
    * if $z$ large $\sigma(z) \sim \frac{1}{1+0} \sim 1$
    * if $z$ small $\sigma(z) \sim \frac{1}{1+large} \sim 0$ 

**Cost function**

Loss function
* single example  
* $L(\hat{y},y) = -(y \log{\hat{y}}+(1-y)\log(1-\hat{y}))$
    * if $y=1$: $L(\hat{y},y) = -\log(\hat{y})$ <- we want $\hat{y}$ to be large
    * if $y=0$: $L(\hat{y},y) = -(1-\log(\hat{y}))$ <- we want $\hat{y}$ small

Cost function
* average loss function for all observations
* $J(w,b) = \frac{1}{m}\sum_{i=1}^m L(\hat{y}^{(i)},y^{(i)}) = -\frac{1}{m}\sum_{i=1}^{m} y^{(i)} \log(\hat{y}^{(i)}) + (1-y^{(i)})\log(1-\hat{y}^{(i)})$

**Gradient descent**

* algorithm used for iteratively minimizing $J(w, b)$, useful for convex optimization problems
* update steps    
    * $w:= w - \alpha\frac{\delta J(w,b)}{\delta w}$
    * $b:= b- \alpha\frac{\delta J(w,b)}{\delta b}$
    * $\alpha$ represents learning rate (speed of weight updates)

**Derivatives**

* slope of a function, formally $\lim_{h \rightarrow 0} \frac{f(x+h)-f(x)}{h}$
* might differ on different point of a function 

**Computation graph**

* based on the chain rule -> $\frac{dJ}{da} = \frac{dJ}{dv} \frac{dv}{da}$
* allows to reuse (cache) derivatives on the computational graph

**Logistic regression derivatives**

* use chain rule to get from the loss to the weight and bias changes
* $\frac{\delta L(\hat{y},y)}{\delta \hat{y}} \frac{\delta \hat{y}}{\delta z} \frac{\delta z}{\delta w}$, where the first two partial derivatives can be precomputed, similarly for bias term

Step-by-step implementation  
* $J = 0; dw_1 = 0; dw_2 = 0; db = 0$
* for i=1 to m (number of examples):
    * $z^{(i)}=w^Tx^{(i)}+b$
    * $a^{(i)}=\sigma(z^{(i)})$
    * updates
        * $J += -\frac{1}{m}[y\log(a) + (1-y)\log(1-a)]$
        * $dw_1 += \frac{1}{m} dz^{(i)} x_1^{(i)}$
        * $dw_2 += \frac{1}{m} dz^{(i)} x_2^{(i)}$
        * $db += \frac{1}{m} dz^{(i)}$

* final update
    * $w_1 = w_1-\alpha dw_1$
    * $w_2 = w_2-\alpha dw_2$
    * $b = b-\alpha db$



# Python and vectorization

**Vectorization**

* ability to execute loop-based operation without explicitly looping through elements
* SIMD instructions for parallelization on CPU/GPU (ie `np.dot`)
* avoid explicit for-loop whenever possible
* `numpy` built-in functions support vectrization for scalar multiplication, matrix multiplication, exp/log operations, and others

**Broadcasting**

* do aggregate operation on parts of the whole matrix using correct `axis`, then broadcast the element-wise operation to desired dimension (with possible use of `reshape` or `keepdims`)

# Logistic regression cost function

* $p(y|x) = \hat{y}^y \cdot (1-\hat{y})^{(1-y)}$
* we can leverage log as the max of the log func is the same as in the original one, that is after simplification $\log(p(y|x)) = y \log(\hat{y}) + (1-y)\log(1-\hat{y})$
* reasoning
    * $ p(...) = \Pi_{i=1}^m p(y^{(i)}|x^{(i)})$
    * $ log(p(...)) = \sum_{i=1}^m log(p(y^{(i)}|x^{(i)})) = - \sum_{i=1}^m L(\hat{y}^{(i)}, y^{(i)})$
    * cost $J(w,b) = \frac{1}{m}sum_{i=1}^m L(\hat{y}^{(i)}, y^{(i)})$