## Logistic regression

- Note the differences between the squared loss func one would use in _linear regression_, and why this gets trapped in local minima in the case of binary classification (which is the use case here)

$\hat{y} = P(y = 1 | x)$

where 

$0\le\hat{y}\le1\ and\ x\in\mathbb{R}^{n_x}$

$parameters:w\in\mathbb{R}^{n_x}, b\in\mathbb{R}$

$output:\hat{y}=\sigma(w^{T}+b)$

the sigmoid func above expands to

$\sigma(z)=\frac{1}{1+e^{-z}}$

this gives the loss func
$\mathcal{L} = (\hat{y},y) = (y_{log}\hat{y} + (1 - y)_{log}(1-\hat{y})$

and the cost func
$\mathcal{J}(w, b) = \frac{1}{m}\sum_{i=1}^{m}\mathcal{L}(\hat{y}^{(i)},y^{(i)})$

## Gradient descent
- the goal is to find a convex func via parameters _w_ and _b_ that minimizes `J(w, b)`

rough algorithm:

$Loop:\quad w := w = \alpha\frac{\delta J(w)}{\delta w}$


where alpha is the learning rate

## Vectorization

this benchmark shows the perf increase from vectorization:

In [1]:
import numpy as np
import time

a = np.random.rand(1000000)
b = np.random.rand(1000000)

tic = time.time()
c = np.dot(a,b)
toc = time.time()
print("vectorized: {}ms".format(1000*(toc-tic)))

c = 0
tic = time.time()
for i in range(1000000):
    c += a[i]*b[i]
toc = time.time()
print("unvectorized (for loop): {}ms".format(1000*(toc-tic)))

vectorized: 1.489877700805664ms
unvectorized (for loop): 448.9467144012451ms


### Some functions to note

In [None]:
import numpy as np

def sigmoid(x):
    return 1/(1+np.exp(-x))

def derivative(x):
    s = sigmoid(x)
    return s*(1-s)

def im2vec(img):
    length = img.shape[0]
    width = img.shape[1]
    depth = img.shape[2]
    return img.reshape(length*width, depth)

def normalize_rows(x):
    return x/np.linalg.norm(x, axis=1, keepdims=True)

def softmax(x):
    return np.exp(x)/np.sum(np.exp(x), axis=1, keepdims=True)

def l1_loss(y_hat, y):
    return np.sum(abs(y-y_hat))

def l2_loss(y_hat, y):
    y_diff = y-y_hat
    return np.sum(np.dot(y_diff, y_diff))

### A few interesting forumulae from the assignment

$$A = \sigma(w^T X + b) = (a^{(1)}, a^{(2)}, ..., a^{(m-1)}, a^{(m)})$$
$$J = -\frac{1}{m}\sum_{i=1}^{m}y^{(i)}\log(a^{(i)})+(1-y^{(i)})\log(1-a^{(i)})$$
$$ \frac{\partial J}{\partial w} = \frac{1}{m}X(A-Y)^T$$
$$ \frac{\partial J}{\partial b} = \frac{1}{m} \sum_{i=1}^m (a^{(i)}-y^{(i)})$$