## One-Hot Encoding

In [19]:
import numpy as np
import pandas as pd

In [17]:
idx = ['Duck', 'Walrus', 'Beaver', 'Salamander']

def thing_vector(thing, idx):
    return [1 if thing == t else 0 for t in idx]

df = pd.DataFrame({
    thing : pd.Series(data=thing_vector(thing, idx), index=idx) for thing in idx
})
df

Unnamed: 0,Beaver,Duck,Salamander,Walrus
Duck,0,1,0,0
Walrus,0,0,0,1
Beaver,1,0,0,0
Salamander,0,0,1,0


## Maximum Likelihood

$P(all) = P_1 \times P_2 \cdots \times P_n$

We want to maximize this function. The higher the value, the more accurately the model will classify points. However, if we multiply large numbers of sums, the outcome could be drastically modified by just one term. How can we turn products into sums?

$\ln b + \ln b = \ln ab$

In [43]:
assert np.allclose(np.log(.6) + np.log(.7), np.log(.42))

# Cross Entropy

Since the result of the natural log of a number less than zero is negative, we can take the negative log of each term to yield a positive number. A good model gives us low cross entropy. We can think of this as the "error" of a given point. If the error is low, that means the probability of the point being a correct prediction is high.

Error function (binary classification)
$$
\displaystyle-\frac{1}{m}\sum_{i=1}^m y_i \ln(\hat{y}_i) + (1 - y_i)\ln(1-\hat{y}_i)
$$

Error function (multi category)
$$
\displaystyle-\frac{1}{m}\sum_{i=1}^m \sum_{i=1}^n y_{ij} \ln(\hat{y}_{ij})
$$

Recall that numpy can operate on lists

In [52]:
# L cannot have values > 1 since there is no exponent that can make a number negative.
L = np.float_([0.5, 0.1, 0.9, 0.1])
print(np.log(L))
print(np.log(1.0 - L))

[-0.69314718 -2.30258509 -0.10536052 -2.30258509]
[-0.69314718 -0.10536052 -2.30258509 -0.10536052]


In [53]:
# Write a function that takes as input two lists Y, P,
# and returns the float corresponding to their cross-entropy.

# Me 🍄
def cross_entropy(Y, P):
    return -1 * sum([yp[0]*np.log(yp[1])+(1-yp[0])*np.log(1-yp[1]) for yp in zip(Y,P)])

# Solution
def cross_entropy(Y, P):
    Y = np.float_(Y)
    P = np.float_(P)
    return -np.sum(Y * np.log(P) + (1 - Y) * np.log(1 - P))

## Logistic Regression

Minimizing the error fuction: we minimize through gradient descent.

The derivation of logistic regression involves calculating the partial derivative of the error function.

The error function is given above. Recall that the sigmoid function $\sigma$ is defined by:
$$
\sigma(x) = \frac{1}{1+e^{-x}}
$$

and that $\hat{y}$ is given by:
$$
    \hat{y}_i = \sigma(W_ix+b)
$$
We can calculate the gradient:
$$
    \nabla E = (\frac{\partial}{\partial w_1}E,\dots,\frac{\partial}{\partial w_n}E,\frac{\partial}{\partial b}E)
$$
The sigmoid function has a handy derivative. [A good explanation of derivation of sigmoid function derivative can be found here.](https://beckernick.github.io/sigmoid-derivative-neural-network/)

What it is is this: $\sigma' = \sigma(x)(1-\sigma(x))$

[TODO: Research derivation]

$$
    \nabla E = -(y-\hat{y})(x_1,\dots,x_n,1)
$$

The weights get updated according to the following:
$$
    w_i' \longleftarrow w_i + \alpha(y - \hat{y})x_i 
$$
and
$$
    b' \longleftarrow b + \alpha(y - \hat{y})
$$

## Gradient Descent Algorithm


<ol>
    <li>Start with random weights: $ w_1,\dots,w_n,b $ </li>
    <li>For every point: $ x_1,\dots,x_n $ </li>
    <ol>
        <li>For $ i = 0 $ to $n$</li>
        <ol>
            <li>$ w_i' \longleftarrow w_i + \alpha(y - \hat{y})x_i $</li>
            <li>$ b' \longleftarrow b + \alpha(y - \hat{y})$ </li>
        </ol>
    </ol>
    <li>Repeat until error is small. </li>
</ol>

This is very similar to the perceptron algorithm.
