As we can see above, the equation for the sigmoid function is $\sigma (x)=\frac{1}{1+e^{-x}}$. Given a Bernoulli random variable $Y$ with only two classes, e.g., $1$ and $0$, we can write $p(x) = \mathop{\mathbb{P}}[Y=1 | X=x]$. So by our model, we assert that $$p(x)=\sigma (x)=\frac{1}{1+e^{-x}}$$, but we'd like to parameterize this as $$\frac{1}{1+e^{-(\beta^Tx + \beta_{0})}}$$. In order to do so, we need to solve for $(\beta^Tx + \beta_{0})$, or if we simplify it as $x'=(\beta^Tx + \beta_{0})$, then we only need to solve for $x'$ in $p(x')$.

$$p(x')=\frac{1}{1+e^{-x'}}\implies\frac{1}{p(x')}=1+e^{-x}\implies\frac{1}{p(x')}-1=e^{-x}\implies\frac{1-p(x')}{p(x')}=e^{-x}\implies\frac{p(x')}{1-p(x')}=e^x\implies \operatorname{ln}(\frac{p(x')}{1-p(x')})=x$$

So finally, we have our model: $$p(x)=\frac{1}{1+e^{-(\beta^Tx + \beta_{0})}}$$
Now that we have our new model for our data, we aim to fit this model by tuning the coefficients $\beta$.

### Optimization

We've seen two examples of *modeling* the data, as either a straight line or a sigmoid curve. However, both of these models involve **parameters** that need to be solved to make the models **fit** the data. 

There are infinite possible straight lines or sigmoid curve variations we could potentially model the data with. We need to find the *best fit* in an efficient way... 

**Optimization** is the process of **maximizing** some parameterized **function** based on a **metric** (or **minimizing** based on its negative).

##### Which objective function are we minimizing?

For the case of **linear regression**, it is simple to look at the *number of misclassified examples*, or the **error** on our predections.

More specifically, the L2 loss for linear regression, aka the **least squares equation**: 

<p style="text-align: center;"> $l(x_{i}, y_{i}) = \|(y_{i} - (\beta x_{i} + \beta_{0})\|^2$ </p>

When dealing with the entire dataset, we'll combine $\beta_{0}$ with vector $\beta$ and add a dummy column of 1's to X:

<p style="text-align: center;"> $l(X, y) = \|y - \beta^T X\|^2$ </p>

**Logistic regression** has some interesting properties that let's us minimize its function more directly. 

If we suppose `y` to take the values -1 and 1 instead of 0 and 1, we can represent `y` as follows: 

<p style="text-align: center;"> $\mathop{\mathbb{P}}[Y=1 | X=x] = \frac{1}{1 + e^{-(\beta^TX)}}$ and $\mathop{\mathbb{P}}[Y=-1 | X=x] = \frac{1}{1 + e^{(\beta^TX)}}$ </p>

This can be simplified to: $\mathop{\mathbb{P}}[Y=y | X=x] = \frac{1}{1 + e^{(-y \beta^TX)}} = \sigma(-y \beta^TX)$



For each class, we seek to **maximize** the probability that Y=y given the data, by fitting our parameters $\beta$ to the data. This is referred to as the **maximum likelihood equation**: 

<p style="text-align: center;"> ${\displaystyle \max_{\beta \in \Theta}} {\mathop{\mathcal{L}}}(\beta;X)$

where $\Theta$ represents the parameter space of all possible parameter values.

This is solved as 

$$
\begin{aligned}
  {\mathop{\mathcal{L}}}(\beta;X) &= p((x_{1},y_{1}),(x_{2},y_{2}),...,(x_{n},y_{n});\beta) \\
 &= {\displaystyle \prod_{i=1}^{n} p(x_{i},y_{i}; \beta)} \\
 &= {\displaystyle \prod_{i=1}^{n} {\rm p}^{y_{i}}(1-{\rm p})^{(1-y_{i})}},
\end{aligned}
$$
where ${\rm p} = \sigma(y_{i}\beta^TX)$ and $y_{i}$ is a **Bernoulli** variable.

The **log likelihood function** is more convenient to use:

$$
\begin{aligned}
\log {\mathop{\mathcal{L}}}(\beta;X) &= \log {\displaystyle \prod_{i=1}^{n} {\rm p}^{y_{i}}(1-{\rm p})^{(1-y_{i})}} \\
&= {\displaystyle \sum_{i=1}^{n} y_{i} \log {\rm p} + (1 - y_{i}) \log(1- {\rm p}) }
\end{aligned}
$$

This equation gives us the **cross entropy**, equivalent to the **log likelihood**