In [12]:
import numpy as np
from matplotlib import pyplot as plt
from pprint import pprint
%matplotlib inline

# Predicting

The linear equation is $0=ax+by+c$. This provides a straight line through a 2D plane. 

When talking about machine learning, this is looked at as a matrix where:
- $(x,y) \rightarrow (x_1, x_2) = \textbf{X}$
- constants are renamed to $w_i$
- the bias term / intercept $= w_0$

$h(\textbf{X}) = w_0 + w_1x_1 + w_2x_2 + w_nx_n$

Where $h()$ is a *linear combination* of the components of $\textbf{X}$. In vector form this is represented as $h(\textbf{X}) = \textbf{W}^T\textbf{X}$ or $h(\textbf{X}) = \textbf{W}\bullet\textbf{X}$

This allows us to work with planes (3D) or hyperplanes (>3D).

To calculate the logistic regression, we are using a model that gives us the probability by passing it through a sigmoid function. There are various sigmoid functions but we will use the following:

$\sigma = \frac{1}{1+ e^{-z}} \in (0,1)$ with a $y$ intercept $=0$

So to calculate our probability of an output belonging to a class, we will use:

$y = \sigma(\textbf{W}^T\textbf{X})$

Logistic regression makes the assumption that the data can be separated by a line or a plane.

## Calculating the output of a logistic classifier

In [13]:
# Create normally distributed matrix
N = 100
D = 2

X = np.random.randn(N,D)
X.shape

(100, 2)

In [14]:
# Create a bias term = column of ones
ones = np.array([[1]*N]).T
ones.shape

(100, 1)

In [15]:
# Concatenate vectors to add the bias term
Xb = np.concatenate((ones,X), axis=1)
Xb.shape

(100, 3)

In [16]:
# Randomly initialise a weight vector
w = np.random.randn(D + 1)
w.shape

(3,)

In [17]:
# Calculate the dot product of X and W matrices
z = Xb.dot(w)
z.shape

(100,)

In [18]:
def sigmoid(z):
    return 1/(1+np.exp(-z))

sig_z = sigmoid(z) # Calculate probability of y given x
sig_z

array([0.45953914, 0.49461177, 0.21312108, 0.53386388, 0.91529471,
       0.64430297, 0.89616554, 0.8953878 , 0.53240738, 0.12782428,
       0.32352807, 0.28921246, 0.30620428, 0.79895575, 0.36417114,
       0.11955983, 0.15422503, 0.54304219, 0.56050085, 0.4601314 ,
       0.35405477, 0.63573829, 0.14284789, 0.63046775, 0.67807454,
       0.44202639, 0.80620858, 0.88566566, 0.65055425, 0.38111102,
       0.22537933, 0.95732103, 0.58246147, 0.30358739, 0.24885075,
       0.90928501, 0.31118029, 0.13352612, 0.31873313, 0.59053022,
       0.92289846, 0.31194472, 0.3419616 , 0.14869494, 0.90288809,
       0.71714416, 0.3206792 , 0.84955526, 0.51186351, 0.40488691,
       0.79302755, 0.71168917, 0.7044027 , 0.58254855, 0.31110882,
       0.59342342, 0.68958463, 0.81913778, 0.1551673 , 0.83078454,
       0.91808648, 0.59885064, 0.30710988, 0.46490165, 0.5973267 ,
       0.31027135, 0.26122523, 0.80469842, 0.75381591, 0.5366081 ,
       0.3853422 , 0.96117809, 0.7864576 , 0.73192863, 0.78855

## Interpreting the output

The output of logistic regression is sigmoid that represents the probability that y belongs to class 1, given X

$$output = p(y=1|x) = \sigma(w^{T}x) \in (0,1)$$

$y$ can only take 2 values so:

$$p(y=1|x) + p(y=0|x) = 1$$

In ML we can then round the value so that if $x>0.5$ then predict one class, otherwise, predict the other.

As the values move away from the dividing line, the value of $|w^{T}x|$ will tend to infinity, creating a clear definition on which group the point belongs

## Summary

- Input features $x$ are weighted by $w$ through $\textbf{W}^T\textbf{X}$
- Reformatted into a range (0,1) by the sigmoid through $y=\sigma(\textbf{W}^T\textbf{X})$
- $\textbf{W}$ needs to be adjusted to provide the right answer > training

# Training

Finding parameters and weights so that the event being modeled is modelled correctly by the model

This problem can be solved through a closed form solution. In general, problems must be solved through gradient descent but if the data is distributed in a specific way, they could be solved by closed form solution.

Problem: <br>
- Data from 2 Gaussian distributed classes
- Same covariance but different means

#### Bayes rule

$p(Y|X) = \frac{p(X|Y)p(Y)}{p(X)}$

- $p(X|Y)$ is the Gaussian - calculated by getting the mean and covariances of the data
- Prior ($p(Y)$) is the maximum likelihood estimate (frequency estimate) $\rightarrow$ e.g. $p(Y=1) = $ # times class 1 appeared / # total

We manipulate Bayes rule to fit in the logistic regression framework

$P(y=1|x) = \frac{p(x|y=1)(p(y=1)}{p(x)} = \frac{p(x|y=1)(p(y=1)}{p(x|y=1)p(y=1)+p(x|y=0)p(y=0)}$

$p(x) = p(x|y=1)p(y=1)+p(x|y=0)p(y=0)$ because $p(x)$ is the marginal of $p(x, y)$ (sum over all possible values of y - [resources](https://towardsdatascience.com/probability-concepts-explained-marginalisation-2296846344fc)) and $p(x, y) = p(x | y)p(y)$ as per the rules of conditional probability.

Simplify the equation by dividing by $p(x|y=1)p(y=1)$ and we get $P(y=1|x) = \frac{1}{1+\frac{p(x|y=0)p(y=0)}{p(x|y=1)(p(y=1)}}$