# An Introduction To Classification With Logistic Regression

Author: K. Voudouris (2022)

In this tutorial, we will extend linear regression to deal with binary variables, creating an algorithm that can classify examples into a binary set of categories, namely, logistic regression. Both linear and logistic regression form special cases of a broad family of models called Generalized Linear Models, used widely in statistics. They also form the basis for simple Artificial Neural Networks, which we will begin to investigate by looking at the perceptron and how that can be extended to multi-layer perceptron models.

First, let's import the libraries.

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error
import math
import ipympl
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
plt.rcParams['figure.figsize'] = (20.0, 10.0)

Next, we will import the dataset. This week, we will be using a dataset from [here](https://www.sheffield.ac.uk/mash/statistics/datasets), which includes various details about newborn babies. It includes the following variables:
- ID	                   
Baby number	
- length	               
Length of baby (cm) 
- Birthweight	           
Weight of baby (kg)	
- headcirumference	       
Head Circumference	
- Gestation	               
Gestation (weeks)	
- smoker	               
Mother smokes 1 = smoker 0 = non-smoker	
- motherage	               
Maternal age	
- mnocig	               
Number of cigarettes smoked per day by mother	
- mheight	               
Mothers height (cm)	 
- mppwt	                   
Mothers pre-pregnancy weight (kg)	 
- fage	                   
Father's age	 
- fedyrs	               
Father’s years in education	
- fnocig	               
Number of cigarettes smoked per day by father	
- fheight	               
Father's height (cm)	 
- lowbwt	               
Low birth weight, 0 = No and 1 = yes	 
- mage35	               
Mother over 35, 0 = No and 1 = yes	

In [None]:
birthweight_data = pd.read_csv("./Birthweight_reduced_kg_R.csv", na_values = ' ')

birthweight_data.head() #show the first 5 rows of the dataset

## Introducing Classification

In linear regression up to now, we have been predicting values of *y* based on values of **x**. In the cases we looked at, the y-variable has been a continuous numerical scale, like the Sepal Width or Petal Length of an iris flower. But what happens if *y* only takes on a small number of discrete values. In this case, we don't want to output a value on a continuous scale, we want our prediction about *y* given **x** to be one of the discrete values. Essentially, we want to *classify* a new data point, characterised by a set of x-values, **x** as one of those discrete values. This clearly has important uses. Say we want to predict who whether a child has clinically low birth weight (a binary variable). It would make no sense to be predicting intermediate values between 0 and 1, because a baby can either be of low birth weight or it is not. Classification algorithms give us the tools to make predictions when the dependent variable is discrete, and intermediate values are meaningless. Many neural networks and other ML methods are classifiers, allowing us to make discrete predictions, rather than continuous ones.

In this tutorial, we will focus on the case where the dependent variable, **y**, is binary. We will be looking at logistic regression to solve the binary classification problem. Logistic regression is one of the simplest and oldest classification techniques out there. Understanding it paves the way for a deeper understanding of how neural networks work and ultimately how to build better ones.

First, let's see more clearly why linear regression is inadequate for classification problems such as these. We will make up a dataset first.

In [None]:
## Let's make up a dataset first

x =  np.linspace(-5, 5, 20)
y = np.array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1])
plt.scatter(x, y)
plt.show()

Let's implement linear regression to see what happens. We can implement it quickly using sklearn, as we saw last week.

In [None]:
LR = LinearRegression()
LR.fit(x.reshape(-1,1), y)
pred = LR.predict(x.reshape(-1,1))
plt.scatter(x, y, label="Actual")
plt.plot(x, pred, label="Predicted")

Let's do the same with some real data from the Babyweight dataset. Let's theorise that the length of a newborn baby is predictive of whether that baby is of low birthweight. Fill in the appropriate bits of code. We want to plot `Length` on the x-axis and `lowbwt` on the y-axis.

In [None]:
## Fill in the appropriate bits of code

birthweight_data['Length'] = pd.to_numeric(birthweight_data['Length'], errors='coerce')
birthweight_data['lowbwt'] = pd.to_numeric(birthweight_data['lowbwt'], errors='coerce')
babylength = birthweight_data[''].values
babyweight = birthweight_data[''].values


LR.fit(babylength.reshape(-1,1), babyweight)
pred = LR.predict(babylength.reshape(-1,1))
plt.scatter(babylength, babyweight, label="Actual")
plt.plot(babylength, pred, label="Predicted")
plt.title('Babyweight')
plt.xlabel('Length of Baby at birth \cm')
plt.ylabel('Low Birth Weight Binary (1 = low birthweight)')

Clearly, using linear regression as we have done up to now is not working well for this classification problem. Indeed, in general linear regression suffers from 3 problems here:
1. A high error rate - a lot of the predicted y-values, $\hat{y}$, are very different from the real y-values, **y** in the dataset, especially as *x* tends towards $\infty$ and $-\infty$.
2. The model predicts values greater than 1 and less than 0 a lot of the time.
3. It is very sensitive to outliers, see below for what would happen to our predictor if a baby was born 4ft in length (!)

In [None]:
babylength_outlier = np.append(babylength, 120)
babyweight_outlier = np.append(babyweight, 0)

LR.fit(babylength_outlier.reshape(-1,1), babyweight_outlier)
pred = LR.predict(babylength_outlier.reshape(-1,1))
plt.scatter(babylength_outlier, babyweight_outlier, label="Actual")
plt.plot(babylength_outlier, pred, label="Predicted")
plt.title('Babyweight with Outlier')
plt.xlabel('Length of Baby at birth \cm')
plt.ylabel('Low Birth Weight Binary (1 = low birthweight)')

## Introducing Logistic Regression

So, the linear regression model representation below is insufficient for these classification algorithms. As a reminder, the linear regression model representation is this:
$$\hat{y} = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + ... + \beta_{n}x_{n}$$
It can also be written as a *hypothesis function*, basically stating what our hypothesis is about the dependent variable given a bunch of x-values.
$$h_{\beta}(\mathbf{x}) = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + ... + \beta_{n}x_{n}$$
In vectorised notation, this can be written as:
$$h_{\beta}(\mathbf{x}) = \mathbf{β}^{T}\mathbf{x}$$ 
*where $x_{0} = 1$*

We will proceed with the latter notation for simplicity, but note that all three of the equations above are just notational variants of each other. Revisit the last section of the previous tutorial if you wish to understand why.

What we would like to do is adapt the hypothesis function above so that our output is bounded between 0 and 1. We do this by using a *link function*, or (in ML) an *activation function*, which essentially transforms an output in a certain way to give us a some desired properties. There are several functions that can transform the output of the linear function above from y-values below 0 and above 1 to y-values bounded between 0 and 1. Here, we will use the **logistic function**, **logit**, or **sigmoid function**, which has some handy properties which is why it is so ubiquitous.

The logistic function looks like this:
$$g(z) = \frac{1}{1+e^{-z}}$$
What this means is that whatever value you take in as *z*, you transform it by putting negative z (-z) as the exponent of *e* (Euler's number), adding 1 to that, and then dividing 1 by the result. You don't need to fully understand what this means, the important thing is that g(z) is always greater than 0 and less than 1. More formally:
$$0<g(z)<1$$

We can now wrap this function around our linear regression model representation from earlier, and only generate results between 0 and 1:
$$h_{\beta}(\mathbf{x}) = g(\mathbf{β}^{T}\mathbf{x}) = \frac{1}{1+e^{-\mathbf{β}^{T}\mathbf{x}}}$$

Before we get into the details, it's worth pointing out that we can now simply state what we need to do to solve these kinds of classification problems. We need to find values of $\mathbf{β}$ that result in small error between $h_{\beta}(\mathbf{x})$ (or $\hat{y}$) as produced by the above model representation, and actual values of *y* in our dataset.

Now write a function that computes the logistic function given a vector of values, note that the exponent function is `np.exp()`:

In [None]:
## Write a function to calculate values of g(z) given a vector of z values

def sigmoid_fun(vector):
    ##Insert code here
    return(result)

In [None]:
x_sigmoid =  np.linspace(-5, 5, 1000)
y_sigmoid = sigmoid_fun(x_sigmoid)
plt.scatter(x_sigmoid, y_sigmoid)
plt.title('Sigmoid Curve')
plt.xlabel('Linear model, beta-transpose-x')
plt.ylabel('Hypothesis')

What are the axes here? The x-axis is $\beta^{T}x$, the set of x values and their beta coefficients, just as in the linear model. In the simplest case, where $\beta_{0} = 0$ (the y-intercept) and $\beta_{1} = 1$ (the gradient associated with x), the x-axis would just be the raw x-values. The y axis is the result of transforming $\beta^{T}x$ according to the logistic function.

### A Note on Generalised Linear Models

As mentioned, there are several link functions that can be used that increase smoothly between two values. The *tanh* (pronounced *tansh* or *tanch*) function is commonly used in ML. The log link function is used often in statistics for Poisson regression. These all constitute members, along with plain old linear regression, of the **Generalised Linear Model** family (GLM). They are all the same in nature, with just the details of the link function differing.

There are two main ways to interpret GLMs. The first is to interpret them geometrically, in terms of minimising the distance between what the model predicts and the actual values in the dataset. The second is to interpret them probabilistically, in terms of maximising the probability that the predicted value matches the actual value in the dataset. This is the maximum likelihood approach, which I will introduce here. Maximum likelhood estimation can be applied to the linear regression cases we have already discussed.

## The Probabilistic (Maximum Likelihood) Interpretation of Logistic Regression

(NOTE: this section borrows a lot from [Andrew Ng's Stanford Lecture Notes](https://see.stanford.edu/materials/aimlcs229/cs229-notes1.pdf), credit where credit is due)

There is still a problem with the logistic function we have got. While we have bounded our hypothesis function to be between 0 and 1, it still predicts values of *y* that are not real in some sense. What we can do is assume that the sigmoid curve instead provides a probability that some set of x-values is truly categorised as 0 or 1. Assuming this brings in some assumptions about the data, which we will discuss below. For now, let's show how we can interpret the output of the sigmoid function as a probability.

Let's assume that:
$$P(y=1|x;β) = h_{β}(x)$$
$$P(y=0|x;β) = 1 - h_{β}(x)$$

This means that the probability of y being equal to 1, **given a some x-values and some beta-values**, is equal to the output of our model representation, i.e., the point on the sigmoid curve corresponding to our x-values.
By definition, the probability of y being equal to 0 is one minus the probability of it being equal to 1.

We can re-write these two equation as one:
$$p(y|x;β) = (h_{β}(x))^{y}(1 - h_{β}(x))^{1-y}$$

To see how this is equivalent to the above two equations, replace all ys with 1s and then replace all ys with 0s.



For y = 1:
$$p(y=1|x;β) = (h_{β}(x))^{y = 1}(1 - h_{β}(x))^{1-(y=1)}$$
Which is equivalent to:
$$p(1|x;β) = (h_{β}(x))^{1}(1 - h_{β}(x))^{1-1}$$
Which is equivalent to (recall that something to the power of 0 is 1):
$$p(1|x;β) = (h_{β}(x))^{1}(1 - h_{β}(x))^{0}$$
Which is equivalent to:
$$p(1|x;β) = (h_{β}(x))^{1}*1$$
Which is equivalent to:
$$p(1|x;β) = (h_{β}(x))$$



For y = 0:
$$p(y=0|x;β) = (h_{β}(x))^{y = 0}(1 - h_{β}(x))^{1-(y=0)}$$
Which is equivalent to:
$$p(0|x;β) = (h_{β}(x))^{0}(1 - h_{β}(x))^{1-0}$$
Which is equivalent to:
$$p(0|x;β) = (h_{β}(x))^{0}(1 - h_{β}(x))^{1}$$
Which is equivalent to:
$$p(0|x;β) = 1*(1 - h_{β}(x))^{1}$$
Which is equivalent to:
$$p(0|x;β) = (1 - h_{β}(x))$$

Now we can interpret the output of the sigmoid function as a probability, this gives us a really useful tool for defining a cost function to optimise logistic regression.

## Cost Functions for Logistic Regression

How do we go about finding the coefficients (or parameters) **β** that maximise the probability that y = 0 or y = 1, given a set of x-values? In other words, which beta values maximise the likelihood that y = 0 or y = 1? Just as we did with Mean Squared Error in the previous tutorial, we want to find the optimum beta values, and to do this we use a cost function. However, in this case, we don't want to minimise the function, we want to *maximise* it. Other than that, the problem is exactly the same.

To do this, we need to know how the probability of y given x changes depending on different beta values. We can write this down as a *likelihood function*:
$$L(β) = p(y|x;β)$$
This computes how the probability changes as we adjust beta. 

Let's imagine a very simple dataset with one predictor variable, x, and one predicted variable, y. Now, in our dataset, we have a list of 0s and 1s as our y-values and a list of x-values of whatever variety (but converted to numerical values). Given a candidate beta value, we want to calculate the probability that y is a 0 or a 1, and we need to do this for each of x-values in the list. So, for a candidate x value, *i*, we compute the probability using the above equations as follows $p(y^{(i)}|x^{(i)};β)$. We want to iterate through all the x-values and get a probability that y equals whatever it equals for each x. When we have all these probabilities, we want to combine them to give a likelihood value for that particular beta value. We can't simply add them all together, like we did with Mean Squared Error, because this would often result in probabilities over 1. Instead, we multiply them all together (creating the joint probability), which we denote with a large π symbol. For m training examples, we get:
$$L(β) = \prod^{m}_{i = 1}p(y^{(i)}|x^{(i)};β)$$
Breaking that down into the likelihood function we have above, we get:
$$L(β) = \prod^{m}_{i = 1}(h_{β}(x^{(i)}))^{y^{(i)}}(1 - h_{β}(x^{(i)}))^{1-y^{(i)}}$$

The likelihood function gives us the likelihood of a y-value, $y^{(i)}$, in our dataset based on our model, $h_{β}(x)$ and an x-value $x^{(i)}$. When our model is good at predicting y-values, the likelihood value should get closer to 1. Let's see how this works by looking at some concrete cases.

Good case:
$$h_{β}(x) = 0.75, y = 1$$
$$L(β) = 0.75^{1}*(1-0.75)^{(1-1)} = 0.75$$
$$h_{β}(x) = 0.15, y = 0$$
$$L(β) = 0.15^{0}*(1-0.15)^{(1-0)} = 0.85$$
$$Joint Probability: 0.75*0.85 = 0.6375$$

Bad case:
$$h_{β}(x) = 0.25, y = 1$$
$$L(β) = 0.25^{1}*(1-0.25)^{(1-1)} = 0.25$$
$$h_{β}(x) = 0.85, y = 0$$
$$L(β) = 0.85^{0}*(1-0.85)^{(1-0)} = 0.15$$
$$Joint Probability: 0.25*0.15 = 0.0375$$

So, we want to find the beta parameter(s) that maximise the likelihood.

There is one small adjustment left to make. Just as we can maximise the likelihood, $L(β)$, we can also maximise any function of $L(β)$ (so long as it strictly increases). To make things easier, we therefore transform the likelihood function into the *log likelihood*, taking the natural logarithm of the likelihood:
$$l(β) = log(L(β))$$
Which can be expanded to the following:
$$l(β) = \sum^{m}_{i = 1}[y^{(i)}log(h(x^{(i)}))+(1-y^{(i)})log(1-h(x^{(i)}))]$$

The reason we do this is that summation is computationally more efficient than the product, so it makes it quicker to run. Otherwise, the intuition is exactly the same. Now we just have to maximise $l(β)$. Note that the log likelihood, unlike the likelihood, can be greater than 1 and less than 0, because it is the natural logarithm of the probability.

That's a *lot* of equations! Let's try it out with some data to try to get the intuition. Let's take it step-by-step. First, let's plot whether or not a baby was of low birth weight as a function of length again.

In [None]:
birthweight_data['Length'] = pd.to_numeric(birthweight_data['Length'], errors='coerce')
birthweight_data['lowbwt'] = pd.to_numeric(birthweight_data['lowbwt'], errors='coerce')
babylength = birthweight_data['Length'].values
babyweight = birthweight_data['lowbwt'].values

plt.scatter(babylength, babyweight)

plt.title('Babyweight')
plt.xlabel('Length of Baby at birth \cm')
plt.ylabel('Low Birth Weight Binary (1 = low birthweight)')


Now we will build the logistic regression. Recall that our equation is this:
$$h_{\beta}(\mathbf{x}) = g(\mathbf{β}^{T}\mathbf{x}) = \frac{1}{1+e^{-\mathbf{β}^{T}\mathbf{x}}}$$

First, build a function that calculates $\mathbf{β}^{T}\mathbf{x}$, just as we did with linear regression. Recall that this is identical to $\mathbf{y} = \beta_{0} + \beta_{1}\mathbf{x}_{1}$. 

It needs to take as input a y-intercept, $β_{0}$, a gradient coefficient $β_{1}$, and a vector of x's.

In [None]:
def linregress(beta0, beta1, datax):
    ##Insert code here
    return(z)

Now use the output of this function, a vector, as input to your sigmoid function from earlier. Play around with the beta parameters to find ones that work. This is essentially squidging the linear regression between 0 and 1.

In [None]:
b0 = 49
b1 = -1.03

z = linregress(b0, b1, babylength)
LogitTransformedVector = sigmoid_fun(z)

plt.scatter(babylength, babyweight)
plt.scatter(babylength, LogitTransformedVector)

Using these functions that you have built, build a function that calculates the log likelihood:
$$l(β) = \sum^{m}_{i = 1}[y^{(i)}log(h(x^{(i)}))+(1-y^{(i)})log(1-h(x^{(i)}))]$$

It should take as input beta0, beta1, a vector of x values and a vector of y values.

The function for the natural logarithm is `np.log()`.

In [None]:
def loglikelihood_fun(beta0, beta1, datax, datay):
    ## Insert code here
    
    return(loglikelihood)

loglikelihood_fun(50, -0.5, babylength, babyweight) # you should get a value of -863.000

Let's plot a range of values of the log likelihood for different values of beta0 and beta1.

In [None]:
beta0, beta1 = np.meshgrid(np.linspace(48,52,100), np.linspace(-1.2,-0.8,100)) # create a bunch of values that define a grid surface
beta0 = np.ndarray.flatten(beta0) #flatten them to put in a dataframe
beta1 = np.ndarray.flatten(beta1)

ll_df = {
    'beta0': beta0,
    'beta1': beta1} # make a dictionary
ll_df = pd.DataFrame(ll_df) #convert it to a dataframe

ll_df['LogLikelihood'] = ll_df.apply(lambda row : loglikelihood_fun(row['beta0'], row['beta1'], babylength, babyweight), axis = 1) # this function applies your function to every beta0 and beta1 value

ax = plt.axes(projection='3d')
ax.set_title('')
ax.set_xlabel('beta0', fontsize = 15) #change the axis labels as necessary
ax.set_ylabel('beta1', fontsize = 15)
ax.set_zlabel('Log Likelihood', fontsize = 15)
ax.scatter(ll_df['beta0'], ll_df['beta1'], ll_df['LogLikelihood'], alpha=0.2)


As with linear regression, there are a range of values for β that are optimal. The job of the machine learning algorithm is to find the optimum. In this case, we want to maximise likelihood, so we use **gradient ascent**, which is exactly like the **gradient descent**, except we ascend *up* the gradient rather than down. The log likelihood curve has the neat property of being convex, so the local minimum is always the global minimum. However, there is currently no closed form solution like the Normal Equations to find this minimum, so we have to use an iterative algorithm like **gradient ascent**.

Gradient ascent has an update rule like this:
$$\beta_{0} := \beta_{0} + \alpha\frac{\partial}{\partial (\beta_{0})}l(\beta_{0}, \beta_{1})$$
$$\beta_{1} := \beta_{1} + \alpha\frac{\partial}{\partial (\beta_{1})}l(\beta_{0}, \beta_{1})$$

Note the positive rather than negative sign in the update, since we are maximising rather than minimising the cost function.

The partial derivative of the log likelihood is nice and simple, because of the nature of the logistic function (which is why we choose it!). You can check out the derivation [here on page 18](https://see.stanford.edu/materials/aimlcs229/cs229-notes1.pdf).
Suffice it to say, our update rule looks like this:
$$\beta_{0} := \beta_{0} + \alpha(y^{(i)} - h_{β}(x_{0}^{(i)}))$$
$$\beta_{1} := \beta_{1} + \alpha(y^{(i)} - h_{β}(x_{1}^{(i)}))x_{1}^{i}$$
This looks very similar to our gradient descent algorithm for linear regression. However, it is different in that $h_{β}(x^{(i)})$ is the result of a logistic function, not our linear model representation from the previous tutorial.

### Interpreting the coefficients

You may know that interpreting logistic regression is usually done in terms of odds ratios. This comes from rearranging the logistic function so that we can interpret what the coefficients **β** mean. If the probability, p, is defined as $\frac{1}{1+e^{-\mathbf{β}^{T}\mathbf{x}}}$, then we can understand $\mathbf{β}^{T}\mathbf{x}$ as $log(p/(1-p))$. This is the log of the ratio between the probability of y = 1 and the probability of y = 0. We can then understand the coefficients **β** just as we understood them in linear regression. A one unit increase in *x* results in a *β* unit increase in the log odds of being categorised as y = 1. 

The probabilistic interpretation of logistic regression is extremely useful, allowing us to quantify the probability that some new data are a member of a particular category, rather than giving a binary yes/no answer to whether they are, or generating meaningless intermediary values. However, it does import some assumptions. The most important of these is that the data are independent and drawn from the same statistical distribution (the normal distribution), which is why we have to check for normality violations when using logistic regression in statistics.

As has been clear over these tutorials, these methods are general to as many independent variables as you want. There are also extensions of logistic regression for multiple dependent variables and for discretely ordered dependent variables with more than 2 categories.

Try implementing logistic regression with the scikit-learn package:

In [None]:
model = LogisticRegression()
reg = model.fit(babylength.reshape(-1,1), babyweight)
print(reg.coef_, reg.intercept_)

Using the above code as a template, try fitting a logistic regression model to predict babyweight using several independent variables.

## A Geometric Intuition about Classifiers

Classifiers can also be interpreted geometrically, although it is somewhat less rigorous (in my opinion). We can think of the logistic regression as finding a *linear separator*, a line or (hyper)plane that divides up the cluster of y=1 and y=0. Let's plot low birthweight as a function of length and head circumference.

In [None]:
birthweight_data['Length'] = pd.to_numeric(birthweight_data['Length'], errors='coerce')
birthweight_data['lowbwt'] = pd.to_numeric(birthweight_data['lowbwt'], errors='coerce')
birthweight_data['Headcirc'] = pd.to_numeric(birthweight_data['Headcirc'], errors='coerce')

plt.scatter(birthweight_data.Length, birthweight_data.Headcirc, c=birthweight_data.lowbwt, s=75)

What logistic regression does is tries to find a line that separates out the yellow dots from the purple dots. Play around with the slope and intercept parameters to try to plot a line that separates the groups.

In [None]:
SLOPE = -1.5
INTERCEPT = 110
x_vals = np.linspace(40,60,1000)
y_vals = INTERCEPT + (SLOPE * x_vals)

plt.scatter(birthweight_data.Length, birthweight_data.Headcirc, c=birthweight_data.lowbwt, s=75)
plt.plot(x_vals, y_vals, '--')

We can think of the logistic regression as finding the optimum line separating the groups, which is why the classic linear regression equation turns up in the logistic function from earlier. We can think of the maximum likelihood estimation as minimising the number of 'yellow' points on the 'purple' side of the line, and the number of 'purple' points on the 'yellow' side of the line, and the distance of those *misclassified* points from the linear separator. You can formalise this using tools from linear algebra, namely vector norms, but we won't do that here.

Using the distance of misclassified points from the linear separator makes where we place the line very sensitive to outliers. If there was a yellow point in the top right of the above plot, it would shift the optimum line up a bit, to minimise the overall distance between the line and the misclassified points. To get around this, we need a function that diminishes the effect of these outliers as they get further and further away from the main cluster of points. The logistic function does exactly this, as it tapers off towards 0 and 1. So now we have another motivation for the logistic function for the geometric interpretation. It makes finding the linear separator far less sensitive to outliers than just using a plain old linear regression line-of-best-fit.

It's a bit complicated to understand how the geometric interpretation and the probabilistic interpretation of logistic regression line up. I find it easier to think of them as complementary perspectives on the classification problem, and they ultimately lead to the same result via different means. However, they do highlight different useful properties of linear regression:
1. The probabilistic interpretation allows us to quantify the probability that some data are a member of a particular category. However, the data need to be independently sampled and identically distributed according to the Normal distribution for us to do this.
2. The geometric interpretation highlights that logistic regression is finding a *linear separator* between the two groups (in binary logistic regression), and gives us an intuition as to why we can wrap the classic linear regression model representation of y = mx + c inside the logistic function. However, as we will see below, this means that logistic regression can only be used with *linearly separable* data.

## The Importance of Logistic Regression for Machine Learning

You have probably come across logistic regression in statistics classes. However, it is very important to machine learning, since it is very similar to the *perceptron learning algorithm*. The perceptron was developed by Frank Rosenblatt in 1958 and further developed during the 60s. It was thought to be an approximate model of how neurons work when it comes to classifying things, by working in an all-or-nothing manner. It was used to classify rudimentary images in the late 50s, by outputting whether an image was a member of one of two classes as a yes/no answer.

It did this by having a step-like function, instead of our logistic function:

In [None]:
x = np.linspace(-5, 5, 100)
x = np.append(x, np.linspace(0,0,50))
y = np.linspace(0, 0, 50)
y = np.append(y, np.linspace(1,1,50))
y = np.append(y, np.linspace(0,1,50))
plt.scatter(x, y)
plt.show()

The function can be written as:
$$\begin{equation}
    s(z)=
    \begin{cases}
      0, & \text{if}\ z < 0 \\
      1, & \text{if}\ z > 0 \\
    \end{cases}
  \end{equation}
$$

The step function outputs a value of 0 if z is less than 0 and a 1 if z is more than 0. The threshold for the step can be changed depending on the application. If we let z be $β^{T}x$, our linear separator, then we are forcing the algorithm to classify things on either side of the linear separator. The perceptron algorithm can then learn what the best values of beta are using gradient ascent, using the following update:
$$\beta_{0} := \beta_{0} + \alpha(y^{(i)} - s(β_{0}x_{0}^{(i)}))$$
$$\beta_{1} := \beta_{1} + \alpha(y^{(i)} - s(β_{1}x_{1}^{(i)}))x_{1}^{i}$$

Whilst the algorithm is a bad representationof the data, $(y - s(βx))$ will be either -1 or 1, meaning the gradient ascent will be taking big steps. As it gets better, these values will become 0, and the gradient ascent rule will converge on optimal beta values.

The perceptron algorithm is therefore identical to logistic regression, except that instead of the logistic function, it has this step-like function. However, that makes it difficult to interpret the output of the perceptron probabilistically or use maximum likelihood estimation to optimise it.

The perceptron is important in the history of machine learning, because it forms the basis of simple feedforward neural networks. However, we usually don't use the step-function from the original perceptron, and instead use the logistic function because of its utility.

### The First AI Winter

The perceptron algorithm was a step forward in machine learning, allowing rudimentary classification with only little human input. Everyone was very excited. But then Marvin Minsky and Seymour Papert published a book demonstrating some of the limits of the perceptron algorithm, and by extension, logistic regression. This led to an abrupt halt to AI research until connectionism in the 80s. Let's look at one case of the limitations of logistic regression.

The geometric interpretation of logistic regression led us to the conclusion that it finds a linear separator between groupings. This is problematic, because not all data are linearly separable. Take the data below:

In [None]:
x, y = np.meshgrid(np.linspace(0,5,20), np.linspace(0,5,20))
x = np.ndarray.flatten(x)
y = np.ndarray.flatten(y)
labs = np.linspace(0,0,400)

for i in range(0,399):
    if x[i] > 1 and x[i] < 4 and y[i] > 1 and y[i] < 4:
        labs[i] = 1

plt.scatter(x, y, c=labs)

This is known as the 'donut problem'. Clearly, a linear separator is inadequate here.

The achilles heel for the perceptron back in the day was another problem. Researchers back then were interested in modelling logic using computers. It was spotted that the perceptron and logistic regression are unable to learn an XOR rule. The XOR rule goes like this:

| X1 | X2 | X1 XOR X2 |
| --- | --- | --- |
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |

The XOR operator outputs 1 only when the two elements are exclusive, i.e., the have different values, and outputs 0 when the two elements are not exclusive, i.e. they have the same values. If you look at the below plot, you'll see that you can't draw a line separating the two groups:

In [None]:
x1 = np.array([0,0,1,1])
x2 = np.array([0,1,0,1])
XOR = np.array([0,1,1,0])

plt.scatter(x1, x2, c=XOR, s = 150)

Logistic regression, as a linear classifier, cannot classify these when we just take X1 and X2 as input. There are ways to classify the output of the XOR rule. The first is to use a new input, the product of X1 and X2:

| X1 | X2 | X1\*X2 | X1 XOR X2 |
| --- | --- | --- | --- |
| 0 | 0 | 0 | 0 |
| 0 | 1 | 0 | 1 |
| 1 | 0 | 0 | 1 |
| 1 | 1 | 1 | 0 |


In [None]:
ax = plt.axes(projection='3d')
ax.scatter(x1, x2, x1*x2, c=XOR, s=150)

There is a plane that can divide the two groups, now we take the product of the two variables. See if you can use scikit-learn to build a logistic regression that can correctly classify all these points, and compare it to the logistic regression that uses only X1 and X2 as independent variables.

This works nicely for XOR, but it doesn't always solve the problem when data are not linearly separable. For example, it struggles to find a linear separator for the donut problem above. 

What we need is a something that can find non-linear separators. That is what neural networks do. We can link up several logistic regressions, which take as input the output of another logistic regression, enabling us to find 'wiggly' separators when the thing we are trying to classify is complex. You can think of a neural network as a set of linked up logistic regressions, which is why they are sometimes called 'Multi-layer Perceptrons'. This is what connectionists like Geoffrey Hinton, James McClelland, and David E. Rumelhart innovated in the 1980s, ending the first AI Winter.

We will leave neural networks to the next tutorial. However, we now have all the tools to really understand what they are doing. You can think of Neural Networks as just more complex regressions (logistic and linear). They use link functions (like the logistic), cost functions, and gradient descent/ascent, and just apply them to contexts too complex for linear methods.