### Introduction

Logistic regression is a method for binary classification. It works to divide points in a dataset into two distinct classes, or categories. For simplicity, let’s call them class A and class B. The model will give us the probability that a given point belongs in category B. If it is low (lower than 50%), then we classify it in category A. Otherwise, it falls in class B. It’s also important to note that logistic regression is better for this purpose than linear regression with a threshold because the threshold would have to be manually set, which is not feasible. Logistic regression will instead create a sort of S-curve (using the sigmoid function) which will also help show certainty, since the output from logistic regression is not just a one or zero. Here is the standard logistic function, note that the output is always between 0 and 1, but never reaches either of those values.

<img src="https://i0.wp.com/www.stokastik.in/wp-content/uploads/2017/07/sigmoid.png?w=400">

#### When to Use

Logistic regression is great for situations where you need to classify between two categories. Some good examples are accepted and rejected applicants and victory or defeat in a competition. 

#### How does it work?

Logistic regression works using a linear combination of inputs, so multiple information sources can govern the output of the model. The parameters of the model are the weights of the various features, and represent their relative importance to the result. In the equation that follows, you should recognize the formula used in linear regression. Logistic regression is, at its base, a transformation from a linear predictor to a probability between 0 and 1.

![logisticEquation](../img/WikiLogisticEQ)

#### Code

In the example, `scikit-learn` and `numpy` are used to train a simple logistic regression model. The model is basic, but extensible. With logistic regression, more features could be added to the data set seamlessly, simply as a column in the 2D arrays.

The code creates a 2D array representing the training input, in this case it is 1000 x 1, since there are 1000 samples and 1 feature. These inputs are scores out of 1000. A training output array is also created, with the classification of 1 for pass and 0 for fail, based on a threshold. Then, scikit-learn’s `LogisticRegression` class is used to fit a logistic regression classifier to the data. After that, the next step is to test for accuracy with a different data set. So, we create another 100 random samples to test against, and predict against them using the model.

In [4]:
# load the required libraries
from sklearn.linear_model import LogisticRegression
import numpy as np
import random

In [5]:
#Generate a random dataset which includes random scores from 0 to 1000.
x = np.array([ random.randint(0,1000) for i in range(0,1000) ])

In [7]:
#defines the classification for the training data.
def true_classifier(i):
    if i >= 700:
        return 1
    else:
        return 0

In [8]:
#The model will expect a 2D array, so we must reshape
#For the model, the 2D array must have rows equal to the number of samples,
#and columns equal to the number of features.
#For this example, we have 1000 samples and 1 feature.
x = x.reshape((-1, 1))

In [9]:
#For each point, y is a pass/fail for the grade. The simple threshold is arbitrary,
#and can be changed as you would like. Classes are 1 for success and 0 for failure
y = [ true_classifier(x[i][0]) for i in range(0,1000) ]

In [10]:
#Again, we need a numpy array, so we convert.
y = np.array(y)

#Our goal will be to train a logistic regression model to do pass/fail to the same threshold.
model = LogisticRegression(solver='liblinear')

#The fit method actually fits the model to our training data
model = model.fit(x,y)

In [11]:
#Create 100 random samples to try against our model as test data
samples = [random.randint(0,1000) for i in range(0,100)]
#Once again, we need a 2d Numpy array
samples = np.array(samples)
samples = samples.reshape(-1, 1)

In [12]:
#Now we use our model against the samples.  output is the probability, and _class is the class.
_class = model.predict(samples)
proba = model.predict_proba(samples)

In [13]:
num_accurate = 0

In [14]:
#Finally, output the results, formatted for nicer viewing.
#The format is [<sample value>]: Class <class number>, probability [ <probability for class 0> <probability for class 1>]
#So, the probability array is the probability of failure, followed by the probability of passing.
#In an example run, [7]: Class 0, probability [  9.99966694e-01   3.33062825e-05]
#Means that for value 7, the class is 0 (failure) and the probability of failure is 99.9%
for i in range(0,100):
    if (true_classifier(samples[i])) == (_class[i] == 1):
        num_accurate = num_accurate + 1
    print("" + str(samples[i]) + ": Class " + str(_class[i]) + ", probability " + str(proba[i]))
#skip a line to separate overall result from sample output
print("")
print(str(num_accurate) +" out of 100 correct.")

[541]: Class 0, probability [0.89636615 0.10363385]
[712]: Class 1, probability [0.42488221 0.57511779]
[491]: Class 0, probability [0.9466904 0.0533096]
[662]: Class 0, probability [0.60267125 0.39732875]
[913]: Class 1, probability [0.03936819 0.96063181]
[67]: Class 0, probability [9.99873750e-01 1.26250097e-04]
[276]: Class 0, probability [0.99745265 0.00254735]
[474]: Class 0, probability [0.95776858 0.04223142]
[432]: Class 0, probability [0.97647104 0.02352896]
[572]: Class 0, probability [0.84702752 0.15297248]
[446]: Class 0, probability [0.97137103 0.02862897]
[613]: Class 0, probability [0.75428374 0.24571626]
[940]: Class 1, probability [0.02703822 0.97296178]
[599]: Class 0, probability [0.78968288 0.21031712]
[124]: Class 0, probability [9.99713371e-01 2.86628527e-04]
[447]: Class 0, probability [0.9709682 0.0290318]
[172]: Class 0, probability [9.99428367e-01 5.71633365e-04]
[674]: Class 0, probability [0.56068902 0.43931098]
[615]: Class 0, probability [0.74891168 0.251