# What is Logistic Regression?

## Overview

This tutorial is on the basics of logistic regression. It is also a continuation of the Intro to Machine Learning post, "What is Linear Regression?", which can be found [here.](https://towardsdatascience.com/what-is-linear-regression-e44d2c4bf025)

### So what is Logistic Regression?

It is a little counterintuitive, but Logistic Regression is typically used as a classifier. In fact, Logistic Regression is one of the most used and well-known classification methods Data Scientists use. The idea behind this classification method is that the output will be between 0 and 1. Essentially returning the probability that the data you gave to the model, belongs to a certain group or class. From there the developer can set thresholds, depending on how much tolerance for error they are willing to give. 

For example, I may set a threshold of 0.8. Which means any output from the Logistic Regression model equal to or greater than 0.8 will be classified as a 1, anything less will be classified as a 0. From there I can move that 0.8 threshold up or down depending on my use case and the metrics I care about.

### How does Logistic Regression relate to Linear Regression?

The Linear Regression formula is included in the Logistic Regression formula. 
If you recall, the Linear Regression formula is: $$ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 ... + \beta_n x_n + \epsilon $$

Well...the formula for Logistic Regession is: $$ p = \frac{1}{1 + e^{-y}} $$

And we can swap out the `y` with our Linear Regression formula: $$ p = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 ... + \beta_n x_n)}} $$

### What is this equation doing?
What is happening is that the Linear Regression provides some output value and the Logistic Regression portion pushes these values between 0 and 1 (inclusive). This forms somewhat of an S-curve which you may have seen before.

If not, here is an example from the Scikit Learn docs:

![s-vurve](https://scikit-learn.org/stable/_images/sphx_glr_plot_logistic_001.png)

## Let's Go Through An Example

For this example, we really just need Scikit Learn. If you aren't familiar with Scikit Learn, it is one of the most popular Python libraries to run machine learning algorithms. I am going to create some fake sample data about patients and if they should be approved for some type of treatment. I will be creating the columns for age, weight, average resting heart rate, and then for the approval decision. In this example we will use all of the columns, besides the approval decision as our features. Our model should estimate how important each feature is towards the approval decision and we should get a probability of the patient being approved for the treatment.

**Note**: I will generate some sample data and use Numpy to store the data in an object. This isn't specifically needed to run Logistic Regression, but are used just for the sake of the example. Also, this data is completely made up. This is not indicitave of any real treatments or decision making by medical professionals.

### Imports

Here we are importing the libraries Numpy and Scikit Learn. Numpy is used to create data structures, while scikit learn is used for a lot of machine learning use cases.

In [1]:
# Import numpy to create numpy arrays
import numpy as np

# Import Scikit-Learn to allow us to run Linear Regression
from sklearn.linear_model import LogisticRegression

### Creating some sample data

The data we have created here is some fake sample data and their respective features associated with age, weight, and average heart rate.
Our approval decision will be a column called `approved` and will contain 0 or 1's indicating whether or not the patient was approved. 0 means the patient was not approved, 1 meaning the patient was approved.

In [2]:
# Creating sample data
approved = np.array([1, 1, 1, 0, 0, 0, 1, 1, 0, 0])
age = [21, 42, 35, 33, 63, 70, 26, 31, 52, 53]
weight = [110, 180, 175, 235, 95, 90, 175, 190, 250, 185]
avg_hrt = [65, 70, 72, 77, 67, 62, 68, 65, 73, 75]

### Structuring the data

These two lines are to shape the data properly to use with scikit learn. We stack the multiple arrays (or the x's) into one numpy object. While also specifying the "shape" of the price array.

In [3]:
# Combining the multiple lists into one object called "X"
X = np.column_stack([age, weight, avg_hrt])
# Reshaping the approvals to work with scikit learn
y = approved.reshape(len(approved), )

### Creating a Logistic Regression Model

Now we can create the shell of the logistic regression model using this line below.

In [4]:
# Instantiating the model object
model = LogisticRegression(fit_intercept = False)

Once we have created the shell, we need to fit the model with data. This is how we can get the Beta values and find some function that can approximate a patient's approval status - given their age, weight, and average resting heart rate.

In [5]:
# Fitting the model with data
fitted_model = model.fit(X, y)

We have trained a Logistic Regression model! Now let's take a look at the coefficients.

In [7]:
# Printing out the coefficients for each feature and the intercept
print(f"The coefficient, or the beta one value, for age = {fitted_model.coef_[0][0]}")
print(f"The coefficient, or the beta two value, for weight = {fitted_model.coef_[0][1]}")
print(f"The coefficient, or the beta three value, for average resting heart rate = {fitted_model.coef_[0][2]}")

The coefficient, or the beta one value, for age = -0.2820214641206744
The coefficient, or the beta two value, for weight = -0.06858172570968078
The coefficient, or the beta three value, for average resting heart rate = 0.3403873155933503


## Interpreting the model & its coefficients

What these coefficients are telling us, in the case of classification, is that age is the most significant factor. Closely followed by average resting heart rate. With the patient's weight coming in last, for this specific data + case. This is useful to understand what features or components contribute to your model and impact your decisions. When talking about explainable AI, typically having a way of intepreting the model and seeing how the model came to its decision is extremely important. In this case we can look at the coefficients to determine how impactful each feature was and why an algorithm may choose to approve someone, but not others.

## Let's test the model on new data

### Create new data

In [7]:
new_age = [20, 45, 33, 31, 62, 71, 72, 25, 30, 53, 55]
new_weight = [105, 175, 170, 240, 100, 95, 200, 170, 195, 255, 180]
new_avg_hrt = [64, 68, 70, 78, 67, 61, 68, 67, 66, 75, 76]

# Combining the multiple lists into one object called "test_X"
test_X = np.column_stack([new_age, new_weight, new_avg_hrt])

### Run new data through model

In [8]:
results = fitted_model.predict(test_X)

### Take a look at the results

In [9]:
print(f"Our approval results are: {results}")

Our approval results are: [1 1 1 0 0 0 0 1 1 0 0]


As you can see Scikit Learn automatically set a threshold for us and determined our approvals. If you want to look at the actual probabilities, we can use a different function provided by Scikit Learn:

In [10]:
results_w_probs = fitted_model.predict_proba(test_X)
print("Our approval results with their probabilites:")
for result in results_w_probs:
    print(f"Probability of being 0 (not approved) = {result[0]:.2f}, Probability of being 1 (approved) = {result[1]:.2f}")

Our approval results with their probabilites:
Probability of being 0 (not approved) = 0.00, Probability of being 1 (approved) = 1.00
Probability of being 0 (not approved) = 0.28, Probability of being 1 (approved) = 0.72
Probability of being 0 (not approved) = 0.00, Probability of being 1 (approved) = 1.00
Probability of being 0 (not approved) = 0.84, Probability of being 1 (approved) = 0.16
Probability of being 0 (not approved) = 0.92, Probability of being 1 (approved) = 0.08
Probability of being 0 (not approved) = 0.99, Probability of being 1 (approved) = 0.01
Probability of being 0 (not approved) = 1.00, Probability of being 1 (approved) = 0.00
Probability of being 0 (not approved) = 0.00, Probability of being 1 (approved) = 1.00
Probability of being 0 (not approved) = 0.00, Probability of being 1 (approved) = 1.00
Probability of being 0 (not approved) = 1.00, Probability of being 1 (approved) = 0.00
Probability of being 0 (not approved) = 1.00, Probability of being 1 (approved) = 0.

**Note**: The list of probablilites for the 1's and 0's are in the same order as our initial numpy array we passed to the `predict_proba()` function earlier.

From here we can set different thresholds to provide different number of approvals based on probability thresholds. If you want to be cautious with approving people, we can set the threshold to 0.8 or 0.9. If the treatment is safe, non-invasive, and has a low cost - we can set the threshold lower to 0.25 or 0.3. This all depends on the use case.

## What would you do next?

Some next steps on how to make your model better include:
- Adding more data.
- Normalize your data. Scale each column between 0 and 1 or between -1 and 1 to help the model. It makes learning difficult for the model when you don't scale your data since it could weigh features, that just have higher values naturally, more than others.
- Test out different thresholds.
- Learn about classification metrics. (Scikit Learn's [classification report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) is a good place to start)
- Tailor your model towards classification metrics that make sense for your use case. Accuracy may not be the best metric. (An example would be when you have imbalanced classes)

## Conclusion

I hope this tutorial was informative and you can take away learnings about Logistic Regression. Logsitic Regression is a fundamental tool in any Data Scientist's tool belt. You can look at coefficients and attempt to learn what data is impacting your model. You can use Logistic Regression as a baseline tool to determine if other models may be better. Logistic Regression is quick to train and doesn't require a lot of data. Although, in some cases it can be too simple. Therefore, don't count out more complex methods or additional methods to make your model better!

## Bio

Frankie Cancino is a Senior AI Scientist for Target and the founder of the Data Science Minneapolis group.

### Links
* [Scikit-Learn Docs](https://scikit-learn.org/stable/)
* [LinkedIn](https://www.linkedin.com/in/frankie-cancino/)
* [Twitter](https://twitter.com/frankiecancino)