# Beginners Guide to Logistic Regression in Python
This notebook discusses Logistic Regression and the math behind it with a practical example and Python codes. Logistic regression is one of the fundamental algorithms meant for classification. Logistic regression is meant exclusively for binary classification problems. Nevertheless, multi-class classification can also be performed with this algorithm with some modifications.

References:

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

https://www.statsmodels.org/stable/generated/statsmodels.discrete.discrete_model.Logit.html#statsmodels.discrete.discrete_model.Logit


# Define a Binary Classification Problem

Create Environment by importing necessary libraries

In [None]:
!python -m pip install pip --upgrade --user -q
!python -m pip install numpy pandas seaborn matplotlib scipy statsmodels sklearn --user -q

In [None]:
import IPython
IPython.Application.instance().kernel.do_shutdown(True)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn import metrics

sns.set_style('darkgrid')

Load a binary classification problem from SciKit-Learn’s in-built datasets. The breast cancer data is a binary classification problem with two classes. Download the data and metadata using the following code.

In [None]:
raw_data = load_breast_cancer()

raw_data.keys()

Read Description

We can read more about the loaded data using the DESCR file.

In [None]:
print(raw_data['DESCR'])

The dataset contains 30 features and one target. Target has two classes: Malignant (cancerous state) and Benign (non-cancerous state).

Create a pandas dataframe for the features and a pandas series for the target.

In [None]:
data = pd.DataFrame(raw_data['data'], columns=raw_data['feature_names'])
target = pd.Series(raw_data['target'], name='target')
data.head()

For more clarity, we proceed with only five selected features.

In [None]:
merge = pd.concat([data, target], axis=1)
merge.head()

Plot the data to understand inter-relationship

In [None]:
columns = list(data.columns)
plt.figure(figsize=(10,12))
k=1
for col in columns:
  plt.subplot(6,5,k)
  sns.scatterplot(x=col, y='mean radius', hue='target', data=merge)
  k+=1
plt.tight_layout()
plt.show()

Select a few independent features to proceed

In [None]:
# selected features
features = ['mean radius', 'mean texture', 'mean smoothness', 'mean compactness', 'mean concavity']

X = data[features]
y = target.copy()
X.head()

X has all the features and y has the target. If we model with all of the available data, we could not evaluate our model. Hence, it is mandatory to split the available data into training and validation sets. The training set is used to train the model and the validation set will be used to evaluate the trained model.

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X,y, test_size=0.2, random_state=6)


# Logistic Regression using statsmodels library

Logistic Regression can be performed using either SciKit-Learn library or statsmodels library. However, the  math concepts can be explored clearly with statsmodels. 

In [None]:
from statsmodels.api import Logit, add_constant

# add intercept manually
X_train_const = add_constant(X_train)
# build model and fit training data
model_1 = Logit(y_train, X_train_const).fit()
# print the model summary
model_1.summary()

The bias and coefficients of the Logit function are calculated by the Logit Regression using Maximum Likelihood Estimation (MLE). The coefficients in the above output are the bias and the five weights respectively.

The probability distribution of the logit function for training data can be obtained and visualized using the following codes.

In [None]:
# Probability Distribution for Training data
prob_train = model_1.predict(X_train_const)

# sort the prob dist for visualization
sorted_train = sorted(prob_train.values)
index_train = np.arange(len(sorted_train))
# plot it
plt.plot(index_train, sorted_train, '+r')
plt.title('Training Data: Probability Distribution', size=14, color='orange')
plt.xlabel('Examples (sorted by output value)')
plt.ylabel('Probability of Logit function')
plt.show()

It can be observed that the probability values are pushed close to either 0 or 1. Most of the points are close to 0 or 1, while a few points make the shift from 0 to 1. Moreover, the shift from 0 to 1 is sudden. It helps the model make decisions with more confidence. By default, 0.5 is the decision boundary (or technically called the threshold). Even if this threshold is shifted a little above or below, hardly any point will be differently classified. Let’s predict the probability distribution for the validation data and plot it.

In [None]:
# Probability Distribution for Validation data
X_val_const = add_constant(X_val)
prob_val = model_1.predict(X_val_const)

sorted_val = sorted(prob_val.values)
index_val = np.arange(len(sorted_val))
plt.plot(index_val, sorted_val, '+g')
plt.title('Validation Data: Probability Distribution', size=14, color='orange')
plt.xlabel('Examples (sorted by output value)')
plt.ylabel('Probability of Logit function')
plt.show()

Because of this continuous transition of predicted values from 0 to 1, Logistic Regression is called so, but not Logistic Classification.

Let’s perform classification using the probability distribution. Define 0.5 as threshold and classify data points either as 0 or 1.

In [None]:
threshold = 0.5
y_pred = (prob_val > threshold).astype(np.int8)

Evaluate the model using Accuracy score.

In [None]:
metrics.accuracy_score(y_val,y_pred)

In [None]:
print(metrics.classification_report(y_val, y_pred))

The confusion matrix may give a better insight on performance.

In [None]:
conf = pd.DataFrame(metrics.confusion_matrix(y_val,y_pred), 
                    index=['Actual Malignant', 'Actual Benign'], 
                    columns=['Predicted Malignant', 'Predicted Benign'])
conf

It is observed that totally 9 data points are misclassified among 114.

We can try different threshold values manually to check model performance.

In [None]:
accuracies = []
thresholds = np.arange(0.0, 1.01, 0.05)
for th in thresholds:
  y_preds = (prob_val > th).astype(np.int8)
  acc = metrics.accuracy_score(y_val,y_preds)
  accuracies.append(acc)


In [None]:
# plot the accuracy values
plt.plot(thresholds, accuracies, '*m')
plt.xlabel('Threshold')
plt.ylabel('Accuracy')
plt.show()

# Using SciKit-Learn Library


Logistic Regression is performed with a few lines of code using the SciKit-Learn library.

In [None]:
from sklearn.linear_model import LogisticRegression

model_2 = LogisticRegression(penalty='none')
model_2.fit(X_train, y_train)
y_pred_2 = model_2.predict(X_val)
metrics.accuracy_score(y_val, y_pred_2)

Evaluate the model with validation data. Infer predictions with X_train and calculate the accuracy.

In [None]:
print(metrics.classification_report(y_val, y_pred_2))

Find the probability distribution

In [None]:
model_2.predict_proba(X_val)

# Compare both libraries

In [None]:
# y_pred is the prediction of statsmodels library
# y_pred_2 is the prediction of sklearn libray

# Compare both libraries

(y_pred == y_pred_2).all()

Please refer these articles:

> * [Beginners Guide to Logistic Regression](https://analyticsindiamag.com/beginners-guide-to-logistic-regression-in-python/)

> * [Important Regression Techniques](https://analyticsindiamag.com/a-beginners-guide-to-regression-techniques/) 


> * [Fake News Classification](https://analyticsindiamag.com/hands-on-guide-to-predict-fake-news-using-logistic-regression-svm-and-naive-bayes-methods/)