<a href="https://colab.research.google.com/github/atlas-github/20190731StarMediaGroup/blob/master/4_Logistic_Regression_using_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Logistic Regression using Python (scikit-learn)


![MNIST](https://cdn-images-1.medium.com/max/800/1*1TkgO9Zz6rC3KpAYNl5KfA.png)

One of the most amazing things about Python’s scikit-learn library is that is has a 4-step modeling pattern that makes it easy to code a machine learning classifier. While this tutorial uses a classifier called Logistic Regression, the coding process in this tutorial applies to other classifiers in sklearn (Decision Tree, K-Nearest Neighbors etc). In this tutorial, we use Logistic Regression to predict digit labels based on images. The image above shows a bunch of training digits (observations) from the MNIST dataset whose category membership is known (labels 0–9). After training a model with logistic regression, it can be used to predict an image label (labels 0–9) given an image.

The first part of this tutorial post goes over a toy dataset (digits dataset) to show quickly illustrate scikit-learn’s 4 step modeling pattern and show the behavior of the logistic regression algorthm. The second part of the tutorial goes over a more realistic dataset (MNIST dataset) to briefly show how changing a model’s default parameters can effect performance (both in timing and accuracy of the model).

# Logistic Regression on Digits Dataset

##Step 1: Loading the Data (Digits Dataset)

The digits dataset is one of datasets scikit-learn comes with that do not require the downloading of any file from some external website. The code below will load the digits dataset.

In [0]:
from sklearn.datasets import load_digits
digits = load_digits()

Now that you have the dataset loaded you can use the commands below

In [0]:
# Print to show there are 1797 images (8 by 8 images for a dimensionality of 64)
print("Image Data Shape", digits.data.shape)
# Print to show there are 1797 labels (integers from 0–9)
print("Label Data Shape", digits.target.shape)

##Step 2. Showing the Images and the Labels (Digits Dataset)

This section is really just to show what the images and labels look like. It usually helps to visualize your data to see what you are working with.

In [0]:
import numpy as np 
import matplotlib.pyplot as plt
plt.figure(figsize=(20,4))
for index, (image, label) in enumerate(zip(digits.data[0:5], digits.target[0:5])):
 plt.subplot(1, 5, index + 1)
 plt.imshow(np.reshape(image, (8,8)), cmap=plt.cm.gray)
 plt.title('Training: %i\n' % label, fontsize = 20)

##Step 3: Splitting Data into Training and Test Sets (Digits Dataset)

We make training and test sets to make sure that after we train our classification algorithm, it is able to generalize well to new data.

In [0]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.25, random_state=0)

##Step 4: Scikit-learn 4-Step Modeling Pattern (Digits Dataset)

###Import the [model](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) you want to use

In sklearn, all machine learning models are implemented as Python classes

In [0]:
from sklearn.linear_model import LogisticRegression

###Make an instance of the [Model](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

In [0]:
# all parameters not specified are set to their defaults
logisticRegr = LogisticRegression(solver='liblinear', multi_class='auto')

###Training the model on the data, storing the information learned from the data

Model is learning the relationship between digits (x_train) and labels (y_train)

In [0]:
logisticRegr.fit(x_train, y_train)

###Predict labels for new data (new images)

Uses the information the model learned during the model training process

In [0]:
# Returns a NumPy Array
# Predict for One Observation (image)
logisticRegr.predict(x_test[0].reshape(1,-1))

In [0]:
import numpy as np 
import matplotlib.pyplot as plt
plt.figure(figsize=(20,4))
plt.imshow(np.reshape(x_test[0], (8,8)), cmap=plt.cm.gray)

Predict for Multiple Observations (images) at Once

In [0]:
logisticRegr.predict(x_test[0:10])

In [0]:
import numpy as np 
import matplotlib.pyplot as plt
plt.figure(figsize=(20, 8))
plt.imshow(np.reshape(x_test[0: 10], (8, 80)), cmap=plt.cm.gray)

Make predictions on entire test data

In [0]:
predictions = logisticRegr.predict(x_test)

###Measuring Model Performance (Digits Dataset)

While there are other ways of measuring model performance (precision, recall, F1 Score, ROC Curve, etc), we are going to keep this simple and use accuracy as our metric. 
To do this are going to see how the model performs on the new data (test set)

Accuracy is defined as: (fraction of correct predictions): correct predictions / total number of data points

In [0]:
# Use score method to get accuracy of model
score = logisticRegr.score(x_test, y_test)
print(score)

Our accuracy was 95.3%.

###Confusion Matrix (Digits Dataset)

A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known. In this section, I am just showing two python packages (Seaborn and Matplotlib) for making confusion matrices more understandable and visually appealing.

In [0]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics

The confusion matrix below is not visually super informative or visually appealing.

In [0]:
cm = metrics.confusion_matrix(y_test, predictions)

plt.figure(figsize=(9,9))
sns.heatmap(cm, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'Blues_r');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
all_sample_title = 'Accuracy Score: {0}'.format(score)
plt.title(all_sample_title, size = 15);

# Logistic Regression (MNIST)

One important point to emphasize that the digit dataset contained in sklearn is too small to be representative of a real world machine learning task.
We are going to use the MNIST dataset because it is for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting. One of the things we will notice is that parameter tuning can greatly speed up a machine learning algorithm’s training time.

## Downloading the Data (MNIST)

The MNIST dataset doesn’t come from within scikit-learn

In [0]:
from sklearn.datasets import fetch_openml
mnist = fetch_openml(name='mnist_784')

Now that you have the dataset loaded you can use the commands below to see that there are 70000 images and 70000 labels in the dataset.

In [0]:
# These are the images
# There are 70,000 images (28 by 28 images for a dimensionality of 784)
print(mnist.data.shape)
# These are the labels
print(mnist.target.shape)

## Splitting Data into Training and Test Sets (MNIST)

The code below splits the data into training and test data sets. The test_size=1/7.0 makes the training set size 60,000 images and the test set size of 10,000.

In [0]:
from sklearn.model_selection import train_test_split
train_img, test_img, train_lbl, test_lbl = train_test_split(
 mnist.data, mnist.target, test_size=1/7.0, random_state=0)

## Showing the Images and Labels (MNIST)

In [0]:
import numpy as np
import matplotlib.pyplot as plt
plt.figure(figsize=(20,4))
for index, (image, label) in enumerate(zip(train_img[0:5], train_lbl[0:5])):
  plt.subplot(1, 5, index + 1)
  plt.imshow(np.reshape(image, (28,28)), cmap=plt.cm.gray)
  plt.title('Training: %i\n' % int(label), fontsize = 20)

# Scikit-learn 4-Step Modeling Pattern (MNIST)

Parameter tuning makes  large difference on larger and more complex datasets. While usually one adjusts parameters for the sake of accuracy, in the case below, we are adjusting the parameter solver to speed up the fitting of the model.

## Step 1: Import the model you want to use

In sklearn, all machine learning models are implemented as Python classes

In [0]:
from sklearn.linear_model import LogisticRegression

## Step 2: Make an instance of the Model

Please see the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) if you are curious what changing solver does. Essentially, we are changing the optimization algorithm.

In [0]:
# all parameters not specified are set to their defaults
# default solver is incredibly slow thats why we change it
logisticRegr = LogisticRegression(solver = 'newton-cg', multi_class = 'auto')

## Step 3: Training the model on the data, storing the information learned from the data

Model is learning the relationship between x (digits) and y (labels)

In [0]:
logisticRegr.fit(train_img, train_lbl)

## Step 4: Predict the labels of new data (new images)

Uses the information the model learned during the model training process

In [0]:
# Returns a NumPy Array
# Predict for One Observation (image)
logisticRegr.predict(test_img[0].reshape(1,-1))

Predict for Multiple Observations (images) at Once

In [0]:
logisticRegr.predict(test_img[0:10])

Make predictions on entire test data

In [0]:
predictions = logisticRegr.predict(test_img)

## Measuring Model Performance (MNIST)

While there are other ways of measuring model performance (precision, recall, F1 Score, ROC Curve, etc), we are going to keep this simple and use accuracy as our metric. 
To do this are going to see how the model performs on the new data (test set)

accuracy is defined as:

fraction of correct predictions= $ \frac {correct-predictions}{total-number-of-data-points} $

In [0]:
score = logisticRegr.score(test_img, test_lbl)
print(score)

## Display Misclassified images with Predicted Labels (MNIST)

While I could show another confusion matrix, I figured people would rather see misclassified images on the off chance someone finds it interesting.

Getting the misclassified images’ index

In [0]:
import numpy as np 
import matplotlib.pyplot as plt
index = 0
misclassifiedIndexes = []
for label, predict in zip(test_lbl, predictions):
 if label != predict: 
  misclassifiedIndexes.append(index)
  index +=1

Showing the misclassified images and image labels using matplotlib

In [0]:
plt.figure(figsize=(20,4))
for plotIndex, badIndex in enumerate(misclassifiedIndexes[0:5]):
 plt.subplot(1, 5, plotIndex + 1)
 plt.imshow(np.reshape(test_img[badIndex], (28,28)), cmap=plt.cm.gray)
 plt.title("Predicted: {}, Actual: {}".format(predictions[badIndex], test_lbl[badIndex]), fontsize = 15)

## Conclusion

The important thing to note here is that making a machine learning model in scikit-learn is not a lot of work.

This article is written by Michael Galarnyk, and can be accessed [here](https://towardsdatascience.com/logistic-regression-using-python-sklearn-numpy-mnist-handwriting-recognition-matplotlib-a6b31e2b166a).