<a href="https://colab.research.google.com/github/cagBRT/Machine-Learning/blob/master/Logistic_Regression_NB3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Logistic Regression on a small sample of the MNIST dataset**
This notebook uses SciKit Learn's logistic regression model on the the MNIST dataset of handwritten digits. <br>
The model will divide the data into 10 classifications (digits 0-9)

Import the libraries

In [None]:
import numpy as np 
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics

**Get the data**
The data is images of hand written digits.  This is a dataset from the sklearn datasets.  

In [None]:
from sklearn.datasets import load_digits
digits = load_digits()

There are 1797 images and each is 8*8 pixels. 
There are 1797 labels, one for each image.

In [None]:
# Print to show there are 1797 images (8 by 8 images for a dimensionality of 64)
print("Image Data Shape = " , digits.data.shape)
# Print to show there are 1797 labels (integers from 0–9)
print("Label Data Shape", digits.target.shape)

Let's look at some of the images. 
You will see they are heavily pixelated. The images are digits 0 - 4

In [None]:
plt.figure(figsize=(20,4))
for index, (image, label) in enumerate(zip(digits.data[0:5], digits.target[0:5])):
 plt.subplot(1, 5, index + 1)
 plt.imshow(np.reshape(image, (8,8)), cmap=plt.cm.gray)
 plt.title('Training: %i\n' % label, fontsize = 20)

**Training - Test Data split**
The data is split into 75% for training and 25% for testing. 

In [None]:
x_train, x_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.25, random_state=0)

**Create the model**
We are using a classic logistic regression model. 
In the SciKit library we can modify the parameters for the model. In this instance we will be using the defaults. 

In [None]:
# all parameters not specified are set to their defaults
logisticRegr = LogisticRegression(fit_intercept=True,
                        multi_class='auto',
                        penalty='l2', #ridge regression
                        solver='saga',
                        max_iter=10000,
                        C=50)

In [None]:
print("training shape: ",x_train.shape)
print("label shape: ", y_train.shape)

**Train the model**<br>
Train the model on the training images and the training labels. 

In [None]:
logisticRegr.fit(x_train, y_train,)

**Make predictions with the trained model**
Let's test the model by giving it one test case x_test[0]<br>
As you can see from the images below, it looks like this is the digit '2'. 

In [None]:
plt.figure(figsize=(20,4))
for index, (image, label) in enumerate(zip(x_test[0:9], digits.target[0:9])):
 plt.subplot(1, 10, index + 1)
 plt.imshow(np.reshape(image, (8,8)), cmap=plt.cm.gray)
 plt.title('Training: %i\n' % label, fontsize = 20)

The model predicts x_test[0] is a '2'

In [None]:
# Returns a NumPy Array
# Predict for One Observation (image)
logisticRegr.predict(x_test[0].reshape(1,-1))

**Make multiple predictions**<br>
Have the model make predictions on the first 10 images in the test set.<br>
Compare the predictions to the images, is the model predicting accurately?


In [None]:
logisticRegr.predict(x_test[0:10])

**Predict with the test set**

In [None]:
predictions = logisticRegr.predict(x_test)

**What is the model accuracy?**

In [None]:
# Use score method to get accuracy of model
score = logisticRegr.score(x_test, y_test)
print(score)

**The confusion matrix for the digits**<br>
The confusion matrix is shown below. <br>
This is a matrix for 10 classes. <br>
The diagonal from upper left to lower right is for the number of each class that the model predicted correctly. <br>
For example:<br>
In the first row there is a 37. This means the model got 37 images of handwritten '0's predicted correctly. The rest of the row has '0's. This means it did not mislabel any of the '0' as other digits. <br>
The second row has a '40' in the diagonal. This means the model predicted 40 handwritten '1's correctly. It also predicted 2 images of '1's as '8's and 1 image of '1' as a '9'. <br>
Each row corresponds to a class and how many were correctly indentified and how many were misidentified. 

In [None]:
cm = metrics.confusion_matrix(y_test, predictions)
print(cm)

In [None]:
plt.figure(figsize=(9,9))
sns.heatmap(cm, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'Blues_r');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
all_sample_title = 'Accuracy Score: {0}'.format(score)
plt.title(all_sample_title, size = 15);

List the instances where ground truth and the prediction differ<br>
(The model got it wrong)


In [None]:
index = 0
misclassifiedIndexes = []
misclassifiedIndexes2 = []
print("Gnd Truth, Prediction")
for label, predict in zip(y_test, predictions):
 if label != predict: 
  print(label, "\t\t", predict)
  misclassifiedIndexes.append(label)
  misclassifiedIndexes2.append(predict)
  index +=1

When we look at the images and their labels, how many would a human expert have gotten correct?

In [None]:
plt.figure(figsize=(20,4))
for plotIndex, badIndex in enumerate(misclassifiedIndexes[0:5]):
 plt.subplot(1, 5, plotIndex + 1)
 plt.imshow(np.reshape(x_test[badIndex], (8,8)), cmap=plt.cm.gray)
 plt.title("Predicted {}, Actual {}".format(misclassifiedIndexes2[plotIndex], misclassifiedIndexes[plotIndex]), fontsize = 15)
 