# Project 1: Digit Classification with KNN and Naive Bayes

In this project, you'll implement your own image recognition system for classifying digits. Read through the code and the instructions carefully and add your own code where indicated. Each problem can be addressed succinctly with the included packages -- please don't add any more. Grading will be based on writing clean, commented code, along with a few short answers.

As always, you're welcome to work on the project in groups and discuss ideas on the course wall, but <b> please prepare your own write-up (with your own code). </b>

If you're interested, check out these links related to digit recognition:

Yann Lecun's MNIST benchmarks: http://yann.lecun.com/exdb/mnist/

Stanford Streetview research and data: http://ufldl.stanford.edu/housenumbers/

In [3]:
# This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline

# Import a bunch of libraries.
import time
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_openml
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LinearRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
import warnings
warnings.filterwarnings('ignore')
# Set the randomizer seed so results are the same each time.
np.random.seed(0)

Load the data. Notice that we are splitting the data into training, development, and test. We also have a small subset of the training data called mini_train_data and mini_train_labels that you should use in all the experiments below, unless otherwise noted.

In [4]:
# Load the digit data from https://www.openml.org/d/554 or from default local location `~/scikit_learn_data/...`
X, Y = fetch_openml(name='mnist_784', return_X_y=True, cache=False)

# Rescale grayscale values to [0,1].
X = X / 255.0

# Shuffle the input: create a random permutation of the integers between 0 and the number of data points and apply this
# permutation to X and Y.
# NOTE: Each time you run this cell, you'll re-shuffle the data, resulting in a different ordering.
shuffle = np.random.permutation(np.arange(X.shape[0]))
X, Y = X[shuffle], Y[shuffle]

print 'data shape: ', X.shape
print 'label shape:', Y.shape

# Set some variables to hold test, dev, and training data.
test_data, test_labels = X[61000:], Y[61000:]
dev_data, dev_labels = X[60000:61000], Y[60000:61000]
train_data, train_labels = X[:60000], Y[:60000] 
mini_train_data, mini_train_labels = X[:1000], Y[:1000]

ValueError: Expecting object: line 3 column 121540 (char 121541)

(1) Create a 10x10 grid to visualize 10 examples of each digit. Python hints:

- plt.rc() for setting the colormap, for example to black and white
- plt.subplot() for creating subplots
- plt.imshow() for rendering a matrix
- np.array.reshape() for reshaping a 1D feature vector into a 2D matrix (for rendering)

In [None]:
def P1(num_examples=10):
    #Defining the colormap for image properties
    plt.rc('image', cmap='gray')
    plt.figure(figsize=(num_examples,num_examples))

    # for each digit in 0-9 -> the unique labels in Y covers this part
    for i in np.unique(Y):
        features = X[Y == i][:num_examples]
        for j in range(num_examples):
               # Create subplot by specifying position from 1 to each digit * columns. Add j to get all columns
                plt.subplot(num_examples, num_examples, 1 + int (i) * num_examples + j  )
                # Hide axes
                plt.axis('off')
                # Plot the corresponding digit ( reshaped to 2D matrix)
                vector_size = int(np.sqrt(features.shape[1]))
                digit = features[j].reshape((vector_size,vector_size))            
                plt.imshow(digit)

P1(10)

(2) Evaluate a K-Nearest-Neighbors model with k = [1,3,5,7,9] using the mini training set. Report accuracy on the dev set. For k=1, show precision, recall, and F1 for each label. Which is the most difficult digit?

- KNeighborsClassifier() for fitting and predicting
- classification_report() for producing precision, recall, F1 results

In [None]:
def P2(k_values):
    for i in k_values:
        # create KNN neigh model from sklearn
        neigh = KNeighborsClassifier(n_neighbors=i)
        # fit the KNN model using mini training data and mini training labels set
        neigh.fit(mini_train_data,mini_train_labels)
        # prediction on the dev data set
        dev_prediction_labels = neigh.predict(dev_data)
        # Report the accuracy on the dev data set
        print 'Accuracy for k-> {0:d} neighbours: {1:.3f}'.format(i,neigh.score(dev_data,dev_labels)) 
        #For k=1, show precision, recall, and F1 for each label
        if i==1:
            target_names = np.unique(Y)
            print '\nPrecission, recall, and F1 for each digit when k = 1:'
            print classification_report(dev_labels, dev_prediction_labels, target_names = target_names)
            
k_values = [1, 3, 5, 7, 9]
P2(k_values)



ANSWER: Which is the most difficult digit for K=1?

    From the classification report, looking at the f1-score which considers both precision and recall, the difficult digit is 8. 
    The recall value is @ 0.77 which means there are lot of false negatives.     
    The precision is very low for digit 9 which is 0.8 implying that there are lot of false positives.


(3) Using k=1, report dev set accuracy for the training set sizes below. Also, measure the amount of time needed for prediction with each training size.

- time.time() gives a wall clock value you can use for timing operations

In [None]:
def P3(train_sizes, accuracies):
    # create KNN neigh model from sklearn with k value 1
    neigh = KNeighborsClassifier(n_neighbors=1)
    # holder for times taken
    time_taken = []
    for size in train_sizes:
        #start the timer
        start = time.time()
        # Get training sample from original data set
        train_data_size, train_labels_size = X[:size], Y[:size]
        # fit the KNN model using training sizes set
        neigh.fit(train_data_size,train_labels_size)
        #append the accuracy to accuracies
        accuracies.append(neigh.score(dev_data,dev_labels))
        end = time.time()
        #calculate time and append it to time_taken
        time_taken.append(end-start)
    #printing the report with times taken for each training size
    for size, accuracy, time_value in zip(train_sizes, accuracies, time_taken):
        print(" For Sample Size %5d, \t  The Accuracy is -> %0.03f , Processing time is %.3f seconds\n "
              % (size, accuracy,time_value))
    
train_sizes = [100, 200, 400, 800, 1600, 3200, 6400, 12800, 25000]
accuracies = []

P3(train_sizes, accuracies)


(4) Fit a regression model that predicts accuracy from training size. What does it predict for n=60000? What's wrong with using regression here? Can you apply a transformation that makes the predictions more reasonable?

- Remember that the sklearn fit() functions take an input matrix X and output vector Y. So each input example in X is a vector, even if it contains only a single value.

In [None]:
def P4():
    lr = LinearRegression(fit_intercept=True)
    size = np.asarray([60000])
    X = np.asarray(train_sizes)[:, np.newaxis]
    y = np.asarray(accuracies)[:, np.newaxis]
    lr.fit(X, y)
    print 'Accuracy for n = 60000 for esitmated function y=a+bX is {0:s}'.format(lr.predict(size.reshape(1, -1)).tolist()[0])
    
    # The predicted value for n=60000 is greater than 1 which is not possible. 
    
    plt.figure(figsize=(14, 2))
  
    #plotting the linear function
    
    ax = plt.subplot(1, 3, 1)
    # Turn off tick marks to keep things clean.
    plt.setp(ax, xticks=())
    
    x = np.linspace(X.min(), 60000, 100)[:,np.newaxis]
    plt.plot(x, lr.predict(x))
    plt.scatter(X, y, color='red')
    plt.xlabel("x")
    plt.ylabel("y")
    plt.title('Linear plot')
    
    # Applying log-transformation for the training values.
  
    logX = np.log(X)
    size = np.log(size)
    lr.fit(logX, y)
    print 'Accuracy for n = 60000 for esitmated function y=a+blogX is {0:s}\n'.format(lr.predict(size.reshape(1, -1)).tolist()[0])
    ax = plt.subplot(1, 3, 2)
    # Turn off tick marks to keep things clean.
    plt.setp(ax, xticks=())
    
    x = np.linspace(np.log(X.min()), np.log(60000), 100)[:,np.newaxis]
    plt.plot(x, lr.predict(x))
    plt.scatter(np.log(X), y, color='red')
    plt.xlabel("log x")
    plt.ylabel("y")
    plt.title('Log of X')
    

P4()

ANSWER: 

    The predicted value for n=60000 is 1.24 and that is greater than 1 which is not possible. 
    
    Even applying log transformation to our training values, the predicted value for n=60000 is 1.033 and that is still
    greater than 1.
    
    We are using the size of the data set to predict the accuracy instead of values in the data set. That is something not
    right when applying regression here.
  

Fit a 1-NN and output a confusion matrix for the dev data. Use the confusion matrix to identify the most confused pair of digits, and display a few example mistakes.

- confusion_matrix() produces a confusion matrix

In [None]:
def P5():
    # create KNN neigh model from sklearn with k value 1
    neigh = KNeighborsClassifier(n_neighbors=1)
   
    # fit the KNN model using mini training set 
    neigh.fit(mini_train_data,mini_train_labels)
    #prediction for dev data
    pred_dev_labels = neigh.predict(dev_data)
    #creating confusion matrix for the dev data
    confusion_mat = confusion_matrix(dev_labels,pred_dev_labels)
    #outputting confusion matrix
    print "confusion matrix when training the model with mini training data\n"
    print confusion_mat

    #Defining the colormap for image properties
    plt.rc('image', cmap='gray')
    plt.figure(figsize=(5,2))

    # for each digit in 0-9 -> the unique labels in Y covers this part
    
    count = 1
    for j in range(len(pred_dev_labels)):
        if pred_dev_labels[j] == '9' and dev_labels[j] == '4' and count<=10: 
            features = np.array(dev_data[j]).reshape(28,28)
            # Create subplot by specifying position from 1 to each digit * columns. Add j to get all columns
            plt.subplot(2, 5, count)
            # Hide axes
            plt.axis('off')
            plt.imshow(features)
            count +=1
                            
                            
P5()

Examples: 

    Using Mini Training data:
        When using the mini training data, the success rate seems to be little low compared to the training data.
        For Digit 2, 6 times it is confused with digit 8 and 4 times each with 1 and 7.
        For Digit 4, 11 times it is confused with digit 9.
        For Digit 8, 22 times it is confused with other digits specially with 1,2 and 9.
        For Digit 9, 7 times it is confused with digit 7.
        
        Digits 2, 4, 8 and 9 seems to be confused more compared to other digits.
    

(6) A common image processing technique is to smooth an image by blurring. The idea is that the value of a particular pixel is estimated as the weighted combination of the original value and the values around it. Typically, the blurring is Gaussian -- that is, the weight of a pixel's influence is determined by a Gaussian function over the distance to the relevant pixel.

Implement a simplified Gaussian blur by just using the 8 neighboring pixels: the smoothed value of a pixel is a weighted combination of the original value and the 8 neighboring values. Try applying your blur filter in 3 ways:
- preprocess the training data but not the dev data
- preprocess the dev data but not the training data
- preprocess both training and dev data

Note that there are Guassian blur filters available, for example in scipy.ndimage.filters. You're welcome to experiment with those, but you are likely to get the best results with the simplified version I described above.

In [None]:
def Gaussian_blur(image,sigma):
    #get the nearest 9 neighbors to a pixel from the image
    #create a KNN with 9 neighbors
    neigh = KNeighborsClassifier(n_neighbors=9)
    vector_size = int(np.sqrt(image.shape[0]))
    #Take a copy of the image
    blurred_image = np.copy(image)
    #create a matrix with the vector_size
    
    
def P6():
    from scipy.ndimage.filters import gaussian_filter
    
    # create KNN neigh model from sklearn using 8 neighbors
    neigh = KNeighborsClassifier(n_neighbors=8)
    
    #preprocessing the training data but not dev data
    blur_mini_train_data = gaussian_filter(mini_train_data, sigma=0.5)

    # fit the KNN model using blurred mini training data and mini training labels set
    neigh.fit(blur_mini_train_data,mini_train_labels)
    # Report the accuracy on the dev data set
    print 'Accuracy for k-> 8 neighbours: {0:.3f}'.format(neigh.score(dev_data,dev_labels)) 
        
    #preprocessing the dev data but not training data
    blur_dev_data = gaussian_filter(dev_data, sigma=0.5)
    # fit the KNN model using mini training data and mini training labels set
    neigh.fit(mini_train_data,mini_train_labels)
    # Report the accuracy on the dev data set
    print 'Accuracy for k-> 8 neighbours: {0:.3f}'.format(neigh.score(blur_dev_data,dev_labels)) 
    
    #preprocessing the both training data and dev data
    # Report the accuracy on the dev data set
     # fit the KNN model using blurred mini training data and mini training labels set
    neigh.fit(blur_mini_train_data,mini_train_labels)
    print 'Accuracy for k-> 8 neighbours: {0:.3f}'.format(neigh.score(blur_dev_data,dev_labels)) 
    
Gaussian_blur(X[0],0.5)
    
P6()


ANSWER: 

    Accuracy seems to be improved when we pre-process the training data but not the dev data
    However accuracy is reduced when we only pre-process the dev data but not the training data
    When both data sets are pre-processed, accuracy seems to be improved as well. 

(7) Fit a Naive Bayes classifier and report accuracy on the dev data. Remember that Naive Bayes estimates P(feature|label). While sklearn can handle real-valued features, let's start by mapping the pixel values to either 0 or 1. You can do this as a preprocessing step, or with the binarize argument. With binary-valued features, you can use BernoulliNB. Next try mapping the pixel values to 0, 1, or 2, representing white, grey, or black. This mapping requires MultinomialNB. Does the multi-class version improve the results? Why or why not?

In [None]:
def P7():
    #For Binary Valued features using BernoulliNB
    Binary_Bernoulli_model = BernoulliNB(binarize=0.333)
    #Using Mini Training data to fit the model
    Binary_Bernoulli_model.fit(mini_train_data, mini_train_labels)
    print 'Accuracy using Bernoulli NB for dev set-> {0:.3f}'.format(Binary_Bernoulli_model.score(dev_data, dev_labels))

    #For multinominal values, I am dividing the values  into 3 parts 
    #0 for values less than 0.333
    #1 for values between 0.333 and 0.667
    #2 for values greater than 0.667
    Multinomial_model = MultinomialNB()
    changed_mini_train_data_for_multinomial = np.where(mini_train_data < 0.333,0,
                                      np.where((mini_train_data > 0.333) & (mini_train_data < 0.667), 1, 2))
    changed_dev_data_for_multinomial = np.where(dev_data < 0.333,0,
                                      np.where((dev_data > 0.333) & (dev_data < 0.667), 1, 2))
    
    #Using Transformed Mini Training data to fit the model
    Multinomial_model.fit(changed_mini_train_data_for_multinomial, mini_train_labels)
    
    print 'Accuracy using Multinomial NB for dev set-> {0:.3f}'.format(
        Multinomial_model.score(changed_dev_data_for_multinomial, dev_labels))

P7()

ANSWER: Does the multi-class version improve the results? Why or why not?

    The multi-class version doesn't improve the results but they are not farther apart. It may fall more if we map the pixel values to more than 3 classes. Accuracy usually improves when the categories are less.
    
    I did try with binarize value 0.25 for Binomial NB, which gave accuracy 0.826 and 0.25,0.75,1 threshold limits for Multinomial NB which gave accuracy to 0.809. 
    
    This looks less when compared to binarize threshold values 0.333 and 0.667 and 1.

(8) Use GridSearchCV to perform a search over values of alpha (the Laplace smoothing parameter) in a Bernoulli NB model. What is the best value for alpha? What is the accuracy when alpha=0? Is this what you'd expect?

- Note that GridSearchCV partitions the training data so the results will be a bit different than if you used the dev data for evaluation.

In [None]:
def P8(alphas):
    #creating a GridSearchCV using BernoulliNB estimator and passing the 
    nb = GridSearchCV(BernoulliNB(binarize=0.333), alphas)
    nb.fit(mini_train_data, mini_train_labels)
    means = nb.cv_results_['mean_test_score']
    stds = nb.cv_results_['std_test_score']
    for params, accuracy in zip(nb.cv_results_['params'], means ):
        print(" For %r \t Accuracy -> %0.03f "
              % (params, accuracy))
    return nb

alphas = {'alpha': [0.0, 0.0001, 0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 10.0]}
nb = P8(alphas)

In [None]:
print nb.best_params_

ANSWER: What is the best value for alpha? What is the accuracy when alpha=0? Is this what you'd expect?

    The best value for alpha is 0.001
    The accuracy at alpha=0 is 0.814 little less compared to the best value. When alpha=0, we are suggesting the Model, that 
    no smoothing is required, and so, it makes sense that the accuracy fall to an extent. And we are only looking 
    at P(label) instead of P(label|feature). Yes, I would expect it to be that way.

(9) Try training a model using GuassianNB, which is intended for real-valued features, and evaluate on the dev data. You'll notice that it doesn't work so well. Try to diagnose the problem. You should be able to find a simple fix that returns the accuracy to around the same rate as BernoulliNB. Explain your solution.

Hint: examine the parameters estimated by the fit() method, theta\_ and sigma\_.

In [None]:
def P9():
    gnb = GaussianNB()
    print gnb
    gnb.fit(mini_train_data, mini_train_labels)
    print 'Accuracy using Gaussian NB for dev set-> {0:.3f}'.format(gnb.score(dev_data, dev_labels))
    
    #The theta_ gives mean of each feature per class
    print gnb.theta_.mean(axis=1)
    #The sigma_ gives mean of each feature per class
    print gnb.sigma_.mean(axis=1)

    #For a Gaussian NB to work accurately, we need the data to be normally distributed. 
    # check the normality for an example image
    image_example = X[0]
    plt.figure(figsize=(12, 4))
    plt.subplot(1,1,1)
    fig = plt.hist(image_example)
    plt.title('Histogram of a random image')
    
    #we can see that the data is not normally distributed as most values are either 0 or 1, lets change some variance 
    #and see if the Gaussian Classifier improves the accuracy.
    
    new_sigma_values = np.arange(0,1,0.05)
    
    for i in new_sigma_values:
        for digit in range(10):
            for feature in range(X.shape[1]):
                gnb.sigma_[digit][feature] = i
            
        print 'Accuracy using Gaussian NB for dev set using with sigma {0:.3f} -> {1:.3f} '.format(i,gnb.score(dev_data, dev_labels))
     
    
    # Observing the accuracies by altering sigma values, we found 0.05,0.1,0.15,0.2,0.25,0.3 
    
    updated_sigma_values = [0.05,0.1,0.15,0.2,0.25,0.3]
    #lets trying adding normal random noise to the mini training data using the updated sigma values
    for i in range(len(updated_sigma_values)):
        noised_mini_train_data = mini_train_data + np.random.normal(0.0, updated_sigma_values[i], mini_train_data.shape)
        gnb.fit(noised_mini_train_data, mini_train_labels)
        print 'Accuracy using for dev set when adding random noise with sigma {0:.3f} -> {1:.3f} '.format(updated_sigma_values[i],gnb.score(dev_data, dev_labels))
            

gnb = P9()

ANSWER: 

    Two ways, I could improve the Gaussian NB
    - One by adding variance to the data
    - Two by adding some random normal noise to the training data
    
    As Gaussian distribution works best if the data is normalized, our image data consists mostly values either 0 or 1. So, to improve the classifier, I tried the above two ways, which got the accuracy 0.803 when added variance with the sigma value 0.05 and 0.818 when adding random normal noise with the sigma value 0.15

(10) Because Naive Bayes is a generative model, we can use the trained model to generate digits. Train a BernoulliNB model and then generate a 10x20 grid with 20 examples of each digit. Because you're using a Bernoulli model, each pixel output will be either 0 or 1. How do the generated digits compare to the training digits?

- You can use np.random.rand() to generate random numbers from a uniform distribution
- The estimated probability of each pixel is stored in feature\_log\_prob\_. You'll need to use np.exp() to convert a log probability back to a probability.

In [None]:
def P10(num_examples):

    #For Binary Valued features using BernoulliNB
    Binary_Bernoulli_model = BernoulliNB(binarize=0.333)
    #Using Mini Training data to fit the model
    Binary_Bernoulli_model.fit(mini_train_data, mini_train_labels)
    #Get the probability values from the model
    probs = np.exp(Binary_Bernoulli_model.feature_log_prob_)
    #Defining the colormap for image properties
    plt.rc('image', cmap='gray')
    plt.figure(figsize=(10,10))

    # for each digit in 0-9 -> the unique labels in Y covers this part
    for i in range(len(np.unique(Y))):
        # using np.random.rand(784L) here as Bernoulli NB model needs 0 and 1 values
        # we turn on if the value of the random is less than the probability the model has given us
        features = np.where(probs[i,]>np.random.rand(X.shape[1]),1,0)
        for j in range(num_examples):
               # Create subplot by specifying position from 1 to each digit * columns. Add j to get all columns
                plt.subplot(len(np.unique(Y)), num_examples, 1 + int (i) * num_examples + j  )
                # Hide axes
                plt.axis('off')
                # Plot the corresponding digit ( reshaped to 2D matrix)
                vector_size = int(np.sqrt(X.shape[1]))
                digit = features.reshape((vector_size,vector_size))            
                plt.imshow(digit)

P10(20)

ANSWER: How do the generated digits compare to the training digits?
    
    The generated digits are much lighter in color and unable to read clearly when compared to the training digits. 
    But given the probabilities, they show that those digits stay in the right place for the given training.

(11) Remember that a strongly calibrated classifier is rougly 90% accurate when the posterior probability of the predicted class is 0.9. A weakly calibrated classifier is more accurate when the posterior is 90% than when it is 80%. A poorly calibrated classifier has no positive correlation between posterior and accuracy.

Train a BernoulliNB model with a reasonable alpha value. For each posterior bucket (think of a bin in a histogram), you want to estimate the classifier's accuracy. So for each prediction, find the bucket the maximum posterior belongs to and update the "correct" and "total" counters.

How would you characterize the calibration for the Naive Bayes model?

In [None]:
def P11(buckets, correct, total):
    
    #Using BinomialNB with the best alpha and a binarize value of 0.333
    Binary_Bernoulli_model = BernoulliNB(alpha=0.001, binarize=0.333)
    #Using Mini Training data to fit the model
    Binary_Bernoulli_model.fit(mini_train_data, mini_train_labels)
    #Get the predicted labels for the dev data set
    dev_predicted_labels = Binary_Bernoulli_model.predict(dev_data)
    #Get the predicted probabilites for the dev data set.
    #This returns the probability estimate for the dev data set
    dev_predicted_probs = Binary_Bernoulli_model.predict_proba(dev_data)
 

    #for each bucket - need to look at previous value to get the correct bin
    for i in range(len(buckets)):
        #for all the probabilites for the entire set
        #set counter for total_value and correct_value
        total_value = 0
        correct_value = 0
        for j in range(dev_predicted_probs.shape[0]):
            # increment correct value only if total is incremented
            increment_value = False
            #Get the maximum probability, that the model predicts
            prob_of_digit = dev_predicted_probs[j, dev_predicted_probs[j].argmax()]
            
            #place it in the bucket
            #if the probability of the digit is in the current bucket
            if i == 0:
                increment_correct = np.where((prob_of_digit <= buckets[i]),True, False)
            else:     
                increment_correct = np.where((prob_of_digit <= buckets[i]) & (prob_of_digit > buckets[i-1]), True, False)
            
            #if the prediction of the label is correct for the given bucket
            if increment_correct: 
                total_value +=1
                if(dev_predicted_labels[j] == dev_labels[j]):
                    correct_value +=1
                
        
        #place the total_value and correct_value in the respective buckets
        correct[i] = float(correct_value)
        total[i] = float(total_value)


buckets = [0.5, 0.9, 0.999, 0.99999, 0.9999999, 0.999999999, 0.99999999999, 0.9999999999999, 1.0]
correct = [0 for i in buckets]
total = [0 for i in buckets]

P11(buckets, correct, total)

for i in range(len(buckets)):
    accuracy = 0.0
    if (total[i] > 0): accuracy = correct[i] / total[i]
    print 'p(pred) <= %.13f    total = %3d  correct = %3d  accuracy = %.3f' %(buckets[i], total[i],  correct[i], accuracy)

ANSWER: How would you characterize the calibration for the Naive Bayes model?

    The accuracy is low at probability bucket 0.5-0.9 which is 0.406 compared to the accuracy at 0.9999999999900 which is 
    0.814.
    This shows that the naive bayes model's precision is different from accuracy.
    As Accuracy is precision with calibration, when the posterior probability of the predicted class is 0.9 our model 
    shows only 0.406 accuracy. This shows that our model is not perfectly calibrated shows signs of weakly calibrated 
    classifier. So, we cannot simply infer the correct digit even though the posterior probability is at 0.9     


(12) EXTRA CREDIT

Try designing extra features to see if you can improve the performance of Naive Bayes on the dev set. Here are a few ideas to get you started:
- Try summing the pixel values in each row and each column.
- Try counting the number of enclosed regions; 8 usually has 2 enclosed regions, 9 usually has 1, and 7 usually has 0.

Make sure you comment your code well!

In [None]:
#def P12():

### STUDENT START ###


### STUDENT END ###

#P12()