# GTI770 - Systèmes intelligents et apprentissage machine

#### Alessandro L. Koerich

## Notebook Jupyter - 6_Bayes_Normal - Optdigits Dataset

##### Created: May 2018
##### Revised: Jan 2019

----
## Title of Database: Optical Recognition of Handwritten Digits


* #### Relevant Information:
	We used preprocessing programs made available by NIST to extract
	normalized bitmaps of handwritten digits from a preprinted form. From
	a total of 43 people, 30 contributed to the training set and different
	13 to the test set. 32x32 bitmaps are divided into nonoverlapping 
	blocks of 4x4 and the number of on pixels are counted in each block.
	This generates an input matrix of 8x8 where each element is an 
	integer in the range 0..16. This reduces dimensionality and gives 
	invariance to small distortions.

* #### Number of Instances
	optdigits.tra	Training	3823
	optdigits.tes	Testing		1797
	
	The way we used the dataset was to use half of training for 
	actual training, one-fourth for validation and one-fourth
	for writer-dependent testing. The test set was used for 
	writer-independent testing and is the actual quality measure.

* #### Number of Attributes
	64 inputs + 1 class attribute

* #### For Each Attribute:
	All input attributes are integers in the range 0..16.
	The last attribute is the class code 0..9
    
* #### Class Distribution


	No of examples in training set
	 0.  376
	 1.  389
	 2.  380
	 3.  389
	 4.  387
	 5.  376
	 6.  377
	 7.  387
	 8.  380
	 9.  382


	No of examples in testing set
	 0.  178
	 1.  182
	 2.  177
	 3.  183
	 4.  181
	 5.  182
	 6.  181
	 7.  179
	 8.  174
	 9.  180
    
 ----

In [None]:
# Imports

import numpy as np
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB

In [None]:
# Load TRAIN, TEST, UNKNOWN CLASS data from files
# Numeric inputs and outputs
num_features    = 64
data_train      = np.loadtxt("CSV_Files/optdigits-train.arff.csv", delimiter="," , skiprows=1)
data_test       = np.loadtxt("CSV_Files/optdigits-test.arff.csv", delimiter="," , skiprows=1)
data_unlabelled = np.loadtxt("CSV_Files/optdigits-predict-nolabels.arff.csv", delimiter="," , skiprows=1)

In [None]:
data_train.shape, data_test.shape, data_unlabelled.shape,  

In [None]:
# Separate inputs (features) from outputs (labels)
X_train       = data_train[:,0:num_features]
Y_train       = data_train[:,num_features] # last column = class labels
X_test        = data_test[:,0:num_features]
Y_test        = data_test[:,num_features] # last column = class labels
X_unlabelled  = data_unlabelled[:,0:num_features]
# Y_unlabelled  = ??? We don't have the labels, so there is no Y_unlabelled!!

In [None]:
X_train
# 64 columns = inputs

In [None]:
Y_train
# last column = output = class labels

## Scikit-Learn Naïve Bayes Documentation

* http://scikit-learn.org/stable/modules/naive_bayes.html

* http://scikit-learn.org/stable/modules/classes.html#module-sklearn.naive_bayes

### Decide on the Probability Distribution

Now that we have our data ready to train your model, we need first to choose the appropriate probability distribution to model/represent the data.
 
Which probability distributions should we use?

    1. Bernoulli distribution: discrete features, 2 possible states (binary features)
    
    2. Multinomial dsitribution: discrete features, 3 or more possible states (n-ary features)
    
    3. Normal distribution: real-value features   

If you choose:

    1. BernoulliNB implements the naive Bayes training and classification algorithms for data that is distributed according to multivariate Bernoulli distributions; i.e., there may be multiple features but each one is assumed to be a binary-valued (Bernoulli, boolean) variable.
        
    2. MultinomialNB implements the naive Bayes algorithm for multinomially distributed data, where the data are typically represented as counts.
        
    3. GaussianNB implements the Gaussian Naive Bayes algorithm for classification. The likelihood of the features is assumed to be Gaussian.
        

In [None]:
# Train the Decision Tree with the training set
# model = BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)
# model = MultinomialNB(fit_prior=True)
model = GaussianNB()
model = model.fit(X_train, Y_train)

In [None]:
# Show all parameters of the model Normal model
# You can change all these parameters
# See the documentation
model

In [None]:
# Use the model to predict the class of samples
# Notice that we are testing the train dataset
Y_train_pred = model.predict(X_train)
Y_train_pred

In [None]:
# You can also predict the probability of each class
# train dataset
Y_train_pred_prob = model.predict_proba(X_train)
Y_train_pred_prob

In [None]:
# Evaluation metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

In [None]:
acc_digits_data = accuracy_score(Y_train, Y_train_pred )
print("Correct classification rate for the training dataset = "+str(acc_digits_data*100)+"%")

In [None]:
from sklearn.metrics import classification_report

In [None]:
target_names = ['0','1','2','3','4','5','6','7','8','9']
print( classification_report(Y_train, Y_train_pred, target_names=target_names))
# This works, but we have labels with no predicted samples

In [None]:
cm_digits_data = confusion_matrix(Y_train, Y_train_pred )
cm_digits_data

In [None]:
import itertools
import matplotlib.pyplot as plt

In [None]:
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap, aspect = 'auto')
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    #plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [None]:
np.set_printoptions(precision=2)

In [None]:
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cm_digits_data, classes = ['0','1','2','3','4','5','6','7','8','9'],
                      title='Confusion matrix, without normalization')

In [None]:
plt.show()

In [None]:
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cm_digits_data, classes = ['0','1','2','3','4','5','6','7','8','9'],
                      normalize=True,
                      title='Confusion matrix, with normalization')

In [None]:
plt.show()

## OK, but TRAINING and TESTING on the same dataset does not give us a fair evaluation of the model...

So, to make a fair evaluation, we need to slipt our dataset into TRAIN and VALID partitions, or use another way...

1. HOLD-OUT: hold out part of the available data as a validation (or test) set

2. k-FOLD CROSS VALIDATION (CV): In k-fold CV, the training set is split into k smaller sets and for each of the k “folds”:
        
        -- A model is trained using k-1 of the folds as training data;
        
        -- the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).
        
The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop. 

3. LeaveOneOut Cross Validation (LOOCV): each learning set is created by taking all the samples except one, the test set being the sample left out. Thus, for n samples, we have n different training sets and n different tests set.


## 1) HOLD-OUT


In [None]:
# Using hold-out evaluation
from sklearn.model_selection import train_test_split

In [None]:
# The data is already split into train and valid/test...
# So, we don't need to use the "train_test_split"
# X_train, X_valid, Y_train, Y_valid = train_test_split( X_data, Y_data, test_size=0.4, random_state=0, shuffle=True,
#                                                     stratify=Y_data)

In [None]:
# Evaluating the test dataset 
Y_test_pred      = model.predict(X_test)
Y_test_pred_prob = model.predict_proba(X_test)
acc_digits_test  = accuracy_score(Y_test, Y_test_pred )
print("Correct classification rate for the test dataset = "+str(acc_digits_test*100)+"% on "+str(Y_test.shape[0])+" samples")


## 2) k-fold CROSS VALIDATION


In [None]:
# Using k-fold cross validation (CV) evaluation
from sklearn.model_selection import cross_val_score

In [None]:
# Just to play with CV, let's concatenate the train and the test sets...
# Usually we don't do that
X_data = np.concatenate( (X_train, X_test), axis=0)
Y_data = np.concatenate( (Y_train, Y_test), axis=0)
print(X_data.shape, Y_data.shape)

In [None]:
# Evaluate our model using 10-fold CV 
model = GaussianNB()
np.set_printoptions(precision=5)
scores = cross_val_score(model, X_data, Y_data, cv=10)
scores

In [None]:
print("k-fold cross validation accuracy: %0.5f (+/- %0.5f)" % (scores.mean()*100, scores.std() * 2 * 100))


## 3) Leave-One-Out Cross Validation (LOOCV)


In [None]:
# Using leave-one-out cross validation (LOOCV) evaluation
from sklearn.model_selection import LeaveOneOut

In [None]:
# Create n data splits, where n is the total number of samples
# 5,620 in our case
loo = LeaveOneOut()
loo.get_n_splits(X_data)

In [None]:
# So, we will train 5,620 models one on each data splits, and test the 5,620 models on 1 sample each time. 
index = 0
acc = np.zeros(5620)
for train_index, test_index in loo.split(X_data):
    X_train, X_test = X_data[train_index], X_data[test_index]
    Y_train, Y_test = Y_data[train_index], Y_data[test_index]
    # Train the model on X_train,Y_train 
    model = GaussianNB()
    model = model.fit(X_train, Y_train)
    # Use the learned model to predict on X_test ,Y_test 
    Y_test_pred = model.predict(X_test)
    acc_digits_valid = accuracy_score(Y_test, Y_test_pred)
    print("Correct classification rate for the model "+str(index+1)+": "+str(acc_digits_valid*100)+"%")
    acc[index] =  acc_digits_valid
    index += 1

In [None]:
print("Accuracy: %0.5f (+/- %0.5f)" % (acc.mean()*100, acc.std() * 2 * 100))


## Now, we want to use our learned model to predict the labels of new data

### The unlabeled data from optdigits-predict-nolabels.arff.csv

### WHAT MODEL MUST WE USE?

    1. From hold-out?
    2. From k-fold CV?
    3. From from LOOCV?
    4. None of them?


In [None]:
# Which model will be our FINAL MODEL?
# model = 
model = model.fit(X_train, Y_train)

In [None]:
# Making prediction on the unlabelled dataset (X_unlabelled)
Y_test_pred = model.predict(X_unlabelled)
Y_test_pred_prob = model.predict_proba(X_unlabelled)

In [None]:
print(Y_test_pred)

In [None]:
print(Y_test_pred_prob)

In [None]:
print("Notebook ended")