# Multiclass Classification
Multiclass classification is a generalization of binary classification, where instead of 2 classes, we have 3 or more.  It turns out that with the sklearn methods we have been using, it is **extremely simple** to extend our procedures from binary to multiclass classification.

One thing that is tricky however - coming up with a performance metric for our proposed model.   Our preferred metric - AUC - does not have a simple analog, since the ROC curve from which it is derived is explicitly defined for a binary classifier.

Instead, we will use metrics derived from the **confusion matrix**, including versions of recall and precision.

## Classifiying All 10 digits simultaneously
We will again use the MNIST sample, but this time our goal is develop a model which, given an image of a digit from 0-9, will predict which digit best corresponds to that true digit.   

So let's begin by reading in all of our data.   Start with the **short** sample since the full sample takes quite a bit of time to run.

In [1]:
import pandas as pd
#
# Define our "signal" digit
#short = ""
short = "short_"

#
# Read in all of the other digits
dfCombined = pd.DataFrame()
for digit in range(10):
    print("Processing digit ",digit)
    fname = '/fs/scratch/PAS1585/ch3/digit_' + short + str(digit) + '.csv'
    df = pd.read_csv(fname,header=None)
    df['digit'] = digit
    dfCombined = pd.concat([dfCombined, df])

print("Length of sample:     ",len(dfCombined))


Processing digit  0
Processing digit  1
Processing digit  2
Processing digit  3
Processing digit  4
Processing digit  5
Processing digit  6
Processing digit  7
Processing digit  8
Processing digit  9
Length of sample:      10000


## Multi-class performance

A good discussion of multiclass performance can be found here:  https://medium.com/usf-msds/choosing-the-right-metric-for-evaluating-machine-learning-models-part-2-86d5649a5428

In the code below, we implement two "accuracy" measures.   In our assignment later, we will add: 
* "macro" averaged recall = recall averaged over each class
            recall for a single class = Number of items correctly identified as positive out of total true positives for that class: TP/(TP+FN)

*    "macro" averaged precision = precision averaged over each class
            precision for a single class = Number of items correctly identified as positive out of total items identified as positive: TP/(TP+FP)


In [2]:
#
# Used to implement the multi-dimensional counter we need in the performance class
from collections import defaultdict
from functools import partial
from itertools import repeat
def nested_defaultdict(default_factory, depth=1):
    result = partial(defaultdict, default_factory)
    for _ in repeat(None, depth - 1):
        result = partial(defaultdict, result)
    return result()
  

#
# Determine the performance
def multiPerformance(y,y_pred,y_score,debug=False):
#
# Make our matrix
  confusionMatrix = nested_defaultdict(int,2)
  classes = set()
  totalTrue = defaultdict(int)
  totalPred = defaultdict(int)
  for i in range(len(y_pred)):
    trueClass = y[i]
    classes.add(trueClass)
    predClass = y_pred[i]
    totalTrue[trueClass] += 1
    totalPred[predClass] += 1
    confusionMatrix[trueClass][predClass] += 1

  if debug:
    for trueClass in classes:
      print("True: ",trueClass,end="")
      for predClass in classes:
        print("\t",confusionMatrix[trueClass][predClass],end="")
      print()
    print()
#
#
# Overall accuracy - sum the diagonals and divide by total
  accMicro = 0.0
  accMacro = 0.0
  for cl in classes:
    accMicro += confusionMatrix[cl][cl]
    accMacro += confusionMatrix[cl][cl]/totalTrue[cl]
  accMicro /= len(y)
  accMacro = accMacro / len(classes)
  results = {"confusionMatrix":confusionMatrix,"accuracyMicro":accMicro,"accuracyMacro":accMacro}
  return results

## runFitter Method
The runFitter method is pretty similar to what we had before.  The primary change is how the performance method is called, as well as the returned data from that method.

In [3]:
def runFitter(estimator,X_train,y_train,X_test,y_test,debug=False):
#
# Now fit to our training set
  estimator.fit(X_train,y_train)
#
# Now predict the classes and get the score for our traing set
  y_train_pred = estimator.predict(X_train)
  y_train_score = estimator.decision_function(X_train)   # NOTE: some estimators have a predict_prob method instead od descision_function
#
# Now predict the classes and get the score for our test set
  y_test_pred = estimator.predict(X_test)
  y_test_score = estimator.decision_function(X_test)

#
# Now get the performaance
  results_test = multiPerformance(y_test,y_test_pred,y_test_score,debug=False)
  results_train = multiPerformance(y_train,y_train_pred,y_train_score,debug=False)
#
# Decide what you want to return: for now, just precision, recall, and auc for both test and train
  results = {
      'cf_test':results_test['confusionMatrix'],
      'cf_train':results_train['confusionMatrix'],
      'accuracyMicro_test':results_test['accuracyMicro'],
      'accuracyMacro_test':results_test['accuracyMacro'],
      'accuracyMicro_train':results_train['accuracyMicro'],
      'accuracyMacro_train':results_train['accuracyMacro'],
}

  return results
  

## Shuffle the data
We must shuffle the data, since the data is in digit order when we read it in.

In [4]:
from sklearn.utils import shuffle
dfCombinedShuffle = shuffle(dfCombined,random_state=42)    # by setting the random state we will get reproducible results

X = dfCombinedShuffle.as_matrix(columns=dfCombinedShuffle.columns[:784])
y = dfCombinedShuffle['digit'].values

  after removing the cwd from sys.path.


## Set up kfolds

In [5]:
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import KFold
kfolds = 5

#skf = StratifiedKFold(n_splits=kfolds)
skf = KFold(n_splits=kfolds)


## Loop over folds
Here we loop over the folds and calculate the statistics.

In [6]:
#
# Get our estimator and predict
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier

estimator = LinearSVC(random_state=42,dual=False,max_iter=500,tol=0.01)    # use dual=False when  n_samples > n_features which is what we have
#estimator = SGDClassifier(random_state=42,max_iter=500,tol=0.01)    # use dual=False when  n_samples > n_features which is what we have
#
# Cresate some vars to keep track of everything
avg_accuracyMicro_test = 0.0
avg_accuracyMicro_train = 0.0
avg_accuracyMacro_test = 0.0
avg_accuracyMacro_train = 0.0
numSplits = 0.0
#
# Also keep track of the 
#
# Now loop
lastCF_train = None
lastCF_test = None
for train_index, test_index in skf.split(X, y):
  print("Training")
  X_train = X[train_index]
  y_train = y[train_index]
  X_test = X[test_index]
  y_test = y[test_index]
  
#
# Now fit to our training set
  results = runFitter(estimator,X_train,y_train,X_test,y_test)
#
# 
  avg_accuracyMicro_test += results['accuracyMicro_test']
  avg_accuracyMicro_train += results['accuracyMicro_train']
  avg_accuracyMacro_test += results['accuracyMacro_test']
  avg_accuracyMacro_train += results['accuracyMacro_train']
  lastCF_train = results['cf_train']
  lastCF_test = results['cf_test']
  numSplits += 1.0
  print("   Split ",numSplits,"; accuracyMicro test/train",results['accuracyMicro_test'],results['accuracyMicro_train'],"; accuracyMacro test/train",results['accuracyMacro_test'],results['accuracyMacro_train'])
#
avg_accuracyMicro_test /= numSplits
avg_accuracyMicro_train /= numSplits
avg_accuracyMacro_test /= numSplits
avg_accuracyMacro_train /= numSplits
print("average accuracyMicro test:  ",avg_accuracyMicro_test)
print("average accuracyMicro train: ",avg_accuracyMicro_train)
print("average accuracyMacro test:  ",avg_accuracyMacro_test)
print("average accuracyMacro train: ",avg_accuracyMacro_train)
print("Test confusion matrix")
for trueClass in range(10):
  print("True: ",trueClass,end="")
  for predClass in range(10):
    print("\t",lastCF_test[trueClass][predClass],end="")
  print()
print()
print("Train confusion matrix")
for trueClass in range(10):
  print("True: ",trueClass,end="")
  for predClass in range(10):
    print("\t",lastCF_train[trueClass][predClass],end="")
  print()
print()



Training
   Split  1.0 ; accuracyMicro test/train 0.88 0.9695 ; accuracyMacro test/train 0.878196276550469 0.9696235926606676
Training
   Split  2.0 ; accuracyMicro test/train 0.876 0.971625 ; accuracyMacro test/train 0.8769621816979397 0.9715621779654109
Training
   Split  3.0 ; accuracyMicro test/train 0.884 0.971125 ; accuracyMacro test/train 0.8836038491939668 0.9712211524207893
Training
   Split  4.0 ; accuracyMicro test/train 0.878 0.971 ; accuracyMacro test/train 0.8800539279914761 0.9708348505351652
Training
   Split  5.0 ; accuracyMicro test/train 0.8745 0.9705 ; accuracyMacro test/train 0.8739619219012592 0.9705410412994956
average accuracyMicro test:   0.8785000000000001
average accuracyMicro train:  0.9707500000000001
average accuracyMacro test:   0.878555631467022
average accuracyMacro train:  0.9707565629763056
Test confusion matrix
True:  0	 194	 0	 3	 0	 0	 1	 4	 0	 0	 0
True:  1	 1	 188	 4	 2	 0	 1	 0	 1	 1	 0
True:  2	 4	 4	 154	 5	 3	 1	 6	 2	 9	 1
True:  3	 2	 1	 6	

## Examine the results
When you run this using the full data sample, it is instructive to examine which digits are misclassified.   Also, note that the matrix is not symmetric, though it is nearly so.