# Assignment 6: Multi-class classification

For this assignment, there are two basic tasks and an open ended additional task for more challenges:
1.  Add the "macro" averaged precision and recall to the **multiPerformance** method from the multicassv2 exercise (copy all of the relevant code to this module).
2.  Add one more class - letters - to your multi-class classifier.   So your classifier will have a total of 11 classes.   The letters data sample is here:
            *  Shorter (1000 samples): /fs/scratch/PAS1585/emnist/emnist_letters_shuffled_1k.csv
            *  Longer (7000 samples): /fs/scratch/PAS1585/emnist/emnist_letters_shuffled_7k.csv
      Note that each of these files has a random sample of 26 upper and lower case english letters.
      
3.  Extra stuff: if you have time and are looking to expand your abilities, try your hand at data augmentation of the MNIST dataset.   The idea is to examine how to increase (or augment) your data sample, by resampling your existing data sample.  Here are some ideas:
             * shift randomly, by 1-2 pixels, the images up/down/left/right
             * random rotations by a few degrees.
             
      First verify that your modifications are working (by displaying the images), then make a "test" set using the augmented images only, and compare how it performs using a training set drawn from the original (non-augmented) data set

In [51]:
'''IMPORT DATA'''
import pandas as pd
#
# Define our "signal" digit
short = ""
#short = "short_"

#
# Read in all of the other digits
dfCombined = pd.DataFrame()
for digit in range(10):
    print("Processing digit ",digit)
    fname = '/fs/scratch/PAS1585/ch3/digit_' + short + str(digit) + '.csv'
    df = pd.read_csv(fname,header=None)
    df['digit'] = digit
    dfCombined = pd.concat([dfCombined, df])

print("Length of sample:     ",len(dfCombined))


Processing digit  0
Processing digit  1
Processing digit  2
Processing digit  3
Processing digit  4
Processing digit  5
Processing digit  6
Processing digit  7
Processing digit  8
Processing digit  9
Length of sample:      10000


In [61]:
# Used to implement the multi-dimensional counter we need in the performance class
from collections import defaultdict
from functools import partial
from itertools import repeat
def nested_defaultdict(default_factory, depth=1):
    result = partial(defaultdict, default_factory)
    for _ in repeat(None, depth - 1):
        result = partial(defaultdict, result)
    return result()
  

# Determine the performance
def multiPerformance(y,y_pred,y_score,debug=False):

# Make our matrix
    confusionMatrix = nested_defaultdict(int,2)
    classes = set()
    totalTrue = defaultdict(int)
    totalPred = defaultdict(int)

    for i in range(len(y_pred)):
        trueClass = y[i]
        classes.add(trueClass)
        predClass = y_pred[i]
        totalTrue[trueClass] += 1
        totalPred[predClass] += 1
        confusionMatrix[trueClass][predClass] += 1
        
    recall = 0
    precision = 0
    for trueClass in classes:
        
        TP = confusionMatrix[trueClass][trueClass]
        
        for predClass in classes:
            
            if predClass != trueClass:
                
                FP = confusionMatrix[predClass][trueClass]
                FN = confusionMatrix[trueClass][predClass]
                TN = confusionMatrix[predClass][predClass]
        
        
        recall += TP / (TP + FN)
        precision += TP / (TP + FP)
        
    RecallMacro = recall/len(classes)
    PrecisionMacro = precision/len(classes)
        
    results = {"confusionMatrix":confusionMatrix,"RecallMacro":RecallMacro,"PrecisionMacro":PrecisionMacro}
        
    return results

In [69]:
def runFitter(estimator,X_train,y_train,X_test,y_test,debug=False):
#
# Now fit to our training set
  estimator.fit(X_train,y_train)
#
# Now predict the classes and get the score for our traing set
  y_train_pred = estimator.predict(X_train)
  y_train_score = estimator.decision_function(X_train)   # NOTE: some estimators have a predict_prob method instead od descision_function
#
# Now predict the classes and get the score for our test set
  y_test_pred = estimator.predict(X_test)
  y_test_score = estimator.decision_function(X_test)

#
# Now get the performaance
  results_test = multiPerformance(y_test,y_test_pred,y_test_score,debug=False)
  results_train = multiPerformance(y_train,y_train_pred,y_train_score,debug=False)
#
# Decide what you want to return: for now, just precision, recall, and auc for both test and train
  results = {
      'cf_test':results_test['confusionMatrix'],
      'cf_train':results_train['confusionMatrix'],
      'RecallMacro_test':results_test['RecallMacro'],
      'PrecisionMacro_test':results_test['PrecisionMacro'],
      'RecallMacro_train':results_train['RecallMacro'],
      'PrecisionMacro_train':results_train['PrecisionMacro'],
}

  return results
  

In [70]:
fname = '/fs/scratch/PAS1585/emnist/emnist_letters_shuffled_7k.csv'

df_letters = pd.read_csv(fname,header=None)
df_letters['digit'] = 10

In [71]:
df_letters.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,775,776,777,778,779,780,781,782,783,digit
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,10
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,10
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,10
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,10
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,10


In [72]:
dfCombined_Letters = pd.concat([dfCombined, df_letters])
dfCombined_Letters.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11000 entries, 0 to 999
Columns: 785 entries, 0 to digit
dtypes: int64(785)
memory usage: 66.0 MB


In [73]:
from sklearn.utils import shuffle
dfCombinedShuffle = shuffle(dfCombined_Letters,random_state=42)    # by setting the random state we will get reproducible results

X = dfCombinedShuffle.as_matrix(columns=dfCombinedShuffle.columns[:784])
y = dfCombinedShuffle['digit'].values

  after removing the cwd from sys.path.


In [74]:
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import KFold
kfolds = 5

#skf = StratifiedKFold(n_splits=kfolds)
skf = KFold(n_splits=kfolds)

In [82]:
#
# Get our estimator and predict
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier

estimator = LinearSVC(random_state=42,dual=False,max_iter=500,tol=0.01)    # use dual=False when  n_samples > n_features which is what we have
#estimator = SGDClassifier(random_state=42,max_iter=500,tol=0.01)    # use dual=False when  n_samples > n_features which is what we have
#
# Cresate some vars to keep track of everything
avg_RecallMacro_test = 0.0
avg_RecallMacro_train = 0.0
avg_PrecisionMacro_test = 0.0
avg_PrecisionMacro_train = 0.0
numSplits = 0.0
#
# Also keep track of the 
#
# Now loop
lastCF_train = None
lastCF_test = None

for train_index, test_index in skf.split(X, y):
    print("Training")
    X_train = X[train_index]
    y_train = y[train_index]
    X_test = X[test_index]
    y_test = y[test_index]

#
# Now fit to our training set
    results = runFitter(estimator,X_train,y_train,X_test,y_test)
#
# 
    avg_RecallMacro_test += results['RecallMacro_test']
    avg_RecallMacro_train += results['RecallMacro_train']
    avg_PrecisionMacro_test += results['PrecisionMacro_test']
    avg_PrecisionMacro_train += results['PrecisionMacro_train']
    lastCF_train = results['cf_train']
    lastCF_test = results['cf_test']
    numSplits += 1.0
    print("   Split ",numSplits,"; RecallMacro test/train",results['RecallMacro_test'],results['RecallMacro_train'],"; PrecisionMacro test/train",results['PrecisionMacro_test'],results['PrecisionMacro_train'])

avg_RecallMacro_test /= numSplits
avg_RecallMacro_train /= numSplits
avg_PrecisionMacro_test /= numSplits
avg_PrecisionMacro_train /= numSplits

print("average RecallMacro test:  ",avg_RecallMacro_test)
print("average RecallMacro train: ",avg_RecallMacro_train)
print("average PrecisionMacro test:  ",avg_PrecisionMacro_test)
print("average PrecisionMacro train: ",avg_PrecisionMacro_train)
print("Test confusion matrix")

n = 11

for trueClass in range(n):
  print("True: ",trueClass,end="")
  for predClass in range(n):
    print("\t",lastCF_test[trueClass][predClass],end="")
  print()
print()
print("Train confusion matrix")
for trueClass in range(n):
  print("True: ",trueClass,end="")
  for predClass in range(n):
    print("\t",lastCF_train[trueClass][predClass],end="")
  print()
print()



Training
   Split  1.0 ; RecallMacro test/train 0.9891404210216713 0.9958037492389938 ; PrecisionMacro test/train 0.9911456904220867 0.9901176402005092
Training
   Split  2.0 ; RecallMacro test/train 0.9787626644934986 0.9962992790691189 ; PrecisionMacro test/train 0.9858358929755494 0.9928150291184431
Training
   Split  3.0 ; RecallMacro test/train 0.9773808421688959 0.9966585674353127 ; PrecisionMacro test/train 0.9875100890884162 0.9918842706782951
Training
   Split  4.0 ; RecallMacro test/train 0.9782981368291442 0.9966227001332303 ; PrecisionMacro test/train 0.9861421955039043 0.9930460339090026
Training
   Split  5.0 ; RecallMacro test/train 0.9810720481502062 0.9967901386105786 ; PrecisionMacro test/train 0.9811750812669842 0.9926869045017608
average RecallMacro test:   0.980930822532683
average RecallMacro train:  0.9964348868974469
average PrecisionMacro test:   0.9863617898513881
average PrecisionMacro train:  0.9921099756816021
Test confusion matrix
True:  0	 188	 0	 4	 0	 0