# Mustererkennung – Aufgabenblatt 3

Bei Fragen: [florian.hartmann@fu-berlin.de](mailto:florian.hartmann@fu-berlin.de?subject=[Mustererkennung]) – E-Mail-Titel der mit [ME] oder [Mustererkennung] anfängt

In [1]:
import pandas as pd
import numpy as np

In [2]:
digits = range(10)
base_path = "data/digits/"

## Trainingsdaten laden

In [3]:
Xs = []
y = []

for digit in digits:
    Xi = pd.read_csv("%s/train.%d.csv" % (base_path, digit), header=None)
    Xs.append(Xi.as_matrix())
    y += [digit] * len(Xi)
    
X = np.concatenate(Xs)
y = np.array(y)

## Testdaten laden

In [4]:
test_data = pd.read_csv(base_path + "test.ssv", delimiter=" ", header=None).as_matrix()
X_test = test_data[:, 1:]
y_test = test_data[:, 0]

## Allgemeine Klasse für Klassifizierer

In [5]:
class Classifier:
    def score(self, X, y):
        predictions = self.predict(X)
        return np.mean(predictions == y)
    
    def confusion_matrix(self, X, y):
        size = len(set(y))
        predicted = self.predict(X)
        
        results = np.zeros((size, size), dtype=np.int32)

        for pi, yi in zip(predicted, y):
            results[int(pi)][int(yi)] += 1

        return results

## Datensatz splitten

In [6]:
def join_classes(X_pos, X_neg):
    X_joined = np.concatenate((X_pos, X_neg))
    y_joined = np.array([1] * len(X_pos) + [-1] * len(X_neg))
    return X_joined, y_joined

In [7]:
def split_train_data(train_data, positive_class, negative_class):
    X_pos = train_data[positive_class]
    X_neg = train_data[negative_class]
    return join_classes(X_pos, X_neg)

In [8]:
def split_test_data(test_data, positive_class, negative_class):
    X_pos = test_data[test_data[:, 0] == positive_class, 1:]
    X_neg = test_data[test_data[:, 0] == negative_class, 1:]
    return join_classes(X_pos, X_neg)

In [9]:
def get_subset(X, y, labels):
    indices = reduce(np.logical_or, map(lambda l: y == l, labels))
    return X[indices], y[indices]

## Gauß-Klassifizierer

Klassifizieren: Für jede Klasse den Durchschnittsvektor und die Kovarianzmatrix berechnen

Vorhersagen: Für jede Klasse die Wahrscheinlichkeit berechnen (mittels vorher berechneten Durchschnittsvektor / Kovarianzmatrix) und dann die Klasse mit höchster Wahrscheinlichkeit vorhersagen

In [10]:
from numpy import sqrt, pi, e
from numpy.linalg import pinv, det, slogdet

In [11]:
def covariance_matrix(X, mu):
    num_samples, _ = X.shape
    X_normalized = X - mu
    return X_normalized.T.dot(X_normalized) / num_samples

In [12]:
def normal_distribution_pdf(X, sigma, mu):
    X_normalised = X - mu
    exponent = -0.5 * np.diag(X_normalised.dot(pinv(sigma)).dot(X_normalised.T))
    
    dim = X.shape[1]
    eps = 1e-15 # To avoid division by zero
    normalization_term = 1. / sqrt((2 * pi)**dim * det(sigma) + eps)
    
    return normalization_term * e**exponent

In [13]:
class GaussianClassifier(Classifier):
    def fit(self, X, y):
        self.labels = np.unique(y)
        Xs = [X[y == label] for label in self.labels]
        
        self.means = []
        self.covariances = []
        
        for X in Xs:
            mean = np.mean(X, axis=0)
            self.means.append(mean)
            self.covariances.append(covariance_matrix(X, mean))
        
    def predict(self, X):
        results = np.array([None] * len(X))
        largest_probs = np.array([-np.inf] * len(X))
        
        for label, mean, covariance in zip(self.labels, self.means, self.covariances):
            probs = normal_distribution_pdf(X, covariance, mean)
            results[probs > largest_probs] = label
            largest_probs = np.maximum(largest_probs, probs)
        
        return results

## Auf dem Datensatz angewandt

### Binäre Klassifizierung

In [14]:
digits = range(10)

In [15]:
accuracies = []

for i in range(len(digits)):
    for j in range(i + 1, len(digits)):
        positive_class = digits[i]
        negative_class = digits[j]
        
        X_train_sub, y_train_sub = split_train_data(Xs, positive_class, negative_class)
        X_test_sub, y_test_sub = split_test_data(test_data, positive_class, negative_class)
        
        clf = GaussianClassifier()
        clf.fit(X_train_sub, y_train_sub)

        accuracy = clf.score(X_test_sub, y_test_sub)
        accuracies.append(accuracy)
        
        print "%d vs %d => %.6f" % (positive_class, negative_class, accuracy)
        
print "Durchschnittlich: %.6f" % np.mean(accuracies)

0 vs 1 => 0.951846
0 vs 2 => 0.946140
0 vs 3 => 0.948571
0 vs 4 => 0.957066
0 vs 5 => 0.951830
0 vs 6 => 0.943289
0 vs 7 => 0.952569
0 vs 8 => 0.948571
0 vs 9 => 0.958955
1 vs 2 => 0.943723


  if __name__ == '__main__':


1 vs 3 => 0.953488
1 vs 4 => 0.909483
1 vs 5 => 0.938679
1 vs 6 => 0.951613
1 vs 7 => 0.946472
1 vs 8 => 0.900000
1 vs 9 => 0.945578
2 vs 3 => 0.947802
2 vs 4 => 0.952261
2 vs 5 => 0.960894
2 vs 6 => 0.910326
2 vs 7 => 0.947826
2 vs 8 => 0.917582
2 vs 9 => 0.957333
3 vs 4 => 0.948087
3 vs 5 => 0.914110
3 vs 6 => 0.928571
3 vs 7 => 0.920128
3 vs 8 => 0.885542
3 vs 9 => 0.927114
4 vs 5 => 0.936111
4 vs 6 => 0.929730
4 vs 7 => 0.896254
4 vs 8 => 0.934426
4 vs 9 => 0.899204
5 vs 6 => 0.903030
5 vs 7 => 0.951140
5 vs 8 => 0.938650
5 vs 9 => 0.958457
6 vs 7 => 0.924290
6 vs 8 => 0.916667
6 vs 9 => 0.930836
7 vs 8 => 0.916933
7 vs 9 => 0.919753
8 vs 9 => 0.845481
Durchschnittlich: 0.932587


### Mehrere Klassen

Alle 10:

In [16]:
clf = GaussianClassifier()
clf.fit(X, y)
clf.score(X_test, y_test)

  if __name__ == '__main__':


0.84055804683607371

3, 5, 7, 8:

In [17]:
labels = [3,5,7,8]

In [18]:
X_sub, y_sub = get_subset(X, y, labels)
X_test_sub, y_test_sub = get_subset(X_test, y_test, labels)

In [19]:
clf = GaussianClassifier()
clf.fit(X_sub, y_sub)
clf.score(X_test_sub, y_test_sub)

0.83881064162754304