# Dimensionsreduktion und Ensemble-Methoden

In dieser Übung werden wir uns der wichtigen Hauptkomponentenanalyse (*Principal Component Analysis* PCA) widmen und verschiedene Ensemble-Methoden verwenden, um ein Modell zur Gesichtserkennung zu entwickeln.
Dazu verwenden wir das *Olivetti-Faces* Datenset, welches aus 400 verschiedenen Bildern von 40 verschiedenen Personen besteht. Jedes Bild hat $64 \times 64$ Pixel.

In [2]:
from sklearn.datasets import fetch_olivetti_faces
faces, labels = fetch_olivetti_faces(return_X_y=True, shuffle=True, random_state=42)

In [3]:
faces.shape

(400, 4096)

In [4]:
64*64

4096

Um ein Bild zu plotten können wir `imshow` verwenden. Der folgende Code plottet jeweils das erste Gesicht der 40 verschiedenen Personen.

In [5]:
import numpy as np
import matplotlib.pyplot as plt

def _plot_face(face):
    if face.shape != (64, 64):
        face = face.reshape(64, 64)
    plt.imshow(face, cmap='gray')
    plt.axis('off')

    
def plot_faces(faces, cols=4):
    faces = np.array(faces)
    if len(faces.shape) == 1:
        faces = faces[None, :]
    m = faces.shape[0]
    
    rows = m // cols
    if m % cols != 0:
        rows += 1
    fig, axes = plt.subplots(rows, cols, figsize=(3*cols, 3*rows))
    if len(axes.shape) == 1:
        axes = axes[None, :]
    
    for i in range(rows):
        for j in range(cols):
            try:
                plt.sca(axes[i, j])  # set current axes
                face = faces[i * cols + j]  # get face
            except IndexError:
                plt.axis('off')
                continue
            _plot_face(face)
    return fig, axes

# plot all distinct persons
_, idx = np.unique(labels, return_index=True)
# plot_faces(faces[idx], cols=5);

## a) Eigengesichter
- Plotte die ersten 20 Hauptachsen (Eigenvektoren der Kovarianzmatrix). 

In [6]:
from sklearn.decomposition import PCA
pca = PCA(n_components = 20)
X_reduced = pca.fit_transform(faces)



In [7]:
pca.components_.shape

(20, 4096)

- Wie groß sind die zugehörigen Eigenwerte (Varianzen) dieser Eigengesichter?

In [8]:
pca.singular_values_

array([86.70191 , 66.46529 , 50.155174, 39.72255 , 33.757397, 31.56875 ,
       27.678602, 25.354546, 24.862415, 22.975147, 22.440619, 21.298529,
       19.838667, 19.029673, 18.317497, 17.568377, 17.033192, 16.04559 ,
       15.426598, 15.355796], dtype=float32)

- Plotte den Anteil der erklärten Varianz in Abhängigkeit der verwendeten Hauptkomponenten. Dazu kannst du das Attribut `explained_variance_ratio_` verwenden.

In [9]:
pca.explained_variance_ratio_

array([0.23812702, 0.13993974, 0.07968614, 0.04998337, 0.03609851,
       0.03156938, 0.02426832, 0.02036399, 0.01958114, 0.01672122,
       0.01595221, 0.01436979, 0.01246741, 0.01147133, 0.01062878,
       0.0097772 , 0.00919059, 0.00815573, 0.00753862, 0.00746958],
      dtype=float32)

## b) Inverse Transformation

Berechne eine PCA und plotte die Rekonstruktion von 5 Gesichter basierend auf $5, 10, 20, 50, 100, 200, 300$ und $ 400$ Hauptkomponenten. Dazu kannst du die Methode `inverse_transform` benutzen.

In [10]:
faces.shape

(400, 4096)

In [11]:
np

<module 'numpy' from 'c:\\Users\\natsc\\anaconda3\\lib\\site-packages\\numpy\\__init__.py'>

In [12]:
pca = PCA(n_components = 154)
X_reduced = pca.fit_transform(faces)
X_recovered = pca.inverse_transform(X_reduced)

In [13]:
#computing the recovering
X = faces
for k in [5, 10, 20, 50, 100, 200, 300, 400]:
    pca = PCA(n_components = k)
    X_reduced = pca.fit_transform(faces)
    X_recovered = pca.inverse_transform(X_reduced)
    print(np.linalg.norm(X - X_recovered,ord=2))
    #plot_faces(X_recovered[:5],cols=5)

31.568754
22.44062
14.850184


7.886724
5.0892935
2.8776712
1.7904152
0.00010027897


## c) Feature Importance
- Berechne die *Gini Feature Importance*. Du kannst `imshow` für die Visualisierung verwenden.

In [14]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(faces, labels, test_size=0.2, stratify=labels, random_state=42)

In [15]:
rnd_clf = RandomForestClassifier(n_estimators=500,)
rnd_clf.fit(X_train,y_train)

RandomForestClassifier(n_estimators=500)

In [16]:
np.round(rnd_clf.feature_importances_,1)

array([0., 0., 0., ..., 0., 0., 0.])

In [17]:
rnd_clf.feature_importances_.argsort()

array([3040, 1252, 3232, ...,  659, 4022,  527], dtype=int64)

In [18]:
indices = []
for i,x in enumerate(rnd_clf.feature_importances_):
    if np.abs(x) > 1e-03:
        indices.append(i)

## d) Gesichtserkennung

Erstelle ein Modell zur Gesichtserkennung. Du kannst dazu Methoden deiner Wahl verwenden. Experimentiere mit verschiedenen Modellen (`SVC`, `RandomForestClassifier`, ...). 
- Erstelle auch einen `VotingClassifier` basierend auf verschiedenen Modellen. 
- Probiere auch eine `PCA` als Preprocessingschritt. Was macht der Parameter `whiten` in der PCA?
- Wie hoch ist der Accuracy Score auf dem Trainings- und Testset? Kannst du einen Score von 1 auf dem Testset erreichen? Falls nicht, welche Personen werden verwechselt? Plotte die Geichter dieser Personen.


In [19]:
indices.__len__()

58

In [20]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()
voting_clf = VotingClassifier(
 estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
 voting='hard')



In [21]:
voting_clf.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


VotingClassifier(estimators=[('lr', LogisticRegression()),
                             ('rf', RandomForestClassifier()), ('svc', SVC())])

In [22]:
y_pred =voting_clf.predict(X_test)

In [23]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

0.95

second approach

In [24]:
from tensorflow import keras

In [25]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

In [26]:

log_clf = LogisticRegression()

svm_clf = SVC()
voting_clf = VotingClassifier(
 estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
 voting='hard')

In [27]:
kernel_svm_clf = Pipeline([
 ("pca",PCA(n_components = 0.95)),
 ("svm_clf", SVC(kernel="rbf"))
 ])

In [28]:
kernels = ["linear", "rbf","poly"]
param_grid = [{
    "svm_clf__kernel" : kernels
 }]
grid_search = GridSearchCV(kernel_svm_clf,param_grid=param_grid)

In [29]:
grid_search.fit(X_train,y_train)

GridSearchCV(estimator=Pipeline(steps=[('pca', PCA(n_components=0.95)),
                                       ('svm_clf', SVC())]),
             param_grid=[{'svm_clf__kernel': ['linear', 'rbf', 'poly']}])

In [30]:
grid_search.best_estimator_

Pipeline(steps=[('pca', PCA(n_components=0.95)),
                ('svm_clf', SVC(kernel='linear'))])

In [31]:
voting_clf = VotingClassifier(
 estimators=[('lr', log_clf), ('rf', rnd_clf), ('svm_pca', grid_search.best_estimator_)],
 voting='hard')

In [32]:
#hyperparameter with grid search

In [33]:
rnd_clf = RandomForestClassifier()
param_grid = [{
    'bootstrap': [True, False],
    'max_depth': [1,10,100, None]
 }]
grid_search_rfc = GridSearchCV(rnd_clf,param_grid=param_grid)
grid_search_rfc.fit(X_train,y_train)

GridSearchCV(estimator=RandomForestClassifier(),
             param_grid=[{'bootstrap': [True, False],
                          'max_depth': [1, 10, 100, None]}])

In [34]:
def create_conv():
    conv = keras.models.Sequential([
        keras.layers.Input([64*64]),
        keras.layers.Reshape([64, 64, 1]),
        keras.layers.Conv2D(20, 3, activation="relu"),
        keras.layers.BatchNormalization(),
        keras.layers.Conv2D(10, 3, activation="relu"),
        keras.layers.BatchNormalization(),
        keras.layers.Flatten(),
        keras.layers.Dense(40, activation="softmax")
    ])
    conv.compile(loss="sparse_categorical_crossentropy",
              optimizer=keras.optimizers.Adam(),
              metrics=["accuracy"])
    return conv

# conv.summary()
# conv.compile(loss="sparse_categorical_crossentropy",
#               optimizer=keras.optimizers.Adam(),
#               metrics=["accuracy"])

# history = conv.fit(X_train, y_train, epochs=10)


In [42]:

conv = keras.models.Sequential([
    keras.layers.Input([64*64]),
    keras.layers.Reshape([64, 64, 1]),
    keras.layers.Conv2D(20, 3, activation="relu"),
    keras.layers.BatchNormalization(),
    keras.layers.Conv2D(10, 3, activation="relu"),
    keras.layers.BatchNormalization(),
    keras.layers.Flatten(),
    keras.layers.Dense(40, activation="softmax")
])
conv.compile(loss="sparse_categorical_crossentropy",
            optimizer=keras.optimizers.Adam(),
            metrics=["accuracy"])

conv.summary()

history = conv.fit(X_train, y_train, epochs=10)

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 reshape_1 (Reshape)         (None, 64, 64, 1)         0         
                                                                 
 conv2d_2 (Conv2D)           (None, 62, 62, 20)        200       
                                                                 
 batch_normalization_2 (Batc  (None, 62, 62, 20)       80        
 hNormalization)                                                 
                                                                 
 conv2d_3 (Conv2D)           (None, 60, 60, 10)        1810      
                                                                 
 batch_normalization_3 (Batc  (None, 60, 60, 10)       40        
 hNormalization)                                                 
                                                                 
 flatten_1 (Flatten)         (None, 36000)            

In [36]:
from keras.wrappers.scikit_learn import KerasClassifier

In [46]:
X_test.shape

(80, 4096)

In [54]:
y_pred =conv.predict(X_test)
# accuracy_score(y_pred,y_test)





0.025

[15,
 31,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 31,
 15,
 31,
 31,
 15,
 31,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 31,
 15,
 15,
 15,
 15,
 15,
 15,
 31,
 15,
 15,
 15,
 15,
 15,
 15,
 31,
 31,
 15,
 15,
 31,
 31,
 15,
 31,
 31,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 31,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 15]

In [38]:
keras_model = KerasClassifier(build_fn=create_conv,nb_epoch=30)
keras_model._estimator_type = "classifier"  

  keras_model = KerasClassifier(build_fn=create_conv,nb_epoch=30)


In [39]:
voting_clf = VotingClassifier(
 estimators=[ ('keras', keras_model),('rf', grid_search_rfc.best_estimator_), ('svm_pca', grid_search.best_estimator_)],
 voting='hard')

In [40]:
voting_clf.fit(X_train, y_train)



VotingClassifier(estimators=[('keras',
                              <keras.wrappers.scikit_learn.KerasClassifier object at 0x0000020B82E61CA0>),
                             ('rf',
                              RandomForestClassifier(bootstrap=False,
                                                     max_depth=100)),
                             ('svm_pca',
                              Pipeline(steps=[('pca', PCA(n_components=0.95)),
                                              ('svm_clf',
                                               SVC(kernel='linear'))]))])

In [41]:
y_pred =voting_clf.predict(X_test)
accuracy_score(y_test,y_pred)



0.925

Finale Version.

In [None]:
#  |  whiten : bool, default=False
#  |      When True (False by default) the `components_` vectors are multiplied
#  |      by the square root of n_samples and then divided by the singular values
#  |      to ensure uncorrelated outputs with unit component-wise variances.
#  |  
#  |      Whitening will remove some information from the transformed signal
#  |      (the relative variance scales of the components) but can sometime
#  |      improve the predictive accuracy of the downstream estimators by
#  |      making their data respect some hard-wired assumptions.