Iantsa Provost et Bastien Soucasse ‚Äì Groupe 5

# Projet d‚ÄôACID ‚Äì 2 janvier 2022

[Sujet](https://masterinfo.emi.u-bordeaux.fr/wiki/lib/exe/fetch.php?media=mini_projet.pdf)

## Introduction

Afin de se lancer dans la comparaison des diff√©rents algorithmes de _Machine Learning_, il nous faut des donn√©es √† exploiter pour l‚Äôentra√Ænement et en guise de test.

Commen√ßons par importer les modules n√©cessaires au bon fonctionnement du projet.

In [1]:
from tensorflow.keras.datasets import fashion_mnist

import matplotlib.pyplot as plt
import matplotlib.cm as cm

%matplotlib inline

VERBOSE = True

On peut alors cr√©er nos donn√©es √† l‚Äôaide de **keras**.

In [3]:
(x_train, y_train), (x_test, y_test) = fashion_mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

assert len(x_train.shape) == 3
assert len(x_test.shape) == 3
assert len(y_train.shape) == 1
assert len(y_test.shape) == 1

if (VERBOSE):
    print('x_train.shape =', x_train.shape)
    print('y_train.shape =', y_train.shape)
    print('x_test.shape =', x_test.shape)
    print('y_test.shape =', y_test.shape)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz
x_train.shape = (60000, 28, 28)
y_train.shape = (60000,)
x_test.shape = (10000, 28, 28)
y_test.shape = (10000,)


Si besoin, on peut avoir un aper√ßu des donn√©es.

In [None]:
if VERBOSE:
    NUM_EXAMPLES = 5

    for i in range(NUM_EXAMPLES):
        print('x_train[%d]:' % i)
        plt.imshow(x_train[i], cmap = cm.Greys)
        plt.show()
        print('y_train[%d] =' % i, y_train[i])

On a √©galement besoin d‚Äôapplatir les donn√©es.

In [4]:
x_train, x_test = x_train.reshape(x_train.shape[0], x_train.shape[1] * x_train.shape[2]), x_test.reshape(x_test.shape[0], x_test.shape[1] * x_test.shape[2])

assert len(x_train.shape) == 2
assert len(x_test.shape) == 2

if (VERBOSE):
    print('x_train.shape =', x_train.shape)
    print('x_test.shape =', x_test.shape)

x_train.shape = (60000, 784)
x_test.shape = (10000, 784)


## Classifications des donn√©es

Cette partie est d√©di√©e concr√®tement au _Machine Learning_. Pour classifier les donn√©es on a plusieurs m√©thodes que l‚Äôon peut appliquer.

- M√©thode des `k` plus proches voisins
- Classification na√Øve bay√©sienne
- Analyse discriminante lin√©aire
- Perceptron multicouche [?]

### M√©thode des `k` plus proches voisins

Commen√ßons par importer la classe du mod√®le de la classification par la m√©thode des `k` plus proches voisins.

In [5]:
from sklearn.neighbors import KNeighborsClassifier

On d√©finit ensuite les valeurs possibles pour `k`.

In [6]:
K_VALS = list(range(1, 11)) + list(range(11, 101, 20))

if VERBOSE:
    print('K_VALS =', K_VALS)

K_VALS = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 31, 51, 71, 91]


On peut calculer la pr√©cision `accuracy` pour chaque `k`.

In [7]:
knn_accuracies = {}

for k in K_VALS:
    if VERBOSE:
        print('Computing %d-Nearest Neighbors classification‚Ä¶' % k)

    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(x_train, y_train)
    knn_accuracies[k] = knn.score(x_test, y_test)

    if VERBOSE:
        print('accuracy =', knn_accuracies.get(k))

Computing 1-Nearest Neighbors classification‚Ä¶
accuracy = 0.8497
Computing 2-Nearest Neighbors classification‚Ä¶
accuracy = 0.846
Computing 3-Nearest Neighbors classification‚Ä¶
accuracy = 0.8541
Computing 4-Nearest Neighbors classification‚Ä¶
accuracy = 0.8577
Computing 5-Nearest Neighbors classification‚Ä¶
accuracy = 0.8554
Computing 6-Nearest Neighbors classification‚Ä¶
accuracy = 0.8544
Computing 7-Nearest Neighbors classification‚Ä¶
accuracy = 0.854
Computing 8-Nearest Neighbors classification‚Ä¶
accuracy = 0.8534
Computing 9-Nearest Neighbors classification‚Ä¶
accuracy = 0.8519
Computing 10-Nearest Neighbors classification‚Ä¶
accuracy = 0.8515
Computing 11-Nearest Neighbors classification‚Ä¶
accuracy = 0.8495
Computing 31-Nearest Neighbors classification‚Ä¶


KeyboardInterrupt: 

On r√©cup√®re ainsi la pr√©cision du meilleur mod√®le de classification par la m√©thode des `k` plus proches voisins.

In [8]:
best_k = max(knn_accuracies, key=knn_accuracies.get)

print('%d-Nearest Neighbors:' % best_k)
print('  - Mean accuracy: %.2f%%.' % (knn_accuracies.get(best_k) * 100))

4-Nearest Neighbors:
  - Mean accuracy: 85.77%.


On remarque alors que le meilleur mod√®le de classification par la m√©thode des `k` plus proches voisins est celui dont le `k` est √©gal √† 4 et sa pr√©cision est de 86%. Quant aux autres valeurs de `k`, on constate malgr√© tout qu‚Äôon arrive g√©n√©ralement au del√† des 80% de pr√©cision.

### Classification na√Øve bay√©sienne

Commen√ßons par importer la classe du mod√®le de la classification na√Øve bay√©sienne.

In [9]:
from sklearn.naive_bayes import GaussianNB

On peut calculer la pr√©cision `accuracy`.

In [10]:
if VERBOSE:
    print('Computing gaussian naive Bayes classification‚Ä¶')

gnb = GaussianNB()
gnb.fit(x_train, y_train)
gnb_accuracy = gnb.score(x_test, y_test)

if VERBOSE:
    print('accuracy =', gnb_accuracy)

Computing gaussian naive Bayes classification‚Ä¶
accuracy = 0.5856


On r√©cup√®re ainsi la pr√©cision de la classification na√Øve bay√©sienne.

In [11]:
print('Gaussian Naive Bayes:')
print('  - Mean accuracy: %.2f%%.' % (gnb_accuracy * 100))

Gaussian Naive Bayes:
  - Mean accuracy: 58.56%.


On remarque alors que la classification na√Øve bay√©sienne a une pr√©cision de 59%.

### Analyse discriminante lin√©aire

Commen√ßons par importer la classe du mod√®le de la classification par l‚Äôanalyse discriminante lin√©aire.

In [12]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

On peut calculer la pr√©cision `accuracy`.

In [13]:
if VERBOSE:
    print('Computing linear discriminant analysis classification‚Ä¶')

lda = LDA()
lda.fit(x_train, y_train)
lda_accuracy = lda.score(x_test, y_test)

if VERBOSE:
    print('accuracy =', lda_accuracy)

Computing linear discriminant analysis classification‚Ä¶
accuracy = 0.8151


On r√©cup√®re ainsi la pr√©cision de la classification par l‚Äôanalyse discriminante lin√©aire.

In [14]:
print('Linear Discriminant Analysis:')
print('  - Mean accuracy: %.2f%%.' % (lda_accuracy * 100))

Linear Discriminant Analysis:
  - Mean accuracy: 81.51%.


On remarque alors que la classification par l‚Äôanalyse discriminante lin√©aire a une pr√©cision de 82%.

## Conclusion

√Ä partir des pr√©cisions calcul√©s dans la partie pr√©c√©dente, il est possible de d√©terminer quelles sont les meilleures pour ces donn√©es.

On peut trier toutes les pr√©cisions par ordre d√©croissant.

In [15]:
accuracies = {
    '%d-Nearest Neighbors' % best_k: knn_accuracies.get(best_k),
    'Gaussian Naive Bayes': gnb_accuracy,
    'Linear Discriminant Analysis': lda_accuracy
}

accuracies = dict(sorted(accuracies.items(), key=lambda item: item[1], reverse=True))

if VERBOSE:
    print(accuracies)

{'4-Nearest Neighbors': 0.8577, 'Linear Discriminant Analysis': 0.8151, 'Gaussian Naive Bayes': 0.5856}


On peut alors d√©terminer quels sont les meilleurs mod√®les de classification.

In [16]:
print('Best Models:')

i = 0

for model in accuracies:
    i += 1
    print('  #%d: %s (with %.2f%%).' % (i, model, accuracies.get(model) * 100))

Best Models:
  #1: 4-Nearest Neighbors (with 85.77%).
  #2: Linear Discriminant Analysis (with 81.51%).
  #3: Gaussian Naive Bayes (with 58.56%).


On constate donc que la classification par la m√©thode des 4 plus proches voisins a √©t√© la plus efficace, suivi par celle par l‚Äôanalyse discriminante lin√©aire, et enfin la classification na√Øve bay√©sienne qui f√ªt la moins efficace sur ces donn√©es.

Bon tout √ßa c‚Äôest probablement encore approximatif mais j‚Äôai tent√© de prendre un peu d‚Äôavance en faisant ce que je pouvais en amont. üôÇ