# Les grandes étapes d'un projet de machine learning

## 1. Formaliser le problème

Lorsqu'on est confronté à un problème, il est souvent bénéfique d'y réfléchir à tête reposée et de l'analyser plutôt que de se lancer tête baissée dans le code. C'est particulièrement vrai en machine learning, puisqu'il est essentiel de _formaliser_ un problème pour qu'une machine puisse apprendre à le résoudre. Voici donc quelques questions à se poser systématiquement :

1. Le machine learning est-il réellement approprié pour mon problème ?
2. Quelles données vais-je pouvoir utiliser ?
3. Comment vais-je mesurer la performance de mon modèle ?

Pour ce TP, notre problème sera de différencier entre mines et rochers à partir de mesures prises par un sonar. Essayons de répondre aux trois questions ci-dessus :

1. Oui parce qu'on ne sait pas faire autrement.
2. Les données sont fournies.
3. On mesure le pourcentage de classification correcte.

## 2. Traiter les données

### 2.a) Collecte des données

À l'aide de la fonction `read_csv` de la librairie `pandas`, importez les données contenues dans le fichier `sonar.all-data.csv` sous forme de DataFrame. Remarquez que les colonnes n'ont pas de noms dans le csv, il faut donc passer un paramètre `names` à la fonction `read_csv` : les 60 premières colonnes s'appelleront _F1, F2, ..., F60_ et la dernière colonne s'appellera _Type_


In [2]:
import pandas
noms = []
for i in range(1, 61):
    noms.append("F"+str(i))
noms.append("Type")
df = pandas.read_csv("sonar.all-data.csv", names=noms)

In [3]:
# Alternative
df = pandas.read_csv("sonar.all-data.csv", names=["F"+str(i) for i in range(1, 61)] + ["Type"])
df

Unnamed: 0,F1,F2,F3,F4,F5,F6,F7,F8,F9,F10,...,F52,F53,F54,F55,F56,F57,F58,F59,F60,Type
0,0.0200,0.0371,0.0428,0.0207,0.0954,0.0986,0.1539,0.1601,0.3109,0.2111,...,0.0027,0.0065,0.0159,0.0072,0.0167,0.0180,0.0084,0.0090,0.0032,R
1,0.0453,0.0523,0.0843,0.0689,0.1183,0.2583,0.2156,0.3481,0.3337,0.2872,...,0.0084,0.0089,0.0048,0.0094,0.0191,0.0140,0.0049,0.0052,0.0044,R
2,0.0262,0.0582,0.1099,0.1083,0.0974,0.2280,0.2431,0.3771,0.5598,0.6194,...,0.0232,0.0166,0.0095,0.0180,0.0244,0.0316,0.0164,0.0095,0.0078,R
3,0.0100,0.0171,0.0623,0.0205,0.0205,0.0368,0.1098,0.1276,0.0598,0.1264,...,0.0121,0.0036,0.0150,0.0085,0.0073,0.0050,0.0044,0.0040,0.0117,R
4,0.0762,0.0666,0.0481,0.0394,0.0590,0.0649,0.1209,0.2467,0.3564,0.4459,...,0.0031,0.0054,0.0105,0.0110,0.0015,0.0072,0.0048,0.0107,0.0094,R
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
203,0.0187,0.0346,0.0168,0.0177,0.0393,0.1630,0.2028,0.1694,0.2328,0.2684,...,0.0116,0.0098,0.0199,0.0033,0.0101,0.0065,0.0115,0.0193,0.0157,M
204,0.0323,0.0101,0.0298,0.0564,0.0760,0.0958,0.0990,0.1018,0.1030,0.2154,...,0.0061,0.0093,0.0135,0.0063,0.0063,0.0034,0.0032,0.0062,0.0067,M
205,0.0522,0.0437,0.0180,0.0292,0.0351,0.1171,0.1257,0.1178,0.1258,0.2529,...,0.0160,0.0029,0.0051,0.0062,0.0089,0.0140,0.0138,0.0077,0.0031,M
206,0.0303,0.0353,0.0490,0.0608,0.0167,0.1354,0.1465,0.1123,0.1945,0.2354,...,0.0086,0.0046,0.0126,0.0036,0.0035,0.0034,0.0079,0.0036,0.0048,M


Modifiez le Dataframe pour faire de la dernière colonne une caractéristique booléenne : remplacez les "M" par des 1 et les "R" par des 0.

In [4]:
df = df.replace({"M": 1, "R": 0})

# Alternative
df.replace({"M": 1, "R": 0}, inplace=True)

df["Type"]

0      0
1      0
2      0
3      0
4      0
      ..
203    1
204    1
205    1
206    1
207    1
Name: Type, Length: 208, dtype: int64

Manque-t-il des données ? Si oui, remplacez les données manquantes par la valeur moyenne de la colonne correspondante.


In [5]:
df.isna().sum().sum()

0

_Il n'y a pas de données manquantes._

Combien y a-t-il d'observations correspondant à des mines ? À des rochers ? Que peut-on en conclure ?


In [6]:
# Nombre d'observations de mines
len(df[df["Type"] == 1])

# Alternative
len(df.query("Type == 1"))

111

In [7]:
# Nombre d'observations de rochers
len(df[df["Type"] == 0])

97

_On a environ autant d'observations de mines que de rochers, ce qui est préférable pour l'apprentissage._


Mélangez les observations du dataframe.

In [8]:
df = df.sample(frac=1)
df

Unnamed: 0,F1,F2,F3,F4,F5,F6,F7,F8,F9,F10,...,F52,F53,F54,F55,F56,F57,F58,F59,F60,Type
74,0.0109,0.0093,0.0121,0.0378,0.0679,0.0863,0.1004,0.0664,0.0941,0.1036,...,0.0077,0.0023,0.0117,0.0053,0.0077,0.0076,0.0056,0.0055,0.0039,0
167,0.0137,0.0297,0.0116,0.0082,0.0241,0.0253,0.0279,0.0130,0.0489,0.0874,...,0.0081,0.0040,0.0025,0.0036,0.0058,0.0067,0.0035,0.0043,0.0033,1
41,0.0093,0.0185,0.0056,0.0064,0.0260,0.0458,0.0470,0.0057,0.0425,0.0640,...,0.0069,0.0064,0.0129,0.0114,0.0054,0.0089,0.0050,0.0058,0.0025,0
18,0.0270,0.0092,0.0145,0.0278,0.0412,0.0757,0.1026,0.1138,0.0794,0.1520,...,0.0084,0.0010,0.0018,0.0068,0.0039,0.0120,0.0132,0.0070,0.0088,0
200,0.0131,0.0387,0.0329,0.0078,0.0721,0.1341,0.1626,0.1902,0.2610,0.3193,...,0.0150,0.0076,0.0032,0.0037,0.0071,0.0040,0.0009,0.0015,0.0085,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32,0.0195,0.0213,0.0058,0.0190,0.0319,0.0571,0.1004,0.0668,0.0691,0.0242,...,0.0157,0.0074,0.0271,0.0203,0.0089,0.0095,0.0095,0.0021,0.0053,0
174,0.0191,0.0173,0.0291,0.0301,0.0463,0.0690,0.0576,0.1103,0.2423,0.3134,...,0.0040,0.0136,0.0137,0.0172,0.0132,0.0110,0.0122,0.0114,0.0068,1
114,0.0114,0.0222,0.0269,0.0384,0.1217,0.2062,0.1489,0.0929,0.1350,0.1799,...,0.0269,0.0152,0.0257,0.0097,0.0041,0.0050,0.0145,0.0103,0.0025,1
113,0.0283,0.0599,0.0656,0.0229,0.0839,0.1673,0.1154,0.1098,0.1370,0.1767,...,0.0147,0.0170,0.0158,0.0046,0.0073,0.0054,0.0033,0.0045,0.0079,1



Pour que notre algorithme apprenne, on va séparer les observations du sonar de l'étiquette Mine ou Rocher. Séparez le dataframe en deux : un dataframe nommé `X` qui contient les 60 premières colonnes, et un dataframe nommé `Y` qui ne contient que la dernière colonne.

In [9]:
Y = df["Type"]
Y

74     0
167    1
41     0
18     0
200    1
      ..
32     0
174    1
114    1
113    1
92     0
Name: Type, Length: 208, dtype: int64

In [10]:
# Pour X, plusieurs options :
# Avec les indices
X = df.iloc[:,:-1]

# Avec les noms de colonnes
X = df[["F"+str(i) for i in range(1,61)]]

# Avec drop
X = df.drop(columns=["Type"])

X

Unnamed: 0,F1,F2,F3,F4,F5,F6,F7,F8,F9,F10,...,F51,F52,F53,F54,F55,F56,F57,F58,F59,F60
74,0.0109,0.0093,0.0121,0.0378,0.0679,0.0863,0.1004,0.0664,0.0941,0.1036,...,0.0124,0.0077,0.0023,0.0117,0.0053,0.0077,0.0076,0.0056,0.0055,0.0039
167,0.0137,0.0297,0.0116,0.0082,0.0241,0.0253,0.0279,0.0130,0.0489,0.0874,...,0.0169,0.0081,0.0040,0.0025,0.0036,0.0058,0.0067,0.0035,0.0043,0.0033
41,0.0093,0.0185,0.0056,0.0064,0.0260,0.0458,0.0470,0.0057,0.0425,0.0640,...,0.0022,0.0069,0.0064,0.0129,0.0114,0.0054,0.0089,0.0050,0.0058,0.0025
18,0.0270,0.0092,0.0145,0.0278,0.0412,0.0757,0.1026,0.1138,0.0794,0.1520,...,0.0045,0.0084,0.0010,0.0018,0.0068,0.0039,0.0120,0.0132,0.0070,0.0088
200,0.0131,0.0387,0.0329,0.0078,0.0721,0.1341,0.1626,0.1902,0.2610,0.3193,...,0.0137,0.0150,0.0076,0.0032,0.0037,0.0071,0.0040,0.0009,0.0015,0.0085
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32,0.0195,0.0213,0.0058,0.0190,0.0319,0.0571,0.1004,0.0668,0.0691,0.0242,...,0.0261,0.0157,0.0074,0.0271,0.0203,0.0089,0.0095,0.0095,0.0021,0.0053
174,0.0191,0.0173,0.0291,0.0301,0.0463,0.0690,0.0576,0.1103,0.2423,0.3134,...,0.0125,0.0040,0.0136,0.0137,0.0172,0.0132,0.0110,0.0122,0.0114,0.0068
114,0.0114,0.0222,0.0269,0.0384,0.1217,0.2062,0.1489,0.0929,0.1350,0.1799,...,0.0213,0.0269,0.0152,0.0257,0.0097,0.0041,0.0050,0.0145,0.0103,0.0025
113,0.0283,0.0599,0.0656,0.0229,0.0839,0.1673,0.1154,0.1098,0.1370,0.1767,...,0.0109,0.0147,0.0170,0.0158,0.0046,0.0073,0.0054,0.0033,0.0045,0.0079


### 2.b) Jeux d'entraînement, de validation et de test

Une fois nos données collectées et pré-traitées, il faut penser dès maintenant à les séparer en deux (ou trois) sous-ensembles pour avoir la possibilité d'évaluer notre solution.

À quoi correspondent ces sous-ensembles ? Pour le comprendre, imaginons que notre programme informatique est un étudiant qui révise pour un examen. Dans cette métaphore, l'étudiant doit apprendre en lisant son manuel de cours. À la fin de l'année, il a un examen qui consiste en plusieurs questions sur le cours. Pour que l'examen évalue bien la compréhension de l'étudiant, il est important que celui-ci n'ait jamais vu les questions auparavant (sinon il aurait pu bêtement mémoriser toutes les réponses sans réellement les comprendre).

Pour en revenir à notre algorithme de machine learning, on sépare donc les données à notre disposition en deux jeux :

- Un jeu d'entraînement, composé en général de 80% des observations à notre disposition, et qui correspond au manuel de cours.
- Un jeu de test (ou jeu d'évaluation), composé des 20% d'observations restantes, et qui correspond à l'examen final.

Il est primordial de ne pas utiliser les données de test avant l'évaluation à la toute fin du projet.

Séparez le dataframe `X` en deux moitiés `X_train` (70% des observations) et `X_test` (les 30% restants). Faites de même pour `Y` : attention, assurez-vous que les observations dans `Y_train` correspondent bien à celles dans `X_train`, et celles dans `Y_test` à `X_test`.

In [15]:
nb_observations_train = int(0.7*len(df))
nb_observations_test = len(df) - nb_observations_train

# Plusieurs manières de faire :

# Avec les indices
X_train = X.iloc[nb_observations_train:, :]
X_test = X.iloc[:nb_observations_train, :]

Y_train = Y.iloc[nb_observations_train:]
Y_test = Y.iloc[:nb_observations_train]

# Avec head et tail
X_train = X.head(nb_observations_train)
X_test = X.tail(nb_observations_test)

Y_train = Y.head(nb_observations_train)
Y_test = Y.tail(nb_observations_test)

## 3. Entraîner un modèle

### 3.a) Qu'est-ce qu'un modèle ? Considérations générales

Un modèle est une représentation plus ou moins précise de la réalité. On l'appelle aussi une hypothèse.

Par exemple, si on est intéressé par le lien entre la taille des mains et des pieds, on peut formuler l'hypothèse que ces deux grandeurs sont liées par la relation&nbsp;:&nbsp;$taille~des~pieds~=~3~\times~taille~des~mains~+~2$. On a modélisé la relation entre les deux grandeurs par une fonction linéaire, ce modèle appartient donc à la _classe de modèles_ des fonctions linéaires.

Un problème de machine learning consiste à chercher le meilleur modèle possible dans une certaine classe de modèles. En théorie, on pourrait prendre comme classe de modèles l'ensemble de tous les programmes existant : on serait alors sûr que le meilleur programme possible se trouverait dans l'ensemble. En pratique, ce serait beaucoup trop coûteux en temps et en énergie, on se limite donc à des classes de modèles plus simples à explorer.
<!---
Pour plus tard : choisir des hyperparamètres, valider ses choix
--->
### 3.b) Une première classe de modèles pour classifier : les "$k$-plus proches voisins"

Une manière très simple d'apprendre à classifier à partir d'observations consiste à comparer entre elles les observations qui se ressemblent. Par exemple : on me donne une photo d'animal à classifier, donc je la compare aux 20 photos d'animaux dans ma collection qui lui ressemblent le plus. Parmi ces photos, il y a 14 photos de chat, 5 photos de tigre et 1 photo d'alligator. J'en conclue que la photo que l'on m'a donné est une photo de chat.

Cet algorithme élémentaire s'appelle "$k$-nearest-neighbors" : on regarde ce que font les $k$ (en l'occurence $20$) plus proches voisins pour classifier.

Je n'ai pas été tout à fait honnête et j'ai omis de mentionner un point potentiellement complexe dans cet algorithme apparemment simpliste. Pouvez-vous voir quelle partie de ma description mériterait plus d'explications ? (Indice : imaginez que vous implémentez en détail cet algorithme, il y a un moment où vous risquez de coincer...)

_À compléter_

### 3.c) Scikit-learn

La librairie `scikit-learn` (souvent abrégé `sklearn`) implémente un [grand nombre](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html) d'algorithmes "traditionnels" de machine learning (presque tous les algorithmes à l'exception des algorithmes de deep learning).

Si ce n'est pas déjà fait, installez `scikit-learn` sur votre machine :

In [None]:
!pip install scikit-learn

[1;31merror[0m: [1mexternally-managed-environment[0m

[31m×[0m This environment is externally managed
[31m╰─>[0m To install Python packages system-wide, try 'pacman -S
[31m   [0m python-xyz', where xyz is the package you are trying to
[31m   [0m install.
[31m   [0m 
[31m   [0m If you wish to install a non-Arch-packaged Python package,
[31m   [0m create a virtual environment using 'python -m venv path/to/venv'.
[31m   [0m Then use path/to/venv/bin/python and path/to/venv/bin/pip.
[31m   [0m 
[31m   [0m If you wish to install a non-Arch packaged Python application,
[31m   [0m it may be easiest to use 'pipx install xyz', which will manage a
[31m   [0m virtual environment for you. Make sure you have python-pipx
[31m   [0m installed via pacman.

[1;35mnote[0m: If you believe this is a mistake, please contact your Python installation or OS distribution provider. You can override this, at the risk of breaking your Python installation or OS, by passing --break-s

Toutes les classes de modèles proposées par `sklearn` ont trois fonctions essentielles :

1. `fit` prend en paramètre des données d'entraînements et effectue l'apprentissage d'un modèle dans la classe à partir de ces données.
2. `score` prend en paramètre des données de test et renvoit une mesure de la performance du modèle sur ces données.
3. `predict` prend en paramètre une observation et renvoit la prédiction du modèle entraîné.

Si vous maîtrisez ces trois fonctions, félicitations, vous maîtrisez (presque) `sklearn` !

Créez un modèle de type `KNeighborsClassifier` à l'aide de `scikit-learn`.

In [12]:
from sklearn.neighbors import KNeighborsClassifier

cls = KNeighborsClassifier()

Entraînez le modèle avec les données d'entraînement.

In [16]:
cls.fit(X_train, Y_train)

## 4. Tester le modèle final

Une fois le modèle entraîné, il est important de vérifier qu'il fonctionne "dans la vraie vie", c'est-à-dire avec des données qu'il n'a jamais vu auparavant.

Utilisez la fonction `score` pour evaluer le modèle entraîné.

In [17]:
cls.score(X_train, Y_train)

0.8620689655172413

In [18]:
cls.score(X_test, Y_test)

0.8095238095238095

In [28]:
# Exemple de prédiction
donnees_sous_marin = X_test.iloc[0, :]
cls.predict([donnees_sous_marin])



array([0])

In [29]:
import pickle

with open("modele.pkl", "wb") as f:
    pickle.dump(cls, f)
    f.close()


In [30]:
with open("modele.pkl", "rb") as f:
    mon_cls = pickle.load(f)
    f.close()

mon_cls.score(X_test, Y_test)


0.8095238095238095

Vous avez presque terminé votre premier projet de machine learning, mais votre travail ne s'arrête pas là ! Une fois un modèle entraîné et testé, il faut souvent le "vendre" aux clients. Cela signifie principalement expliquer vos choix et leur pertinence, pour que le client ait confiance en votre IA. Pour ce TP guidé, vous n'avez pas eu beaucoup de choix à faire, donc je vous épargne cette tâche. Cependant, le TP suivant sur la validation va vous entraîner sur le sujet !