## Explication du modèle

Le modèle Nearest Centroid(centroïde le plus proche) est un algorithme de classification simple qui représente chaque classe par sa moyenne et appartient au domaine de l'apprentissage supervisé.
Il est l'un des classificateurs les plus sous-utilisés dans le Machine Learning et est quelque peu similaire au modèle K-Nearest Neighbors.
Son fonctionnement peut être décrit en trois étapes:
- Le centre de gravité de chaque classe cible est calculé pendant l'entraînement.
- Après l'entraînement, les distances entre le point X et le centre de gravité de chaque classe sont calculées.
- Parmi toutes les distances calculées, la distance minimale est choisie.Le centre de gravité auquel la distance du point donné est minimale,sa classe est attribuée au point donné.

## Explication des données

Le jeu de données utilisé est un ensemble d'images sur des variétés de raisins secs Kecimen et Besni. 
Elles sont cultivées en Turquie ont été obtenues par CVS. 
Un total de 900 grains de raisin sec a été utilisé, dont 450 morceaux des deux variétés.
Ces images ont été soumises à différentes étapes de prétraitement et 7 caractéristiques morphologiques ont été
extraites.Ces caractéristiques ont été classées à l'aide de trois techniques d'intelligence artificielle différentes.

Informations sur les attributs :

1.) Area : Donne le nombre de pixels à l'intérieur des limites du raisin sec. 

2.) Perimeter : Il mesure l'environnement en calculant la distance entre les limites du raisin et les pixels qui l'entourent.

3.) MajorAxisLength : Donne la longueur de l'axe principal, qui est la plus longue ligne pouvant être tracée sur le raisin sec.

4.) MinorAxisLength : Donne la longueur du petit axe, qui est la ligne la plus courte pouvant être tracée sur le raisin sec.

5.) Eccentricity : Donne une mesure de l'excentricité de l'ellipse, qui a les mêmes moments que les raisins secs. 

6.) ConvexArea (Zone convexe) : Donne le nombre de pixels de la plus petite coquille convexe de la région formée par le raisin sec.

7.) Extent : Indique le rapport entre la région formée par le raisin sec et le nombre total de pixels de la boîte englobante.

8.) Class : Kecimen et Besni raisin sec.


## Implémentation du modèle

In [1]:
# Importation des modules
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import NearestCentroid
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier

In [2]:
# Préparation des données
datas = pd.read_csv('raisin.csv')
datas

Unnamed: 0,Area,MajorAxisLength,MinorAxisLength,Eccentricity,ConvexArea,Extent,Perimeter,Class
0,87524,442.246011,253.291155,0.819738,90546,0.758651,1184.040,Kecimen
1,75166,406.690687,243.032436,0.801805,78789,0.684130,1121.786,Kecimen
2,90856,442.267048,266.328318,0.798354,93717,0.637613,1208.575,Kecimen
3,45928,286.540559,208.760042,0.684989,47336,0.699599,844.162,Kecimen
4,79408,352.190770,290.827533,0.564011,81463,0.792772,1073.251,Kecimen
...,...,...,...,...,...,...,...,...
895,83248,430.077308,247.838695,0.817263,85839,0.668793,1129.072,Besni
896,87350,440.735698,259.293149,0.808629,90899,0.636476,1214.252,Besni
897,99657,431.706981,298.837323,0.721684,106264,0.741099,1292.828,Besni
898,93523,476.344094,254.176054,0.845739,97653,0.658798,1258.548,Besni


In [3]:
# Préprocessing
datas.dtypes

Area                 int64
MajorAxisLength    float64
MinorAxisLength    float64
Eccentricity       float64
ConvexArea           int64
Extent             float64
Perimeter          float64
Class               object
dtype: object

In [4]:
datas['Class'] = datas['Class'].replace(['Besni','Kecimen'],[1,0],regex=True)

In [5]:
datas.head(700)

Unnamed: 0,Area,MajorAxisLength,MinorAxisLength,Eccentricity,ConvexArea,Extent,Perimeter,Class
0,87524,442.246011,253.291155,0.819738,90546,0.758651,1184.040,0
1,75166,406.690687,243.032436,0.801805,78789,0.684130,1121.786,0
2,90856,442.267048,266.328318,0.798354,93717,0.637613,1208.575,0
3,45928,286.540559,208.760042,0.684989,47336,0.699599,844.162,0
4,79408,352.190770,290.827533,0.564011,81463,0.792772,1073.251,0
...,...,...,...,...,...,...,...,...
695,86852,456.478688,248.606869,0.838684,90550,0.607854,1207.534,1
696,91464,433.219793,273.255461,0.775982,93852,0.717702,1182.210,1
697,93441,396.790780,300.812608,0.652122,95370,0.723317,1157.771,1
698,94211,450.004617,269.286569,0.801191,96340,0.716848,1194.631,1


In [6]:
# Séparation des données(Train et Test)
Y = datas['Class']
X = datas.drop(['Class'],axis=1)

In [7]:
X

Unnamed: 0,Area,MajorAxisLength,MinorAxisLength,Eccentricity,ConvexArea,Extent,Perimeter
0,87524,442.246011,253.291155,0.819738,90546,0.758651,1184.040
1,75166,406.690687,243.032436,0.801805,78789,0.684130,1121.786
2,90856,442.267048,266.328318,0.798354,93717,0.637613,1208.575
3,45928,286.540559,208.760042,0.684989,47336,0.699599,844.162
4,79408,352.190770,290.827533,0.564011,81463,0.792772,1073.251
...,...,...,...,...,...,...,...
895,83248,430.077308,247.838695,0.817263,85839,0.668793,1129.072
896,87350,440.735698,259.293149,0.808629,90899,0.636476,1214.252
897,99657,431.706981,298.837323,0.721684,106264,0.741099,1292.828
898,93523,476.344094,254.176054,0.845739,97653,0.658798,1258.548


In [8]:
Y

0      0
1      0
2      0
3      0
4      0
      ..
895    1
896    1
897    1
898    1
899    1
Name: Class, Length: 900, dtype: int64

In [9]:
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=(0.3),random_state=12)

In [10]:
print('Taille du X:',X.shape,'\nTaille du train:',X_train.shape)
print('Pourcentage train : ',X_train.shape[0]/X.shape[0]*100,'%')

Taille du X: (900, 7) 
Taille du train: (630, 7)
Pourcentage train :  70.0 %


## Comparaison du modèle avec les modèles classiques

### Nearest Centroid Classifier

In [11]:
# Instanciaion du modèle 
model_ncc = NearestCentroid()
# Fitting du modèle
model_ncc.fit(X_train,Y_train)
# Prédiction
pred_ncc = model_ncc.predict(X_test)

In [12]:
# Paramètres du grid
params = [
    {'metric':['euclidean','manhattan']},
    {'shrink_threshold':[None,0.2,5]}
]
ncc_best_model = GridSearchCV(model_ncc, params, cv=5, verbose=10)
ncc_best_model.fit(X_train,Y_train)

Fitting 5 folds for each of 5 candidates, totalling 25 fits
[CV 1/5; 1/5] START metric=euclidean............................................
[CV 1/5; 1/5] END .............metric=euclidean;, score=0.833 total time=   0.0s
[CV 2/5; 1/5] START metric=euclidean............................................
[CV 2/5; 1/5] END .............metric=euclidean;, score=0.841 total time=   0.0s
[CV 3/5; 1/5] START metric=euclidean............................................
[CV 3/5; 1/5] END .............metric=euclidean;, score=0.817 total time=   0.0s
[CV 4/5; 1/5] START metric=euclidean............................................
[CV 4/5; 1/5] END .............metric=euclidean;, score=0.786 total time=   0.0s
[CV 5/5; 1/5] START metric=euclidean............................................
[CV 5/5; 1/5] END .............metric=euclidean;, score=0.825 total time=   0.0s
[CV 1/5; 2/5] START metric=manhattan............................................
[CV 1/5; 2/5] END .............metric=manhattan;,

GridSearchCV(cv=5, estimator=NearestCentroid(),
             param_grid=[{'metric': ['euclidean', 'manhattan']},
                         {'shrink_threshold': [None, 0.2, 5]}],
             verbose=10)

In [13]:
print('Nearest Centroid Classifier')
# Paramètres optimaux
print('Paramètres optimaux',ncc_best_model.best_params_)
# Modele optimal
print('Modèle optimal',ncc_best_model.best_estimator_)
# Meilleur score
print('Meilleur score',ncc_best_model.best_score_)

Nearest Centroid Classifier
Paramètres optimaux {'metric': 'manhattan'}
Modèle optimal NearestCentroid(metric='manhattan')
Meilleur score 0.8365079365079364


### K-Nearest Neighbors

In [14]:
# Instanciaion du modèle 
model_knn = KNeighborsClassifier()
# Fitting du modèle
model_knn.fit(X_train,Y_train)
# Prédiction
pred_knn = model_knn.predict(X_test)

In [15]:
# Paramètres du grid
params = [
    {'n_neighbors':[1,8,3,5,7,18,9,4,10,13]},
    {'leaf_size':[2,4,6,8,9,10,5,7]}
]
knn_best_model = GridSearchCV(model_knn, params, cv=5, verbose=10)
knn_best_model.fit(X_train,Y_train)

Fitting 5 folds for each of 18 candidates, totalling 90 fits
[CV 1/5; 1/18] START n_neighbors=1..............................................
[CV 1/5; 1/18] END ...............n_neighbors=1;, score=0.778 total time=   0.0s
[CV 2/5; 1/18] START n_neighbors=1..............................................
[CV 2/5; 1/18] END ...............n_neighbors=1;, score=0.849 total time=   0.0s
[CV 3/5; 1/18] START n_neighbors=1..............................................
[CV 3/5; 1/18] END ...............n_neighbors=1;, score=0.817 total time=   0.0s
[CV 4/5; 1/18] START n_neighbors=1..............................................
[CV 4/5; 1/18] END ...............n_neighbors=1;, score=0.802 total time=   0.0s
[CV 5/5; 1/18] START n_neighbors=1..............................................
[CV 5/5; 1/18] END ...............n_neighbors=1;, score=0.786 total time=   0.0s
[CV 1/5; 2/18] START n_neighbors=8..............................................
[CV 1/5; 2/18] END ...............n_neighbors=8;

[CV 3/5; 11/18] END ................leaf_size=2;, score=0.873 total time=   0.0s
[CV 4/5; 11/18] START leaf_size=2...............................................
[CV 4/5; 11/18] END ................leaf_size=2;, score=0.810 total time=   0.0s
[CV 5/5; 11/18] START leaf_size=2...............................................
[CV 5/5; 11/18] END ................leaf_size=2;, score=0.857 total time=   0.0s
[CV 1/5; 12/18] START leaf_size=4...............................................
[CV 1/5; 12/18] END ................leaf_size=4;, score=0.833 total time=   0.0s
[CV 2/5; 12/18] START leaf_size=4...............................................
[CV 2/5; 12/18] END ................leaf_size=4;, score=0.873 total time=   0.0s
[CV 3/5; 12/18] START leaf_size=4...............................................
[CV 3/5; 12/18] END ................leaf_size=4;, score=0.873 total time=   0.0s
[CV 4/5; 12/18] START leaf_size=4...............................................
[CV 4/5; 12/18] END ........

GridSearchCV(cv=5, estimator=KNeighborsClassifier(),
             param_grid=[{'n_neighbors': [1, 8, 3, 5, 7, 18, 9, 4, 10, 13]},
                         {'leaf_size': [2, 4, 6, 8, 9, 10, 5, 7]}],
             verbose=10)

In [16]:
print('K-Nearest Neighbors')
# Paramètres optimaux
print('Paramètres optimaux',knn_best_model.best_params_)
# Modele optimal
print('Modele optimal',knn_best_model.best_estimator_)
# Meilleur score
print('Meilleur score',knn_best_model.best_score_)

K-Nearest Neighbors
Paramètres optimaux {'n_neighbors': 7}
Modele optimal KNeighborsClassifier(n_neighbors=7)
Meilleur score 0.8539682539682539


## Logistic Regression 

In [17]:
# Instanciaion du modèle 
model_lr = LogisticRegression()
# Fitting du modèle
model_lr.fit(X_train,Y_train)
# Prédiction
pred_lr = model_lr.predict(X_test)

In [18]:
#parametre du grid
params=[
    {'penalty':['l1', 'l2', 'elasticnet', 'none']},
    {'solver':['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']},
    {'max_iter':[10,34,55,78]}
]
lr_best_model = GridSearchCV(model_lr,params,cv=5,verbose=10)
lr_best_model.fit(X_train,Y_train)

Fitting 5 folds for each of 13 candidates, totalling 65 fits
[CV 1/5; 1/13] START penalty=l1.................................................
[CV 1/5; 1/13] END ....................penalty=l1;, score=nan total time=   0.0s
[CV 2/5; 1/13] START penalty=l1.................................................
[CV 2/5; 1/13] END ....................penalty=l1;, score=nan total time=   0.0s
[CV 3/5; 1/13] START penalty=l1.................................................
[CV 3/5; 1/13] END ....................penalty=l1;, score=nan total time=   0.0s
[CV 4/5; 1/13] START penalty=l1.................................................
[CV 4/5; 1/13] END ....................penalty=l1;, score=nan total time=   0.0s
[CV 5/5; 1/13] START penalty=l1.................................................
[CV 5/5; 1/13] END ....................penalty=l1;, score=nan total time=   0.0s
[CV 1/5; 2/13] START penalty=l2.................................................
[CV 1/5; 2/13] END ..................penalty=l2;

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 1/5; 4/13] END ................penalty=none;, score=0.865 total time=   0.0s
[CV 2/5; 4/13] START penalty=none...............................................
[CV 2/5; 4/13] END ................penalty=none;, score=0.873 total time=   0.0s
[CV 3/5; 4/13] START penalty=none...............................................
[CV 3/5; 4/13] END ................penalty=none;, score=0.881 total time=   0.0s
[CV 4/5; 4/13] START penalty=none...............................................
[CV 4/5; 4/13] END ................penalty=none;, score=0.833 total time=   0.0s
[CV 5/5; 4/13] START penalty=none...............................................
[CV 5/5; 4/13] END ................penalty=none;, score=0.857 total time=   0.0s
[CV 1/5; 5/13] START solver=newton-cg...........................................


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 1/5; 5/13] END ............solver=newton-cg;, score=0.865 total time=   0.2s
[CV 2/5; 5/13] START solver=newton-cg...........................................




[CV 2/5; 5/13] END ............solver=newton-cg;, score=0.873 total time=   0.1s
[CV 3/5; 5/13] START solver=newton-cg...........................................
[CV 3/5; 5/13] END ............solver=newton-cg;, score=0.889 total time=   0.1s
[CV 4/5; 5/13] START solver=newton-cg...........................................




[CV 4/5; 5/13] END ............solver=newton-cg;, score=0.833 total time=   0.1s
[CV 5/5; 5/13] START solver=newton-cg...........................................
[CV 5/5; 5/13] END ............solver=newton-cg;, score=0.857 total time=   0.1s
[CV 1/5; 6/13] START solver=lbfgs...............................................
[CV 1/5; 6/13] END ................solver=lbfgs;, score=0.865 total time=   0.0s
[CV 2/5; 6/13] START solver=lbfgs...............................................
[CV 2/5; 6/13] END ................solver=lbfgs;, score=0.873 total time=   0.0s
[CV 3/5; 6/13] START solver=lbfgs...............................................


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 3/5; 6/13] END ................solver=lbfgs;, score=0.873 total time=   0.0s
[CV 4/5; 6/13] START solver=lbfgs...............................................
[CV 4/5; 6/13] END ................solver=lbfgs;, score=0.833 total time=   0.0s
[CV 5/5; 6/13] START solver=lbfgs...............................................
[CV 5/5; 6/13] END ................solver=lbfgs;, score=0.857 total time=   0.0s
[CV 1/5; 7/13] START solver=liblinear...........................................
[CV 1/5; 7/13] END ............solver=liblinear;, score=0.873 total time=   0.0s
[CV 2/5; 7/13] START solver=liblinear...........................................
[CV 2/5; 7/13] END ............solver=liblinear;, score=0.873 total time=   0.0s
[CV 3/5; 7/13] START solver=liblinear...........................................
[CV 3/5; 7/13] END ............solver=liblinear;, score=0.849 total time=   0.0s
[CV 4/5; 7/13] START solver=liblinear...........................................
[CV 4/5; 7/13] END .........

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

[CV 4/5; 9/13] END .................solver=saga;, score=0.492 total time=   0.0s
[CV 5/5; 9/13] START solver=saga................................................
[CV 5/5; 9/13] END .................solver=saga;, score=0.492 total time=   0.0s
[CV 1/5; 10/13] START max_iter=10...............................................
[CV 1/5; 10/13] END ................max_iter=10;, score=0.492 total time=   0.0s
[CV 2/5; 10/13] START max_iter=10...............................................
[CV 2/5; 10/13] END ................max_iter=10;, score=0.492 total time=   0.0s
[CV 3/5; 10/13] START max_iter=10...............................................
[CV 3/5; 10/13] END ................max_iter=10;, score=0.492 total time=   0.0s
[CV 4/5; 10/13] START max_iter=10...............................................
[CV 4/5; 10/13] END ................max_iter=10;, score=0.492 total time=   0.0s
[CV 5/5; 10/13] START max_iter=10...............................................
[CV 5/5; 10/13] END ........

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

[CV 4/5; 12/13] END ................max_iter=55;, score=0.833 total time=   0.0s
[CV 5/5; 12/13] START max_iter=55...............................................
[CV 5/5; 12/13] END ................max_iter=55;, score=0.833 total time=   0.0s
[CV 1/5; 13/13] START max_iter=78...............................................
[CV 1/5; 13/13] END ................max_iter=78;, score=0.873 total time=   0.0s
[CV 2/5; 13/13] START max_iter=78...............................................
[CV 2/5; 13/13] END ................max_iter=78;, score=0.873 total time=   0.0s
[CV 3/5; 13/13] START max_iter=78...............................................
[CV 3/5; 13/13] END ................max_iter=78;, score=0.849 total time=   0.0s
[CV 4/5; 13/13] START max_iter=78...............................................
[CV 4/5; 13/13] END ................max_iter=78;, score=0.841 total time=   0.0s
[CV 5/5; 13/13] START max_iter=78...............................................
[CV 5/5; 13/13] END ........

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
10 fits failed out of a total of 65.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the fai

GridSearchCV(cv=5, estimator=LogisticRegression(),
             param_grid=[{'penalty': ['l1', 'l2', 'elasticnet', 'none']},
                         {'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag',
                                     'saga']},
                         {'max_iter': [10, 34, 55, 78]}],
             verbose=10)

In [19]:
print('Logistic Regression')
# Paramètres optimaux
print('Paramètres optimaux',lr_best_model.best_params_)
# Modele optimal
print('Modele optimal',lr_best_model.best_estimator_)
# Meilleur score
print('Meilleur score',lr_best_model.best_score_)

Logistic Regression
Paramètres optimaux {'solver': 'newton-cg'}
Modele optimal LogisticRegression(solver='newton-cg')
Meilleur score 0.8634920634920634


## Decision Tree Classifier 

In [20]:
# Instanciation
model_dtc = DecisionTreeClassifier()
# Fitting du modéle
model_dtc.fit(X_train, Y_train)
# Prédiction du modéle
pred_dtc = model_dtc.predict(X_test)

In [21]:
# Paramètres du grid
params=[
    {'criterion':["gini", "entropy"]},
    {'max_depth':[10,15,2,33,5,None]},
    {'min_samples_leaf':[1,5,4,3,2]},
    {'max_leaf_nodes':[None,3,50,100,23,10,56]}
]
dtc_best_model = GridSearchCV(model_dtc,params,cv=5,verbose=10)
dtc_best_model.fit(X_train,Y_train)

Fitting 5 folds for each of 20 candidates, totalling 100 fits
[CV 1/5; 1/20] START criterion=gini.............................................
[CV 1/5; 1/20] END ..............criterion=gini;, score=0.810 total time=   0.0s
[CV 2/5; 1/20] START criterion=gini.............................................
[CV 2/5; 1/20] END ..............criterion=gini;, score=0.833 total time=   0.0s
[CV 3/5; 1/20] START criterion=gini.............................................
[CV 3/5; 1/20] END ..............criterion=gini;, score=0.825 total time=   0.0s
[CV 4/5; 1/20] START criterion=gini.............................................
[CV 4/5; 1/20] END ..............criterion=gini;, score=0.833 total time=   0.0s
[CV 5/5; 1/20] START criterion=gini.............................................
[CV 5/5; 1/20] END ..............criterion=gini;, score=0.841 total time=   0.0s
[CV 1/5; 2/20] START criterion=entropy..........................................
[CV 1/5; 2/20] END ...........criterion=entropy

[CV 3/5; 11/20] END .........min_samples_leaf=4;, score=0.817 total time=   0.0s
[CV 4/5; 11/20] START min_samples_leaf=4........................................
[CV 4/5; 11/20] END .........min_samples_leaf=4;, score=0.762 total time=   0.0s
[CV 5/5; 11/20] START min_samples_leaf=4........................................
[CV 5/5; 11/20] END .........min_samples_leaf=4;, score=0.889 total time=   0.0s
[CV 1/5; 12/20] START min_samples_leaf=3........................................
[CV 1/5; 12/20] END .........min_samples_leaf=3;, score=0.825 total time=   0.0s
[CV 2/5; 12/20] START min_samples_leaf=3........................................
[CV 2/5; 12/20] END .........min_samples_leaf=3;, score=0.857 total time=   0.0s
[CV 3/5; 12/20] START min_samples_leaf=3........................................
[CV 3/5; 12/20] END .........min_samples_leaf=3;, score=0.825 total time=   0.0s
[CV 4/5; 12/20] START min_samples_leaf=3........................................
[CV 4/5; 12/20] END ........

GridSearchCV(cv=5, estimator=DecisionTreeClassifier(),
             param_grid=[{'criterion': ['gini', 'entropy']},
                         {'max_depth': [10, 15, 2, 33, 5, None]},
                         {'min_samples_leaf': [1, 5, 4, 3, 2]},
                         {'max_leaf_nodes': [None, 3, 50, 100, 23, 10, 56]}],
             verbose=10)

In [22]:
print('Decision Tree Classifier')
# Paramètres optimaux
print('Paramètres optimaux',dtc_best_model.best_params_)
# Modele optimal
print('Modèle optimal',dtc_best_model.best_estimator_)
# Meilleur score
print('Meilleur score',dtc_best_model.best_score_)

Decision Tree Classifier
Paramètres optimaux {'max_depth': 2}
Modèle optimal DecisionTreeClassifier(max_depth=2)
Meilleur score 0.8714285714285716


## Naive Bayesian 

In [23]:
# Instanciation du modèle
model_gnb = GaussianNB()
# Fitting du modéle
model_gnb.fit(X_train, Y_train)
# Prédiction du modéle
pred_gnb = model_gnb.predict(X_test)

In [24]:
#parametre du grid
params=[
    {'priors':[None]},
    {'var_smoothing':[1.5,2,3.7]}
]
gnb_best_model = GridSearchCV(model_gnb,params,cv=5,verbose=10)
gnb_best_model.fit(X_train,Y_train)

Fitting 5 folds for each of 4 candidates, totalling 20 fits
[CV 1/5; 1/4] START priors=None.................................................
[CV 1/5; 1/4] END ..................priors=None;, score=0.841 total time=   0.0s
[CV 2/5; 1/4] START priors=None.................................................
[CV 2/5; 1/4] END ..................priors=None;, score=0.865 total time=   0.0s
[CV 3/5; 1/4] START priors=None.................................................
[CV 3/5; 1/4] END ..................priors=None;, score=0.841 total time=   0.0s
[CV 4/5; 1/4] START priors=None.................................................
[CV 4/5; 1/4] END ..................priors=None;, score=0.825 total time=   0.0s
[CV 5/5; 1/4] START priors=None.................................................
[CV 5/5; 1/4] END ..................priors=None;, score=0.833 total time=   0.0s
[CV 1/5; 2/4] START var_smoothing=1.5...........................................
[CV 1/5; 2/4] END ............var_smoothing=1.5;,

GridSearchCV(cv=5, estimator=GaussianNB(),
             param_grid=[{'priors': [None]}, {'var_smoothing': [1.5, 2, 3.7]}],
             verbose=10)

In [25]:
print('Naive Bayesian')
# Paramètres optimaux
print('Paramètres optimaux',gnb_best_model.best_params_)
# Modele optimal
print('Modèle optimal',gnb_best_model.best_estimator_)
# Meilleur score
print('Meilleur score',gnb_best_model.best_score_)

Naive Bayesian
Paramètres optimaux {'priors': None}
Modèle optimal GaussianNB()
Meilleur score 0.8412698412698413


In [26]:
# 1- Nearest Centroid
# Instanciaion du modèle 
model_ncc = NearestCentroid(metric='manhattan')
# Fitting du modèle
model_ncc.fit(X_train,Y_train)
# Prédiction
pred_ncc = model_ncc.predict(X_test)
# Calculons l'accuracy
ncc_score = accuracy_score(Y_test, pred_ncc)
print('le score est du model_ncc :',(ncc_score*100),'%')

le score est du model_ncc : 79.25925925925927 %


In [27]:
# 2- K-Nearest Neighbors
# Instanciaion du modèle 
model_knn = KNeighborsClassifier(n_neighbors=7)
# Fitting du modèle
model_knn.fit(X_train,Y_train)
# Prédiction
pred_knn = model_knn.predict(X_test)
# Calculons l'accuracy
knn_score = accuracy_score(Y_test, pred_knn)
print('le score est du model_knn :',(knn_score*100),'%')

le score est du model_knn : 80.74074074074075 %


In [28]:
# 3- Régression Logistique
# Instanciaion du modèle 
model_lr = LogisticRegression(solver='newton-cg')
# Fitting du modèle
model_lr.fit(X_train,Y_train)
# Prédiction
pred_lr = model_lr.predict(X_test)
# Calculons l'accuracy
score_lr = accuracy_score(Y_test,pred_lr)
print('le score est du model_lr :',(score_lr*100),'%')

le score est du model_lr : 83.7037037037037 %




In [29]:
# 4- Naive Bayesian
# Instanciation du modèle
model_gnb = GaussianNB()
# Fitting du modéle
model_gnb.fit(X_train, Y_train)
# Prédiction du modéle
pred_gnb = model_gnb.predict(X_test)
# Calculons l'accuracy
gnb_score = accuracy_score(Y_test, pred_gnb)
print('le score est du model_gnb :',(gnb_score*100),'%')

le score est du model_gnb : 78.14814814814814 %


In [30]:
# 5- Arbre de décision
# Instanciation
model_dtc = DecisionTreeClassifier(max_depth=2)
# Fitting du modéle
model_dtc.fit(X_train, Y_train)
# Prédiction du modéle
pred_dtc = model_dtc.predict(X_test)
# Calculons l'accuracy
dtc_score = accuracy_score(Y_test, pred_dtc)
print('le score est du model_dtc :',(dtc_score*100),'%')

le score est du model_dtc : 82.96296296296296 %


Nous constatons que le score du model_lr est supérieur aux autres.
Nous pouvons en concluire que le modèle Regression Logistique est plus efficace. 