<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>
Classification en utilisant les arbres de décisions avec scikit-learn </p>

**Analyse des données journalière méteorologique**

# Decision Tree

In [1]:
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
# id3 ou c45

In [2]:
data = pd.read_csv('./weather/daily_weather.csv')

In [3]:
data.columns

Index(['number', 'air_pressure_9am', 'air_temp_9am', 'avg_wind_direction_9am',
       'avg_wind_speed_9am', 'max_wind_direction_9am', 'max_wind_speed_9am',
       'rain_accumulation_9am', 'rain_duration_9am', 'relative_humidity_9am',
       'relative_humidity_3pm'],
      dtype='object')



ce dataset consiste à cet ensemble de variables :

* **number:** unique number for each row
* **air_pressure_9am:** air pressure averaged over a period from 8:55am to 9:04am (*Unit: hectopascals*)
* **air_temp_9am:** air temperature averaged over a period from 8:55am to 9:04am (*Unit: degrees Fahrenheit*)
* **air_wind_direction_9am:** wind direction averaged over a period from 8:55am to 9:04am (*Unit: degrees, with 0 means coming from the North, and increasing clockwise*)
* **air_wind_speed_9am:** wind speed averaged over a period from 8:55am to 9:04am (*Unit: miles per hour*)
* ** max_wind_direction_9am:** wind gust direction averaged over a period from 8:55am to 9:10am (*Unit: degrees, with 0 being North and increasing clockwise*)
* **max_wind_speed_9am:** wind gust speed averaged over a period from 8:55am to 9:04am (*Unit: miles per hour*)
* **rain_accumulation_9am:** amount of rain accumulated in the 24 hours prior to 9am (*Unit: millimeters*)
* **rain_duration_9am:** amount of time rain was recorded in the 24 hours prior to 9am (*Unit: seconds*)
* **relative_humidity_9am:** relative humidity averaged over a period from 8:55am to 9:04am (*Unit: percent*)
* **relative_humidity_3pm:** relative humidity averaged over a period from 2:55pm to 3:04pm (*Unit: percent *)


In [4]:
data.head()

Unnamed: 0,number,air_pressure_9am,air_temp_9am,avg_wind_direction_9am,avg_wind_speed_9am,max_wind_direction_9am,max_wind_speed_9am,rain_accumulation_9am,rain_duration_9am,relative_humidity_9am,relative_humidity_3pm
0,0,918.06,74.822,271.1,2.080354,295.4,2.863283,0.0,0.0,42.42,36.16
1,1,917.347688,71.403843,101.935179,2.443009,140.471548,3.533324,0.0,0.0,24.328697,19.426597
2,2,923.04,60.638,51.0,17.067852,63.7,22.100967,0.0,20.0,8.9,14.46
3,3,920.502751,70.138895,198.832133,4.337363,211.203341,5.190045,0.0,0.0,12.189102,12.742547
4,4,921.16,44.294,277.8,1.85666,136.5,2.863283,8.9,14730.0,92.41,76.74


In [5]:
data[data.isnull().any(axis=1)]

Unnamed: 0,number,air_pressure_9am,air_temp_9am,avg_wind_direction_9am,avg_wind_speed_9am,max_wind_direction_9am,max_wind_speed_9am,rain_accumulation_9am,rain_duration_9am,relative_humidity_9am,relative_humidity_3pm
16,16,917.89,,169.2,2.192201,196.8,2.930391,0.0,0.0,48.99,51.19
111,111,915.29,58.82,182.6,15.613841,189.0,,0.0,0.0,21.5,29.69
177,177,915.9,,183.3,4.719943,189.9,5.346287,0.0,0.0,29.26,46.5
262,262,923.596607,58.380598,47.737753,10.636273,67.145843,13.671423,0.0,,17.990876,16.461685
277,277,920.48,62.6,194.4,2.751436,,3.869906,0.0,0.0,52.58,54.03
334,334,916.23,75.74,149.1,2.751436,187.5,4.183078,,1480.0,31.88,32.9
358,358,917.44,58.514,55.1,10.021491,,12.705819,0.0,0.0,13.88,25.93
361,361,920.444946,65.801845,49.823346,21.520177,61.886944,25.549112,,40.364018,12.278715,7.618649
381,381,918.48,66.542,90.9,3.467257,89.4,4.406772,,0.0,20.64,14.35
409,409,,67.853833,65.880616,4.328594,78.570923,5.216734,0.0,0.0,18.487385,20.356594


**Néttoyage de données**

In [6]:
del data['number']

**Suppression des valeurs nulles**

In [7]:
before_rows = data.shape[0]
print(before_rows)

1095


In [8]:
data = data.dropna()

In [9]:
after_rows = data.shape[0]
print(after_rows)

1064


In [10]:
before_rows - after_rows

31

**Preparer les donnees à la tache de Classification**  

Binariser relative_humidity_3pm à 0 ou 1.

In [11]:
clean_data = data.copy()
clean_data['high_humidity_label'] = (clean_data['relative_humidity_3pm'] > 24.99)*1
print(clean_data['high_humidity_label'])

0       1
1       0
2       0
3       0
4       1
5       1
6       0
7       1
8       0
9       1
10      1
11      1
12      1
13      1
14      0
15      0
17      0
18      1
19      0
20      0
21      1
22      0
23      1
24      0
25      1
26      1
27      1
28      1
29      1
30      1
       ..
1064    1
1065    1
1067    1
1068    1
1069    1
1070    1
1071    1
1072    0
1073    1
1074    1
1075    0
1076    0
1077    1
1078    0
1079    1
1080    0
1081    0
1082    1
1083    1
1084    1
1085    1
1086    1
1087    1
1088    1
1089    1
1090    1
1091    1
1092    1
1093    1
1094    0
Name: high_humidity_label, Length: 1064, dtype: int32


In [12]:
# On stocke dans y la cible (target: what we have to predict)
y = clean_data[['high_humidity_label']].copy()

In [13]:
clean_data['relative_humidity_3pm'].head()

0    36.160000
1    19.426597
2    14.460000
3    12.742547
4    76.740000
Name: relative_humidity_3pm, dtype: float64

In [14]:
y.head()

Unnamed: 0,high_humidity_label
0,1
1,0
2,0
3,0
4,1


In [49]:
y[y == 0].count()

high_humidity_label    535
dtype: int64

In [50]:
y[y == 1].count()

high_humidity_label    529
dtype: int64

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>
Utiliser signaux des capteurs à 9am pour prédire l'humidité à 3pm</p>

In [15]:
morning_features = ['air_pressure_9am','air_temp_9am','avg_wind_direction_9am','avg_wind_speed_9am',
        'max_wind_direction_9am','max_wind_speed_9am','rain_accumulation_9am',
        'rain_duration_9am', 'relative_humidity_9am' ]

In [16]:
X = clean_data[morning_features].copy()

In [17]:
X.columns

Index(['air_pressure_9am', 'air_temp_9am', 'avg_wind_direction_9am',
       'avg_wind_speed_9am', 'max_wind_direction_9am', 'max_wind_speed_9am',
       'rain_accumulation_9am', 'rain_duration_9am', 'relative_humidity_9am'],
      dtype='object')

In [18]:
X.head()

Unnamed: 0,air_pressure_9am,air_temp_9am,avg_wind_direction_9am,avg_wind_speed_9am,max_wind_direction_9am,max_wind_speed_9am,rain_accumulation_9am,rain_duration_9am,relative_humidity_9am
0,918.06,74.822,271.1,2.080354,295.4,2.863283,0.0,0.0,42.42
1,917.347688,71.403843,101.935179,2.443009,140.471548,3.533324,0.0,0.0,24.328697
2,923.04,60.638,51.0,17.067852,63.7,22.100967,0.0,20.0,8.9
3,920.502751,70.138895,198.832133,4.337363,211.203341,5.190045,0.0,0.0,12.189102
4,921.16,44.294,277.8,1.85666,136.5,2.863283,8.9,14730.0,92.41


In [19]:
y.columns

Index(['high_humidity_label'], dtype='object')

**Diviser le dataset en ensemble d'apprentissage et test**

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=324)

In [21]:
y_train.describe()

Unnamed: 0,high_humidity_label
count,712.0
mean,0.494382
std,0.50032
min,0.0
25%,0.0
50%,0.0
75%,1.0
max,1.0


**Entrainer le modèle**

In [22]:
# leaf nodes : feuilles (les resultats)
humidity_classifier10 = DecisionTreeClassifier(max_leaf_nodes=10, random_state=0)
humidity_classifier10.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=10,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=0,
            splitter='best')

In [23]:
# leaf nodes : feuilles (les resultats)
humidity_classifier12 = DecisionTreeClassifier(max_leaf_nodes=12, random_state=0)
humidity_classifier12.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=12,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=0,
            splitter='best')

In [24]:
type(humidity_classifier10)

sklearn.tree.tree.DecisionTreeClassifier

**Prédire les données de test**

In [25]:
predictions = humidity_classifier10.predict(X_test)

In [26]:
predictions[:10]

array([0, 0, 1, 1, 1, 1, 1, 0, 1, 1])

In [27]:
y_test['high_humidity_label'][:10]

456     0
845     0
693     1
259     1
723     1
224     1
300     1
442     0
585     1
1057    1
Name: high_humidity_label, dtype: int32

**Mesurer la précision du modèle**

In [28]:
accuracy_score(y_true = y_test, y_pred = predictions)

0.9005681818181818

In [29]:
y_pred = predictions
class_names = y

In [30]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, precision_recall_curve, f1_score

dtRes , dt_pred, dt_accuracy, dt_confmat, dt_precision, dt_recall, dt_f1 = [], [], [], [], [], [], []

for max_leaf_nodes in [5, 10, 20]:
    # the fit method returns the object self so we can instantiate it
    # and fit in one line
    clf = DecisionTreeClassifier(max_leaf_nodes=max_leaf_nodes, random_state=0).fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    precision, recall, thresholds = precision_recall_curve(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='binary') 
    dt_precision.append(precision)
    dt_recall.append(recall)
    dt_f1.append(f1)
    dt_pred.append(y_pred)
    dt_accuracy.append(accuracy_score(y_test, y_pred))
    dt_confmat.append(confusion_matrix(y_test, y_pred))
    dtRes.append({'Training Score': {'Max leaf nodes' : max_leaf_nodes, 'Score' :'{:.2f}'.format(clf.score(X_train, y_train))},
          'Test Score' : {'Max leaf nodes' : max_leaf_nodes, 'Score' :'{:.2f}'.format(clf.score(X_test, y_test))} })
    
for elet in dtRes:
    print("\nResultats de KNN: \n")
    print("{}".format(elet))


Resultats de KNN: 

{'Training Score': {'Max leaf nodes': 5, 'Score': '0.88'}, 'Test Score': {'Max leaf nodes': 5, 'Score': '0.91'}}

Resultats de KNN: 

{'Training Score': {'Max leaf nodes': 10, 'Score': '0.90'}, 'Test Score': {'Max leaf nodes': 10, 'Score': '0.90'}}

Resultats de KNN: 

{'Training Score': {'Max leaf nodes': 20, 'Score': '0.93'}, 'Test Score': {'Max leaf nodes': 20, 'Score': '0.89'}}


# KNN

In [31]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, precision_recall_curve, f1_score

knnRes , knn_pred, knn_accuracy, knn_confmat, knn_precision, knn_recall, knn_f1 = [], [], [], [], [], [], []

for n_neighbors in [1, 3, 9]:
    # the fit method returns the object self so we can instantiate it
    # and fit in one line
    clf = KNeighborsClassifier(n_neighbors=n_neighbors).fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    precision, recall, thresholds = precision_recall_curve(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='binary') 
    knn_precision.append(precision)
    knn_recall.append(recall)
    knn_f1.append(f1)
    knn_pred.append(y_pred)
    knn_accuracy.append(accuracy_score(y_test, y_pred))
    knn_confmat.append(confusion_matrix(y_test, y_pred))
    knnRes.append({'Training Score': {'Nbr of neighbors' : n_neighbors, 'Score' :'{:.2f}'.format(clf.score(X_train, y_train))},
          'Test Score' : {'Nbr of neighbors' : n_neighbors, 'Score' :'{:.2f}'.format(clf.score(X_test, y_test))} })
    
for elet in knnRes:
    print("\nResultats de KNN: \n")
    print("{}".format(elet))


Resultats de KNN: 

{'Training Score': {'Nbr of neighbors': 1, 'Score': '1.00'}, 'Test Score': {'Nbr of neighbors': 1, 'Score': '0.91'}}

Resultats de KNN: 

{'Training Score': {'Nbr of neighbors': 3, 'Score': '0.92'}, 'Test Score': {'Nbr of neighbors': 3, 'Score': '0.91'}}

Resultats de KNN: 

{'Training Score': {'Nbr of neighbors': 9, 'Score': '0.88'}, 'Test Score': {'Nbr of neighbors': 9, 'Score': '0.87'}}


  if sys.path[0] == '':


In [32]:
print(knn_precision)

[array([0.50284091, 0.93452381, 1.        ]), array([0.50284091, 0.91860465, 1.        ]), array([0.50284091, 0.91666667, 1.        ])]


# Nearest Centroid

In [33]:
import numpy as np
from sklearn.neighbors.nearest_centroid import NearestCentroid
from sklearn.metrics import accuracy_score, confusion_matrix, precision_recall_curve, f1_score

ncRes , nc_pred, nc_accuracy, nc_confmat, nc_precision, nc_recall, nc_f1 = [], [], [], [], [], [], []

for nc_metric in ['euclidean', 'manhattan']:
    # the fit method returns the object self so we can instantiate it
    # and fit in one line
    clf = NearestCentroid(metric=nc_metric).fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    precision, recall, thresholds = precision_recall_curve(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='binary') 
    nc_precision.append(precision)
    nc_recall.append(recall)
    nc_f1.append(f1)
    nc_pred.append(y_pred)
    nc_accuracy.append(accuracy_score(y_test, y_pred))
    nc_confmat.append(confusion_matrix(y_test, y_pred))
    ncRes.append({'Training Score': {'metric' : nc_metric, 'Score' :'{:.2f}'.format(clf.score(X_train, y_train))},
          'Test Score' : {'metric' : nc_metric, 'Score' :'{:.2f}'.format(clf.score(X_test, y_test))} })

for elet in ncRes:
    print("\nResultats de Nearest Centroid: \n")
    print("{}".format(elet))


Resultats de Nearest Centroid: 

{'Training Score': {'metric': 'euclidean', 'Score': '0.55'}, 'Test Score': {'metric': 'euclidean', 'Score': '0.55'}}

Resultats de Nearest Centroid: 

{'Training Score': {'metric': 'manhattan', 'Score': '0.70'}, 'Test Score': {'metric': 'manhattan', 'Score': '0.71'}}


  y = column_or_1d(y, warn=True)


# SVM

In [34]:
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, confusion_matrix, precision_recall_curve, f1_score

linSVMRes , linSVM_pred, linSVM_accuracy, linSVM_confmat, linSVM_precision, linSVM_recall, linSVM_f1 = [], [], [], [], [], [], []

for C in [0.001, 1, 100]:
    clf = LinearSVC(C=C).fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    precision, recall, thresholds = precision_recall_curve(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='binary') 
    linSVM_precision.append(precision)
    linSVM_recall.append(recall)
    linSVM_f1.append(f1)
    linSVM_pred.append(y_pred)
    linSVM_accuracy.append(accuracy_score(y_test, y_pred))
    linSVM_confmat.append(confusion_matrix(y_test, y_pred))
    linSVMRes.append({'Training Score': {'C value' : C, 'Score' :'{:.2f}'.format(clf.score(X_train, y_train))},
          'Test Score' : {'C value' : C, 'Score' :'{:.2f}'.format(clf.score(X_test, y_test))} })
    
for elet in linSVMRes:
    print("\n Resultats de SVM: \n")
    print("{}".format(elet))


 Resultats de SVM: 

{'Training Score': {'C value': 0.001, 'Score': '0.75'}, 'Test Score': {'C value': 0.001, 'Score': '0.76'}}

 Resultats de SVM: 

{'Training Score': {'C value': 1, 'Score': '0.86'}, 'Test Score': {'C value': 1, 'Score': '0.88'}}

 Resultats de SVM: 

{'Training Score': {'C value': 100, 'Score': '0.83'}, 'Test Score': {'C value': 100, 'Score': '0.85'}}


# Naive Bayes

## Multinomial NB

In [35]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, precision_recall_curve, f1_score

mnbRes , mnb_pred, mnb_accuracy, mnb_confmat, mnb_precision, mnb_recall, mnb_f1 = [], [], [], [], [], [], []

for alpha in [0.1, 0.01, 10]:
    clf = MultinomialNB(alpha=alpha).fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    precision, recall, thresholds = precision_recall_curve(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='binary') 
    mnb_precision.append(precision)
    mnb_recall.append(recall)
    mnb_f1.append(f1)
    mnb_pred.append(y_pred)
    mnb_accuracy.append(accuracy_score(y_test, y_pred))
    mnb_confmat.append(confusion_matrix(y_test, y_pred))
    mnbRes.append({'Training Score': {'Alpha value' : alpha, 'Score' :'{:.2f}'.format(clf.score(X_train, y_train))},
          'Test Score' : {'Alpha value' : alpha, 'Score' :'{:.2f}'.format(clf.score(X_test, y_test))} })
    
for elet in mnbRes:
    print("\n Resultats de MultinomialNB: \n")
    print("{}".format(elet))


 Resultats de MultinomialNB: 

{'Training Score': {'Alpha value': 0.1, 'Score': '0.55'}, 'Test Score': {'Alpha value': 0.1, 'Score': '0.57'}}

 Resultats de MultinomialNB: 

{'Training Score': {'Alpha value': 0.01, 'Score': '0.55'}, 'Test Score': {'Alpha value': 0.01, 'Score': '0.57'}}

 Resultats de MultinomialNB: 

{'Training Score': {'Alpha value': 10, 'Score': '0.55'}, 'Test Score': {'Alpha value': 10, 'Score': '0.57'}}


## Gaussian NB

In [36]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, confusion_matrix, precision_recall_curve, f1_score

gnbRes = []

clf = GaussianNB().fit(X_train, y_train)
y_pred = clf.predict(X_test)
precision, recall, thresholds = precision_recall_curve(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='binary') 
gnb_precision = precision
gnb_recall = recall
gnb_f1 = f1
gnb_pred = y_pred
gnb_accuracy = accuracy_score(y_test, y_pred)
gnb_confmat = confusion_matrix(y_test, y_pred)
gnbRes.append({'Training Score':'{:.2f}'.format(clf.score(X_train, y_train)),
          'Test Score' :'{:.2f}'.format(clf.score(X_test, y_test))} )
    
for elet in gnbRes:
    print("\n Resultats de GaussianNB: \n")
    print("{}".format(elet))


 Resultats de GaussianNB: 

{'Training Score': '0.77', 'Test Score': '0.78'}


# Results and discussion

## Decision Tree

In [37]:
import pandas as pd

dt_index = [5, 10, 20]
dt = pd.DataFrame({
        'Max leaf nodes' : dt_index,
        'accuracy': dt_accuracy,
        'confusion_matrix': dt_confmat,
        'precision': dt_precision,
        'recall': dt_recall,
        'f1' : dt_f1  
})

In [38]:
dt.set_index('Max leaf nodes')

Unnamed: 0_level_0,accuracy,confusion_matrix,f1,precision,recall
Max leaf nodes,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
5,0.911932,"[[160, 15], [16, 161]]",0.912181,"[0.5028409090909091, 0.9147727272727273, 1.0]","[1.0, 0.9096045197740112, 0.0]"
10,0.900568,"[[162, 13], [22, 155]]",0.898551,"[0.5028409090909091, 0.9226190476190477, 1.0]","[1.0, 0.8757062146892656, 0.0]"
20,0.892045,"[[160, 15], [23, 154]]",0.890173,"[0.5028409090909091, 0.9112426035502958, 1.0]","[1.0, 0.8700564971751412, 0.0]"


Si on compare les différentes mesures on voit bien que le meilleur paramètre à utiliser est le max leaf nodes égalant 5. Donc on va garder que le premier.

In [67]:
bdt = dt.loc[dt['Max leaf nodes'] == 5 ]

In [68]:
bdt

Unnamed: 0,Max leaf nodes,accuracy,confusion_matrix,f1,precision,recall
0,5,0.911932,"[[160, 15], [16, 161]]",0.912181,"[0.5028409090909091, 0.9147727272727273, 1.0]","[1.0, 0.9096045197740112, 0.0]"


In [69]:
print(type(bdt))

<class 'pandas.core.frame.DataFrame'>


## KNN

In [40]:
import pandas as pd

knn_index = [1, 3, 9]
knn = pd.DataFrame({
        'Neighbors' : knn_index,
        'accuracy': knn_accuracy,
        'confusion_matrix': knn_confmat,
        'precision': knn_precision,
        'recall': knn_recall,
        'f1' : knn_f1  
})
knn.set_index('Neighbors')

Unnamed: 0_level_0,accuracy,confusion_matrix,f1,precision,recall
Neighbors,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,0.911932,"[[164, 11], [20, 157]]",0.910145,"[0.5028409090909091, 0.9345238095238095, 1.0]","[1.0, 0.8870056497175142, 0.0]"
3,0.90625,"[[161, 14], [19, 158]]",0.905444,"[0.5028409090909091, 0.9186046511627907, 1.0]","[1.0, 0.8926553672316384, 0.0]"
9,0.866477,"[[162, 13], [34, 143]]",0.858859,"[0.5028409090909091, 0.9166666666666666, 1.0]","[1.0, 0.807909604519774, 0.0]"


Si on compare les différentes mesures on voit bien que le meilleur paramètre à utiliser est le neighbor égalant à 1. Donc on va garder que le premier modèle avec neighbor 1.

In [56]:
bknn = knn.iloc[[0]]

In [102]:
bknn

Unnamed: 0,Neighbors,accuracy,confusion_matrix,f1,precision,recall
0,1,0.911932,"[[164, 11], [20, 157]]",0.910145,"[0.5028409090909091, 0.9345238095238095, 1.0]","[1.0, 0.8870056497175142, 0.0]"


In [57]:
print(type(bknn))

<class 'pandas.core.frame.DataFrame'>


## Nearest Centroid

In [42]:
import pandas as pd

nc_index = ['euclidian', 'manhattan']
nc = pd.DataFrame({
        'Metric' : nc_index,
        'accuracy': nc_accuracy,
        'confusion_matrix': nc_confmat,
        'precision': nc_precision,
        'recall': nc_recall,
        'f1' : nc_f1  
})
nc.set_index('Metric')

Unnamed: 0_level_0,accuracy,confusion_matrix,f1,precision,recall
Metric,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
euclidian,0.553977,"[[172, 3], [154, 23]]",0.226601,"[0.5028409090909091, 0.8846153846153846, 1.0]","[1.0, 0.12994350282485875, 0.0]"
manhattan,0.707386,"[[111, 64], [39, 138]]",0.728232,"[0.5028409090909091, 0.6831683168316832, 1.0]","[1.0, 0.7796610169491526, 0.0]"


Si on compare les différentes mesures on voit bien que le meilleur paramètre à utiliser est celui dont la moyenne utilisée est Manhattan. Donc on va garder que le deuxieme résultat.

In [61]:
bnc = nc.loc[nc['Metric'] == 'manhattan']

In [63]:
bnc

Unnamed: 0,Metric,accuracy,confusion_matrix,f1,precision,recall
1,manhattan,0.707386,"[[111, 64], [39, 138]]",0.728232,"[0.5028409090909091, 0.6831683168316832, 1.0]","[1.0, 0.7796610169491526, 0.0]"


In [62]:
print(type(bnc))

<class 'pandas.core.frame.DataFrame'>


## SVM

In [43]:
import pandas as pd

linSVM_index = [0.001, 1, 100]
linSVM = pd.DataFrame({
        'C' : linSVM_index,
        'accuracy': linSVM_accuracy,
        'confusion_matrix': linSVM_confmat,
        'precision': linSVM_precision,
        'recall': linSVM_recall,
        'f1' : linSVM_f1  
})
linSVM.set_index('C')

Unnamed: 0_level_0,accuracy,confusion_matrix,f1,precision,recall
C,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0.001,0.761364,"[[173, 2], [82, 95]]",0.693431,"[0.5028409090909091, 0.979381443298969, 1.0]","[1.0, 0.536723163841808, 0.0]"
1.0,0.880682,"[[168, 7], [35, 142]]",0.871166,"[0.5028409090909091, 0.9530201342281879, 1.0]","[1.0, 0.8022598870056498, 0.0]"
100.0,0.852273,"[[168, 7], [45, 132]]",0.835443,"[0.5028409090909091, 0.9496402877697842, 1.0]","[1.0, 0.7457627118644068, 0.0]"


Si on compare les différentes mesures on voit bien que le meilleur paramètre à utiliser est celui dont le C est 1.000. Donc on va garder que le deuxieme résultat.

In [64]:
blinSVM = linSVM.loc[linSVM['C'] == 1.000 ]

In [65]:
blinSVM

Unnamed: 0,C,accuracy,confusion_matrix,f1,precision,recall
1,1.0,0.880682,"[[168, 7], [35, 142]]",0.871166,"[0.5028409090909091, 0.9530201342281879, 1.0]","[1.0, 0.8022598870056498, 0.0]"


In [66]:
print(type(blinSVM))

<class 'pandas.core.frame.DataFrame'>


## Naive Bayes

In [44]:
import pandas as pd

gnb = pd.DataFrame({
        'accuracy': [gnb_accuracy],
        'confusion_matrix': [gnb_confmat],
        'precision': [gnb_precision],
        'recall': [gnb_recall],
        'f1' : [gnb_f1]  
})

In [45]:
gnb

Unnamed: 0,accuracy,confusion_matrix,f1,precision,recall
0,0.78125,"[[171, 4], [73, 104]]",0.729825,"[0.5028409090909091, 0.9629629629629629, 1.0]","[1.0, 0.5875706214689266, 0.0]"


## Comparison between the models

La première question qu'on doit se poser est : Nos données sont elles équilibrées (a-t-on le meme nombre d'exemples dans la classe 0 que dans la classe 1) ?

In [49]:
y[y == 0].count()

high_humidity_label    535
dtype: int64

In [50]:
y[y == 1].count()

high_humidity_label    529
dtype: int64

**C/c:**   

- D'après les résultats au-dessus on voit bien que nos **données ne sont pas équilibrées**. Ce qui va nous permettre de réduire le nombre de métrique avec la(es)quelle(s) on va comparer nos modèles.

- **F-mesure** est bien connue pour servir de métrique pour les données non-équilibrées C'est pourquoi on va l'utiliser pour déterminer le modèle le plus performant.

In [94]:
import pandas as pd

dfs = [bdt, bknn, bnc, blinSVM, gnb]
alldf = pd.concat(dfs)

In [95]:
all = (alldf.loc[:, 'accuracy':'recall']).reset_index()

In [98]:
all['Model'] = ['Decision Trees', 'KNN', 'Nearest Centroid', 'SVM', 'Naive Bayes']

In [101]:
all.set_index('Model')

Unnamed: 0_level_0,index,accuracy,confusion_matrix,f1,precision,recall
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Decision Trees,0,0.911932,"[[160, 15], [16, 161]]",0.912181,"[0.5028409090909091, 0.9147727272727273, 1.0]","[1.0, 0.9096045197740112, 0.0]"
KNN,0,0.911932,"[[164, 11], [20, 157]]",0.910145,"[0.5028409090909091, 0.9345238095238095, 1.0]","[1.0, 0.8870056497175142, 0.0]"
Nearest Centroid,1,0.707386,"[[111, 64], [39, 138]]",0.728232,"[0.5028409090909091, 0.6831683168316832, 1.0]","[1.0, 0.7796610169491526, 0.0]"
SVM,1,0.880682,"[[168, 7], [35, 142]]",0.871166,"[0.5028409090909091, 0.9530201342281879, 1.0]","[1.0, 0.8022598870056498, 0.0]"
Naive Bayes,0,0.78125,"[[171, 4], [73, 104]]",0.729825,"[0.5028409090909091, 0.9629629629629629, 1.0]","[1.0, 0.5875706214689266, 0.0]"


**C/c:** On comparant les différents modèles en se basant sur f-mesure (f1) on peut voir que le meilleur modèle à adapté pour ce dataset est `Decision Trees` .