**Auteurs:** Guillaume Poirier-Morency et Gabriel Lemyre

Chaque modèle est présenté successivement, entraîné et finalement testés selon les meilleurs paramètres obtenus par le processus de validation.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.datasets import fetch_mldata
from sklearn.preprocessing import LabelEncoder, LabelBinarizer
from collections import OrderedDict
%matplotlib inline

Le jeu de données de salaire est déjà séparé en deux ensembles.

In [None]:
salary_dtype = OrderedDict([('age', 'int'), 
                            ('workclass', 'category'), 
                            ('financial_weight', 'int'), 
                            ('education', 'category'), 
                            ('education_code', 'int'),
                            ('marital_status', 'category'), 
                            ('occupation', 'category'),
                            ('relationship', 'category'),
                            ('race', 'category'),
                            ('sex', 'category'),
                            ('capital_gain', 'int'),
                            ('capital_loss', 'int'),
                            ('hours_per_week', 'int'),
                            ('native_country', 'category'),
                            ('target', 'category')])
salary_continuous_columns = ['age', 'financial_weight', 'capital_gain', 'capital_loss', 'hours_per_week']
salary_categorical_columns = ['workclass', 'education', 'education_code', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'native_country']

salary_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', sep=', ', engine='python', names=salary_dtype.keys(), dtype=salary_dtype, na_values=['?'])
salary_test = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test', sep=', ', engine='python', skiprows=[0], names=salary_dtype.keys(), dtype=salary_dtype, na_values=['?'])

salary_data[salary_categorical_columns] = salary_data[salary_categorical_columns].apply(lambda x: x.fillna(x.mode()[0]))
salary_test[salary_categorical_columns] = salary_data[salary_categorical_columns].apply(lambda x: x.fillna(x.mode()[0]))
identity = lambda x: x
salary_transform = {
    'age': identity,
    'workclass': LabelEncoder().fit(salary_data.workclass).transform,
    'financial_weight': identity,
    'education': LabelEncoder().fit(salary_data.education).transform,
    'education_code': identity,
    'marital_status': LabelEncoder().fit(salary_data.marital_status).transform,
    'occupation': LabelEncoder().fit(salary_data.occupation).transform,
    'relationship': LabelEncoder().fit(salary_data.relationship).transform,
    'race': LabelEncoder().fit(salary_data.race).transform,
    'sex': LabelEncoder().fit(salary_data.sex).transform,
    'capital_gain': identity,
    'capital_loss': identity,
    'hours_per_week': identity,
    'native_country': LabelEncoder().fit(salary_data.native_country).transform,
    'target': lambda x: LabelBinarizer().fit_transform(x).ravel()}
salary_data = salary_data.transform(salary_transform)
salary_test = salary_test.transform(salary_transform)

salary_train_X, salary_train_Y = salary_data.iloc[:,:len(salary_dtype)-1], salary_data['target']
salary_test_X, salary_test_Y = salary_test.iloc[:,:len(salary_dtype)-1], salary_test['target']

In [None]:
import seaborn as sb
sb.pairplot(salary_data, vars=salary_continuous_columns, hue='target')
plt.savefig('figures/salary-pair-plot', dpi=300)

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer, OneHotEncoder, LabelEncoder, FunctionTransformer

salary_preprocessing_pipeline = Pipeline([('imputer', Imputer(strategy='mean')),
                                          ('cat-to-one-hot', OneHotEncoder(categorical_features=[salary_data.columns.get_loc(c) for c in salary_categorical_columns],
                                                                           n_values=salary_data[salary_categorical_columns].nunique().as_matrix(),
                                                                           handle_unknown='ignore',
                                                                           sparse=False))])

On utilise un état déterministe pour la routine `train_test_split` afin de s'assurer de ne jamais toucher l'ensemble de test avant la toute fin.

In [None]:
mnist_data = fetch_mldata('mnist-original')
mnist_train_X, mnist_test_X, mnist_train_Y, mnist_test_Y = train_test_split(mnist_data['data'], mnist_data['target'], random_state=123)

# Classifieur de Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB, BernoulliNB

## Salary

Pour classer les features catégoriques du dataset de salaires, on les convertit en one-hot et on utilise un classifier naïf ad-hoc avec densité de Bernouilli. On considère ensuite la probabilité suivante: $\Pr [c \mid x_{cont},x_{cat}] = \frac{\Pr[X_{cont} X_{cat} \mid c]\Pr[c]}{\Pr[X_{cont}] \Pr[X_{cat}]}$.

Avec l'hypothèse naïve $\Pr[X_{cont},X_{cat}] = \Pr[X_{cont}] \Pr[X_{cat}]$ et en passant par le logarithme:

$\implies \log \Pr[X_{cont} \mid c] + \log \Pr[X_{cat} \mid c] + \log \Pr[c] - (\log \Pr[X_{cont}] + \log \Pr[X_{cat}])$

Puisque la probabilité finale combine des densités (i.e. continues) et des masses (i.e. discrètes), on pondère chaque classifieur par un hyper-paramètre $\lambda$:

$\implies \lambda \log \Pr[X_{cont} \mid c] + (1 - \lambda) \log \Pr[X_{cat} \mid c] + \log \Pr[c] - (\lambda \log \Pr[X_{cont}] + (1 - \lambda) \log \Pr[X_{cat}])$
$\implies \lambda \log \Pr[c \mid X_{cont}] + (1 - \lambda) \log \Pr[c \mid X_{cat}]$

# Bayes mixte

In [None]:
from sklearn.base import BaseEstimator, ClassifierMixin
class MixedNB(BaseEstimator, ClassifierMixin):
    """
    Mixed weighted gaussian and binomial naive Bayes classifier.
    """
    def __init__(self, cont, cat, alpha=1.0, lambda_=0.5):
        self.cont = cont
        self.cat = cat
        self.alpha = alpha
        self.lambda_ = lambda_
        self.gnb = GaussianNB()
        self.bnb = BernoulliNB(alpha)
    def get_params(self, deep=False):
        return {'cont': self.cont, 'cat': self.cat, 'alpha': self.bnb.alpha, 'lambda_': self.lambda_}
    def set_params(self, **parameters):
        for name, val in parameters.items():
            setattr(self, name, val)
        self.bnb.set_params(alpha=self.alpha)
        return self
    def fit(self, X, y):
        self.gnb.fit(X[:,self.cont], y)
        self.bnb.fit(X[:,self.cat], y)
    def predict_log_proba(self, X):
        return self.lambda_ * self.gnb.predict_log_proba(X[:,self.cont]) + (1 - self.lambda_) * self.bnb.predict_log_proba(X[:,self.cat])
    def predict_proba(self, X):
        return np.exp(self.predict_log_proba(X))
    def predict(self, X):
        return self.gnb.classes_[np.argmax(self.predict_log_proba(X), axis=1)]

In [None]:
%%time
mnb_salary_param_grid = {'mixed_nb__lambda_': np.linspace(0, 1), 
                         'mixed_nb__alpha': np.linspace(1, 10)}
mnb_salary = GridSearchCV(Pipeline([('pre', salary_preprocessing_pipeline), 
                                    ('mixed_nb', MixedNB(cont=np.arange(115, 120), cat=np.arange(115)))]), param_grid=mnb_salary_param_grid, scoring='accuracy', n_jobs=16, return_train_score=True)
mnb_salary.fit(salary_train_X, salary_train_Y)

In [None]:
r = pd.DataFrame(mnb_salary.cv_results_)
r = r.groupby('param_mixed_nb__lambda_').apply(lambda x: x.sort_values(by='mean_test_score', ascending=False).head(1))
plt.plot(r.param_mixed_nb__lambda_, 1 - r.mean_train_score, label='Erreur d\'entraînement')
plt.plot(r.param_mixed_nb__lambda_, 1 - r.mean_test_score, label='Erreur de validation')
plt.title('Courbe d\'apprentissage du Bayes mixte sur les données de salaire')
plt.xlabel('Poids de chaque classifieur')
plt.ylabel('Erreur')
plt.legend()
plt.savefig('figures/mixed-naive-bayes-salary-learning-curve-lambda', dpi=300)

In [None]:
r = pd.DataFrame(mnb_salary.cv_results_)
r = r.groupby('param_mixed_nb__alpha').apply(lambda x: x.sort_values(by='mean_test_score', ascending=False).head(1))
plt.plot(r.param_mixed_nb__alpha, 1 - r.mean_train_score, label='Erreur d\'entraînement')
plt.plot(r.param_mixed_nb__alpha, 1 - r.mean_test_score, label='Erreur de validation')
plt.title('Courbe d\'apprentissage du Bayes mixte sur les données de salaire')
plt.xlabel('Lissage laplacien')
plt.ylabel('Erreur')
plt.legend()
plt.savefig('figures/mixed-naive-bayes-salary-learning-curve-alpha', dpi=300)

## MNIST

In [None]:
gnb_mnist = GaussianNB()
cross_val_score(gnb_mnist, mnist_train_X, mnist_train_Y, scoring='accuracy', n_jobs=16).mean()

In [None]:
gnb_mnist.fit(mnist_train_X, mnist_train_Y)

In [None]:
from sklearn.preprocessing import Binarizer
bnb_mnist_param_grid = {'bnb__alpha': range(1, 10)}
bnb_mnist = GridSearchCV(Pipeline([('binarize', Binarizer()), 
                                   ('bnb', BernoulliNB())]), param_grid=bnb_mnist_param_grid, scoring='accuracy', n_jobs=16, return_train_score=True)
bnb_mnist.fit(mnist_train_X, mnist_train_Y)

In [None]:
r = pd.DataFrame(bnb_mnist.cv_results_)
plt.plot(r.param_bnb__alpha, 1 - r.mean_train_score, label='Erreur d\'entraînement')
plt.plot(r.param_bnb__alpha, 1 - r.mean_test_score, label='Erreur de validation')
plt.title('Courbe d\'apprentissage du Bayes à noyau de Bernoulli sur les données de MNIST')
plt.xlabel('Lissage laplacien')
plt.ylabel('Erreur')
plt.legend()
plt.savefig('figures/bernoulli-naive-bayes-mnist-learning-curve-alpha', dpi=300)

# Arbres de décision

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
dtc_salary_param_grid = {'dtc__max_depth': range(1, 20), 
                         'dtc__min_samples_leaf': range(1, 30)}

## Salary

In [None]:
%%time
dtc_salary = GridSearchCV(Pipeline([('pre', salary_preprocessing_pipeline), 
                                    ('dtc', DecisionTreeClassifier())]), param_grid=dtc_salary_param_grid, scoring='accuracy', n_jobs=16, return_train_score=True)
dtc_salary.fit(salary_train_X, salary_train_Y)

In [None]:
r = pd.DataFrame(dtc_salary.cv_results_)
r = r.groupby('param_dtc__max_depth').apply(lambda x: x.sort_values(by='mean_test_score', ascending=False).head(1))
plt.plot(r.param_dtc__max_depth, 1 - r.mean_train_score, label='Erreur d\'entraînement')
plt.plot(r.param_dtc__max_depth, 1 - r.mean_test_score, label='Erreur de validation')
plt.title('Courbe d\'apprentissage des arbres de décisions sur salary')
plt.xlabel('Profondeur maximale')
plt.ylabel('Erreur')
plt.xticks(dtc_salary_param_grid['dtc__max_depth'])
plt.legend()
plt.savefig('figures/decision-tree-salary-learning-curve-max-depth', dpi=300)

In [None]:
r = pd.DataFrame(dtc_salary.cv_results_)
r = r.groupby('param_dtc__min_samples_leaf').apply(lambda x: x.sort_values(by='mean_test_score', ascending=False).head(1))
plt.plot(r.param_dtc__min_samples_leaf, 1 - r.mean_train_score, label='Erreur d\'entraînement')
plt.plot(r.param_dtc__min_samples_leaf, 1 - r.mean_test_score, label='Erreur de validation')
plt.title('Courbe d\'apprentissage des arbres de décisions sur salary')
plt.xlabel('Nombre minimal d\'échantillons aux feuilles')
plt.ylabel('Erreur')
plt.xticks(dtc_salary_param_grid['dtc__min_samples_leaf'])
plt.legend()
plt.savefig('figures/decision-tree-salary-learning-curve-min-samples-leaf', dpi=300)

## MNIST

In [None]:
dtc_mnist_param_grid = {'dtc__max_depth': range(1, 30), 
                        'dtc__min_samples_leaf': range(1, 10)}

In [None]:
%%time
dtc_mnist = GridSearchCV(Pipeline([('dtc', DecisionTreeClassifier())]), param_grid=dtc_mnist_param_grid, scoring='accuracy', n_jobs=16, return_train_score=True)
dtc_mnist.fit(mnist_train_X, mnist_train_Y)

In [None]:
r = pd.DataFrame(dtc_mnist.cv_results_)
r = r.groupby('param_dtc__max_depth').apply(lambda x: x.sort_values(by='mean_test_score', ascending=False).head(1))
plt.plot(r.param_dtc__max_depth, 1 - r.mean_train_score, label='Erreur d\'entraînement')
plt.plot(r.param_dtc__max_depth, 1 - r.mean_test_score, label='Erreur de validation')
plt.title('Courbe d\'apprentissage des arbres de décisions sur MNIST')
plt.xlabel('Profondeur maximale')
plt.ylabel('Erreur')
plt.xticks(dtc_mnist_param_grid['dtc__max_depth'])
plt.legend()
plt.savefig('figures/decision-tree-mnist-learning-curve-max-depth', dpi=300)

In [None]:
r = pd.DataFrame(dtc_mnist.cv_results_)
r = r.groupby('param_dtc__min_samples_leaf').apply(lambda x: x.sort_values(by='mean_test_score', ascending=False).head(1))
plt.plot(r.param_dtc__min_samples_leaf, 1 - r.mean_train_score, label='Erreur d\'entraînement')
plt.plot(r.param_dtc__min_samples_leaf, 1 - r.mean_test_score, label='Erreur de validation')
plt.title('Courbe d\'apprentissage des arbres de décisions sur MNIST')
plt.xlabel('Nombre minimal d\'échantillons aux feuilles')
plt.ylabel('Erreur')
plt.xticks(dtc_mnist_param_grid['dtc__min_samples_leaf'])
plt.legend()
plt.savefig('figures/decision-tree-mnist-learning-curve-min-samples-leaf', dpi=300)

# Perceptron multi-couches

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.losses import categorical_crossentropy, binary_crossentropy
from keras.utils import to_categorical
from keras.optimizers import SGD, Adagrad, Adadelta
from keras.regularizers import l1_l2, l1, l2
from keras.wrappers.scikit_learn import KerasClassifier

## Salary

In [None]:
salary_mlp = Sequential()
salary_mlp.add(Dense(units=500, activation='relu', input_dim=120))
salary_mlp.add(Dense(units=2, activation='softmax'))
salary_mlp.compile(loss=categorical_crossentropy, optimizer=Adadelta(), metrics=['accuracy'])

In [None]:
from sklearn.utils.class_weight import compute_class_weight
salary_mlp_history = salary_mlp.fit(salary_preprocessing_pipeline.fit_transform(salary_train_X), 
                                    to_categorical(salary_train_Y), 
                                    validation_split=0.33, batch_size=64, epochs=50)

In [None]:
plt.suptitle('Courbe d\'apprentissage du perceptron multi-couche sur les données de salaire')
plt.title('Une couche cachée de 50 neurones')
plt.plot(1 - np.array(salary_mlp_history.history['acc']), label='Erreur d\'entraînement')
plt.plot(1 - np.array(salary_mlp_history.history['val_acc']), label='Erreur de validation')
plt.xlabel('Époque')
plt.ylabel('Erreur')
plt.legend()
plt.savefig('figures/multilayer-perceptron-salary-learning-curve-epoch', dpi=300)

## MNIST

In [None]:
mnist_mlp = Sequential()
mnist_mlp.add(Dense(units=512, activation='relu', input_dim=784))
mnist_mlp.add(Dropout(0.1))
mnist_mlp.add(Dense(units=10, activation='softmax'))
mnist_mlp.compile(loss=categorical_crossentropy, optimizer=Adadelta(), metrics=['accuracy'])

In [None]:
mnist_mlp_history = mnist_mlp.fit(mnist_train_X, to_categorical(mnist_train_Y), validation_split=0.33, epochs=50, batch_size=128)

In [None]:
plt.suptitle('Courbe d\'apprentissage du perceptron multi-couche sur MNIST')
plt.title('Une couche cachée de 512 neurones et 0.1 dropout')
plt.plot(1 - np.array(mnist_mlp_history.history['acc']), label='Erreur d\'entraînement')
plt.plot(1 - np.array(mnist_mlp_history.history['val_acc']), label='Erreur de validation')
plt.xlabel('Époque')
plt.ylabel('Erreur')
plt.legend()
plt.savefig('figures/multilayer-perceptron-mnist-learning-curve-epoch', dpi=300)

# Réseau de neurones convolutif

In [None]:
from keras.layers import Conv2D, MaxPooling2D, Flatten, Reshape

In [None]:
mnist_cnn = Sequential()
mnist_cnn.add(Reshape((28,28,1), input_shape=(784,)))
mnist_cnn.add(Conv2D(32, kernel_size=(3, 3), activation='relu'))
mnist_cnn.add(Conv2D(64, (3, 3), activation='relu'))
mnist_cnn.add(MaxPooling2D(pool_size=(2, 2)))
mnist_cnn.add(Flatten())
mnist_cnn.add(Dropout(0.3))
mnist_cnn.add(Dense(128, activation='relu'))
mnist_cnn.add(Dropout(0.3))
mnist_cnn.add(Dense(10, activation='softmax'))
mnist_cnn.compile(loss=categorical_crossentropy,
              optimizer=Adadelta(),
metrics=['accuracy'])

In [None]:
mnist_cnn_history = mnist_cnn.fit(mnist_train_X, to_categorical(mnist_train_Y), validation_split=0.33, batch_size=128, epochs=20)

In [None]:
plt.title('Courbe d\'apprentissage du réseau de neurones convolutif sur MNIST\n'
            'Convolution 3x3 de 32 features, convolution 3x3 de 64 features,\npooling, ropout 0.3, 128 neurones cachés et dropout 0.3')
plt.plot(1 - np.array(mnist_cnn_history.history['acc']), label='Erreur d\'entraînement')
plt.plot(1 - np.array(mnist_cnn_history.history['val_acc']), label='Erreur de validation')
plt.xlabel('Époque')
plt.ylabel('Erreur')
plt.legend()
plt.savefig('figures/convolutional-neural-network-mnist-learning-curve-epoch', dpi=300)

# Tests

Ici, on trouve le code pour les tests finaux qui ont été effectués à la toute fin, indépendament du processus de validation afin d'avoir la meilleure idée possible de la performance de généralisation de chaque modèle.

## Classifieurs Bayésiens

In [None]:
accuracy_score(salary_test_Y, mnb_salary.predict(salary_test_X))

In [None]:
accuracy_score(mnist_test_Y, gnb_mnist.predict(mnist_test_X))

In [None]:
accuracy_score(mnist_test_Y, bnb_mnist.predict(mnist_test_X))

## Arbres de décisions

In [None]:
accuracy_score(salary_test_Y, dtc_salary.predict(salary_test_X))

In [None]:
accuracy_score(mnist_test_Y, dtc_mnist.predict(mnist_test_X))

## Réseaux de neurones

In [None]:
salary_mlp.evaluate(salary_preprocessing_pipeline.fit_transform(salary_test_X), to_categorical(salary_test_Y))

In [None]:
mnist_mlp.evaluate(mnist_test_X, to_categorical(mnist_test_Y))

In [None]:
mnist_cnn.evaluate(mnist_test_X, to_categorical(mnist_test_Y))