# Active Model Selection

The idea of active model selection is the process of generating data points and asking the user to label them in order to help the user select a model among a set of _k_ candidate models.  An automated machine learning process can often provide a set of candidate models to solve a domain scientists needs.  However, there are several issues with expecting the domain scientist to simply take the top-ranked candidate model and using it in production, or training it further.  First, the SME has to trust that the autoML process actually executed correctly and did something intelligent.  While we know that these algorithms can and should work, there is no proof of correctness, and we are well aware that there can be errors resulting in nonsensical models.  Second, even if the autoML process worked correctly, there could have been some bias present in the training set.  To remedy these issues, the SME needs to interact with each model to get a feel for how they make predictions on domain data.

Previous efforts towards model selection, such as TreePOD, show the predictions of the candidate models on the training or validation parts of the dataset.  However, that might not solve the issue of revealing bias in the training process; in fact, it may reinforce it, since model selection is occurring on the same source of data.  SMEs expect that resulting models will be robust, and that they agree with their intuition.  

Active Model Selection aids in model selection by generating data points and asking the SME to provide labels for those data points.  Then, the SME is shown the set of models that agree with their own judgment.  By comparing their predictions, they are able to determine which models agree with their judgments, as well as gaining trust in the resulting models.

## Algorithm

Assuming that there are _n_ data points, {x_i} = D, with _m_ features, and _k_ potential models.  

    for i in 0, ...
        d_i = generate_data_point(D \cup {d_0, ..., d_i})
        y_i = active_query(d_i)
        
Our contribution is a novel way of generating data points.  We can compare it to randomly generated data points, as well as some other naive methods of generating data points.  

Our method is to build a random variable _h_ representing the __entropy__ of the set of predictions of the available models which has a gaussian process prior, based on the entropy of the predictions on the training set and all labeled points.  

This method is reminiscent of interpretability methods which make approximations to complex models to explain them.  Gaussian processes could be used to make approximations of what a model would predict.  However, we aren't using a GP to approximate a model, we are using it to predict which spaces of the data are most disagreed upon.

In [33]:
# Loading data
TRAINING_SET_CT = 50
VALIDATION_SET_CT = 20
NROWS = TRAINING_SET_CT + VALIDATION_SET_CT

import pandas as pd
import numpy as np

training_df = pd.read_csv('adult_data/adult.csv', sep='\s*,\s*', nrows=NROWS, usecols=['education-num', 'capital-gain', 'capital-loss', 'hours-per-week', 'age', 'label'])

training_labels = training_df['label'][:TRAINING_SET_CT]
training_predictors = training_df.drop(columns=('label'))[0:TRAINING_SET_CT]

val_labels = training_df['label'][TRAINING_SET_CT:]
val_predictors = training_df.drop(columns=('label'))[TRAINING_SET_CT:]


# training_df




In [34]:
# Build random models
model_dict = {}

# We build 10 models each of:
#  - logistic regression
#  - SVM
#  - kNN
#  - Decision Tree
#  - Naive Bayes

# Logistic regression
# https://scikit-learn.org/stable/auto_examples/linear_model/plot_logistic_l1_l2_sparsity.html#sphx-glr-auto-examples-linear-model-plot-logistic-l1-l2-sparsity-py
from sklearn.linear_model import LogisticRegression
for i, C in enumerate(np.logspace(0, 1, num=5)):
    # turn down tolerance for short training time
    clf_l1_LR = LogisticRegression(C=C, penalty='l1', tol=0.01)
    clf_l2_LR = LogisticRegression(C=C, penalty='l2', tol=0.01)
    clf_l1_LR.fit(training_predictors, training_labels)
    clf_l2_LR.fit(training_predictors, training_labels)

    coef_l1_LR = clf_l1_LR.coef_.ravel()
    coef_l2_LR = clf_l2_LR.coef_.ravel()

    # coef_l1_LR contains zeros due to the
    # L1 sparsity inducing norm

    sparsity_l1_LR = np.mean(coef_l1_LR == 0) * 100
    sparsity_l2_LR = np.mean(coef_l2_LR == 0) * 100

#     print("C=%.2f" % C)
#     print("Sparsity with L1 penalty: %.2f%%" % sparsity_l1_LR)
#     print("score with L1 penalty: %.4f" % clf_l1_LR.score(predictors, labels))
#     print("Sparsity with L2 penalty: %.2f%%" % sparsity_l2_LR)
#     print("score with L2 penalty: %.4f" % clf_l2_LR.score(predictors, labels))
    
    model_dict["logreg_c%.2f_l1" % C] = clf_l1_LR
    model_dict["logreg_c%.2f_l2" % C] = clf_l2_LR

#  SVM
# https://scikit-learn.org/stable/auto_examples/svm/plot_iris.html#sphx-glr-auto-examples-svm-plot-iris-py
from sklearn import svm
for i, C in enumerate(np.logspace(0, 1, num=5)):
    m1 = svm.SVC(kernel='linear', C=C)
    m2 = svm.SVC(kernel='rbf', gamma=0.7, C=C)
    
    m1.fit(training_predictors, training_labels)
    m2.fit(training_predictors, training_labels)
    print("m1 score", m1.score(training_predictors, training_labels))
    print("m2 score", m2.score(training_predictors, training_labels))

    model_dict["svm_c%.2f_linear" % C] = m1
    model_dict["svm_c%.2f_rbf" % C] = m2
    
# kNN
# https://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html#sphx-glr-auto-examples-neighbors-plot-classification-py
from sklearn import neighbors
for i in range(5):
    k = 2 + (2 * i)
    uniform_knn = neighbors.KNeighborsClassifier(k, weights='uniform')
    distance_knn = neighbors.KNeighborsClassifier(k, weights='distance')
    uniform_knn.fit(training_predictors, training_labels)
    distance_knn.fit(training_predictors, training_labels)
    print("uniform_knn score", uniform_knn.score(training_predictors, training_labels))
    print("distance_knn score", distance_knn.score(training_predictors, training_labels))

    model_dict["knn_k%.2f_uniform" % k] = uniform_knn
    model_dict["knn_k%.2f_distance" % k] = distance_knn
    
# Decision Tree
# https://scikit-learn.org/stable/auto_examples/tree/plot_iris.html#sphx-glr-auto-examples-tree-plot-iris-py
from sklearn import tree
for i in range(5):
    max_depth = 2 + i
    gini_dt = tree.DecisionTreeClassifier(criterion='gini', max_depth=max_depth)
    entropy_dt = tree.DecisionTreeClassifier(criterion='entropy', max_depth=max_depth)

    gini_dt.fit(training_predictors, training_labels)
    entropy_dt.fit(training_predictors, training_labels)
    print("gini_dt score", gini_dt.score(training_predictors, training_labels))
    print("entropy_dt score", entropy_dt.score(training_predictors, training_labels))

    model_dict["dt_d%.2f_gini" % max_depth] = gini_dt
    model_dict["dt_d%.2f_entropy" % max_depth] = entropy_dt

# Naive Bayes
# https://scikit-learn.org/stable/modules/naive_bayes.html#gaussian-naive-bayes
from sklearn import naive_bayes
for i, smoothing in enumerate(np.logspace(1e-10, 1e-7, num=5)):
    nb_1 = naive_bayes.GaussianNB(priors=(0.8, 0.2))
    nb_2 = naive_bayes.GaussianNB(priors=(0.2, 0.8))

    nb_1.fit(training_predictors, training_labels)
    nb_2.fit(training_predictors, training_labels)
    print("nb_1 score", nb_1.score(training_predictors, training_labels))
    print("nb_2 score", nb_2.score(training_predictors, training_labels))

    model_dict["nb1_s%.2f" % smoothing] = nb_1
    model_dict["nb2_s%.2f" % smoothing] = nb_2



m1 score 0.86
m2 score 0.98
m1 score 0.86
m2 score 1.0
m1 score 0.86
m2 score 1.0
m1 score 0.86
m2 score 1.0
m1 score 0.86
m2 score 1.0
uniform_knn score 0.84
distance_knn score 1.0
uniform_knn score 0.78
distance_knn score 1.0
uniform_knn score 0.78
distance_knn score 1.0
uniform_knn score 0.78
distance_knn score 1.0
uniform_knn score 0.78
distance_knn score 1.0
gini_dt score 0.78
entropy_dt score 0.78
gini_dt score 0.86
entropy_dt score 0.84
gini_dt score 0.9
entropy_dt score 0.9
gini_dt score 0.92
entropy_dt score 0.9
gini_dt score 0.94
entropy_dt score 0.92
nb_1 score 0.34
nb_2 score 0.28
nb_1 score 0.34
nb_2 score 0.28
nb_1 score 0.34
nb_2 score 0.28
nb_1 score 0.34
nb_2 score 0.28
nb_1 score 0.34
nb_2 score 0.28


In [45]:
# calculate entropies
from scipy import stats
from collections import Counter

predictions = {}
for name, m in model_dict.items():
    predictions[name] = m.predict(val_predictors)

predictions_df = pd.DataFrame(predictions)
predictions_df
entropies = predictions_df.apply(lambda r: stats.entropy([x for x in Counter(r).values()], base=2), axis=1)
entropies


0     0.286397
1     0.286397
2     0.909736
3     0.998196
4     0.992774
5     0.609840
6     0.168661
7     0.609840
8     0.286397
9     0.998196
10    0.954434
11    0.168661
12    0.286397
13    0.983708
14    0.286397
15    0.286397
16    0.286397
17    0.286397
18    0.848548
19    0.286397
dtype: float64

In [46]:
# Next, we build the GP estimator for entropy over the whole space.
from sklearn import gaussian_process
gp = gaussian_process.GaussianProcessRegressor()
gp.fit(val_predictors, entropies)

gp

GaussianProcessRegressor(alpha=1e-10, copy_X_train=True, kernel=None,
             n_restarts_optimizer=0, normalize_y=False,
             optimizer='fmin_l_bfgs_b', random_state=None)