![Machine Learning Workshop: Content Insights 2020](assets/mlci_banner.jpg)

# Machine Learning Workshop: Content Insights 2020

Welcome to the workshop notebooks!  These notebooks are designed to give you a walk through the steps of creating a model, refining it with user labels, and testing it on content.  You can access the main [workshop forum page](https://INFO_SITE/forums/html/forum?id=241a0b77-7aa6-4fef-9f25-5ea351825725&ps=25), the [workshop files repo](https://INFO_SITE/communities/service/html/communityview?communityUuid=fb400868-b17c-44d8-8b63-b445d26a0be4#fullpageWidgetId=W403a0d6f86de_45aa_8b67_c52cf90fca16&folder=d8138bef-9182-4bdc-8b12-3c88158a219c), or the [symposium home page](https://software.web.DOMAIN) for additional help.

The notebooks are divided into five core components: (A) setup & data, (B) model exploration, (C) labeling, (D) active labeling, (E) and deployment.  You are currently viewing the *setup & data* workbook.

In [156]:
# constants for running the workshop; we'll repeat these in the top line of each workbook.
#   why repeat them? the backup routine only serializes .ipynb files, so others will need 
#   to be downloaded again if your compute instance restarts (a small price to pay, right?)

WORKSHOP_BASE = "https://vmlr-workshop.STORAGE"
AGG_AVFEAT = "models/agg_avfeature.pkl.gz"             # custom file for merged audio and video features
CLASS_LABELS_FLAT = "assets/labels_final.json"  # provided file for label info
CLASS_DEFINITIONS = "assets/classes.json"       # provided file for class info
LEARNING_PERF_RERANK = "assets/learning_rerank.pkl.gz"
LEARNING_PERF_QUERY = "assets/learning_query.pkl.gz"

# Notebook D: Active Labeling Analysis

Now that we've reviewed **how** you can solicit labels and update the order of that solicitation, let's anlayze the implications of solicitation reording.  The late notebook focused on interacting with the labeling interface, so here we'll just use offline labels and simulate labeler entries.  Additionally, this notebook will focus on the training of a custom classifier instead of reusing other tags.

In this notebook, we evaluate a few critical questions.

1. Does reranking unlabeled instance (e.g. online learning) help to improve efficiency?
1. What strategies for ordering results can improve labeling efficiency?
1. What consensus measures should be taken for multiple labels?
1. Are there trends in performance curves that can point to a moment of model stability?

If you're really curious about the space, this overview paper, [A Wholistic View of Continual Learning with Deep Neural Networks:  Forgotten Lessons and the Bridge to Active and Open World Learning](https://arxiv.org/abs/2009.01797), gives a great (and dizzying) review of active learning topics.

![Machine Learning Workshop: Content Insights 2020](assets/active_overview.jpg)


## Modeling Basics
The cell below provides our basic training functions that are utilized in the notebook.  It is derived from the av-featuretraing method (classifier 3) evaluated in notebook B.

In [55]:
from sklearn import metrics
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from pathlib import Path

# define scorng functions
def classifier_score(df_prediction, df_labels, class_name):
    """Functiont to provide metric outputs for the evaluation of a prediction dataframe.
    
    Parameters:
        df_prediction (DataFrame): dataframe containing 'asset' and 'score' as columns
        df_labels (DataFrame): dataframe containing 'asset' and 'class' for labels
        class_name (str): class name for evaluation against labels

    Returns:
        dict of metrics (AUC, AP, precision, recall) ({"ap":X, "class":Y, ...}) and joined dataframe
    """
    metrics_obj = {"class":class_name}
    
    # clean up input labels, prune to relevant class
    df_join = df_prediction
    if "class" not in df_prediction.columns:
        df_labels = df_labels[df_labels["class"] == class_name].drop(columns=["etag", "url"]) 
        # join labels and scores by asset, nomalize score to float
        df_join = df_prediction.set_index('asset').join(df_labels.set_index('asset'), how="left").fillna(0)  # joint at asset level, 0 for nonscoring
        df_join["class"] = df_join["class"].apply(lambda x: 1 if x != 0 else 0).astype(int)
        df_join = df_join.reset_index().sort_values("score", ascending=False)

    # print(f"{class_name}: Found {len(df_join)} samples from {len(df_labels)} labels and {len(df_prediction)} scores.")

    def thresh(x):
        return 1 if x >= 0.5 else 0
    
    metrics_obj["AP"] = metrics.average_precision_score(df_join['class'], df_join['score'])
    fpr, tpr, thresholds = metrics.roc_curve(df_join['class'], df_join['score'])
    metrics_obj["AUC"] = metrics.auc(fpr, tpr)
    metrics_obj["Accuracy"] = metrics.accuracy_score(df_join['class'], df_join['score'].apply(thresh))
    _, metrics_obj["Recall"], metrics_obj["F1"], _ = metrics.precision_recall_fscore_support(
        df_join['class'], df_join['score'].apply(thresh), average='macro', warn_for=())
    #metrics_obj["Recall"] = metrics.recall_score(df_join['class'], df_join['score'].apply(thresh))
    #metrics_obj["F1"] = metrics.f1_score(df_join['class'], df_join['score'].apply(thresh))
    # print(f"{class_name}: {metrics_obj}")
        
    # return our computation!
    return metrics_obj, df_join

def classifier_plot(metrics_obj, df_scored):
    fpr, tpr, thresholds = metrics.roc_curve(df_scored['class'], df_scored['score'])
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(11, 4))

    lw = 2
    ax1.plot(fpr, tpr, color='darkorange',
             lw=lw, label=f"AUC curve (area={metrics_obj['AUC']:0.2})")
    ax1.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
    ax1.set_xlim([0.0, 1.0])
    ax1.set_ylim([0.0, 1.05])
    ax1.set_xlabel('False Positive Rate')
    ax1.set_ylabel('True Positive Rate')
    ax1.legend(loc="lower right")

    precision, recall, thresholds = metrics.precision_recall_curve(df_scored['class'], df_scored['score'])
    ax2.plot(recall, precision, color='red',
             lw=lw, label=f"PR Curve (AP={metrics_obj['AP']:0.2}, F1={metrics_obj['F1']:0.2})")
    ax2.plot([1, 0], [0, 1], color='navy', lw=lw, linestyle='--')
    ax2.set_xlim([0.0, 1.0])
    ax2.set_ylim([0.0, 1.05])
    ax2.set_ylabel('Precision')
    ax2.set_xlabel('Recall')
    ax2.legend(loc="upper right")
    plt.show()
    
# read label data for use later!
df_labels = pd.read_json(CLASS_LABELS_FLAT).explode('labels').fillna('none of the above')
df_labels.rename(columns={"data":"url", "labels":"class"}, inplace=True)
df_labels["asset"] = df_labels['url'].replace(regex={r'^' + WORKSHOP_BASE + '/': ''})
print(f"Loaded a total of {len(df_labels)} labels across {len(df_labels['asset'].unique())} samples and {len(df_labels['class'].unique())} classes.")

# clear out other performance stores
df_performance = None

# load features
path_features = Path(AGG_AVFEAT)
if not path_features.exists():
    raise Exception(f"""
       Sorry, the set of aggregate features was not found.  
       Please return to notebook B to create file '{str(path_features)}'...
    """)
df_avfeature = pd.read_pickle(str(path_features))
print(f"Loaded a total of {len(df_avfeature)} samples.")

df_classes = pd.read_json(CLASS_DEFINITIONS)
display(df_classes)

Loaded a total of 1525 labels across 1000 samples and 6 classes.
Loaded a total of 1033 samples.


Unnamed: 0,class,definition,primary
0,holiday,holiday scenes or objects like decorated trees...,1
1,halloween,halloween scenes where one or more characters ...,1
2,gift giving,"scenes of gift giving, receiving, or opening/u...",1
3,family moments,"at least two people on screen, typically famil...",0
4,shopping scenes,one or more primary actors in a store-like env...,0


### Shortcut to Learning
Disable the below cell (change `True` to `False`) to perform a full re-run of simulations.  Yes, it's a little bit of a cheat, but the numbers should be the same wherever they are run (*we use a constant random number generator seed*), so there's no shame in taking this shortcut.

In [None]:
df_perf_rerank = None  # variables that are defined later
df_query_sizes = None

if True:
    df_perf_rerank = pd.from_pickle(LEARNING_PERF_RERANK)
    df_query_sizes = pd.from_pickle(LEARNING_PERF_QUERY)


# Reranking to Improve Efficency
One major advantage of active learning is truncating the time needed to label a highly performant model.  We'll explore four strategies on this dataset to demonstrate the power of active learning.  For a more in-depth read (and in-depth methods), we suggest consultation of the paper mentioned at the top of this notebook.

1. **Random** query - random sample of remaining items
2. Most outlying, dissimilar tasks first - learn a classifier and pick from the most confusing (e.g. on the decision boundary)
    1. **Outlier** - Scores that are the most divergent from their label
    2. **Anomalies** - looking for singleton outlier to target first
    3. **Entropy** - compute the entropy of the training set and look for highest entropy additions
    4. **Local Outliers** - using a kNN appreach, find the most disconnected samples
3. Most inlying, similar tasks first - learn a classifier and pick from the top scoring results
    1. **Inlier** - Scores agree the most from their label (positive or negative)
    2. **Positive** - sample from the high scoring items of the unlabeled (may lead to lower recall)
    3. **Negative** - sample form the low scoring items of the unlabeled (may lead ot better diversity)



In [221]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate, cross_val_predict
from sklearn.calibration import CalibratedClassifierCV
from sklearn.exceptions import UndefinedMetricWarning
from sklearn.semi_supervised import LabelSpreading
from sklearn.neighbors import LocalOutlierFactor
from sklearn.ensemble import IsolationForest
from scipy import stats
import numpy as np
import warnings

def classify_av(methods=[], label_increment=5, query_sizes=[]):
    global df_labels
    cv_folds = 5
    feature = "combined"
    score_calibrate = False
    cv_jobs = -1  # -1 is auto, otherwise specific number
    if not methods:
        methods = ['anomaly', 'entropy', 'random', 'outlier', 'inlier', 'positive', 'negative'] 
        #, 'localoutlier'] - almost always similar (but worse) to anomaly
    
    label_len = label_increment
    total_len = len(df_labels["asset"].unique())
    if not query_sizes:   # should we compute label size?
        total_len_eval = total_len
        while total_len_eval > 0:   # consume chunks of labels: 5, 5, 10, 10, 20, 20, 40, 40, ...
            query_sizes.append(label_len)
            total_len_eval -= label_len
            query_sizes.append(label_len)
            total_len_eval -= label_len
            label_len *= 2
    print(f"Query Sizes: {query_sizes}")

    list_res_intermediate = []
    for idx in range(len(df_classes)):  # iterate classes for evaluation
        row = df_classes.iloc[idx]
        
        df_label_sub = df_labels[df_labels["class"]==row["class"]]  # subselect for this class
        df_feat = df_avfeature.set_index("asset").copy()   # get slice of right features
        df_feat = df_feat.join(df_label_sub.set_index("asset"), how="left").fillna(0)  # join with labels
        df_feat["class"] = df_feat["class"].apply(lambda x: 1 if x != 0 else 0).astype(int)  # blank out text

        idx_asset = df_feat.index
        idx_numeric = list(range(len(idx_asset)))  # pull index from features
        rng = np.random.RandomState(0)
        
        for n_fold in range(cv_folds):   # our cross-validation folds
            rng.shuffle(idx_numeric)

            X_feat = np.vstack(df_feat[feature])
            y_label = df_feat["class"].values
            #print("INITIAL", len(idx_asset), len(X_feat), X_feat.shape, y_label.shape)

            for method_act in methods:   # iterate through all sampling methods
                idx_seen = []
                idx_unseen = idx_numeric.copy()
                
                for query_idx in range(len(query_sizes)):   # how many labels can we see in thish round?
                    query_len = query_sizes[query_idx]
                    if query_len > len(idx_unseen):   # went over the count break now?
                        break
                    idx_seen += idx_unseen[:query_len]   # add to labels we can see
                    idx_unseen = idx_unseen[query_len:]  # remove from available set                    
                    
                    model = LogisticRegression()  # basic logistic regression
                    if score_calibrate:   # try to re-calibrate outputs for better threshold?
                        model = CalibratedClassifierCV(model, method="sigmoid")
                    
                    X_train = X_feat[idx_seen, :]
                    y_train = y_label[idx_seen]
                    if len(np.unique(y_train)) < 2:
                        print(f"Warning, insufficient class diversity with {len(y_train)} samples..")
                        continue
                    model.fit(X_train, y_train)   # retrain on this segment
                    
                    X_test = X_feat[idx_unseen,:]
                    probs = model.predict_proba(X_test)   # predict for all samples
                    scores = probs[:,1]
                    y_test = y_label[idx_unseen]

                    
                    # last time we trained over everything via cross-fold
                    scores_train = None
                    if False:   # doubles training time a least!
                        try:
                            probs_train = cross_val_predict(model, X_train, y_train, cv=2, 
                                                          n_jobs=cv_jobs, method='predict_proba')
                            scores_train = probs_train[:,1]
                        except ValueError as e:
                            pass

                    if method_act == "inlier":   # inlier is most confident (most similar to label)
                        delta_val = np.absolute(scores - y_test)
                        idx_resort = np.argsort(-delta_val)
                        idx_unseen[:] = [idx_unseen[i] for i in idx_resort]  # reorder by priority
                        
                    elif method_act == "outlier":   # outlier is least confident (divergent from label)
                        delta_val = np.absolute(scores - y_test)
                        idx_resort = np.argsort(delta_val)
                        #print(idx_resort, len(idx_resort), idx_resort.min(), idx_resort.max())
                        #print(len(idx_unseen))
                        idx_unseen[:] = [idx_unseen[i] for i in idx_resort]  # reorder by priority
                        
                    elif method_act == "positive":   # grab the top of the ranked list
                        idx_resort = np.argsort(-scores)
                        idx_unseen[:] = [idx_unseen[i] for i in idx_resort]  # reorder by priority

                    elif method_act == "negative":   # grab the top of the ranked list
                        idx_resort = np.argsort(scores)
                        idx_unseen[:] = [idx_unseen[i] for i in idx_resort]  # reorder by priority    
                        
                    elif method_act == "anomaly":   # find outliers from the unlabeled set
                        # See this page for more details...
                        #.  https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html#sklearn.ensemble.IsolationForest.score_samples
                        outlier_model = IsolationForest(random_state=0).fit(X_train)
                        anomaly_score = outlier_model.score_samples(X_test)
                        idx_resort = np.argsort(anomaly_score)
                        idx_unseen[:] = [idx_unseen[i] for i in idx_resort]  # reorder by priority

                    elif method_act == "localoutlier":   # find outliers from the unlabeled set
                        # See this page for more details...
                        #.  https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LocalOutlierFactor.html#sklearn.neighbors.LocalOutlierFactor
                        # Similar, but almost always worse than `anomaly`
                        outlier_model = LocalOutlierFactor(n_neighbors=5, novelty=True).fit(X_train)
                        anomaly_score = outlier_model.score_samples(X_test)
                        idx_resort = np.argsort(anomaly_score)
                        idx_unseen[:] = [idx_unseen[i] for i in idx_resort]  # reorder by priority    
                        
                    elif method_act == "entropy":   # look for least simialr in feature space
                        # NOTE: This code block is copied from scikit documentation
                        # https://scikit-learn.org/stable/auto_examples/semi_supervised/plot_label_propagation_digits_active_learning.html?highlight=active
                        lp_model = LabelSpreading(gamma=0.25, max_iter=20)
                        y_unknown = np.copy(y_label)
                        y_unknown[idx_unseen] = -1   # add new "unknown" class
                        lp_model.fit(X_feat, y_unknown)
                        
                        # compute the entropies of transduced label distributions
                        pred_entropies = stats.distributions.entropy(
                            lp_model.label_distributions_.T)
                        pred_entropies_sub = pred_entropies[idx_unseen]
                        #print(pred_entropies, pred_entropies.shape)

                        # select up to 5 digit examples that the classifier is most uncertain about
                        uncertainty_index = np.argsort(pred_entropies_sub)[::-1]
                        #print(len(idx_unseen), uncertainty_index.shape, pred_entropies_sub.shape)
                        idx_unseen[:] = [idx_unseen[i] for i in uncertainty_index if i in idx_unseen]  # reorder by priority
                        #uncertainty_index = uncertainty_index[
                        #    np.in1d(uncertainty_index, unlabeled_indices)][:5]

                    with warnings.catch_warnings():
                        # NOTE: we get some nasty warnings because sampling can't always guarantee classes
                        warnings.simplefilter("ignore", category=RuntimeWarning)
                        warnings.simplefilter("ignore", category=UndefinedMetricWarning)
                        warnings.simplefilter("ignore", category=UserWarning)
                        warnings.simplefilter("ignore", category=FutureWarning)
                        
                        metrics_obj, _ = classifier_score(
                            pd.DataFrame({"score":scores, "class":y_test}), df_feat["class"], row['class'])
                        if scores_train is not None:  # had enough data in train set?
                            metrics_train, _ = classifier_score(
                                pd.DataFrame({"score":scores_train, "class":y_train}), df_feat["class"], row['class'])
                            metrics_obj["AP_train"] = metrics_train["AP"]
                            metrics_obj["Recall_train"] = metrics_train["Recall"]
                            metrics_obj["AUC_train"] = metrics_train["AUC"]
                        
                    # too many iterations here, discard the scores
                    metrics_obj.update({'class':row['class'], 'method':method_act, 
                                        "size_train":len(idx_seen), "size_test":len(idx_unseen),
                                        "query_idx":query_idx, "label_increment":label_increment})
                    list_res_intermediate.append(metrics_obj)
            print(f"fold: {n_fold}, class: {row['class']}, methods: {methods}")
    # end loop over everything
    df_intermediate = pd.DataFrame(list_res_intermediate)
    return df_intermediate

# run classifier for overall sampling
if df_perf_rerank is None:
    df_perf_rerank = classify_av()
    df_perf_rerank.to_pickle(LEARNING_PERF_RERANK)
print("Computed impact by reranking strategy.")


Query Sizes: [5, 5, 10, 10, 20, 20, 40, 40, 80, 80, 160, 160, 320, 320]
fold: 0, class: holiday, methods: ['anomaly', 'entropy', 'random', 'outlier', 'inlier', 'positive', 'negative']
fold: 1, class: holiday, methods: ['anomaly', 'entropy', 'random', 'outlier', 'inlier', 'positive', 'negative']
fold: 2, class: holiday, methods: ['anomaly', 'entropy', 'random', 'outlier', 'inlier', 'positive', 'negative']
fold: 3, class: holiday, methods: ['anomaly', 'entropy', 'random', 'outlier', 'inlier', 'positive', 'negative']
fold: 4, class: holiday, methods: ['anomaly', 'entropy', 'random', 'outlier', 'inlier', 'positive', 'negative']
fold: 0, class: halloween, methods: ['anomaly', 'entropy', 'random', 'outlier', 'inlier', 'positive', 'negative']
fold: 1, class: halloween, methods: ['anomaly', 'entropy', 'random', 'outlier', 'inlier', 'positive', 'negative']
fold: 2, class: halloween, methods: ['anomaly', 'entropy', 'random', 'outlier', 'inlier', 'positive', 'negative']
fold: 3, class: halloween,

In [223]:
import matplotlib.pyplot as plt
import ipywidgets as widgets
from functools import partial

def active_perf_visualize(f, df, title_overall, metric_name, class_name=None):
    df_perf_avg = df.groupby(['method', 'class', 'size_train', 'label_increment']).mean().reset_index()
    if class_name is not None:
        df_perf_avg = df_perf_avg[df_perf_avg["class"]==class_name]
    lw = 2
    method_list = list(df_perf_avg['method'].unique())
    share_axy = None
    f.clf()
    def plt_local(ax, grp_class):
        for idx_method, grp_method in grp_class.groupby('method'):
            queryK = grp_method["size_train"]
            resultAP = grp_method[metric_name]
            ax.plot(queryK, resultAP, lw=lw, label=f"{idx_method} ({metric_name})")
        #ax.set_prop_cycle(None) # color reset - https://stackoverflow.com/a/24283087
        #for idx_method, grp_method in grp_class.groupby('method'):
        #    queryK = grp_method["size_train"]
        #    resultAP = grp_method["AUC"]
        #    ax.plot(queryK, resultAP, lw=lw, linestyle="--", label=f"{idx_method} (AUC)")
    
    idx_graph = 1
    for idx_class, grp_class in df_perf_avg.groupby(['class', 'label_increment']):
        ax = f.add_subplot(3, 3, idx_graph, sharey=share_axy)
        plt_local(ax, grp_class)
        if idx_graph > 3:
            ax.set_xlabel('train size')
        ax.grid()
        ax.set_title(f"{idx_class[0]} @ Query {idx_class[1]}")
        if share_axy is None:
            share_axy = ax
            # ax.set_ylabel('mAP (mean average precision)')
        # ax.legend(loc="best")
        idx_graph += 1
    # one more graph that we'll hide most of 
    ax = f.add_subplot(3, 3, idx_graph)
    plt_local(ax, grp_class)
    ax.set_xlim([0.0, 1.0])
    ax.set_ylim([0.0, 1.05])
    ax.set_axis_off()
    ax.legend(loc="upper left") #, ncol=2)
    ax.set_title(title_overall)
    display(f)


# use widget interaction basic
f = plt.figure(figsize=(12,12))

demo_fn = partial(active_perf_visualize, f, df_perf_rerank, "Average Precision on Unseen Samples")
dropdown_metric = widgets.Dropdown(
    options=["AP", "AUC", "Accuracy"], 
    value="AP",  # send run names
    description='Metric:',
    disabled=False
)
out = widgets.interactive_output(demo_fn, {"metric_name":dropdown_metric})
# output = dropdown.children[-1]  # anti-flicker trick (https://ipywidgets.readthedocs.io/en/stable/examples/Using%20Interact.html#Flickering-and-jumping-output)
# output.layout.height = '750px'  # disable this if you make your output window longer!
display(dropdown_metric, out)

# active_perf_visualize()


<Figure size 864x864 with 0 Axes>

Dropdown(description='Metric:', options=('AP', 'AUC', 'Accuracy'), value='AP')

Output()

### Reranking Post-Mortem
Whoa, there's a lot to unpack in the graphs above, so let's just iterate directly!  Overall there are a few winning methods that we'll carry into our next analysis.  Remember that these scores are computed on the `unknown` set of samples, so the size of that set decreases over time.
1. Random does okay, but flatlines
2. The simpler, prediction score-based methods (`positive`, `negative`) don't do well overall; in fact, we see that because of increasing homogeniety in `positive`, the performance drops with more samples.
3. In these experiments, the training sizes between 200-300 worked best (appx 20-30% of data)

# Query Depth Experiments

Now that we've got some better performance statistics, let's benchmark the effect of bigger or smaller query sizes.  Specifically, this indicates how often we may need to retrain and update scores in our labeling system.  

In one school of thought, the thought is that smalll, micro-updates from the user will allow the machine learning model to quickly adapt.  However, keep in mind that the `random` baseline above (just random ordering of samples) was quite performant among all of the different classes. 

In another practice, recent studies have found to "go big" on the first initialization of a classifier, as applied in [Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers](https://arxiv.org/abs/2002.11794).  This choice asserts that by learning over. alarge sample first, the diversity of those samples is better guaranteed.  After a large model is created, the thought is that additional transfer learning and adaptation strategies can move forward to exploit and tune the model for various sample space variances.

We won't delve into those advanced techniques here, but we can experiment to emulate various query depths that can be utilized in a labeling sysystem.

In [220]:
# run query size estimation
if df_query_sizes is None:
    list_methods = ['anomaly', 'inlier', 'outlier', 'random', 'negative']
    query_sizes = [20] *20
    df_query_sizes = classify_av(methods=list_methods, 
                                 label_increment=20, query_sizes=query_sizes)
    query_sizes = [50] * 8
    df_query_sizes = df_query_sizes.append(classify_av(methods=list_methods, 
                                 label_increment=50, query_sizes=query_sizes))
    query_sizes = [100] * 4
    df_query_sizes = df_query_sizes.append(classify_av(methods=list_methods, 
                                 label_increment=100, query_sizes=query_sizes))
    df_query_sizes.to_pickle(LEARNING_PERF_QUERY)
print("Computed impact by query (audience solicitation) size.")

Query Sizes: [20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20]
fold: 0, class: holiday, methods: ['anomaly', 'inlier', 'outlier', 'random', 'negative']
fold: 1, class: holiday, methods: ['anomaly', 'inlier', 'outlier', 'random', 'negative']
fold: 2, class: holiday, methods: ['anomaly', 'inlier', 'outlier', 'random', 'negative']
fold: 3, class: holiday, methods: ['anomaly', 'inlier', 'outlier', 'random', 'negative']
fold: 4, class: holiday, methods: ['anomaly', 'inlier', 'outlier', 'random', 'negative']
fold: 0, class: halloween, methods: ['anomaly', 'inlier', 'outlier', 'random', 'negative']
fold: 1, class: halloween, methods: ['anomaly', 'inlier', 'outlier', 'random', 'negative']
fold: 2, class: halloween, methods: ['anomaly', 'inlier', 'outlier', 'random', 'negative']
fold: 3, class: halloween, methods: ['anomaly', 'inlier', 'outlier', 'random', 'negative']
fold: 4, class: halloween, methods: ['anomaly', 'inlier', 'outlier', 'random', 'negative']
fold: 

In [222]:
# use widget interaction basic
f = plt.figure(figsize=(12,12))
demo_fn = partial(active_perf_visualize, f, df_query_sizes, "Average Precision Query Size Effect")
dropdown_class = widgets.Dropdown(
    options=list(df_query_sizes['class'].unique()),  # send run names
    value=list(df_query_sizes['class'].unique())[0],  # send run names
    description='Class:',
    disabled=False
)
out = widgets.interactive_output(demo_fn, {'metric_name':dropdown_metric, 'class_name':dropdown_class})
display(dropdown_class, dropdown_metric, out)


<Figure size 864x864 with 0 Axes>

Dropdown(description='Class:', options=('holiday', 'halloween', 'gift giving', 'family moments', 'shopping sce…

Dropdown(description='Metric:', index=1, options=('AP', 'AUC', 'Accuracy'), value='AUC')

Output()

### Query Depth Post-Mortem
A few messages come out of this experiment set that mostly agree with recent publications: start big and then hone-in quickly with small tuning.
1. The average precision (AP) of small depth query sampling worked well for the `inlier` method, but only after a quarter  fo the dataset (or half of this experiment) was already confirmed. 
2. All models (except `inliner`) did well on "big starts" where a random set of as many as 100 samples were labeled before any reranking occured.
3. Perhaps most telling of a potential weakness, the `inlier` method shows dramatic oscillations when used early in sampling and with small query depths.

# Consensus Measures
Consensus is an important factor for determining agreement among labelers.  Few labeling systems properly use consensus to determine accuracy of an underlying label because the methods for establishing agreement among labelers relies both on the labels themselves as well as an individual's performance record.  The correct aggregation and balance of these two strategies could be a hackathon or workshop in its own (and arguably was in such datasets as the [Netflix Prize](https://www.kaggle.com/netflix-inc/netflix-prize-data)) because a variety of ETL and data clean-up processs are required.

Unfortunately, at the time of writing, there weren't enough labels on this dataset, so we could not conduct an in-depth consensus analysis.  Never fear! Parallel work on other tracks in [LabelQuest](https://lq.web.DOMAIN) come and go as requied by business, so future datasets will provide sufficient means for exploration.

![Counts of Label Overlap](assets/active_consensus.png)

# End of Active Learning Material

This is where the core technical evaluations end, congratulations -- you made it!  Armed with this knowledge, you have a few strategies to map from a new concept into a custom av-centric classifier with several evaluation metrics along the way.  

The next notebook, [notebook E](E_deployment.ipynb) *(that link may not work)* visits advanced methods that can apply and utilize models generated from these work books.
