![Machine Learning Workshop: Content Insights 2020](assets/mlci_banner.jpg)

# Machine Learning Workshop: Content Insights 2020

Welcome to the workshop notebooks!  These notebooks are designed to give you a walk through the steps of creating a model, refining it with user labels, and testing it on content.  You can access the main [workshop forum page](https://INFO_SITE/forums/html/forum?id=241a0b77-7aa6-4fef-9f25-5ea351825725&ps=25), the [workshop files repo](https://INFO_SITE/communities/service/html/communityview?communityUuid=fb400868-b17c-44d8-8b63-b445d26a0be4#fullpageWidgetId=W403a0d6f86de_45aa_8b67_c52cf90fca16&folder=d8138bef-9182-4bdc-8b12-3c88158a219c), or the [symposium home page](https://software.web.DOMAIN) for additional help.

The notebooks are divided into five core components: (A) setup & data, (B) model exploration, (C) labeling, (D) active labeling, (E) and deployment.  You are currently viewing the *setup & data* workbook.

In [207]:
# constants for running the workshop; we'll repeat these in the top line of each workbook.
#   why repeat them? the backup routine only serializes .ipynb files, so others will need 
#   to be downloaded again if your compute instance restarts (a small price to pay, right?)

WORKSHOP_BASE = "https://vmlr-workshop.STORAGE"
# WORKSHOP_BASE = "http://content.research.DOMAIN/projects/mlci_2020"
AGG_METADATA = "agg_metadata.pkl.gz"            # custom file for merged metadata
CLASS_DEFINITIONS = "assets/classes.json"       # provided file for class info
CLASS_LABELS_FLAT = "assets/labels_final.json"  # provided file for label info
AGG_TAG_EMBEDDING = "agg_tag_vocab.w2v"         # custom file for tag-based vocabulary
AGG_LABELS = "agg_labels.pkl.gz"                # custom file for merged labels
AGG_AVFEAT = "agg_avfeature.pkl.gz"             # custom file for merged audio and video features
MODEL_PERFORMANCE = "model_performance.pkl.gz"  # custom file for storing model performance


# Notebook B: Building Models for Contextual Ads

One goal of combining timed metadata and content is for better alignment of content from WarnerMedia and ads from Xandr creatives.  The image below demonstrates one example where a detected keyword or scene can trigger an ad that is realted.  Without this technology, this ad spot (or inventory) may be undersold and filled with an unrelated or standard campaign ad.

![Contextual Ad Product](assets/mlci_contextual.jpg)

## Class Exploration
Let's quickly load and display the target classes used in this experiment. The table below indicates the class and the definition utilized for our classifier.  The field `primary` indicates whether or not a class will be used for performance evaluations.  Some non-primary classes were also included for additional experimentation, but they will not be the focus here.

In [92]:
import pandas as pd
from IPython.display import display
pd.set_option('display.max_colwidth',1000)
df_classes = pd.read_json(CLASS_DEFINITIONS)
display(df_classes)

path_metadata = Path(AGG_METADATA)
if not path_metadata.exists():
    print("Please return to notebook A to run data flattening and merging!")
df_flatten = pd.read_pickle(str(path_metadata))
print(f"Loaded {len(df_classes)} classes and {len(df_flatten)} tag rows.")

Unnamed: 0,class,definition,primary
0,holiday,"holiday scenes or objects like decorated trees, presents, or character, holiday party",1
1,halloween,"halloween scenes where one or more characters are in costume, ideally one or more characters are trick-or-treating",1
2,gift giving,"scenes of gift giving, receiving, or opening/unwrapping",1
3,family moments,"at least two people on screen, typically familes at parties, enjoying a meal, lounging at home",0
4,shopping scenes,one or more primary actors in a store-like environment; not necessary to see their face,0


Loaded 5 classes and 122736 tag rows.


### Result Browsing for Subjective Scoring
Below, we've created a simple method to view keyframe results from the video assets used in this workshop.  It uses simple widgets (as we did above) and displays them in a columnar form in ths notebook.  In addition to numerical performance (accuracy, AUC, etc.), we can visually inspect the performance of a classifier model with this utility.

In [104]:
import ipywidgets as widgets

import numpy as np
import random

def display_hit_html(path_result, score, label=1, idx=None, base=WORKSHOP_BASE):
    '''
     This function essentially convert the image url to 
     '<img src="'+ path + '"/>' format. And one can put any
     formatting adjustments to control the height, aspect ratio, size etc.
     within as in the below example. 
    '''
    path_parts = path_result.split('/')
    url_full = f"{base}/{path_parts[0]}/keyframe/{path_parts[1]}.jpg"
    text_color = f" style='color:{'black' if label==1 else 'red'}' "
    return f"""
        <a href='{url_full}' title='{path_result}' target='_new'><img width='100%' src='{url_full}' /></a>
        <small {text_color}><strong>{f"[{idx}] " if idx is not None else ''}{score:0.4}</strong></small>
        """

def results_display(df, max_results=32, num_cols=8, base=WORKSHOP_BASE, title="Example Results"):
    """Input DataFrame with 'asset' and 'score' and show those results in columnar format.
    Providing an additional column 'label' [as 0 or 1] will allow incorrect items to be shown in red."""
    width_col = f"{round(100/num_cols)}%"
    #list_results.sort(reverse=True, key=lambda x: x[-1]) 
    #list_html = [widgets.HTML(path_to_image_html(*x, base=base)) for x in list_results]
    df = df.sort_values("score", ascending=False).head(max_results).reset_index(drop=True)
    if "class" not in df.columns:
        df["class"] = 1
    list_html = [widgets.HTML(display_hit_html(
        x['asset'], x['score'], x['class'], idx=i, base=base)) for i,x in df.iterrows()]
    display(widgets.VBox([widgets.HTML(f"<h3>{title}</h3>"), widgets.GridBox(list_html,
               layout=widgets.Layout(grid_template_columns=f"repeat(8, {width_col})"))]))

# demo of input as a single result, but with random scores
def regen(x):
    df_demo = pd.DataFrame({"asset":list(df_flatten['asset'].sample(16)),
                            "score":np.random.rand(1, 16)[0],
                            "class":np.random.randint(2, size=16)})
    results_display(df_demo, title="Results demo (random images, scores, and labels utilized)")
widgets.interact(regen, x=widgets.ToggleButton(description="Randomize"))


interactive(children=(ToggleButton(value=False, description='Randomize'), Output()), _dom_classes=('widget-int…

<function __main__.regen(x)>

### Objective Result Scoring
Similar to the code above, this function can be used for evaluation of models by a few different metrics -- but here it is aimed at objective, numerical methods.  For convenience, this method can also plot the results of both objective and subjective scoring.

In [122]:
from sklearn import metrics
import matplotlib.pyplot as plt

def classifier_score(df_prediction, df_labels, class_name):
    """Functiont to provide metric outputs for the evaluation of a prediction dataframe.
    
    Parameters:
        df_prediction (DataFrame): dataframe containing 'asset' and 'score' as columns
        df_labels (DataFrame): dataframe containing 'asset' and 'label' for labels
        class_name (str): class name for evaluation against labels

    Returns:
        dict of metrics (AUC, AP, precision, recall) ({"ap":X, "class":Y, ...}) and joined dataframe
    """
    metrics_obj = {"class":class_name}
    
    # clean up input labels, prune to relevant class
    df_labels = df_labels[df_labels["class"] == class_name].drop(columns=["etag", "url"]) 
    # join labels and scores by asset, nomalize score to float
    df_join = df_prediction.set_index('asset').join(df_labels.set_index('asset'), how="left").fillna(0)  # joint at asset level, 0 for nonscoring
    df_join["class"] = df_join["class"].apply(lambda x: 1 if x != 0 else 0).astype(int)
    df_join = df_join.reset_index().sort_values("score", ascending=False)

    # print(f"{class_name}: Found {len(df_join)} samples from {len(df_labels)} labels and {len(df_prediction)} scores.")

    def thresh(x):
        return 1 if x >= 0.5 else 0
    
    metrics_obj["AP"] = metrics.average_precision_score(df_join['class'], df_join['score'])
    fpr, tpr, thresholds = metrics.roc_curve(df_join['class'], df_join['score'])
    metrics_obj["AUC"] = metrics.auc(fpr, tpr)
    metrics_obj["Accuracy"] = metrics.accuracy_score(df_join['class'], df_join['score'].apply(thresh))
    metrics_obj["Recall"] = metrics.recall_score(df_join['class'], df_join['score'].apply(thresh))
    metrics_obj["F1"] = metrics.f1_score(df_join['class'], df_join['score'].apply(thresh))
    # print(f"{class_name}: {metrics_obj}")
        
    # return our computation!
    return metrics_obj, df_join

def classifier_plot(metrics_obj, df_scored):
    fpr, tpr, thresholds = metrics.roc_curve(df_scored['class'], df_scored['score'])
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(11, 4))

    lw = 2
    ax1.plot(fpr, tpr, color='darkorange',
             lw=lw, label=f"AUC curve (area = {metrics_obj['AUC']:0.2})")
    ax1.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
    ax1.set_xlim([0.0, 1.0])
    ax1.set_ylim([0.0, 1.05])
    ax1.set_xlabel('False Positive Rate')
    ax1.set_ylabel('True Positive Rate')
    ax1.legend(loc="lower right")

    precision, recall, thresholds = metrics.precision_recall_curve(df_scored['class'], df_scored['score'])
    ax2.plot(recall, precision, color='red',
             lw=lw, label=f"PR Curve (F1 = {metrics_obj['F1']:0.2})")
    ax2.plot([1, 0], [0, 1], color='navy', lw=lw, linestyle='--')
    ax2.set_xlim([0.0, 1.0])
    ax2.set_ylim([0.0, 1.05])
    ax2.set_ylabel('Precision')
    ax2.set_xlabel('Recall')
    ax2.legend(loc="upper right")
    plt.show()
    
# read label data for use later!
df_labels = pd.read_json(CLASS_LABELS_FLAT).explode('labels').fillna('none of the above')
df_labels.rename(columns={"data":"url", "labels":"class"}, inplace=True)
df_labels["asset"] = df_labels['url'].replace(regex={r'^' + WORKSHOP_BASE + '/': ''})
print(f"Loaded a total of {len(df_labels)} labels across {len(df_labels['asset'].unique())} samples and {len(df_labels['class'].unique())} classes.")

# dataframe for tracking performance...
df_performance = None

Loaded a total of 947 labels across 589 samples and 6 classes.


# NLP and Text processing
Our first classifiers will be constructed by using classical mapping of the class definitions through text search and mapping to tag names.  Using the classes loaded above, let's perform a quick mapping.  

* **text2doc** - this function will allow us to strip out [stop words](https://www.analyticsvidhya.com/blog/2019/08/how-to-remove-stopwords-text-normalization-nltk-spacy-gensim-python/); we'll use it both for the `raw match` and the `nlp embedding` models
* **doc2vec** - this function will take a filtered text string and produce a numerical embedding

This processing will be used to test the perfomance of two text-based methods:

1. direct string matching (e.g. find a tag `gift` that matches the definition of `gift`) and 
2. embedding-based matching by `k` nearest neighbors.

**A-ha Questions** *(that means you probably get it)*
* Q: Wait, where does the extra text come from? 
    * A: How insightful! we use both the class name and the defintion to compute textual features.  This is a reasonable step because those textual prompts would be provided by the stakeholder and are provided to the user labeler in subsequent notebooks.
* Q: What is embedding?
    * A: Embedding is a technique that assigns a numerical vector to a term based on its co-occurence with other words, which is often attributed to an early Google algorithm called [word2vec](https://en.wikipedia.org/wiki/Word2vec).  Methods have progressed since then but of course, this co-occurence is computed from statistics in a training corpus.  In the cell below, we load a spaCy model [en_core_web_md](https://spacy.io/models/en) trained on the [Common Crawl dataset](https://nlp.stanford.edu/projects/glove/) dataset.


In [28]:
import spacy
from spacy.vocab import Vocab
from spacy.tokenizer import Tokenizer
from scipy import spatial
from pathlib import Path

def text2doc(nlp, tag_raw):
    """Given a raw text input, tokenize it to remove stop words"""
    return [x for x in nlp(tag_raw) if not x.is_stop and not x.is_punct]

def doc2vec(nlp, doc, target_domain=None):
    """Given a specific model, clean line of text into an output embedding space"""
    # https://spacy.io/usage/vectors-similarity
    if target_domain is None:
        target_domain = nlp.vocab
    if type(doc) != list:
        doc = [doc]
    tag_doc = None
    for token in doc:
        tag_id = target_domain.strings[token.text]
        if tag_id in target_domain.vectors:   # search existing one
            new_vec = target_domain.vectors[tag_id]
        elif type(token)==str:
            new_vec = nlp(token).vector
        else:
            new_vec = token.vector
        if tag_doc is None:
            tag_doc = new_vec
        else:
            tag_doc += new_vec
    return tag_doc

# doc = text2doc(nlp, "this is a phrase to clean")
# vec = doc2vec(nlp, doc)

# also load our spacy NLP model
nlp = spacy.load('en_core_web_md')
print("NLP model ready to go!")

NLP model and flattened featured ready to go!


## Classifiers 1+2: Text Token Matching and Embedding
Proceed to process by matching by raw keyword and NLP tokenization.

In [106]:
# let's plow through our class dataframe and do the cleaning and mapping
# print([x for x in df_classes[df_classes['primary']==1].iloc])
print("Tokenizing and embedding classes...")
list_tokens = [""] * len(df_classes)
list_vect = [np.ndarray((1, nlp.vocab.vectors.shape[1]))] * len(df_classes)
for idx in range(len(df_classes)):  # iterate row indexes
    row = df_classes.iloc[idx]
    # tokenize to remove tags, return tokens
    doc = text2doc(nlp, row["definition"]) + text2doc(nlp, row["class"])
    list_tokens[idx] = [str(x) for x in doc]
    # convert to an embedding array
    list_vect[idx] = doc2vec(nlp, doc)
df_classes["token"] = list_tokens
df_classes["embedding"] = list_vect

# now collect all of the tags that match simple text bag
print("Lookup specific tags by match...")
list_tags = df_flatten['tag'].unique()
map_tokens = {}
embed_tokens = np.ndarray((len(list_tags), nlp.vocab.vectors.shape[1]))
for idx in range(len(list_tags)):   # iterate through full list of known tags
    doc = text2doc(nlp, list_tags[idx])
    for x in doc:  # create map/reference to each term
        x = str(x)
        if x not in map_tokens:  # first time to see this tag?
            map_tokens[x] = []
        map_tokens[x].append(idx)   # save reference to original tag set
    embed_tokens[idx, :] = doc2vec(nlp, doc)  # compute embedding 

# df_classes["token"] = list_tokens
# df_classes["embedding"] = list_vect
print(f"Count of new text-based mapping: {len(map_tokens)}...")
print(f"Shape of embedded tag matrix: {embed_tokens.shape}...")


Tokenizing and embedding classes...
Lookup specific tags by match...
Count of new text-based mapping: 1804...
Shape of embedded tag matrix: (1763, 300)...


### Classifier 1
The scores for "classifier 1" are the `max()` of all scores of tags that had a string-based match against the name or definition of the class.

In [277]:
dict_results = {}
for idx in range(len(df_classes)):  # iterate classes for evaluation
    row = df_classes.iloc[idx]
    tag_match = []
    for x in row["token"]:  # scan our tokens for this class
        if x in map_tokens:   # now check for hits in tokenized tags
            tag_match += map_tokens[x]   # found one? save all of the original references
    tag_match = [list_tags[x] for x in tag_match]   # dereference to actual tag text
    # print(tag_match)
    df_sub = df_flatten[df_flatten['tag'].isin(tag_match)]  # grab dataframe that matched one tag
    #print(df_sub)
    df_score = df_sub.groupby(['asset'])['score'].max().reset_index(drop=False)
    #print(df_score, row["class"])

    metrics_obj, df_scored = classifier_score(df_score, df_labels, row['class'])
    dict_results[row['class']] = {'class':row['class'], 'method':'match', 'token':tag_match,
                                  'metrics': metrics_obj, 'scored': df_scored, "details": ""}

# save the results 
def result_update(df_performance, dict_results):
    df_new = pd.DataFrame(dict_results.values())
    df_new['id'] = [f"{x['class']} {x['method']}".replace(' ', '_') for i,x in df_new.iterrows()]
    df_new.set_index('id', inplace=True, drop=True)
    if df_performance is None:   # no prior records
        df_performance = df_new
    else:   # update or insert new performance records
        for i,r in df_new.iterrows():
            df_performance.loc[i] = r
    df_performance.to_pickle(MODEL_PERFORMANCE)
    return df_performance
    
# create a quick interaction grid for display
def result_visualize(run_name):
    global df_performance
    if not run_name in df_performance.index:
        print(f"Error: Couldn't find the specified run {run_name} in results!")
    class_name = df_performance.loc[run_name, "class"]
    display(f"Matched Tags (for class {class_name}): {df_performance.loc[run_name, 'token']}")
    classifier_plot(df_performance.loc[run_name, 'metrics'], df_performance.loc[run_name, 'scored'])
    results_display(df_performance.loc[run_name, 'scored'], 16, title=class_name)

# save results
df_performance = result_update(df_performance, dict_results)
            
# use widget interaction basic    
dropdown = widgets.interactive(result_visualize, run_name=widgets.Dropdown(
    options=list(df_performance[df_performance['method']=="match"].index),  # send run names
    description='Run:',
    disabled=False,
))
output = dropdown.children[-1]  # anti-flicker trick (https://ipywidgets.readthedocs.io/en/stable/examples/Using%20Interact.html#Flickering-and-jumping-output)
output.layout.height = '750px'  # disable this if you make your output window longer!
display(dropdown)


interactive(children=(Dropdown(description='Run:', options=('holiday_match', 'halloween_match', 'gift_giving_m…

### Classifier 1 - Post Mortem

What did we learn from this experiment?  Briefly, we confirmed that text matching alone is probablby too simplistic and can lead to far too many tokens being selected for a class.
1. Token matching had lots of errors and went outside of the main goal for each
2. Performance with matching never really went above random for AUC and PR was similarly erratic.
3. There are no tuning knobs to take out spurious mappings, like the term `scene`, which was included in the definition of many classes, but was likely more of a stop word than something helpful!

### Classifier 2
The scores for "classifier 2" are the weighted combination of the scores and nearest N neighbors in an NLP-based embedding space, as determined by a class name and definition.

In [282]:
from sklearn.neighbors import BallTree
from sklearn.preprocessing import normalize

embed_tree = BallTree(embed_tokens, leaf_size=16)
def classify_embed(kNN=10):
    dict_results = {}
    for idx in range(len(df_classes)):  # iterate classes for evaluation
        row = df_classes.iloc[idx]
        tag_match = []

        tag_dist, tag_idx = embed_tree.query(row['embedding'].reshape(1, len(row['embedding'])), k=kNN)
        # normalize into a "similarity" weight instead of distance with min-max normalization
        tag_sim = 1 - normalize(tag_dist, norm='max', axis=1)
        
        # dereference to actual tag text
        tag_match = {list_tags[tag_idx[0][i]]:tag_sim[0][i] for i in range(len(tag_idx[0]))}
        # print(tag_match.keys())
        
        # print(tag_match)
        df_sub = df_flatten[df_flatten['tag'].isin(tag_match.keys())].copy()  # grab dataframe that matched one tag
        for k in tag_match:   # scale by the weight 
            df_sub.loc[df_sub['tag']==k, 'score'] *= tag_match[k]
        
        #print(df_sub)
        df_score = df_sub.groupby(['asset'])['score'].max().reset_index(drop=False)
        #print(df_score, row["class"])

        metrics_obj, df_scored = classifier_score(df_score, df_labels, row['class'])
        dict_results[row['class']] = {'class':row['class'], 'method':f'embed', 
                                      'token':list(tag_match.keys()), "details":str(kNN),
                                      'metrics': metrics_obj, 'scored': df_scored}
    return dict_results

def result_remap(knn, run_name):
    global df_performance
    if len(df_performance[df_performance["details"]==str(knn)]) == 0:
        dict_results = classify_embed(knn)
        df_performance = result_update(df_performance, dict_results)   # save results
    result_visualize(run_name)

            
# use widget interaction basic    
dropdown = widgets.interactive(result_remap, 
    knn=widgets.IntSlider(min=5, max=30, step=5, value=10, continuous_update=False),
    run_name=widgets.Dropdown(
    options=list(df_performance[df_performance['method']=="embed"].index),  # send run names
    description='Run:',
    disabled=False,
))
output = dropdown.children[-1]  # anti-flicker trick (https://ipywidgets.readthedocs.io/en/stable/examples/Using%20Interact.html#Flickering-and-jumping-output)
output.layout.height = '750px'  # disable this if you make your output window longer!
display(dropdown)


interactive(children=(IntSlider(value=10, continuous_update=False, description='knn', max=30, min=5, step=5), …

### Classifier 2 - Post Mortem

What did we learn from this experiment?  Briefly, we saw that semantic embedding **does** help to pick better metadata tags for a given class. However, similar issues for stop words came up that may be hard to solve algorithmically.
1. Semantic embedding got much better tags that weren't direct matches.
2. We hit a sweet spot with `kNN=10` (or 10 nearest neighbors) from semantic embedding, which generally resulted in the hgiest AUC curves.
3. The PR curve looked much healthier with this classifier, often above the break-even line.

# Low-level Video and Audio Features
Our final classifier requires supervision to train.  Specifically, whereas the other two text-based classifiers used only the class name and description, we'll need explicit labels of whether or not a class is valid to continue.  Looking back at the list of [ContentAI extractors](https://www.contentai.io/docs/extractors), we'll be working with two wrapped extractors that produce [video 3DCNN](https://www.contentai.io/docs/dsai_videocnn) features and [audio VGGish](https://www.contentai.io/docs/dsai_vggish) features.  

The native output of each extractor is a set of numerical features at regular time interval, according to its sample rate.  This is useful in long-form video because it helps us detect with 30s segment of a two hour comedy.  Luckily, our source data (youtube clips) have been segmented into 15s clips already, so we can actually aggregate the timed feature outputs into a single vector for each video.


## Flattening Features
This block loads the raw features and flattens them to a single row via averaging across time samples and then applying unit normalization.  The normalization step is a common practice that makes the feature more amenable to DNN and linear regressor functions.

In [253]:
import numpy as np
import pandas as pd
import h5py
import json

def feature_flatten(x, include_norm=True):
    if len(x) == 1:  # just one item in series
        return x.values
    if len(x.shape) > 1:
        x = x.mean(axis=0)
    vec = np.reshape(x, (1, x.shape[0]))
    vec_norm = vec * 1 / feature_l2norm(vec)
    return vec_norm

def feature_l2norm(x):
    # print(f"NORM: {x.shape, type(x), len(x), x}")
    return np.linalg.norm(x.astype(np.float32), 2)

path_features = Path(AGG_AVFEAT)
if path_features.exists():
    print(f"Skipping re-create of feature file '{str(path_features)}'...")
    df_avfeature = pd.read_pickle(str(path_features))
else:
    df_avfeature = None
    num_files = 0
    path_content = Path("packages/content/vmlr-workshop")

    list_feat = []
    list_asset = []
    list_files = list(path_content.rglob("dsai_vggish/data.hdf5"))
    print(f"Ingesting {len(list_files)} audio files in path '{str(path_content)}'...")
    for path_file in list_files:  # search for flattened files
        with h5py.File(str(path_file),'r') as f:
            path_asset = Path(*path_file.parent.relative_to(path_content).parts[:2])
            for asset_key in f.keys():
                nd_new = feature_flatten(f[asset_key][()])   # read the numpy array, flatten
                list_asset.append(str(path_asset))   # save new feature
                list_feat.append(nd_new)
                #if nd_feat is None:
                #    nd_feat = nd_new
                #else:
                #    print(nd_new.shape, nd_feat.shape)
                #    nd_feat = np.append(nd_feat, nd_new, 0)  # concat feature
    df_audio = pd.DataFrame({"asset": list_asset, "audio":list_feat}).set_index("asset")
    
    list_feat = []
    list_asset = []
    list_files = list(path_content.rglob("dsai_videocnn/data.hdf5"))
    print(f"Ingesting {len(list_files)} video files in path '{str(path_content)}'...")
    for path_file in list_files:  # search for flattened files
        with h5py.File(str(path_file),'r') as f:
            path_asset = Path(*path_file.parent.relative_to(path_content).parts[:2])
            for asset_key in f.keys():
                nd_new = feature_flatten(f[asset_key][()])   # read the numpy array, flatten
                list_asset.append(str(path_asset))   # save new feature
                list_feat.append(nd_new)
    df_video = pd.DataFrame({"asset": list_asset, "video":list_feat}).set_index("asset")
    df_avfeature = df_audio.join(df_video, how='inner').reset_index()  # merge audio and video features
    print(f"Found audio {len(df_audio)} samples, {len(df_video)} video, and {len(df_avfeature)} combined samples")
    # uh oh, looks like there is a disparity between audio and video features!
    #    that happens, so we'll just have to accomodate for it later...

    list_combined = []
    for idx, row in df_avfeature.iterrows():
        nd_new = np.append(row['video'], row['audio'], axis=1)
        nd_new *= 1 / feature_l2norm(nd_new)
        list_combined.append(nd_new)
    df_avfeature["combined"] = list_combined
    df_avfeature.to_pickle(str(path_features))

print(f"Found {len(df_avfeature)} rows with these data columns... {list(df_avfeature.columns)}")

Ingesting 1016 audio files in path 'packages/content/vmlr-workshop'...
Ingesting 1016 video files in path 'packages/content/vmlr-workshop'...
Found audio 1013 samples, 1013 video, and 1033 combined samples
Found 1033 rows with these data columns... ['asset', 'audio', 'video', 'combined']


### Classifier 3: Supervised Audio and Video 
In this section we'll train a a/v-based model, but using some labels.  The general expectation is that this model should perform the best because it is specifically trained for this class.

In [305]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate, cross_val_predict
from sklearn.calibration import CalibratedClassifierCV

def classify_av(feature="combined", score_calibrate=False):
    cv_folds = 5 if score_calibrate else 10
    cv_jobs = -1  # -1 is auto, otherwise specific number
    dict_results = {}
    for idx in range(len(df_classes)):  # iterate classes for evaluation
        row = df_classes.iloc[idx]
        tag_match = []

        df_label_sub = df_labels[df_labels["class"]==row["class"]]  # subselect for this class
        df_feat = df_avfeature.set_index("asset").copy()   # get slice of right features
        df_feat = df_feat.join(df_label_sub.set_index("asset"), how="left").fillna(0)  # join with labels
        df_feat["class"] = df_feat["class"].apply(lambda x: 1 if x != 0 else 0).astype(int)  # blank out text
        
        model = LogisticRegression()  # basic logistic regression
        if score_calibrate:   # try to re-calibrate outputs for better threshold?
            model = CalibratedClassifierCV(model, method="sigmoid")
        probs = cross_val_predict(model, np.vstack(df_feat[feature]), 
                                  df_feat["class"], cv=cv_folds, 
                                  n_jobs=cv_jobs, method='predict_proba')
        df_feat["score"] = probs[:,1]  # grab prediction for second class
        df_feat = df_feat.reset_index().drop(columns=['class'])  # reset index, drop label

        metrics_obj, df_scored = classifier_score(df_feat[["asset", "score"]], df_labels, row['class'])
        dict_results[row['class']] = {'class':row['class'], 'method':f'avfeat', 
                                      'token':['audio', 'video'], "details":f"{feature}_{score_calibrate}",
                                      'metrics': metrics_obj, 'scored': df_scored}
    return dict_results

def result_retrain(run_name, modality, calibrate):
    global df_performance
    details_mode = f"{modality}_{calibrate}"
    if len(df_performance[df_performance["details"]==details_mode]) == 0:
        print("Model condition change detected, retraining classifier...")
        dict_results = classify_av(modality, calibrate)
        df_performance = result_update(df_performance, dict_results)   # save results
    result_visualize(run_name)

            
# use widget interaction basic    
dropdown = widgets.interactive(result_retrain
    ,run_name=widgets.Dropdown(
        options=list(df_performance[df_performance['method']=="avfeat"].index),  # send run names
        value=list(df_performance[df_performance['method']=="avfeat"].index)[0],
        description='Class Name:',
        disabled=False)
    ,modality=widgets.Dropdown(options=['combined', 'audio', 'video'], description='Modality:')
    ,calibrate=widgets.Checkbox(value=False, description='Score re-calibration')
)
output = dropdown.children[-1]  # anti-flicker trick (https://ipywidgets.readthedocs.io/en/stable/examples/Using%20Interact.html#Flickering-and-jumping-output)
# output.layout.height = '750px'  # disable this if you make your output window longer!
display(dropdown)


interactive(children=(Dropdown(description='Class Name:', options=('holiday_avfeat', 'halloween_avfeat', 'gift…

### Classifier 3 - Post Mortem

What did we learn from this experiment?  We saw that feature-based analysis does a great job at finding things that look or sound a lot like what we've seen before.  Unfortunately, as in real life, the classifier hadn't seen enough of the world and its recall of diverse conditions us poor.  Additionally, this under-training effected the values of the score output, which is more erratic than other methods.
1. A/V features are great for finding content with slight perturbations to what was labeled true.
2. Combined audio and video features seem to always perform better than audio or video alone.
3. Additional score calibration may be required for better threshold binarization.
4. As always with cold-start machine learning, more data would help to grow this classifier.

# End of Model Material

Another notebook done! Now models trained on text features, embeddings, and audio-video features is under your belt.  One thing to consider now is the price of those supervised labels, which often gave us the best performance.  

If you've ever proposed such an idea to a manager or business partner, some of these questions are likely to be first responses... How do you get those labels?  How many labeled examples do you need?  How long will label collection and stability take?

Fret not, for in [notebook C](C_labeling.ipynb) *(that link may not work)* we will focus on programmatic use of an internal label acquisition softare, [LabelQuest](https://lq.web.DOMAIN/), which has been configured for self-service answers to some of these questions.