![Machine Learning Workshop: Content Insights 2020](assets/mlci_banner.jpg)

# Machine Learning Workshop: Content Insights 2020

Welcome to the workshop notebooks!  These notebooks are designed to give you a walk through the steps of creating a model, refining it with user labels, and testing it on content.  You can access the main [workshop forum page](https://INFO_SITE/forums/html/forum?id=241a0b77-7aa6-4fef-9f25-5ea351825725&ps=25), the [workshop files repo](https://INFO_SITE/communities/service/html/communityview?communityUuid=fb400868-b17c-44d8-8b63-b445d26a0be4#fullpageWidgetId=W403a0d6f86de_45aa_8b67_c52cf90fca16&folder=d8138bef-9182-4bdc-8b12-3c88158a219c), or the [symposium home page](https://software.web.DOMAIN) for additional help.

The notebooks are divided into five core components: (A) setup & data, (B) model exploration, (C) labeling, (D) active labeling, (E) and deployment.  You are currently viewing the *setup & data* workbook.

In [187]:
# constants for running the workshop; we'll repeat these in the top line of each workbook.
#   why repeat them? the backup routine only serializes .ipynb files, so others will need 
#   to be downloaded again if your compute instance restarts (a small price to pay, right?)

WORKSHOP_BASE = "https://vmlr-workshop.STORAGE"
# WORKSHOP_BASE = "http://content.research.DOMAIN/projects/mlci_2020"
AGG_METADATA = "agg_metadata.pkl.gz"     # custom file for merged metadata
CLASS_DEFINITIONS = "assets/classes.json"     # provided file for class info
CLASS_LABELS_FLAT = "assets/labels_final.json"     # provided file for label info
AGG_TAG_EMBEDDING = "agg_tag_vocab.w2v"  # custom file for tag-based vocabulary
AGG_LABELS = "agg_labels.pkl.gz"         # custom file for merged labels


## Code Dependency Downloads

This section will grab and install other package files required for the 
execution of this workshop.  This may be required if you did not start from the 
all-in-one package download.  

* `packages` - contains installed packages that may not exist in other public repos
* `data` - contains the data that will be used in this workshop in [pickled](https://docs.python.org/3/library/pickle.html) and [hdf5](https://docs.h5py.org/en/stable/) file formats.

In [53]:

import os
from pathlib import Path

ATT_JUPYTER = False
for k in os.environ:   # scan some environment vars
    if "user" in k.lower():   # found user, check setting
        if "DOMAIN" in os.environ[k].lower():
            ATT_JUPYTER = True   # found AT&T, set marker

proxies = None
if ATT_JUPYTER:   # switch for proxy setting
    # os.environ['http_proxy'] = 'http://PROXY:8080'
    # os.environ['https_proxy'] = 'http://PROXY:8080'
    os.environ['no_proxy'] = '*.DOMAIN'
    proxies = {
        "http": "http://pxyapp.proxy.DOMAIN:8080",
        "https": "http://pxyapp.proxy.DOMAIN:8080",
    }
    os.environ['http_proxy'] = proxies['http']
    os.environ['https_proxy'] = proxies['https']

files = {
    "lq-latest-py3-none-any.whl": f"{WORKSHOP_BASE}/packages/lq-latest-py3-none-any.whl"
    , "features_tag.tgz": f"{WORKSHOP_BASE}/packages/features_tag.tgz"
    , "features_binary.tgz": f"{WORKSHOP_BASE}/packages/features_binary.tgz"
}

def remote_download(dict_files, proxies, dir_dest="packages", overwrite=False):
    import requests

    path_dest = Path(dir_dest)
    if not path_dest.exists():
        path_dest.mkdir(parents=True)

    for name, location in files.items():
        path_local = path_dest.joinpath(name)
        if path_local.exists() and not overwrite:
            print(f"{str(path_local.resolve())} already exists!")
            continue

        print(f"Getting file '{location}'")
        r = requests.get(location, proxies=proxies, stream=True)
        print(f"Writing to file {name}")
        with path_local.open('wb') as f:
            for chunk in r.iter_content(4096):
                f.write(chunk)

# consider changing this to True if you have odd install errors
remote_download(files, overwrite=False, proxies=proxies)   
print("... file download complete.")

print("Installing packages...")

# the labelquest client library, this is mostly used in workbook 'B'
!pip install -q --no-cache-dir --no-index packages/lq-latest-py3-none-any.whl

# some visualization helpers for the workshop
!pip install -q --no-cache-dir --no-index ipywidgets

# include basic text mapping utility
!pip install -q --no-cache-dir --no-index spacy

# check out this URL for other text models, but since we're not using it much, a smaller version is okay 
#    https://spacy.io/models/en
!python -m spacy download en_core_web_md -q --no-cache-dir --no-index

print("Expanding features...")
!cd packages && ls *.tgz | xargs -I {} tar -zxf {}

print("...all setup operations complete.")

/Users/quinone/Documents/projects/miracle/ml_hack2020/cmlp/work/packages/lq-latest-py3-none-any.whl already exists!
/Users/quinone/Documents/projects/miracle/ml_hack2020/cmlp/work/packages/features_tag.tgz already exists!
/Users/quinone/Documents/projects/miracle/ml_hack2020/cmlp/work/packages/features_binary.tgz already exists!
... file download complete.
Installing packages...
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')
Expanding features...
...all setup operations complete.


## Exploring Textual Features and Tags

In this section, we'll take our first look at the timed metadata. Specifically, this data has been computed within the [ContentAI](https://www.contentai.io/) platform and downloaded with the steps above.  ContentAI is a flexible cloud-native platform that can accept a content reference and run one or more [extractors](https://www.contentai.io/docs/extractors) to provide metadata, processed video, etc.  

In this workshop, we'll be looking at some of the tags and recognition features that come from the [Azure extrator](https://www.contentai.io/docs/azure-videoindexer-api) which wraps many of the features from the [Azure Video Indexer](https://azure.microsoft.com/en-us/services/media-services/video-indexer/) service in a secure fashion.  
```
content/vmlr-workshop/halloween/vid_halloween_0-13-of-23.mp4/batches/1hhadDBuEtRUPd6v8vCr5H3346r/azure_videoindexer
content/vmlr-workshop/gifts/vid_gift_give_take_9-2-of-14.mp4/batches/1hl1l0V3BNumdsZc3DaAJd3JlB2/azure_videoindexer
content/vmlr-workshop/xmas/vid_xmas_8-28-of-49.mp4/batches/1hiPieI6mb2Dyzzsg84TCI5cCmM/azure_videoindexer
content/vmlr-workshop/halloween/vid_halloween_7-34-of-56.mp4/batches/1hhadDBuEtRUPd6v8vCr5H3346r/azure_videoindexer
content/vmlr-workshop/halloween/vid_halloween_7-19-of-56.mp4/batches/1hhadDBuEtRUPd6v8vCr5H3346r/azure_videoindexer
```

Further, we'll also be looking at a normalized (or flattened) version of the data produced by the [DSAI Metadata Flattener](https://www.contentai.io/docs/dsai_metadata_flatten) ([code repo](https://CODE_SITE/projects/ST_VMLR/repos/contentai-metadata-flatten/browse)) which has been rendered to CSVs.  
```
content/vmlr-workshop/halloween/vid_halloween_0-13-of-23.mp4/batches/1hhadDBuEtRUPd6v8vCr5H3346r/dsai_metadata_flatten/csv_flatten_azure_videoindexer.csv.gz
content/vmlr-workshop/gifts/vid_gift_give_take_9-2-of-14.mp4/batches/1hl1l0V3BNumdsZc3DaAJd3JlB2/dsai_metadata_flatten/csv_flatten_azure_videoindexer.csv.gz
content/vmlr-workshop/xmas/vid_xmas_8-28-of-49.mp4/batches/1hiPieI6mb2Dyzzsg84TCI5cCmM/dsai_metadata_flatten/csv_flatten_azure_videoindexer.csv.gz
content/vmlr-workshop/halloween/vid_halloween_7-34-of-56.mp4/batches/1hhadDBuEtRUPd6v8vCr5H3346r/dsai_metadata_flatten/csv_flatten_azure_videoindexer.csv.gz
content/vmlr-workshop/halloween/vid_halloween_7-19-of-56.mp4/batches/1hhadDBuEtRUPd6v8vCr5H3346r/dsai_metadata_flatten/csv_flatten_azure_videoindexer.csv.gz
```


### Aggregating Insights
As an example, let's parse and store the flattened data for Azure output, activity output, and moderation output from the flattener service.


For inquisitive minds, the original data from the extractors is also include, typically as a simple `data.json` in their corresponding diectory.  If you've got the hang of it, try to figure out what other extractors have been run for this asset.


```
.../1hhadDBuEtRUPd6v8vCr5H3346r/dsai_videocnn/data.json
.../1hhadDBuEtRUPd6v8vCr5H3346r/dsai_videocnn/data.hdf5
.../1hhadDBuEtRUPd6v8vCr5H3346r/dsai_activity_classifier/data.json
.../1hhadDBuEtRUPd6v8vCr5H3346r/dsai_metadata_flatten/csv_flatten_dsai_activity_classifier.csv.gz
.../1hhadDBuEtRUPd6v8vCr5H3346r/dsai_metadata_flatten/csv_flatten_azure_videoindexer.csv.gz
.../1hhadDBuEtRUPd6v8vCr5H3346r/dsai_metadata_flatten/csv_flatten_dsai_moderation.csv.gz
.../1hhadDBuEtRUPd6v8vCr5H3346r/dsai_metadata_flatten/wbTimeTaggedMetadata.json.gz
.../1hhadDBuEtRUPd6v8vCr5H3346r/dsai_vggish/data.json
.../1hhadDBuEtRUPd6v8vCr5H3346r/dsai_vggish/data.hdf5
.../1hhadDBuEtRUPd6v8vCr5H3346r/dsai_moderation_image/data.json
.../1hhadDBuEtRUPd6v8vCr5H3346r/azure_videoindexer/data.csv
.../1hhadDBuEtRUPd6v8vCr5H3346r/azure_videoindexer/data.json
.../1hhadDBuEtRUPd6v8vCr5H3346r/azure_videoindexer/data.ttml
.../1hhadDBuEtRUPd6v8vCr5H3346r/azure_videoindexer/data.txt
.../1hhadDBuEtRUPd6v8vCr5H3346r/azure_videoindexer/data.vtt
.../1hhadDBuEtRUPd6v8vCr5H3346r/azure_videoindexer/data.srt
```

(answer for above...)
* **input path** - `content/vmlr-workshop/halloween/vid_halloween_0-13-of-23.mp4`
* **nested job id** - `batches/1hhadDBuEtRUPd6v8vCr5H3346r`
* **extractor and data file** - `dsai_videocnn/data.json`



In [119]:
import numpy as np
import pandas as pd

path_metadata = Path(AGG_METADATA)
if path_metadata.exists():
    print(f"Skipping re-create of metadata file '{str(path_metadata)}'...")
    df_flatten = pd.read_pickle(str(path_metadata))
else:
    df_flatten = None
    num_files = 0
    path_content = Path("packages/content/vmlr-workshop")
    list_files = list(path_content.rglob("csv_flatten*.csv*"))
    print(f"Ingesting {len(list_files)} flatten files in path '{str(path_content)}'...")
    for path_file in list_files:  # search for flattened files
        df_new = pd.read_csv(path_file)
        # FROM content/vmlr-workshop/halloween/vid_halloween_0-13-of-23.mp4/batches/1hhadDBuEtRUPd6v8vCr5H3346r/dsai_metadata_flatten/csv_flatten_azure_videoindexer.csv.gz -> 
        # TO halloween/vid_halloween_0-13-of-23.mp4/batches/1hhadDBuEtRUPd6v8vCr5H3346r/dsai_metadata_flatten (relative_to)
        # TO halloween/vid_halloween_0-13-of-23.mp4  (joining base path parts)
        path_asset = Path(*path_file.parent.relative_to(path_content).parts[:2])
        df_new['tag'] = df_new['tag'].str.lower()   # lower case the tags
        df_new['details'] = df_new['details'].fillna('').str.lower()   # lower case the enhanced information
        df_new['asset'] = str(path_asset)
        if df_flatten is None:   # first one we saw
            df_flatten = df_new
        else:
            df_flatten = df_flatten.append(df_new, ignore_index=True)   # append new dataframe
        num_files += 1
        if num_files % 500 == 0:
            print(f"... read {num_files}...")
    df_flatten.reset_index(drop=True, inplace=True)  # drop prior index
    df_flatten.to_pickle(str(path_metadata))
    print(f"Wrote {num_files} aggregations to file '{str(path_metadata)}'...")

print(f"New columns in this data... {list(df_flatten.columns)}")


Skipping re-create of metadata file 'agg_metadata.pkl.gz'...
New columns in this data... ['time_begin', 'source_event', 'tag_type', 'time_end', 'time_event', 'tag', 'score', 'details', 'extractor', 'asset']


### Plotting tags
Let's plot some statistics about tags, both their numbers and their names.  First, a histogram of how many unique and total tags were present for an asset.  This plot helps us find average number of tags, both in raw counts and unique tags for an asset.  Second, an average and raw count of the top `N` tags found from this dataset.

In [44]:
import pylab as pl
import ipywidgets as widgets

# this is a handy update function
def tag_count_hist(x):
    x = (round(x[0], 2), round(x[1], 2))
    df_sub = df_flatten[(df_flatten['score'] >= x[0]) & (df_flatten['score'] <= x[1])]
    df_pairs = df_sub.groupby(['asset','tag']).count()['score'].reset_index()   # group by two params, reset into dataframe
    df_unitags = df_pairs.groupby(['asset'])['score'].agg(['count','sum']).reset_index()   # group by asset to find unique tag count per asset
    df_unitags.rename(columns={"count":"unique tags", "sum":"total tags"}, inplace=True)
    # print(df_unitags)
    ax = df_unitags.plot.hist(by='asset', bins=40, figsize=(10,4), alpha=0.75)
    pl.title(f"Histogram of Unique Tags Among Assets ({x[0]} >= Score >= {x[1]})")
    pl.ylabel('number of assets')
    pl.xlabel('count of tags')
    pl.grid()
    pl.show()
    
    df_pairs = df_sub.groupby(['tag','asset']).count()['score'].reset_index()   # group by two params, reset into dataframe
    df_unitags = df_pairs.groupby(['tag'])['score'].agg(['count','sum']).reset_index()   # group by asset to find unique tag count per asset
    df_unitags.sort_values('sum', ignore_index=True, inplace=True, ascending=False)
    df_unitags.rename(columns={"count":"Asset Frequency", "sum":"Total Frequency"}, inplace=True)
    top_n = 20
    df_topn = df_unitags.iloc[:top_n]
    ax = df_topn.plot.barh(x='tag', figsize=(10,4), width=0.8, log=True)
    pl.title(f"Total and Asset Frequency of {top_n} Most Frequent Tags ({x[0]} >= Score >= {x[1]})")
    pl.ylabel('tag text')
    pl.xlabel('count of tags')
    pl.grid()
    pl.show()
    
    

# get an interactive widget/graph
widgets.interact(tag_count_hist, x=widgets.FloatRangeSlider(
    value=[0.5, 1.0],
    step=0.05,
    min=df_flatten['score'].min(),
    max=df_flatten['score'].max(),
    description='Score Range:',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
    readout_format='.1f',
))


interactive(children=(FloatRangeSlider(value=(0.5, 1.0), continuous_update=False, description='Score Range:', …

<function __main__.tag_count_hist(x)>

# Contextual Ads

One goal of combining timed metadata and content is for better alignment of content from WarnerMedia and ads from Xandr creatives.  The iamge below demonstrates one example where a detected keyword or scene can trigger an ad that is realted.  Without this technology, this ad spot (or inventory) may be undersold and filled with an unrelated or standard campaign ad.

![Contextual Ad Product](assets/mlci_contextual.jpg)

## Class Exploration
Let's quickly load and display the target classes used in this experiment. The table below indicates the class and the definition utilized for our classifier.  The field `primary` indicates whether or not a class will be used for performance evaluations.  Some non-primary classes were also included for additional experimentation, but they will not be the focus here.

In [121]:
import pandas as pd
with open(CLASS_DEFINITIONS, 'r') as f:
    obj_classes = json.load(f)
pd.set_option('display.max_colwidth',1000)
df_classes = pd.read_json(CLASS_DEFINITIONS)
df_classes

Unnamed: 0,class,definition,primary
0,holiday,"holiday scenes or objects like decorated trees, presents, or character, holiday party",1
1,halloween,"halloween scenes where one or more characters are in costume, ideally one or more characters are trick-or-treating",1
2,gift giving,"scenes of gift giving, receiving, or opening/unwrapping",1
3,family moments,"at least two people on screen, typically familes at parties, enjoying a meal, lounging at home",0
4,shopping scenes,one or more primary actors in a store-like environment; not necessary to see their face,0


### Result Browsing
Below, we've created a simple method to view keyframe results from the video assets used in this workshop.  It uses simple widgets (as we did above) and displays them in a columnar form in ths notebook.  In addition to numerical performance (accuracy, AUC, etc.), we can visually inspect the performance of a classifier model with this utility.

In [335]:
import ipywidgets as widgets
from IPython.display import display
import random

def display_hit_html(path_result, score, label=1, base=WORKSHOP_BASE):
    '''
     This function essentially convert the image url to 
     '<img src="'+ path + '"/>' format. And one can put any
     formatting adjustments to control the height, aspect ratio, size etc.
     within as in the below example. 
    '''
    path_parts = path_result.split('/')
    url_full = f"{base}/{path_parts[0]}/keyframe/{path_parts[1]}.jpg"
    text_color = f" style='color:{'black' if label==1 else 'red'}' "
    return f"""
        <a href='{url_full}' title='{path_result}' target='_new'><img width='100%' src='{url_full}' /></a>
        <small {text_color}><strong>{score:0.4}</strong></small>
        """

def results_display(df, max_results=32, num_cols=8, base=WORKSHOP_BASE, title="Example Results"):
    """Input DataFrame with 'asset' and 'score' and show those results in columnar format.
    Providing an additional column 'label' [as 0 or 1] will allow incorrect items to be shown in red."""
    width_col = f"{round(100/num_cols)}%"
    #list_results.sort(reverse=True, key=lambda x: x[-1]) 
    #list_html = [widgets.HTML(path_to_image_html(*x, base=base)) for x in list_results]
    df = df.sort_values("score", ascending=False).head(max_results)
    if "label" not in df.columns:
        df["label"] = 1
    list_html = [widgets.HTML(display_hit_html(x['asset'], x['score'], x['label'], base=base)) for i,x in df.iterrows()]
    display(widgets.VBox([widgets.HTML(f"<h3>{title}</h3>"), widgets.GridBox(list_html,
               layout=widgets.Layout(grid_template_columns=f"repeat(8, {width_col})"))]))

# demo of input as a single result, but with random scores
df_demo = pd.DataFrame({"asset":["xmas/vid_xmas_9-62-of-67.mp4"]*16, 
                        "score":np.random.rand(1, 16)[0],
                         "label":np.random.randint(2, size=16)})
results_display(df_demo, title="Results demo")
    


VBox(children=(HTML(value='<h3>Results demo</h3>'), GridBox(children=(HTML(value="\n        <a href='https://v…

### Objective Result Scoring
Similar to the code above, this function can be used for evaluation of models by a few different metrics -- but here it is aimed at objective, numerical methods.  For convenience, this method can also plot the results of both objective and subjective scoring.

In [333]:
from sklearn import metrics
import matplotlib.pyplot as plt

def classifier_score(df_prediction, df_labels, class_name):
    """Functiont to provide metric outputs for the evaluation of a prediction dataframe.
    
    Parameters:
        df_prediction (DataFrame): dataframe containing 'asset' and 'score' as columns
        df_labels (DataFrame): dataframe containing 'asset' and 'label' for labels
        class_name (str): class name for evaluation against labels

    Returns:
        dict of metrics (AUC, AP, precision, recall) ({"ap":X, "class":Y, ...}) and joined dataframe
    """
    metrics_obj = {"class":class_name}
    
    # clean up input labels, prune to relevant class
    df_labels = df_labels[df_labels["label"] == class_name].drop(columns=["etag", "url"]) 
    # join labels and scores by asset, nomalize score to float
    df_join = df_prediction.set_index('asset').join(df_labels.set_index('asset'), how="left").fillna(0)  # joint at asset level, 0 for nonscoring
    df_join["label"] = df_join["label"].apply(lambda x: 1 if x != 0 else 0).astype(int)
    df_join = df_join.reset_index().sort_values("score", ascending=False)

    print(f"{class_name}: Found {len(df_join)} samples from {len(df_labels)} labels and {len(df_prediction)} scores.")

    def thresh(x):
        return 1 if x >= 0.5 else 0
    
    metrics_obj["AP"] = metrics.average_precision_score(df_join['label'], df_join['score'])
    fpr, tpr, thresholds = metrics.roc_curve(df_join['label'], df_join['score'])
    metrics_obj["AUC"] = metrics.auc(fpr, tpr)
    metrics_obj["Accuracy"] = metrics.accuracy_score(df_join['label'], df_join['score'].apply(thresh))
    metrics_obj["Recall"] = metrics.recall_score(df_join['label'], df_join['score'].apply(thresh))
    metrics_obj["F1"] = metrics.f1_score(df_join['label'], df_join['score'].apply(thresh))
    print(f"{class_name}: {metrics_obj}")
        
    # return our computation!
    return metrics_obj, df_join

def classifier_plot(metrics_obj, df_scored):
    fpr, tpr, thresholds = metrics.roc_curve(df_scored['label'], df_scored['score'])
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(11, 4))

    lw = 2
    ax1.plot(fpr, tpr, color='darkorange',
             lw=lw, label=f"AUC curve (area = {metrics_obj['AUC']:0.2})")
    ax1.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
    ax1.set_xlim([0.0, 1.0])
    ax1.set_ylim([0.0, 1.05])
    ax1.set_xlabel('False Positive Rate')
    ax1.set_ylabel('True Positive Rate')
    ax1.legend(loc="lower right")

    precision, recall, thresholds = metrics.precision_recall_curve(df_scored['label'], df_scored['score'])
    ax2.plot(recall, precision, color='red',
             lw=lw, label=f"PR Curve (F1 = {metrics_obj['F1']:0.2})")
    ax2.plot([1, 0], [0, 1], color='navy', lw=lw, linestyle='--')
    ax2.set_xlim([0.0, 1.0])
    ax2.set_ylim([0.0, 1.05])
    ax2.set_ylabel('Precision')
    ax2.set_xlabel('Recall')
    ax2.legend(loc="upper right")
    plt.show()
    
# read label data for use later!
df_labels = pd.read_json(CLASS_LABELS_FLAT).explode('labels').rename(columns={"data":"url", "labels":"label"})
df_labels["asset"] = df_labels['url'].replace(regex={r'^' + WORKSHOP_BASE + '/': ''})
print(f"Loaded a total of {len(df_labels)} labels across {len(df_labels['asset'].unique())} samples and {len(df_labels['label'].unique())} labels.")


Loaded a total of 826 labels across 520 samples and 7 labels.


## Classifiers 1, 2: Classifier By Text
Our first classifier model will be constructed by using classical mapping of the class definitions through text search and mapping to tag names.  Using the classes loaded above, let's perform a quick mapping.  

* **text2doc** - this function will allow us to strip out [stop words](https://www.analyticsvidhya.com/blog/2019/08/how-to-remove-stopwords-text-normalization-nltk-spacy-gensim-python/); we'll use it both for the `raw match` and the `nlp embedding` models
* **doc2vec** - this function will take a filtered text string and produce a numerical embedding

This processing will be used to test the perfomance of (1) direct string matching (e.g. find a tag `gift` that matches the definition of `gift`) and (2) embedding-based matching by `k` nearest neighbors.


In [170]:
import spacy
from spacy.vocab import Vocab
from spacy.tokenizer import Tokenizer
from scipy import spatial

def text2doc(nlp, tag_raw):
    """Given a raw text input, tokenize it to remove stop words"""
    return [x for x in nlp(tag_raw) if not x.is_stop and not x.is_punct]

def doc2vec(nlp, doc, target_domain=None):
    """Given a specific model, clean line of text into an output embedding space"""
    # https://spacy.io/usage/vectors-similarity
    if target_domain is None:
        target_domain = nlp.vocab
    if type(doc) != list:
        doc = [doc]
    tag_doc = None
    for token in doc:
        tag_id = target_domain.strings[token.text]
        if tag_id in target_domain.vectors:   # search existing one
            new_vec = target_domain.vectors[tag_id]
        elif type(token)==str:
            new_vec = nlp(token).vector
        else:
            new_vec = token.vector
        if tag_doc is None:
            tag_doc = new_vec
        else:
            tag_doc += new_vec
    return tag_doc

nlp = spacy.load('en_core_web_md')
# doc = text2doc(nlp, "this is a phrase to clean")
# vec = doc2vec(nlp, doc)

# let's plow through our class dataframe and do the cleaning and mapping
# print([x for x in df_classes[df_classes['primary']==1].iloc])
print("Tokenizing and embedding classes...")
list_tokens = [""] * len(df_classes)
list_vect = [np.ndarray((1, nlp.vocab.vectors.shape[1]))] * len(df_classes)
for idx in range(len(df_classes)):  # iterate row indexes
    row = df_classes.iloc[idx]
    # tokenize to remove tags, return tokens
    doc = text2doc(nlp, row["definition"]) + text2doc(nlp, row["class"])
    list_tokens[idx] = [str(x) for x in doc]
    # convert to an embedding array
    list_vect[idx] = doc2vec(nlp, doc)
df_classes["token"] = list_tokens
df_classes["embedding"] = list_vect

# now collect all of the tags that match simple text bag
print("Lookup specific tags by match...")
list_tags = df_flatten['tag'].unique()
map_tokens = {}
embed_tokens = np.ndarray((len(list_tags), nlp.vocab.vectors.shape[1]))
for idx in range(len(list_tags)):   # iterate through full list of known tags
    doc = text2doc(nlp, list_tags[idx])
    for x in doc:  # create map/reference to each term
        x = str(x)
        if x not in map_tokens:  # first time to see this tag?
            map_tokens[x] = []
        map_tokens[x].append(idx)   # save reference to original tag set
    embed_tokens[idx, :] = doc2vec(nlp, doc)  # compute embedding 

# df_classes["token"] = list_tokens
# df_classes["embedding"] = list_vect
print(f"Count of new text-based mapping: {len(map_tokens)}...")
print(f"Shape of embedded tag matrix: {embed_tokens.shape}...")



Tokenizing and embedding classes...
Lookup specific tags by match...
Count of new text-based mapping: 1804...
Shape of embedded tag matrix: (1763, 300)...


### Classifier 1
The scores for "classifier 1" are the `max()` of all scores for the selected tag.

In [337]:
dict_results = {}
for idx in range(len(df_classes)):  # iterate classes for evaluation
    row = df_classes.iloc[idx]
    if row["primary"] != 1:   # skip non-priority items
        continue
    tag_match = []
    for x in row["token"]:  # scan our tokens for this class
        if x in map_tokens:   # now check for hits in tokenized tags
            tag_match += map_tokens[x]   # found one? save all of the original references
    tag_match = [list_tags[x] for x in tag_match]   # dereference to actual tag text
    # print(tag_match)
    df_sub = df_flatten[df_flatten['tag'].isin(tag_match)]  # grab dataframe that matched one tag
    #print(df_sub)
    df_score = df_sub.groupby(['asset'])['score'].max().reset_index(drop=False)
    #print(df_score, row["class"])

    metrics_obj, df_scored = classifier_score(df_score, df_labels, row['class'])
    dict_results[row['class']] = {'metrics': metrics_obj, 'scored': df_scored}

# create a quick interaction grid for display
def result_visualize(class_name):
    classifier_plot(dict_results[class_name]['metrics'], dict_results[class_name]['scored'])
    results_display(dict_results[class_name]['scored'], 16, title=class_name)
    
# use widget interaction basic    
widgets.interactive(result_visualize, class_name=widgets.Dropdown(
    options=list(dict_results.keys()),
    value=list(dict_results.keys())[0],
    description='Class:',
    disabled=False,
))



holiday: Found 121 samples from 182 labels and 121 scores.
holiday: {'class': 'holiday', 'AP': 0.431933284866166, 'AUC': 0.5834876543209877, 'Accuracy': 0.33884297520661155, 'Recall': 0.975, 'F1': 0.49367088607594933}
halloween: Found 47 samples from 150 labels and 47 scores.
halloween: {'class': 'halloween', 'AP': 0.4147021310041902, 'AUC': 0.5478927203065134, 'Accuracy': 0.3829787234042553, 'Recall': 0.8888888888888888, 'F1': 0.5245901639344261}
gift giving: Found 25 samples from 115 labels and 25 scores.
gift giving: {'class': 'gift giving', 'AP': 0.3350140056022409, 'AUC': 0.6403508771929824, 'Accuracy': 0.32, 'Recall': 1.0, 'F1': 0.41379310344827586}


interactive(children=(Dropdown(description='Class:', options=('holiday', 'halloween', 'gift giving'), value='h…

In [197]:

print(df_labels)


                         etag              label  \
0    a039cb37f290fe4a4127bbd2            holiday   
0    a039cb37f290fe4a4127bbd2        gift giving   
0    a039cb37f290fe4a4127bbd2     family moments   
1    b6dbe02fdf00f08a61a44b8f            holiday   
1    b6dbe02fdf00f08a61a44b8f        gift giving   
..                        ...                ...   
515  bf3731149d36b7af5703e1ea          halloween   
516  d2e47869a53995007d279ac4          halloween   
517  1a9528594a26c2645da8076a  none of the above   
518  ac3eb9016c3b19c85006b0ac            holiday   
519  5f0dc388e0964e590a69f166  none of the above   

                                                                                url  \
0     https://vmlr-workshop.STORAGE/gifts/vid_gift_give_take_8-4-of-12.mp4   
0     https://vmlr-workshop.STORAGE/gifts/vid_gift_give_take_8-4-of-12.mp4   
0     https://vmlr-workshop.STORAGE/gifts/vid_gift_give_take_8-4-of-12.mp4   
1    https://vmlr-workshop.STORAGE/gifts/vid_gift_give

In [None]:

print("Tokenizing and embedding tags...")

def list2vocab(nlp, list_tags, output_file):
    """Given a specific model, map the file set (single line of text) into an output embedding space"""
    vocab = Vocab()
    idx = 0
    for tag_raw in list_tags:   # for each tag input
        tag_raw = tag_raw.lower()
        tag_id = nlp.vocab.strings[tag_raw]
        if tag_id not in nlp.vocab:   # search existing one
            tag_doc = nlp(tag_raw)
            vocab.set_vector(tag_raw, tag_doc.vector)
        else:
            vocab.set_vector(tag_raw, nlp.vocab[tag_id].vector)
        idx += 1
        if (idx % 5000) == 0:
            print(f"list2vec: [{idx}/{len(list_tags)}] processing complete")
    # write a w2v file
    path_target = Path(output_file).resolve()
    vocab.to_disk(str(path_target))
    return vocab

AGG_TAG_EMBEDDING

# keys, rows, score = nlp.vocab.vectors.most_similar(query_vect, batch_size=2048, n=5)
# print(keys, rows, score)    
    

In [None]:
########################################################################
# This block only needs to be run once.  If your notebook is inactive  #
#     for an extended period of time and restarts, then you may have   #
#     to re-run this again to reinstall vertica library.               #
########################################################################
import os


# Some commands to get thigns running
!pip install contentai_metadata_flatten
!pip install git+https://CODE_SITE/scm/st_lq/pylq.git

# See custom pypi package creation here:
# !pip install --extra-index-url https://repocentral.it.DOMAIN:8443/nexus/repository/att-pypi/ contentai_metadata_flatten


In [None]:
for k in os.environ:
    if "user" in k.lower() or "att" in k.lower() :
        print(k, os.environ[k])