![Machine Learning Workshop: Content Insights 2020](assets/mlci_banner.jpg)

# Machine Learning Workshop: Content Insights 2020

Welcome to the workshop notebooks!  These notebooks are designed to give you a walk through the steps of creating a model, refining it with user labels, and testing it on content.  You can access the main [workshop forum page](https://INFO_SITE/forums/html/forum?id=241a0b77-7aa6-4fef-9f25-5ea351825725&ps=25), the [workshop files repo](https://INFO_SITE/communities/service/html/communityview?communityUuid=fb400868-b17c-44d8-8b63-b445d26a0be4#fullpageWidgetId=W403a0d6f86de_45aa_8b67_c52cf90fca16&folder=d8138bef-9182-4bdc-8b12-3c88158a219c), or the [symposium home page](https://software.web.DOMAIN) for additional help.

The notebooks are divided into five core components: (A) setup & data, (B) model exploration, (C) labeling, (D) active labeling, (E) and deployment.  You are currently viewing the *setup & data* workbook.

In [1]:
# constants for running the workshop; we'll repeat these in the top line of each workbook.
#   why repeat them? the backup routine only serializes .ipynb files, so others will need 
#   to be downloaded again if your compute instance restarts (a small price to pay, right?)

WORKSHOP_BASE = "https://vmlr-workshop.STORAGE"
# WORKSHOP_BASE = "http://content.research.DOMAIN/projects/mlci_2020"
AGG_METADATA = "models/agg_metadata.pkl.gz"     # custom file for merged metadata

## Code Dependency Downloads

This section will grab and install other package files required for the 
execution of this workshop.  This may be required if you did not start from the 
all-in-one package download.  

* `packages` - contains installed packages that may not exist in other public repos
* `data` - contains the data that will be used in this workshop in [pickled](https://docs.python.org/3/library/pickle.html) and [hdf5](https://docs.h5py.org/en/stable/) file formats.

In [2]:

import os
from pathlib import Path

ATT_JUPYTER = False
for k in os.environ:   # scan some environment vars
    if "user" in k.lower():   # found user, check setting
        if "DOMAIN" in os.environ[k].lower():
            ATT_JUPYTER = True   # found AT&T, set marker

proxies = None
if ATT_JUPYTER:   # switch for proxy setting
    # os.environ['http_proxy'] = 'http://PROXY:8080'
    # os.environ['https_proxy'] = 'http://PROXY:8080'
    os.environ['no_proxy'] = '*.DOMAIN'
    proxies = {
        "http": "http://pxyapp.proxy.DOMAIN:8080",
        "https": "http://pxyapp.proxy.DOMAIN:8080",
    }
    os.environ['http_proxy'] = proxies['http']
    os.environ['https_proxy'] = proxies['https']

files = {
    "lq-latest-py3-none-any.whl": f"{WORKSHOP_BASE}/packages/lq-latest-py3-none-any.whl"
    , "features_tag.tgz": f"{WORKSHOP_BASE}/packages/features_tag.tgz"
    , "features_binary.tgz": f"{WORKSHOP_BASE}/packages/features_binary.tgz"
    , "features_imdb_5000.csv.tgz":  f"{WORKSHOP_BASE}/packages/features_imdb_5000.csv.tgz"
}

def remote_download(dict_files, proxies, dir_dest="packages", overwrite=False):
    import requests

    path_dest = Path(dir_dest)
    if not path_dest.exists():
        path_dest.mkdir(parents=True)

    for name, location in files.items():
        path_local = path_dest.joinpath(name)
        if path_local.exists() and not overwrite:
            print(f"{str(path_local.resolve())} already exists!")
            continue

        print(f"Getting file '{location}'")
        r = requests.get(location, proxies=proxies, stream=True)
        print(f"Writing to file {name}")
        with path_local.open('wb') as f:
            for chunk in r.iter_content(4096):
                f.write(chunk)

# consider changing this to True if you have odd install errors
remote_download(files, overwrite=False, proxies=proxies)   
print("... file download complete.")

print("Installing packages...")

# the labelquest client library, this is mostly used in workbook 'C'
!pip install -q --no-cache-dir --no-index --upgrade packages/lq-latest-py3-none-any.whl
!pip freeze | grep lq

# some visualization and data management helpers for the workshop
!pip install --no-cache-dir --upgrade pandas sklearn numpy
!pip install --no-cache-dir ipywidgets scipy  

# include basic text mapping utility and loading low level features
!pip install -U --no-cache-dir spacy h5py

# check out this URL for other text models, but since we're not using it much, a smaller version is okay 
#    https://spacy.io/models/en
!python -m spacy download en_core_web_md --no-cache-dir 

print("Expanding features...")
!cd packages && ls *.tgz | xargs -I {} tar -zxf {}

print("...all setup operations complete.")

/Users/quinone/Documents/projects/miracle/ml_hack2020/cmlp/work/packages/lq-latest-py3-none-any.whl already exists!
/Users/quinone/Documents/projects/miracle/ml_hack2020/cmlp/work/packages/features_tag.tgz already exists!
/Users/quinone/Documents/projects/miracle/ml_hack2020/cmlp/work/packages/features_binary.tgz already exists!
/Users/quinone/Documents/projects/miracle/ml_hack2020/cmlp/work/packages/features_imdb_5000.csv.tgz already exists!
... file download complete.
Installing packages...
lq==0.2.3
Collecting pandas
[?25l  Downloading https://files.pythonhosted.org/packages/a3/b9/b6e214ef4d4cc4eca5918011f5237ee646792f23b4acce325f2d60b10373/pandas-1.1.2-cp36-cp36m-macosx_10_9_x86_64.whl (10.6MB)
[K    100% |████████████████████████████████| 10.6MB 5.5MB/s ta 0:00:01
[?25hRequirement already up-to-date: sklearn in /Users/quinone/anaconda/envs/cognita36/lib/python3.6/site-packages (0.0)
Collecting numpy
[?25l  Downloading https://files.pythonhosted.org/packages/be/8e/800113bd3a0c9

Collecting spacy
[?25l  Downloading https://files.pythonhosted.org/packages/f2/49/76009c236922eb94903725760c2d42d5f211af10a7ff687840ecbcbf4acd/spacy-2.3.2-cp36-cp36m-macosx_10_9_x86_64.whl (10.2MB)
[K    100% |████████████████████████████████| 10.2MB 5.9MB/s ta 0:00:011
[?25hRequirement already up-to-date: h5py in /Users/quinone/anaconda/envs/cognita36/lib/python3.6/site-packages (2.10.0)
Installing collected packages: spacy
  Found existing installation: spacy 2.3.1
    Uninstalling spacy-2.3.1:
      Successfully uninstalled spacy-2.3.1
Successfully installed spacy-2.3.2


[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')
Expanding features...
...all setup operations complete.


# Notebook A: Initilizing Data and Features

Ready to get started *(technically you already did)*?!  In this section we'll explore timed metadata and merge it into some more easily usable [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html).  Specifically, we'll merge the raw output from many assets and content analysis tools.  Mmkay, let's go!

## Exploring Textual Features and Tags

In this section, we'll take our first look at the timed metadata. Specifically, this data has been computed within the [ContentAI](https://www.contentai.io/) platform and downloaded with the steps above.  ContentAI is a flexible cloud-native platform that can accept a content reference and run one or more [extractors](https://www.contentai.io/docs/extractors) to provide metadata, processed video, etc.  

In this workshop, we'll be looking at some of the tags and recognition features that come from the [Azure extrator](https://www.contentai.io/docs/azure-videoindexer-api) which wraps many of the features from the [Azure Video Indexer](https://azure.microsoft.com/en-us/services/media-services/video-indexer/) service in a secure fashion.  
```
content/vmlr-workshop/halloween/vid_halloween_0-13-of-23.mp4/batches/1hhadDBuEtRUPd6v8vCr5H3346r/azure_videoindexer
content/vmlr-workshop/gifts/vid_gift_give_take_9-2-of-14.mp4/batches/1hl1l0V3BNumdsZc3DaAJd3JlB2/azure_videoindexer
content/vmlr-workshop/xmas/vid_xmas_8-28-of-49.mp4/batches/1hiPieI6mb2Dyzzsg84TCI5cCmM/azure_videoindexer
content/vmlr-workshop/halloween/vid_halloween_7-34-of-56.mp4/batches/1hhadDBuEtRUPd6v8vCr5H3346r/azure_videoindexer
content/vmlr-workshop/halloween/vid_halloween_7-19-of-56.mp4/batches/1hhadDBuEtRUPd6v8vCr5H3346r/azure_videoindexer
```

Further, we'll also be looking at a normalized (or flattened) version of the data produced by the [DSAI Metadata Flattener](https://www.contentai.io/docs/dsai_metadata_flatten) ([code repo](https://CODE_SITE/projects/ST_VMLR/repos/contentai-metadata-flatten/browse)) which has been rendered to CSVs.  
```
content/vmlr-workshop/halloween/vid_halloween_0-13-of-23.mp4/batches/1hhadDBuEtRUPd6v8vCr5H3346r/dsai_metadata_flatten/csv_flatten_azure_videoindexer.csv.gz
content/vmlr-workshop/gifts/vid_gift_give_take_9-2-of-14.mp4/batches/1hl1l0V3BNumdsZc3DaAJd3JlB2/dsai_metadata_flatten/csv_flatten_azure_videoindexer.csv.gz
content/vmlr-workshop/xmas/vid_xmas_8-28-of-49.mp4/batches/1hiPieI6mb2Dyzzsg84TCI5cCmM/dsai_metadata_flatten/csv_flatten_azure_videoindexer.csv.gz
content/vmlr-workshop/halloween/vid_halloween_7-34-of-56.mp4/batches/1hhadDBuEtRUPd6v8vCr5H3346r/dsai_metadata_flatten/csv_flatten_azure_videoindexer.csv.gz
content/vmlr-workshop/halloween/vid_halloween_7-19-of-56.mp4/batches/1hhadDBuEtRUPd6v8vCr5H3346r/dsai_metadata_flatten/csv_flatten_azure_videoindexer.csv.gz
```


### Aggregating Insights
As an example, let's parse and store the flattened data for Azure output, activity output, and moderation output from the flattener service.


For inquisitive minds, the original data from the extractors is also include, typically as a simple `data.json` in their corresponding diectory.  If you've got the hang of it, try to figure out what other extractors have been run for this asset.


```
.../1hhadDBuEtRUPd6v8vCr5H3346r/dsai_videocnn/data.json
.../1hhadDBuEtRUPd6v8vCr5H3346r/dsai_videocnn/data.hdf5
.../1hhadDBuEtRUPd6v8vCr5H3346r/dsai_activity_classifier/data.json
.../1hhadDBuEtRUPd6v8vCr5H3346r/dsai_metadata_flatten/csv_flatten_dsai_activity_classifier.csv.gz
.../1hhadDBuEtRUPd6v8vCr5H3346r/dsai_metadata_flatten/csv_flatten_azure_videoindexer.csv.gz
.../1hhadDBuEtRUPd6v8vCr5H3346r/dsai_metadata_flatten/csv_flatten_dsai_moderation.csv.gz
.../1hhadDBuEtRUPd6v8vCr5H3346r/dsai_metadata_flatten/wbTimeTaggedMetadata.json.gz
.../1hhadDBuEtRUPd6v8vCr5H3346r/dsai_vggish/data.json
.../1hhadDBuEtRUPd6v8vCr5H3346r/dsai_vggish/data.hdf5
.../1hhadDBuEtRUPd6v8vCr5H3346r/dsai_moderation_image/data.json
.../1hhadDBuEtRUPd6v8vCr5H3346r/azure_videoindexer/data.csv
.../1hhadDBuEtRUPd6v8vCr5H3346r/azure_videoindexer/data.json
.../1hhadDBuEtRUPd6v8vCr5H3346r/azure_videoindexer/data.ttml
.../1hhadDBuEtRUPd6v8vCr5H3346r/azure_videoindexer/data.txt
.../1hhadDBuEtRUPd6v8vCr5H3346r/azure_videoindexer/data.vtt
.../1hhadDBuEtRUPd6v8vCr5H3346r/azure_videoindexer/data.srt
```

(answer for above...)
* **input path** - `content/vmlr-workshop/halloween/vid_halloween_0-13-of-23.mp4`
* **nested job id** - `batches/1hhadDBuEtRUPd6v8vCr5H3346r`
* **extractor and data file** - `dsai_videocnn/data.json`



In [3]:
import numpy as np
import pandas as pd

path_metadata = Path(AGG_METADATA)
if path_metadata.exists():
    print(f"Skipping re-create of metadata file '{str(path_metadata)}'...")
    df_flatten = pd.read_pickle(str(path_metadata))
else:
    df_flatten = None
    num_files = 0
    path_content = Path("packages/content/vmlr-workshop")
    list_files = list(path_content.rglob("csv_flatten*.csv*"))
    print(f"Ingesting {len(list_files)} flatten files in path '{str(path_content)}'...")
    for path_file in list_files:  # search for flattened files
        df_new = pd.read_csv(path_file)
        # FROM content/vmlr-workshop/halloween/vid_halloween_0-13-of-23.mp4/batches/1hhadDBuEtRUPd6v8vCr5H3346r/dsai_metadata_flatten/csv_flatten_azure_videoindexer.csv.gz -> 
        # TO halloween/vid_halloween_0-13-of-23.mp4/batches/1hhadDBuEtRUPd6v8vCr5H3346r/dsai_metadata_flatten (relative_to)
        # TO halloween/vid_halloween_0-13-of-23.mp4  (joining base path parts)
        path_asset = Path(*path_file.parent.relative_to(path_content).parts[:2])
        df_new['tag'] = df_new['tag'].str.lower()   # lower case the tags
        df_new['details'] = df_new['details'].fillna('').str.lower()   # lower case the enhanced information
        df_new['asset'] = str(path_asset)
        if df_flatten is None:   # first one we saw
            df_flatten = df_new
        else:
            df_flatten = df_flatten.append(df_new, ignore_index=True)   # append new dataframe
        num_files += 1
        if num_files % 500 == 0:
            print(f"... read {num_files}...")
    df_flatten.reset_index(drop=True, inplace=True)  # drop prior index
    df_flatten.to_pickle(str(path_metadata))
    print(f"Wrote {num_files} aggregations to file '{str(path_metadata)}'...")

print(f"New columns in this data... {list(df_flatten.columns)}")


Skipping re-create of metadata file 'models/agg_metadata.pkl.gz'...
New columns in this data... ['time_begin', 'source_event', 'tag_type', 'time_end', 'time_event', 'tag', 'score', 'details', 'extractor', 'asset']


### Plotting tag statistics
Let's plot some statistics about tags, both their numbers and their names.  First, a histogram of how many unique and total tags were present for an asset.  This plot helps us find average number of tags, both in raw counts and unique tags for an asset.  Second, an average and raw count of the top `N` tags found from this dataset.

In [4]:
import pylab as pl
import ipywidgets as widgets
import matplotlib.pyplot as plt

# this is a handy update function
def tag_count_hist(x):
    x = (round(x[0], 2), round(x[1], 2))
    df_sub = df_flatten[(df_flatten['score'] >= x[0]) & (df_flatten['score'] <= x[1])]
    df_pairs = df_sub.groupby(['asset','tag']).count()['score'].reset_index()   # group by two params, reset into dataframe
    df_unitags = df_pairs.groupby(['asset'])['score'].agg(['count','sum']).reset_index()   # group by asset to find unique tag count per asset
    df_unitags.rename(columns={"count":"unique tags", "sum":"total tags"}, inplace=True)
    # print(df_unitags)
    ax = df_unitags.plot.hist(by='asset', bins=40, figsize=(12,4), alpha=0.75)
    pl.title(f"Histogram of Tags Counts Among Assets ({x[0]} >= Score >= {x[1]})")
    pl.ylabel('number of assets')
    pl.xlabel('count of tags')
    pl.grid()
    pl.show()
    
    df_pairs = df_sub.groupby(['tag','asset']).count()['score'].reset_index()   # group by two params, reset into dataframe
    df_unitags = df_pairs.groupby(['tag'])['score'].agg(['count','sum']).reset_index()   # group by asset to find unique tag count per asset
    df_unitags.sort_values('sum', ignore_index=True, inplace=True, ascending=False)
    df_unitags.rename(columns={"count":"Asset Frequency", "sum":"Total Frequency"}, inplace=True)
    top_n = 20
    
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(11, 4))
    
    df_topn = df_unitags.iloc[:top_n]
    df_topn.plot.barh(ax=ax1, x='tag', width=0.8, log=True)
    ax1.set_title(f"Top {top_n} Tags ({x[0]} >= Score >= {x[1]})")
    ax1.set_ylabel('tag text')
    ax1.set_xlabel('count of tags')
    ax1.legend(loc="lower left")
    ax1.grid()
    
    skip_percent = 0.05
    top_percent = int(len(df_unitags)*skip_percent)
    df_topn = df_unitags.iloc[top_percent:top_percent+top_n]
    df_topn.plot.barh(ax=ax2, x='tag', width=0.8, log=True)
    ax2.set_title(f"Top {top_n} (skip {skip_percent*100:1}%) Tags ({x[0]} >= Score >= {x[1]})")
    ax2.set_ylabel('')
    ax2.set_xlabel('count of tags')
    ax2.legend(loc="lower left")
    ax2.grid()
    
    

# get an interactive widget/graph
out = widgets.interactive(tag_count_hist, x=widgets.FloatRangeSlider(
    value=[0.5, 1.0],
    step=0.05,
    min=df_flatten['score'].min(),
    max=df_flatten['score'].max(),
    description='Score Range:',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
    readout_format='.1f',
))
output = out.children[-1]  # anti-flicker trick (https://ipywidgets.readthedocs.io/en/stable/examples/Using%20Interact.html#Flickering-and-jumping-output)
output.layout.height = '575px'  # disable this if you make your output window longer!
display(out)

interactive(children=(FloatRangeSlider(value=(0.5, 1.0), continuous_update=False, description='Score Range:', …

### Tag Frequency Post-Mortem
The graph on the top demonstrates that typically within an asset there are two to three instances of each unique tag.  This aligns with expectations since we have 15 second clips that typically have 2-3 different scenes.  Fewer scenes would mean fewer diverse appearances and fewer instances.

The raw set of top 20 tags (left) seems to find mostly people-based tags.  The top 20 after truncating the top and bottom 5% (by count) of frequent tags looks more intersting and includes useful tags like `party`, `family` and `christmas tree`.  



# End of Intro Data Material

Nice work, you've just created a useable, aggregated form of content metadata.  Consider switching over to [notebook B](B_models.ipynb) *(that link may not work)* to continue exploration and building of models using existing metadata tags.