![Machine Learning Workshop: Content Insights 2020](assets/mlci_banner.jpg)

# Machine Learning Workshop: Content Insights 2020

Welcome to the workshop notebooks!  These notebooks are designed to give you a walk through the steps of creating a model, refining it with user labels, and testing it on content.  You can access the main [workshop forum page](https://INFO_SITE/forums/html/forum?id=241a0b77-7aa6-4fef-9f25-5ea351825725&ps=25), the [workshop files repo](https://INFO_SITE/communities/service/html/communityview?communityUuid=fb400868-b17c-44d8-8b63-b445d26a0be4#fullpageWidgetId=W403a0d6f86de_45aa_8b67_c52cf90fca16&folder=d8138bef-9182-4bdc-8b12-3c88158a219c), or the [symposium home page](https://software.web.DOMAIN) for additional help.

The notebooks are divided into five core components: (A) setup & data, (B) model exploration, (C) labeling, (D) active labeling, (E) and deployment.  You are currently viewing the *setup & data* workbook.

In [1]:
# constants for running the workshop; we'll repeat these in the top line of each workbook.
#   why repeat them? the backup routine only serializes .ipynb files, so others will need 
#   to be downloaded again if your compute instance restarts (a small price to pay, right?)

WORKSHOP_BASE = "https://vmlr-workshop.STORAGE"
# WORKSHOP_BASE = "http://content.research.DOMAIN/projects/mlci_2020"
AGG_PROCESSED = "models/agg_processed.pkl.gz"     # custom file for processed metadata

# Notebook E: Deployment

We're repeating the exploration of compressed timed metadata by first merging it into some more easily usable [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html).  Specifically, we'll merge the raw output from many assets and content analysis tools.  This should look familiar!

In this rest of this notebook, we'll take quick look at what a deployed extractor may generate.  If you want to know more about deployment itself, head over to [Contentai Extractors](https://www.contentai.io/docs/extractor-getting-started) and get started.  Feel free to skip to the [appendix](#Appendix---Extractor-Sample) for example code in your own extractor.

## Collate processed data

In [3]:
import numpy as np
import pandas as pd
import json
from pathlib import Path

path_metadata = Path(AGG_PROCESSED)
if path_metadata.exists():
    print(f"Skipping re-create of metadata file '{str(path_metadata)}'...")
    df_flatten = pd.read_pickle(str(path_metadata))
else:
    df_flatten = None
    num_files = 0
    path_content = Path("packages/trailers")
    list_files = list(path_content.rglob("csv_flatten*.csv*"))
    print(f"Ingesting {len(list_files)} flatten files in path '{str(path_content)}'...")
    for path_file in list_files:  # search for flattened files
        df_new = pd.read_csv(path_file)
        # FROM content/vmlr-workshop/halloween/vid_halloween_0-13-of-23.mp4/batches/1hhadDBuEtRUPd6v8vCr5H3346r/dsai_metadata_flatten/csv_flatten_azure_videoindexer.csv.gz -> 
        # TO halloween/vid_halloween_0-13-of-23.mp4/batches/1hhadDBuEtRUPd6v8vCr5H3346r/dsai_metadata_flatten (relative_to)
        # TO halloween/vid_halloween_0-13-of-23.mp4  (joining base path parts)
        path_asset = Path(*path_file.parent.relative_to(path_content).parts[:2])
        df_new['tag'] = df_new['tag'].str.lower()   # lower case the tags
        df_new['details'] = df_new['details'].fillna('').str.lower()   # lower case the enhanced information
        df_new['asset'] = str(path_asset)
        if df_flatten is None:   # first one we saw
            df_flatten = df_new
        else:
            df_flatten = df_flatten.append(df_new, ignore_index=True)   # append new dataframe
        num_files += 1
        if num_files % 500 == 0:
            print(f"... read {num_files}...")
            
    # experimental code to parse halloween output
    path_content = Path("packages/trailers")
    list_files = list(path_content.rglob("*halloween*/*.json*"))
    print(list_files)
    obj_halloween = None
    for path_local in list_files:
        with list_files[0].open('r') as f:
            obj_halloween = json.load(f)
        df_new = pd.DataFrame(obj_halloween['results'])
        # print(json.dumps(obj_halloween, indent=4))
        df_new["time_event"] = df_new["time_begin"]
        df_new["tag_type"] = "tag"
        df_new["tag"] = "halloween"
        df_new["source_event"] = "video"
        df_new["extractor"] = "dsai_activity_halloween"
        df_new.drop(columns=["type_audio", "type_video", "path_video", "class", "id"], inplace=True)
        df_new["details"] = ""
        path_local = Path(*path_local.parent.relative_to(path_content).parts[:2])
        df_new["asset"] = str(path_local)
        
        if df_flatten is None:
            df_flatten = df_new
        else:
            df_flatten = df_flatten.append(df_new, ignore_index=True)   # append new dataframe
            num_files += 1            
    df_flatten.reset_index(drop=True, inplace=True)  # drop prior index
    df_flatten.to_pickle(str(path_metadata))
    print(f"Wrote {num_files} aggregations to file '{str(path_metadata)}'...")
    
print(f"New columns in this data... {list(df_flatten.columns)}")


Skipping re-create of metadata file 'models/agg_processed.pkl.gz'...
New columns in this data... ['time_begin', 'source_event', 'tag_type', 'time_end', 'time_event', 'tag', 'score', 'details', 'extractor', 'asset']


### Plotting tag statistics
Let's plot some statistics about tags, both their numbers and their names.  First, a histogram of how many unique and total tags were present for an asset.  This plot helps us find average number of tags, both in raw counts and unique tags for an asset.  Second, an average and raw count of the top `N` tags found from this dataset.

In [14]:
import pylab as pl
import ipywidgets as widgets
import matplotlib.pyplot as plt
from IPython.display import YouTubeVideo
# from IPython import display

VIDEO_URL="go6GEIrcvFY"


# this is a handy update function
def tag_video(x=None, clip=None, tag=None):
    
    # print(clip)
    if tag is None:
        tag = "halloween"
    
    df_sub = df_flatten[(df_flatten['score'] >= x[0]) & (df_flatten['score'] <= x[1])]
    df_history = df_sub[df_sub["tag"]==tag]
    df_rev = df_history.sort_values(["score"], ascending=False)
    df_rev["jump"] = f"{tag}, "+df_rev["score"].round(2).map(str) + "@ , " + \
        df_rev["time_begin"].round(2).map(str) + ", " + df_rev["time_end"].round(2).map(str)

    if clip is not None and len(df_rev):
        time_parts = clip.split(',')
        if tag != time_parts[0].strip():
            clip_dropdown.options = list(df_rev["jump"].head(10))
            return

    time_start = 0
    time_end = None
    if len(df_rev):
        # print(df_rev.head(10))
        time_start = 0 if not len(df_rev) else int(df_rev.head(1)["time_begin"])
        time_end = time_start+5 if not len(df_rev) else int(df_rev.head(1)["time_end"])
        if not clip_dropdown.options: 
            clip_dropdown.options = list(df_rev["jump"].head(10))
            return
    if clip is not None:
        time_parts = clip.split(',')
        time_start = int(float(time_parts[-2].strip()))
        time_end = int(float(time_parts[-1].strip()))
    vid = YouTubeVideo(VIDEO_URL, start=time_start, end=time_end, loop=1)
    print(f"Video Start: {time_start}, End: {time_end}")
    with out_v:
        out_v.clear_output()
        display(vid)


# this is a handy update function
def tag_count_hist(x=None):
    
    x = (round(x[0], 2), round(x[1], 2))
    df_sub = df_flatten[(df_flatten['score'] >= x[0]) & (df_flatten['score'] <= x[1])]
    df_history = df_sub[df_sub["tag"]=="halloween"]
    f, ax = plt.subplots(figsize=(12,2))
    ax.stem(df_history["time_event"], df_history['score'], use_line_collection=True)
    ax.set_title(f"Instances of Halloween ({x[0]} >= Score >= {x[1]})")
    ax.set_ylabel('score')
    ax.set_xlabel('time (in seconds)')
    ax.set_xlim([0.0, df_flatten['time_end'].max()])
    ax.set_ylim([0.0, 1.05])
    ax.grid()
#     plt.show()
    

    df_pairs = df_sub.groupby(['tag','asset']).count()['score'].reset_index()   # group by two params, reset into dataframe
    df_unitags = df_pairs.groupby(['tag'])['score'].agg(['count','sum']).reset_index()   # group by asset to find unique tag count per asset
    df_unitags.sort_values('sum', ignore_index=True, inplace=True, ascending=False)
    df_unitags.rename(columns={"count":"Asset Frequency", "sum":"Total Frequency"}, inplace=True)
    top_n = 20
    
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(11, 4))
    
    df_topn = df_unitags.iloc[:top_n]
    df_topn.plot.barh(ax=ax1, x='tag', width=0.8, log=True)
    ax1.set_title(f"Top {top_n} Tags ({x[0]} >= Score >= {x[1]})")
    ax1.set_ylabel('tag text')
    ax1.set_xlabel('count of tags')
    ax1.legend(loc="lower left")
    ax1.grid()
    
    skip_percent = 0.05
    top_percent = int(len(df_unitags)*skip_percent)
    df_topn = df_unitags.iloc[top_percent:top_percent+top_n]
    df_topn.plot.barh(ax=ax2, x='tag', width=0.8, log=True)
    ax2.set_title(f"Top {top_n} (skip {skip_percent*100:1}%) Tags ({x[0]} >= Score >= {x[1]})")
    ax2.set_ylabel('')
    ax2.set_xlabel('count of tags')
    ax2.legend(loc="lower left")
    ax2.grid()
    
    tag_dropdown.options = ["halloween"] + list(df_topn["tag"])
    clip_dropdown.options = []
    
    #plt.show()
    


# get an interactive widget/graph
range_slider = widgets.FloatRangeSlider(
    value=[0.5, 1.0],
    step=0.05,
    min=df_flatten['score'].min(),
    max=df_flatten['score'].max(),
    description='Score Range:',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
    readout_format='.1f',
)
tag_dropdown = widgets.Dropdown(
    options=list(),  # send run names
    description='Tag:',
    disabled=False,
)
clip_dropdown = widgets.Dropdown(
    options=list(),  # send run names
    description='Clip:',
    disabled=False,
)

out = widgets.interactive(tag_count_hist, x=range_slider)
out_v = widgets.Output()
out2 = widgets.interactive(tag_video, tag=tag_dropdown, clip=clip_dropdown, x=range_slider)
# out.layout.height = '575px'  # disable this if you make your output window longer!
display(out2, out_v, out)

interactive(children=(FloatRangeSlider(value=(0.5, 1.0), continuous_update=False, description='Score Range:', …

Output()

interactive(children=(FloatRangeSlider(value=(0.5, 1.0), continuous_update=False, description='Score Range:', …

## Appendix - Extractor Sample

This is a functional main file that could get called within a ContentAI extractor.  It uses OpenCV to read a video or a frame and perform some model analysis (**NOTE** no reall model evaluation here, it's just an example!).  Other trappings to turn this into a full fledged extractor include a manifest file  with extractor information and a simple Docker file; [see the documentation](https://www.contentai.io/docs/extractor-getting-started) for more exact formats.

In [3]:
from pathlib import Path
import sys
import argparse
from datetime import datetime
import time
import json
import tempfile

import cv2

import contentaiextractor as contentai


def classify_image(model, nd_image=None, path_image=None):

    # # Predict single image
    # predict.classify(model, '2.jpg')
    # # {'2.jpg': {'magic': 4.3454722e-05, 'natural': 0.85139805}}

    # # Predict multiple images at once
    # predict.classify(model, ['/Users/bedapudi/Desktop/2.jpg', '/Users/bedapudi/Desktop/6.jpg'])
    # # {'2.jpg': {'magic': 0.14751942, 'natural': 0.8513979}, '6.jpg': {'magic': 0.013342537, 'natural': 0.5209196}}

    
    if path_image is None:   # plan for future where we can use nd_image prediction
        return None
    return {"img": {"magic":0.5, "natural":0.5}}


def video_stats(cap):
    """Gets video stats. OpenCV is not reliable supply us with total frame count or milliseconds,
    so we have to manually calculate them by iterating over the frames ourselves.

    Parameters
    ----------
    cap : VideoCapture
        object for video capturing from video files
    
    Returns
    -------
    total_milliseconds
        total video duration in milliseconds
    total_frames
        total frame count from video duration
    """

    total_milliseconds = 0
    total_frames = 0
   
    # jump to the end of the video and get the current milliseconds and frame number
    cap.set(cv2.CAP_PROP_POS_AVI_RATIO, 1)
    total_milliseconds = cap.get(cv2.CAP_PROP_POS_MSEC)
    total_frames = int(cap.get(cv2.CAP_PROP_POS_FRAMES))

    # reset to frame zero
    cap.set(cv2.CAP_PROP_POS_AVI_RATIO, 0)

    print(f"[INFO] total milliseconds: {total_milliseconds}")
    print(f"[INFO] total frames: {total_frames}")

    return total_milliseconds, total_frames


def analyze_dir(model, content_path, verbose=False, match_pattern="*.jpg", round_decimals=5):
    """Run for all results in a directory

    Parameters
    ----------
    content_path : str
        The local path of the directory we want to analyze
    verbose: bool
        verbose printing of processing at regular intervals

    Returns
    -------
    array
        an array of objects representing keypoints identified in the frames
    """
    
    results = []
    idx_frame = 0
    path_source = Path(content_path).resolve()
    for path_image in sorted(path_source.rglob(match_pattern)):
        # run inference on image from video
        start_inference = cv2.getTickCount() / float(1000)
        path_full = str(path_image)
        response = classify_image(model, path_image=path_full)

        # total inference time for the frame
        inference_time = ((cv2.getTickCount() / float(1000)) - start_inference) / cv2.getTickFrequency()
        if response is not None and path_full in response:
            result = response[path_full]
            for key_name in result:  # round decimals
                result[key_name] = round(result[key_name], round_decimals)
            result['time_event'] = 0
            result['name_event'] = str(path_image.relative_to(path_source))
            result['index_frame'] = idx_frame
            results.append(result)
        else:
            print(f"Warning: Classifications not found at index {idx_frame} using input file {path_image.name}")
        idx_frame += 1

    return results


def analyze_video(model, content_path, time_interval=None, verbose=False, round_decimals=5):
    """Gets all results from running inference on images in the entire video file

    Parameters
    ----------
    content_path : str
        The local path the the video file we want to analyze
    time_interval : float, optional
        how often do we want to sample (e.g. time between sample)
    verbose: bool
        verbose printing of processing at regular intervals

    Returns
    -------
    array
        an array of objects representing keypoints identified in the video frames
    """
    
    # use OpenCV to open video file
    cap = cv2.VideoCapture(content_path)
    print(f"[INFO] asset: {content_path}")

    # get total milliseconds
    total_milliseconds, total_frames = video_stats(cap)
    
    PRINT_INTERVAL = 60 * 1000
    time_print_next = 0   # next print time

    # images_per_second = 5 #
    fps = round(cap.get(cv2.CAP_PROP_FPS))

    # get temp item
    obj_temp = tempfile.NamedTemporaryFile(suffix=".jpg", delete=False)
    obj_temp.close()

    hop = None
    if time_interval != None:
        hop = round(fps * time_interval)
    print(f"Info: Detected HOP interval {hop} from fps {fps} and time_interval {time_interval}")

    # holds all results from inference call
    results = []
    while (cap.isOpened()):

        # get frame from the video
        hasFrame, frame = cap.read()

        # Stop the program if we've reached the end of the video
        if not hasFrame:
            time.sleep(3)

            # Release device
            cap.release()
            break

        # get current milliseconds and frame number
        milliseconds = cap.get(0)
        frame_number = int(cap.get(1))

        if hop is None or (frame_number % hop == 0):
            # cv2.imwrite(obj_temp.name, frame)
            start_inference = cv2.getTickCount()

            # run inference on image from video
            response = classify_image(model, path_image=obj_temp.name, nd_image=frame)
            complete_inference = cv2.getTickCount()
            time_in_sec = milliseconds / float(1000)

            # total inference time for the frame
            inference_time = (complete_inference - start_inference) / cv2.getTickFrequency()
            if milliseconds > time_print_next:
                time_print_next += PRINT_INTERVAL
                if verbose:
                    print(f"[INFO] video {time_in_sec}s | frame: {frame_number}/{total_frames} | inference: {str(round(inference_time, 2))}s")

            if response is not None and obj_temp.name in response:
                result = response[obj_temp.name]
                for key_name in result:  # round decimals
                    result[key_name] = round(result[key_name], round_decimals)
                result['time_event'] = time_in_sec
                result['index_frame'] = frame_number
                results.append(result)
            else:
                print(f"Warning: Classifications not found at time {time_in_sec}s using temp file {obj_temp.name}")

    # delete temp file used for writing frames
    if obj_temp is not None:
        path_temp = Path(obj_temp.name)
        if path_temp.exists():
            path_temp.unlink()
            
    return results


def classify(input_params=None, args=None):
    # extract data from contentai.content_url
    # or if needed locally use contentai.content_path
    # after calling contentai.download_content()
    print("Downloading content from ContentAI")
    contentai.download_content()
    path_root = Path(__file__).parent

    parser = argparse.ArgumentParser(
        description="""A script to perform model classification""",
        epilog="""
        Launch with video parsing at a regular frame interval
            python -u main.py --path_content a/video/file.mp4 --time_interval 3
        Launch with moderation analysis for frames in a directory
            python -u main.py --path_content a/frame --match_pattern "*.jpg"
            ....
    """, formatter_class=argparse.RawTextHelpFormatter)
    submain = parser.add_argument_group('main execution and evaluation functionality')
    submain.add_argument('--path_content', dest='path_content', type=str, default=contentai.content_path, 
                            help='input video path for files to label')
    submain.add_argument('--path_result', dest='path_result', type=str, default=contentai.result_path, 
                            help='output path for samples')
    submain.add_argument('--path_model', dest='path_model', type=str, 
                            default=str(path_root.joinpath('models', 'mobilenet_v2_140_224').resolve()),
                            help='manifest path for model information')
    submain.add_argument('--time_interval', dest='time_interval', type=float, default=1,  
                            help='time interval for predictions from models')
    submain.add_argument('--round_decimals', dest='round_decimals', type=int, default=5,  
                            help='rounding decimals for predictions')
    submain.add_argument('--match_pattern', dest='match_pattern', type=str, default="*.jpg",  
                            help='frame match pattern when running on an input directory')
    submain.add_argument('--verbose', dest='verbose', default=False, action='store_true', 
                            help='verbosely print operations')

    if args is not None:
        config = vars(parser.parse_args(args))
    else:
        config = vars(parser.parse_args())
    if input_params is not None:
        config.update(input_params)
    config.update(contentai.metadata())

    print(f"Run argments: {config}")

    path_output = Path(config['path_result'])
    path_content = Path(config['path_content'])

    # call process with i/o specified
    print(f"Processing input {config['path_content']} and writing to output dir '{path_output}'")

    model = {} # in a real example, you would load a model here...
    if path_content.is_dir:
        list_items = analyze_video(model, config['path_content'], config['time_interval'], config['verbose'], config['round_decimals'])
    if len(list_items) == 0 and path_content.is_dir:
        list_items = analyze_dir(model, config['path_content'], config['verbose'], config['match_pattern'], config['round_decimals'])

    # write output of each class and segment
    dict_result = {'config': {'version':"1.0", 'extractor':"fancy_extractor",
                            'input':str(path_content.resolve()), 'timestamp': str(datetime.now()) }, 
                    'results':list_items }

    if list_items is None or len(list_items) == 0:
        print(f"No predictions made from unique models...")

    elif len(config['path_result']) > 1:
        if not path_output.exists():           # write out data if completed
            path_output.mkdir(parents=True)

        path_output = path_output.joinpath("data.json")
        with path_output.open('wt') as f:
            json.dump(dict_result, f)
        print(f"Written to '{path_output.resolve()}'...")


    # return dict of data
    if not contentai.running_in_contentai:
        return dict_result
    


if __name__ == "__main__":
    #classify()
    print("Disabled in notebook mode...")


Disabled in notebook mode...


# End of Deployment Material

Nice job, that's all we wrote -- literally.