![Machine Learning Workshop: Content Insights 2020](assets/mlci_banner.jpg)

# Machine Learning Workshop: Content Insights 2020

Welcome to the workshop notebooks!  These notebooks are designed to give you a walk through the steps of creating a model, refining it with user labels, and testing it on content.  You can access the main [workshop forum page](https://INFO_SITE/forums/html/forum?id=241a0b77-7aa6-4fef-9f25-5ea351825725&ps=25), the [workshop files repo](https://INFO_SITE/communities/service/html/communityview?communityUuid=fb400868-b17c-44d8-8b63-b445d26a0be4#fullpageWidgetId=W403a0d6f86de_45aa_8b67_c52cf90fca16&folder=d8138bef-9182-4bdc-8b12-3c88158a219c), or the [symposium home page](https://software.web.DOMAIN) for additional help.

The notebooks are divided into five core components: (A) setup & data, (B) model exploration, (C) labeling, (D) active labeling, (E) and deployment.  You are currently viewing the *setup & data* workbook.

Start your LabelQuest journey by clicking here ...

## Log In: https://APP_SITE

## Get your Token : https://APP_SITE/api/lq/v1/uam/auth

In [1]:
# constants for running the workshop; we'll repeat these in the top line of each workbook.
#   why repeat them? the backup routine only serializes .ipynb files, so others will need 
#   to be downloaded again if your compute instance restarts (a small price to pay, right?)

WORKSHOP_BASE = "https://vmlr-workshop.STORAGE"
# WORKSHOP_BASE = "http://content.research.DOMAIN/projects/mlci_2020"
AGG_METADATA = "models/agg_metadata.pkl.gz"            # custom file for merged metadata
CLASS_DEFINITIONS = "assets/classes.json"       # provided file for class info

# you need to provide this (copy the string from https://APP_SITE/api/lq/v1/uam/auth)
LQ_JWT = ""  
LQ_ROOT_URL = "https://APP_SITE"
LQ_ROOT_SSL_VERIFY = False

IMDB5000_FEAT = "packages/movie_metadata.csv"   # public dataset for movies

# Notebook C: Self-Service Labeling

Collecting labels can be an arduous, expensive task.  That's why we're turning to an internal platform that simplifies the process with a programmatic API and democrtizes the labelers to some or all of your fellow employees.  In this notebook, we will focus on creating, exploring, and tuning a labeling campaign.  The task herein takes a momentary break away from our contextual advertising focus, but we'll use skill here to continue that direction in the next notebook.

![LabelQuest](assets/labelquest_banner.jpg)

[LabelQuest](https://lq.web.DOMAIN) is an AT&T labeling platform that allows task creation through programmatic API and broad label solicitation across the enterprise.  While due dilligence is still required to avoid senstive information and content, the tracking of labels and compliance-approved usage of the software on desktops, laptops, and tablets is already there.  Additionally, the gamification of the labeling task, manifesting in variance of tasks, a simple but intutitive UX, and the awarding of points and badges may keep spirits up if the number of tasks starts to build up.

# Pythonic API


In [2]:
import lq
from lq.content_label import ContentLabeler
from pathlib import Path
import pandas as pd

if not LQ_JWT:
    LQ_JWT = ContentLabeler.jwt_load("auth.json")
if not LQ_JWT:
    raise Exception("""
        No token detected (in LQ_JWT), please authenticate and get your JWT token.
        1. Log into the test instance of LQ - https://APP_SITE/
        2a. Get your LQ token from here - https://APP_SITE/api/lq/v1/uam/auth
        2b. OR Save the produced JSON file to the same directory as this script (as auth.json)
    """)

pd.set_option('display.max_colwidth',1000)
df_classes = pd.read_json(CLASS_DEFINITIONS)
display(df_classes)

path_metadata = Path(AGG_METADATA)
if not path_metadata.exists():
    raise Exception(f"""
        Please return to notebook A to run data flattening and merging!
        A file '{AGG_METADATA}' will be created after successful execution.
    """)
df_flatten = pd.read_pickle(str(path_metadata))
print(f"Loaded {len(df_classes)} classes and {len(df_flatten)} tag rows.")


Unnamed: 0,class,definition,primary
0,holiday,"holiday scenes or objects like decorated trees, presents, or character, holiday party",1
1,halloween,"halloween scenes where one or more characters are in costume, ideally one or more characters are trick-or-treating",1
2,gift giving,"scenes of gift giving, receiving, or opening/unwrapping",1
3,family moments,"at least two people on screen, typically familes at parties, enjoying a meal, lounging at home",0
4,shopping scenes,one or more primary actors in a store-like environment; not necessary to see their face,0


Loaded 5 classes and 122736 tag rows.


## Project Creation
This cell demonstrates basic functions for listing, retrieving and creating new projects.  To avoid any confusion, the code discussions will use the term `project` instead of `campaign` but otherwise they are meant to be interchangable in this workshop.

The sample below demos these functions...
* `ContentLabeler.list()` - queries active projects with 
* `ContentLabeler.tasks_retrieve()` - query the tasks under a single project with 
* `ContentLabeler.load()` - create or load an existing project
* `ContentLabeler.delete()` - delete a project that has been loaded


In [3]:
from IPython.display import display
import ipywidgets as widgets
from functools import partial
import random

pd.set_option('display.max_colwidth',1000)

# first, create a handy content labeling instance
labeler = ContentLabeler(LQ_JWT, LQ_ROOT_URL, verbose=False, ssl_verify=LQ_ROOT_SSL_VERIFY)

def proj_create_ex(proj_title="emotion_check"):
    # let's create a simple textual campaign
    labeler.load(proj_title, create_if_missing=True)
    if not labeler.valid:
        raise Exception("""
        Uh oh, failed to create a new project.  Make sure your token is valid and check above errors.""")
    list_classes = ["happy", "sad", "neutral"]
    list_media = ["A juicy apple", "A long-time friend", "In-laws", "Tax day", "Rainy days", "Fuzzy cats", "A cold beverage", "Spam", "Pocket Lint"]
    random.shuffle(list_media)
    text_questions = "Please select your primary emotion when thinking of these items"
    list_cut = list_media[:3]  # just enough
    num_inserted = labeler.tasks_insert(list_classes, list_cut, is_exclusive=True, data_question=text_questions)
    if num_inserted == 0:  # hrm, it says no new inserts, let's check for a prior project
        list_tasks = labeler.tasks_retrieve(0)   # -1=label only, 0=all, 1=unlabeled
        print(f"The project '{proj_title}' already exists, with {len(list_tasks)} tasks of expected {len(list_cut)} tasks")
        assert len(list_tasks) == len(list_cut)
    else:
        print(f"The project '{proj_title}' was created, with {num_inserted} tasks of expected {len(list_cut)} tasks")
        assert num_inserted == len(list_cut)
    print(f"That's it for basic project creation, head to LQ to check it out: {LQ_ROOT_URL}")


# create a quick interaction grid for display
def proj_view(create_fn, template_name, proj=None, delete=False, create=False):
    global df_proj
    def refresh():
        list_active = labeler.list()
        if type(list_active) != list:  # error retrieving or empty (no projects!)
            return pd.DataFrame()
        df_proj = pd.DataFrame(labeler.list())
        df_proj['id'] = df_proj['title'].map(str) + ": " + df_proj['description'].map(str) 
        df_proj.set_index('id', inplace=True)
        return df_proj
    
    if create:
        title_new = f"{template_name}_{len(df_proj)}"
        result = create_fn(title_new)
        df_proj = refresh()
        idx_new = df_proj[df_proj["title"]==title_new].index[0]
        btn_create.value = False
        dropdown.options = list(df_proj.index)
        dropdown.value = idx_new
        return
    if proj is None:
        df_proj = refresh()
        dropdown.options = list(df_proj.index)
        return
    if not proj in df_proj.index:
        print(f"Error: Couldn't find the specified project {proj} in results!")
        return
    if delete:
        result = labeler.delete()
        df_proj = refresh()
        btn_del.value = False
        if len(df_proj):
            dropdown.options = list(df_proj.index)
            dropdown.value = df_proj.index[0]
        else:
            dropdown.options = []
        return

    # display(f"Matched Tags (for class {class_name}): {df_performance.loc[run_name, 'token']}")
    proj_name = df_proj.loc[proj, "title"]
    labeler.load(proj_name, create_if_missing=False)
    
    if labeler.valid:
        df_tasks = pd.DataFrame(labeler.tasks_retrieve(0))
        df_tasks_sub = df_tasks[['etag', 'data', 'labels']]
        display(df_tasks_sub)
    else:
        print(f"No tasks found for project {proj_name}")

df_proj = None  # start with empty obj

dropdown = widgets.Dropdown(
    options=[] if df_proj is None else list(df_proj.index),  # send run names
    description='Project:', layout={'width': '90%'},
    disabled=False,
)
btn_del = widgets.ToggleButton(description="Delete")
btn_create = widgets.ToggleButton(description="Create")

# create two rows ; "partial" allows us to sub in another function for create
demo_fn = partial(proj_view, proj_create_ex, "new_project")
out = widgets.interactive_output(demo_fn, {'proj':dropdown, 'delete':btn_del, 'create':btn_create})

display(widgets.VBox([dropdown, widgets.HBox([btn_del, btn_create])]), out)
demo_fn()

VBox(children=(Dropdown(description='Project:', layout=Layout(width='90%'), options=(), value=None), HBox(chil…

Output()

### Creation Post-Mortem
Not much to say here, but you should consider checking out the one or more projects and how they are rendered on the test version of LabelQuest: https://APP_SITE.

## Task Type Variance
Demo code to add text, image, or video tasks.  Additionally, other capabilties like textual info and video playback speed will be explored.  Be warned that this cell will delete all projcts on the test instance so that you can more easily explore the various settings herein.


In [4]:
import time

def proj_task_media_create(proj_title="media_check"):
    # let's create a simple textual campaign
    labeler.load(proj_title, create_if_missing=True)
    if not labeler.valid:
        raise Exception("""
        Uh oh, failed to create a new project.  Make sure your token is valid and check above errors.""")
    list_classes = list(df_classes["class"].unique())
    list_media = list(df_flatten["asset"].unique())
    random.shuffle(list_media)
    list_media = [f"{WORKSHOP_BASE}/{x}" for x in list_media[:6]]   # prefix URL
    text_questions = "Please label with one or more of these classes."
    text_info = """
        <h2>Information</h2>
        <h3>Not all information is created equally</h3>
        <ul>
        <li><strong>this item</strong> is first</li>
        <li>that item is second</li>
        </ul>
        <small>That is all</small>
    """
    num_inserted = 0
    num_new = labeler.tasks_insert(list_classes, list_media[:2], is_exclusive=True, 
                    data_question=text_questions + " <strong>THIS IS THE FIRST SET OF TASKS</strong>")
    num_inserted += num_new
    num_new = labeler.tasks_insert(list_classes, list_media[2:4], is_exclusive=True, 
                    data_question=text_questions + " <strong>THIS IS THE SECOND SET OF TASKS</strong>",
                    data_info=text_info)
    num_inserted += num_new
    num_new = labeler.tasks_insert(list_classes, list_media[4:], is_exclusive=False, 
                    data_question=text_questions + " <strong>THIS IS THE THIRD SET OF TASKS</strong>",
                    media_playback="1.5")
    num_inserted += num_new
    
    if num_inserted == 0:  # hrm, it says no new inserts, let's check for a prior project
        list_tasks = labeler.tasks_retrieve(0)   # -1=label only, 0=all, 1=unlabeled
        print(f"The project '{proj_title}' already exists, with {len(list_tasks)} tasks of expected {len(list_media)} tasks")
        assert len(list_tasks) == len(list_media)
    else:
        print(f"The project '{proj_title}' was created, with {num_inserted} tasks of expected {len(list_media)} tasks")
        assert num_inserted == len(list_media)
        list_tasks = labeler.tasks_retrieve(0)   # -1=label only, 0=all, 1=unlabeled
        print(f"The project '{proj_title}' now exists, with {len(list_tasks)} tasks ")
    print(f"Check out your new project with video, info, and variable speed playback: {LQ_ROOT_URL}")

def delete_all(sleep_secs=5):
    list_found = labeler.list()
    if type(list_found) != list:
        list_found = []
    for i in range(len(list_found)):
        labeler.load(list_found[i]['title'])
        result = labeler.delete()
    print(f"Deleted {len(list_found)}, sleeping {sleep_secs}")
    time.sleep(sleep_secs)  # because of lazy update on backend, neeed ssleep some
    
# delets all project
delete_all()

# create a new project
proj_task_media_create()


Deleted 0, sleeping 5
The project 'media_check' was created, with 6 tasks of expected 6 tasks
The project 'media_check' now exists, with 6 tasks 
Check out your new project with video, info, and variable speed playback: https://APP_SITE


# Interface Exploration

## Reranking of Tasks
How do you rerank or prioritize the tasks with label data?

In our user survey, the majority indicated they'd be willing to provide a few labels to a service for better product recommendations, so let's test that promise and its efficacy.  However, while the survey indicates willingness to add labels, those labels should be as relevant as possible.

![willingness to label](assets/labelquest_agreement.jpg)

## Reranking items
In this example we create a temporary project and rerank a few of the items.  This demonstrates how to dynamically reorder samples within a campaign.

In [30]:
    
# delete all projects
delete_all(10)

# create a new project having "FIRST" "SECOND" "THIRD" in title
project_name = "priority"
proj_task_media_create(project_name)

# now query for those tasks
# force items with "FIRST" to be first, then "SECOND", then third
df_tasks = pd.DataFrame(labeler.tasks_retrieve(0))

# if False:
#     dict_rescore = {}
#     for i,r in df_tasks.iterrows():
#         qtext = r['question'].lower()
#         if "first" in r[k].lower():
#             dict_rescore[r[k]] = 0.99
#         elif "second" in r[k].lower():
#             dict_rescore[r[k]] = 0.60

#     num_updates = labeler.tasks_rescore(dict_scores)   # send request

#     print(f"Updated {num_updates} of {len(df_tasks)}...")
# else:
#     print("Weig ghost in the machine is preventing normal operation")



Deleted 0, sleeping 10
The project 'priority' was created, with 6 tasks of expected 6 tasks
The project 'priority' now exists, with 42 tasks 
Check out your new project with video, info, and variable speed playback: https://APP_SITE


## A few Labels
What does it look like when someone added labels?

In [None]:
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import matplotlib.pyplot as plt
import pandas as pd
from IPython.display import display
import ipywidgets as widgets
from functools import partial

campaign = lq.project.Project(jwt_token=LQ_JWT, url=LQ_ROOT_URL, ssl_verify=False)
campaign_df = pd.DataFrame(campaign.retrieve_data(project_id='11de6843906842beb50d228ed9107914'))
labels = campaign_df[campaign_df.id=='5f6b283a2525a124ccf2922e']['labels']
print(labels)

In [None]:
def get_votes_by_count(labels, count=None):
    if count is None:
        count = len(labels.at[0])
    options = []
    for label in labels.at[0][0:count]:
        for selection in label['selection']:
            options.append(selection['selected'])
    return options

out = widgets.IntSlider(
    value=1,
    step=1,
    min=1,
    max=len(labels.at[0]),
    description='Votes',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
    readout_format='d',
)
def update_barchart(votes):
    return pd.Series(get_votes_by_count(labels,votes)).value_counts().sort_index().plot(kind='bar')

interactive(update_barchart, votes=out)

# End of Labeling Material

Ready to label the world?  Just remember that out-of-context, your friends and family may not appreciate you giving them a `happy` or `sad` label, and much less so when you try to click a submit button afterwards! With the familiarity to a labeling system underway, we have the skills to return to task for content labeling.

The next notebook, [notebook D](D_active_labels.ipynb) *(that link may not work)* returns to the contextual ads problem and applies what was learned here to the labeling task that everyone contributed to (hopefully!) before the workshop.
