![Machine Learning Workshop: Content Insights 2020](assets/mlci_banner.jpg)

# Machine Learning Workshop: Content Insights 2020

Welcome to the workshop notebooks!  These notebooks are designed to give you a walk through the steps of creating a model, refining it with user labels, and testing it on content.  You can access the main [workshop forum page](https://INFO_SITE/forums/html/forum?id=241a0b77-7aa6-4fef-9f25-5ea351825725&ps=25), the [workshop files repo](https://INFO_SITE/communities/service/html/communityview?communityUuid=fb400868-b17c-44d8-8b63-b445d26a0be4#fullpageWidgetId=W403a0d6f86de_45aa_8b67_c52cf90fca16&folder=d8138bef-9182-4bdc-8b12-3c88158a219c), or the [symposium home page](https://software.web.DOMAIN) for additional help.

The notebooks are divided into five core components: (A) setup & data, (B) model exploration, (C) labeling, (D) active labeling, (E) and deployment.  You are currently viewing the *setup & data* workbook.

In [207]:
# constants for running the workshop; we'll repeat these in the top line of each workbook.
#   why repeat them? the backup routine only serializes .ipynb files, so others will need 
#   to be downloaded again if your compute instance restarts (a small price to pay, right?)

WORKSHOP_BASE = "https://vmlr-workshop.STORAGE"
# WORKSHOP_BASE = "http://content.research.DOMAIN/projects/mlci_2020"
AGG_METADATA = "agg_metadata.pkl.gz"            # custom file for merged metadata
AGG_AVFEAT = "agg_avfeature.pkl.gz"             # custom file for merged audio and video features


# Notebook C: Self-Service Labeling

Collecting labels can be an arduous, expensive task.  That's why we're turning to an internal platform that simplifies the process with a programmatic API and democrtizes the labelers to some or all of your fellow employees.  In this notebook, we will focus on creating, exploring, and tuning a labeling campaign.  The task herein takes a momentary break away from our contextual advertising focus, but we'll use skill here to continue that direction in the next notebook.

![LabelQuest](assets/labelquest_banner.jpg)

[LabelQuest](https://lq.web.DOMAIN) is an AT&T labeling platform that allows task creation through programmatic API and broad label solicitation across the enterprise.  While due dilligence is still required to avoid senstive information and content, the tracking of labels and compliance-approved usage of the software on desktops, laptops, and tablets is already there.  Additionally, the gamification of the labeling task, manifesting in variance of tasks, a simple but intutitive UX, and the awarding of points and badges may keep spirits up if the number of tasks starts to build up.

# Pythonic API
Quick discussion of the library

## Project Creation
Demo code to create a project

## Task Creation
Demo code to gather materials from a directory or text file

## Extra Information
Demo code to add extra information to the tasks (e.g. the info box or question additions)

## Task Type Variance
Demo code to add text, image, or video tasks

# Interface Exploration

## Recommendation by Association
In our user survey, the majority indicated they'd be willing to provide a few labels to a service for better product recommendations, so let's test that promise and its efficacy.  

![willingness to label](assets/labelquest_agreement.jpg)

In the remainder of this notebook, we'll use the [IMDB 5000 dataset](https://www.kaggle.com/carolzhangdc/imdb-5000-movie-dataset) that has a small collection of movies and some metadata about actors and directors.  Unfortunately, you won't be able to make direct requests to WarnerBrothers or HBO to produce your most preferred combination, but this example should demonstrate the power of rapid model refinement with just a few labels.

To jump right to the fun part, we've done a little bit of preprocessing to formulate a few models on the data.  The code in the following cell **will not be executed** because we did it for you, but it's provided here for some fun examples.

1. **Actor Affinity** - finding your preferred actor by the links to others
2. **Genre Preferences** - finding preference for genres by direct categorical links
3. **Crowd Alignment** - how closely do your opinions match that of others, as determined by `likes`
4. **Embedded Topics** - (advanced) a method that uses our NLP embedding to get recommendations, similar to genre work

In [28]:
# OLD CODE, but examples of thigns that would be run FOR the user to be used in above experiments


import spacy
from spacy.vocab import Vocab
from spacy.tokenizer import Tokenizer
from scipy import spatial
from pathlib import Path

def text2doc(nlp, tag_raw):
    """Given a raw text input, tokenize it to remove stop words"""
    return [x for x in nlp(tag_raw) if not x.is_stop and not x.is_punct]

def doc2vec(nlp, doc, target_domain=None):
    """Given a specific model, clean line of text into an output embedding space"""
    # https://spacy.io/usage/vectors-similarity
    if target_domain is None:
        target_domain = nlp.vocab
    if type(doc) != list:
        doc = [doc]
    tag_doc = None
    for token in doc:
        tag_id = target_domain.strings[token.text]
        if tag_id in target_domain.vectors:   # search existing one
            new_vec = target_domain.vectors[tag_id]
        elif type(token)==str:
            new_vec = nlp(token).vector
        else:
            new_vec = token.vector
        if tag_doc is None:
            tag_doc = new_vec
        else:
            tag_doc += new_vec
    return tag_doc

# doc = text2doc(nlp, "this is a phrase to clean")
# vec = doc2vec(nlp, doc)

# also load our spacy NLP model
nlp = spacy.load('en_core_web_md')
print("NLP model ready to go!")

NLP model and flattened featured ready to go!


In [106]:
# let's plow through our class dataframe and do the cleaning and mapping
# print([x for x in df_classes[df_classes['primary']==1].iloc])
print("Tokenizing and embedding classes...")
list_tokens = [""] * len(df_classes)
list_vect = [np.ndarray((1, nlp.vocab.vectors.shape[1]))] * len(df_classes)
for idx in range(len(df_classes)):  # iterate row indexes
    row = df_classes.iloc[idx]
    # tokenize to remove tags, return tokens
    doc = text2doc(nlp, row["definition"]) + text2doc(nlp, row["class"])
    list_tokens[idx] = [str(x) for x in doc]
    # convert to an embedding array
    list_vect[idx] = doc2vec(nlp, doc)
df_classes["token"] = list_tokens
df_classes["embedding"] = list_vect

# now collect all of the tags that match simple text bag
print("Lookup specific tags by match...")
list_tags = df_flatten['tag'].unique()
map_tokens = {}
embed_tokens = np.ndarray((len(list_tags), nlp.vocab.vectors.shape[1]))
for idx in range(len(list_tags)):   # iterate through full list of known tags
    doc = text2doc(nlp, list_tags[idx])
    for x in doc:  # create map/reference to each term
        x = str(x)
        if x not in map_tokens:  # first time to see this tag?
            map_tokens[x] = []
        map_tokens[x].append(idx)   # save reference to original tag set
    embed_tokens[idx, :] = doc2vec(nlp, doc)  # compute embedding 

# df_classes["token"] = list_tokens
# df_classes["embedding"] = list_vect
print(f"Count of new text-based mapping: {len(map_tokens)}...")
print(f"Shape of embedded tag matrix: {embed_tokens.shape}...")


Tokenizing and embedding classes...
Lookup specific tags by match...
Count of new text-based mapping: 1804...
Shape of embedded tag matrix: (1763, 300)...


## A few Labels
What does it look like when someone added labels

## Reranking of Tasks
How do you rerank or prioritize the tasks with label data?

# End of Labeling Material

Ready to label the world?  Just remember that out-of-context, your friends and family may not appreciate you giving them a `happy` or `sad` label, and much less so when you try to click a submit button afterwards! With the familiarity to a labeling system underway, we have the skills to return to task for content labeling.

The next notebook, [notebook D](D_active_labels.ipynb) *(that link may not work)* returns to the contextual ads problem and applies what was learned here to the labeling task that everyone contributed to (hopefully!) before the workshop.
