![Machine Learning Workshop: Content Insights 2020](../assets/mlci_banner.jpg)

# Machine Learning Workshop: Content Insights 2020

Welcome to the workshop notebooks!  These notebooks are designed to give you a walk through the steps of creating a model, refining it with user labels, and testing it on content.  You can access the main [workshop forum page](https://INFO_SITE/forums/html/forum?id=241a0b77-7aa6-4fef-9f25-5ea351825725&ps=25), the [workshop files repo](https://INFO_SITE/communities/service/html/communityview?communityUuid=fb400868-b17c-44d8-8b63-b445d26a0be4#fullpageWidgetId=W403a0d6f86de_45aa_8b67_c52cf90fca16&folder=d8138bef-9182-4bdc-8b12-3c88158a219c), or the [symposium home page](https://software.web.DOMAIN) for additional help.

The notebooks are divided into five core components: (A) setup & data, (B) model exploration, (C) labeling, (D) active labeling, (E) and deployment.  You are currently viewing the *setup & data* workbook.

Start your LabelQuest journey by clicking here:

## Log In: https://APP_SITE

## Get your Token : https://APP_SITE/api/lq/v1/uam/auth

In [1]:
# constants for running the workshop; we'll repeat these in the top line of each workbook.
#   why repeat them? the backup routine only serializes .ipynb files, so others will need 
#   to be downloaded again if your compute instance restarts (a small price to pay, right?)

WORKSHOP_BASE = "https://vmlr-workshop.STORAGE"
# WORKSHOP_BASE = "http://content.research.DOMAIN/projects/mlci_2020"
AGG_METADATA = "models/agg_metadata.pkl.gz"            # custom file for merged metadata

# you need to provide this (copy the string from https://APP_SITE/api/lq/v1/uam/auth)
LQ_JWT = ""  
LQ_ROOT_URL = "https://APP_SITE"
LQ_ROOT_SSL_VERIFY = False

IMDB5000_FEAT = "packages/movie_metadata.csv"   # public dataset for movies

# Pythonic API


In [2]:
import lq
from lq.content_label import ContentLabeler

if not LQ_JWT:
    LQ_JWT = ContentLabeler.jwt_load("auth.json")
if not LQ_JWT:
    raise Exception("""
        No token detected (in LQ_JWT), please authenticate and get your JWT token.
        1. Log into the test instance of LQ - https://APP_SITE/
        2a. Get your LQ token from here - https://APP_SITE/api/lq/v1/uam/auth
        2b. OR Save the produced JSON file to the same directory as this script (as auth.json)
    """)

Exception: 
        No token detected (in LQ_JWT), please authenticate and get your JWT token.
        1. Log into the test instance of LQ - https://APP_SITE/
        2a. Get your LQ token from here - https://APP_SITE/api/lq/v1/uam/auth
        2b. OR Save the produced JSON file to the same directory as this script (as auth.json)
    

# Interface Exploration

## Recommendation by Association
In our user survey, the majority indicated they'd be willing to provide a few labels to a service for better product recommendations, so let's test that promise and its efficacy.  

![willingness to label](../assets/labelquest_agreement.jpg)

In the remainder of this notebook, we'll use the [IMDB 5000 dataset](https://www.kaggle.com/carolzhangdc/imdb-5000-movie-dataset) that has a small collection of movies and some metadata about actors and directors.  Unfortunately, you won't be able to make direct requests to WarnerBrothers or HBO to produce your most preferred combination, but this example should demonstrate the power of rapid model refinement with just a few labels.

To jump right to the fun part, we've done a little bit of preprocessing to formulate a few models on the data.  The code in the following cell **will not be executed** because we did it for you, but it's provided here for some fun examples.

1. **Actor Affinity** - finding your preferred actor by the links to others
2. **Genre Preferences** - finding preference for genres by direct categorical links
3. **Crowd Alignment** - how closely do your opinions match that of others, as determined by `likes` or `box office gross`
4. **Embedded Topics** - (advanced) a method that uses our NLP embedding to get recommendations, similar to genre work

In [28]:
# IMDB5000_FEAT - train some simple models according to above?
df_movies = pd.read_csv(IMDB5000_FEAT)
print(df_movies.columns, len(df_movies))

# task 1: encode all features into movie-based rows
# task 2: select actors by affinity to likes 



df_movies['people'] = df_movies['director_name'].map(str) + "|" + df_movies['actor_1_name'].map(str) + \
                         "|" + df_movies['actor_2_name'].map(str) + "|" + df_movies['actor_3_name'].map(str)
df_movies['people'] = df_movies['people'].apply(lambda x: x.split('|'))
df_movies['genres'] = df_movies['genres'].apply(lambda x: x.split('|'))
df_movies = df_movies.explode('people').explode('genres')

print(df_movies.columns, len(df_movies))
df_movies.head(5)

Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
       'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
       'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
       'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],
      dtype='object') 5043
Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
       'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'language', 'co

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes,people
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action,...,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000,James Cameron
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Adventure,...,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000,James Cameron
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Fantasy,...,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000,James Cameron
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Sci-Fi,...,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000,James Cameron
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action,...,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000,CCH Pounder


In [29]:



# OLD CODE, but examples of thigns that would be run FOR the user to be used in above experiments


import spacy
from spacy.vocab import Vocab
from spacy.tokenizer import Tokenizer
from scipy import spatial
from pathlib import Path

def text2doc(nlp, tag_raw):
    """Given a raw text input, tokenize it to remove stop words"""
    return [x for x in nlp(tag_raw) if not x.is_stop and not x.is_punct]

def doc2vec(nlp, doc, target_domain=None):
    """Given a specific model, clean line of text into an output embedding space"""
    # https://spacy.io/usage/vectors-similarity
    if target_domain is None:
        target_domain = nlp.vocab
    if type(doc) != list:
        doc = [doc]
    tag_doc = None
    for token in doc:
        tag_id = target_domain.strings[token.text]
        if tag_id in target_domain.vectors:   # search existing one
            new_vec = target_domain.vectors[tag_id]
        elif type(token)==str:
            new_vec = nlp(token).vector
        else:
            new_vec = token.vector
        if tag_doc is None:
            tag_doc = new_vec
        else:
            tag_doc += new_vec
    return tag_doc

# doc = text2doc(nlp, "this is a phrase to clean")
# vec = doc2vec(nlp, doc)

# also load our spacy NLP model
nlp = spacy.load('en_core_web_md')
print("NLP model ready to go!")

NLP model ready to go!


In [30]:
# let's plow through our class dataframe and do the cleaning and mapping
# print([x for x in df_classes[df_classes['primary']==1].iloc])
print("Tokenizing and embedding classes...")
list_tokens = [""] * len(df_classes)
list_vect = [np.ndarray((1, nlp.vocab.vectors.shape[1]))] * len(df_classes)
for idx in range(len(df_classes)):  # iterate row indexes
    row = df_classes.iloc[idx]
    # tokenize to remove tags, return tokens
    doc = text2doc(nlp, row["definition"]) + text2doc(nlp, row["class"])
    list_tokens[idx] = [str(x) for x in doc]
    # convert to an embedding array
    list_vect[idx] = doc2vec(nlp, doc)
df_classes["token"] = list_tokens
df_classes["embedding"] = list_vect

# now collect all of the tags that match simple text bag
print("Lookup specific tags by match...")
list_tags = df_flatten['tag'].unique()
map_tokens = {}
embed_tokens = np.ndarray((len(list_tags), nlp.vocab.vectors.shape[1]))
for idx in range(len(list_tags)):   # iterate through full list of known tags
    doc = text2doc(nlp, list_tags[idx])
    for x in doc:  # create map/reference to each term
        x = str(x)
        if x not in map_tokens:  # first time to see this tag?
            map_tokens[x] = []
        map_tokens[x].append(idx)   # save reference to original tag set
    embed_tokens[idx, :] = doc2vec(nlp, doc)  # compute embedding 

# df_classes["token"] = list_tokens
# df_classes["embedding"] = list_vect
print(f"Count of new text-based mapping: {len(map_tokens)}...")
print(f"Shape of embedded tag matrix: {embed_tokens.shape}...")


Tokenizing and embedding classes...


NameError: name 'df_classes' is not defined