# Generating User Embeddings
This notebook generates user embeddings for use in the Safegraph Lookalike model.

To use this notebook, edit the variables in the `Loading the data` section below.

#### Expected label-dataset columns:
- `titles`: A string of aggregated webpage titles visited by the user
- `keywords`: A string of aggregated keywords from webpages visited by the user
- `domains`: A string of aggregated domains of the webpages visited by the user

#### Outputs (see results section below):
- 2-Dimensional plot: `plotly_results/{file_name}_{embedding_type}.html`
- NPY pickle of generated embeddings: `npy_pickles/{file_name}_{embedding_type}.npy`

## Setup

In [None]:
%env TFHUB_CACHE_DIR=models
%env CUDA_VISIBLE_DEVICES=""  # Comment out to allow GPU

In [None]:
from time import time

import numpy as np
import pandas as pd
import tldextract as tld

# Visualization
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go

# Embeddings Models
import tensorflow_text
import tensorflow_hub as hub
import fasttext
from sentence_transformers import SentenceTransformer

# Dimensionality Reduction
from sklearn.manifold import TSNE
from umap import UMAP
from sklearn.decomposition import PCA

## Data Pre-Processing

In [None]:
def parse_domains(domains):
    """ Extracts only the primary domain for each domain in the list  """
    return ', '.join([tld.extract(s).domain for s in domains.split(',')])

def parse_keywords(keywords):
    """ Replace '_' characters with a space """
    return keywords.replace('_', ' ').replace(',', ', ')

def parse_titles(titles):
    """ Replace '|' characters with a comma """
    return titles.replace('|', ', ')

def process(df, group_name):
    """ Apply all pre-processing filters to dataframe """
    df['domains'] = df['domains'].apply(parse_domains)
    df['keywords'] = df['keywords'].apply(parse_keywords)
    df['titles'] = df['titles'].apply(parse_titles)
    df['group'] = group_name

## Generating Embeddings

In [None]:
def batch(text_sets, batch_size=300):
    """ Return a batch of sentences from a set of text series """
    for texts in text_sets:
        idx = 0
        t_len = len(texts)
        while idx < t_len:
            yield texts[idx:min(idx+batch_size, t_len)]
            idx += batch_size

def gen_embeddings(emb_type, df_list):
    """ Generates embeddings from all rows of all DF's in the input list
    
    Valid Embedding Types:
        - 'USE': Generates English-only USE-4 sentence embeddings
        - 'MUSE': Generates Multilingual USE-3 sentence embeddings
        - 'SBERT': Generates English-only SBERT sentence embeddings
        - 'SBERT-POST': Generates post-trained English-only SBERT sentence embeddings
        - 'FASTTEXT': Generates English-only FastText composite word embeddings
        
    Development:
        Adding or changing models is trivial; simply extend the if-chain. Another
        viable option is to use a dictionary mapping each key to a function/lambda.
    """
    
    # Model inputs are concatenated keywords and titles, this is a list 
    # of pandas series. Each series is the text for a specific input DF
    text = [df[['keywords', 'titles']].agg(' '.join, axis=1) for df in df_list]
    
    # Generate based on model type, we instantiate the model on-demand 
    # to save GPU memory.
    # TODO: use locally saved models, and perhaps store embeddings in a
    #       better-fitting data-structure
    if emb_type=='USE':
        model_use = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
        return [model_use(s) for s in batch(text)]
    elif emb_type=='MUSE':
        # Did not run this to completion yet, but code "compiles"
        model_use = hub.load("https://tfhub.dev/google/universal-sentence-encoder-multilingual/3")
        return [model_use(s) for s in batch(text)]
    elif emb_type=='SBERT':
        model_sbert = SentenceTransformer('bert-base-nli-mean-tokens') 
        return [model_sbert.encode(s) for s in batch(text)]
    elif emb_type=='SBERT-POST':
        model_sbert = SentenceTransformer('output/training_stsbenchmark_continue_training-bert-base-nli-mean-tokens-2020-08-21_13-47-18/')
        return [model_sbert.encode(s) for s in batch(text)]
    elif emb_type=='FASTTEXT':
        # Did not get FastText working, type error on the inputs.
        model_fast = fasttext.load_model('models/cc.en.300.bin')
        return [model_fast[s] for s in text]
    else:
        raise KeyError("Unrecognized embedding type: {0}".format(emb_type))

## Post-Processing

### Dimensionality Reduction

In [None]:
def reduce(embeddings, reduce_type='tSNE'):
    """ Reduces input embeddings to 2 dimensions for visualization 
    
    Valid Reduction Types:
        - 'tSNE'
        - 'PCA'
        - 'UMAP'
        
    Development:
        Adding or changing reduction methods is trivial; simply
        extend the if-chain
    """
    # Define dimensionality reduction algorithm
    alg = PCA(n_components=2) if reduce_type == 'PCA' else \
          TSNE(n_components=2) if reduce_type == 'tSNE' else \
          UMAP() if reduce_type == 'UMAP' else None
    if alg is None:
        raise KeyError("Unrecognized reduction type: {0}".format(reduce_type))

    # Execute reduction
    principalComponents = alg.fit_transform(embeddings)
    principalDf = pd.DataFrame(data=principalComponents, 
                               columns=['principal component 1', 'principal component 2'])
    return principalDf

### Add Labels

In [None]:
def concat_labels(df, label_dfs, char_lim=200):
    """ Attatch domain, title, and group labels to an embedding DF """
    replace_endl = lambda s: s.replace('\n', '<br> ')
    domains = pd.concat([df_i['domains'] for df_i in label_dfs], ignore_index=True)
    titles = pd.concat([df_i['titles'] for df_i in label_dfs], ignore_index=True)
    groups = pd.concat([df_i['group'] for df_i in label_dfs], ignore_index=True)
    keywords = pd.concat([df_i['keywords'] for df_i in label_dfs], ignore_index=True)
    
    # Clip and wrap labels for hover text
    hover = "<br><br>Groups: " + groups.astype(str).str[:char_lim].str.wrap(75).apply(replace_endl) + \
        "<br><br>Domains: " + domains.astype(str).str[:char_lim].str.wrap(75).apply(replace_endl) + \
        "<br><br>Keywords: " + keywords.astype(str).str[:char_lim].str.wrap(75).apply(replace_endl) + \
        "<br><br>Titles: " + titles.astype(str).str[:char_lim].str.wrap(75).apply(replace_endl)
    
    # If memory becomes an issue, add only the hover text and no other label
    new_df = pd.concat([df, domains, titles, groups, hover], axis = 1)
    new_df.columns = ['principal component 1', 'principal component 2', 'domains', 'titles', 'group', 'hover']
    
    return new_df

# Results

## Loading the data

This is usually the only cell you will need to edit. Simply change the variables within this cell to fit your job, and then run the notebook.

#### Outputs
- 2-Dimensional plot: `plotly_results/{file_name}_{embedding_type}.html`
- NPY pickle of generated embeddings: `npy_pickles/{file_name}_{embedding_type}.npy`

In [None]:
num_embeddings = 2000
full_dir="data/safegraph/merge_brands/safegraph_sephora.csv"

file_name = "sephora"
embedding_type = 'SBERT-POST'
reduce_type = 'tSNE'

ful_df = pd.read_csv(full_dir).dropna().head(num_embeddings)
process(ful_df, 'sephora')

df_list = [ful_df]
total_embeddings = sum(df_i.shape[0] for df_i in df_list)
print("Loaded {0} total embeddings".format(total_embeddings))

## Generate Embeddings
This part will take the longest to run, as the models will have trouble with long text-inputs. Thus, we use the `capture` magic command to run the cell in the background so that the developer can close the browser safely, and come back to view the results later.

**Suggested Solution:** we should limit the input size and select only the "most significant" text for a user, whatever that will mean.

#### Output from previous runs on Beast2:
- **USE game_seph:**                   Took 13.070s to generate 600 total embeddings
- **SBERT game_speh:**                 Took 50.435s to generate 600 total embeddings
- **USE all_brands:**                  Took 128.175s to generate 951 total embeddings
- **SBERT all_brands:**                Took 1204.685s to generate 951 total embeddings
- **SBERT-POST all_brands_ndomain10:** Took 10921.594s to generate 938 total embeddings
- **SBERT-POST spehora:**              Took 292.435s to generate 2000 total embeddings

In [None]:
%%capture emb_output
# ^ Run this cell in the background so it doesn't quit on it's own

start = time()

# Currently converting a python list of arrays to a 
# single numpy array, perhaps there is a better way
embeddings = np.vstack(gen_embeddings(embedding_type, df_list))

print('Took {:.3f}s to generate {} total embeddings'.format(time()-start, total_embeddings))

In [None]:
# After the above cell is completed, print the output
emb_output.show()

## Reduce and Label Embeddings

In [None]:
start = time()

re_df = reduce(embeddings, reduce_type=reduce_type)

print('Took {:.3f}s to reduce {} total embeddings'.format(time()-start, total_embeddings))

finalDf = concat_labels(re_df, df_list)

## Visualize Results

Use plotly to render results and examine each point for in-depth analysis.

In [None]:
title = '2 component {} on {} URLs'.format(reduce_type, num_embeddings)
fig = px.scatter(finalDf, x="principal component 1", y="principal component 2", 
                 color="group", hover_data=['hover'], title=title,
                 width=1000, height=800)
fig.show()

## Saving Results
Save the above plot and the generated embeddings for later analysis and use in the Lookalike Model

In [None]:
fn = "{0}_{1}".format(file_name, embedding_type)
fig.write_html("plotly_results/{0}.html".format(fn))
np.save("npy_pickles/{0}.npy".format(fn), embeddings)