# Embeddings Analysis

Now that we have run the training flows, we can use Metaflow's Client API as a handy way to fetch results, analyze performance and decide how to iterate on embeddings, modeling approaches, and experiment design.

First we import the packages we need and define some config variables:

In [None]:
import os

for _ in range(3):
    if os.path.exists(f'{os.getcwd()}/setup.py'):
        break
    os.chdir('..')
print('Current working directory:', os.getcwd())

In [None]:
from collections import Counter
from random import choice

import matplotlib.pyplot as plt
import numpy as np
from metaflow import Flow
from sklearn.manifold import TSNE

from src.utils.styling import apply_styling

In [None]:
colors = apply_styling()
palette = colors['palette']

FLOW_NAME = 'ModelingFlow'

Let's retrieved the artifacts from the latest successful run. 
The `get_latest_successful_run` uses the `metaflow.Flow` object to get results of runs using the (class) name of the flows. 

In [None]:
def get_latest_successful_run(flow_name: str):
    "Gets the latest successful run."
    for run in Flow(flow_name).runs():
        if run.successful:
            return run

In [None]:
latest_run = get_latest_successful_run(FLOW_NAME)
latest_model = latest_run.data.final_vectors
latest_dataset = latest_run.data.final_dataset

First, we check all is in order by printing out datasets and rows and stats:

In [None]:
print(len(latest_dataset))
latest_dataset.head(3)

### Model vectors

Now, let's turn our attention to the model - the embedding space we trained: let's check how big it is and use it to make a test prediction.

In [None]:
print("Track vectors in the space: {}".format(len(latest_model)))

test_track = choice(list(latest_model.index_to_key))
test_vector = latest_model[test_track]
test_sims = latest_model.most_similar(test_track, topn=3)

print("Example track: '{}'".format(test_track))
print("Test vector for '{}': {}".format(test_track, test_vector[:5]))
print("Similar songs to '{}': {}".format(test_track, test_sims))

The skip-gram model we trained is an embedding space: if we did our job correctly, the space is such that tracks closer in the space are actually similar, and tracks that are far apart are pretty unrelated.

[Judging the quality of "fantastic embeddings" is hard](https://arxiv.org/abs/2007.14906), but we point here to some common qualitative checks you can run.

In [None]:
test_track = 'Daft Punk-Get Lucky - Radio Edit'
test_sims = latest_model.most_similar(test_track, topn=5)
print(f"Similar songs to '{test_track}':")
for song in test_sims:
    print('  ',song[0])

If you use 'Daft Punk|||Get Lucky - Radio Edit' as the query item in the space, you will discover a pretty interesting phenomenon, that is, that there are unfortunately many duplicates in the datasets, that is, songs which are technically different but semantically the same, i.e. Daft Punk|||Get Lucky - Radio Edit vs Daft Punk|||Get Lucky.

This is a problem as

i) working with dirty data may be misleading, and 

ii) these issues make data sparsity worse, so the task for our model is now harder. That said, it is cool that KNN can be used to quickly identify and potentially remove duplicates, depending on your dataset and use cases.

Let's map some tracks to known categories: the intuition is that songs that are similar will be colored in the same way in the chart, and so we will expect them to be close in the embedding space.

In [None]:
track_sequence = latest_dataset['track_sequence'] 
songs = [item for sublist in track_sequence for item in sublist]
song_counter = Counter(songs)

We downsample the vector space a bit to the K most common songs to avoid crowding the plot / analysis

In [None]:
TOP_N_TRACKS = 1000
top_tracks = [_[0] for _ in song_counter.most_common(TOP_N_TRACKS)]
tracks = [_ for _ in latest_model.index_to_key if _ in top_tracks]
print(tracks)

# assert TOP_N_TRACKS == len(tracks)

In [None]:
tracks_to_category = {t: 'unknown' for t in tracks}

We tag songs based on keywords found in the playlist name. Of course, better heuristics are possible.

In [None]:
all_playlists_names = set(
    latest_dataset['playlist_id'].apply(lambda r: r.split('-')[1].lower().strip())
)
target_categories = [
    'rock',
    'rap',
    'country'
]

This select the playlists with the target keyword, and mark the tracks as belonging to that category

In [None]:
def tag_tracks_with_category(_df, target_word, tracks_to_category):
    _df = _df[_df['playlist_id'].str.contains(target_word)]
    # debug
    print(len(_df))
    # unnest the list
    songs = [item for sublist in _df['track_sequence'] for item in sublist]
    for song in songs:
        if song in tracks_to_category and tracks_to_category[song] == 'unknown':
            tracks_to_category[song] = target_word

    return tracks_to_category


for cat in target_categories:
    print('Processing {}'.format(cat))
    tracks_to_category = tag_tracks_with_category(
        latest_dataset, cat, tracks_to_category
    )

Note: to visualize a n-dimensional space, we need to be in 2D. We can use a dimensionality reduction technique like [TSNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) for this.

In [None]:
def tsne_analysis(embeddings, perplexity=50, n_iter=1000):
    """TSNE dimensionality reduction of track embeddings. It may take a while!."""
    return TSNE(
        n_components=2,
        perplexity=perplexity,
        n_iter=n_iter,
        verbose=1,
        learning_rate='auto',
        init='random',
        random_state=42
    ).fit_transform(embeddings)

In [None]:
# Add all the tagged tracks to the embedding space, on top of the popular tracks
for track, cat in tracks_to_category.items():
    # Add a track if we have a tag, if we have a vector for it
    # and if not there already
    if (
        cat in target_categories
        and track in latest_model.index_to_key
        and track not in tracks
    ):
        tracks.append(track)

print(len(tracks))

In [None]:
# extract the vectors from the model and project them in 2D
embeddings = np.array([latest_model[t] for t in tracks])
# debug, print out embedding shape
print(embeddings.shape)
tsne_results = tsne_analysis(embeddings, perplexity=10, n_iter=5000)
assert len(tsne_results) == len(tracks)

Now we can plot the 2D representations produced by the TSNE algorithm.

In [None]:
groups = {}
for item, target_cat in tracks_to_category.items():
    if item not in tracks:
        continue

    item_idx = tracks.index(item)
    x = tsne_results[item_idx][0]
    y = tsne_results[item_idx][1]
    if target_cat in groups:
        groups[target_cat]['x'].append(x)
        groups[target_cat]['y'].append(y)
    else:
        groups[target_cat] = {'x': [x], 'y': [y]}

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(10, 4))

group_colors = {
    'rock': palette[0],
    'rap': palette[1],
    'country': palette[2],
    'unknown': colors['lines'],
}
for group, data in groups.items():
    axs[0].scatter(
        data['x'],
        data['y'],
        alpha=0.4 if group == 'unknown' else 0.8,
        color=group_colors[group],
        edgecolors='none',
        s=25,
        marker='o',
        label=group,
    )


[axs[0].spines[dir].set_visible(False) for dir in ['top', 'bottom', 'left', 'right']]
axs[0].set_title('Music in (latent) space')
axs[0].set_ylabel('')
axs[0].set_xlabel('')
axs[0].set_xticks([])
axs[0].set_yticks([])
axs[0].legend(frameon=False)
axs[1].axis('off')

fig.tight_layout(pad=3.0)
fig.savefig('data/06_viz/tsne_latent_space.png', bbox_inches='tight')
plt.show()