# Embeddings for Recommendation Systems

As we’ve mentioned, the concept of embeddings is useful in so many other domains. In industry, it’s widely used for recommendation systems, for example.

we’ll use the word2vec algorithm to embed songs using human-made music playlists. Imagine if we treated each song as we would a word or token, and we treated each playlist like a sentence. These embeddings can then be used to recommend similar songs that often appear together in playlists.

The dataset we’ll use was collected by Shuo Chen from Cornell University. It contains playlists from hundreds of radio stations around the US. Figure 2-17 demonstrates this dataset.

![Three playlists containing watched video IDs](../assets/videos_playlists.png)

Figure 2-17. For video embeddings that capture video similarity we’ll use a dataset made up of a collection of playlists, each containing a list of videos.


Let’s demonstrate the end product before we look at how it’s built. So let’s give it a few songs and see what it recommends in response.



### Training a Song Embedding Model

We’ll start by loading the dataset containing the song playlists as well as each song’s metadata, such as its title and artist:



In [None]:
import pandas as pd
from urllib import request

# Get the playlist dataset file
data = request.urlopen('https://storage.googleapis.com/maps-premium/dataset/yes_complete/train.txt')

# Parse the playlist dataset file. Skip the first two lines as
# they only contain metadata
lines = data.read().decode("utf-8").split('\n')[2:]

# Remove playlists with only one song
playlists = [s.rstrip().split() for s in lines if len(s.split()) > 1]

# Load song metadata
songs_file = request.urlopen('https://storage.googleapis.com/maps-premium/dataset/yes_complete/song_hash.txt')
songs_file = songs_file.read().decode("utf-8").split('\n')
songs = [s.rstrip().split('\t') for s in songs_file]
songs_df = pd.DataFrame(data=songs, columns = ['id', 'title', 'artist'])
songs_df = songs_df.set_index('id')

In [None]:
print( 'Playlist #1:\n ', playlists[0], '\n')
print( 'Playlist #2:\n ', playlists[1])

Based on the official [Gensim Word2Vec documentation](https://radimrehurek.com/gensim/models/word2vec.html), here is the description for each parameter, of the next code snippet calling `Word2Vec`:

* **`sentences` (playlists):** The input data. It must be an iterable of lists of tokens (in your case, song IDs or names within a playlist).
* **`vector_size=32`:** The dimensionality of the word vectors. This defines the number of features in the hidden layer of the neural network used to represent each item.
* **`window=20`:** The maximum distance between the current and predicted word within a sentence. A larger window captures more global context.
* **`negative=50`:** Specifies how many "noise words" should be drawn for **Negative Sampling**. According to the documentation, values between 5 and 20 are typical for small datasets, while 2 to 5 suffice for large ones. You have set this high (50) to increase training rigor.
* **`min_count=1`:** The model ignores all words with a total frequency lower than this. Setting it to 1 ensures every item in your playlists is included in the vocabulary.
* **`workers=4`:** The number of worker threads used to train the model, allowing for multicore parallelization to speed up training.

In [None]:
from gensim.models import Word2Vec

# Train our Word2Vec model
model = Word2Vec(
    playlists,
    vector_size=32,
    window=20,
    negative=50,
    min_count=1,
    workers=4
)

In [None]:
song_id = 2172

# Ask the model for songs similar to song #2172
model.wv.most_similar(positive=str(song_id))

In [None]:
print(songs_df.iloc[2172])

In [None]:
import numpy as np

def print_recommendations(song_id):
    similar_songs = np.array(
        model.wv.most_similar(positive=str(song_id),topn=5)
    )[:,0]
    return  songs_df.iloc[similar_songs]

# Extract recommendations
print_recommendations(2172)

In [None]:
print_recommendations(2172)

In [None]:
print_recommendations(842)