# 02 - Song Embeddings - Skipgram Recommender

In this notebook, we'll use human-made music playlists to learn song embeddings. We'll treat a playlist as if it's a sentence and the songs it contains as words. We feed that to the word2vec algorithm which then learns embeddings for every song we have. These embeddings can then be used to recommend similar songs. This technique is used by Spotify, AirBnB, Alibaba, and others. It accounts for a vast portion of their user activity, user media consumption, and/or sales (in the case of Alibaba).

The [dataset we'll use](https://www.cs.cornell.edu/~shuochen/lme/data_page.html) was collected by Shuo Chen from Cornell University. The dataset contains playlists from hundreds of radio stations from around the US.

## Importing packages and dataset

In [None]:
import numpy as np
import pandas as pd
import gensim 
from gensim.models import Word2Vec
from urllib import request
import warnings
warnings.filterwarnings('ignore')

The playlist dataset is a text file where every line represents a playlist. That playlist is basically a series of song IDs. 

In [None]:
# Get the playlist dataset file
data = request.urlopen('https://storage.googleapis.com/maps-premium/dataset/yes_complete/train.txt')

# Parse the playlist dataset file. Skip the first two lines as 
# they only contain metadata
lines = data.read().decode("utf-8").split('\n')[2:] 

# Remove playlists with only one song
playlists = [s.rstrip().split() for s in lines if len(s.split()) > 1]


In [None]:
len(lines)

In [None]:
lines[0:2]

The `playlists` variable now contains a python list. Each item in this list is a playlist containing song ids. We can look at the first two playlists here:

In [None]:
print( 'Playlist #1:\n ', playlists[0], '\n')
print( 'Playlist #2:\n ', playlists[1])

## Training the Word2Vec Model
Our dataset is now in the shape the the Word2Vec model expects as input. We pass the dataset to the model, and set the following key parameters:
 * **size**: Embedding size for the songs. 
 * **window**: word2vec algorithm parameter -- maximum distance between the current and predicted word (song) within a sentence
 * **negative**: word2vec algorithm parameter -- Number of negative examples to use at each training step that the model needs to identify as noise


In [None]:
model = Word2Vec(playlists, size=32, window=20, negative=50, min_count=1, workers=4)

The model is now trained. Every song has an embedding. We only have song IDs, though, no titles or other info. Let's grab the song information file.

## Song Title and Artist File
Let's load and parse the file containing song titles and artists

In [None]:
songs_file = request.urlopen('https://storage.googleapis.com/maps-premium/dataset/yes_complete/song_hash.txt')
songs_file = songs_file.read().decode("utf-8").split('\n')
songs = [s.rstrip().split('\t') for s in songs_file]

Now, `songs` is a list containing the id, title, and artist of every song in our datset. It looks like this:

In [None]:
songs[:3]

To simplify looking up song titles by ID, we'll define a pandas dataframe to hold song information.

In [None]:
songs_df = pd.DataFrame(data=songs, columns = ['id', 'title', 'artist'])
songs_df = songs_df.set_index('id')

In [None]:
songs_df.head()

Pandas dataframes give us the ability to easily search through the columns of our dataset. We can look at the songs of a certain artist, for example.

In [None]:
songs_df[songs_df.artist == 'Rush'].head()

### Looking up songs by their IDs
Pandas also give us the ability to retrieve the information of multiple songs by passing their ids. Let's for example retrieve the info for songs number 1, 10, and 100.

In [None]:
songs_df.iloc[[1,10,100]]

## Recommending Similar Songs
Let's now pick a song, and see what similar songs the model recommends

In [None]:
songs_df.iloc[2172]

In [None]:
song_id = 2172

# Ask the model for songs similar to song #2172
model.wv.most_similar(positive=str(song_id))

Let's look up the titles and artists of these songs:

In [None]:
similar_songs = np.array(model.wv.most_similar(positive=str(song_id)))[:,0]
songs_df.iloc[similar_songs]

Let's define a function that prints out both the song title and the recommendations based on it:


In [None]:
def print_recommendations(song_id):
    print( songs_df.iloc[song_id] )
    similar_songs = np.array(model.wv.most_similar(positive=str(song_id)))[:,0]
    return  songs_df.iloc[similar_songs] 


## More Example Recommendations

### Paranoid Android - Radiohead

In [None]:
print_recommendations(19563)

### California Love - 2Pac

In [None]:
print_recommendations(842)

### Billie Jean - Michael Jackson

In [None]:
print_recommendations(3822)