# Demo music query

We perform the following in order to generate recommendation.

1. Load an embeddings and cluster model.
2. Query by specifying song title and any metadata to condition.
3. Get lyrics through an API.
    - First with [this API](http://www.chartlyrics.com/api.aspx), as it's free and does not require an API key.
    - Otherwise fall back on [this API](https://github.com/johnwmillr/LyricsGenius) to access Genius. **Note you will need an API key which can create [here](https://genius.com/api-clients).**
4. Get Spotify acoustic features and metadata with [this API](https://spotipy.readthedocs.io/en/2.19.0/). **Note you will need a client ID and secret key which can create [here](https://developer.spotify.com).**
5. Return top K recommendations by:
    - Computing embedding.
    - Identifying corresponding cluster.
    - Subset based on query.
    
First some imports.

In [2]:
import urllib.request
import json
import numpy as np
import lyricsgenius
import re
from tqdm import tqdm
import pandas as pd
import xml.etree.ElementTree as ET
from pprint import pprint
import os
import spotipy
import torch
from sklearn.metrics.pairwise import cosine_distances, euclidean_distances

from spotipy.oauth2 import SpotifyClientCredentials
from sentence_transformers import SentenceTransformer
from transformers import AutoConfig, AutoModel,AutoModelForPreTraining, AutoTokenizer


def get_token(token_name, token_path="tokens.json"):
    TOKEN = None
    if os.environ.get(token_name):
        TOKEN = os.environ.get(token_name)
    elif os.path.isfile(token_path):
        f = open(token_path)
        data = json.loads(f.read())
        TOKEN = data[token_name]
    else:
        assert TOKEN is not None, f"No value for {token_name}."
    return TOKEN


def standardize_lyrics(lyrics, i=0, verbose=False):
    if verbose:
        print(i)
    if lyrics is np.nan or len(lyrics) == 0:
        return np.nan
    
    # remove new lines
    clean = lyrics.replace("\\n\\n", ". ").replace("\\n", ". ").replace("\\", "")
    
    # remove square brackets around lyrics
    # if possible, extract chorus, pre-chorus, post-chorus, bridge, verses
    song_parts = ["Chorus", "Pre-Chorus", "Post-Chorus", "Bridge", "Verse 1", "Verse 2", "Verse 3", "Verse 4"]
    if verbose:
        for part in song_parts:
            text = find_between(clean, f"[{part}]. ", "[")
            if len(text):
                print(f"\n{part} : {text}")
    
    for part in song_parts:
        clean = clean.replace(f"[{part}]. ", "")
        
        
    # remove anything else in square brackets
    clean = re.sub("[\[].*?[\]]", "", clean)
    
    # clean up
    clean = clean.replace('"', "")
    try:
        while clean[0] == "." or clean[0] == " " or clean[0] == "'":
            clean = clean[1:]
    except:
        return np.nan
    try:
        if clean[-1] == "'":
            clean = clean[:-1]
    except:
        return np.nan
    
    clean = clean.strip().replace("\n", " ")
        
    return clean

# 1) specify encoder and clustering

In [4]:
# from model hub or local path
embeddings_model = "all-mpnet-base-v2"
embeddings_model = "all-mpnet-base-v2-finetuned-genre_unfrozen_base-checkpoint-1735/checkpoint-1735"
n_clusters = 4
affinity = "cosine"     # “euclidean”, “l1”, “l2”, “manhattan”, “cosine”, or “precomputed”. If linkage is “ward”, only “euclidean” is accepted
linkage = "complete"    # {‘ward’, ‘complete’, ‘average’, ‘single’}, default=’ward’

# 2) specify query

In [108]:
song_title = "the lazy song"
artist = "bruno mars"
# song_title = "we are the champions"
# artist = "queen"
genre = "country"    # ['dance pop', 'acoustic/folk', 'hip-hop/rap', 'pop', 'soul/disco', 'country', 'r&b', 'rock']
danceability = 0   # positive for more, 0 for no preference, negative for less

# 3) get lyrics

- http://www.chartlyrics.com/api.aspx
- https://github.com/johnwmillr/LyricsGenius

For last approach, you need an [API token](https://genius.com/api-clients) and add it to your environment variables:
```
export GENIUS_ACCESS_TOKEN="my_access_token_here"
```

In [99]:
GENIUS_ACCESS_TOKEN = get_token("GENIUS_ACCESS_TOKEN")

In [104]:
start_url = f"http://api.chartlyrics.com/apiv1.asmx/SearchLyricDirect?artist={artist}&song={song_title}"
url = start_url.replace(" ","%20")
contents = urllib.request.urlopen(url).read()
root = ET.fromstring(contents.decode("utf-8"))
for child in root:
    tag = child.tag.split("}")[1]
    if tag == "Lyric":
        lyrics = child.text
if lyrics is not None:
    lyrics = lyrics.strip().replace("\n", " ")
elif GENIUS_ACCESS_TOKEN:
    # use Genius API
    print("Using Genius...")
    genius = lyricsgenius.Genius(GENIUS_ACCESS_TOKEN)
    song = genius.search_song(song_title, artist)
    lyrics = standardize_lyrics(song.lyrics)
    lyrics = ' '.join(lyrics.split(' ')[:-1])[:-13]   # remove last part Genius adds
else:
    raise ValueError("Could not find song.")
    
print(lyrics)

Using Genius...
Searching for "the lazy song" by bruno mars...
Done.
Today, I don't feel like doing anything I just wanna lay in my bed Don't feel like picking up my phone So leave a message at the tone 'Cause today, I swear, I'm not doing anything   I'm gonna kick my feet up, then stare at the fan Turn the TV on, throw my hand in my pants Nobody's gon' tell me I can't, nah I'll be lounging on the couch, just chillin' in my Snuggie Click to MTV, so they can teach me how to dougie 'Cause in my castle, I'm the freaking man   Oh-oh, yes, I said it I said it, I said it, 'cause I can   Today, I don't feel like doing anything I just wanna lay in my bed Don't feel like picking up my phone So leave a message at the tone 'Cause today, I swear, I'm not doing anything Nothing at all (Woo-hoo, woo-hoo, ooh) Nothing at all (Woo-hoo, woo-hoo, ooh)  Tomorrow, I'll wake up, do some P90X Meet a really nice girl, have some really nice sex And she's gonna scream out, This is great! (Oh my God, this is gr

# 4) get spotipy metadata and features

be sure to have credentials from [here](https://developer.spotify.com) and save them as environment variables.
```
export SPOTIPY_CLIENT_ID='your-spotify-client-id'
export SPOTIPY_CLIENT_SECRET='your-spotify-client-secret'
```

In [105]:
SPOTIPY_CLIENT_ID = get_token("SPOTIPY_CLIENT_ID")
SPOTIPY_CLIENT_SECRET = get_token("SPOTIPY_CLIENT_SECRET")

auth_manager = SpotifyClientCredentials(client_id=SPOTIPY_CLIENT_ID, client_secret=SPOTIPY_CLIENT_SECRET)
sp = spotipy.Spotify(auth_manager=auth_manager)

In [106]:
# search for song, https://developer.spotify.com/documentation/web-api/reference/#/operations/search
query = f"track:{song_title}"
if artist is not None:
    query += f" artist:{artist}"
res = sp.search(q=query, type='track')

# take top entry
_id = 0
rx_song = res["tracks"]["items"][0]["name"]
rx_artists = [artist["name"] for artist in res["tracks"]["items"][0]["artists"]]
print(f"{rx_song} by {rx_artists}")

The Lazy Song by ['Bruno Mars']


In [107]:
song_metadata = dict()
song_metadata["release_year"] = int(res["tracks"]["items"][_id]["album"]["release_date"][:4])
song_metadata["popularity"] = res["tracks"]["items"][_id]["popularity"]

# get acoustic features
acoustic_features = ["mode", "acousticness", "danceability", "energy", "instrumentalness", "liveness", "loudness", "speechiness", "valence", "tempo"]
uri = res["tracks"]["items"][_id]["uri"]
feat_results = sp.audio_features(uri)[0]
for _feat in acoustic_features:
    song_metadata[_feat] = feat_results[_feat]
pprint(song_metadata)

# could probably also get genre metadata from this API

{'acousticness': 0.3,
 'danceability': 0.794,
 'energy': 0.711,
 'instrumentalness': 0,
 'liveness': 0.0955,
 'loudness': -5.124,
 'mode': 0,
 'popularity': 74,
 'release_year': 2010,
 'speechiness': 0.0699,
 'tempo': 174.915,
 'valence': 0.955}


# 5) return top K recommendations

first compute embedding

In [109]:
if not os.path.isdir(embeddings_model):
    # coming from model hub
    model = SentenceTransformer(embeddings_model)
    query_embed = model.encode(lyrics)
else:
    # local model
    config = AutoConfig.from_pretrained(f'{embeddings_model}/config.json')
    model = AutoModel.from_config(config)
    model = AutoModel.from_pretrained(f'{embeddings_model}/pytorch_model.bin',config=config)
    model.eval()
    model.cuda()
    tokenizer = AutoTokenizer.from_pretrained(embeddings_model, use_fast=True)
    
    # TODO : simpler for a single lyric?
    tokens = tokenizer.batch_encode_plus(
        [lyrics],
        max_length = 512,
        padding=True,
        truncation=True
    )
    embed = []
    with torch.no_grad():
        for i in tqdm(range(len([lyrics]))):
            tkin = tokens['input_ids'][i:i+1]
            tkam = tokens['attention_mask'][i:i+1]

            tkin = torch.tensor(tkin).cuda()
            tkam = torch.tensor(tkam).cuda()

            out = model(tkin,tkam)['last_hidden_state']
            out = out.mean(1).cpu().numpy()

            embed.append(out)
    query_embed = embed[0][0]

print(query_embed.shape)

Some weights of the model checkpoint at all-mpnet-base-v2-finetuned-genre_unfrozen_base-checkpoint-1735/checkpoint-1735/pytorch_model.bin were not used when initializing MPNetModel: ['classifier.out_proj.bias', 'classifier.dense.weight', 'classifier.out_proj.weight', 'classifier.dense.bias']
- This IS expected if you are initializing MPNetModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing MPNetModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of MPNetModel were not initialized from the model checkpoint at all-mpnet-base-v2-finetuned-genre_unfrozen_base-checkpoint-1735/checkpoint-1735/pytorch_model.bin and are newly initialized: ['mpnet.pooler.dense.bias', 'mpnet.pooler.dense.we

(768,)





identify corresponding cluster

In [110]:
if os.path.isdir(embeddings_model):
    clustering_fp = os.path.split(embeddings_model)[0]
else:
    clustering_fp = embeddings_model
clustering_fp += f"_{n_clusters}clusters_affinity={affinity}_linkage={linkage}.npy"
clustering_fp = os.path.join("clustering", clustering_fp)
print(clustering_fp)

# check if clustering already exists
if os.path.isfile(clustering_fp):
    cluster_assignment = np.load(clustering_fp)
    
else:
    # compute with clustering notebook
    raise ValueError("Cluster assignment not available.")
    
    
# compute centroids
# -- load embeddings
if os.path.isdir(embeddings_model):
    embedding_fp = f"{os.path.split(embeddings_model)[0]}_embeddings.pt"
else:
    embedding_fp = f"{embeddings_model}_embeddings.pt"
embedding_fp = os.path.join("embeddings", embedding_fp)
assert os.path.isfile(embedding_fp)
corpus_embeddings = torch.load(embedding_fp)
if torch.is_tensor(corpus_embeddings):
    corpus_embeddings = corpus_embeddings.cpu().data.numpy()
assert len(corpus_embeddings) == len(cluster_assignment)

# -- average according to cluster assignment
centroids = []
for i in range(n_clusters):
    inds = cluster_assignment == i
    centroids.append(np.mean(corpus_embeddings[inds,:], axis=0))
centroids = np.vstack(centroids)

# identify closest cluster according to correct metric
query_embed = query_embed / np.linalg.norm(query_embed, axis=-1, keepdims=True)

if affinity == "cosine":
    scores = cosine_distances(query_embed[np.newaxis, :], centroids)
elif affinity == "euclidean":
    scores = euclidean_distances(query_embed[np.newaxis, :], centroids)
else:
    raise ValueError
assigned_cluster = np.argmin(scores)
print(scores)
print("assigned cluster :", assigned_cluster)

clustering/all-mpnet-base-v2-finetuned-genre_unfrozen_base-checkpoint-1735_4clusters_affinity=cosine_linkage=complete.npy
[[0.5367125  0.7559653  0.72158474 0.5945208 ]]
assigned cluster : 0


subset based on query and give top K recommendations

In [111]:
# subset according to cluster
inds = cluster_assignment == assigned_cluster
embeddings_subset = corpus_embeddings[inds]
dataset_path = "df_clean_v4_14122021_py35.pkl"
df_clean = pd.read_pickle(dataset_path)
df_subset = df_clean.iloc[inds]

# subset according to criteria
if genre is not None:
    ind_genre = df_subset["genre"] == genre
    embeddings_subset = embeddings_subset[ind_genre]
    df_subset = df_subset[ind_genre]
    
if danceability:
    if danceability >= 0:
        _ind = df_subset["danceability"] > song_metadata["danceability"]
    else:
        _ind = df_subset["danceability"] < song_metadata["danceability"]
    embeddings_subset = embeddings_subset[_ind]
    df_subset = df_subset[_ind]
    
# TODO other criteria


In [112]:
# compute scores
if affinity == "cosine":
    scores = cosine_distances(query_embed[np.newaxis, :], embeddings_subset)[0]
elif affinity == "euclidean":
    scores = euclidean_distances(query_embed[np.newaxis, :], embeddings_subset)[0]
print(scores.shape)

(118,)


In [113]:
K = 10
max_len = 200  # for printing lyrics

topk = scores.argsort()[:K]
for i in topk:
    print("Score:", scores[i])
    print('Genre:',df_subset['genre'][i],'Artist:',df_subset['artist'][i],'SongName:',df_subset['song_name'][i])
    print("Lyrics:", df_subset['lyrics'][i][:max_len])
    print('*****')

Score: 0.5227248
Genre: country Artist: Maren Morris SongName: Rich
Lyrics: La-a-a-a-a-di-da. La-a-a-a-a-di-da. If I had a dollar every time that I swore you off. And a twenty every time that i picked up when you called. And a crisp new Benjamin for when you're here then gone
*****
Score: 0.530687
Genre: country Artist: Luke Bryan SongName: Buzzkill
Lyrics: Baby you're a buzzkill(x2). . We had a good thing but that was in the past tense. Since it went bad then you haven't been back since. You said I changed but really I never did. I just really stopped c
*****
Score: 0.5381939
Genre: country Artist: Sam Hunt SongName: House Party
Lyrics: You're on the couch, blowing up my phone. You don't want to come out, but you don't want to be alone. It don't take but two to have a little soiree. If you're in the mood, sit tight right where you ar
*****
Score: 0.5396657
Genre: country Artist: Old Dominion SongName: Snapback
Lyrics: Strictly outta curiosity. What would happen if you got with me?. Ki