# k-means music

This is an attempt to use [Spotify's audio feature data](https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/) and [k-means clustering](https://en.wikipedia.org/wiki/K-means_clustering) to algorithmically generate playlists of similar songs.

Currently working on the project in this jupyter notebook, but will ultimately post it to my website (link to come).

## Background

[Back in 2016](https://www.recode.net/2016/9/27/13070726/spotify-daily-mix-playlist), Spotify rolled out their Daily Mixes feature, which automatically generates playlists of songs that you've already saved to your library. Spotify had previously released a number of auto-generated playlists (Discover Weekly, Release Radar), but as the names imply, these playlists were intentionally filled with tracks that you had not already saved. The Daily Mixes were different, instead focusing on creating playlists made up of songs saved in your library. Each mix is intended to hit on a different ["listening mode or grouping"](https://newsroom.spotify.com/2018-05-18/how-your-daily-mix-just-gets-you/) specific to each person, which means you might have a lo-fi hip hop mix and a stomp-and-holler folk mix both show up.

I appreciated the concept behind the Daily Mixes, but I often found that the "listening mode" I was into at the moment was not always represented in the Daily Mixes. This made me wonder how difficult it would be to create my own generated mixes so that I could find a playlist of my own music that truly matched the vibe of a song or album I was into. This gave birth to the concept of using clustering to quickly generate a bunch of (hopefully) representative mixes.

I have previously written about/used both [k-means clustering](http://ben-tanen.com/blog/2016/03/09/clustering-with-kmeans.html) and [Spotify's API data](http://ben-tanen.com/blog/2016/08/26/spotify-popularity.html) before, so these two were a natural combination to try for this experiement. Spotify obviously has substantially more data available for their analyses and they have teams of much more qualified data scientists working on their algorithms, but maybe, just maybe, I would be able to crack into one of their secrets...

---

## Starter code

Let's start with some basic code setup and then get into the meat of the analysis.

#### Load relevant packages and sign into Spotify instance

I'm going to use [spotipy](https://spotipy.readthedocs.io/en/latest/) for interfacing with Spotify and then [scikit learn](https://scikit-learn.org/) for my clustering analysis.

In [1]:
import os, json
import pandas as pd

import spotipy
import spotipy.util as util

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA # possibly going to try some PCA

In [22]:
# set API keys
apikeys = json.load(open("data/api-keys.json"))
os.environ["SPOTIPY_CLIENT_ID"]     = apikeys["spotipy-client-id"]
os.environ["SPOTIPY_CLIENT_SECRET"] = apikeys["spotipy-client-secret"]
os.environ["SPOTIPY_REDIRECT_URI"]  = apikeys["redirect-url"]

# set my user_id
user_id = '129874447'

In [60]:
# connect to spotify
token = util.prompt_for_user_token(user_id, \
                                   scope = 'user-library-read, playlist-modify-public, playlist-modify-private')
sp = spotipy.Spotify(auth = token)

#### Define helper functions for interfacing with Spotify

In [4]:
### function to get the current user's saved tracks (track name, artist, id)
def get_saved_tracks(limit = 50, offset = 0):
    saved_tracks = [ ]
    
    # get initial list of tracks to determine length
    saved_tracks_obj = sp.current_user_saved_tracks(limit = limit, offset = offset)
    num_saved_tracks = saved_tracks_obj['total']
    
    # loop through to get all saved tracked
    while (offset < num_saved_tracks):
        saved_tracks_obj = sp.current_user_saved_tracks(limit = limit, offset = offset)
        
        # add track information to running list
        for track_obj in saved_tracks_obj['items']:
            saved_tracks.append({
                'name': track_obj['track']['name'],
                'artists': ', '.join([artist['name'] for artist in track_obj['track']['artists']]),
                'track_id': track_obj['track']['id']
            })
            
        offset += limit
        
    return saved_tracks

### function to get tracks from a specified playlist (track name, artist, id)
def get_playlist_tracks(user_id, playlist_id, limit = 100, offset = 0):
    playlist_tracks = [ ]
    
    # get initial initial list of tracks in playlist to determine length
    playlist_obj = sp.user_playlist_tracks(user = user_id, playlist_id = playlist_id, \
                                           limit = limit, offset = offset)
    num_playlist_tracks = playlist_obj['total']
    
    # loop through to get all playlist tracks
    while (offset < num_playlist_tracks):
        playlist_obj = sp.user_playlist_tracks(user = user_id, playlist_id = playlist_id, \
                                               limit = limit, offset = offset)

        # add track information to running list
        for track_obj in playlist_obj['items']:
            playlist_tracks.append({
                'name': track_obj['track']['name'],
                'artists': ', '.join([artist['name'] for artist in track_obj['track']['artists']]),
                'track_id': track_obj['track']['id']
            })
            
        offset += limit
        
    return playlist_tracks

### function to get spotify audio features when given a list of track ids
def get_audio_features(track_ids):
    saved_tracks_audiofeat = [ ]
    
    # iterate through track_ids in groups of 50
    for ix in range(0,len(track_ids),50):
        audio_feats = sp.audio_features(track_ids[ix:ix+50])
        saved_tracks_audiofeat += audio_feats
        
    return saved_tracks_audiofeat

### function to  get all of the current user's playlists (playlist names, ids)
def get_all_user_playlists(playlist_limit = 50, playlist_offset = 0):
    # get initial list of users playlists (first n = playlist_limit), determine total number of playlists
    playlists_obj = sp.user_playlists(user_id, limit = playlist_limit, offset = playlist_offset)
    num_playlists = playlists_obj['total']

    # start accumulating playlist names and ids
    all_playlists = [{'name': playlist['name'], 'id': playlist['id']} for playlist in playlists_obj['items']]
    playlist_offset += playlist_limit

    # continue accumulating through all playlists
    while (playlist_offset < num_playlists):
        playlists_obj = sp.user_playlists(user_id, limit = playlist_limit, offset = playlist_offset)
        all_playlists += [{'name': playlist['name'], 'id': playlist['id']} for playlist in playlists_obj['items']]
        playlist_offset += playlist_limit
        
    return(all_playlists)

## An initial rushed attempt (a.k.a., how not to cluster)

With this code originally set up, let's get started! First, we'll pull in a list of all of my saved tracks and then merge on the audio feature data associated with these songs. From there, we should be able to let the clustering algorithm loose, right?

In [29]:
# get list of saved songs
saved_tracks    = get_saved_tracks()
saved_tracks_df = pd.DataFrame(saved_tracks)

print("tracks: %d" % saved_tracks_df.shape[0])
saved_tracks_df.head()

tracks: 2532


Unnamed: 0,artists,name,track_id
0,J. Cole,BRACKETS,5sWbwccBcyHsg5LEKWGZo9
1,Nathaniel Rateliff & The Night Sweats,You Worry Me,6jwsbxD1nvTc4UGLgRoCa6
2,Jorja Smith,The One,1Ahp4PZ1vzdbzBCedUrsqI
3,"Anderson .Paak, Kendrick Lamar",Tints (feat. Kendrick Lamar),7c3SbTuufigBWURcICnAWy
4,Anderson East,This Too Shall Last,0CuXzMEgFzuQhLEYQHYas4


In [27]:
# get audio features for saved songs
saved_tracks_audiofeat    = get_audio_features(track_ids = list(saved_tracks_df['track_id']))
saved_tracks_audiofeat_df = pd.DataFrame(saved_tracks_audiofeat).drop(['analysis_url', 'track_href', \
                                                                       'type', 'uri'], axis = 1)

# merge audio features onto tracks df
saved_tracks_plus_df = saved_tracks_df.merge(saved_tracks_audiofeat_df, how = 'left', \
                                             left_on = 'track_id', right_on = 'id').drop('id', axis = 1)
saved_tracks_plus_df.head()

Unnamed: 0,artists,name,track_id,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
0,J. Cole,BRACKETS,5sWbwccBcyHsg5LEKWGZo9,0.19,0.675,315771,0.567,2e-06,10,0.175,-9.147,0,0.263,84.039,4,0.658
1,Nathaniel Rateliff & The Night Sweats,You Worry Me,6jwsbxD1nvTc4UGLgRoCa6,0.124,0.666,214160,0.646,0.286,2,0.095,-7.722,1,0.0268,100.05,4,0.664
2,Jorja Smith,The One,1Ahp4PZ1vzdbzBCedUrsqI,0.252,0.312,197575,0.661,0.000135,0,0.102,-8.508,0,0.092,82.39,4,0.443
3,"Anderson .Paak, Kendrick Lamar",Tints (feat. Kendrick Lamar),7c3SbTuufigBWURcICnAWy,0.0859,0.805,268400,0.833,0.0,1,0.0578,-6.73,0,0.12,109.076,4,0.703
4,Anderson East,This Too Shall Last,0CuXzMEgFzuQhLEYQHYas4,0.323,0.503,222400,0.49,0.799,4,0.0873,-8.553,1,0.0279,148.884,4,0.68


With a full table of songs and hopefully meaningful audio features, we should be good to let the scikit-learn function do its thing.

The goal of this exercise is to make mixes similar to Spotify's Daily Mixes. Their mixes are technically endless (they grow as you listen), but for now let's shoot for playlists of 10 - 20 songs to start, which should be small enough to see that we are hopefully get meaningful results. As of writing this, I have more than 2,500 tracks saved, so it would make sense to create `k = 200` clusters.

In [34]:
# try clustering on the full dataset, excluding the non-numeric variables
kmeans = KMeans(n_clusters = 200).fit(saved_tracks_plus_df.drop(['track_id', 'track_id', 'name', 'artists'], axis = 1))

# add results to df
saved_tracks_plus_df['cluster'] = pd.Series(kmeans.labels_) + 1

With our tracks clustered together, let's take a look at a few and see what we've got!

In [36]:
saved_tracks_plus_df[saved_tracks_plus_df['cluster'] == 1]

Unnamed: 0,artists,name,track_id,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,cluster
64,The Vaccines,Put It On a T-Shirt - Acoustic Version,68hDyCIudkojeX9TO85EgH,0.965,0.67,162227,0.155,0.0,0,0.116,-12.68,1,0.0444,120.352,3,0.216,1
442,Otis Redding,Shake,13snzJGKyzUtWxyXsfhk5t,0.323,0.576,161667,0.555,0.000185,0,0.107,-9.634,1,0.0774,164.788,4,0.863,1
465,Logic,Everybody,7cGFbx7MP0H23iHZTZpqMM,0.158,0.885,162347,0.94,0.0,1,0.0675,-5.908,1,0.0909,110.005,4,0.77,1
510,Chuck Berry,Johnny B. Goode,4Hbe0lRKsXtDZ2wQIovz7I,0.748,0.522,161560,0.806,5.5e-05,10,0.313,-9.097,1,0.0817,168.078,4,0.969,1
717,Susto,"Friends, Lovers, Ex-Lovers: Whatever",4WrceuCclK7z1Y24Re8U1p,0.175,0.499,161789,0.721,0.00312,0,0.0682,-7.353,1,0.0364,129.98,4,0.441,1
967,The Colourist,Little Games,0nWWxoOmEPUtAHRiFOSAMc,0.000504,0.571,161987,0.82,0.00308,10,0.159,-4.745,1,0.0451,110.496,4,0.649,1
1013,Otis Redding,Shake - Remastered Stereo,6RkyopJ2y0DnoIrq57zrap,0.2,0.583,161480,0.471,0.0037,7,0.0871,-9.712,0,0.0692,163.364,4,0.796,1
1682,Jimmy Cliff,Many Rivers To Cross,2jQQJgvmr8fmTsWULa2pct,0.693,0.388,161133,0.297,0.194,5,0.134,-13.68,1,0.0721,139.784,4,0.232,1
1957,Junip,Far Away,7hB7p58iHbZNuxJnmkv8qx,0.561,0.421,162293,0.734,0.739,2,0.262,-8.711,1,0.0344,167.564,4,0.437,1
2476,AWOLNATION,THISKIDSNOTALRIGHT,0ESptRacCBJUeBMSPleIP8,0.0259,0.483,161200,0.826,0.000546,1,0.408,-4.982,1,0.0705,200.042,4,0.519,1


In [43]:
saved_tracks_plus_df[saved_tracks_plus_df['cluster'] == 30]

Unnamed: 0,artists,name,track_id,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,cluster
102,clipping.,Face,2fPYXByPcdTO82OKQpNFrC,0.0956,0.711,119745,0.926,0.00039,2,0.0839,-5.287,1,0.343,98.015,4,0.408,30
752,You Won't,Untitled 1,23wVZSH44ueGWiEQjIRqKR,0.148,0.144,119357,0.519,0.0,9,0.138,-7.387,1,0.0479,77.588,4,0.0953,30
992,"Lil Dicky, Hannibal Buress",Hannibal Interlude (feat. Hannibal Buress),7xVwwsQeCIYYafI9hiehii,0.0218,0.77,120691,0.665,0.0,1,0.656,-7.605,1,0.373,120.054,4,0.766,30
1070,Houndmouth,Krampus,3SpkH1bfjt04RBH4mUV3yU,0.52,0.407,120147,0.835,0.000684,7,0.111,-7.258,1,0.0905,121.945,4,0.106,30
1638,Neutral Milk Hotel,King of Carrot Flowers Pt. 1,17Nowmq4iF2rkbd1rAe1Vt,0.11,0.419,120427,0.519,0.889,5,0.409,-6.47,1,0.0334,94.044,4,0.343,30
2519,Aaron Embry,Raven's Song,4HC29wxOQARYNfvIXymKEw,0.898,0.386,120053,0.282,0.0895,3,0.314,-11.147,1,0.0297,99.825,4,0.435,30


These clusters don't look *that* great. The songs all seem pretty different from each other, or at least no more similar than if I were to just take a random sample from my saved library. I don't think I would be in the mood to listen to Neutral Milk Hotel and clipping at the same time. What could be going on?

Just from scanning the data, it seems like all of the songs in a cluster do have one thing in common: song length. The songs that are being grouped together appear to share a relatively similar `duration_ms`, which makes sense because there is a lot more variance from that variable compared to the other metrics that range from 0 - 1. The k-means algorithm is going to be driven towards those more varied variables, even if that is not our intention. This is why you are *suppose* to normalize your data first!

While this is a slightly interesting result - maybe I want to listen to exactly 13 songs in 45 minutes, so I need all of my songs to be 3:28 long - it was not exactly what I was shooting for. It does highlight the need and importance for me to normalize and center my data though (silly mistake on my part)!

---

## Validating the method

Before rushing into another attempt (with normalized data), it may also be worth taking a step back and seeing if clustering based on this audio feature data even produces meaningful results. One of the difficulties of any clustering analysis is validating your results and knowing if the clusters that are output even make sense. Before getting too ahead of myself, I want to try clustering the songs of two very different pre-made playlists and see if I have any luck. If that works, great; if not, I might be SOL.

For ease, I can use Spotify's mass selection of playlists to find two drastically different playlists. To start, let's compare the smooth and calm sounds of the ["Ambient Chill"](https://open.spotify.com/user/spotify/playlist/37i9dQZF1DX3Ogo9pFvBkY) playlist to the more "lit" musings of ["Get Turnt"](https://open.spotify.com/user/spotify/playlist/37i9dQZF1DWY4xHQp97fN6?si=D3-isiIdRRuwzbfPkHHh4w).

#### Getting the tracks from "Ambient Chill" and "Get Turnt"

In [53]:
# get tracks for "ambient chill" playlist
testA_tracks    = get_playlist_tracks(user_id = 'spotify', playlist_id = '37i9dQZF1DX3Ogo9pFvBkY')
testA_tracks_df = pd.DataFrame(testA_tracks)
testA_tracks_df['playlist'] = "ambient chill"

# get tracks for "get turnt" playlist
testB_tracks    = get_playlist_tracks(user_id = 'spotify', playlist_id = '37i9dQZF1DWY4xHQp97fN6')
testB_tracks_df = pd.DataFrame(testB_tracks)
testB_tracks_df['playlist'] = "get turnt"

# stack all tracks together
testAB_tracks_df = testA_tracks_df.append(testB_tracks_df).sort_values(by = "track_id")
testAB_tracks_df.head()

Unnamed: 0,artists,name,track_id,playlist
88,"Saweetie, London On Da Track, G-Eazy, Rich The...",Up Now (feat. G-Eazy and Rich The Kid),01TreyTchXP0J1Mn6wcVHt,get turnt
12,"French Montana, Drake",No Stylist,04MLEeAMuV9IlHEsD8vF6A,get turnt
44,David Wingo,Taken Away,0DUqyYJuKmVD6LDjUYDjgY,ambient chill
61,The Carters,APESHIT,0E6PsO3ymCfUh7pJQjBgkj,get turnt
20,"6ix9ine, Kanye West",KANGA (feat. Kanye West),0EwS8XJQspgrj1zSUXhFkl,get turnt


#### Get the audio features for our set of songs

In [54]:
_testAB_audiofeat    = get_audio_features(track_ids = list(testAB_tracks_df['track_id']))
_testAB_audiofeat_df = pd.DataFrame(_testAB_audiofeat).drop(['analysis_url', 'track_href', 'type', 'uri'], axis = 1)

_testAB_audiofeat_df.head()

Unnamed: 0,acousticness,danceability,duration_ms,energy,id,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
0,0.00294,0.956,198088,0.544,01TreyTchXP0J1Mn6wcVHt,0.0,1,0.104,-6.41,1,0.0822,104.011,4,0.127
1,0.0215,0.765,192172,0.704,04MLEeAMuV9IlHEsD8vF6A,0.0,5,0.227,-4.589,0,0.127,147.055,4,0.498
2,0.979,0.19,93160,0.0199,0DUqyYJuKmVD6LDjUYDjgY,0.92,2,0.0873,-28.495,0,0.044,89.936,3,0.0343
3,0.0133,0.705,264853,0.784,0E6PsO3ymCfUh7pJQjBgkj,0.0,2,0.168,-6.477,1,0.271,160.035,4,0.377
4,0.0192,0.901,132331,0.598,0EwS8XJQspgrj1zSUXhFkl,0.0,1,0.0791,-6.934,1,0.17,117.9,4,0.355


#### Normalize the audio feature data before merging on and clustering

In [55]:
testAB_audiofeat_scaler = StandardScaler()

testAB_audiofeat    = testAB_audiofeat_scaler.fit_transform(_testAB_audiofeat_df.drop(['id'], axis = 1))
testAB_audiofeat_df = pd.DataFrame(testAB_audiofeat, columns = _testAB_audiofeat_df.drop('id', axis = 1).columns)
testAB_audiofeat_df['id'] = _testAB_audiofeat_df['id']

testAB_audiofeat_df.head()

Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,id
0,-1.087991,1.28261,0.068631,0.541926,-0.930963,-0.893109,-0.493509,0.81802,0.763763,-0.354083,-0.379299,0.210567,-0.627097,01TreyTchXP0J1Mn6wcVHt
1,-1.043656,0.681089,-0.024785,1.126576,-0.930963,0.187724,0.845658,1.034872,-1.309307,0.048287,0.951187,0.210567,0.847257,04MLEeAMuV9IlHEsD8vF6A
2,1.24353,-1.129771,-1.588209,-1.37317,1.1511,-0.622901,-0.675331,-1.811947,-1.309307,-0.697176,-0.814355,-1.607968,-0.995487,0DUqyYJuKmVD6LDjUYDjgY
3,-1.063244,0.49213,1.122867,1.418901,-0.930963,-0.622901,0.203293,0.810042,0.763763,1.341621,1.352398,0.210567,0.366403,0E6PsO3ymCfUh7pJQjBgkj
4,-1.04915,1.109397,-0.969689,0.739245,-0.930963,-0.893109,-0.764609,0.75562,0.763763,0.434491,0.050009,0.210567,0.278975,0EwS8XJQspgrj1zSUXhFkl


#### Merge on (normalized) audio feature data and try clustering

In [56]:
testAB_tracks_plus_df = testAB_tracks_df.merge(testAB_audiofeat_df, how = 'left', \
                                               left_on = 'track_id', right_on = 'id').drop('id', axis = 1)
testAB_tracks_plus_df.head()

Unnamed: 0,artists,name,track_id,playlist,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
0,"Saweetie, London On Da Track, G-Eazy, Rich The...",Up Now (feat. G-Eazy and Rich The Kid),01TreyTchXP0J1Mn6wcVHt,get turnt,-1.087991,1.28261,0.068631,0.541926,-0.930963,-0.893109,-0.493509,0.81802,0.763763,-0.354083,-0.379299,0.210567,-0.627097
1,"French Montana, Drake",No Stylist,04MLEeAMuV9IlHEsD8vF6A,get turnt,-1.043656,0.681089,-0.024785,1.126576,-0.930963,0.187724,0.845658,1.034872,-1.309307,0.048287,0.951187,0.210567,0.847257
2,David Wingo,Taken Away,0DUqyYJuKmVD6LDjUYDjgY,ambient chill,1.24353,-1.129771,-1.588209,-1.37317,1.1511,-0.622901,-0.675331,-1.811947,-1.309307,-0.697176,-0.814355,-1.607968,-0.995487
3,The Carters,APESHIT,0E6PsO3ymCfUh7pJQjBgkj,get turnt,-1.063244,0.49213,1.122867,1.418901,-0.930963,-0.622901,0.203293,0.810042,0.763763,1.341621,1.352398,0.210567,0.366403
4,"6ix9ine, Kanye West",KANGA (feat. Kanye West),0EwS8XJQspgrj1zSUXhFkl,get turnt,-1.04915,1.109397,-0.969689,0.739245,-0.930963,-0.893109,-0.764609,0.75562,0.763763,0.434491,0.050009,0.210567,0.278975


In [57]:
# try clustering full stack of songs into two distinctplaylists
kmeans = KMeans(n_clusters = 2).fit(testAB_tracks_plus_df.drop(['track_id', 'track_id', 'name', \
                                                                'artists', 'playlist'], axis = 1))
testAB_tracks_plus_df['cluster'] = pd.Series(kmeans.labels_) + 1

# see if successful (hopefully see the playlists are clustered mutually exclusively)
testAB_tracks_plus_df[['track_id', 'playlist', 'cluster']].groupby(['playlist', 'cluster']).agg('count')

Unnamed: 0_level_0,Unnamed: 1_level_0,track_id
playlist,cluster,Unnamed: 2_level_1
ambient chill,1,90
get turnt,2,100


Aha! Using normalized audio feature data, the songs do get properly clustered into their respective, mutually exclusive playlists. To an extent, this is validation of the method, at least on two quite different sounding sets of songs.

As a secondary experiment, before trying it on my own library, I want to see how it performs on more similar playlists.

But first, going to throw this code into some functions for later use.

#### Define functions to more easily create our necessary dataframes and cluster

In [58]:
### function to create "tracks plus" df (including normalized audio features) when given a tracks df
def build_tracks_plus_df(tracks_df, normalize = True):
    # get raw audio features
    _audiofeat    = get_audio_features(track_ids = list(tracks_df['track_id']))
    _audiofeat_df = pd.DataFrame(_audiofeat).drop(['analysis_url', 'track_href', 'type', 'uri'], axis = 1)
    
    # scale audio features (if desired)
    if normalize:
        scaler = StandardScaler()
        audiofeat    = scaler.fit_transform(_audiofeat_df.drop(['id'], axis = 1))
        audiofeat_df = pd.DataFrame(audiofeat, columns = _audiofeat_df.drop('id', axis = 1).columns)
        audiofeat_df['id'] = _audiofeat_df['id']
    else:
        audiofeat_df = _audiofeat_df
    
    # merge audio features with tracks_df
    tracks_plus_df = tracks_df.merge(audiofeat_df, how = 'left', left_on = 'track_id', right_on = 'id')
    return(tracks_plus_df)

### function to cluster tracks based on normalized audio features
def cluster_tracks_plus_df(tracks_plus_df, num_clusters, drop_vars = None):
    kmeans = KMeans(n_clusters = num_clusters).fit(tracks_plus_df.drop(['track_id', 'id', 'name', 'artists'] + \
                                                                       (drop_vars if drop_vars != None else []), \
                                                                       axis = 1))
    tracks_plus_df['cluster'] = pd.Series(kmeans.labels_) + 1
    return(tracks_plus_df)

This time around, let's try the same approach but on *three* playlists that are all pretty similar (really just slight variants of the same type of indie music). Because these are all pretty similar to one another (**## insert explanation of why ##**), I don't expect these results to be as clean as the last run, but we'll see!

In [64]:
# get tracks for "lo-fi indie" playlist
testC_tracks    = get_playlist_tracks(user_id = 'spotify', playlist_id = '37i9dQZF1DX0CIO5EOSHeD')
testC_tracks_df = pd.DataFrame(testC_tracks)
testC_tracks_df['playlist'] = "lo-fi indie"

# get tracks for "dreampop" playlist
testD_tracks    = get_playlist_tracks(user_id = 'spotify', playlist_id = '37i9dQZF1DX6uhsAfngvaD')
testD_tracks_df = pd.DataFrame(testD_tracks)
testD_tracks_df['playlist'] = "dreampop"

# get tracks for "bedroom pop" playlist
testE_tracks    = get_playlist_tracks(user_id = 'spotify', playlist_id = '37i9dQZF1DXcxvFzl58uP7')
testE_tracks_df = pd.DataFrame(testE_tracks)
testE_tracks_df['playlist'] = "bedroom pop"

# stack all tracks together
testCDE_tracks_df = testC_tracks_df.append(testD_tracks_df).append(testE_tracks_df)
testCDE_tracks_df.head()

# build plus df and cluster
testCDE_tracks_plus_df = cluster_tracks_plus_df(build_tracks_plus_df(testCDE_tracks_df), 3, drop_vars = ['playlist'])
testCDE_tracks_plus_df[['track_id', 'playlist', 'cluster']].groupby(['playlist', 'cluster']).agg('count')

Unnamed: 0_level_0,Unnamed: 1_level_0,track_id
playlist,cluster,Unnamed: 2_level_1
bedroom pop,1,12
bedroom pop,2,92
bedroom pop,3,30
dreampop,1,2
dreampop,2,8
dreampop,3,54
lo-fi indie,1,1
lo-fi indie,2,18
lo-fi indie,3,35


The results aren't as clean on this try, but that makes sense because these are similar sounding playlists. On my own listening, it wasn't entirely obvious how one song would fall into one or another, so it would be a lot to expect the clustering algorithm to do it.

Nonetheless, the goal of this experiment is not to make perfectly partitioned playlists purely based on Spotify's own genres, but instead to create playlists that have similar vibes, hopefully grouping together songs that are not entirely obvious at first listen.

### Get track and audiofeature information for all of my saved tracks

In [12]:
# get in list of saved songs
saved_tracks_df      = pd.DataFrame(get_saved_tracks())
saved_tracks_plus_df = build_tracks_plus_df(saved_tracks_df, )

In [13]:
saved_tracks_clustered1_df = cluster_tracks_plus_df(saved_tracks_plus_df, 200)
saved_tracks_clustered1_df[saved_tracks_clustered1_df['cluster'] == 1]

Unnamed: 0,artists,name,track_id,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,id,cluster
74,Caroline Rose,Cry!,7dKfbQC4PhNmgyK2OFwozT,-1.029702,0.542236,-0.17511,1.151729,3.302412,-0.233011,-0.547785,0.623582,0.557887,-0.40009,0.729236,0.269542,1.768134,7dKfbQC4PhNmgyK2OFwozT,1
613,Rubblebucket,If U C My Enemies,1PCRs7aBfxcFjl4noFtDaV,-1.001727,-0.005383,-0.106035,1.573766,3.471135,-0.7896,-0.476971,1.099774,0.557887,-0.152936,-0.973612,0.269542,1.342849,1PCRs7aBfxcFjl4noFtDaV,1
673,St. Paul & The Broken Bones,Midnight on the Earth,4XKLRlYnnz9OoD2jiduPZK,-1.00234,0.768129,-0.029626,1.493801,3.386773,0.323578,0.013757,0.935387,0.557887,-0.243508,-0.472562,0.269542,1.433981,4XKLRlYnnz9OoD2jiduPZK,1
1187,Alabama Shakes,Shoegaze,1bEfe3IX7GT7gsCtfSVye5,-1.013782,0.186283,-0.83101,0.880737,2.243205,1.158462,-0.352736,0.738918,0.557887,-0.421581,-0.075462,0.269542,1.043413,1bEfe3IX7GT7gsCtfSVye5,1
1294,AWOLNATION,Dreamers,4hrr2I7Vh8Cu1B4WTQbyMA,-1.006205,0.152057,-1.337341,1.280561,1.488637,0.045283,0.56039,1.115682,0.557887,-0.300307,0.421988,0.269542,-0.102252,4hrr2I7Vh8Cu1B4WTQbyMA,1
1336,Augustines,Cruel City,3OoFP31h2lntxSZYyPqwKk,-0.791607,0.446402,-0.167612,1.582651,1.605806,-0.7896,-0.066996,0.248143,0.557887,0.299923,-0.187277,0.269542,-0.314895,3OoFP31h2lntxSZYyPqwKk,1
1349,Gary Clark Jr.,Next Door Neighbor Blues,0cJh6Ghx9Hy5nw9Cvo3ytn,0.383836,0.836581,-0.804906,1.431606,2.463483,0.601872,-0.470759,0.624377,0.557887,0.487207,-1.22337,0.269542,0.943602,0cJh6Ghx9Hy5nw9Cvo3ytn,1
1386,The Districts,Bold,5pKssLOUeoa5sP4SfZvjFK,-1.01961,-0.621455,-0.713841,1.071764,2.716567,-1.34619,-0.01109,0.272006,0.557887,2.363118,-1.695524,0.269542,-0.688104,5pKssLOUeoa5sP4SfZvjFK,1
1418,Cloud Nothings,I'm Not Part of Me,3p9hLevLl6y3DsApanbnff,-0.875655,-0.977407,0.59187,1.653731,2.646266,-1.067895,-0.647794,0.838346,0.557887,0.028207,-1.108122,0.269542,0.001899,3p9hLevLl6y3DsApanbnff,1


In [14]:
def save_cluster_tracks_to_playlist(playlist_name, track_ids):
    # get all of the users playlists
    all_playlists = get_all_user_playlists()
    
    # check if playlist already exists
    if (playlist_name not in [playlist['name'] for playlist in all_playlists]):
        playlist = sp.user_playlist_create(user = user_id, name = playlist_name, public = True)
    else:
        playlist_id = [playlist['id'] for playlist in all_playlists if playlist['name'] == playlist_name][0]
        playlist = sp.user_playlist(user = user_id, playlist_id = playlist_id)

    # remove any existing tracks in playlist
    while (playlist['tracks']['total'] > 0):
        sp.user_playlist_remove_all_occurrences_of_tracks(user_id, playlist['id'], \
                                                          tracks = [track['track']['id'] for track in \
                                                                    playlist['tracks']['items']])
        playlist = sp.user_playlist(user = user_id, playlist_id = playlist_id)

    # add tracks from cluster
    cluster_ix = 50
    sp.user_playlist_add_tracks(user_id, playlist_id = playlist['id'], tracks = track_ids)

In [15]:
save_cluster_tracks_to_playlist("k-means, cluster 1", \
                                list(saved_tracks_clustered1_df[saved_tracks_clustered1_df['cluster'] == 1]['id']))
save_cluster_tracks_to_playlist("k-means, cluster 2", \
                                list(saved_tracks_clustered1_df[saved_tracks_clustered1_df['cluster'] == 2]['id']))
save_cluster_tracks_to_playlist("k-means, cluster 3", \
                                list(saved_tracks_clustered1_df[saved_tracks_clustered1_df['cluster'] == 24]['id']))

SpotifyException: http status: 401, code:-1 - https://api.spotify.com/v1/users/129874447/playlists?limit=50&offset=0:
 The access token expired

In [17]:
saved_tracks_clustered1_df[saved_tracks_clustered1_df['name'] == "Upgrade"]

Unnamed: 0,artists,name,track_id,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,id,cluster
547,Logic,Upgrade,1kUnxg3OKR5itogg7WSX51,-0.740074,0.932414,-0.919937,1.071764,-0.409467,0.601872,-0.352736,0.533964,-1.792478,0.26615,1.526636,0.269542,-0.393008,1kUnxg3OKR5itogg7WSX51,7


In [19]:
saved_tracks_clustered1_df[saved_tracks_clustered1_df['cluster'] == 7]

Unnamed: 0,artists,name,track_id,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,id,cluster
490,Future Islands,Ran,5GxeKtNdjSVAZTfxSsnXhF,-1.028659,-0.457169,-0.443392,0.525338,-0.2,0.601872,-0.729789,0.343594,-1.792478,-0.373993,0.82502,2.681129,-0.219422,5GxeKtNdjSVAZTfxSsnXhF,7
547,Logic,Upgrade,1kUnxg3OKR5itogg7WSX51,-0.740074,0.932414,-0.919937,1.071764,-0.409467,0.601872,-0.352736,0.533964,-1.792478,0.26615,1.526636,0.269542,-0.393008,1kUnxg3OKR5itogg7WSX51,7
553,Father John Misty,Total Entertainment Forever,2EDu3Xi4LORS0HRJ7kt0hF,-0.499587,-0.991098,-0.913997,0.965144,-0.409469,0.601872,-0.240925,0.868837,-1.792478,-0.106883,1.253581,0.269542,0.60077,2EDu3Xi4LORS0HRJ7kt0hF,7
697,"Jack White, Alicia Keys",Another Way to Die,01bMpqmvH031R417l3AQTA,-0.913385,-0.101217,0.420579,0.680825,-0.40919,-0.233011,-0.476971,0.781871,-1.792478,0.889407,0.729469,0.269542,0.071334,01bMpqmvH031R417l3AQTA,7
716,NEEDTOBREATHE,MONEY & FAME,4jtJZ9mep9Cdh8RGl5vw0g,-0.825963,0.152057,-0.642954,1.369411,-0.40943,0.323578,-0.591888,1.114887,-1.792478,0.629973,1.127669,0.269542,0.101711,4jtJZ9mep9Cdh8RGl5vw0g,7
810,NEEDTOBREATHE,MONEY & FAME,22JGzWpzuBQMbMMxq6egMq,-0.825963,0.152057,-0.642954,1.369411,-0.40943,0.323578,-0.591888,1.114887,-1.792478,0.629973,1.127669,0.269542,0.101711,22JGzWpzuBQMbMMxq6egMq,7
901,Alabama Shakes,Future People,5nwmpDEN8CqQoLeypoaenL,-0.730565,-1.032169,-0.502414,0.822985,0.176346,0.323578,-0.56145,1.010156,-1.792478,-0.083856,1.134368,0.269542,0.344731,5nwmpDEN8CqQoLeypoaenL,7
1138,NEEDTOBREATHE,Quit,6L0DOHfPcM6mUd4oj4Fe4c,-1.035687,-0.539312,-0.444579,1.711483,-0.409281,0.880167,-0.588782,1.192838,-1.792478,0.674491,1.260447,0.269542,-0.440744,6L0DOHfPcM6mUd4oj4Fe4c,7
1143,NEEDTOBREATHE,Knew It All,7HUXxv5aRll28Z0SkfuErR,-1.03569,-0.854193,-0.70315,1.564881,-0.409408,-0.233011,-0.427277,1.145113,-1.792478,0.006716,1.06098,0.269542,-0.410367,7HUXxv5aRll28Z0SkfuErR,7
1238,BØRNS,Electric Love,0zulYs8vHhhN0hl8wvYgdv,-1.021144,-1.086931,-0.256702,1.231694,-0.405843,0.323578,0.144203,0.425787,-1.792478,0.474926,-0.005407,0.269542,0.661525,0zulYs8vHhhN0hl8wvYgdv,7
