# Creating final dataset of playlists (our sample data)

## Description (summary):

The goal of this section is to create a final dataset of playlists (our sample), with independent variables (tracks and artists features) and the dependent variable (genre of the playlist). Most importantly, we made sure that our sample was equally distributed in each of the classes, since this is important in fitting the models to the training dataset. In order to do so, we had to carry out a number of steps, which included: 

    - Requesting playlist IDs, tracks and artist features from Spotify's API using Spotipy Package
    - Setting up a pandas dataframe at the track level
    - Classifying each song to one of 5 genres (rock, pop, poprock, rap, and others)
    - Collapsing the songs to unique playlist IDs, so that for each playlist we would have a vector of average of the features of songs belonging to a playlist, which characterizes each playlist
    - Classifying each playlist to one of 5 genres (rock, pop, poprock, rap, and others), according to the genre most frequent in that given playlist
    - Setting up final sample of playlists of equally distributed in each of the classes (genres) 

<hr style="height:2pt">

## Requesting playlist IDs, tracks and artist features from Spotify's API using Spotipy Package

First we imported the necessarry libraries to be used in our functions and codes:

In [1]:
import json # import the json library
%matplotlib inline
import numpy as np
import scipy 
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
import time
import re
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import spotipy.util as util

Then we read the json files provided from the the Million Playlist Dataset (explained in the Description of Data): 

In [2]:
path = "data"

file_names = ["mpd.slice.0-999", "mpd.slice.1000-1999", "mpd.slice.2000-2999", "mpd.slice.3000-3999", "mpd.slice.4000-4999", "mpd.slice.5000-5999", "mpd.slice.6000-6999", "mpd.slice.7000-7999", "mpd.slice.8000-8999", "mpd.slice.9000-9999","mpd.slice.10000-10999", "mpd.slice.11000-11999", "mpd.slice.12000-12999", "mpd.slice.13000-13999", "mpd.slice.14000-14999"]
plylist = []
for file in file_names:
    with open(path+"/"+file+".json", "r") as fd:
        plylist_temp = json.load(fd)
        plylist_temp = plylist_temp.get('playlists')
        plylist = plylist + plylist_temp

The following function takes a number of playlist and returns the features of the tracks of those selected playlists: 

In [3]:
def feature_list_func(plylist_dic, feature, n_playlist, first_pid):
    """"
    This function takes a number of playlist and returns the features of the tracks of those selected playlists.
    
    input: 
        1 - plylist_dic: dictionary of all playlists (dataset in dictionary format: json)
        2 - feature: feature to be selected from each songs in selected playlists
        3 - n_playlists: number of playlists to be selected 
    
    output: list of observations for the feature chosen, for all of the tracks that belong to the selected playlists 
    
    """
    feature_list = []
    pid_list = []
    length_playlist = np.minimum(n_playlist,len(plylist_dic)) # the output will be based on the min of the n_playlist and the actual length of the input playlist
    for i in range(length_playlist):
        playlist = plylist_dic[first_pid + i]
        playlist_pid = playlist.get('pid')
        for j in range(len(playlist.get('tracks'))):
            feature_list.append(playlist.get('tracks')[j].get(feature))
            pid_list.append(playlist_pid)
#     data = [playlist_pid, feature_list]
    return pid_list, feature_list

The following code calls the functions above, in order to get the playlist IDs, the track and artist URIs, which will be used later to request the features that will comprise our dataframe.

In [4]:
pid_t, track_uri = feature_list_func(plylist, feature = 'track_uri', n_playlist = 10, first_pid = 0)
pid_a, artist_uri = feature_list_func(plylist, feature = 'artist_uri', n_playlist = 10, first_pid = 0)

After getting the URI of the tracks and artists, we requested their features from API Spotify, to create a pandas database at the track level. We used Spotipy API. The Spotify Package can be found at: https://spotipy.readthedocs.io

In [5]:
def create_spotipy_obj():
#    import spotipy
#    from spotipy.oauth2 import SpotifyClientCredentials
#    import spotipy.util as util
    
    """
    Uses dbarjum's client id for DS Project
    """

    SPOTIPY_CLIENT_ID = '54006da9bd7849b7906b944a7fa4e29d'
    SPOTIPY_CLIENT_SECRET = 'f54ae294a30c4a99b2ff330a923cd6e3'
    SPOTIPY_REDIRECT_URI = 'http://localhost/'

    username = 'dbarjum'
    scope = 'user-library-read'
    
    token = util.prompt_for_user_token(username,scope,client_id=SPOTIPY_CLIENT_ID,
                           client_secret=SPOTIPY_CLIENT_SECRET,
                           redirect_uri=SPOTIPY_REDIRECT_URI)
    client_credentials_manager = SpotifyClientCredentials(client_id=SPOTIPY_CLIENT_ID, 
                                                          client_secret=SPOTIPY_CLIENT_SECRET, proxies=None)
    sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)
    
    return sp

In [6]:
sp = create_spotipy_obj()

In [7]:
def get_all_features(track_list = list, artist_list = list, sp=None):
    
    """
    This function takes in a list of tracks and a list of artists, along
    with a spotipy object and generates two lists of features from Spotify's API.
    
    inputs:
        1. track_list: list of all tracks to be included in dataframe
        2. artist_list: list of all artists corresponding to tracks
        3. sp: spotipy object to communicate with Spotify API
    
    returns:
        1. track_features: list of all features for each track in track_list
        2. artist_features: list of all artist features for each artist in artist_list
    """
    
    track_features = []
    artist_features = []
    
    track_iters = int(len(track_list)/50)
    track_remainders = len(track_list)%50

    start = 0
    end = start+50
    
    for i in range(track_iters):
        track_features.extend(sp.audio_features(track_list[start:end]))
        artist_features.extend(sp.artists(artist_list[start:end]).get('artists'))
        start += 50
        end = start+50
    

    if track_remainders:
        end = start + track_remainders
        track_features.extend(sp.audio_features(track_list[start:end]))
        artist_features.extend(sp.artists(artist_list[start:end]).get('artists'))
    
    return track_features, artist_features

In [8]:
start_time = time.time()
t_features, a_features = get_all_features(track_uri, artist_uri, sp)
print("--- %s seconds ---" % (time.time() - start_time))

--- 2.8701727390289307 seconds ---


## Setting up a pandas dataframe at the track level

The following function takes in the lists of track and artist features, and generates a dataframe of the features. It also creates columns in the dataframe that represent the genres provided for the artist of each track. These columns will be used later for classifying each track to one of 5 genres (rock, pop, poprock, rap, and others).

In [9]:
def create_song_df(track_features=list, artist_features=list, pid=list):
    
    """
    This function takes in two lists of track and artist features, respectively,
    and generates a dataframe of the features.
    
    inputs:
        1. track_features: list of all tracks including features
        2. artist_features: list of all artists including features
    
    returns:
        1. df: a pandas dataframe of size (N, X) where N corresponds to the number of songs
        in track_features, X is the number of features in the dataframe.
    """
    
    import pandas as pd
    
    selected_song_features = ['uri', 'duration_ms', 'time_signature', 'key',
                              'tempo', 'energy', 'mode', 'loudness', 'speechiness', 
                              'danceability', 'acousticness', 'instrumentalness', 
                              'valence', 'liveness']
    selected_artist_features = ['followers', 'uri', 'name', 'popularity', 'genres']
    
    col_names = ['song_uri', 'duration_ms', 'time_signature', 'key',
                 'tempo', 'energy', 'mode', 'loudness', 'speechiness', 
                 'danceability', 'acousticness', 'instrumentalness', 
                 'valence', 'liveness', 'artist_followers', 'artist_uri',
                 'artist_name', 'artist_popularity']
    
    
    data = []

    for i, j in zip(track_features, artist_features):
        temp = []
        for sf in selected_song_features:
            temp.append(i.get(sf))
        for af in selected_artist_features:
            if af == 'followers':
                temp.append(j.get('followers').get('total'))
            elif af == 'genres':
                for g in j.get('genres'):
                    temp.append(g)
            else:
                temp.append(j.get(af))

        data.append(list(temp))
    
    df = pd.DataFrame(data)

    for i in range(len(df.columns)- len(col_names)):
        col_names.append('g'+str(i+1))
    
    df.columns = col_names
    
    df.insert(loc=0, column='pid', value=pid)
    
    return df

In [10]:
songs_df = create_song_df(t_features, a_features, pid_t)
songs_df.head()

Unnamed: 0,pid,song_uri,duration_ms,time_signature,key,tempo,energy,mode,loudness,speechiness,danceability,acousticness,instrumentalness,valence,liveness,artist_followers,artist_uri,artist_name,artist_popularity,g1,g2,g3,g4,g5,g6,g7,g8,g9,g10,g11,g12,g13,g14,g15,g16,g17,g18,g19,g20,g21
0,0,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI,226864,4,4,125.461,0.813,0,-7.105,0.121,0.904,0.0311,0.00697,0.81,0.0471,909647,spotify:artist:2wIVse2owClT7go1WT98tk,Missy Elliott,76,dance pop,hip hop,hip pop,pop,pop rap,r&b,rap,southern hip hop,urban contemporary,,,,,,,,,,,,
1,0,spotify:track:6I9VzXrHxO9rA9A5euc8Ak,198800,4,5,143.04,0.838,0,-3.914,0.114,0.774,0.0249,0.025,0.924,0.242,5457673,spotify:artist:26dSoYclwsYLMAKD3tpOr4,Britney Spears,82,dance pop,pop,post-teen pop,,,,,,,,,,,,,,,,,,
2,0,spotify:track:0WqIKmW4BTrj3eJFmnCKMv,235933,4,2,99.259,0.758,0,-6.583,0.21,0.664,0.00238,0.0,0.701,0.0598,16686181,spotify:artist:6vWDO969PvNqNYHIOW5v0m,Beyoncé,87,dance pop,pop,post-teen pop,r&b,,,,,,,,,,,,,,,,,
3,0,spotify:track:1AWQoqb9bSvzTjaLralEkT,267267,4,4,100.972,0.714,0,-6.055,0.14,0.891,0.202,0.000234,0.818,0.0521,7343717,spotify:artist:31TPClRtHm23RisEBtV3X7,Justin Timberlake,83,dance pop,pop,pop rap,,,,,,,,,,,,,,,,,,
4,0,spotify:track:1lzr43nnXAijIGYnCT8M8H,227600,4,0,94.759,0.606,1,-4.596,0.0713,0.853,0.0561,0.0,0.654,0.313,1044930,spotify:artist:5EvFsr3kj42KNv97ZEnqij,Shaggy,74,dance pop,pop rap,reggae fusion,,,,,,,,,,,,,,,,,,


## Collapsing songs to unique playlists

This section is responsible for collapsing songs to unique playlist IDs, so that for each playlist we would have a vector of average of the features of songs belonging to a playlist, which characterizes each playlist. In this section we also classified songs, and playlists. 

The following function classifies songs according to the given genres of the artist of the song, according to "if" statements:

In [11]:
def genre_generator(songs_df):
    
    """
    This function classifies songs according to the given genres of the artist of the song, according to an "if" statements.
    
    Input: dataframe with a list of songs
    
    Output: dataframe with added column with unique genre for each song 
    
    """
    # defining liist of genres that will determine a song with unique genre "rap"
    rap = ["rap","hiphop", "r&d"]

    # finding position of "g1" (first column of genres) and last position of "gX" in columns (last column of genres) , to use it later for assessingn genre of song
    g1_index = 0 
    last_column_index = 0
    
    column_names = songs_df.columns.values
    
    # finding first column with genres ("g1") 
    for i in column_names:
        if i == "g1":
            break
        g1_index += 1
    
    # finding last column with genrer ("gX")
    for i in column_names:
        last_column_index += 1

    # create new columnn that will have unique genre (class) of each song 
    songs_df["genre"] = "" 

    # loop to create genre for each song in dataframe     
    for j in range(len(songs_df)):

        # Creating list of genres for a given song  
        genres_row = list(songs_df.iloc[[j]][column_names[g1_index:last_column_index-1]].dropna(axis=1).values.flatten())
        # genres_row = ['british invasion', 'merseybeat', 'psychedelic']

        # classifing genre for the song

        genre = "other"

        if any("rock" in s for s in genres_row) and any("pop" in s for s in genres_row):
            genre = "pop rock"
        elif any("rock" in s for s in genres_row):
            genre = "rock"
        elif any("pop" in s for s in genres_row):
            genre = "pop"

        for i in rap:
            if any(i in s for s in genres_row):
                genre = "rap"

        # giving column genre the classified genre for a given song         
        songs_df.set_value(j, 'genre', genre)
    
    return songs_df

The code below calls the song genre generator function, and the result is a dataframe with songs containing a genre, which has been classified according to the genre of the artists of each song.

In [12]:
songs_df_new = genre_generator(songs_df)
songs_df_new.head()



Unnamed: 0,pid,song_uri,duration_ms,time_signature,key,tempo,energy,mode,loudness,speechiness,danceability,acousticness,instrumentalness,valence,liveness,artist_followers,artist_uri,artist_name,artist_popularity,g1,g2,g3,g4,g5,g6,g7,g8,g9,g10,g11,g12,g13,g14,g15,g16,g17,g18,g19,g20,g21,genre
0,0,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI,226864,4,4,125.461,0.813,0,-7.105,0.121,0.904,0.0311,0.00697,0.81,0.0471,909647,spotify:artist:2wIVse2owClT7go1WT98tk,Missy Elliott,76,dance pop,hip hop,hip pop,pop,pop rap,r&b,rap,southern hip hop,urban contemporary,,,,,,,,,,,,,rap
1,0,spotify:track:6I9VzXrHxO9rA9A5euc8Ak,198800,4,5,143.04,0.838,0,-3.914,0.114,0.774,0.0249,0.025,0.924,0.242,5457673,spotify:artist:26dSoYclwsYLMAKD3tpOr4,Britney Spears,82,dance pop,pop,post-teen pop,,,,,,,,,,,,,,,,,,,pop
2,0,spotify:track:0WqIKmW4BTrj3eJFmnCKMv,235933,4,2,99.259,0.758,0,-6.583,0.21,0.664,0.00238,0.0,0.701,0.0598,16686181,spotify:artist:6vWDO969PvNqNYHIOW5v0m,Beyoncé,87,dance pop,pop,post-teen pop,r&b,,,,,,,,,,,,,,,,,,pop
3,0,spotify:track:1AWQoqb9bSvzTjaLralEkT,267267,4,4,100.972,0.714,0,-6.055,0.14,0.891,0.202,0.000234,0.818,0.0521,7343717,spotify:artist:31TPClRtHm23RisEBtV3X7,Justin Timberlake,83,dance pop,pop,pop rap,,,,,,,,,,,,,,,,,,,rap
4,0,spotify:track:1lzr43nnXAijIGYnCT8M8H,227600,4,0,94.759,0.606,1,-4.596,0.0713,0.853,0.0561,0.0,0.654,0.313,1044930,spotify:artist:5EvFsr3kj42KNv97ZEnqij,Shaggy,74,dance pop,pop rap,reggae fusion,,,,,,,,,,,,,,,,,,,rap


The following lines clean the dataframe by dropping unnecessary columns (the genres of each song), which were used to create the unique column of song genre that will be used later in the algorithm. 

In [13]:
temp = songs_df_new.copy()

In [14]:
column_names_temp = songs_df_new.columns.values[18:-1]
column_names_temp

array(['artist_popularity', 'g1', 'g2', 'g3', 'g4', 'g5', 'g6', 'g7', 'g8',
       'g9', 'g10', 'g11', 'g12', 'g13', 'g14', 'g15', 'g16', 'g17', 'g18',
       'g19', 'g20', 'g21'], dtype=object)

In [15]:
temp = temp.drop(column_names_temp,axis=1)
temp.head()

Unnamed: 0,pid,song_uri,duration_ms,time_signature,key,tempo,energy,mode,loudness,speechiness,danceability,acousticness,instrumentalness,valence,liveness,artist_followers,artist_uri,artist_name,genre
0,0,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI,226864,4,4,125.461,0.813,0,-7.105,0.121,0.904,0.0311,0.00697,0.81,0.0471,909647,spotify:artist:2wIVse2owClT7go1WT98tk,Missy Elliott,rap
1,0,spotify:track:6I9VzXrHxO9rA9A5euc8Ak,198800,4,5,143.04,0.838,0,-3.914,0.114,0.774,0.0249,0.025,0.924,0.242,5457673,spotify:artist:26dSoYclwsYLMAKD3tpOr4,Britney Spears,pop
2,0,spotify:track:0WqIKmW4BTrj3eJFmnCKMv,235933,4,2,99.259,0.758,0,-6.583,0.21,0.664,0.00238,0.0,0.701,0.0598,16686181,spotify:artist:6vWDO969PvNqNYHIOW5v0m,Beyoncé,pop
3,0,spotify:track:1AWQoqb9bSvzTjaLralEkT,267267,4,4,100.972,0.714,0,-6.055,0.14,0.891,0.202,0.000234,0.818,0.0521,7343717,spotify:artist:31TPClRtHm23RisEBtV3X7,Justin Timberlake,rap
4,0,spotify:track:1lzr43nnXAijIGYnCT8M8H,227600,4,0,94.759,0.606,1,-4.596,0.0713,0.853,0.0561,0.0,0.654,0.313,1044930,spotify:artist:5EvFsr3kj42KNv97ZEnqij,Shaggy,rap


In [16]:
feature_indexes = list(range(len(temp.columns)-1))

In [17]:
col_names_temp = ['duration_ms','time_signature','key','tempo','energy','loudness','speechiness','danceability','acousticness',
         'instrumentalness', 'valence', 'liveness', 'artist_followers', 'artist_popularity'  ]


In [18]:
col_names = temp.columns

The code below one-hot-encodes the variable genre, so that we can calculated the proportion of songs of each genre in each playlist. This will help classify the genre of our playlist according to the most frequent genre of songs that belong to that playlist. 

In [19]:
songs_encoded = pd.get_dummies(temp,columns = ['genre'],drop_first=False)
songs_encoded.head()

Unnamed: 0,pid,song_uri,duration_ms,time_signature,key,tempo,energy,mode,loudness,speechiness,danceability,acousticness,instrumentalness,valence,liveness,artist_followers,artist_uri,artist_name,genre_other,genre_pop,genre_pop rock,genre_rap,genre_rock
0,0,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI,226864,4,4,125.461,0.813,0,-7.105,0.121,0.904,0.0311,0.00697,0.81,0.0471,909647,spotify:artist:2wIVse2owClT7go1WT98tk,Missy Elliott,0,0,0,1,0
1,0,spotify:track:6I9VzXrHxO9rA9A5euc8Ak,198800,4,5,143.04,0.838,0,-3.914,0.114,0.774,0.0249,0.025,0.924,0.242,5457673,spotify:artist:26dSoYclwsYLMAKD3tpOr4,Britney Spears,0,1,0,0,0
2,0,spotify:track:0WqIKmW4BTrj3eJFmnCKMv,235933,4,2,99.259,0.758,0,-6.583,0.21,0.664,0.00238,0.0,0.701,0.0598,16686181,spotify:artist:6vWDO969PvNqNYHIOW5v0m,Beyoncé,0,1,0,0,0
3,0,spotify:track:1AWQoqb9bSvzTjaLralEkT,267267,4,4,100.972,0.714,0,-6.055,0.14,0.891,0.202,0.000234,0.818,0.0521,7343717,spotify:artist:31TPClRtHm23RisEBtV3X7,Justin Timberlake,0,0,0,1,0
4,0,spotify:track:1lzr43nnXAijIGYnCT8M8H,227600,4,0,94.759,0.606,1,-4.596,0.0713,0.853,0.0561,0.0,0.654,0.313,1044930,spotify:artist:5EvFsr3kj42KNv97ZEnqij,Shaggy,0,0,0,1,0


The following function takes a data frame of songs (with playlists IDs) and collapses the dataframe at the playlist ID level, to get averages for each column (which characterize each playlist). This creates a datafram at the playlist level.

In [20]:
def collapse_pid(df):
    
    """
    This function takes a data frame of songs (with playlists IDs) and collapses the dataframe at the playlist ID level, to get averages for each column.
    
    Input: data frame of songs (with playlists IDs)
    
    Output: data frame of playlists (collapsing songs into playlist IDs, using average)
    
    """
    
    # Group by play list category
    pid_groups = df.groupby('pid')
    # Apply mean function to all columns
    
    return pid_groups.mean()

playlists_collapsed = collapse_pid(songs_encoded)


## Classifying each playlist to one of 5 genres (rock, pop, poprock, rap, and others)

The following function classifies playlists according to the most frequent genre of the songs in the playlist:

In [21]:
def playlist_genre_generator (df, first_row):
    
    """
    This function classifies playlists according to the most frequent genre of the songs in the playlist
    
    Input: dataframe with a list of playlists
    
    Output: dataframe with added column with unique genre for each playlist 
    
    """

    # create new columnn that will have unique genre (class) of each playlist 
    df ["playlist_genre"] = ""

    for j in range(len(df)):

        # finding position of "g1" (first column of genres) and last position of "gX" in columns (last column of genres) , to use it later for assessingn genre of song
        g1_index = 0 
        last_column_index = 0

        column_names = df.columns.values

        # finding first column with genres ("g1") 
        for i in column_names:
            if i == "artist_followers":
                break
            g1_index += 1
        g1_index += 1

        # finding last column with genrer ("gX")
        for i in column_names:
            last_column_index += 1
        last_column_index -= 1

        # Creating list of genres for a given song  
        genres_row = list(df.iloc[[j]][column_names[g1_index:last_column_index]].dropna(axis=1).values.flatten())
        
        # classifing genre for the playlist
        max_value = max(genres_row)
        max_index = genres_row.index(max_value)
        playlist_genre = column_names[g1_index + max_index]
        
        # giving column genre the classified genre for a given playlist
        df.set_value(j + first_row, 'playlist_genre', playlist_genre)
    return df

## Setting up final sample of playlists of equally distributed in each of the classes (genres)

The following code creates a "base line" playlist with a defined minimum size of the playlist (2000 playlists), which will have an unequal distribution of genres among the playlists, as demonstrated in the output table below.

In [22]:
### creating base_line data frame

import warnings
warnings.filterwarnings('ignore')

n_playlist = 2000
# get uri for tracks and artists of playlist selected of size N (usually 100)
pid_t, track_uri = feature_list_func(plylist, feature = 'track_uri', n_playlist = n_playlist, first_pid = 0)
pid_a, artist_uri = feature_list_func(plylist, feature = 'artist_uri', n_playlist = n_playlist, first_pid = 0)

t_features, a_features = get_all_features(track_uri, artist_uri, sp)

#create dataframe of songs
songs_df = create_song_df(t_features, a_features, pid_t)
songs_df_new = genre_generator(songs_df)
temp = songs_df_new.copy()
column_names_temp = songs_df_new.columns.values[18:-1]
temp = temp.drop(column_names_temp,axis=1)
songs_encoded = pd.get_dummies(temp,columns = ['genre'],drop_first=False)

#create dataframe of playlists
playlists_collapsed = collapse_pid(songs_encoded)
genre_classified_playlists = playlist_genre_generator (playlists_collapsed, first_row = 0)
genre_classified_playlists.head()

Unnamed: 0_level_0,duration_ms,time_signature,key,tempo,energy,mode,loudness,speechiness,danceability,acousticness,instrumentalness,valence,liveness,artist_followers,genre_other,genre_pop,genre_pop rock,genre_rap,genre_rock,playlist_genre
pid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
0,221777.461538,4.0,5.038462,123.006885,0.782173,0.692308,-4.881942,0.107021,0.659288,0.08344,0.000676,0.642904,0.192127,4800843.0,0.0,0.288462,0.230769,0.461538,0.019231,genre_rap
1,298844.128205,3.769231,4.461538,122.669615,0.691077,0.538462,-8.291667,0.088449,0.496459,0.1631,0.22227,0.476667,0.178433,1704673.0,0.358974,0.0,0.051282,0.0,0.589744,genre_rock
2,219374.875,4.0,5.0,114.600672,0.693203,0.515625,-4.874156,0.096288,0.671875,0.26923,0.000638,0.565078,0.169028,1691574.0,0.0625,0.9375,0.0,0.0,0.0,genre_pop
3,229575.055556,3.952381,5.103175,125.032413,0.621282,0.714286,-9.614937,0.067186,0.513714,0.27387,0.202042,0.451623,0.188585,212510.9,0.246032,0.150794,0.31746,0.071429,0.214286,genre_pop rock
4,255014.352941,3.941176,3.352941,127.759882,0.650535,0.823529,-7.634471,0.041159,0.576765,0.177148,0.081875,0.490765,0.166524,1167521.0,0.117647,0.117647,0.705882,0.0,0.058824,genre_pop rock


The following code is an intermediary step in adjusting the sample towards an equal distribution of genres among all playlists. It looks for the most frequent genre among the playlists, calculates the number of playlists of each genre, so that in the next step we fill up the sample with playlits of underrepresented genres.  

In [28]:
# getting the number of playlists of the most frequent gender in the dataframe 
from pandas.tools.plotting import table
table = genre_classified_playlists['playlist_genre'].value_counts()
mode_genre = genre_classified_playlists['playlist_genre'].value_counts().idxmax()
number_mode_genre = table.loc[mode_genre]

# defining variables to use as baseline for sampling function
number_genre_pop = table.loc["genre_pop"]
number_genre_rap = table.loc["genre_rap"]
number_genre_other = table.loc["genre_other"]
number_genre_poprock = table.loc["genre_pop rock"]
number_genre_rock = table.loc["genre_rock"]

# getting the genre mode and the total number of playlists for the genre mode
mode_genre = genre_classified_playlists['playlist_genre'].value_counts().idxmax()
mode_genre
total_number = number_genre_pop + number_genre_rap + number_genre_other + number_genre_poprock + number_genre_rock
total_number

2675

The code below takes one playlist at a time from the pool of 15,000 playlists (read from the Million Playlist json files at the beginning of this page), checks to which genre it belongs, and adds the playlist (if of underepresented genre) to the baseline sample, until the full sample is equally distributed. 

The playlists taken from the 15,000 playlists are taken in sequence after the playlists that have already been added to the sample, or discarded if the playlist belongs to an already "well represented genre". 

In [29]:
### adjusting base_line data frame to get to desired distribution

# to calculate how long this code will take
start_time = time.time()

# define first_pid, which is the first playlist id after the playlists that are in the original data frame
t = 0

while total_number < number_mode_genre*5:

    first_pid = n_playlist + t

    # get uri for tracks and artists of playlist selected
    pid_t, track_uri = feature_list_func(plylist, feature = 'track_uri', n_playlist = 1, first_pid = first_pid)
    pid_a, artist_uri = feature_list_func(plylist, feature = 'artist_uri', n_playlist = 1, first_pid = first_pid)
    t_features, a_features = get_all_features(track_uri, artist_uri, sp)

    #create dataframe of songs
    songs_df = create_song_df(t_features, a_features, pid_t)
    songs_df_new = genre_generator(songs_df)
    temp = songs_df_new.copy()
    column_names_temp = songs_df_new.columns.values[18:-1]
    temp = temp.drop(column_names_temp,axis=1)
    temp

    songs_encoded = pd.get_dummies(temp,columns = ['genre'],drop_first=False)
    songs_encoded

    #create dataframe of playlists
    playlists_collapsed = collapse_pid(songs_encoded)
    genre_classified_SinglePlaylist = playlist_genre_generator (playlists_collapsed, first_row = first_pid)

    # checking if playlist selected belongs to one of the genres that is not the most frequent in baseline dataframe
    if total_number != 5*number_mode_genre:

        if genre_classified_SinglePlaylist.playlist_genre.iloc[0] == "genre_pop":
            if number_genre_pop < number_mode_genre:
                genre_classified_playlists = genre_classified_playlists.append(genre_classified_SinglePlaylist, sort=False)
                number_genre_pop += 1
        elif genre_classified_SinglePlaylist.playlist_genre.iloc[0] == "genre_rap":
            if number_genre_rap < number_mode_genre:
                genre_classified_playlists = genre_classified_playlists.append(genre_classified_SinglePlaylist, sort=False)
                number_genre_rap += 1
        elif genre_classified_SinglePlaylist.playlist_genre.iloc[0] == "genre_other":
            if number_genre_other < number_mode_genre:
                genre_classified_playlists = genre_classified_playlists.append(genre_classified_SinglePlaylist, sort=False)
                number_genre_other += 1
        elif genre_classified_SinglePlaylist.playlist_genre.iloc[0] == "genre_pop rock":
            if number_genre_poprock < number_mode_genre:
                genre_classified_playlists = genre_classified_playlists.append(genre_classified_SinglePlaylist, sort=False)
                number_genre_poprock += 1
        elif genre_classified_SinglePlaylist.playlist_genre.iloc[0] == "genre_rock":
            if number_genre_rock < number_mode_genre:
                genre_classified_playlists = genre_classified_playlists.append(genre_classified_SinglePlaylist, sort=False)
                number_genre_rock += 1

    t += 1

    total_number = number_genre_pop + number_genre_rap + number_genre_other + number_genre_poprock + number_genre_rock

    # print (total_number)
    # print (number_genre_pop)
    # print (number_genre_rap)
    # print (number_genre_other)
    # print (number_genre_poprock)
    # print (number_genre_rock)
    
print("--- %s seconds ---" % (time.time() - start_time))
    
genre_classified_playlists.head()

--- 0.0009975433349609375 seconds ---


Unnamed: 0,pid,duration_ms,time_signature,key,tempo,energy,mode,loudness,speechiness,danceability,acousticness,instrumentalness,valence,liveness,artist_followers,genre_other,genre_pop,genre_pop rock,genre_rap,genre_rock,playlist_genre
0,0,221777.461538,4.0,5.038462,123.006885,0.782173,0.692308,-4.881942,0.107021,0.659288,0.08344,0.000676,0.642904,0.192127,4797984.0,0.0,0.288462,0.230769,0.461538,0.019231,genre_rap
1,1,298844.128205,3.769231,4.461538,122.669615,0.691077,0.538462,-8.291667,0.088449,0.496459,0.1631,0.22227,0.476667,0.178433,1702573.0,0.358974,0.0,0.051282,0.0,0.589744,genre_rock
2,2,219374.875,4.0,5.0,114.600672,0.693203,0.515625,-4.874156,0.096288,0.671875,0.26923,0.000638,0.565078,0.169028,1688725.0,0.0625,0.9375,0.0,0.0,0.0,genre_pop
3,3,229575.055556,3.952381,5.103175,125.032413,0.621282,0.714286,-9.614937,0.067186,0.513714,0.27387,0.202042,0.451623,0.188585,212325.8,0.246032,0.150794,0.31746,0.071429,0.214286,genre_pop rock
4,4,255014.352941,3.941176,3.352941,127.759882,0.650535,0.823529,-7.634471,0.041159,0.576765,0.177148,0.081875,0.490765,0.166524,1166320.0,0.117647,0.117647,0.705882,0.0,0.058824,genre_pop rock


Finally, we check to make sure that the final dataframe is equally distributed among all genres:

In [30]:
display(genre_classified_playlists['playlist_genre'].value_counts())
display(genre_classified_playlists['playlist_genre'].value_counts(normalize=True))

genre_other       535
genre_pop         535
genre_rock        535
genre_rap         535
genre_pop rock    535
Name: playlist_genre, dtype: int64

genre_other       0.2
genre_pop         0.2
genre_rock        0.2
genre_rap         0.2
genre_pop rock    0.2
Name: playlist_genre, dtype: float64

And export the final dataframe as a csv file, which will be used as the sample data for our machine learning models. This sample will be split into training and test data, the former for training different models and assesing their performance, and the latter for evaluating how well our trained models perform in the test data. 

In [None]:
genre_classified_playlists.to_csv ("playlist_df.csv")