# Spotify's song analysis

This is a personal project that aims to satisfy different goals :

-  Analysis of the user's favorite music types based on his playlists
-  Discover who are his favorite artists

This project also aims to help to improve my personal skills on Unsupervised Learning techniques and knowledges but also to make me improve my proficiency on building a Data Science project from scratch (from the Data Acquisition to Storytelling).

Spotify for Developers : http://developer.spotify.com

### 1. Import Libraries

In [162]:
### Libraries to manage Spotify data

import spotipy
import spotipy.util as util
from spotipy.oauth2 import SpotifyClientCredentials

### Libraries for Data Processing

import pandas as pd 
import numpy as np

### Libraries for Data Visualization

import matplotlib.pyplot as plt
import seaborn as sns

### Libraries for Machine Learning

from sklearn.preprocessing import MinMaxScaler, Normalizer, OneHotEncoder, StandardScaler
from sklearn.cluster import KMeans

### 2. Functions allowing to load data and create the dataframe

#### 2.1. Connecting to Spotify

In [2]:
def connect_spotify(client_id, client_secret, username, data_location):
    """
    Function allowing the user to connect to its Spotify database, thus according him access to his songs data.
    
    These three parameters in the function inputs are available in the user's Spotify for Developers dashboard.
    ------
    Parameters:
        - client_id: the user's Spotify client ID 
        - client_secret: the user's Spotify secret ID
        - username: the user's Spotify username
        - data_type: the data that the user wants to load from its Spotify logs (ex: gathering data from playlists, from saved tracks, etc ...)
    """
    
    ## Creating a Spotify object with the user logs
    
    client_credentials_manager = SpotifyClientCredentials(
        client_id=client_id, 
        client_secret=client_secret
    )
    
    sp = spotipy.Spotify(
        client_credentials_manager=client_credentials_manager
    )
    
    ## Choosing specific user data (scope allows to choose from which Spotify section to load the data, token allows to connect the user credentials)
    
    scope = data_location # the location from where we want to load the data on Spotify (playlist, library, etc ...)
    token = util.prompt_for_user_token(username, scope) # Generate an authentication token

    ## Get read access to your library

    if token:
        sp = spotipy.Spotify(auth=token)
    else:
        print("Can't get token for", username)
        
    return sp    

#### 2.2. Creating the DataFrame

In [47]:
def create_dataframe(spotify_data, offset):
    """
    Function that allows the user to create a DataFrame containing the data related to the songs in its library
    ------
    Parameters:
        - spotify_data:  the user data on Spotify
    """
    
    df_saved_tracks = pd.DataFrame()
    track_list = ''
    added_time_list = []
    artist_list = []
    title_list = []
    more_songs = True
    
    ## As the limit of tracks' storage in the JSON file is 20, we need to set an offset to determine from which track we start to collect the data
    ## An offset of 0 means that we will collect data from track 0 to track 20
    ## So, if you have 220 songs in your playlist, you will have to use 11 offsets to load your data entirely
    
    offset_index = offset
    
    while more_songs:
        songs = spotify_data.current_user_saved_tracks(offset=offset_index)
        
        ## A playlist on Spotify data is actually represented in a JSON file form
        ## So we need to use list comprehension to collect all the necessary data
        ## for our study.
        
        for song in songs['items']:
            track_list += song['track']['id'] + ','                       # ID of the song
            added_time_list.append(song['added_at'])        # Time at which the song got added
            title_list.append(song['track']['name'])                # Name of the song  
            
            # As a song can have multiple artists, we need to add all of them inside
            # of the list, so we need to use another list comprehension inside of
            # the initial one.  
            
            artists = song['track']['artists']
            artists_name = ''
            
            for artist in artists:
                artists_name += artist['name']  + ','
                
            artist_list.append(artists_name[:-1])
            
        ## Collecting the audio features for each song and put them into a DataFrame
    
        track_features = spotify_data.audio_features(track_list[:-1])
        df_temp = pd.DataFrame(track_features)
        df_saved_tracks = df_saved_tracks.append(df_temp)
    
        track_list = ''

        if songs['next'] == None:
            more_songs = False                # There are no more songs in the playlist
        else:
            offset_index += songs['limit']    # Index of the playlists
            
        ### Adding timestamp, title and artists of a song as features in the DataFrame

        df_saved_tracks['added_at'] = added_time_list
        df_saved_tracks['song_title'] = title_list
        df_saved_tracks['artists'] = artist_list 

        return df_saved_tracks

### 3. Loading data

Link to try out the JSON that contains all the songs of the user library (c.f 'user-library-read' in the code) : https://developer.spotify.com/console/get-current-user-saved-tracks

#### 3.1. Connecting to Spotify API

In [1]:
### User's credentials

user_credentials = {
    'client_id': 'XXX',
    'client_secret': 'XXX',
    'username': 'XXX',
    'data_location': 'user-library-read',
    'total_tracks': 220
}

In [145]:
### Loading user's Spotify songs data

spotify_data = connect_spotify(
    user_credentials['client_id'], user_credentials['client_secret'], user_credentials['username'],
    user_credentials['data_location'])

spotify_data

<spotipy.client.Spotify at 0x1136a0990>

#### 3.2. Creating the DataFrame

In [144]:
### Creating the DataFrame contaning all the tracks data in the user's Spotify library
### As the limit per API call seems to be 20 songs, we will change the offset from each call in order to collect all tracks

df_songs = pd.DataFrame()
offsets = np.arange(0, user_credentials['total_tracks'], 20)    # the offset range must be set between 0 and the number of songs in your library

for offset in offsets:
    df_temp = create_dataframe(spotify_data, offset)
    df_songs = df_songs.append(df_temp)

df_songs.reset_index().head()

Unnamed: 0,index,acousticness,analysis_url,danceability,duration_ms,energy,id,instrumentalness,key,liveness,...,speechiness,tempo,time_signature,track_href,type,uri,valence,added_at,song_title,artists
0,0,0.488,https://api.spotify.com/v1/audio-analysis/4Bii...,0.773,180120,0.699,4BiiOzZCrXEzHRLYcYFiD5,4e-06,1,0.0814,...,0.0958,104.941,4,https://api.spotify.com/v1/tracks/4BiiOzZCrXEz...,audio_features,spotify:track:4BiiOzZCrXEzHRLYcYFiD5,0.513,2019-09-08T14:30:51Z,Hope,"The Chainsmokers,Winona Oak"
1,1,0.0975,https://api.spotify.com/v1/audio-analysis/3Ni1...,0.619,221693,0.901,3Ni1v6Xq3hIrnYqTlBlZyI,0.0455,11,0.178,...,0.0563,123.002,4,https://api.spotify.com/v1/tracks/3Ni1v6Xq3hIr...,audio_features,spotify:track:3Ni1v6Xq3hIrnYqTlBlZyI,0.503,2019-09-08T14:30:49Z,Remember (with ZOHARA),"Gryffin,ZOHARA"
2,2,0.00271,https://api.spotify.com/v1/audio-analysis/3SEu...,0.618,210928,0.754,3SEupjP7CBdIoNPrFrMozG,0.0,4,0.213,...,0.0882,148.013,4,https://api.spotify.com/v1/tracks/3SEupjP7CBdI...,audio_features,spotify:track:3SEupjP7CBdIoNPrFrMozG,0.485,2019-09-08T14:30:48Z,Bye Bye (feat. Ivy Adara),"Gryffin,Ivy Adara"
3,3,0.478,https://api.spotify.com/v1/audio-analysis/50jj...,0.644,192667,0.608,50jjD5M7AQOuJyF0PidOhj,0.0,6,0.137,...,0.0484,144.04,4,https://api.spotify.com/v1/tracks/50jjD5M7AQOu...,audio_features,spotify:track:50jjD5M7AQOuJyF0PidOhj,0.49,2019-09-08T14:30:47Z,Ugly - English Version,Anitta
4,4,0.0371,https://api.spotify.com/v1/audio-analysis/4DM3...,0.685,183894,0.638,4DM3zxFlei14ZOyKFtEx5p,0.0,6,0.0987,...,0.0367,123.442,4,https://api.spotify.com/v1/tracks/4DM3zxFlei14...,audio_features,spotify:track:4DM3zxFlei14ZOyKFtEx5p,0.414,2019-09-08T14:30:46Z,Party For One,Carly Rae Jepsen


### 4. Data Cleaning

For more information related to the signification of each feature of a track in Spotify : https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/

#### 4.1. Features Verification

In [138]:
### Structure of the DataFrame

df_songs.shape

(204, 21)

In [141]:
### Features of the DataFrame

df_songs.columns

Index(['acousticness', 'analysis_url', 'danceability', 'duration_ms', 'energy',
       'id', 'instrumentalness', 'key', 'liveness', 'loudness', 'mode',
       'speechiness', 'tempo', 'time_signature', 'track_href', 'type', 'uri',
       'valence', 'added_at', 'song_title', 'artists'],
      dtype='object')

In [153]:
### Remove some columns

col_to_drop = ['id', 'uri', 'track_href', 'analysis_url', 'type']
df_final = df_songs.drop(columns=col_to_drop)
print(df_final.shape)

(204, 16)


In [156]:
### Checking data type of each feature

print(df_final.info())
df_final.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 204 entries, 0 to 3
Data columns (total 16 columns):
acousticness        204 non-null float64
danceability        204 non-null float64
duration_ms         204 non-null int64
energy              204 non-null float64
instrumentalness    204 non-null float64
key                 204 non-null int64
liveness            204 non-null float64
loudness            204 non-null float64
mode                204 non-null int64
speechiness         204 non-null float64
tempo               204 non-null float64
time_signature      204 non-null int64
valence             204 non-null float64
added_at            204 non-null object
song_title          204 non-null object
artists             204 non-null object
dtypes: float64(9), int64(4), object(3)
memory usage: 27.1+ KB
None


Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,added_at,song_title,artists
0,0.488,0.773,180120,0.699,4e-06,1,0.0814,-5.982,0,0.0958,104.941,4,0.513,2019-09-08T14:30:51Z,Hope,"The Chainsmokers,Winona Oak"
1,0.0975,0.619,221693,0.901,0.0455,11,0.178,-4.173,0,0.0563,123.002,4,0.503,2019-09-08T14:30:49Z,Remember (with ZOHARA),"Gryffin,ZOHARA"
2,0.00271,0.618,210928,0.754,0.0,4,0.213,-3.739,1,0.0882,148.013,4,0.485,2019-09-08T14:30:48Z,Bye Bye (feat. Ivy Adara),"Gryffin,Ivy Adara"
3,0.478,0.644,192667,0.608,0.0,6,0.137,-5.058,1,0.0484,144.04,4,0.49,2019-09-08T14:30:47Z,Ugly - English Version,Anitta
4,0.0371,0.685,183894,0.638,0.0,6,0.0987,-6.539,1,0.0367,123.442,4,0.414,2019-09-08T14:30:46Z,Party For One,Carly Rae Jepsen


#### 4.2. Features Engineering

In [161]:
### Converting the duration of the track from ms to min

df_final = df_final.rename(columns={'duration_ms': 'duration_min'})
df_final['duration_min'] = df_final['duration_min'] / 60000
df_final.head()

Unnamed: 0,acousticness,danceability,duration_min,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,added_at,song_title,artists
0,0.488,0.773,3.002,0.699,4e-06,1,0.0814,-5.982,0,0.0958,104.941,4,0.513,2019-09-08T14:30:51Z,Hope,"The Chainsmokers,Winona Oak"
1,0.0975,0.619,3.694883,0.901,0.0455,11,0.178,-4.173,0,0.0563,123.002,4,0.503,2019-09-08T14:30:49Z,Remember (with ZOHARA),"Gryffin,ZOHARA"
2,0.00271,0.618,3.515467,0.754,0.0,4,0.213,-3.739,1,0.0882,148.013,4,0.485,2019-09-08T14:30:48Z,Bye Bye (feat. Ivy Adara),"Gryffin,Ivy Adara"
3,0.478,0.644,3.211117,0.608,0.0,6,0.137,-5.058,1,0.0484,144.04,4,0.49,2019-09-08T14:30:47Z,Ugly - English Version,Anitta
4,0.0371,0.685,3.0649,0.638,0.0,6,0.0987,-6.539,1,0.0367,123.442,4,0.414,2019-09-08T14:30:46Z,Party For One,Carly Rae Jepsen


In [167]:
### Converting the songs added timestamp from object type to datetime type

