# Introduction
This notebook is a first part of [eargasm-spotify](https://github.com/adamsiemaszkiewicz/eargasm-music) repository meant to fetch and clean up the playlist and track information from the [eargasm music channel](https://open.spotify.com/user/eargasmusic?si=HtTLbkG6QoqkdKU3uTRjAQ) at Spotify.

# Setup up the environment

## Google Drive mount

In [1]:
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive


In [2]:
%cd /content/gdrive/My Drive/Colab Notebooks/eargasm-music/

/content/gdrive/My Drive/Colab Notebooks/eargasm-music


## Install extra libraries

### Spotipy
Spotipy handles the Spotify Web API.
https://spotipy.readthedocs.io/

In [3]:
!pip install spotipy



### Colab-env
Colab-env handles environment variables in Google Colab.
https://pypi.org/project/colab-env/

In [4]:
!pip install colab-env -qU

## Import libraries and functions
Let's import all the libraries and function we're gonna use throughout the notebook.

### System
- `os` - Miscellaneous operating system interfaces
- `timeit` - Measure execution time of small code snippets
- `colab_env` - Google Colab environment variables handling (*it will ask you to log in to your google account and enter your authorization code*)


In [5]:
import os
import timeit
import colab_env

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


### Notebook
- `IPython.display` `Audio` & `Image` - Audio & imagery handling
- `tqdm.notebook` `tqdm` - Progress bars

In [6]:
import IPython
from IPython.display import Image # for images display
from tqdm.notebook import tqdm # for progress bars

### Data science

In [7]:
import pandas as pd # for data manipulation & analysis
import numpy as np # for linear algebra

### Spotify

In [8]:
import spotipy # Spotify Web API
from spotipy.oauth2 import SpotifyClientCredentials # for Spotify authentication

# Authentication
To authenticate the requests to the Spotify API I'll use [Spotipy](https://spotipy.readthedocs.io/en/2.16.0/) library. I'll use *Client Credentials Flow* method for authorization due to a higher rate limit and no need for `SPOTIPY_REDIRECT_URI`. 

Due to safety reasons the `client_id` & `client_secret` variables are loaded from the [colab-env](https://pypi.org/project/colab-env/) environment variables using `os.getenv()` method. 

After passing the variables to the `SpotifyClientCredentials()` module the `Spotify()` API client is created as `sp`. I set `request_timeout` parameter higher to avoid `ReadTimeoutError` errors,

In [9]:
client_id = os.getenv('CLIENT_ID')
client_secret = os.getenv('CLIENT_SECRET')

client_credentials_manager = SpotifyClientCredentials(client_id=client_id,
                                                      client_secret=client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager,
                     requests_timeout=20)

# Playlists
Let's fetch all the existing playlists from the [eargasm music channel](https://open.spotify.com/user/eargasmusic?si=HtTLbkG6QoqkdKU3uTRjAQ).


## Channel info
Basic channel information. *Feel free to change this for your own channel.*

1. Set the username
2. Create a user object
3. Iterate through dictionary to display basic info


In [10]:
USER_NAME = 'eargasmusic'
user = sp.user(USER_NAME)
for key, value in user.items():
  if key == 'images': IMAGE_URL=value[0]['url']
  print('{}: {}'.format(key, value))
Image(url = IMAGE_URL, width = 300, height = 300)

display_name: Eargasm Musicblog
external_urls: {'spotify': 'https://open.spotify.com/user/eargasmusic'}
followers: {'href': None, 'total': 322}
href: https://api.spotify.com/v1/users/eargasmusic
id: eargasmusic
images: [{'height': None, 'url': 'https://i.scdn.co/image/ab6775700000ee85c34f83b0dc82b7ff64a06ac5', 'width': None}]
type: user
uri: spotify:user:eargasmusic


## Playlists info
Let's fetch the list of available playlists, display number of playlists as well as their names.

***Disclaimer:*** *There are two types of playlists:*

*- named playlists containing songs of a similar mood/sentiment (i.e. `eargasm | curvatronic`),*

*- yearly playlists containing all songs posted in a given year (i.e. `eargasm music | 2017`).*

*I will mostly focus on the named playlists but I'll take advantage of the yearly playlists in latter notebooks.*


In [11]:
playlist_dict = sp.user_playlists(USER_NAME)
playlist_items = playlist_dict['items']

print('The channel contains a total of {} playlists:'.format(len(playlist_items)))
for item in playlist_items:
  print('Name: {}\n Number of tracks: {}\n'.format(item['name'], item['tracks']['total']))

The channel contains a total of 38 playlists:
Name: eargasm | breathe easy
 Number of tracks: 173

Name: eargasm | city walk
 Number of tracks: 242

Name: eargasm | curvatronik
 Number of tracks: 213

Name: eargasm | decadency
 Number of tracks: 68

Name: eargasm | deep water
 Number of tracks: 93

Name: eargasm | departure
 Number of tracks: 112

Name: eargasm | dust settling
 Number of tracks: 85

Name: eargasm | get moving
 Number of tracks: 125

Name: eargasm | glide
 Number of tracks: 83

Name: eargasm | high frequency radio
 Number of tracks: 170

Name: eargasm | into the wild
 Number of tracks: 75

Name: eargasm | joyride
 Number of tracks: 167

Name: eargasm | kickin' it ol' skool
 Number of tracks: 94

Name: eargasm | loungin'
 Number of tracks: 133

Name: eargasm | neon socks
 Number of tracks: 100

Name: eargasm | on top
 Number of tracks: 39

Name: eargasm | organised noise
 Number of tracks: 163

Name: eargasm | oscilloscope
 Number of tracks: 131

Name: eargasm | polymers

## All playlists DataFrame
Let's extract the basic information we need for each of the playlists and put them in a Pandas DataFrame. 

We can either do it by fetching fresh data from Spotify API or import previously saved CSV file.

1. Perform a list comprehension
2. Create a DataFrame based on the list of details
3. Display first rows and information

### Fetch data

In [12]:
playlist_details = [[item['id'],
                     item['name'],
                     item['external_urls']['spotify'],
                     item['images'][0]['url'],
                     item['tracks']['total']] for item in playlist_items]

all_playlists = pd.DataFrame(playlist_details,
                             columns=['id', 'name', 'url', 'image', 'tracks'])

In [13]:
all_playlists.head()

Unnamed: 0,id,name,url,image,tracks
0,5apHWYcigR3lSZpyzyGKEa,eargasm | breathe easy,https://open.spotify.com/playlist/5apHWYcigR3l...,https://i.scdn.co/image/ab67706c0000bebbd3ccf5...,173
1,3MXM4ca1b3bT198F7mG9ms,eargasm | city walk,https://open.spotify.com/playlist/3MXM4ca1b3bT...,https://i.scdn.co/image/ab67706c0000da84e10d9c...,242
2,2QdM3NBe7lkOzC7OqWXfNI,eargasm | curvatronik,https://open.spotify.com/playlist/2QdM3NBe7lkO...,https://i.scdn.co/image/ab67706c0000bebb2aa390...,213
3,1CwPTyGbQDSda6m7vTys1d,eargasm | decadency,https://open.spotify.com/playlist/1CwPTyGbQDSd...,https://i.scdn.co/image/ab67706c0000da84c70dd4...,68
4,6pGQQZ4PITmFnSC0rTnmXp,eargasm | deep water,https://open.spotify.com/playlist/6pGQQZ4PITmF...,https://i.scdn.co/image/ab67706c0000da846e34ff...,93


In [14]:
all_playlists.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38 entries, 0 to 37
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      38 non-null     object
 1   name    38 non-null     object
 2   url     38 non-null     object
 3   image   38 non-null     object
 4   tracks  38 non-null     int64 
dtypes: int64(1), object(4)
memory usage: 1.6+ KB


### Save to CSV
Let's export the DataFrame to external CSV file for backup and later use.

In [15]:
all_playlists.to_csv('all_playlists.csv')

### Load from CSV
Let's extract the basic information we need for each of the playlists and put them in a Pandas DataFrame using previously saved CSV to save time.

In [16]:
all_playlists = pd.read_csv('all_playlists.csv', index_col=0)

## Named & unnamed playlists DataFrames
Let's divide the `all_playlists` DataFrame into two seperate playlists for named and unnamed playlists.

1. Create a boolean filter Series to mask named and unnamed playlists based on the naming convention.
2. Mask `all_playlists` into `unnamed_playlists` and `named_playlists` using filter Series.


### Named playlists

In [17]:
named = all_playlists['name'].str.startswith('eargasm | ')
named_playlists = all_playlists[named.values]

In [18]:
named_playlists.head()

Unnamed: 0,id,name,url,image,tracks
0,5apHWYcigR3lSZpyzyGKEa,eargasm | breathe easy,https://open.spotify.com/playlist/5apHWYcigR3l...,https://i.scdn.co/image/ab67706c0000bebbd3ccf5...,173
1,3MXM4ca1b3bT198F7mG9ms,eargasm | city walk,https://open.spotify.com/playlist/3MXM4ca1b3bT...,https://i.scdn.co/image/ab67706c0000da84e10d9c...,242
2,2QdM3NBe7lkOzC7OqWXfNI,eargasm | curvatronik,https://open.spotify.com/playlist/2QdM3NBe7lkO...,https://i.scdn.co/image/ab67706c0000bebb2aa390...,213
3,1CwPTyGbQDSda6m7vTys1d,eargasm | decadency,https://open.spotify.com/playlist/1CwPTyGbQDSd...,https://i.scdn.co/image/ab67706c0000da84c70dd4...,68
4,6pGQQZ4PITmFnSC0rTnmXp,eargasm | deep water,https://open.spotify.com/playlist/6pGQQZ4PITmF...,https://i.scdn.co/image/ab67706c0000da846e34ff...,93


In [19]:
named_playlists.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 29 entries, 0 to 28
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      29 non-null     object
 1   name    29 non-null     object
 2   url     29 non-null     object
 3   image   29 non-null     object
 4   tracks  29 non-null     int64 
dtypes: int64(1), object(4)
memory usage: 1.4+ KB


### Example named playlist
Let's see the example  named playlist information.

In [20]:
random_playlist_id = named_playlists['id'][np.random.randint(0, len(named_playlists['id']))]
random_filter = named_playlists['id'] == random_playlist_id
random_playlist_info = named_playlists[random_filter.values].iloc[0]
for key, value in random_playlist_info.iteritems():
    if key == 'image': IMAGE_URL=value
    else: print('{}: {}'.format(key, value))
Image(url = IMAGE_URL, width = 300, height = 300)

id: 5LzidLwoF4Th3WDHUoUEOK
name: eargasm | on top
url: https://open.spotify.com/playlist/5LzidLwoF4Th3WDHUoUEOK
tracks: 39


### Unnamed playlists

In [21]:
unnamed = all_playlists['name'].str.startswith('eargasm music ')
unnamed_playlists = all_playlists[unnamed.values]
unnamed_playlists.reset_index(drop=True, inplace=True)

In [22]:
unnamed_playlists.head()

Unnamed: 0,id,name,url,image,tracks
0,43754bIdP7b0ygh8tTMenW,eargasm music 2020,https://open.spotify.com/playlist/43754bIdP7b0...,https://i.scdn.co/image/ab67706c0000da84c881c9...,254
1,0MsxZLGhAKJyBMXAfD03db,eargasm music 2019,https://open.spotify.com/playlist/0MsxZLGhAKJy...,https://i.scdn.co/image/ab67706c0000da84eabc26...,815
2,4tFrGBRcTYsrz5BwCGZS8L,eargasm music 2018,https://open.spotify.com/playlist/4tFrGBRcTYsr...,https://i.scdn.co/image/ab67706c0000bebb96c0b2...,826
3,0tNl58CSFwviwg7LxWzdwy,eargasm music 2017,https://open.spotify.com/playlist/0tNl58CSFwvi...,https://i.scdn.co/image/ab67706c0000da84b7c821...,876
4,2CDNi9K1M0ilAUQn1FTVp4,eargasm music 2016,https://open.spotify.com/playlist/2CDNi9K1M0il...,https://i.scdn.co/image/ab67706c0000bebbde6035...,61


In [23]:
unnamed_playlists.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      9 non-null      object
 1   name    9 non-null      object
 2   url     9 non-null      object
 3   image   9 non-null      object
 4   tracks  9 non-null      int64 
dtypes: int64(1), object(4)
memory usage: 488.0+ bytes


### Example unnamed playlist
Let's see the example unnamed playlist information.

In [24]:
random_playlist_id = unnamed_playlists['id'][np.random.randint(0, len(unnamed_playlists['id']))]
random_filter = unnamed_playlists['id'] == random_playlist_id
random_playlist_info = unnamed_playlists[random_filter.values].iloc[0]
for key, value in random_playlist_info.iteritems():
    if key == 'image': IMAGE_URL=value
    else: print('{}: {}'.format(key, value))
Image(url = IMAGE_URL, width = 300, height = 300)

id: 4DyXuBus6lNF6rBYqcOEp5
name: eargasm music 2015
url: https://open.spotify.com/playlist/4DyXuBus6lNF6rBYqcOEp5
tracks: 427


# Tracks
Now, let's gather information about track in the playlists. I'll fetch:
1. Basic track information from [Get a Track](https://developer.spotify.com/documentation/web-api/reference/tracks/get-track/) using `sp.playlist_items()`
2. Audio features from [Get Audio Features for a Track](https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/) using `sp.audio_features()`
3. Audio analysis from [Get Audio Analysis for a Track](https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-analysis/) using `sp.audio_analysis()`

## Basic information
Let's fetch the basic track info we need from `sp.playlist_items()` object and save it to DataFrame.
- **id** - individual track is (*string*)
- **artist** - artist name (*string*)
- **name** - track name (*string*)
- **duration_ms** - duration in miliseconds (*integer*)
- **popularity** - popularity of the track (*integer from 0 to 100*) 
- **release_date** - album release_date (*string*)
- **preview** - 30-sec mp3 track preview link (*string*)
- **url** - Spotify URL for the track (*string*)

### Initiate variables

In [25]:
track_id = []
track_artists = []
track_name = []
track_duration = []
track_popularity = []
track_releasedate = []
track_preview = []
track_image = []
track_url = []
track_playlist = []

### Fetch data 
Let's now fetch the tracks data for each of the named playlists using from `sp.playlist_items()`. The loops iterate through (starting from the most outer):
1. Named playlists ids
2. Chunks of 100 tracks within a playlist (*To bypass the Spotify query limit*)
3. Catalog information of a single track

***Disclaimer:*** *For convenience I'll use `timeit` to check the runtime of the cell as well as `tqdm()` for a progress bar.*

In [26]:
start = timeit.default_timer()

for playlist_id in tqdm(named_playlists['id']):
  
  for i in range(0,10000,100):
    playlist_items = sp.playlist_items(playlist_id, limit=100, offset=i)['items']

    for item in playlist_items:
      track_id.append(item['track']['id'])
      track_artists.append(item['track']['artists'][0]['name'])
      track_name.append(item['track']['name'])
      track_duration.append(item['track']['duration_ms'])
      track_popularity.append(item['track']['popularity'])
      track_releasedate.append(item['track']['album']['release_date'])
      track_preview.append(item['track']['preview_url'])
      track_image.append(item['track']['album']['images'][0]['url'])
      track_url.append(item['track']['external_urls']['spotify'])
      track_playlist.append(sp.playlist(playlist_id)['name'])  
    
stop = timeit.default_timer()
print('Runtime: {} seconds.'.format(stop-start))

HBox(children=(FloatProgress(value=0.0, max=29.0), HTML(value='')))


Runtime: 566.9173681399998 seconds.


### Create a DataFrame
Let's create a DataFrame using created variables.

In [27]:
basic_info_df = pd.DataFrame({'track_id': track_id,
                              'track_artists': track_artists,
                              'track_name': track_name,
                              'track_duration': track_duration,
                              'track_popularity': track_popularity,
                              'track_releasedate': track_releasedate,
                              'track_preview': track_preview,
                              'track_image': track_image,
                              'track_url': track_url,
                              'track_playlist': track_playlist})

In [28]:
basic_info_df.head()

Unnamed: 0,track_id,track_artists,track_name,track_duration,track_popularity,track_releasedate,track_preview,track_image,track_url,track_playlist
0,1ua6hBq18qZLyprXjMcpyf,Virgil Howe,Someday,251266,43,2009-10-19,https://p.scdn.co/mp3-preview/a2bdcba6acda937f...,https://i.scdn.co/image/ab67616d0000b27356dc5e...,https://open.spotify.com/track/1ua6hBq18qZLypr...,eargasm | breathe easy
1,42VpxSdGQgnV1UJkWeGYkA,Cass McCombs,Switch,254233,52,2016-08-26,https://p.scdn.co/mp3-preview/d0feea85b84ce9f5...,https://i.scdn.co/image/ab67616d0000b27396782c...,https://open.spotify.com/track/42VpxSdGQgnV1UJ...,eargasm | breathe easy
2,1g8A166soQjwl1ihqBWKGW,The Slow Revolt,Lean,207699,0,2016-09-09,,https://i.scdn.co/image/ab67616d0000b273ce48d6...,https://open.spotify.com/track/1g8A166soQjwl1i...,eargasm | breathe easy
3,6cAVWcj8TQ5yR2T6BZjnOg,Dirty Nice,Zero Summer,212640,0,2017-06-09,,https://i.scdn.co/image/ab67616d0000b2733a028c...,https://open.spotify.com/track/6cAVWcj8TQ5yR2T...,eargasm | breathe easy
4,3YA509E9ki7a3Ic9cf25Vt,Alex Ebert,Broken Record,274800,47,2017-05-05,https://p.scdn.co/mp3-preview/96c62ba3b9d730d3...,https://i.scdn.co/image/ab67616d0000b2738a6904...,https://open.spotify.com/track/3YA509E9ki7a3Ic...,eargasm | breathe easy


In [29]:
basic_info_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3621 entries, 0 to 3620
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   track_id           3621 non-null   object
 1   track_artists      3621 non-null   object
 2   track_name         3621 non-null   object
 3   track_duration     3621 non-null   int64 
 4   track_popularity   3621 non-null   int64 
 5   track_releasedate  3621 non-null   object
 6   track_preview      2532 non-null   object
 7   track_image        3621 non-null   object
 8   track_url          3621 non-null   object
 9   track_playlist     3621 non-null   object
dtypes: int64(2), object(8)
memory usage: 283.0+ KB


###Save to CSV
Export the basic information DataFrame to external CSV file.

In [30]:
basic_info_df.to_csv('basic_info.csv')

### Load from CSV
Import the basic track information from previously saved CSV file and put it into a Pandas DataFrame.

In [31]:
basic_info_df = pd.read_csv('basic_info.csv', index_col=0)

### Example track basic info
Let's see the example track basic information.

In [32]:
random_track_id = basic_info_df['track_id'][np.random.randint(0, len(basic_info_df['track_id']))]
random_filter = basic_info_df['track_id'] == random_track_id
random_track_info = basic_info_df[random_filter.values].iloc[0]
PREVIEW_URL = ''
IMAGE_URL = ''
for key, value in random_track_info.iteritems():
  
  if key == 'track_preview' and isinstance(value, str):
    PREVIEW_URL = value+'.mp3'
  elif key == 'track_image' and isinstance(value, str):
    IMAGE_URL = value
  print('{}: {}'.format(key, value))

IPython.display.Audio(url=PREVIEW_URL, embed=True)
Image(url=IMAGE_URL, width=300, height=300)

track_id: 5NiKic5j4TWoqS2J4iQTCt
track_artists: Mori
track_name: Ice Cream Summers
track_duration: 134883
track_popularity: 33
track_releasedate: 2019-09-25
track_preview: https://p.scdn.co/mp3-preview/2b2b49d95a4ec67340de0e8382ca58aedcb80e6b?cid=6f8f846f86b24d08a3de0e01d381894a
track_image: https://i.scdn.co/image/ab67616d0000b27343d137e1f5b64a7e73567a24
track_url: https://open.spotify.com/track/5NiKic5j4TWoqS2J4iQTCt
track_playlist: eargasm | city walk


## Audio features
Let's fetch the audio features of track info we need from `sp.audio_features()` object and save it to DataFrame.

- **id** - individual track is (*string*)
- **danceability** - suitability of a track for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity (*float between 0.0 and 1.0*)
- **energy** - perceptual measure of intensity and activity; perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy (*float between 0.0 and 1.0*)
- **speechiness** - presence of spoken words in a track; speech-like recordings - the closer to 1.0, above 0.66 tracks probably made entirely of spoken words above 0.66, track with both music and speech between 0.33 and 0.66, music and instrumental tracks below 0.33 (*float between 0.0 and 1.0*)
- **acousticness** - confidence measure wheather the track is acoustic (*float between 0.0 and 1.0*)
- **instrumnetalness** - confidence measure wheather the track contains no vocals (*float between 0.0 and 1.0*)
- **liveness** - confidence measure wheather the song was recorded live (detects audience in the recording) (**float between 0.0 and 1.0*)
- **valence** - describes the musical positiveness conveyed by a track (*float between 0.0 and 1.0*)


### The function
Let's build a function which takes the track id and returns a dictionary of features of our choice using `sp.audio_features()` method.

In [33]:
def audio_features(id):
    all_features = sp.audio_features(id)[0]
    columns_to_keep = ['id',
                       'danceability',
                       'energy',
                       'speechiness',
                       'acousticness',
                       'instrumentalness',
                       'liveness',
                       'valence']
    selected_features = { column: all_features[column] for column in columns_to_keep }
    
    return selected_features

### Example audio features
Let's fetch the sample of audio features for a random song in our set and see if the features align with real world listening experience.

1. Get a random track id
2. Print the track artist and name using `basic_info_df`
3. Extract the audio features using `audio_features()` function
4. Extract the 30-second mp3 preview URL if exists
5. Play the track preview

In [34]:
random_track_id = basic_info_df['track_id'][np.random.randint(0, len(basic_info_df['track_id']))]

df = basic_info_df.loc[basic_info_df['track_id'] == random_track_id]
print(df.iloc[0].loc[['track_artists', 'track_name']])

for feature, value in audio_features(random_track_id).items():
  print(feature, value) 
  
PREVIEW_URL = df.iloc[0].loc[['track_preview']][0]
if isinstance(PREVIEW_URL, str):
  PREVIEW_URL = PREVIEW_URL+'.mp3'
else: PREVIEW_URL = ''

IPython.display.Audio(url=PREVIEW_URL, embed=True)

track_artists             Bonobo
track_name       Flicker - Mixed
Name: 419, dtype: object
id 4pisjFv269FiNKKOCzU8DN
danceability 0.68
energy 0.718
speechiness 0.043
acousticness 0.037
instrumentalness 0.787
liveness 0.0823
valence 0.092


### Fetch data and create a DataFrame
Let's now fetch the audio features of each track in the `basic_info_df` DataFrame using our `audio_features()` function and create the `audio_features_df` DataFrame cointaining all the data retrieved. I'll also change the id column name to track_id.

***Disclaimer:*** *For convenience I'll use timeit to check the runtime of the cell as well as tqdm() for a progress bar.*

In [35]:
start = timeit.default_timer()

audio_features_df = pd.DataFrame()

for track_id in tqdm(basic_info_df['track_id']):
    features = audio_features(track_id)   
    audio_features_df = audio_features_df.append(features, ignore_index=True)

stop = timeit.default_timer()
print('Runtime: {} seconds.'.format(stop-start))

HBox(children=(FloatProgress(value=0.0, max=3621.0), HTML(value='')))


Runtime: 328.5993837850001 seconds.


In [36]:
audio_features_df.rename(columns={'id': 'track_id'},
                         inplace=True)

In [37]:
audio_features_df.head()

Unnamed: 0,acousticness,danceability,energy,track_id,instrumentalness,liveness,speechiness,valence
0,0.37,0.483,0.462,1ua6hBq18qZLyprXjMcpyf,0.21,0.0875,0.029,0.351
1,0.362,0.682,0.538,42VpxSdGQgnV1UJkWeGYkA,0.000123,0.324,0.0283,0.713
2,0.195,0.536,0.753,1g8A166soQjwl1ihqBWKGW,0.801,0.12,0.0309,0.676
3,0.742,0.663,0.509,6cAVWcj8TQ5yR2T6BZjnOg,6e-06,0.112,0.0889,0.303
4,0.24,0.464,0.57,3YA509E9ki7a3Ic9cf25Vt,0.00121,0.138,0.04,0.548


In [38]:
audio_features_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3621 entries, 0 to 3620
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   acousticness      3621 non-null   float64
 1   danceability      3621 non-null   float64
 2   energy            3621 non-null   float64
 3   track_id          3621 non-null   object 
 4   instrumentalness  3621 non-null   float64
 5   liveness          3621 non-null   float64
 6   speechiness       3621 non-null   float64
 7   valence           3621 non-null   float64
dtypes: float64(7), object(1)
memory usage: 226.4+ KB


### Save to CSV
Let's export the DataFrame to external CSV file for backup and later use.

In [39]:
audio_features_df.to_csv('audio_features.csv')

### Load from CSV
Import the audio features from previously saved CSV file and put it into a Pandas DataFrame.

In [40]:
audio_feature_df = pd.read_csv('audio_features.csv', index_col=0)

## Audio analysis
Let's fetch the audio analysis info we need from `sp.audio_analysis()` object and save it to DataFrame.

- **tempo** - the overall estimated tempo of the track in beats per minute (BPM) (*float*)
- **tempo_confidence** - the reliability of the tempo (*float between 0.0 to 1.0*)
- **time_signature** - tn estimated overall time signature of a track; the time signature ranges from 3 to 7 indicating time signatures of 3/4, to 7/4. (*integer between 3 to 8*)
- **time-signature_confidence** - the reliability of the time signature (*float between 0.0 to 1.0*)
- **key** - the estimated overall key of the track; the values in this field range from 0 to 11 mapping to pitches using standard [Pitch Class notation](https://en.wikipedia.org/wiki/Pitch_class) (i.e. 0 = C, 1 = C♯/D♭, 2 = D, and so on). If no key was detected, the value is -1.
- **key_confidence** - the reliability of the key (*float between 0.0 and 1.0*)
- **mode** - the modality (major or minor) of a track, the type of scale from which its melodic content is derived (*0 - minor, 1 - major, or -1 - no result*)
- **mode_confidence** - the reliability of the mode (*float between 0.0 and 1.0*)
- **number_of_sections** - the number of parts of the track defined by large variations in rhythm or timbre, e.g. chorus, verse, bridge, guitar solo, etc. 

### The function
Let's build a function which takes the track id and returns a dictionary of audio analysis features of our choice using sp.audio_analysis() method.

In [41]:
def audio_analysis(id):
    track_features = sp.audio_analysis(id)['track']  
    
    columns_to_keep = ['tempo',
                       'tempo_confidence',
                       'time_signature',
                       'time_signature_confidence',
                       'key',
                       'key_confidence',
                       'mode',
                       'mode_confidence']
    
    selected_analysis = { column: track_features[column] for column in columns_to_keep }
    
    selected_analysis['number_of_sections'] = len(sp.audio_analysis(id)['sections'][0])
    selected_analysis['track_id'] = id
    
    return selected_analysis

### Example audio analysis
Let's fetch the sample of audio analysis for a random song in our set and see if the features align with real world listening experience.

1. Get a random track id
2. Print the track artist and name using `basic_info_df`
3. Extract the audio features using `audio_features()` function
4. Extract the 30-second mp3 preview URL if exists
5. Play the track preview

In [42]:
random_track_id = basic_info_df['track_id'][np.random.randint(0, len(basic_info_df['track_id']))]

df = basic_info_df.loc[basic_info_df['track_id'] == random_track_id]
print(df.iloc[0].loc[['track_artists', 'track_name']])

for feature, value in audio_analysis(random_track_id).items():
  print(feature, value) 
  
PREVIEW_URL = df.iloc[0].loc[['track_preview']][0]
if isinstance(PREVIEW_URL, str):
  PREVIEW_URL = PREVIEW_URL+'.mp3'
else: PREVIEW_URL = ''
IPython.display.Audio(url=PREVIEW_URL, embed=True)

track_artists    Gibmafuffi
track_name            Intro
Name: 299, dtype: object
tempo 80.209
tempo_confidence 0.071
time_signature 4
time_signature_confidence 0.916
key 5
key_confidence 0.845
mode 0
mode_confidence 0.699
number_of_sections 12
track_id 242vrIt51Rk4EPfqqe8uvp


### Fetch data and create a DataFrame
Let's now fetch the audio analysis of each track in the basic_info_df DataFrame using our audio_analysis() function and create the audio_analysis_df DataFrame cointaining all the data retrieved. In order to deal with track with no audio analysis data I'll use `try except` statement and fill missing data with `NaN's`

***Disclaimer:*** *For convenience I'll use timeit to check the runtime of the cell as well as tqdm() for a progress bar.*

In [43]:
start = timeit.default_timer()
audio_analysis_df = pd.DataFrame()


for track_id in tqdm(basic_info_df['track_id']):
    # Continue when encountered an error
    try:
        analysis = audio_analysis(track_id)
        audio_analysis_df = audio_analysis_df.append(analysis, ignore_index=True)
    except:
        audio_analysis_df = audio_analysis_df.append({'tempo': np.nan,
                                                      'tempo_confidence': np.nan,
                                                      'time_signature': np.nan,
                                                      'time_signature_confidence': np.nan,
                                                      'key': np.nan,
                                                      'key_confidence': np.nan,
                                                      'mode': np.nan,
                                                      'mode_confidence': np.nan,
                                                      'number_of_sections': np.nan,
                                                      'track_id': track_id},
                                                      ignore_index=True)

stop = timeit.default_timer()
print('Runtime: {} seconds.'.format(stop-start))

HBox(children=(FloatProgress(value=0.0, max=3621.0), HTML(value='')))

HTTP Error for GET to https://api.spotify.com/v1/audio-analysis/0XNRWcT0GVA8cyKaU8aT4D returned 404 due to analysis not found
HTTP Error for GET to https://api.spotify.com/v1/audio-analysis/0XNRWcT0GVA8cyKaU8aT4D returned 404 due to analysis not found



Runtime: 16963.136651581 seconds.


In [44]:
audio_analysis_df.head()

Unnamed: 0,key,key_confidence,mode,mode_confidence,number_of_sections,tempo,tempo_confidence,time_signature,time_signature_confidence,track_id
0,9.0,0.292,0.0,0.398,12.0,86.502,0.16,4.0,1.0,1ua6hBq18qZLyprXjMcpyf
1,7.0,0.743,0.0,0.453,12.0,98.003,0.583,4.0,1.0,42VpxSdGQgnV1UJkWeGYkA
2,1.0,0.655,1.0,0.368,12.0,85.036,0.325,4.0,0.979,1g8A166soQjwl1ihqBWKGW
3,11.0,0.547,1.0,0.56,12.0,125.088,0.476,4.0,0.845,6cAVWcj8TQ5yR2T6BZjnOg
4,0.0,0.905,1.0,0.753,12.0,170.556,0.183,4.0,0.799,3YA509E9ki7a3Ic9cf25Vt


In [45]:
audio_analysis_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3621 entries, 0 to 3620
Data columns (total 10 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   key                        3619 non-null   float64
 1   key_confidence             3619 non-null   float64
 2   mode                       3619 non-null   float64
 3   mode_confidence            3619 non-null   float64
 4   number_of_sections         3619 non-null   float64
 5   tempo                      3619 non-null   float64
 6   tempo_confidence           3619 non-null   float64
 7   time_signature             3619 non-null   float64
 8   time_signature_confidence  3619 non-null   float64
 9   track_id                   3621 non-null   object 
dtypes: float64(9), object(1)
memory usage: 283.0+ KB


### Save to CSV
Let's export the DataFrame to external CSV file for backup and later use.

In [46]:
audio_analysis_df.to_csv('audio_analysis.csv')

### Load from CSV
Import the audio analysis information from previously saved CSV file and put it into a Pandas DataFrame.

In [47]:
audio_analysis_df = pd.read_csv('audio_analysis.csv', index_col=0)

## Merge DataFrames
Now, let's combine all previously fetched DataFrames.

### DataFrames summary

In [48]:
print('Basic info DataFrame\'s shape:', basic_info_df.shape)
print('Columns: ', list(basic_info_df.columns))
print('Audio features DataFrame\'s shape:', audio_features_df.shape)
print('Columns: ', list(audio_features_df.columns))
print('Audio analysis DataFrame\'s shape:', audio_analysis_df.shape)
print('Columns: ', list(audio_analysis_df.columns))

Basic info DataFrame's shape: (3621, 10)
Columns:  ['track_id', 'track_artists', 'track_name', 'track_duration', 'track_popularity', 'track_releasedate', 'track_preview', 'track_image', 'track_url', 'track_playlist']
Audio features DataFrame's shape: (3621, 8)
Columns:  ['acousticness', 'danceability', 'energy', 'track_id', 'instrumentalness', 'liveness', 'speechiness', 'valence']
Audio analysis DataFrame's shape: (3621, 10)
Columns:  ['key', 'key_confidence', 'mode', 'mode_confidence', 'number_of_sections', 'tempo', 'tempo_confidence', 'time_signature', 'time_signature_confidence', 'track_id']


In [49]:
basic_info_df.head()



Unnamed: 0,track_id,track_artists,track_name,track_duration,track_popularity,track_releasedate,track_preview,track_image,track_url,track_playlist
0,1ua6hBq18qZLyprXjMcpyf,Virgil Howe,Someday,251266,43,2009-10-19,https://p.scdn.co/mp3-preview/a2bdcba6acda937f...,https://i.scdn.co/image/ab67616d0000b27356dc5e...,https://open.spotify.com/track/1ua6hBq18qZLypr...,eargasm | breathe easy
1,42VpxSdGQgnV1UJkWeGYkA,Cass McCombs,Switch,254233,52,2016-08-26,https://p.scdn.co/mp3-preview/d0feea85b84ce9f5...,https://i.scdn.co/image/ab67616d0000b27396782c...,https://open.spotify.com/track/42VpxSdGQgnV1UJ...,eargasm | breathe easy
2,1g8A166soQjwl1ihqBWKGW,The Slow Revolt,Lean,207699,0,2016-09-09,,https://i.scdn.co/image/ab67616d0000b273ce48d6...,https://open.spotify.com/track/1g8A166soQjwl1i...,eargasm | breathe easy
3,6cAVWcj8TQ5yR2T6BZjnOg,Dirty Nice,Zero Summer,212640,0,2017-06-09,,https://i.scdn.co/image/ab67616d0000b2733a028c...,https://open.spotify.com/track/6cAVWcj8TQ5yR2T...,eargasm | breathe easy
4,3YA509E9ki7a3Ic9cf25Vt,Alex Ebert,Broken Record,274800,47,2017-05-05,https://p.scdn.co/mp3-preview/96c62ba3b9d730d3...,https://i.scdn.co/image/ab67616d0000b2738a6904...,https://open.spotify.com/track/3YA509E9ki7a3Ic...,eargasm | breathe easy


In [50]:
audio_features_df.head()

Unnamed: 0,acousticness,danceability,energy,track_id,instrumentalness,liveness,speechiness,valence
0,0.37,0.483,0.462,1ua6hBq18qZLyprXjMcpyf,0.21,0.0875,0.029,0.351
1,0.362,0.682,0.538,42VpxSdGQgnV1UJkWeGYkA,0.000123,0.324,0.0283,0.713
2,0.195,0.536,0.753,1g8A166soQjwl1ihqBWKGW,0.801,0.12,0.0309,0.676
3,0.742,0.663,0.509,6cAVWcj8TQ5yR2T6BZjnOg,6e-06,0.112,0.0889,0.303
4,0.24,0.464,0.57,3YA509E9ki7a3Ic9cf25Vt,0.00121,0.138,0.04,0.548


In [51]:
audio_analysis_df.head()

Unnamed: 0,key,key_confidence,mode,mode_confidence,number_of_sections,tempo,tempo_confidence,time_signature,time_signature_confidence,track_id
0,9.0,0.292,0.0,0.398,12.0,86.502,0.16,4.0,1.0,1ua6hBq18qZLyprXjMcpyf
1,7.0,0.743,0.0,0.453,12.0,98.003,0.583,4.0,1.0,42VpxSdGQgnV1UJkWeGYkA
2,1.0,0.655,1.0,0.368,12.0,85.036,0.325,4.0,0.979,1g8A166soQjwl1ihqBWKGW
3,11.0,0.547,1.0,0.56,12.0,125.088,0.476,4.0,0.845,6cAVWcj8TQ5yR2T6BZjnOg
4,0.0,0.905,1.0,0.753,12.0,170.556,0.183,4.0,0.799,3YA509E9ki7a3Ic9cf25Vt


### Merge
1. I will first merge `audio_features_df` with `audio_analysis_df` which represent features for each individual track and then drop duplicates of the same `track_id`.
2. Finally, I'll marge the `basic_info_df` with the `features_df` to add all the features to the entire list of tracks I fetched with a proper `track_playlist` label.

In [52]:

features_df = audio_features_df.merge(audio_analysis_df,
                                      how='inner',
                                      on='track_id')

features_df.drop_duplicates(subset='track_id', inplace=True)

final_df = basic_info_df.merge(features_df,
                               how='inner',
                               on='track_id')

In [53]:
final_df.head()

Unnamed: 0,track_id,track_artists,track_name,track_duration,track_popularity,track_releasedate,track_preview,track_image,track_url,track_playlist,acousticness,danceability,energy,instrumentalness,liveness,speechiness,valence,key,key_confidence,mode,mode_confidence,number_of_sections,tempo,tempo_confidence,time_signature,time_signature_confidence
0,1ua6hBq18qZLyprXjMcpyf,Virgil Howe,Someday,251266,43,2009-10-19,https://p.scdn.co/mp3-preview/a2bdcba6acda937f...,https://i.scdn.co/image/ab67616d0000b27356dc5e...,https://open.spotify.com/track/1ua6hBq18qZLypr...,eargasm | breathe easy,0.37,0.483,0.462,0.21,0.0875,0.029,0.351,9.0,0.292,0.0,0.398,12.0,86.502,0.16,4.0,1.0
1,42VpxSdGQgnV1UJkWeGYkA,Cass McCombs,Switch,254233,52,2016-08-26,https://p.scdn.co/mp3-preview/d0feea85b84ce9f5...,https://i.scdn.co/image/ab67616d0000b27396782c...,https://open.spotify.com/track/42VpxSdGQgnV1UJ...,eargasm | breathe easy,0.362,0.682,0.538,0.000123,0.324,0.0283,0.713,7.0,0.743,0.0,0.453,12.0,98.003,0.583,4.0,1.0
2,1g8A166soQjwl1ihqBWKGW,The Slow Revolt,Lean,207699,0,2016-09-09,,https://i.scdn.co/image/ab67616d0000b273ce48d6...,https://open.spotify.com/track/1g8A166soQjwl1i...,eargasm | breathe easy,0.195,0.536,0.753,0.801,0.12,0.0309,0.676,1.0,0.655,1.0,0.368,12.0,85.036,0.325,4.0,0.979
3,6cAVWcj8TQ5yR2T6BZjnOg,Dirty Nice,Zero Summer,212640,0,2017-06-09,,https://i.scdn.co/image/ab67616d0000b2733a028c...,https://open.spotify.com/track/6cAVWcj8TQ5yR2T...,eargasm | breathe easy,0.742,0.663,0.509,6e-06,0.112,0.0889,0.303,11.0,0.547,1.0,0.56,12.0,125.088,0.476,4.0,0.845
4,3YA509E9ki7a3Ic9cf25Vt,Alex Ebert,Broken Record,274800,47,2017-05-05,https://p.scdn.co/mp3-preview/96c62ba3b9d730d3...,https://i.scdn.co/image/ab67616d0000b2738a6904...,https://open.spotify.com/track/3YA509E9ki7a3Ic...,eargasm | breathe easy,0.24,0.464,0.57,0.00121,0.138,0.04,0.548,0.0,0.905,1.0,0.753,12.0,170.556,0.183,4.0,0.799


In [54]:
final_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3621 entries, 0 to 3620
Data columns (total 26 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   track_id                   3621 non-null   object 
 1   track_artists              3621 non-null   object 
 2   track_name                 3621 non-null   object 
 3   track_duration             3621 non-null   int64  
 4   track_popularity           3621 non-null   int64  
 5   track_releasedate          3621 non-null   object 
 6   track_preview              2532 non-null   object 
 7   track_image                3621 non-null   object 
 8   track_url                  3621 non-null   object 
 9   track_playlist             3621 non-null   object 
 10  acousticness               3621 non-null   float64
 11  danceability               3621 non-null   float64
 12  energy                     3621 non-null   float64
 13  instrumentalness           3621 non-null   float

### Save to CSV
Let's export the DataFrame to external CSV file for backup and later use.

In [55]:
final_df.to_csv('final_df.csv')

### Load from CSV
Import the entire list of tracks with features from previously saved CSV file and put it into a Pandas DataFrame.

In [56]:
final_df = pd.read_csv('final_df.csv', index_col=0)

## References
For this part of the project I used mostly:

- Spotify Web API ([link](https://developer.spotify.com/documentation/web-api/))
- Spotipy library documentation ([link](https://spotipy.readthedocs.io/))