# Project: Algo-Rhythms
## The Spotify Song Popularity Predictor and Song Recommender System

# Introduction 

Founded in 2006, Spotify is a digital music and podcast streaming service that allows users to access its massive library of over 70 millions of songs and other content from all over the world. As of end 2020, <a href="https://www.businessofapps.com/data/spotify-statistics/">Spotify</a> holds the title of the world's most popular music streaming platform with 345 million users (of which 155 million are paying subscribers), generating USD 9.5 billion in revenue that year. 

One key reason propelling Spotify to dominance in the music streaming space was their "Discover Weekly" feature which was fully designed via a machine learning algorithm to generate a curated playlist of 30 songs tailored to the listening habits of the user. Through the use of a collaborative filtering recommender system, the algorithm was equipped to examine playlists of other users to identify similarities among the songs and then recommending similar songs that match the musical profile to the user. Since then, Spotify continuous innovation and refinement of their algorithms through the massive amounts of data collected from their listeners have helped them implement new features such as "Daily Mixes", "Release Radar" and "Playlist Radio", through <a href="https://www.analyticssteps.com/blogs/how-spotify-using-big-data">variations</a> of its highly successful recommender system. 

However, despite their numerous algorithm-designed features for their users, we can still gain a lot more insights from their enormous database of tracks to benefit both music creators and music listeners alike. 

# Problem Statement

The aims of this project are two-fold:
- Predict the popularity of a song based on its musical features (eg. acousticness, energy, tempo, etc) to identify the ideal blend of features characteristic of popular songs.
- Based on the key music features, build a song recommender system that returns a playlist of 30 songs with musical features similar to a particular genre (eg. newcastle nsw indie). A subset of the top 30 genres will be selected to test out the recommender system and shared with other Spotify users to get feedback on how similar the songs in the playlist are. 

# Executive Summary

The Kaggle Spotify Dataset consists of over 170,000 tracks, including songs, audiobooks and podcasts. As of 29 April 2021, the most updated dataset (version 15) consists of around 600,000 tracks, so to access the dataset I used in this project, do use version 11 found <a href="https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks/version/11">here.</a> 

There are 2 main files explored in this project: 
- Data: consisting of all songs, their musical features, popularity score and release date 
- Data by genres: consisting of all genres and the average score of the musical features of all songs belonging to these genres. 

More emphasis was placed on the cleaning and exploration of the data file as it contained all songs with our target variable (popularity). We began by dropping all duplicated songs with the same ID and musical features, as well as removing audiobooks, podcasts and <a href="https://www.whathifi.com/features/25-best-tracks-testing-bass">bass testing songs</a> from the dataset as we do not want these to influence the modelling and song recommendation processes. In addition, given that most recent songs typically have a duration of up to <a href="https://www.dancemusicnw.com/science-behind-shorter-songs-2019/">5 minutes</a>, it was decided that songs which are too long would be dropped as well (the song November Rain - 8.93 mins was used as our reference point to provide some variety in the model). With this, the initial 175,000 songs was reduced to around 160,000.

For the data analysis process, the key musical features (acousicness, danceability, energy, instrumentalness, liveness, loudness, speechiness, tempo and valence) over the years to ascertain their relationship with a song's popularity, and to determine how music has evolved over the decades. After conducting time series analysis on these features, these relationships became more pronounced and we were able to obtain more meaningful results. 

To provide context on how Spotify calculates the "popularity" score: "The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are" (<a href="https://help.musicinsights.com/hc/en-us/articles/360049246653-What-is-the-Spotify-Popularity-Score-#:~:text=The%20score%20is%20received%20from,how%20recent%20those%20plays%20are.">source</a>). Thus, it would be expected that recent songs would by and large have a higher popularity score compared to older songs (eg. from 1920s/1930s). We want the popularity to be based off muscical features purely, thus the "year" feature was dropped from the modelling process. Finally, an arbitrary threshold of 0.70 was set as the minimum for a song to be considered "popular". This made up only 2.6% of our dataset. 

The following features considered were: 'acousticness', 'danceability', 'duration_mins', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness', 'tempo', 'valence'

These were then fed into the modelling process. To account for the imbalanced classes, we re-sampled the positive popular classes using SMOTE. The models used were:

- Logistic Regression
- K Nearest Neighbors
- Random Forest
- XGBoost

Hyperparameter tuning was done to obtain the best training and cross-validated scores for each pipeline. We then used the best model to find the classification metrics and compared them against an AUC (Area Under The Curve) ROC (Receiver Operating Characteristics) curve to determine the model's ability to distinguish between popular and non-popular songs. Due to the immense size of the dataset, the dataset was broken up into 2 parts (1980 to 1999, and 2000 - 2021) to gain a clearer picture of which musical features contributed to the popularity of songs in the last 40 years. 

Finally, the song recommendation system was built on a smaller subset of the dataset - only songs from 2000 onwards were used, giving us around 36,000 songs. This was due to computational constraints (over 200GB of RAM required for 160,000 songs), as well as the fact that the more recent songs were created with better thought and instruments, and are generally more diverse. 

## 1. Import Libraries

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns 
import matplotlib.pyplot as plt

%matplotlib inline

## 2. Data Cleaning 

### 2.1 Main Dataset

In [2]:
df = pd.read_csv("../datasets/data.csv")
df.tail(10)

Unnamed: 0,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,year
174379,0.795,['Alessia Cara'],0.429,144720,0.211,0,45XnLMuqf3vRfskEAMUeCH,0.0,4,0.196,-11.665,1,A Little More,0,2021-01-22,0.036,94.71,0.228,2021
174380,0.0484,"['Stephan F', 'YA-YA']",0.693,177148,0.826,0,1Cbf6PLWsL4s51eFepXx6L,1.2e-05,1,0.231,-2.669,1,Only Tonight - Radio Edit,0,2020-12-25,0.0762,126.049,0.361,2020
174381,0.795,['Alessia Cara'],0.429,144720,0.211,0,4pPFI9jsguIh3wC7Otoyy8,0.0,4,0.196,-11.665,1,A Little More,0,2021-01-22,0.036,94.71,0.228,2021
174382,0.141,"['BigBankCarti', 'Keyvo400']",0.544,215014,0.407,1,3ASGdyWXeXsXtOIWtm0tv4,0.0,4,0.253,-12.745,0,LayUp,0,2020-12-31,0.233,129.75,0.49,2020
174383,0.795,['Alessia Cara'],0.429,144720,0.211,0,52YtxLVUyvtiGPxwwxayHZ,0.0,4,0.196,-11.665,1,A Little More,0,2021-01-22,0.036,94.71,0.228,2021
174384,0.00917,"['DJ Combo', 'Sander-7', 'Tony T']",0.792,147615,0.866,0,46LhBf6TvYjZU2SMvGZAbn,6e-05,6,0.178,-5.089,0,The One,0,2020-12-25,0.0356,125.972,0.186,2020
174385,0.795,['Alessia Cara'],0.429,144720,0.211,0,7tue2Wemjd0FZzRtDrQFZd,0.0,4,0.196,-11.665,1,A Little More,0,2021-01-22,0.036,94.71,0.228,2021
174386,0.806,['Roger Fly'],0.671,218147,0.589,0,48Qj61hOdYmUCFJbpQ29Ob,0.92,4,0.113,-12.393,0,Together,0,2020-12-09,0.0282,108.058,0.714,2020
174387,0.92,['Taylor Swift'],0.462,244000,0.24,1,1gcyHQpBQ1lfXGdhZmWrHP,0.0,0,0.113,-12.077,1,champagne problems,69,2021-01-07,0.0377,171.319,0.32,2021
174388,0.239,['Roger Fly'],0.677,197710,0.46,0,57tgYkWQTNHVFEt6xDKKZj,0.891,7,0.215,-12.237,1,Improvisations,0,2020-12-09,0.0258,112.208,0.747,2020


In [3]:
df.shape

(174389, 19)

In [4]:
df.columns

Index(['acousticness', 'artists', 'danceability', 'duration_ms', 'energy',
       'explicit', 'id', 'instrumentalness', 'key', 'liveness', 'loudness',
       'mode', 'name', 'popularity', 'release_date', 'speechiness', 'tempo',
       'valence', 'year'],
      dtype='object')

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 174389 entries, 0 to 174388
Data columns (total 19 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   acousticness      174389 non-null  float64
 1   artists           174389 non-null  object 
 2   danceability      174389 non-null  float64
 3   duration_ms       174389 non-null  int64  
 4   energy            174389 non-null  float64
 5   explicit          174389 non-null  int64  
 6   id                174389 non-null  object 
 7   instrumentalness  174389 non-null  float64
 8   key               174389 non-null  int64  
 9   liveness          174389 non-null  float64
 10  loudness          174389 non-null  float64
 11  mode              174389 non-null  int64  
 12  name              174389 non-null  object 
 13  popularity        174389 non-null  int64  
 14  release_date      174389 non-null  object 
 15  speechiness       174389 non-null  float64
 16  tempo             17

In this dataset, there are a total of 19 variables - 12 numerical variables, 2 categorical variables, 2 dummy variables and the ID variable. 

In [6]:
df.describe()

Unnamed: 0,acousticness,danceability,duration_ms,energy,explicit,instrumentalness,key,liveness,loudness,mode,popularity,speechiness,tempo,valence,year
count,174389.0,174389.0,174389.0,174389.0,174389.0,174389.0,174389.0,174389.0,174389.0,174389.0,174389.0,174389.0,174389.0,174389.0,174389.0
mean,0.499228,0.536758,232810.0,0.482721,0.068135,0.197252,5.205305,0.211123,-11.750865,0.702384,25.693381,0.105729,117.0065,0.524533,1977.061764
std,0.379936,0.176025,148395.8,0.272685,0.251978,0.334574,3.518292,0.180493,5.691591,0.457211,21.87274,0.18226,30.254178,0.264477,26.90795
min,0.0,0.0,4937.0,0.0,0.0,0.0,0.0,0.0,-60.0,0.0,0.0,0.0,0.0,0.0,1920.0
25%,0.0877,0.414,166133.0,0.249,0.0,0.0,2.0,0.0992,-14.908,0.0,1.0,0.0352,93.931,0.311,1955.0
50%,0.517,0.548,205787.0,0.465,0.0,0.000524,5.0,0.138,-10.836,1.0,25.0,0.0455,115.816,0.536,1977.0
75%,0.895,0.669,265720.0,0.711,0.0,0.252,8.0,0.27,-7.499,1.0,42.0,0.0763,135.011,0.743,1999.0
max,0.996,0.988,5338302.0,1.0,1.0,1.0,11.0,1.0,3.855,1.0,100.0,0.971,243.507,1.0,2021.0


##### Check for null values

In [7]:
df.isnull().sum()

acousticness        0
artists             0
danceability        0
duration_ms         0
energy              0
explicit            0
id                  0
instrumentalness    0
key                 0
liveness            0
loudness            0
mode                0
name                0
popularity          0
release_date        0
speechiness         0
tempo               0
valence             0
year                0
dtype: int64

#### Duplicate song IDs

In [8]:
df[df.duplicated(subset='id')]

Unnamed: 0,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,year
9525,0.56700,['Neil Diamond'],0.515,180253,0.6410,0,1BmVQ5RGqqtF5cnsv6cQYu,0.064200,5,0.3220,-5.573,1,"Girl, You'll Be A Woman Soon",60,1968,0.0272,109.558,0.655,1968
9534,0.02710,['Neil Diamond'],0.560,163907,0.8270,0,2SS3WeSe24ZqTlTSK4KzQZ,0.002850,8,0.0551,-4.157,1,Cherry Cherry,54,1968,0.0306,84.383,0.904,1968
16113,0.97400,"['Johann Strauss II', 'Riccardo Muti', 'Wiener...",0.219,459053,0.0855,0,5zZbXSRIFe1uWNmEM7f2XI,0.922000,0,0.3550,-19.703,0,"Frühlingsstimmen, Walzer, Op. 410",34,2021-01-08,0.0404,171.849,0.156,2021
16663,0.35500,"['Waylon Jennings', 'Willie Nelson']",0.626,184267,0.4570,0,0sFq478LIo9BFwf2qzMzzF,0.000009,4,0.0668,-13.785,1,The Year 2003 Minus 25 - Remastered,43,1978-01-01,0.0384,102.166,0.474,1978
16669,0.20200,['Ten Years After'],0.384,224133,0.5160,0,19HjHUjCfDrEYhVSIKG6nK,0.180000,9,0.1140,-12.032,0,I'd Love to Change the World - 2004 Remaster,60,1971-11-11,0.0345,118.129,0.371,1971
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
174380,0.04840,"['Stephan F', 'YA-YA']",0.693,177148,0.8260,0,1Cbf6PLWsL4s51eFepXx6L,0.000012,1,0.2310,-2.669,1,Only Tonight - Radio Edit,0,2020-12-25,0.0762,126.049,0.361,2020
174382,0.14100,"['BigBankCarti', 'Keyvo400']",0.544,215014,0.4070,1,3ASGdyWXeXsXtOIWtm0tv4,0.000000,4,0.2530,-12.745,0,LayUp,0,2020-12-31,0.2330,129.750,0.490,2020
174384,0.00917,"['DJ Combo', 'Sander-7', 'Tony T']",0.792,147615,0.8660,0,46LhBf6TvYjZU2SMvGZAbn,0.000060,6,0.1780,-5.089,0,The One,0,2020-12-25,0.0356,125.972,0.186,2020
174386,0.80600,['Roger Fly'],0.671,218147,0.5890,0,48Qj61hOdYmUCFJbpQ29Ob,0.920000,4,0.1130,-12.393,0,Together,0,2020-12-09,0.0282,108.058,0.714,2020


In [9]:
df['id'].value_counts().sort_values(ascending=False).head()

1xQvPFljQXA3GCK869ERvC    9
7tJS1cjSD1P8bodNGblYiK    9
0UsmyJDsst2xhX1ZiFF3JW    9
7GlixhQpXo76vgALqoJ3L5    8
7Kh32CyazzTdVEBXjKINVO    8
Name: id, dtype: int64

In [10]:
# Inspecting the first instance of repeated IDs 
df[df['id'] == '0UsmyJDsst2xhX1ZiFF3JW']

Unnamed: 0,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,year
16071,0.969,['Schoolgirl Byebye'],0.314,74302,0.0855,0,0UsmyJDsst2xhX1ZiFF3JW,0.795,9,0.16,-15.775,1,"Year,2015",18,2020-09-16,0.0342,69.893,0.161,2020
16881,0.969,['Schoolgirl Byebye'],0.314,74302,0.0855,0,0UsmyJDsst2xhX1ZiFF3JW,0.795,9,0.16,-15.775,1,"Year,2015",18,2020-09-16,0.0342,69.893,0.161,2020
17883,0.969,['Schoolgirl Byebye'],0.314,74302,0.0855,0,0UsmyJDsst2xhX1ZiFF3JW,0.795,9,0.16,-15.775,1,"Year,2015",18,2020-09-16,0.0342,69.893,0.161,2020
18063,0.969,['Schoolgirl Byebye'],0.314,74302,0.0855,0,0UsmyJDsst2xhX1ZiFF3JW,0.795,9,0.16,-15.775,1,"Year,2015",18,2020-09-16,0.0342,69.893,0.161,2020
18269,0.969,['Schoolgirl Byebye'],0.314,74302,0.0855,0,0UsmyJDsst2xhX1ZiFF3JW,0.795,9,0.16,-15.775,1,"Year,2015",18,2020-09-16,0.0342,69.893,0.161,2020
18863,0.969,['Schoolgirl Byebye'],0.314,74302,0.0855,0,0UsmyJDsst2xhX1ZiFF3JW,0.795,9,0.16,-15.775,1,"Year,2015",18,2020-09-16,0.0342,69.893,0.161,2020
19065,0.969,['Schoolgirl Byebye'],0.314,74302,0.0855,0,0UsmyJDsst2xhX1ZiFF3JW,0.795,9,0.16,-15.775,1,"Year,2015",18,2020-09-16,0.0342,69.893,0.161,2020
19467,0.969,['Schoolgirl Byebye'],0.314,74302,0.0855,0,0UsmyJDsst2xhX1ZiFF3JW,0.795,9,0.16,-15.775,1,"Year,2015",18,2020-09-16,0.0342,69.893,0.161,2020
20071,0.969,['Schoolgirl Byebye'],0.314,74302,0.0855,0,0UsmyJDsst2xhX1ZiFF3JW,0.795,9,0.16,-15.775,1,"Year,2015",18,2020-09-16,0.0342,69.893,0.161,2020


Looking at these rows, we note that all the features are the identical, thus all duplicates with the same ID are dropped. 

In [11]:
df.drop_duplicates(['id'], keep='first', inplace=True)

In [12]:
df.shape

(172230, 19)

#### Duplicate song names

In [13]:
df[df.duplicated(subset='name')]

Unnamed: 0,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,year
86,0.995,['Amalia Vaka'],0.287,216720,0.621,0,2XVujD5mP2bMSXLkpqeWKU,0.7840,3,0.1510,-10.843,0,Argilamas kai psarades,0,1920-01-01,0.0394,77.537,0.559,1920
88,0.932,['Marika Papagkika'],0.382,231400,0.492,0,2ZZ4rkQvNDkQt86sdwd398,0.0137,1,0.1570,-8.321,1,Sti filaki me valane,0,1920-01-01,0.0345,88.525,0.640,1920
126,0.895,['Xaralampos Kritikos'],0.474,227320,0.444,0,3kjaYn5CnAM2v00hkJY7Wb,0.0000,6,0.1660,-9.783,1,To ouest,0,1920-01-01,0.0378,80.317,0.920,1920
129,0.993,['Takis Nikolaou'],0.453,255520,0.410,0,3vTys7kWK733lvNmrcbIiP,0.1600,2,0.2920,-10.469,1,Mparmpaouzos,0,1920-01-01,0.0440,67.271,0.894,1920
133,0.700,['Greg Fieler'],0.197,224052,0.511,0,40j8iS1ClIh73xUFrvdEJg,0.0000,4,0.0661,-11.444,0,I'll See You in C U B A,0,1920,0.0413,148.502,0.267,1920
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
174379,0.795,['Alessia Cara'],0.429,144720,0.211,0,45XnLMuqf3vRfskEAMUeCH,0.0000,4,0.1960,-11.665,1,A Little More,0,2021-01-22,0.0360,94.710,0.228,2021
174381,0.795,['Alessia Cara'],0.429,144720,0.211,0,4pPFI9jsguIh3wC7Otoyy8,0.0000,4,0.1960,-11.665,1,A Little More,0,2021-01-22,0.0360,94.710,0.228,2021
174383,0.795,['Alessia Cara'],0.429,144720,0.211,0,52YtxLVUyvtiGPxwwxayHZ,0.0000,4,0.1960,-11.665,1,A Little More,0,2021-01-22,0.0360,94.710,0.228,2021
174385,0.795,['Alessia Cara'],0.429,144720,0.211,0,7tue2Wemjd0FZzRtDrQFZd,0.0000,4,0.1960,-11.665,1,A Little More,0,2021-01-22,0.0360,94.710,0.228,2021


In [14]:
df[df['name'] == "Angel"]

Unnamed: 0,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,year
13315,0.304,['Aerosmith'],0.239,308907,0.819,0,3kfXUGIdBNpyr2gBvU3Guj,0.0,3,0.0855,-3.583,1,Angel,62,1987-01-01,0.0393,171.449,0.439,1987
15341,0.949,['Sarah McLachlan'],0.532,270400,0.0744,0,3xZMPZQYETEn4hjor3TR1A,1.2e-05,1,0.106,-16.092,1,Angel,68,1997-08-29,0.0355,117.131,0.125,1997
15918,0.116,"['Shaggy', 'Rayvon']",0.74,235133,0.766,0,7FDV5ELOJHCGLe52AnttEd,0.0,6,0.0406,-2.939,1,Angel,72,2000-08-08,0.178,170.531,0.807,2000
22242,0.185,['Why Not'],0.789,286853,0.366,0,4DFcVahVfx8Xv8u0vmgHjW,0.0,2,0.0986,-10.291,1,Angel,0,1934,0.0358,99.989,0.526,1934
32318,0.296,['Madonna'],0.752,236133,0.593,0,7ov1nZ2QZc3LIuCXXERZP0,0.000196,7,0.168,-13.673,1,Angel,46,1984-11-12,0.0367,133.266,0.834,1984
33863,0.295,['Jon Secada'],0.485,275040,0.477,0,543d68hZ4dzVePFfdVc0Ki,3.4e-05,0,0.202,-8.948,1,Angel,60,1992-01-01,0.0262,151.704,0.134,1992
34930,0.0157,['Massive Attack'],0.714,379533,0.309,0,7uv632EkfwYhXoqf8rhYrg,0.807,7,0.0777,-10.796,1,Angel,62,1998-01-01,0.0291,107.346,0.0671,1998
35672,0.428,['Amanda Perez'],0.631,217627,0.496,0,7DSfv2r0Km90N97qlRTxou,1e-06,9,0.107,-8.279,0,Angel,51,2002-01-01,0.0598,143.735,0.365,2002
38126,0.00251,['Theory of a Deadman'],0.443,202213,0.844,0,2qJkesdHu9sMMVFgkRkqhQ,0.0,11,0.085,-5.414,0,Angel,66,2014-07-08,0.0425,76.001,0.452,2014
48559,0.229,['Jimi Hendrix'],0.386,265813,0.536,0,4m6KlFD4eMiQFzh4IVVFVm,0.0123,3,0.11,-14.151,1,Angel,42,1971,0.0546,133.451,0.31,1971


In the cells above, we check for repeated songs with different IDs. We have 37376 duplicated songs, some of which have multiple versions (eg. White Christmas, Angel, etc.) by different artists. Although they have the same titles, we do note several interesting observations about these songs. This will be explored further in the EDA section. 

In [15]:
df['name'].value_counts().head(25)

White Christmas                             103
Winter Wonderland                            88
Silent Night                                 81
Jingle Bells                                 71
2000 Years                                   65
Sleigh Ride                                  54
Summertime                                   53
The Christmas Song                           51
Silver Bells                                 51
Overture                                     49
Happy New Year                               49
O Holy Night                                 48
99 Year Blues                                45
It's the Most Wonderful Time of the Year     39
Ave Maria                                    39
My Only Wish (This Year)                     38
Have Yourself a Merry Little Christmas       36
7 Years                                      34
Blue Christmas                               34
Have Yourself A Merry Little Christmas       34
Stay                                    


Below, we note that the song "Have yourself a merry little christmas" appears twice due to the capitalisation of the letter 'a'. Thus we will rename the song.

In [16]:
df.loc[df['name'] == "Have Yourself A Merry Little Christmas", 'name'] = "Have Yourself a Merry Little Christmas"
len(df.loc[df['name'] == "Have Yourself a Merry Little Christmas"])

70

It's noted that some songs are repeated, despite having different IDs. Given that some artists may release multiple versions of the same song (eg. acoustic remix), we will only drop duplicates whose musical features are completely identical. 

In [17]:
df[df.duplicated(['artists', 'name', 'acousticness', 'danceability','energy', 'instrumentalness', 'liveness', 'speechiness', 'valence','year'])]

Unnamed: 0,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,year
86,0.995,['Amalia Vaka'],0.287,216720,0.6210,0,2XVujD5mP2bMSXLkpqeWKU,0.7840,3,0.1510,-10.843,0,Argilamas kai psarades,0,1920-01-01,0.0394,77.537,0.559,1920
88,0.932,['Marika Papagkika'],0.382,231400,0.4920,0,2ZZ4rkQvNDkQt86sdwd398,0.0137,1,0.1570,-8.321,1,Sti filaki me valane,0,1920-01-01,0.0345,88.525,0.640,1920
126,0.895,['Xaralampos Kritikos'],0.474,227320,0.4440,0,3kjaYn5CnAM2v00hkJY7Wb,0.0000,6,0.1660,-9.783,1,To ouest,0,1920-01-01,0.0378,80.317,0.920,1920
129,0.993,['Takis Nikolaou'],0.453,255520,0.4100,0,3vTys7kWK733lvNmrcbIiP,0.1600,2,0.2920,-10.469,1,Mparmpaouzos,0,1920-01-01,0.0440,67.271,0.894,1920
143,0.972,['Giorgos Katsaros'],0.528,277720,0.3250,0,4HsM025MRTDfhvpPPOSqaN,0.0234,11,0.1020,-12.508,1,Oli mera paizei zaria,0,1920-01-01,0.0421,118.562,0.146,1920
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
174373,0.966,"['Ludovico Einaudi', 'Johannes Bornlöf']",0.269,424200,0.0926,0,4yBReaKJW78ZYCHpc1cfaK,0.8900,9,0.0992,-24.280,0,Divenire,0,2021-01-23,0.0609,120.323,0.102,2021
174379,0.795,['Alessia Cara'],0.429,144720,0.2110,0,45XnLMuqf3vRfskEAMUeCH,0.0000,4,0.1960,-11.665,1,A Little More,0,2021-01-22,0.0360,94.710,0.228,2021
174381,0.795,['Alessia Cara'],0.429,144720,0.2110,0,4pPFI9jsguIh3wC7Otoyy8,0.0000,4,0.1960,-11.665,1,A Little More,0,2021-01-22,0.0360,94.710,0.228,2021
174383,0.795,['Alessia Cara'],0.429,144720,0.2110,0,52YtxLVUyvtiGPxwwxayHZ,0.0000,4,0.1960,-11.665,1,A Little More,0,2021-01-22,0.0360,94.710,0.228,2021


In [18]:
df.drop_duplicates(['artists', 'name', 'acousticness', 'danceability','energy', 'instrumentalness', 'liveness', 'speechiness', 'valence','year'], keep='first', inplace=True)

In [19]:
df.shape

(169252, 19)

#### Artists to Drop (Audiobooks) 
Let's see how many unique artists we have. 

In [20]:
df['artists'].nunique()

36195

With 36195 unique artists in a dataset of over 170,000 songs, it would be interesting to see artists with the most number of tracks. 

In [21]:
# Top 50 artists by number of tracks 
df.groupby('artists').count().sort_values('id', ascending=False).head(50)

Unnamed: 0_level_0,acousticness,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,year
artists,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
['Tadeusz Dolega Mostowicz'],1281,1281,1281,1281,1281,1281,1281,1281,1281,1281,1281,1281,1281,1281,1281,1281,1281,1281
['Эрнест Хемингуэй'],1175,1175,1175,1175,1175,1175,1175,1175,1175,1175,1175,1175,1175,1175,1175,1175,1175,1175
['Эрих Мария Ремарк'],1062,1062,1062,1062,1062,1062,1062,1062,1062,1062,1062,1062,1062,1062,1062,1062,1062,1062
['Francisco Canaro'],951,951,951,951,951,951,951,951,951,951,951,951,951,951,951,951,951,951
['Ignacio Corsini'],624,624,624,624,624,624,624,624,624,624,624,624,624,624,624,624,624,624
['Frank Sinatra'],621,621,621,621,621,621,621,621,621,621,621,621,621,621,621,621,621,621
['Elvis Presley'],494,494,494,494,494,494,494,494,494,494,494,494,494,494,494,494,494,494
['Bob Dylan'],459,459,459,459,459,459,459,459,459,459,459,459,459,459,459,459,459,459
"['Francisco Canaro', 'Charlo']",456,456,456,456,456,456,456,456,456,456,456,456,456,456,456,456,456,456
['Johnny Cash'],455,455,455,455,455,455,455,455,455,455,455,455,455,455,455,455,455,455


Upon closer inspection, the top 3 artists with the most number of "tracks" are actually book authors who have their audiobooks listed on Spotify. Since we are building a popularity predictor and music recommender, this data is irrelevant to us. Listed below are 7 other artists whose audiobooks have made it to our top 50 list. 

- Tadeusz Dolega Mostowicz: Polish writer, journalist, author <
- Эрнест Хемингуэй is actually <a href="https://www.britannica.com/biography/Ernest-Hemingway">Ernest Hemingway</a>, an American author. These are his audiobooks. 
- Эрих Мария Ремарк, is actually <a href="https://www.britannica.com/biography/Erich-Maria-Remarque">Erich Maria Remarque</a>, a German novelist. These are her audiobooks. 
- Georgette Heyer, Irina Salkow: <a href="https://vimeo.com/527679668">Audiobooks</a> by Georgette Heyer, narrated by Irina Salkow. 
- Georgette Heyer, Brigitte Carlsen: <a href="https://vimeo.com/527679386">Audiobooks</a> by Georgette Heyer, narrated by Irina Salkow and Brigitte Carlsen 
- Sinclair Lewis, Frank Arnold: <a href="https://open.spotify.com/track/5AAgVY7r7EZd7Mgr9cV3m2">Audiobooks</a> by Sinclair Lewis, translated into German by Frank Arnold
- H.P Lovecraft: <a href="https://www.britannica.com/biography/H-P-Lovecraft">Audiobooks</a> by American writer of weird and horror fiction. 



- 'Unspecified' is actually the name of a <a href="https://www.youtube.com/watch?v=MPDIgg5odT8">Bengali band</a>. Here is one of their songs. We will not be dropping this. 

Because these audiobooks are contributing significantly to the track count, the next 150 artists will be examined as well. 

In [22]:
# Top 51st-100th artists with most number of tracks 
df.groupby('artists').count().sort_values('id', ascending=False)[50:100]

Unnamed: 0_level_0,acousticness,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,year
artists,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
['Duke Ellington'],193,193,193,193,193,193,193,193,193,193,193,193,193,193,193,193,193,193
['The Kinks'],190,190,190,190,190,190,190,190,190,190,190,190,190,190,190,190,190,190
['Armin van Buuren'],188,188,188,188,188,188,188,188,188,188,188,188,188,188,188,188,188,188
['John Coltrane'],187,187,187,187,187,187,187,187,187,187,187,187,187,187,187,187,187,187
['Taylor Swift'],185,185,185,185,185,185,185,185,185,185,185,185,185,185,185,185,185,185
['KISS'],184,184,184,184,184,184,184,184,184,184,184,184,184,184,184,184,184,184
['Shamshad Begum'],183,183,183,183,183,183,183,183,183,183,183,183,183,183,183,183,183,183
['Count Basie'],183,183,183,183,183,183,183,183,183,183,183,183,183,183,183,183,183,183
"['Billie Holiday', 'Teddy Wilson']",181,181,181,181,181,181,181,181,181,181,181,181,181,181,181,181,181,181
['Metallica'],181,181,181,181,181,181,181,181,181,181,181,181,181,181,181,181,181,181


- Труман Капоти: <a href="https://www.britannica.com/biography/Truman-Capote">Audiobooks</a> by American novelist, Truman Capote
- Arthur Conan Doyle: <a href="https://www.arthurconandoyle.com/biography.html">Audiobooks</a> by the famous British novelist and creator of the famous character Sherlock Holmes. 

In [23]:
df.groupby('artists').count().sort_values('id', ascending=False)[100:150]

Unnamed: 0_level_0,acousticness,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,year
artists,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
['Gerry & The Pacemakers'],131,131,131,131,131,131,131,131,131,131,131,131,131,131,131,131,131,131
['Chet Baker'],131,131,131,131,131,131,131,131,131,131,131,131,131,131,131,131,131,131
['James Taylor'],129,129,129,129,129,129,129,129,129,129,129,129,129,129,129,129,129,129
['Van Morrison'],128,128,128,128,128,128,128,128,128,128,128,128,128,128,128,128,128,128
['Doris Day'],127,127,127,127,127,127,127,127,127,127,127,127,127,127,127,127,127,127
['Judy Garland'],127,127,127,127,127,127,127,127,127,127,127,127,127,127,127,127,127,127
['Neil Diamond'],126,126,126,126,126,126,126,126,126,126,126,126,126,126,126,126,126,126
"['Dale Carnegie', 'Till Hagen', 'Stefan Kaminski']",125,125,125,125,125,125,125,125,125,125,125,125,125,125,125,125,125,125
['Juan Gabriel'],125,125,125,125,125,125,125,125,125,125,125,125,125,125,125,125,125,125
['Wings'],125,125,125,125,125,125,125,125,125,125,125,125,125,125,125,125,125,125


- Dale Carnegie, Till Hagen, Stefan Kaminski: <a href="https://www.amazon.com/Wie-man-Freunde-gewinnt-audiobook/dp/B017XETBNC">Audiobook</a> of Dale Carnegie's "How to win friends and influence people" in German. 

In [24]:
df.groupby('artists').count().sort_values('id', ascending=False)[150:200]

Unnamed: 0_level_0,acousticness,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,year
artists,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
"['Ernst H. Gombrich', 'Christoph Waltz']",106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106
['Stevie Ray Vaughan'],106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106
['Luis Miguel'],106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106
['Simon & Garfunkel'],105,105,105,105,105,105,105,105,105,105,105,105,105,105,105,105,105,105
['Johnny Mathis'],105,105,105,105,105,105,105,105,105,105,105,105,105,105,105,105,105,105
['Green Day'],104,104,104,104,104,104,104,104,104,104,104,104,104,104,104,104,104,104
['ZZ Top'],104,104,104,104,104,104,104,104,104,104,104,104,104,104,104,104,104,104
['Robin Trower'],104,104,104,104,104,104,104,104,104,104,104,104,104,104,104,104,104,104
['Yes'],104,104,104,104,104,104,104,104,104,104,104,104,104,104,104,104,104,104
['The Isley Brothers'],103,103,103,103,103,103,103,103,103,103,103,103,103,103,103,103,103,103


- Ernst H. Gombrich, Christoph Waltz: <a href="https://www.amazon.com/Education-Reference-Ernst-H-Gombrich/s?rh=n%3A3344092011%2Cp_27%3AErnst+H.+Gombrich">Audiobooks</a> by Ernst H. Gombrich, Christoph Waltz in German 
- Zofia Dromlewiczowa: <a href="https://www.imdb.com/name/nm2396458/">Audiobooks by Polish Writer, Zofia Dromlewiczowa. 

In [25]:
top_audiobook_artists = ["['Tadeusz Dolega Mostowicz']", "['Эрнест Хемингуэй']", "['Эрих Мария Ремарк']",
                         "['Georgette Heyer', 'Irina Salkow']", "['Georgette Heyer', 'Brigitte Carlsen']", 
                         "['Sinclair Lewis', 'Frank Arnold']", "['H.P. Lovecraft']", "['Трумен Капоте']", 
                         "['Arthur Conan Doyle']", "['Dale Carnegie', 'Till Hagen', 'Stefan Kaminski']",
                         "['Ernst H. Gombrich', 'Christoph Waltz']", "['Zofia Dromlewiczowa']"
                        ]
length_dropped = 0
for artist in top_audiobook_artists: 
    length_dropped += len(df[df['artists'] == artist].index)
    df.drop(df[df['artists'] == artist].index, inplace=True)
print(f'No. of tracks of audiobook artists in top 200 artists dropped: {length_dropped}')

No. of tracks of audiobook artists in top 200 artists dropped: 5260


In [26]:
df.shape

(163992, 19)

However, I'm certain there's more audiobooks or podcasts hidden in the dataset. One unique feature about these tracks on Spotify is their high "Speechiness" level - indicating that the track contains more spoken words than instrumentals. For the sake of completeness, we'll filter tracks with a speechiness level above 0.9. 

In [27]:
df.sort_values('speechiness', ascending=False).head(20)

Unnamed: 0,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,year
162011,0.379,"['Harper Lee', 'Eva Mattes']",0.603,361735,0.214,0,4ucGpd15TTUybXsoS5mH9Q,0.0,1,0.536,-20.355,0,"Wer die Nachtigall stört ..., Kapitel 1",12,1960,0.971,68.686,0.28,1960
25872,0.831,"['Ernest Hemingway', 'Christian Brückner']",0.606,613240,0.31,0,5c1iGqmxzXvDhiDYO1MU7Q,7e-06,11,0.5,-17.651,0,Kapitel 15 - Der alte Mann und das Meer - Erzä...,10,1952,0.97,76.283,0.437,1952
25873,0.837,"['Ernest Hemingway', 'Christian Brückner']",0.595,513600,0.305,0,5nnUR41x2t9KvRDKumNcKC,1.7e-05,11,0.201,-17.821,0,Kapitel 16 - Der alte Mann und das Meer - Erzä...,10,1952,0.97,64.964,0.388,1952
145240,0.82,"['Ernest Hemingway', 'Christian Brückner']",0.608,510133,0.261,0,2UAgXacpKvUlpTbtTyeD4G,8e-06,11,0.295,-19.298,0,Kapitel 5 - Der alte Mann und das Meer - Erzäh...,12,1952,0.968,84.616,0.378,1952
145162,0.73,"['Ernest Hemingway', 'Christian Brückner']",0.611,476213,0.28,0,2LAPdENXKnvaKUT11qoLAF,1.7e-05,11,0.649,-19.092,0,Kapitel 20 - Der alte Mann und das Meer - Erzä...,9,1952,0.968,170.235,0.309,1952
387,0.7,['Fernando Pessoa'],0.717,99800,0.202,0,1KxCqUwjiAHw7W0IIrRUwP,0.0,5,0.294,-22.807,1,Capítulo 2.20 - Banquero Anarquista,0,1922-06-01,0.967,81.64,0.683,1922
418,0.73,['Fernando Pessoa'],0.619,106900,0.173,0,3FfL0WRdyrEFPlPvmqj2lJ,0.0,0,0.145,-23.13,1,Capítulo 1.9 - Banquero Anarquista,0,1922-06-01,0.967,181.485,0.653,1922
60606,0.799,['Alcoholics Anonymous'],0.643,160396,0.249,0,5itzRJSnhTTIYhlcakSh9f,0.0,11,0.182,-18.347,1,An Alcoholic's Wife,0,1939-01-01,0.967,87.481,0.659,1939
60585,0.82,['Alcoholics Anonymous'],0.642,1120581,0.233,0,5WVTv9uDjriC3QJBf9UfTX,0.0,0,0.226,-18.47,1,He Thought He Could Drink Like A Gentleman,0,1939-01-01,0.966,87.248,0.644,1939
451,0.865,['Fernando Pessoa'],0.661,117900,0.225,0,5ZQdHBQwtS0iKSyRmFg7hB,0.0,2,0.239,-23.33,0,Capítulo 1.26 & Capítulo 2.1 - Banquero Anarqu...,0,1922-06-01,0.966,86.89,0.692,1922


In [28]:
df[df['speechiness'] > 0.9]

Unnamed: 0,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,year
357,0.797,['Fernando Pessoa'],0.695,98200,0.263,0,021ht4sdgPcrDgSk7JTbKY,0.000000,0,0.148,-22.136,1,Capítulo 2.16 - Banquero Anarquista,0,1922-06-01,0.957,102.009,0.6550,1922
364,0.794,['Fernando Pessoa'],0.676,99100,0.235,0,0OYGe21oScKJfanLyM7daU,0.000000,11,0.210,-22.447,0,Capítulo 2.8 - Banquero Anarquista,0,1922-06-01,0.960,96.777,0.7240,1922
365,0.578,['Fernando Pessoa'],0.750,132700,0.229,0,0PE42H6tslQuyMMiGRiqtb,0.000000,2,0.314,-22.077,1,Capítulo 2.25 - Banquero Anarquista,0,1922-06-01,0.955,102.629,0.5310,1922
369,0.754,['Fernando Pessoa'],0.687,96600,0.198,0,0cC9CYjLRIzwchQ42xVnq6,0.000000,4,0.197,-24.264,0,Capítulo 1.23 - Banquero Anarquista,0,1922-06-01,0.962,78.453,0.4780,1922
370,0.670,['Fernando Pessoa'],0.800,103200,0.171,0,0eb1PfHxT6HnXvvdUOzmME,0.000000,8,0.123,-24.384,1,Capítulo 1.18 - Banquero Anarquista,0,1922-06-01,0.953,59.613,0.6930,1922
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
173082,0.903,"['TandMProductionCo', 'TandMMusic', 'TandMTV']",0.855,81894,0.159,0,7dB5NaMjhRecZfkFmudlrS,0.152000,10,0.113,-14.197,0,"Frederick's Bass Tester, Year 5, Track #99",0,2014,0.908,149.961,0.5130,2014
173160,0.590,"['TandMProductionCo', 'TandMMusic', 'TandMTV']",0.882,437238,0.322,0,3380X865BgjcAf3mHxGNVR,0.059700,1,0.114,-13.982,1,"Frederick's Bass Tester, Year 5, Track #77",0,2014,0.930,112.001,0.0757,2014
173164,0.268,"['TandMProductionCo', 'TandMMusic', 'TandMTV']",0.732,207569,0.470,0,4LCgaEvTM4V7JYW3HndSQB,0.000262,1,0.177,-11.139,1,"Frederick's Bass Tester, Year 5, Track #81",0,2014,0.941,140.025,0.3520,2014
173172,0.742,"['TandMProductionCo', 'TandMMusic', 'TandMTV']",0.967,421251,0.127,0,2WXwnKkOJkQlrl9xH0dEtd,0.045400,1,0.164,-14.772,1,"Frederick's Bass Tester, Year 5, Track #56",0,2014,0.905,130.018,0.1520,2014


Similarly, most of the tracks here are of audiobooks. 

TandMProductionCo, TandMMusic, TandMTV are under a company called Tech and Media Production Company. These songs are for musicians to test their bass systems and therefore not relevant to our song recommender. The rest of the songhs by this artist will also be dropped

In [29]:
df.drop(df[df['speechiness'] > 0.9].index, inplace=True)
print(df.shape)
df.drop(df[df['artists'] == "['TandMProductionCo', 'TandMMusic', 'TandMTV']"].index, inplace=True)
print(df.shape)

(163125, 19)
(162981, 19)


In [30]:
df.describe()

Unnamed: 0,acousticness,danceability,duration_ms,energy,explicit,instrumentalness,key,liveness,loudness,mode,popularity,speechiness,tempo,valence,year
count,162981.0,162981.0,162981.0,162981.0,162981.0,162981.0,162981.0,162981.0,162981.0,162981.0,162981.0,162981.0,162981.0,162981.0,162981.0
mean,0.50436,0.530546,235190.7,0.489053,0.055491,0.200637,5.205576,0.209517,-11.49481,0.705941,27.006614,0.075131,117.257093,0.525628,1977.828054
std,0.38255,0.176133,146960.9,0.271277,0.228937,0.335812,3.501255,0.1809,5.509868,0.45562,21.718419,0.090214,30.207793,0.267443,25.938153
min,0.0,0.0,4937.0,0.0,0.0,0.0,0.0,0.0,-60.0,0.0,0.0,0.0,0.0,0.0,1920.0
25%,0.0872,0.408,169853.0,0.26,0.0,2e-06,2.0,0.0984,-14.435,0.0,4.0,0.0348,94.236,0.307,1957.0
50%,0.531,0.539,207806.0,0.476,0.0,0.000729,5.0,0.136,-10.652,1.0,27.0,0.0444,116.023,0.536,1978.0
75%,0.904,0.659,267120.0,0.714,0.0,0.275,8.0,0.267,-7.444,1.0,43.0,0.0695,135.173,0.751,1999.0
max,0.996,0.988,5338302.0,1.0,1.0,1.0,11.0,1.0,3.855,1.0,100.0,0.9,243.507,1.0,2021.0


In [31]:
df.shape

(162981, 19)

#### Song Duration 

In [32]:
# For easier interpretation of song length -> Convert to mins 
df['duration_mins'] = df['duration_ms'].apply(lambda x: round(x/(1000*60), 2))
df['duration_mins']

0         2.81
1         2.50
2         2.73
3         7.03
4         2.75
          ... 
174369    5.82
174371    3.44
174375    5.06
174377    2.41
174387    4.07
Name: duration_mins, Length: 162981, dtype: float64

In [33]:
df.drop('duration_ms', axis=1, inplace=True)

In [34]:
df[df['duration_mins'] > 6].sort_values('popularity', ascending=False).head(10)

Unnamed: 0,acousticness,artists,danceability,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,year,duration_mins
11062,0.00574,['Eagles'],0.579,0.508,0,40riOy7x9W7GXjyGp4pjAv,0.000494,2,0.0575,-9.484,1,Hotel California - 2013 Remaster,83,1976-12-08,0.027,147.125,0.609,1976,6.52
126687,0.104,"['Jay Wheeler', 'Myke Towers', 'Rauw Alejandro...",0.841,0.674,1,7LE1K0lZyRFZjSlrnm7lfD,0.0,2,0.131,-5.35,1,"La Curiosidad (feat. Dj Nelson, Jhay Cortez, L...",83,2020-12-17,0.0962,90.007,0.761,2020,6.38
18492,0.234,['Justin Timberlake'],0.574,0.512,0,4rHZZAmHpZrA3iH5zx8frV,0.0,5,0.0946,-6.664,0,Mirrors,82,2013-03-15,0.0503,76.899,0.512,2013,8.07
14070,0.0163,"[""Guns N' Roses""]",0.294,0.641,0,3YRCqOhFifThpSRFJ1VWFM,0.22,11,0.112,-9.316,1,November Rain,79,1991-09-17,0.0291,79.759,0.226,1991,8.93
39122,0.000532,['Harry Styles'],0.535,0.521,0,6SQLk9HSNketfgs2AyIiMs,0.371,0,0.19,-5.942,1,She,79,2019-12-13,0.0272,140.026,0.457,2019,6.04
10064,0.58,['Led Zeppelin'],0.338,0.34,0,5CQ30WqJwcep0pYcV4AMNc,0.0032,9,0.116,-12.049,0,Stairway to Heaven - Remaster,78,1971-11-08,0.0339,82.433,0.197,1971,8.05
39106,0.172,['Harry Styles'],0.306,0.347,0,6VzcQuzTNTMFnJ6rBSaLH9,0.00013,2,0.0485,-8.5,1,Fine Line,78,2019-12-13,0.0334,120.996,0.0511,2019,6.3
15874,0.0371,"['Eminem', 'Dido']",0.78,0.768,1,3UmaczJpikHgJFyBTAJVoz,2e-06,6,0.518,-4.325,0,Stan,78,2000-05-23,0.238,80.063,0.507,2000,6.74
10062,0.382,['Elton John'],0.414,0.428,0,2TVxnKdb3tqe1nhQWwwZCO,0.000243,0,0.148,-11.097,1,Tiny Dancer,78,1971-11-05,0.0278,145.075,0.282,1971,6.28
38968,0.542,"['Nio Garcia', 'Casper Magico', 'Bad Bunny', '...",0.903,0.675,1,3V8UKqhEK5zBkBb6d6ub8i,1.3e-05,11,0.0595,-3.445,0,Te Boté - Remix,77,2018-04-13,0.214,96.507,0.442,2018,6.97


In [46]:
df.groupby('year').sum()['popularity'].tail(30)

year
1992    84269
1993    83988
1994    87745
1995    88233
1996    88753
1997    89870
1998    90894
1999    92921
2000    53695
2001    49639
2002    93201
2003    56713
2004    50239
2005    55246
2006    57384
2007    56512
2008    58404
2009    56351
2010    57173
2011    58577
2012    59597
2013    62201
2014    64013
2015    66713
2016    67857
2017    67735
2018    73443
2019    74248
2020    97144
2021    12429
Name: popularity, dtype: int64

We don't want songs which are too long as most popular songs nowadays are less than <a href="https://www.vox.com/2014/8/18/6003271/why-are-songs-3-minutes-long">5 minutes</a>. However, we do recognise that some older famous songs were long in duration (eg. Hotel California at 6.52 mins, November Rain at 8.93 mins, etc). Using November Rain as a reference due to its immense popularity, we will drop songs with length above 8.93 minutes. 

In [36]:
df.drop(df[df['duration_mins'] > 8.93].index, inplace=True)
df.shape

(159334, 19)

#### Convert Popularity to Scale 0-100

In [37]:
df['popularity'] = df['popularity'].apply(lambda x: x/100)
df['popularity']

0         0.12
1         0.07
2         0.04
3         0.17
4         0.02
          ... 
174369    0.00
174371    0.00
174375    0.00
174377    0.00
174387    0.69
Name: popularity, Length: 159334, dtype: float64

In [38]:
df.head()

Unnamed: 0,acousticness,artists,danceability,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,year,duration_mins
0,0.991,['Mamie Smith'],0.598,0.224,0,0cS0A1fUEUd1EW3FcF8AEI,0.000522,5,0.379,-12.628,0,Keep A Song In Your Soul,0.12,1920,0.0936,149.976,0.634,1920,2.81
1,0.643,"[""Screamin' Jay Hawkins""]",0.852,0.517,0,0hbkKFIJm7Z05H8Zl9w30f,0.0264,5,0.0809,-7.261,0,I Put A Spell On You,0.07,1920-01-05,0.0534,86.889,0.95,1920,2.5
2,0.993,['Mamie Smith'],0.647,0.186,0,11m7laMUgmOKqI3oYzuhne,1.8e-05,0,0.519,-12.098,1,Golfing Papa,0.04,1920,0.174,97.6,0.689,1920,2.73
3,0.000173,['Oscar Velazquez'],0.73,0.798,0,19Lc5SfJJ5O1oaxY0fpwfh,0.801,2,0.128,-7.311,1,True House Music - Xavier Santos & Carlos Gomi...,0.17,1920-01-01,0.0425,127.997,0.0422,1920,7.03
4,0.295,['Mixe'],0.704,0.707,1,2hJjbsLCytGsnAHfdsLejp,0.000246,10,0.402,-6.036,0,Xuniverxe,0.02,1920-10-01,0.0768,122.076,0.299,1920,2.75


#### Datetime Index

As the 'release_dates' feature is inconsistent (i.e either in YYYY or YYYY-MM-DD format), it will be set as the index of the dataframe. Also, the data collected spans 100 years, so we would need to call the year parameter of the datetime index in our EDA section. 

Furthermore, we are also given the 'year' feature of the song which will be used in our classification modelling later on. 

In [39]:
df['release_date'] = pd.DatetimeIndex(df['release_date'])
df.set_index(['release_date'], inplace=True)
df.head()

Unnamed: 0_level_0,acousticness,artists,danceability,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,speechiness,tempo,valence,year,duration_mins
release_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1920-01-01,0.991,['Mamie Smith'],0.598,0.224,0,0cS0A1fUEUd1EW3FcF8AEI,0.000522,5,0.379,-12.628,0,Keep A Song In Your Soul,0.12,0.0936,149.976,0.634,1920,2.81
1920-01-05,0.643,"[""Screamin' Jay Hawkins""]",0.852,0.517,0,0hbkKFIJm7Z05H8Zl9w30f,0.0264,5,0.0809,-7.261,0,I Put A Spell On You,0.07,0.0534,86.889,0.95,1920,2.5
1920-01-01,0.993,['Mamie Smith'],0.647,0.186,0,11m7laMUgmOKqI3oYzuhne,1.8e-05,0,0.519,-12.098,1,Golfing Papa,0.04,0.174,97.6,0.689,1920,2.73
1920-01-01,0.000173,['Oscar Velazquez'],0.73,0.798,0,19Lc5SfJJ5O1oaxY0fpwfh,0.801,2,0.128,-7.311,1,True House Music - Xavier Santos & Carlos Gomi...,0.17,0.0425,127.997,0.0422,1920,7.03
1920-10-01,0.295,['Mixe'],0.704,0.707,1,2hJjbsLCytGsnAHfdsLejp,0.000246,10,0.402,-6.036,0,Xuniverxe,0.02,0.0768,122.076,0.299,1920,2.75


In [40]:
df.to_csv("../datasets/data_cleaned.csv")

## 2.2. Data By Genre

In [41]:
df_genres = pd.read_csv("../datasets/data_by_genres.csv")
df_genres

Unnamed: 0,genres,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key,mode
0,21st century classical,0.754600,0.284100,3.525932e+05,0.159580,0.484374,0.168580,-22.153400,0.062060,91.351000,0.143380,6.600000,4,1
1,432hz,0.485515,0.312000,1.047430e+06,0.391678,0.477250,0.265940,-18.131267,0.071717,118.900933,0.236483,41.200000,11,1
2,8-bit,0.028900,0.673000,1.334540e+05,0.950000,0.630000,0.069000,-7.899000,0.292000,192.816000,0.997000,0.000000,5,1
3,[],0.535793,0.546937,2.495312e+05,0.485430,0.278442,0.220970,-11.624754,0.101511,116.068980,0.486361,12.350770,7,1
4,a cappella,0.694276,0.516172,2.018391e+05,0.330533,0.036080,0.222983,-12.656547,0.083627,105.506031,0.454077,39.086248,7,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3227,zim urban groove,0.003910,0.553000,4.267200e+04,0.942000,0.961000,0.113000,-8.004000,0.039900,134.995000,0.752000,9.000000,7,1
3228,zolo,0.208648,0.533837,2.641016e+05,0.620470,0.163334,0.201430,-10.878906,0.061828,126.765194,0.576721,31.108254,9,1
3229,zouk,0.272928,0.641889,4.416418e+05,0.695778,0.257604,0.166011,-9.518889,0.050511,105.848889,0.878444,32.555556,7,1
3230,zurich indie,0.993000,0.705667,1.984173e+05,0.172667,0.468633,0.179667,-11.453333,0.348667,91.278000,0.739000,0.000000,7,0


In [42]:
df_genres.isnull().sum()

genres              0
acousticness        0
danceability        0
duration_ms         0
energy              0
instrumentalness    0
liveness            0
loudness            0
speechiness         0
tempo               0
valence             0
popularity          0
key                 0
mode                0
dtype: int64

In [43]:
df_genres.shape

(3232, 14)

In [44]:
df_genres[df_genres.duplicated(subset='genres')]

Unnamed: 0,genres,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key,mode


No duplicated genres

#### Convert Duration to Minutes

In [45]:
df_genres['duration_mins'] = df_genres['duration_ms'].apply(lambda x: round(x/(1000*60), 2))
df_genres['duration_mins']

0        5.88
1       17.46
2        2.22
3        4.16
4        3.36
        ...  
3227     0.71
3228     4.40
3229     7.36
3230     3.31
3231     3.07
Name: duration_mins, Length: 3232, dtype: float64

In [46]:
df_genres.drop('duration_ms', axis=1, inplace=True)

In [47]:
df_genres.head()

Unnamed: 0,genres,acousticness,danceability,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key,mode,duration_mins
0,21st century classical,0.7546,0.2841,0.15958,0.484374,0.16858,-22.1534,0.06206,91.351,0.14338,6.6,4,1,5.88
1,432hz,0.485515,0.312,0.391678,0.47725,0.26594,-18.131267,0.071717,118.900933,0.236483,41.2,11,1,17.46
2,8-bit,0.0289,0.673,0.95,0.63,0.069,-7.899,0.292,192.816,0.997,0.0,5,1,2.22
3,[],0.535793,0.546937,0.48543,0.278442,0.22097,-11.624754,0.101511,116.06898,0.486361,12.35077,7,1,4.16
4,a cappella,0.694276,0.516172,0.330533,0.03608,0.222983,-12.656547,0.083627,105.506031,0.454077,39.086248,7,1,3.36


#### Convert Popularity to Scale 0-100

In [48]:
df_genres['popularity'] = df_genres['popularity'].apply(lambda x: x/100)

Saving Genre dataset

In [49]:
df_genres.to_csv("../datasets/data_by_genres_cleaned.csv", index=False)

## 3. Exploratory Data Analysis

Notebook 2

## 4. Preprocessing for Classification Model

In [50]:
df_cleaned = pd.read_csv("../datasets/data_cleaned.csv")
df_cleaned

Unnamed: 0,release_date,acousticness,artists,danceability,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,speechiness,tempo,valence,year,duration_mins
0,1920-01-01,0.991000,['Mamie Smith'],0.598,0.2240,0,0cS0A1fUEUd1EW3FcF8AEI,0.000522,5,0.3790,-12.628,0,Keep A Song In Your Soul,0.12,0.0936,149.976,0.6340,1920,2.81
1,1920-01-05,0.643000,"[""Screamin' Jay Hawkins""]",0.852,0.5170,0,0hbkKFIJm7Z05H8Zl9w30f,0.026400,5,0.0809,-7.261,0,I Put A Spell On You,0.07,0.0534,86.889,0.9500,1920,2.50
2,1920-01-01,0.993000,['Mamie Smith'],0.647,0.1860,0,11m7laMUgmOKqI3oYzuhne,0.000018,0,0.5190,-12.098,1,Golfing Papa,0.04,0.1740,97.600,0.6890,1920,2.73
3,1920-01-01,0.000173,['Oscar Velazquez'],0.730,0.7980,0,19Lc5SfJJ5O1oaxY0fpwfh,0.801000,2,0.1280,-7.311,1,True House Music - Xavier Santos & Carlos Gomi...,0.17,0.0425,127.997,0.0422,1920,7.03
4,1920-10-01,0.295000,['Mixe'],0.704,0.7070,1,2hJjbsLCytGsnAHfdsLejp,0.000246,10,0.4020,-6.036,0,Xuniverxe,0.02,0.0768,122.076,0.2990,1920,2.75
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
159329,2021-01-23,0.995000,"['Ludovico Einaudi', 'Johannes Bornlöf']",0.297,0.0287,0,2LeqqwzobL5ktfQhWA3bHh,0.908000,8,0.0995,-30.008,1,Nuvole bianche,0.00,0.0564,141.636,0.0678,2021,5.82
159330,2021-01-23,0.995000,"['Ludovico Einaudi', 'Johannes Bornlöf']",0.343,0.0165,0,3Glmyv3hbGGTgeR3FZrxJA,0.878000,9,0.0774,-30.915,0,Una Mattina,0.00,0.0455,126.970,0.1510,2021,3.44
159331,2021-01-23,0.988000,"['Ludovico Einaudi', 'Johannes Bornlöf']",0.316,0.0573,0,6QGVWUbmlePAiY5zJjfCmT,0.879000,3,0.1200,-24.121,1,Night,0.00,0.0515,81.070,0.0373,2021,5.06
159332,2021-01-22,0.795000,['Alessia Cara'],0.429,0.2110,0,3N3Wi5Un7iT8amLezSRwub,0.000000,4,0.1960,-11.665,1,A Little More,0.00,0.0360,94.710,0.2280,2021,2.41


Dropping columns release_date, artists, id, key, mode, name for modelling

In [None]:
df_modelling = df_cleaned.drop(['release_date', 'artists', 'id', 'key', 'mode', 'name', 'explicit'], axis=1)
df_modelling.head()

In [None]:
df_modelling.to_csv("../datasets/data_to_model.csv", index=False)

## 5. Preprocessing for Recommender System

In [51]:
df_recommender = df_cleaned.drop(['release_date', 'id',], axis=1)
df_recommender.head()

Unnamed: 0,acousticness,artists,danceability,energy,explicit,instrumentalness,key,liveness,loudness,mode,name,popularity,speechiness,tempo,valence,year,duration_mins
0,0.991,['Mamie Smith'],0.598,0.224,0,0.000522,5,0.379,-12.628,0,Keep A Song In Your Soul,0.12,0.0936,149.976,0.634,1920,2.81
1,0.643,"[""Screamin' Jay Hawkins""]",0.852,0.517,0,0.0264,5,0.0809,-7.261,0,I Put A Spell On You,0.07,0.0534,86.889,0.95,1920,2.5
2,0.993,['Mamie Smith'],0.647,0.186,0,1.8e-05,0,0.519,-12.098,1,Golfing Papa,0.04,0.174,97.6,0.689,1920,2.73
3,0.000173,['Oscar Velazquez'],0.73,0.798,0,0.801,2,0.128,-7.311,1,True House Music - Xavier Santos & Carlos Gomi...,0.17,0.0425,127.997,0.0422,1920,7.03
4,0.295,['Mixe'],0.704,0.707,1,0.000246,10,0.402,-6.036,0,Xuniverxe,0.02,0.0768,122.076,0.299,1920,2.75


In [52]:
df_recommender.to_csv("../datasets/data_recommender.csv", index=False)