![]()

# Who Do You Sound Like?
### Notebook 1: Data Collection & Cleaning
#### Adam Zucker
---

## The Problem

The advent of affordable tools for home recording over the last 20 years has made music creation more accessible than ever before. Simultaneously, social media platforms such as Instagram and Facebook have made it similarly cheap and easy to advertise yourself to a targeted audience. However, most budding musicians, and even seasoned industry veterans, don't always know the best way to reach new fans; the *how*, *who*, and *where* questions always loom tall when it's time for a new release.

**In this study, I intend to create a recommender system to show similarities between the music of a new artist and that of popular artists around the globe, in order to generate basic recommendations to best connect new artists with new fans. I'll base this model on data from hundreds of thousands of songs hosted on the popular streaming platform Spotify, along with metrics I'll extract from a new artist's music using a Python library called Librosa.**

## Background

Music things.

## Scope

How far I made it. What's the data?

## Contents

- **Section 1:** Package and data imports
- **Section 2:** Data cleaning and feature selection
- **Section 3:** Data exports

---
## Section 1
### Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy as sp
import seaborn as sns

import os
import IPython.display as ipd

import spotipy as sp
import tekore as tk
import librosa as lib
import librosa.display as libd

---

**BELOW:** Importing master Spotify song dataset, sourced from [Kaggle](https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks?select=data.csv). Some brief descriptions of less tangible features, as defined by [Spotify](https://developer.spotify.com/documentation/web-api/reference/):
- **Acousticness:** A confidence measure from 0.0 to 1.0 of whether the track is acoustic.
- **Danceability:** How suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. 
- **Energy:** A measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
- **Instrumentalness:** Predicts whether a track contains vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
- **Liveness:** Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
- **Loudness:** The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typically range between -60 and 0 dB.
- **Popularity:** The popularity of a track is a value between 0 and 100, with 100 being the most popular. The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are. Generally speaking, songs that are being played a lot now will have a higher popularity than songs that were played a lot in the past. Duplicate tracks (e.g., the same track from a single and an album) are rated independently. Artist and album popularity is derived mathematically from track popularity.
- **Speechiness:** Detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g., talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
- **Valence:** A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g., happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g., sad, depressed, angry).

In [2]:
df = pd.read_csv('../data_raw/kg_spotify_master_data.csv')
df.head()

Unnamed: 0,valence,year,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo
0,0.0594,1921,0.982,"['Sergei Rachmaninoff', 'James Levine', 'Berli...",0.279,831667,0.211,0,4BJqT0PrAfrxzMOxytFOIz,0.878,10,0.665,-20.096,1,"Piano Concerto No. 3 in D Minor, Op. 30: III. ...",4,1921,0.0366,80.954
1,0.963,1921,0.732,['Dennis Day'],0.819,180533,0.341,0,7xPhfUan2yNtyFG0cUWkt8,0.0,7,0.16,-12.441,1,Clancy Lowered the Boom,5,1921,0.415,60.936
2,0.0394,1921,0.961,['KHP Kridhamardawa Karaton Ngayogyakarta Hadi...,0.328,500062,0.166,0,1o6I8BglA6ylDMrIELygv1,0.913,3,0.101,-14.85,1,Gati Bali,5,1921,0.0339,110.339
3,0.165,1921,0.967,['Frank Parker'],0.275,210000,0.309,0,3ftBPsC5vPBKxYSee08FDH,2.8e-05,5,0.381,-9.316,1,Danny Boy,3,1921,0.0354,100.109
4,0.253,1921,0.957,['Phil Regan'],0.418,166693,0.193,0,4d6HGyGT8e121BsdKmw9v6,2e-06,3,0.229,-10.096,1,When Irish Eyes Are Smiling,2,1921,0.038,101.665


In [3]:
df.tail()

Unnamed: 0,valence,year,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo
170648,0.608,2020,0.0846,"['Anuel AA', 'Daddy Yankee', 'KAROL G', 'Ozuna...",0.786,301714,0.808,0,0KkIkfsLEJbrcIhYsCL7L5,0.000289,7,0.0822,-3.702,1,China,72,2020-05-29,0.0881,105.029
170649,0.734,2020,0.206,['Ashnikko'],0.717,150654,0.753,0,0OStKKAuXlxA0fMH54Qs6E,0.0,7,0.101,-6.02,1,Halloweenie III: Seven Days,68,2020-10-23,0.0605,137.936
170650,0.637,2020,0.101,['MAMAMOO'],0.634,211280,0.858,0,4BZXVFYCb76Q0Klojq4piV,9e-06,4,0.258,-2.226,0,AYA,76,2020-11-03,0.0809,91.688
170651,0.195,2020,0.00998,['Eminem'],0.671,337147,0.623,1,5SiZJoLXp3WOl3J4C8IK0d,8e-06,2,0.643,-7.161,1,Darkness,70,2020-01-17,0.308,75.055
170652,0.642,2020,0.132,"['KEVVO', 'J Balvin']",0.856,189507,0.721,1,7HmnJHfs0BkFzX4x8j0hkl,0.00471,7,0.182,-4.928,1,Billetes Azules (with J Balvin),74,2020-10-16,0.108,94.991


In [4]:
df.shape

(170653, 19)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 170653 entries, 0 to 170652
Data columns (total 19 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   valence           170653 non-null  float64
 1   year              170653 non-null  int64  
 2   acousticness      170653 non-null  float64
 3   artists           170653 non-null  object 
 4   danceability      170653 non-null  float64
 5   duration_ms       170653 non-null  int64  
 6   energy            170653 non-null  float64
 7   explicit          170653 non-null  int64  
 8   id                170653 non-null  object 
 9   instrumentalness  170653 non-null  float64
 10  key               170653 non-null  int64  
 11  liveness          170653 non-null  float64
 12  loudness          170653 non-null  float64
 13  mode              170653 non-null  int64  
 14  name              170653 non-null  object 
 15  popularity        170653 non-null  int64  
 16  release_date      17

In [6]:
# Dropping the "explicit" column, as this won't be relevant for audio analysis
df.drop(columns='explicit', inplace=True)

In [7]:
# According to Spotify, "liveness" predicts whether or not the track was recorded as part of a live show.
# Spotify's algorithm looks for audience noise in the track to make these predictions. A value > 0.8 suggests the track is a live recording.
# Here, isolating and dropping assumed "live" recordings, as their sonic qualities usually differ quite a bit from studio-recorded music.
df = df[(df['liveness'] <= 0.8)]

In [8]:
df.shape

(167244, 18)

In [9]:
df.columns

Index(['valence', 'year', 'acousticness', 'artists', 'danceability',
       'duration_ms', 'energy', 'id', 'instrumentalness', 'key', 'liveness',
       'loudness', 'mode', 'name', 'popularity', 'release_date', 'speechiness',
       'tempo'],
      dtype='object')

**BELOW:** Some songs have duplicate entries in the dataframe. Some reasons duplicates can legitimately occur include cover versions by other artists, *single* and *album* versions of the same song, different songs with the same name, and often with orchestral music, different recordings of the same piece by different ensembles.

However, I want to be sure not to include any duplicate songs that actually have identical data. Below, I'm using a simple `drop_duplicates()` method, factoring out the song's `id` (this is always unique on Spotify). I'll also investigate the other potentially problematic duplicate case, in which there are album and single versions of a song that are very similar, but don't have identical metrics. This can happen because of the way albums are formatted versus the way singles are formatted.

In [10]:
# Defining the subset of columns I'll pass into drop_duplicates()
no_id = ['name', 'artists', 'tempo', 'key', 'loudness', 'mode', 'duration_ms', 'energy', 'instrumentalness', 'speechiness', 
         'acousticness', 'danceability', 'valence', 'popularity', 'liveness', 'year', 'release_date']

df.drop_duplicates(subset=no_id, inplace=True)

In [11]:
df.shape

(166706, 18)

In [12]:
# Viewing duplicate song titles
#pd.concat(n for _, n in df.groupby('name') if len(n) > 1)

---

**BELOW**: Rounding `tempo` values to the nearest single decimal place. A few notes on this choice, and on tempo in general:
1. `tempo` is measured in beats per minute, and is generally expressed in integer values in the music world.
2. The decimal places exist in the original dataframe because the `tempo` metric is averaged for the entire song. Many songs recorded before the advent of digital recording and *click tracks* (a regular, metronomic pulse musicians hear in their headphones when recording to keep a steady pace) do not have a consistent tempo, due to natural human error in timing. 
3. Some old songs have also been altered by the idiosyncrasies of analog recording - an improperly aligned tape machine can cause small tempo fluctuations on playback, or when transferred to a new tape reel during the mastering process.
4. Ambient, orchestral, purely acoustic, purely vocal, and other such pieces without a strong pulse can also "trick" Spotify's tempo algorithm. Computers are largely able to infer tempo based on regular pulses (often in the attack sounds of drums and other rhythm instruments, known as *transients*) that machine learning algorithms can use as something like anchor points. Measuring the temporal distance between these "anchors" ideally yields a tempo.

In [13]:
# Rounding "tempo" to a single decimal place
df['tempo'] = round(df['tempo'], 1)

In [14]:
df['tempo'].value_counts()

120.0    916
130.0    702
128.0    699
100.0    685
140.0    651
        ... 
38.3       1
214.1      1
44.8       1
44.7       1
38.0       1
Name: tempo, Length: 1776, dtype: int64

As mentioned **above**, age can drastically affect an audio recording. In the early half of the 20th century, and especially in the 1920s and 1930s, recording technology was still very much in the process of maturing. The full frequency range of human hearing (20Hz to 20kHz) wasn't reproduceable with the microphones and media of the time, and additional noise, often in the form of white noise or electrical hum, was clearly audible. With this in mind, I'm going to drop entries for songs recorded prior to 1940, as they tend to suffer the most prominently from these issues. The excessive noise and poor sound reproduction of many of these recordings are likely to skew their metrics.

In [15]:
df['year'].value_counts()

2018    2090
2010    1989
2020    1988
1967    1980
2014    1980
        ... 
1925     268
1924     233
1923     182
1921     150
1922      68
Name: year, Length: 100, dtype: int64

In [16]:
df = df[(df['year'] >= 1940)]

---

In [17]:
# Modality (here defined as 0 for a minor key, 1 for a major key)
# Checking for entries where "mode" is equal to -1. This is the equivalent of a null or non-conclusive analysis.
# None found
df[df['mode'] == -1]

Unnamed: 0,valence,year,acousticness,artists,danceability,duration_ms,energy,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo


In [18]:
# Key (here defined as 0-11, where 0 represents C, and keys increase chromatically with integer)
# Checking for entries where "key" is equal to -1. This is the equivalent of a null or non-conclusive analysis.
# None found
df[df['key'] == -1]

Unnamed: 0,valence,year,acousticness,artists,danceability,duration_ms,energy,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo


In [19]:
df['year'].value_counts()

2018    2090
2010    1989
2020    1988
2014    1980
1967    1980
        ... 
1942    1631
1947    1618
1941     922
1944     737
1943     687
Name: year, Length: 81, dtype: int64

In [20]:
# Engineering a feature for "full_key", which contains the letter representing the key's fundamental pitch, and the modality,
# representing whether the key is major or minor.
df['key_letter'] = df['key'].map({0: 'C', 1: 'C#', 2: 'D', 3: 'D#', 4: 'E', 5: 'F', 6: 'F#', 7: 'G', 8: 'G#', 9: 'A', 10: 'A#', 11: 'B'})
df['maj_min'] = df['mode'].map({0: 'minor', 1: 'major'})

df['full_key'] = df['key_letter'] + ' ' + df['maj_min']

In [21]:
# Dropping columns engineered to make "full_key", as they're no longer necessary.
df.drop(columns=['key_letter', 'maj_min'], inplace=True)

---

In [22]:
# Reordering columns
df = df[['name', 'artists', 'tempo', 'key', 'mode', 'full_key', 'loudness', 'duration_ms', 'energy', 'instrumentalness', 'speechiness', 
         'acousticness', 'danceability', 'valence', 'popularity', 'liveness', 'year', 'release_date', 'id']]

In [23]:
df.head()

Unnamed: 0,name,artists,tempo,key,mode,full_key,loudness,duration_ms,energy,instrumentalness,speechiness,acousticness,danceability,valence,popularity,liveness,year,release_date,id
3606,Main Title - Gone With the Wind,['Max Steiner'],121.8,5,1,F major,-8.326,246773,0.43,0.854,0.0357,0.68,0.17,0.171,33,0.459,1940,1940-01-17,6kucJsz2GR6oN8glzqGdTh
3607,Let's Do It (with Eddie Heywood & His Orchestr...,"['Billie Holiday', 'Eddie Heywood']",142.3,5,0,F minor,-14.45,175893,0.182,0.000171,0.0537,0.983,0.543,0.597,23,0.258,1940,1940,3LKwoKXJBpWBSKID58IgYw
3608,9:20 Special,['Count Basie'],198.7,2,0,D minor,-14.557,187400,0.333,0.0892,0.101,0.871,0.513,0.876,27,0.297,1940,1940,1GoA6bkY0rZsc7aAHGlfl5
3609,Seven Come Eleven (feat. Benny Goodman & Charl...,"['Benny Goodman Sextet', 'Benny Goodman', 'Cha...",115.7,1,1,C# major,-6.975,165800,0.72,0.016,0.0292,0.727,0.703,0.823,26,0.185,1940,1940,5SikmTT1Bf2M1NetWLeDyi
3610,"Final Speech (From ""The Great Dictator"")",['Charlie Chaplin'],124.4,2,1,D major,-8.429,222867,0.383,0.0,0.659,0.885,0.721,0.629,25,0.503,1940,1940,5HjcDXSfMbBmG4x04wELjK


---

In [24]:
df['popularity'].describe()

count    152342.000000
mean         34.349654
std          20.757630
min           0.000000
25%          20.000000
50%          36.000000
75%          50.000000
max         100.000000
Name: popularity, dtype: float64

In [25]:
# Based on the metrics immediately above, I'm going to drop all songs with a popularity rating of less than 10 out of 100.
# The mean popularity skews low, which makes sense based on how Spotify measures it (definition above). Even still, I'd like to filter
# out the least popular songs, given that the point of the project is to help new artists market themselves.
df = df[(df['popularity'] >= 10)]

# Dropping all rows where the "artists" feature contains the word "Cast" and "Broadway Company".
# These are Broadway and musical film cast recordings, which are not within the scope of this project's goals. 
df = df[(df['artists'].str.contains('Cast', case=False)) == False]
df = df[(df['artists'].str.contains('Broadway Company', case=False)) == False]

# Similarly dropping artists containing the phrase "Motion Picture".
df = df[(df['artists'].str.contains('Motion Picture', case=False)) == False]

# I also noticed some Charlie Chaplin dialogue recordings in the dataframe. Dropping here.
df = df[(df['artists'].str.contains('Charlie Chaplin', case=False)) == False]

# Same with "Legally Blonde"
df = df[(df['artists'].str.contains('Legally Blonde', case=False)) == False]

In [26]:
df.head()

Unnamed: 0,name,artists,tempo,key,mode,full_key,loudness,duration_ms,energy,instrumentalness,speechiness,acousticness,danceability,valence,popularity,liveness,year,release_date,id
3606,Main Title - Gone With the Wind,['Max Steiner'],121.8,5,1,F major,-8.326,246773,0.43,0.854,0.0357,0.68,0.17,0.171,33,0.459,1940,1940-01-17,6kucJsz2GR6oN8glzqGdTh
3607,Let's Do It (with Eddie Heywood & His Orchestr...,"['Billie Holiday', 'Eddie Heywood']",142.3,5,0,F minor,-14.45,175893,0.182,0.000171,0.0537,0.983,0.543,0.597,23,0.258,1940,1940,3LKwoKXJBpWBSKID58IgYw
3608,9:20 Special,['Count Basie'],198.7,2,0,D minor,-14.557,187400,0.333,0.0892,0.101,0.871,0.513,0.876,27,0.297,1940,1940,1GoA6bkY0rZsc7aAHGlfl5
3609,Seven Come Eleven (feat. Benny Goodman & Charl...,"['Benny Goodman Sextet', 'Benny Goodman', 'Cha...",115.7,1,1,C# major,-6.975,165800,0.72,0.016,0.0292,0.727,0.703,0.823,26,0.185,1940,1940,5SikmTT1Bf2M1NetWLeDyi
3611,Things Ain't What They Used To Be - 1995 Remas...,['Johnny Hodges & His Orchestra'],89.1,1,1,C# major,-17.038,217933,0.127,0.897,0.0486,0.971,0.651,0.529,22,0.113,1940,1940,61HOxhLI26v49GePTFZ1pN


---

In [42]:
df[['artists']].describe()

Unnamed: 0,artists
count,125561
unique,25575
top,['Frank Sinatra']
freq,590


In [43]:
# df.groupby('artists').mean()

---

In [29]:
# RE-INDEX DATAFRAME

In [30]:
# df.to_csv('../data_clean/spotify_kg_master.csv', index=False)

---