# Taylor Swift & Analaytics at Spotify
***How Spotify Makes Quantitative Sense of Music***

Note: this notebook was prepared as a part of a presentation party with friends.   The goal of this workbook is to learn about the Spotify algorithm features using Taylor Swift's Spotify data. Exploring questions like:

* What are the audio features Spotify thinks are most important when classifying music?
* Why feature engineering is hard
* Most importantly, what can this stuff tell us about Taylor Swift?

*Outline*
* SECTION 1: Intro to Recommender Systems @ Spotify
* SECTION 2: The 12 Dimensions Spotify Uses to measure Music
* SECTION 3: Features Deep Dive

Dataset provided by Jarred Priester - https://www.kaggle.com/datasets/jarredpriester/taylor-swift-spotify-dataset

by *Brett Carpenter*
![](https://media.gettyimages.com/id/1170427711/photo/2019-mtv-video-music-awards-red-carpet.jpg?s=1024x1024&w=gi&k=20&c=I9EtDWpzBb7VV-F31cbDlptIB_7Q3xFn-Mn41hw1aRE=)
This picture costs $499.00!

#### SECTION 1
# Intro to Recommender Systems @ Spotify
* What: Recommender Systems are algorithms aimed at suggesting relevant items to users
* Why: Spotfiy uses Personalization to drive user retention 
* How: Spotify uses a "compare & share" method 


## The Compare & Share Method

**1. What do you like?**
* Repeat listens
* Skips (60 second rule)
* Save-to-play ratio
* Playlist adds

**2. Who else Likes this stuff?**
* Similar Users to you
* Similar Playlists to yours

**3. Split the difference**
* Musical penpals
* They like this song but you've never
* Works very well despite being relatively naive - We may not know why the tastes are shared

![](https://files.realpython.com/media/Build-a-Recommendation-Engine-With-Collaborative-Filtering_Watermarked.451abc4ecb9f.jpg)

## Where Audio Features Come into Play

**Limitations of Collaborative Filtering**

Collaborative Filtering is very efficient at identifying clusters, but doesn't give us any insight into the character of these clusters.
* Why do these clusters exist?
* How would you describe this cluster?
* If we introduce audio features, we can create genres

**Cool Stuff you can do with Genres**
* [Genre graphing](https://manasahariharan.github.io/Spotify-Artists-Network/#country%20rock) - map relationships
* Genre Based Recommendation
* Taste Mapping: "The Best discoveries are between or on the edge of genres".

<div style="width:100%;text-align: center;"><img src="https://preview.redd.it/2htfahv13k451.png?width=960&crop=smart&auto=webp&v=enabled&s=20bfca1954fd2b8f50d846b99cdd3dfc1aa56f33" width="900" height="900"></div>

<div style="width:100%;text-align: center;"><img src="https://storage.googleapis.com/research-production/1/2022/03/3.-Graph-2-1-1536x864.png" width="900" height="900"></div>


#### SECTION 2
# The 12 Dimensions Spotify Uses to Measure Music
***How TSwift Looks to an Algorithm***
* These are the factors that Spotify thinks are most important to understanding music
* Music is so complex, why are there are only 12 features?
* The art and science of feature engineering

Overview of Dimensions

<div style="width:100%;text-align: center;"><img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/1*Z5QnuTYai72jqxmPH-qaSw.png" width="600" height="600"></div>

## Features Explained

Categorical Features

1. **Key**: The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
2. **Explicit Lyric** Status: Whether or not the track has explicit lyrics ( True = yes it does; False = no it does not OR unknown).
3. **Mode**: Indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.

Numerical Features

4. **Song Duration** (ms): Length of the track in ms.=
5. **Acousticness**: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. The closer this value is to 1.0, the greater the confidence that the track is acoustic. An example song with high acousticness is Offing by Julianna Barwick (acousticness = 0.996), while an example song with a low acousticness is Memory Machine by Dismemberment Plan (acousticness = 0.003320).
6. **Danceability**: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. An example song with a high danceability is Leave Me Now by Herbert (danceability = 0.985), while an example of a song with low danceability is Surf by Fennesz (danceability = 0.0636)
7. **Energy**: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. An example of a song with high energy is When Doves Cry by Prince (energy = 0.989), while an example of a song with low energy is Zawinul/Lava by Brian Eno (energy = 0.010600)
8. **Instrumentalness**: Predicts whether a track contains no vocals. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
9. **Liveness**: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
10. **Speechiness**: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value.
11. **Tempo** (bmp): The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
12. **Valence**: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). An example of a song with high valence is Don’t Laugh (I Love You) by Ween (valence = 0.983), while an example of a song with low valence is Hedphelym by Aphex Twin (valence = 0.0188)
13. **Popularity**: The popularity of a track is a value between 0 and 100, with 100 being the most popular. The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are.



### Introducing Data
We'll start to load the data and see what it looks like

In [None]:
import pandas as pd
import numpy as np
import random

# Models
from sklearn import ensemble
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.linear_model import ElasticNet
import xgboost as xg 
from sklearn.metrics import r2_score, mean_squared_error, mean_gamma_deviance

# Data Vis
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns

# LOOK AT FIELD TYPES
raw = pd.read_csv('/kaggle/input/taylor-swift-spotify-dataset/taylor_swift_spotify.csv', index_col=0)
df = raw.drop(['id', 'uri'], axis=1)
df.drop_duplicates(inplace = True)
df.head(10)


## The Shape of You
*Applying these Features to Taylor*
* Correlation Matrix: how to read it, takeaways
* Descriptive stats

In [None]:
# Feature Correlation
plt.figure(figsize=(12, 6))
sns.set(style="whitegrid")
corr = raw.corr()
sns.heatmap(corr,annot=True,cmap="coolwarm")

#### Correlations:

1) Expected: There is a strong positive link between loudness and energy (0.80).

2) Expected: There is a strong negative link between loudness/energy and acousticness (-0.72 & -0.73)

3) Unexpected:  Instrumental songs have a negative impact on the popularity of a song (-0.53)

4) Unexpected: Valence has a  strong negative correlation with the length of a song. (-0.43)

5) Unexpected: There is a positive correlation between valence and danceability (0.39), but not tempo and danceability. - it's more about how it makes people feel that makes the songs danceable than the tempo.

### Descriptive Stats for Taylor
Avg Danceability for Taylor songs
Avg Popularity
Etc

In [None]:
# GET STATS ON EACH FIELD
df.describe()

#### SECTION 3
# Features Deep Dive
Now we can start to look at the tay data and ask some questions to learn about the algorithm features and what they can tell us about her music.

## 1. Popularity

### What is Spotify’s Popularity Index?

The Spotify Popularity Index is a 0-to-100 score that ranks how popular an artist is relative to other artists on Spotify. As your numbers grow, you’ll get placed in more editorial playlists and increase your reach on algorithmic playlists and recommendations. Some say that the magic number is 50!

The Index can be used to monitor and influence the progress of new releases. Each track has its own SPI calculated influencing the artist’s overall index. Yet, while the Popularity Index is majorly determined by recent stream count, other factors like save rate, the number of playlists, skip rate, and share rate can indirectly bump up or push down a song’s popularity index.

### What does popularity look like in general? 

![](https://miro.medium.com/v2/resize:fit:1024/format:webp/1*RpPeirYvUS0_7VKXbINZqw.png)

*Pareto principle - only a handful of songs get the vast majority of listens, and most songs fail!*

### Distribution of Taylor's Song Popularity
Are all songs popular?

In [None]:
# DISTRIBUTION OF SONG POPULARITY
plt.figure(figsize=(16, 8))
sns.histplot(df["popularity"],bins='auto')

As we can see from the above histogram, popularity seems to vary quite widely. There are karaoke/international versions of the same songs which may explain why some of these are so low and the data is left skewed overall.


In [None]:
df2 = df[['album', 'name','popularity']].copy()

In [None]:
sorted_df = df2.groupby('album').mean().sort_values(['popularity'],ascending=True)
sorted_df=sorted_df.reset_index() 

### Taking a Closer Look at Popularity
Most and Least Popular Albums

In [None]:
# LEAST Popular
sorted_df.head(10)

**Removing Karaoke Albums**

There are a lot of albums here with some being karaoke versions which only strip out the vocals so would be giving an unfair boost to some correlations with speechiness and could affect how we are predicting popularity. I am going to strip out any album that contains the string 'karaoke'.


In [None]:
# remove any rows where the album string contains 'karaoke'
df3 = df2[~df2['album'].str.contains("Karaoke")]

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20,4), sharey=True)
sns.histplot(raw["popularity"],bins='auto', ax=axes[0])
sns.histplot(df3["popularity"],bins='auto', ax=axes[1])
axes[0].set_title('Raw Popularity')
axes[1].set_title('Cleaned Popularity')


### Most Popular Albums

In [None]:
plt.figure(figsize=(20,10))
sns.barplot(x=sorted_df['album'], y=sorted_df['popularity'])
plt.xlabel('Album')
plt.ylabel('Popularity')
plt.xticks(rotation=90)
plt.title('Popularity of Albums')
plt.show()

### Most Popular Songs

In [None]:
# Popularity of Songs
df2 = df[['name','album','danceability','popularity', 'release_date','valence','speechiness']].copy()
df2 = df2.loc[df2['album'].isin(['Midnights','reputation','Lover'])]
song_sorted_df = df2.groupby('name').mean().sort_values(['popularity'],ascending=True)
song_sorted_df=song_sorted_df.reset_index() 
#song_sorted_df=song_sorted_df.head(11)

plt.figure(figsize=(20,10))
sns.barplot(x=song_sorted_df['popularity'], y=song_sorted_df['name'],orient='h')
plt.xlabel('Popularity')
plt.ylabel('Song')
plt.xticks(rotation=90)
plt.title('Popularity of Song')
plt.show()

## Danceability

In [None]:
## Most Danceable songs - Radio edits
sns.scatterplot(data=df2, x="danceability", y="popularity", style="album", hue="album", legend="full")

In [None]:
# What is the distribution of Tay's danceability?
plt.figure(figsize=(16, 8))
sns.histplot(df["danceability"],bins='auto')

In [None]:
# Danceability over time, is Taylor getting more dancey?
df['release_date'] = pd.to_datetime(df['release_date'])
df['relase_date'] = df['release_date'].dt.date

sns.lineplot(data=df, x="release_date", y="danceability")



### Valence
**How Happy is Taylor over time?**

Valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). An example of a song with high valence is Don’t Laugh (I Love You) by Ween (valence = 0.983), while an example of a song with low valence is Hedphelym by Aphex Twin (valence = 0.0188)

In [None]:
sns.lineplot(data=df, x="release_date", y="valence")

In [None]:
plt.figure(figsize=(12, 6))
sns.set(style="whitegrid")
corr = raw.corr()
sns.heatmap(corr,annot=True,cmap="coolwarm")

In [None]:
## Happier Songs and Speechiness
sns.lineplot(data=df, x="valence", y="speechiness")

When Taylor wants to express happiness, she tends to so through her voice, or through her song writing, as opposed to through music

### Release Times

In [None]:
## Songs by Danceability, Popularity, Year
# Convert the release_date to datetime type
df['release_date'] = pd.to_datetime(df['release_date'])


# Create features for each day/month/year each song was released.
month_of_release = df['release_date'].dt.month
day_of_month = df['release_date'].dt.day
year_of_release = df['release_date'].dt.year

# Visualise by release 
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(20,4), sharey=False)

sns.countplot(x=day_of_month, ax=axes[0])
sns.countplot(x=month_of_release, ax=axes[1])
sns.countplot(x=year_of_release, ax=axes[2])

axes[0].set_title('Day of month released')
axes[1].set_title('Month of release')
axes[2].set_title('Year of release')

#sns.scatterplot(data=df, x="danceability", y="popularity")