# Spotify Playlist In-Depth

Hello everyone, in this project I'll analyze one of my spotify playlists. I'll discuss about the data, how to [replicate this analysis](#how-to-replicate-this-project) with your own playlist, where to find resources, some info about music and how I'm going to use this project in different ways for different audiences.

### Loading the data

There's two main ways to generate a dataset from a Spotify playlist. A) Using the official [Spotify API](https://developer.spotify.com/documentation/web-api/) or B) Using a third party app that uses the Spotify API

I chose option B because that's the easier way. It's free and anyone can do it. I used this website called [Chosic Spotify Playlist Analyzer](https://www.chosic.com/spotify-playlist-analyzer/) to generate a csv file of my playlist. The website also generates some basic analytics which you can check out.

I created another python notebook to web scrape Wikipedia & [Resident Advisor](https://ra.co) to get the country information about the artists. Wikipedia is a good source for gathering info about popular artists and Resident Advisor is a great knowledgebase for everything DJ.

Notes : Resident Advisor prevents web scraping after a certain number of requests. It gives a 403 Forbidden (a service of CloudFlare to prevent bots). So I manually filled in some of the entries in the country csv.

### Analysis Overview

The playlist I'm analyzing is called Production. I use this playlist to store tracks for inspiration with my own music production. It has a great variety of tracks that I've added tracks since 2018.

For the anlysis, I'll be looking into details about the tracks themselves, like genre, release year, record label, etc. Then I'll also look into my actions with the playlist, like how many tracks I added over the years, what type of tracks, etc.

I'll be using plotly for the graphs because it generates interactive plots and it's also pretty easy to customize. 

In [89]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('../datasets/'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

../datasets/all_countries_231.txt
../datasets/Hotel Reservations.csv
../datasets/Production_Playlist.csv
../datasets/Production_Playlist_Countries.csv
../datasets/Production_Playlist_Countries_first.csv
../datasets/Production_Playlist_RA_Countries.csv


In [50]:
import ast

# from datetime import date
import plotly.io as pio
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import cufflinks as cf
%matplotlib inline

from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

cf.go_offline()

In [51]:
df=pd.read_csv('https://raw.githubusercontent.com/dekaghub/Data-Projects-Deka/main/Datasets/Production_Playlist.csv', parse_dates=['Added At','Album Date'])
df_country = pd.read_csv('https://github.com/dekaghub/Data-Projects-Deka/raw/main/Datasets/Production_Playlist_Countries.csv')

In [52]:
def optimize_memory(df):
    return (
        df
        .astype({
            'Popularity':'int8',
            'BPM':'int16',
            'Dance':'int8',
            'Energy':'int8',
            'Acoustic':'int8',
            'Instrumental':'int8',
            'Happy':'int8', 
            'Speech':'int8', 
            'Live':'int8', 
            'Loud':'int8', 
            'Time Signature':'int8',
            'Key':'category',
            'Camelot':'category'
        })
        .rename(columns={'Parent Genres':'Parent_Genres',
                         'Album Date':'Album_Date',
                         'Time Signature':'Time_Signature',
                         'Added At':'Added_At',
                         'Album Label':'Record_Label'
                         })
        .drop(columns=['#','Spotify Track Id','Spotify Track Img','Song Preview'])
    )

In [53]:
df.memory_usage(deep=True).sum()

733021

In [54]:
optimize_memory(df).memory_usage(deep=True).sum()

371517

In [55]:
playlist = optimize_memory(df)
playlist['Country'] = df_country['Artist']

In [56]:
# borrowed from https://www.kaggle.com/sukhdeepk
def missing_pct(df):
    # Calculate missing value and their percentage for each column
    missing_count_percent = df.isnull().sum() * 100 / df.shape[0]
    df_missing_count_percent = pd.DataFrame(missing_count_percent).round(2)
    df_missing_count_percent = df_missing_count_percent.reset_index().rename(
                    columns={
                            'index':'Column',
                            0:'Missing_Percentage (%)'
                    }
                )
    df_missing_value = df.isnull().sum()
    df_missing_value = df_missing_value.reset_index().rename(
                    columns={
                            'index':'Column',
                            0:'Missing_value_count'
                    }
                )
    # Sort the data frame
    #df_missing = df_missing.sort_values('Missing_Percentage (%)', ascending=False)
    Final = df_missing_value.merge(df_missing_count_percent, how = 'inner', left_on = 'Column', right_on = 'Column')
    Final = Final.sort_values(by = 'Missing_Percentage (%)',ascending = False)
    return Final

missing_pct(playlist)

Unnamed: 0,Column,Missing_value_count,Missing_Percentage (%)
5,Parent_Genres,69,10.21
4,Genres,59,8.73
0,Song,0,0.0
13,Happy,0,0.0
21,Camelot,0,0.0
20,Record_Label,0,0.0
19,Added_At,0,0.0
18,Time_Signature,0,0.0
17,Key,0,0.0
16,Loud,0,0.0


### Basic Overview

Here I'll look for unique values as well as duplicates. In the context of a playlist, a duplicate would be the same song by the same artist. Then I'll try to understand the overall numbers i.e. a quick summary of the dataset.

In [57]:
playlist[playlist[['Song','Artist']].duplicated(keep='first')].count()

Song              2
Artist            2
Popularity        2
BPM               2
Genres            2
Parent_Genres     2
Album             2
Album_Date        2
Time              2
Dance             2
Energy            2
Acoustic          2
Instrumental      2
Happy             2
Speech            2
Live              2
Loud              2
Key               2
Time_Signature    2
Added_At          2
Record_Label      2
Camelot           2
Country           2
dtype: int64

In [58]:
playlist.nunique()

Song              670
Artist            526
Popularity         86
BPM               117
Genres            457
Parent_Genres     107
Album             617
Album_Date        491
Time              246
Dance              79
Energy             80
Acoustic           93
Instrumental       86
Happy              92
Speech              8
Live                9
Loud               17
Key                24
Time_Signature      4
Added_At          305
Record_Label      439
Camelot            24
Country           527
dtype: int64

### Basic Summary

* Artists - 526 : For 670 songs, that's a lot of unique artists. 107 Parent Genres and 457 Genres also means this playlist is pretty diverse in terms of sound; some may consider that as good while others may comment that a playlist should have a uniform sound. I'd say it depends. In this case, this is a playlist for musical inspiration and not a vibe/mood playlist so this is good.
* Added_At - 305 : This means that I've added songs to this playlist on 305 separate days since 2018. 
* Duplicates - 2
* Country - 528 : This number does not reflect the number of unique countries, the formatting of the country data is such that it includes Artist + Country. So for songs with multiple artists, there are multiple values of country in the same row.
* Key - 12 : This playlist has songs of all the musical keys (also why Camelot is 24)

### Genre Analysis

One of the most basic ways to differentiate a song is by its genre. It's a very broad yet precise way of labelling a song or music, also artists.

There are two genre columns that comes with the dataset, Parent Genres & Genres. There are a few missing values but it's only 10% so I'll ignore it. Needless to say, the Parent Genre represents the broader genre whereas the actual genre label can be found in Genres. For instance, [Dam Swindle - 64 Ways](https://youtu.be/M9rsg2YOcvY?t=50) is a deep house track but it's also a dance/electronic track.

To visualize genre changes by year i.e. my change in music taste, I'll take the first genre label out of Parent Genres & Genres.

In [59]:
playlist['Genre'] = playlist['Genres'].str.split(",").str[0]
playlist['Parent_Genre'] = playlist['Parent_Genres'].str.split(",").str[0]

playlist[['Artist','Song','Parent_Genres','Parent_Genre','Genres','Genre']].sample(3)

Unnamed: 0,Artist,Song,Parent_Genres,Parent_Genre,Genres,Genre
227,Bakar,Big Dreams,Hip Hop,Hip Hop,uk alternative hip hop,uk alternative hip hop
250,Eris Drew,Trans Love Vibration (Eris Goes to Church),"New age, Dance/Electronic",New age,"experimental house, float house",experimental house
576,The Backseat Lovers,Growing/Dying,"Rock, New age",Rock,"indie pop, modern rock, slc indie",indie pop


In [60]:
t = playlist.groupby([playlist['Added_At'].dt.year, 'Parent_Genre'])['Song'].count().reset_index().query("Song > 3")

px.bar(x=t.Added_At, y=t.Song, color=t.Parent_Genre).update_layout(
                  title_text="Parent Genre Distribution by Year",
                  title_x=0.5,
                  yaxis_title='Genre types',
                  xaxis_title='Track Added Year',
                  paper_bgcolor="LightSteelBlue",
                  height=600,
                  margin=dict(l=100, r=150, t=100, b=20)
)

From the graph above, it looks like I added more Rock and Dance/Electronic songs over the years. I used to listen to a lot of pop punk and alternative rock songs when I was younger, so I added the ones that I still listen to today. More dance songs because as I grew older and did more music production, I grew a liking to the sound aesthetics of electronic music.

The graph below is a breakdown of subgenre changes i.e. the more precise genre. During 2020-2022, I started listening to a lot of bedroom pop, artists like Still Woozy, Benee, JAWNY, etc. These artists are like the baby product of Rock & Dance music, just the right mix of electronic sounds paired with guitar & drums.

In [61]:
t = playlist.groupby([playlist['Added_At'].dt.year, 'Genre'])['Song'].count().reset_index().query("Song > 3")

px.bar(x=t.Added_At, y=t.Song, color=t.Genre, text_auto=True).update_layout(
                  title_text="Sub-Genre Distribution by Year",
                  title_x=0.5,
                  yaxis_title='Genre types',
                  xaxis_title='Track Added Year',
                  paper_bgcolor="LightSteelBlue",
                  height=800,
                  margin=dict(l=100, r=150, t=100, b=20)
).update_traces(textangle=0).update_xaxes(type='category', categoryarray=[2019,2021,2022,2023])

In [62]:
playlist[(playlist.Genres.str.contains("indie pop", na=False))].sample(3)

Unnamed: 0,Song,Artist,Popularity,BPM,Genres,Parent_Genres,Album,Album_Date,Time,Dance,...,Live,Loud,Key,Time_Signature,Added_At,Record_Label,Camelot,Country,Genre,Parent_Genre
531,Preoccupied,Slow Pulp,36,140,"bubblegrunge, chicago indie, indie pop, indie ...",Rock,Ep2,2017-03-09,3:20,69,...,0,-3,F#/G♭ Minor,4,2022-08-12,Slow Pulp,11A,{'Slow Pulp': 'N/A'},bubblegrunge,Rock
483,sex money feelings die,Lykke Li,69,134,"art pop, dance pop, electropop, pop, swedish e...","Pop, Dance/Electronic",so sad so sexy,2018-06-08,2:19,79,...,0,-6,G#/A♭ Minor,4,2022-07-08,LL Recordings/RCA Records,1A,{'Lykke Li': 'Sweden'},art pop,Pop
461,MESS U MADE,MICHELLE,49,127,"bedroom pop, indie pop, indie r&b","Pop, Rock, R&B",MESS U MADE,2021-10-27,2:48,67,...,10,-7,E Major,3,2022-06-21,Canvasback/ATL,12B,{'MICHELLE': 'United States of America'},bedroom pop,Pop


My brother is a DJ so I've always been a big fan of house music and the DJ culture. Over the years, I've scoured through [Beatport](https://www.beatport.com/), the capital of dance music, for underground tracks alongside other DJ oriented websites. For entertainment, I'd recommend the [People Of](https://youtu.be/09yeEisT-FQ) YouTube videos. Great music, funny moments.

I personally prefer the genre labelling of Beatport over Spotify's for dance/electronic tracks because they're more accurate. But anyway, I'll visualize all the tracks that have "house" as one of the labels inside Genres to see the changes of house tracks over the years.

In [63]:
t = playlist[(playlist.Genres.str.contains("house", na=False))].groupby([playlist['Added_At'].dt.year, 'Genre'])['Song'].count().reset_index()

px.bar(x=t.Added_At, y=t.Song, color=t.Genre, text_auto=True).update_xaxes(type='category', categoryarray=[2019,2020, 2021,2022,2023]).update_layout(
                  title_text="House tracks by year",
                  title_x=0.5,
                  yaxis_title='Genre types',
                  xaxis_title='Track Added Year',
                  paper_bgcolor="LightSteelBlue",
                  height=600,
                  margin=dict(l=100, r=150, t=100, b=20)
).update_traces(textangle=0)

I'm a little surprised to see the dip in 2020 but I also remember that was the year of lockdown. I was mostly inside the apartment with my roommates and that year I listened to a lot of rap & hip hop. Deep House is one of my fav genres and it's nice to see the 2022 deep house count as the highest count.

In [64]:
playlist[(playlist.Genres.str.contains("house", na=False))].sample(3)

Unnamed: 0,Song,Artist,Popularity,BPM,Genres,Parent_Genres,Album,Album_Date,Time,Dance,...,Live,Loud,Key,Time_Signature,Added_At,Record_Label,Camelot,Country,Genre,Parent_Genre
76,Je t'aime encore,Yelle,32,100,"alternative dance, dance pop, electro-pop fran...","Rock, Pop, R&B, Dance/Electronic",Je t'aime encore,2020-04-28,3:37,52,...,10,-9,F Major,4,2020-06-14,Recreation Center,7B,{'Yelle': 'France'},alternative dance,Rock
638,King Bromeliad,Floating Points,40,115,"electronica, microhouse, uk bass, wonky","Rock, Dance/Electronic",King Bromeliad / Montparnasse,2014-06-23,8:52,90,...,0,-10,C♯/D♭ Major,4,2023-01-04,Pluto,3B,{'Floating Points': 'UK'},electronica,Rock
440,Armonia,"Michel Degen,Mikhu",19,124,"swiss house,",Dance/Electronic,Armonia,2018-06-04,7:19,81,...,10,-11,B Minor,4,2022-06-09,Samani,10A,"{'Michel Degen': 'N/A', 'Mikhu': 'N/A'}",swiss house,Dance/Electronic


### Instrumental Analysis

This feature determines the amount of vocal presence i.e. singing vs musical content in a song. A value of 0 means that almost entirety of the song contains vocals. The higher the Instrumental number, the more musical content than singing. 

Note: singing is different from vocal fx or vocal edits -- like Ooohs & Aaaahs which are not considered singing 

[Smash Mouth - All Star](https://youtu.be/L_jWHffIx5E?t=38) - This track has an Instrumental rating of 0 and there's singing pretty much the entire song 

[Bonobo - Linked](https://youtu.be/0W-a11Tdk7Y?t=102) - This track has an Instrumental rating of 89 and you can listen to some of the vocal stabs 

[Lorn - Soft Room](https://youtu.be/YNPPdkxE7S0?t=36) - This track has an Instrumental rating of 79 and you can listen to some of the vocal Ooohs fx

For analysis, I want to compare songs that are full of singing vs those that are not. I also want to investigate if newer songs tend to be singing heavy compared to older releases.

In [65]:
# Tracks with 0 Instrumental vs all Tracks by Release Year
d1, d2 = playlist.Album_Date.dt.strftime("%Y").value_counts(), playlist[playlist.Instrumental==0].Album_Date.dt.strftime("%Y").value_counts()
vals = []
for k in d1.keys():
    if k in d2.keys():
        vals.append(d2[k])
    else:
        vals.append(0)
d1keys = d1.keys()
t = pd.DataFrame({'Year':d1keys,'All':d1.values,'instrumentalzero':vals})

fig = go.Figure()

fig.add_trace(go.Bar(name='All',x=t.Year, y=t.All, offsetgroup=0))
fig.add_trace(go.Bar(name='Instrumental 0',x=t.Year, y=t.instrumentalzero, offsetgroup=0, text=t.instrumentalzero, marker_color = '#204887'))
fig.update_layout(title_text="All Tracks vs Tracks that are 0 Instrumental",
                  yaxis_title='# of tracks',
                  xaxis_title='Album Release Year',
                  paper_bgcolor="LightSteelBlue",
                  margin=dict(l=100, r=150, t=100, b=20)
)

The above chart shows the release year of tracks and the ratio of tracks that have singing majority i.e. instrumental = 0 vs all tracks. It's almost an even split between the two i.e. there's a good collection of tracks with musical content. Also from the chart, you can see that tracks that came out in 2017 have the highest count. I think 2017 was a great year of music and definitely gave us a lot of artists that continued to release good music over the years.

The chart below compares the same but based on the year the track was added to the playlist. Once again, the split is pretty even i.e. half the tracks have vocal majority but the other half has some musical content. Interesting thing to note here is that I added more tracks on 2021 & 2022, but most of those tracks are from the 2017 - 2022 period.

In [66]:
# Tracks with 0 Instrumental vs all Tracks as added
d1, d2 = playlist.Added_At.dt.strftime("%Y").value_counts(), playlist[playlist.Instrumental==0].Added_At.dt.strftime("%Y").value_counts()
vals = []
for k in d1.keys():
    if k in d2.keys():
        vals.append(d2[k])
    else:
        vals.append(0)
d1keys = d1.keys()
t = pd.DataFrame({'Year':d1keys,'All':d1.values,'instrumentalzero':vals}).sort_values('Year').query("Year != '2018'")
fig = go.Figure()

fig.add_trace(go.Bar(name='All',x=t.Year, y=t.All, offsetgroup=0))
fig.add_trace(go.Bar(name='Instrumental 0',x=t.Year, y=t.instrumentalzero, offsetgroup=0, text=t.instrumentalzero, marker_color = '#204887'))
fig.update_layout(title_text="Total Tracks vs Tracks that are 0 Instrumental",
                  yaxis_title='# of tracks',
                  xaxis_title='Year track was added to playlist',
                  paper_bgcolor="LightSteelBlue",
                  margin=dict(l=100, r=150, t=100, b=20)
)

In [67]:
t = (playlist
[playlist.Album_Date.dt.year.isin([2017,2018,2019,2020,2021,2022])]
 .assign(release_year=playlist.Album_Date.dt.year,
         added_year=playlist.Added_At.dt.year)
 .groupby(['release_year','added_year'])
 .size()
 .rename('song_count')
 .reset_index()
)

data = []
for year in t['release_year'].unique():
    df_year = t[t['release_year'] == year]
    data.append(go.Bar(x=df_year['added_year'], y=df_year['song_count'], name=str(year), text=df_year['song_count']))
layout = go.Layout(title='Track Count by Release Year and Added Year', 
                   xaxis_title='Added Year', title_x=0.5,
                   yaxis_title='Track Count',
                   paper_bgcolor="LightSteelBlue",
                   margin=dict(l=100, r=150, t=100, b=20)
            )
fig = go.Figure(data=data, layout=layout)

fig.show()

This chart visualizes when tracks, that were released during 2017 - 2022, were added to the playlist. Again, 2017 was a dope year and I've added tracks from that year every year.

### Record Labels

I follow record labels as much as the artists and their music because finding a good label is like finding a music gold mine. It works well for smaller labels since they are usually picky, or you could say picky artists find those labels. Bigger labels, like Universal, Sony/RCA, etc. are a good indicator of artists who've made it made it in terms of commercial success. In the industry, it's very common for a track to get signed with multiple labels, usually it goes from smaller to bigger labels.

For instance, [Fury's Laughter by S.A.M.](https://youtu.be/YC81VY4VE1Q?t=125) was initially a PIV record but as it was gaining popularity, later it was signed to Spinnin' Records. PIV is like the high fashion niche record label and Spinnin is like Zara, it's cool & everywhere but stylistically not that rare i.e. commercial.

For my playlist analysis, I want to see which Record Labels are the majority as well as the year over year changes.

In [68]:
# Of the 676 songs, there are 439 Record Labels
playlist['Record_Label'].nunique()

439

In [69]:
# To narrow down the number, I'm only going to consider Record Labels that have more than 5 songs
recLabels = playlist['Record_Label'].value_counts().reset_index().query('Record_Label > 5')['index'].tolist()
t = (playlist[playlist['Record_Label'].isin(recLabels)]
 .assign(Added_Year=playlist.Added_At.dt.strftime("%Y"))
 [['Record_Label','Added_Year','Song']]
 .groupby(['Added_Year','Record_Label'])['Song']
 .count().reset_index()
 )

px.bar(x=t.Added_Year, y=t.Song, color=t.Record_Label).update_xaxes(type='category').update_layout(
                  title_text="Record Label distribution by year",
                  title_x=0.5,
                  yaxis_title='Record Labels',
                  xaxis_title='Track Added Year',
                  paper_bgcolor="LightSteelBlue",
                  height=600,
                  margin=dict(l=100, r=150, t=100, b=20)
)

In [70]:
playlist[playlist.Record_Label.str.contains("Columbia")].sample(3)

Unnamed: 0,Song,Artist,Popularity,BPM,Genres,Parent_Genres,Album,Album_Date,Time,Dance,...,Live,Loud,Key,Time_Signature,Added_At,Record_Label,Camelot,Country,Genre,Parent_Genre
571,Más Que Amigos,Matisse,28,92,"latin arena pop, latin pop, mexican pop, urban...",Latin,Sube (Summer Edition),2016-07-08,2:57,64,...,0,-6,D Major,4,2022-10-22,Columbia,10B,{'Matisse': 'France'},latin arena pop,Latin
80,1 Thing,Amerie,60,130,"contemporary r&b, dance pop, hip hop, hip pop,...","Pop, Hip Hop, R&B",Touch,2005-01-01,3:58,61,...,0,-3,A#/B♭ Minor,5,2020-06-27,Richcraft/Sony Urban Music/Columbia,3A,{'Amerie': 'U.S.'},contemporary r&b,Pop
519,PIENSO EN TU MIRÁ - Cap.3: Celos,ROSALÍA,47,165,r&b en espanol,R&B,PIENSO EN TU MIRÁ (Cap.3: Celos),2018-07-24,3:13,66,...,0,-6,G#/A♭ Major,3,2022-08-02,Columbia,4B,{'ROSALÍA': 'N/A'},r&b en espanol,R&B


Since this playlist contains songs of all genres and from different time periods, I want to see what are some of the earliest & latest tracks from each Record Label. This should give a good idea about the Record Label's ongoing activity in the context of this playlist. 

In [71]:
# list of record label with label count > 3
recLabels = playlist['Record_Label'].value_counts().reset_index().query('Record_Label > 3')['index'].tolist()
(playlist[playlist['Record_Label'].isin(recLabels)]
 .groupby('Record_Label').agg({'Album_Date':['min','max']}))

Unnamed: 0_level_0,Album_Date,Album_Date
Unnamed: 0_level_1,min,max
Record_Label,Unnamed: 1_level_2,Unnamed: 2_level_2
A&M,2003-06-24,2003-06-24
Atlantic Records,2018-02-16,2020-08-14
Canvasback/ATL,2011-09-02,2021-10-27
Capitol Records,2004-01-01,2022-08-19
Columbia,1999-11-04,2022-07-01
Domino Recording Co,2006-02-21,2018-08-24
Fueled By Ramen,2012-02-21,2017-05-12
Geffen,1991-09-26,2008-01-01
Glassnote Entertainment Group LLC,2009-05-25,2022-09-07
Island Records,1995-01-01,2022-05-13


Looking at the table above, it's not surprising to see the big labels have the biggest time difference. See Columbia, Geffen, Island Records, Matador, Virgin Records, Warner Records. While I was aware of some of the big names here, I'm learning about Geffen, Island Records, Matador first time here.

Upon researching, these old timey labels have always been big and have continued to flourish as the artists they signed continued to be successful. It's a win win situation. Also, the newer artists that they sign are already good commercially.

## Country Analysis

I'll preface this by talking about markets. In the music industry, there are the major markets: US, Europe & Japan. Then there's trigger markets which are basically the major cities in South East Asia & South America. If a song is doing well in the major markets, it's pretty much a global success. If a song is unknown in the major markets but is trending in one of the trigger markets, then it's missing either a global distributor or global platform. In other words, that song has a great potential to be a global success.

I have two graphs to visualize the country of origin for the artists.

In [72]:
artist_country_dict = {}
for x in playlist.Country:
    artist_country_dict.update(ast.literal_eval(x))

artist_country_df = pd.DataFrame({'Artist':artist_country_dict.keys(),'Country':artist_country_dict.values()})

px.bar(artist_country_df.Country.value_counts(), text_auto='True').update_layout(
                  title_text="Artist Country of Origin",
                  yaxis_title='# of Artists',
                  xaxis_title='Country',
                  paper_bgcolor="LightSteelBlue",
                  margin=dict(l=100, r=150, t=100, b=20)
                ).update_traces(textangle=0)

From the first graph it's clear that nearly half of my playlist has no country information. Since many of the artists in this playlist are relatively underground, they do not have much of a public profile i.e. a detailed wikipedia page. That's one reason why there's a lot of missing country values. Another issue with scraping internet information is inconsistencies. An artist might be labeled as being from US or U.S. or America.

So in the second graph, I removed the N/A artists and consolidated all the US variations to one variable. It's no surprise that I listen to a lot of American artists since it's Hollywood baby.

In [73]:
t = artist_country_df.assign(country_clean = np.where(artist_country_df.Country.isin(['U.S.', 'US', 'United States', 'United States of America']), 'United States', artist_country_df.Country))

px.bar(t[~artist_country_df.Country.isin(['N/A'])].country_clean.value_counts(), text_auto='True').update_layout(
                  title_text="Artist Country of Origin",
                  yaxis_title='# of Artists',
                  xaxis_title='Country',
                  paper_bgcolor="LightSteelBlue",
                  margin=dict(l=100, r=150, t=100, b=20)
                ).update_traces(textangle=0)

I guess it makes sense that I listen to English speaking artists primarily so US artists are a majority. Also all the media platforms where I find my music are also American so there's that as well.

## BPM & Key Analysis

As someone who does music production, I personally love and work in the range of 80 - 110 BPM. Anything above 110 pretty much turns into a house track. With that said, I'm guessing most of the songs will have their BPM between that 80 - 110 range. Any 110+ BPM songs should be the Dance tracks. It's a little tricky because lower BPM songs like those that are 60 - 75 BPM could also be labelled as 120 - 150 BPM. Same for any tracks with 150+ BPM, they're just 2x values.

As for the key, it will be interesting to see what patterns are there, if any.

In [74]:
playlist['BPM'].describe()

count    676.000000
mean     120.544379
std       27.534374
min       63.000000
25%      100.000000
50%      119.000000
75%      132.000000
max      204.000000
Name: BPM, dtype: float64

In [75]:
playlist['BPM'].median()

119.0

In [76]:
px.bar(playlist.BPM.value_counts()).update_layout(
                  title_text="Tempo of songs", title_x=0.5,
                  yaxis_title='Track Count',
                  xaxis_title='BPM',
                  paper_bgcolor="LightSteelBlue",
                  margin=dict(l=100, r=150, t=100, b=20)
                )

In [77]:
playlist[playlist.BPM==120].sample(3)

Unnamed: 0,Song,Artist,Popularity,BPM,Genres,Parent_Genres,Album,Album_Date,Time,Dance,...,Live,Loud,Key,Time_Signature,Added_At,Record_Label,Camelot,Country,Genre,Parent_Genre
595,I Deserve to Bleed,Sushi Soucy,65,120,"indie pop, pixel",Rock,I Deserve to Bleed,2020-12-27,1:44,76,...,10,-12,G Major,4,2022-10-29,Sushi Soucy,9B,{'Sushi Soucy': 'N/A'},indie pop,Rock
555,The Coffin Was So Light I Thought It Might Flo...,Eiafuawn,43,120,"dream pop, lo-fi, lo-fi indie",Rock,Birds In The Ground,2006-01-01,3:51,77,...,30,-15,F#/G♭ Minor,4,2022-09-17,Numero Group,11A,{'Eiafuawn': 'N/A'},dream pop,Rock
54,Dazed (feat. Gabrielle & Geoffroy),"Men I Trust,Geoffroy,Gabrielle",44,120,"indie pop, indie quebecois,","Rock, Blues",Men I Trust,2014-05-28,3:52,84,...,10,-11,F#/G♭ Minor,4,2020-01-20,Indie,11A,"{'Men I Trust': 'Canada', 'Geoffroy': 'N/A', '...",indie pop,Rock


In [78]:
playlist[playlist.BPM==120][['Artist','Song','BPM','Parent_Genre','Genre']].sample(3)

Unnamed: 0,Artist,Song,BPM,Parent_Genre,Genre
559,"Kosmo Kint,Atjazz",Too Big - Atjazz Remix,120,Dance/Electronic,afro house
633,"Fabich,Pastel,Jafunk,Bambie",Ecstasy,120,R&B,indie soul
467,347aidan,Dancing in My Room,120,Hip Hop,sad rap


In [79]:
playlist.groupby(['BPM','Key']).size().rename('Count').reset_index().query('Count > 3')

Unnamed: 0,BPM,Key,Count
749,100,B Minor,4
764,100,G Major,4
1230,120,C Major,4
1239,120,E Minor,4
1244,120,G Major,4
1290,122,F#/G♭ Major,4
1316,123,G Major,4


In [80]:
px.bar(playlist['Key'].value_counts()).update_layout(
                  title_text="Song Key",
                  yaxis_title='Track Count',
                  xaxis_title='Key',
                  paper_bgcolor="LightSteelBlue",
                  margin=dict(l=100, r=150, t=100, b=20)
                )

### BPM & Key Summary

The BPM range is 70 - 130 BPM for almost all the tracks. As expected, the higher BPMs are just 2x numbers except for that one Drum & Bass song, [SpectraSoul - Push & Pull](https://youtu.be/LdYeEtVfGnk?t=47). There are also very few songs that share the same BPM and Key, max count being 4. 

As for the Key, I was surprised to see that D Minor has the lowest count. I love to play in D minor since it's the easiest minor key to play on the piano and it has a very dark tone to it. As for the highest count of G Major, I'm going guess it's because most guitar players use the standard tuning and G Major is relative to E Minor. In other words, the standard tuning, also called the standard E tuning, on that most people naturally play E Minor chords which are also G Major chords.

### Spotify Features Analysis

Spotify generates some audio features which are Dance, Energy, Acoustic, Instrumental, Happy, Speech, Live & Loud. I've already worked with Instrumental, here I'll work on Happy, Dance & Energy.

Official Definitions:
* Happy - It's a linear scale of 0 to 100. 100 being very happy and 0 being sad/negative.
* Dance - It's a scale of 0 to 100 that determines how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity.
* Energy - Again a scale of 0 to 100 that represents a perceptual measure of intensity and activity.
* Acoustic - Scale of 0 to 100 that determines if a track is composed of real instruments i.e. acoustic sounds.

[Spotify Docs Reference](https://developer.spotify.com/documentation/web-api/reference/#/operations/get-audio-features)

In [88]:
scatter = go.Scatter(
    x=playlist.Dance,
    y=playlist.Energy,
    mode='markers',
    marker_color=playlist.Happy,
    marker=dict(
        colorscale='teal',
        showscale=True,
        line_width=1,
        size=10,
        colorbar={'title': 'Happy'}
    ),
    hovertext=['Artist: {}<br>Song: {}<br>Dance: {}<br>Energy: {}<br>Happy: {}'\
               .format(a, s, d, e, h)\
                for a ,s, d, e, h in zip(playlist.Artist, playlist.Song, playlist.Dance, playlist.Energy, playlist.Happy)],
)

rect = {
    'type': 'rect',
    'xref': 'x',
    'yref': 'y',
    'x0': playlist.Dance.max()/2,
    'y0': playlist.Energy.max()/2,
    'x1': playlist.Dance.max()+3,
    'y1': playlist.Energy.max()+5,
    'line': {'color': '#0b5685', 'width': 2},
    'opacity': 0.5,
    'fillcolor': 'rgba(250, 220, 60, 0.1)',
}

text = {
    'x': playlist.Dance.max() * 0.95,
    'y': playlist.Energy.max(),
    'text':  'Happy Quadrant',
    'showarrow': False,
    'font': {
        'size': 30,
        'color': 'black',
        'family': 'Balto',
    }
}

layout = go.Layout(shapes=[rect], annotations=[text])

fig = go.Figure(data=[scatter], layout=layout)

fig.update_layout(
                  title_text="Energy vs Dance vs Happiness Distribution",title_x=0.5,
                  yaxis_title='Energy',
                  xaxis_title='Dance',
                  plot_bgcolor="#edf5fa",
                  paper_bgcolor="LightSteelBlue",
                  height=600,
                  margin=dict(l=100, r=150, t=80, b=20)
                )
fig.show()


In [82]:
playlist\
    [(playlist.Energy < 30) & (playlist.Dance > 80)]\
    [['Song','Artist','Energy','Dance','Happy']]

Unnamed: 0,Song,Artist,Energy,Dance,Happy
264,Cristal (feat. BxRod),"Cráneo,Made in M,BxRod",26,82,34
296,No! No! No! No!,Axel Boman,29,94,40
434,Tears - Original Mix,HNNY,28,88,48
667,Kiss,Prince,27,90,74


### Energy Dance Happy Summary

Looking at the plot, you can see that most happy tracks have higher value of dance & energy. It only makes sense that they are all directly related since it'll be hard to dance being sad and without energy. 

However, there are a few tracks that have low energy & low happy score but high danceability. And I love these tracks. They are musically very pleasing. These are dance music tracks that uses gloomy instruments & chords that has that very underground vibe. They're also signed to record labels that are pretty underground and pretty unique.

Examples:

[Axel Boman - No! No! No! No!](https://youtu.be/SaMGBT_Y_fw?t=82) - Swedish artist Axel Boman, he's an underground legend.

[HNNY - Tears](https://youtu.be/PYQngi1wU4U?t=66) - HNNY, another Swedish artist, another legend.

## Summary

That concludes my analysis of my Spotify playlist. I hope you enjoyed the visualizations as well as the music from my playlist. Feel free to use the code in here for your own projects as well. Thanks.

### How to replicate this project

* Generate & download the csv of your playlist using the [Chosic Spotify Playlist Analyzer](https://www.chosic.com/spotify-playlist-analyzer/)
* Download the [playlist artist location](https://github.com/dekaghub/Data-Projects-Deka/raw/main/Python%20Notebooks/Playlist_Artist_Location.ipynb) notebook
* Load your playlist csv to both the Playlist Analysis & Artist Location notebooks
* Run all cells and have fun tinkering with the analysis