In [None]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

import plotly.express as px


The dataset I chose to use is the Top Songs on Spotify by year from 2010-2019. The data is originally from Billboard, and each year has a different amount of top songs. I chose this dataset because I am also very passionate about music and wanted to analyze various trends such as which genres are most popular for each coming year. Other details are also provided for each song such as beats per minute, energy, and danceability. I wanted to analyze a few of these, specifically energy, to understand how year made and genre effect the perceived energy level of these songs. I will be using a histogram, a sunburst plot, a box plot, and a treemap to tell a story with this dataset.

> Indented block



In [None]:
from google.colab import files
upload = files.upload()

Saving top10s.csv to top10s (1).csv


In [None]:
df = pd.read_csv("top10s.csv", encoding='unicode_escape')
df.head(15)

Unnamed: 0.1,Unnamed: 0,title,artist,top genre,year,bpm,nrgy,dnce,dB,live,val,dur,acous,spch,pop
0,1,"Hey, Soul Sister",Train,neo mellow,2010,97,89,67,-4,8,80,217,19,4,83
1,2,Love The Way You Lie,Eminem,detroit hip hop,2010,87,93,75,-5,52,64,263,24,23,82
2,3,TiK ToK,Kesha,dance pop,2010,120,84,76,-3,29,71,200,10,14,80
3,4,Bad Romance,Lady Gaga,dance pop,2010,119,92,70,-4,8,71,295,0,4,79
4,5,Just the Way You Are,Bruno Mars,pop,2010,109,84,64,-5,9,43,221,2,4,78
5,6,Baby,Justin Bieber,canadian pop,2010,65,86,73,-5,11,54,214,4,14,77
6,7,Dynamite,Taio Cruz,dance pop,2010,120,78,75,-4,4,82,203,0,9,77
7,8,Secrets,OneRepublic,dance pop,2010,148,76,52,-6,12,38,225,7,4,77
8,9,Empire State of Mind (Part II) Broken Down,Alicia Keys,hip pop,2010,93,37,48,-8,12,14,216,74,3,76
9,10,Only Girl (In The World),Rihanna,barbadian pop,2010,126,72,79,-4,7,61,235,13,4,73


In [None]:
df.describe

<bound method NDFrame.describe of      Unnamed: 0                                              title  \
0             1                                   Hey, Soul Sister   
1             2                               Love The Way You Lie   
2             3                                            TiK ToK   
3             4                                        Bad Romance   
4             5                               Just the Way You Are   
..          ...                                                ...   
598         599                Find U Again (feat. Camila Cabello)   
599         600      Cross Me (feat. Chance the Rapper & PnB Rock)   
600         601  No Brainer (feat. Justin Bieber, Chance the Ra...   
601         602    Nothing Breaks Like a Heart (feat. Miley Cyrus)   
602         603                                   Kills You Slowly   

               artist        top genre  year  bpm  nrgy  dnce  dB  live  val  \
0               Train       neo mellow  2010 

In [None]:
df.shape

(603, 15)

In [None]:
df.drop(['Unnamed: 0'], axis=1, inplace=True)

In [None]:
df.shape

(603, 14)

In [None]:
fig=px.histogram(df, x="top genre", nbins=30, histnorm="probability")
fig.update_xaxes(categoryorder="total ascending")
fig.show()


This histogram is showing the percentage of all 603 total songs that each genre makes up.  For example, over 50% of the total songs are considered dance pop. Even though there are so many other genres, no other genre has a percentage over 10%. I chose to use percentages of the total instead of count because dance pop has so many more songs than any other genre that the other bars were even less visible. This histogram is overall showing that dance pop has really dominated the decade more so than any other genre on this list. 

In [None]:
fig = px.sunburst(df,
                  path=[ "year", "top genre", "title"],
                 
                  title="2010-2019 Spotify Top Hits Genre Popularity by Year [Year, top genre, title]",
                  width=1000, height=1000)

fig.show()

I chose to use a sunburst plot here to build off of the previous histogram, and show the distribution of each genre within each year. This sunburst plot is a great way to see how popular each genre is for each individual year, and also be able to see how much each year contributes to the total amount of songs.  For example, 2015 dance pop catches my eye as it clearly has the most songs within it compared to any other genre. The only problem I found with this graph is that I could not find a way to get the years into a better order.

In [None]:
## different colors to differentiate each color
fig= px.box(df, y="nrgy", x="year", color="year")
fig.show()

Now I wanted to explore the energy(nrgy) feature a little bit more and also see if there was any trend in average energy levels as the years progressed. I chose to use  a boxplot so I could see median energy for each year but also see the distribution of energy for each year and visualize potential outliers. There appears to be a slight negative trend as median energy starts at 82 in 2010 and moves down to about 68 in 2019. I found it interesting that even with the rise of more produced/electronic music, median energy levels are far lower in 2019 than in 2010.

In [None]:
import statistics
mean= statistics.mean(df.nrgy)
print(mean)


70.50414593698176


In [None]:
fig = px.treemap(df, path=[px.Constant("All"),"year", "top genre"], 
                  color= 'nrgy'  ,
                  color_continuous_scale='RdBu',
                  color_continuous_midpoint=70.5)
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
#fig.update_xaxes(categoryorder="total ascending")
fig.show()

By using a treemap here, we can expand on what we learned in the sunburst plot and not only see most popular genres each year, but also see the median energy level for each genre within that year. The more blue a specific box is, on average, the more energetic the songs in that genre for that specific year are.I found the mean energy level for all songs and set the midpoint to this value(around 70.5) so that the colors represented the data better. For example we can see that dance pop is almost always at or above the median nrgy level(blue-ish) except in 2019 where it is below(red). In addition, it is clear that 2010 has the most genres above the average energy level which also supports the high median energy level for 2010 in the previous boxplot. 2018 and 2019 seem to have the most genres under the average energy level. I also wish I could have ordered the years better for this graph.