# Top hits from Spotify
An analysis of the Top 2000 tracks of Spotify based on the Kaggle dataset which you can find [here](https://www.kaggle.com/datasets/paradisejoy/top-hits-spotify-from-20002019) (2000-2019).

* 2000 rows, 18 columns
* Each row represents a different Hit Song
* Used columns:
    * artist: name of the artist
    * song: name of the song
    * duration_ms: duration of the song expressed in milliseconds
    * explicit: a boolean value expressing the presence of an explicit content, for children
    * year: release year of the song
    * popularity: the popularity of the song on a scale from 1 to 100 (89 - the highest value encountered in our dataset)
    * danceability: the danceability of the song on a scale from 0.0 to 1.0
    * energy: perceptual intensity and activity on a scale from 0.0 to 1.0
    * key: the key the song is in based on the Pitch Class notation. If there is no key, the value is -1.
        * {C = 0, C-sharp = 1, D = 2, D-sharp = 3, E = 4, F = 5, F-sharp = 6, G = 7, G-sharp = 8, A = 9, B-flat = 10 (T), B = 11 (E)}
    * loudness: overall loudness of a song in decibels (dB) on a range between -60 and 0
    * mode: modality of the song (1 - major; 0 - minor)
    * speechiness: represents the preence of spoken words in a song.
        * 0 - 0.33 -> music and other non-speech-like songs
        * 0.33 - 0.66 -> may contain both music and speech (e.g: rap music)
        * 0.66 - 1 -> made entirely of spoken words
    * acousticness: confidence measure of whether the song is acoustic on a scale from 0.0 to 1.0 
    * instrumentalness: prediction of the song containing vocals on a scale from 0.0 to 1.0.
        * A word is a vocal, but "ooh" or "aah" are not. 1.0 means that the song is instrumental.
    * liveness: presence of an audience in the recording on a scale from 0.0 to 1.0. 
    * valence: a measure from 0.0 to 1.0 representing the musical positiveness conveyed by a song. 
        * High values sound more positive (e.g. happy, cheerful, euphoric). Low values sound more negative (e.g. sad, depressed, angry)
    * tempo: overall estimated tempo of a song in BPM (beats per minute).
    * genre: genre of a song. There could be more than one per song

We analyze the following aspects:
- Number of tracks per artist
- Artist popularity
- Popularity by number of songs
- Top genres
- Popularity by genre
- Technical mean values by genre (for each one of them)
- Duration of the most popular songs
- Duration of songs throughout the years

As far as what can be expected, we are expecting that throughout the years, most popular genre would be pop and the duration of songs decreased in time and at the same time, classical music would be less present and popular than the other genres. 
Also, the duration of the most popular songs would vary between 2 and 5 minutes due to the reason that people would get easily bored with a long duration song and therefore, not listen to all of it or over again.
Nevertheless, we are expecting a high mean value of acousticness for classical and jazz music, highest mean values of danceability and energy for R&B, hip hop, latin and pop music, highest mean value of explicit content for hip hop music.

## Load dataset

In [52]:
import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt

In [53]:
data = pd.read_csv("songs_normalize.csv")
data.sample(5)

Unnamed: 0,artist,song,duration_ms,explicit,year,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,genre
1653,Ariana Grande,Into You,244453,False,2016,3,0.623,0.734,9,-5.948,1,0.107,0.0162,2e-06,0.145,0.37,107.853,pop
145,Jennifer Lopez,Love Don't Cost a Thing,221226,False,2001,67,0.786,0.842,4,-5.115,0,0.0707,0.00305,4e-06,0.473,0.685,97.577,"hip hop, pop, R&B"
138,Ricky Martin,Nobody Wants to Be Lonely (with Christina Agui...,252706,False,2008,52,0.635,0.854,10,-5.02,0,0.0612,0.00579,0.0083,0.0623,0.59,100.851,"pop, latin"
1963,J. Cole,MIDDLE CHILD,213593,True,2019,80,0.837,0.364,8,-11.713,1,0.276,0.149,0.0,0.271,0.463,123.984,hip hop
216,Las Ketchup,The Ketchup Song (Aserejé) - Spanglish Version,213973,False,2002,66,0.607,0.923,1,-6.777,1,0.0948,0.0193,1e-06,0.0924,0.868,184.819,set()


In the dataset each row represents a different hit song

In [54]:
print(f"Number of rows: {data.shape[0]}")
print(f"Number of columns: {data.shape[1]}")

Number of rows: 2000
Number of columns: 18


## Cleanup
### Data types
Before approaching to the analysis, we could check whether the dataset needs some cleanup of sorts. First we can do that by verifying the coherence of the type of columns w.r.t the  columns themselves.

In [55]:
data.dtypes

artist               object
song                 object
duration_ms           int64
explicit               bool
year                  int64
popularity            int64
danceability        float64
energy              float64
key                   int64
loudness            float64
mode                  int64
speechiness         float64
acousticness        float64
instrumentalness    float64
liveness            float64
valence             float64
tempo               float64
genre                object
dtype: object

From what is shown, we have no reason for concern, except for the "mode" column, since it has just two possible values that represent "minor" or "major" mode. We could replace 0 and 1 with the corresponding values above.

In [56]:
data["mode"] = data["mode"].map({0: "minor", 1: "major"})
data["mode"]


0       minor
1       major
2       major
3       minor
4       minor
        ...  
1995    minor
1996    major
1997    major
1998    major
1999    major
Name: mode, Length: 2000, dtype: object

Another thing that would be better is to have the duration not in miliseconds, but rather in minutes. This way it's more comprehensible

In [57]:
data["duration_ms"] = round(data["duration_ms"] / (6*10**4), 2)
data.rename(columns={"duration_ms": "duration_min"}, inplace=True)
data.sample(5)["duration_min"]

509     3.97
1937    4.28
665     3.68
562     3.97
1308    3.87
Name: duration_min, dtype: float64

## Year outliers
If we group by the year all the songs in the dataset, we can see that there are also included the years 1998, 1999 and 2020, which are not supposed to be in the dataset.

In [58]:
artists_sum = data.groupby("year", as_index=False).count()
artists_sum[["year", "song"]]


Unnamed: 0,year,song
0,1998,1
1,1999,38
2,2000,74
3,2001,108
4,2002,90
5,2003,97
6,2004,96
7,2005,104
8,2006,95
9,2007,94


Since there are 23 years in the dataset, where 3 of them are not supposed to be there, we can drop them

In [59]:
data = data.drop(data[data.year < 2000].index)
data = data.drop(data[data.year == 2020].index)
data.shape[0]

1958

In [60]:
artists_sum = data.groupby("year", as_index=False).count()
artists_sum[["year", "song"]]


Unnamed: 0,year,song
0,2000,74
1,2001,108
2,2002,90
3,2003,97
4,2004,96
5,2005,104
6,2006,95
7,2007,94
8,2008,97
9,2009,84


## Rescaling
Some of the columns in the dataset are supposed to be on a scale between two certain numbers, but that is not always the case since there are values that dont reach the top value by far. For this reason, and for the fact that it'll be more manageable later, we can rescale the columns in an appropriate way.

The function below, given a column and it's maximum value, rescales the values of that column according to the maximum value.

In [61]:
def rescale_col(column_name: str, max_value):
    coeff = data[column_name].max()
    data[column_name] = (data[column_name] / coeff) * max_value

For example, applying this function to the "popularity" column rescales it as we wanted.

In [62]:
print(data["popularity"].max())
rescale_col("popularity", 100)
print(data["popularity"].max())

89
100.0


Now that we proved that our function works, we can apply it also for other columns like "acousticness", "danceability" and "instrumentalness".

In [63]:
rescale_col("acousticness", 1.0)
rescale_col("danceability", 1.0)
rescale_col("instrumentalness", 1.0)


## set() in genre column
Some of the rows in the dataset contain the value "set()" as its genre.

In [64]:
columns = ["artist", "song", "genre"]
data[data["genre"] == "set()"][columns].sample(3)


Unnamed: 0,artist,song,genre
481,Eamon,Fuck It (I Don't Want You Back),set()
949,A.R. Rahman,Jai Ho! (You Are My Destiny),set()
1487,Natalie La Rose,Somebody,set()


Since there is no way for us to interpret set(), it would be better to replace it with N/A.

In [65]:
data["genre"] = data["genre"].replace("set()", "N/A")
data[data["genre"] == "N/A"][columns].sample(3)

Unnamed: 0,artist,song,genre
1406,MAGIC!,Rude,
291,Blazin' Squad,Crossroads - Radio Edit,
455,DJ Casper,Cha Cha Slide - Hardino Mix,


## N° of tracks per artist
From the graph below we can see how many songs of the various artists (top 50) have entered the hit charts of Spotify throughout the years. 

In [66]:
artists_sum = data.groupby("artist", as_index=False).count().sort_values(by="song", ascending=False).head(50)
fig = px.bar(artists_sum, x="song", y="artist", color="artist",
             width=1300, height=1000, title="Most present artists")
fig.update_layout(font_size=12)
fig.show()


## Artists popularity
Now that we've seen the most present artists in the charts of Spotify, it would be interesting their average popularity index.

In [67]:
top_artists = artists_sum["artist"]
popularity_means = dict()
for artist in top_artists:
    popularity_means[artist] = round(data[data["artist"] == artist]["popularity"].mean())
artists = list(popularity_means.keys())
popularities = list(popularity_means.values())
plot_dict = {"artist": artists,
             "popularity": popularities}
fig = px.bar(plot_dict, x="popularity", y="artist", color="artist",
             width=1300, height=1000, title="Average popularity of top artists")
fig.update_layout(font_size=12)
fig.show()


## Popularity by number of songs
A doubt could be raised looking at the two previous figures: how are popularity and the total amount of songs related? Let's see how

In [68]:
artists_sum = pd.DataFrame(artists_sum.drop(artists_sum.columns.values[2:], axis=1))
artists_popularities = pd.DataFrame(plot_dict)
df_artists_merged = pd.merge(artists_sum, artists_popularities, on="artist")
fig = px.scatter(df_artists_merged, x="song",
                 y="popularity", trendline="ols", width=800, height=600)
fig.show()


Judging by the graph, we can answer to the initially raised doubt: the average popularity of an artist grows with respect to the number of his/her number of hit songs. 

## Top genres
If before we saw which artists were the most present in the dataset, we can see the same for genre. Nowadays one of the most popular genres is pop, so we could expect that to be at the top.

In [69]:
# Explode the genre column
data.genre = data.genre.apply(lambda x: x.split(", "))
df_genre_exp = data.explode("genre")
genres_amount = df_genre_exp.groupby("genre", as_index=False).count()


genre_cols = df_genre_exp.columns.values[5:-1]
genre_cols = np.append(genre_cols, ["explicit"])
genres = df_genre_exp.groupby("genre")[genre_cols].mean().reset_index()
#See results
genres.head()


Unnamed: 0,genre,popularity,danceability,energy,key,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo,explicit
0,Dance/Electronic,61.760009,0.68384,0.766687,5.754522,-5.151388,0.079919,0.112884,0.038239,0.18285,0.537117,124.054483,0.118863
1,Folk/Acoustic,67.356594,0.569339,0.720789,4.421053,-6.480842,0.042058,0.204917,0.046239,0.193379,0.560684,109.683263,0.052632
2,,63.830926,0.763516,0.733667,3.809524,-5.956571,0.065595,0.17759,0.038299,0.160933,0.664714,121.345857,0.047619
3,R&B,65.876176,0.706778,0.663233,5.331828,-5.879093,0.114886,0.154248,0.006679,0.162687,0.562708,115.507377,0.275395
4,World/Traditional,60.049938,0.591909,0.692889,4.222222,-6.354889,0.078433,0.266463,0.223766,0.204456,0.641667,111.259889,0.0


In [70]:
fig = px.bar(genres_amount.sort_values("song", ascending=False), x="song", y="genre", color="genre",
             width=1000, height=750, title="Most present genres")
fig.update_layout(font_size=18)
fig.show()


The more we advance with the years, the more we hear Pop in the charts of Spotify, in the radios, commercial points ecc. Therefore, it is not so surprising to see Pop as the most present genre, even if by that far with respect to the others. What we maybe would've expected slightly different is the Latin genre, which we thought would've been more present, since summer Hits are pretty much all latin/hispanic. But if we think about it more, it makes sense since we dont heare many of them throughout the whole year.

## Popularity by Genre
Now that we've seen how much the genres appear in the dataset, let's take a look at the popularity of the songs with those genres.

In [71]:
fig = px.bar(genres.sort_values("popularity", ascending=False), x="popularity", y="genre",color="genre",
             width=1000, height=750, title="Average popularity of genres")
fig.update_layout(font_size=18)
fig.show()


Contrary to the amount of songs of those genres, the popularity shows completely different results. A possible explanation could be the one that, since there are fewer songs of genres such as Metal, Classical or Rock, and since they made it in to the charts of Spotify, that would mean that they had to be averagely more popular than maybe a Pop song, which has average popularity lower, but is much more likely to become a so called Hit.

## Technical values by genre
Every genre has it's own parameters of speechiness, danceability ecc. From this graph we can see, for every genre, what are the mean values of those parameters.

In [72]:

genre_num_songs = df_genre_exp.groupby("genre").size().reset_index()
genre_num_songs.columns = ["genre", "songs"]
genre_num_songs.drop(genre_num_songs[genre_num_songs.genre.isin(
    ["N/A"])].index.values, inplace=True)
valid_genres = genre_num_songs.genre.values

for genre in valid_genres:
    values_data = genres.drop(["popularity"], axis=1)[genres.genre == genre]

    values_data = values_data.drop(["genre", "tempo", "key", "loudness"], axis=1).transpose().reset_index()
    values_data.columns = ["parameter", "value"]

    fig = px.bar(values_data, x="value", y="parameter", color="parameter",
             width=750, height=500, title="Mean values per genre")
    fig.update_layout(font_size=18, xaxis_title=f"Mean values per {genre}")
    fig.show()


From all the graphs oer genre, we can derive to the following conclusions:
- As could've been expected, Blues and World/Traditional have the higher values of Instrumentalness, so the absence of vocals such as words or spoken lyrics.
- Jazz has the lowest value of Energy, so it's pretty calm and definitely not upbeat.
- Hip Hop is more explicit and spoken, as defined by parameters Explicit and Speechiness, which are higher than the other genres. Since it's pretty much all words, it was predictable. 

## 

## How long are the most popular songs?
This is the correlation between danceability of the songs and the corresponding years.

In [73]:
fig = px.scatter(data, x="duration_min", y="popularity", hover_data=[
                 "artist", "song"],  width=1200, height=800)
fig.update_layout(font_size=20)
fig.show()


As we can see, and how could be expected, the most popular songs last in average between 2.5 and 5 minutes. This graph is not so explinatory though, so it would be better to visualize it through something else.

## Duration of songs throughout the years
From the following graph we can see the mean duration of the songs throughout the years.

In [74]:
year_data = data.groupby("year", as_index=False)["duration_min"].mean()
fig = px.bar(year_data, x="year", y="duration_min", width=1200, height=800)
fig.update_layout(font_size=20)
fig.show()


From what is shown, the mean duration of a hit songs tends to be lower than years ago. As many would say, the attention span of people keeps on getting shorter, therefore that could be the reason of this decrease

<center>
    <h1>Thanks for your attention.</h1><br/>
    <h2>Any questions?</h2>
</center>