 <font size="+2">***Chart-Topping Qualities:</font>
 <font size="+1">Commonalities Between Spotify's 20 Most-Streamed Songs***</font>
 
**an analysis by Allison Ward** // March 2024

Spotify is the world's largest audio streaming platform, boasting over 602 million users and a catalog of over 100 million songs. With thousands of tracks being uploaded every day, what qualities determine whether or not a song with be a hit? The tempo? The energy? This analysis will examine characteristics of the top 20 most-streamed songs on Spotify as of 2023.


Dataset taken from https://www.kaggle.com/datasets/nelgiriyewithana/top-spotify-songs-2023/data


In [1]:
# importing packages
import pandas as pd
import numpy as np

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [2]:
# reading in the data
df = pd.read_csv("spotify2023.csv", encoding = 'latin-1')

In [3]:
# taking a look
df.head()

Unnamed: 0,track_name,artist(s)_name,artist_count,released_year,released_month,released_day,in_spotify_playlists,in_spotify_charts,streams,in_apple_playlists,...,bpm,key,mode,danceability_%,valence_%,energy_%,acousticness_%,instrumentalness_%,liveness_%,speechiness_%
0,Seven (feat. Latto) (Explicit Ver.),"Latto, Jung Kook",2,2023,7,14,553,147,141381703,43,...,125,B,Major,80,89,83,31,0,8,4
1,LALA,Myke Towers,1,2023,3,23,1474,48,133716286,48,...,92,C#,Major,71,61,74,7,0,10,4
2,vampire,Olivia Rodrigo,1,2023,6,30,1397,113,140003974,94,...,138,F,Major,51,32,53,17,0,31,6
3,Cruel Summer,Taylor Swift,1,2019,8,23,7858,100,800840817,116,...,170,A,Major,55,58,72,11,0,11,15
4,WHERE SHE GOES,Bad Bunny,1,2023,5,18,3133,50,303236322,84,...,144,A,Minor,65,23,80,14,63,11,6


In [4]:
# Checking out info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 953 entries, 0 to 952
Data columns (total 24 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   track_name            953 non-null    object
 1   artist(s)_name        953 non-null    object
 2   artist_count          953 non-null    int64 
 3   released_year         953 non-null    int64 
 4   released_month        953 non-null    int64 
 5   released_day          953 non-null    int64 
 6   in_spotify_playlists  953 non-null    int64 
 7   in_spotify_charts     953 non-null    int64 
 8   streams               953 non-null    object
 9   in_apple_playlists    953 non-null    int64 
 10  in_apple_charts       953 non-null    int64 
 11  in_deezer_playlists   953 non-null    object
 12  in_deezer_charts      953 non-null    int64 
 13  in_shazam_charts      903 non-null    object
 14  bpm                   953 non-null    int64 
 15  key                   858 non-null    ob

In [5]:
# Getting a count of unknown values
df.isna().sum()

track_name               0
artist(s)_name           0
artist_count             0
released_year            0
released_month           0
released_day             0
in_spotify_playlists     0
in_spotify_charts        0
streams                  0
in_apple_playlists       0
in_apple_charts          0
in_deezer_playlists      0
in_deezer_charts         0
in_shazam_charts        50
bpm                      0
key                     95
mode                     0
danceability_%           0
valence_%                0
energy_%                 0
acousticness_%           0
instrumentalness_%       0
liveness_%               0
speechiness_%            0
dtype: int64

In [6]:
# filling in unknown values with zeros or an "unknown" placeholder
df["in_shazam_charts"].fillna(0, inplace = True)
df["key"].fillna("Unknown", inplace = True)
df.isna().sum()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["in_shazam_charts"].fillna(0, inplace = True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["key"].fillna("Unknown", inplace = True)


track_name              0
artist(s)_name          0
artist_count            0
released_year           0
released_month          0
released_day            0
in_spotify_playlists    0
in_spotify_charts       0
streams                 0
in_apple_playlists      0
in_apple_charts         0
in_deezer_playlists     0
in_deezer_charts        0
in_shazam_charts        0
bpm                     0
key                     0
mode                    0
danceability_%          0
valence_%               0
energy_%                0
acousticness_%          0
instrumentalness_%      0
liveness_%              0
speechiness_%           0
dtype: int64

In [7]:
# There's a string in the "streams" column at position 574 that prevents us from converting the column from an object to an int. 
# Dropping this row since it's likely not in the top 20 anyways
df.drop([574], inplace = True)

In [8]:
# Converting this column to numeric
df["streams"] = df["streams"].astype('int64')

In [9]:
# Getting a list of the top 20 songs by number of streams
top_20 = df.sort_values(by=["streams"], ascending= False).head(20)

In [10]:
# Resetting the index
top_20.reset_index(inplace = True)

In [11]:
# Creating a column "rank" based on the position in the table and deleting the original index labels
top_20["Rank"] = top_20.index + 1
top_20.drop(["index"], axis = 1, inplace = True)

In [12]:
# looking at the info of our new table
top_20.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 25 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   track_name            20 non-null     object
 1   artist(s)_name        20 non-null     object
 2   artist_count          20 non-null     int64 
 3   released_year         20 non-null     int64 
 4   released_month        20 non-null     int64 
 5   released_day          20 non-null     int64 
 6   in_spotify_playlists  20 non-null     int64 
 7   in_spotify_charts     20 non-null     int64 
 8   streams               20 non-null     int64 
 9   in_apple_playlists    20 non-null     int64 
 10  in_apple_charts       20 non-null     int64 
 11  in_deezer_playlists   20 non-null     object
 12  in_deezer_charts      20 non-null     int64 
 13  in_shazam_charts      20 non-null     object
 14  bpm                   20 non-null     int64 
 15  key                   20 non-null     obje

In [13]:
# converting our new Top 20 table to its own .csv file
top_20.to_csv('top_20_spotify_2023.csv', index=False)

In [14]:
# taking a look at the stats
top_20.describe()

Unnamed: 0,artist_count,released_year,released_month,released_day,in_spotify_playlists,in_spotify_charts,streams,in_apple_playlists,in_apple_charts,in_deezer_charts,bpm,danceability_%,valence_%,energy_%,acousticness_%,instrumentalness_%,liveness_%,speechiness_%,Rank
count,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0
mean,1.4,2017.75,6.45,16.7,23403.7,36.05,2652270000.0,342.25,103.45,8.6,117.65,63.75,52.45,61.1,31.9,0.1,15.6,7.65,10.5
std,0.598243,2.173404,3.379115,10.219177,8673.499203,35.153011,382768200.0,170.454771,57.861427,12.861202,32.283734,15.043358,22.177454,15.165404,29.622005,0.447214,9.885662,6.799961,5.91608
min,1.0,2012.0,1.0,1.0,14749.0,0.0,2282771000.0,7.0,0.0,0.0,81.0,35.0,12.0,30.0,0.0,0.0,7.0,2.0,1.0
25%,1.0,2016.75,4.75,9.0,16550.25,8.0,2404276000.0,228.5,70.25,0.0,95.0,51.5,40.25,51.5,4.0,0.0,9.0,3.0,5.75
50%,1.0,2017.5,6.0,14.5,22229.0,27.0,2562529000.0,379.0,111.0,2.5,107.0,64.5,51.0,59.0,25.0,0.0,10.5,5.0,10.5
75%,2.0,2019.0,9.25,28.25,27347.25,61.5,2737466000.0,443.25,138.75,10.5,124.25,76.25,66.5,76.5,55.0,0.0,19.25,8.25,15.25
max,3.0,2022.0,11.0,31.0,43899.0,130.0,3703895000.0,672.0,199.0,46.0,186.0,83.0,93.0,82.0,93.0,2.0,36.0,28.0,20.0


In [15]:
# Renaming the song Senorita to its correct name
top_20.replace({"track_name": "Seï¿½ï¿½o"}, {"track_name":"Senorita"})

Unnamed: 0,track_name,artist(s)_name,artist_count,released_year,released_month,released_day,in_spotify_playlists,in_spotify_charts,streams,in_apple_playlists,...,key,mode,danceability_%,valence_%,energy_%,acousticness_%,instrumentalness_%,liveness_%,speechiness_%,Rank
0,Blinding Lights,The Weeknd,1,2019,11,29,43899,69,3703895074,672,...,C#,Major,50,38,80,0,0,9,7,1
1,Shape of You,Ed Sheeran,1,2017,1,6,32181,10,3562543890,33,...,C#,Minor,83,93,65,58,0,9,8,2
2,Someone You Loved,Lewis Capaldi,1,2018,11,8,17836,53,2887241814,440,...,C#,Major,50,45,41,75,0,11,3,3
3,Dance Monkey,Tones and I,1,2019,5,10,24529,0,2864791672,533,...,F#,Minor,82,54,59,69,0,18,10,4
4,Sunflower - Spider-Man: Into the Spider-Verse,"Post Malone, Swae Lee",2,2018,10,9,24094,78,2808096550,372,...,D,Major,76,91,50,54,0,7,5,5
5,One Dance,"Drake, WizKid, Kyla",3,2016,4,4,43257,24,2713922350,433,...,C#,Major,77,36,63,1,0,36,5,6
6,STAY (with Justin Bieber),"Justin Bieber, The Kid Laroi",2,2021,7,9,17050,36,2665343922,492,...,C#,Major,59,48,76,4,0,10,5,7
7,Believer,Imagine Dragons,1,2017,1,31,18986,23,2594040133,250,...,A#,Minor,77,74,78,4,0,23,11,8
8,Closer,"The Chainsmokers, Halsey",2,2016,5,31,28032,0,2591224264,315,...,G#,Major,75,64,52,41,0,11,3,9
9,Starboy,"The Weeknd, Daft Punk",2,2016,9,21,29536,79,2565529693,281,...,G,Major,68,49,59,16,0,13,28,10
