In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import re

# Data Visualization and Exploratory Data Analysis Lab
## Visualizing and exploring data. The data mining process

In this lab, you'll get acquainted with the most streamed songs on Spotify in 2024. The dataset and its associated metadata can be found [here](https://www.kaggle.com/datasets/nelgiriyewithana/most-streamed-spotify-songs-2024). The version you'll need is provided in the `data/` folder.

You know the drill. Do what you can / want / need to answer the questions to the best of your ability. Answers do not need to be trivial, or even the same among different people.

### Problem 1. Read the dataset (1 point)
Read the file without unzipping it first. You can try a different character encoding, like `unicode_escape`. Don't worry too much about weird characters.

In [2]:
spotify_data = pd.read_csv('./data/spotify_most_streamed_2024.zip', encoding='unicode_escape')

In [3]:
spotify_data

Unnamed: 0,Track,Album Name,Artist,Release Date,ISRC,All Time Rank,Track Score,Spotify Streams,Spotify Playlist Count,Spotify Playlist Reach,...,SiriusXM Spins,Deezer Playlist Count,Deezer Playlist Reach,Amazon Playlist Count,Pandora Streams,Pandora Track Stations,Soundcloud Streams,Shazam Counts,TIDAL Popularity,Explicit Track
0,MILLION DOLLAR BABY,Million Dollar Baby - Single,Tommy Richman,4/26/2024,QM24S2402528,1,725.4,390470936,30716,196631588,...,684,62.0,17598718,114.0,18004655,22931,4818457,2669262,,0
1,Not Like Us,Not Like Us,Kendrick Lamar,5/4/2024,USUG12400910,2,545.9,323703884,28113,174597137,...,3,67.0,10422430,111.0,7780028,28444,6623075,1118279,,1
2,i like the way you kiss me,I like the way you kiss me,Artemas,3/19/2024,QZJ842400387,3,538.4,601309283,54331,211607669,...,536,136.0,36321847,172.0,5022621,5639,7208651,5285340,,0
3,Flowers,Flowers - Single,Miley Cyrus,1/12/2023,USSM12209777,4,444.9,2031280633,269802,136569078,...,2182,264.0,24684248,210.0,190260277,203384,,11822942,,0
4,Houdini,Houdini,Eminem,5/31/2024,USUG12403398,5,423.3,107034922,7223,151469874,...,1,82.0,17660624,105.0,4493884,7006,207179,457017,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4595,For the Last Time,For the Last Time,$uicideboy$,9/5/2017,QM8DG1703420,4585,19.4,305049963,65770,5103054,...,,2.0,14217,,20104066,13184,50633006,656337,,1
4596,Dil Meri Na Sune,"Dil Meri Na Sune (From ""Genius"")",Atif Aslam,7/27/2018,INT101800122,4575,19.4,52282360,4602,1449767,...,,1.0,927,,,,,193590,,0
4597,Grace (feat. 42 Dugg),My Turn,Lil Baby,2/28/2020,USUG12000043,4571,19.4,189972685,72066,6704802,...,,1.0,74,6.0,84426740,28999,,1135998,,1
4598,Nashe Si Chadh Gayi,November Top 10 Songs,Arijit Singh,11/8/2016,INY091600067,4591,19.4,145467020,14037,7387064,...,,,,7.0,6817840,,,448292,,0


### Problem 2. Perform some cleaning (1 point)
Ensure all data has been read correctly; check the data types. Give the columns better names (e.g. `all_time_rank`, `track_score`, etc.). To do so, try to use `apply()` instead of a manual mapping between old and new name. Get rid of any unnecessary ones.

In [4]:
def to_snake_case(col_name):
    col_name = col_name.strip()
    col_name = col_name.lower()
    col_name = re.sub(r"[^\w\s]", "_", col_name) # replace non-alphanumeric characters with underscores
    col_name = re.sub(r"\s+", "_", col_name) # regex is used instead of replace to handle multiple spaces
    col_name = re.sub(r"_+", "_", col_name)  # replace multiple underscores with a single underscore
    col_name = col_name.rstrip("_")

    return col_name

In [5]:
spotify_data.columns = spotify_data.columns.to_series().apply(to_snake_case)

In [6]:
spotify_data.columns

Index(['track', 'album_name', 'artist', 'release_date', 'isrc',
       'all_time_rank', 'track_score', 'spotify_streams',
       'spotify_playlist_count', 'spotify_playlist_reach',
       'spotify_popularity', 'youtube_views', 'youtube_likes', 'tiktok_posts',
       'tiktok_likes', 'tiktok_views', 'youtube_playlist_reach',
       'apple_music_playlist_count', 'airplay_spins', 'siriusxm_spins',
       'deezer_playlist_count', 'deezer_playlist_reach',
       'amazon_playlist_count', 'pandora_streams', 'pandora_track_stations',
       'soundcloud_streams', 'shazam_counts', 'tidal_popularity',
       'explicit_track'],
      dtype='object')

In [7]:
spotify_data.isrc.count() == spotify_data.shape[0]

np.True_

In [8]:
spotify_data = spotify_data.drop(columns=["isrc"])

In [9]:
spotify_data.dtypes.reset_index().rename(columns={"index": "feature", 0: "data_type"})

Unnamed: 0,feature,data_type
0,track,object
1,album_name,object
2,artist,object
3,release_date,object
4,all_time_rank,object
5,track_score,float64
6,spotify_streams,object
7,spotify_playlist_count,object
8,spotify_playlist_reach,object
9,spotify_popularity,float64


In [10]:
object_to_float_cols = [
    "all_time_rank",
    "spotify_streams",
    "spotify_playlist_count",
    "spotify_playlist_reach",
    "youtube_views",
    "youtube_likes",
    "tiktok_posts",
    "tiktok_likes",
    "tiktok_views",
    "youtube_playlist_reach",
    "airplay_spins",
    "siriusxm_spins",
    "deezer_playlist_reach",
    "pandora_streams",
    "pandora_track_stations",
    "soundcloud_streams",
    "shazam_counts"]

In [11]:
spotify_data[object_to_float_cols]

Unnamed: 0,all_time_rank,spotify_streams,spotify_playlist_count,spotify_playlist_reach,youtube_views,youtube_likes,tiktok_posts,tiktok_likes,tiktok_views,youtube_playlist_reach,airplay_spins,siriusxm_spins,deezer_playlist_reach,pandora_streams,pandora_track_stations,soundcloud_streams,shazam_counts
0,1,390470936,30716,196631588,84274754,1713126,5767700,651565900,5332281936,150597040,40975,684,17598718,18004655,22931,4818457,2669262
1,2,323703884,28113,174597137,116347040,3486739,674700,35223547,208339025,156380351,40778,3,10422430,7780028,28444,6623075,1118279
2,3,601309283,54331,211607669,122599116,2228730,3025400,275154237,3369120610,373784955,74333,536,36321847,5022621,5639,7208651,5285340
3,4,2031280633,269802,136569078,1096100899,10629796,7189811,1078757968,14603725994,3351188582,1474799,2182,24684248,190260277,203384,,11822942
4,5,107034922,7223,151469874,77373957,3670188,16400,,,112763851,12185,1,17660624,4493884,7006,207179,457017
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4595,4585,305049963,65770,5103054,149247747,1397590,48370,20202000,143934379,53016,6,,14217,20104066,13184,50633006,656337
4596,4575,52282360,4602,1449767,943920245,5347766,,1721,19935,24973048,412,,927,,,,193590
4597,4571,189972685,72066,6704802,201027333,1081402,7596,5288677,36849005,29253152,204,,74,84426740,28999,,1135998
4598,4591,145467020,14037,7387064,1118595159,3868828,11433,2534831,37757301,25150516,1200,,,6817840,,,448292


In [12]:
def clean_numeric_with_commas(value):
    value = str(value)
    value = value.replace(",", "")
    
    return pd.to_numeric(value, errors="coerce")

In [13]:
spotify_data[object_to_float_cols] = spotify_data[object_to_float_cols].apply(
    lambda col: col.apply(clean_numeric_with_commas)
)

In [14]:
spotify_data.release_date = pd.to_datetime(spotify_data.release_date, errors='coerce')

In [15]:
spotify_data

Unnamed: 0,track,album_name,artist,release_date,all_time_rank,track_score,spotify_streams,spotify_playlist_count,spotify_playlist_reach,spotify_popularity,...,siriusxm_spins,deezer_playlist_count,deezer_playlist_reach,amazon_playlist_count,pandora_streams,pandora_track_stations,soundcloud_streams,shazam_counts,tidal_popularity,explicit_track
0,MILLION DOLLAR BABY,Million Dollar Baby - Single,Tommy Richman,2024-04-26,1,725.4,3.904709e+08,30716.0,196631588.0,92.0,...,684.0,62.0,17598718.0,114.0,18004655.0,22931.0,4818457.0,2669262.0,,0
1,Not Like Us,Not Like Us,Kendrick Lamar,2024-05-04,2,545.9,3.237039e+08,28113.0,174597137.0,92.0,...,3.0,67.0,10422430.0,111.0,7780028.0,28444.0,6623075.0,1118279.0,,1
2,i like the way you kiss me,I like the way you kiss me,Artemas,2024-03-19,3,538.4,6.013093e+08,54331.0,211607669.0,92.0,...,536.0,136.0,36321847.0,172.0,5022621.0,5639.0,7208651.0,5285340.0,,0
3,Flowers,Flowers - Single,Miley Cyrus,2023-01-12,4,444.9,2.031281e+09,269802.0,136569078.0,85.0,...,2182.0,264.0,24684248.0,210.0,190260277.0,203384.0,,11822942.0,,0
4,Houdini,Houdini,Eminem,2024-05-31,5,423.3,1.070349e+08,7223.0,151469874.0,88.0,...,1.0,82.0,17660624.0,105.0,4493884.0,7006.0,207179.0,457017.0,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4595,For the Last Time,For the Last Time,$uicideboy$,2017-09-05,4585,19.4,3.050500e+08,65770.0,5103054.0,71.0,...,,2.0,14217.0,,20104066.0,13184.0,50633006.0,656337.0,,1
4596,Dil Meri Na Sune,"Dil Meri Na Sune (From ""Genius"")",Atif Aslam,2018-07-27,4575,19.4,5.228236e+07,4602.0,1449767.0,56.0,...,,1.0,927.0,,,,,193590.0,,0
4597,Grace (feat. 42 Dugg),My Turn,Lil Baby,2020-02-28,4571,19.4,1.899727e+08,72066.0,6704802.0,65.0,...,,1.0,74.0,6.0,84426740.0,28999.0,,1135998.0,,1
4598,Nashe Si Chadh Gayi,November Top 10 Songs,Arijit Singh,2016-11-08,4591,19.4,1.454670e+08,14037.0,7387064.0,66.0,...,,,,7.0,6817840.0,,,448292.0,,0


### Problem 3. Most productive artists (1 point)
Who are the five artists with the most songs in the dataset?

Who are the five "clean-mouthed" artists (i.e., with no explicit songs)? **Note:** We're not going into details but we can start a discussion about whether a song needs swearing to be popular.

### Problem 4. Most streamed artists (1 point)
And who are the top five most streamed (as measured by Spotify streams) artists?

### Problem 5. Songs by year and month (1 point)
How many songs have been released each year? Present an appropriate plot. Can you explain the behavior of the plot for 2024?

How about months? Is / Are there (a) popular month(s) to release music?

### Problem 6. Playlists (2 points)
Is there any connection (correlation) between users adding a song to playlists in one service, or another? Only Spotify, Apple, Deezer, and Amazon offer the ability to add a song to a playlist. Find a way to plot all these relationships at the same time, and analyze them. Experiment with different types of correlations.

### Problem 7. YouTube views and likes (1 point)
What is the relationship between YouTube views and likes? Present an appropriate plot. 

What is the mean YouTube views-to-likes ratio? What is its distribution? Find a way to plot it and describe it.

### Problem 8. TikTok stuff (2 points)
The most popular songs on TikTok released every year show... interesting behavior. Which years peaked the most TikTok views? Show an appropriate chart. Can you explain this behavior? For a bit of context, TikTok was created in 2016.

Now, how much popular is the most popular song for each release year, than the mean popularity? Analyze the results.

In both parts, it would be helpful to see the actual songs.

### * Problem 9. Explore (and clean) at will
There is a lot to look for here. For example, you can easily link a song to its genres, and lyrics. You may also try to link artists and albums to more info about them. Or you can compare and contrast a song's performance across different platforms, in a similar manner to what you already did above; maybe even assign a better song ranking system (across platforms with different popularity metrics, and different requirements) than the one provided in the dataset.