# Tarea 1 Análisis de datos

Referencia: https://www.kaggle.com/datasets/nelgiriyewithana/most-streamed-spotify-songs-2024

## Description

This dataset presents a comprehensive compilation of the most streamed songs on Spotify in 2024. It provides extensive insights into each track's attributes, popularity, and presence on various music platforms, offering a valuable resource for music analysts, enthusiasts, and industry professionals. The dataset includes information such as track name, artist, release date, ISRC, streaming statistics, and presence on platforms like YouTube, TikTok, and more.

## Key Features

- Track Name: Name of the song.
- Album Name: Name of the album the song belongs to.
- Artist: Name of the artist(s) of the song.
- Release Date: Date when the song was released.
- ISRC: International Standard Recording Code for the song.
- All Time Rank: Ranking of the song based on its all-time popularity.
- Track Score: Score assigned to the track based on various factors.
- Spotify Streams: Total number of streams on Spotify.
- Spotify Playlist Count: Number of Spotify playlists the song is included in.
- Spotify Playlist Reach: Reach of the song across Spotify playlists.
- Spotify Popularity: Popularity score of the song on Spotify.
- YouTube Views: Total views of the song's official video on YouTube.
- YouTube Likes: Total likes on the song's official video on YouTube.
- TikTok Posts: Number of TikTok posts featuring the song.
- TikTok Likes: Total likes on TikTok posts featuring the song.
- TikTok Views: Total views on TikTok posts featuring the song.
- YouTube Playlist Reach: Reach of the song across YouTube playlists.
- Apple Music Playlist Count: Number of Apple Music playlists the song is included in.
- AirPlay Spins: Number of times the song has been played on radio stations.
- SiriusXM Spins: Number of times the song has been played on SiriusXM.
- Deezer Playlist Count: Number of Deezer playlists the song is included in.
- Deezer Playlist Reach: Reach of the song across Deezer playlists.
- Amazon Playlist Count: Number of Amazon Music playlists the song is included in.
- Pandora Streams: Total number of streams on Pandora.
- Pandora Track Stations: Number of Pandora stations featuring the song.
- Soundcloud Streams: Total number of streams on Soundcloud.
- Shazam Counts: Total number of times the song has been Shazamed.
- TIDAL Popularity: Popularity score of the song on TIDAL.
- Explicit Track: Indicates whether the song contains explicit content.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [8]:
df = pd.read_csv('https://raw.githubusercontent.com/desareca/Tareas_Analisis_Datos-G6/main/Most%20Streamed%20Spotify%20Songs%202024.csv', encoding='ISO-8859-1')

In [10]:
df.head()

Unnamed: 0,Track,Album Name,Artist,Release Date,ISRC,All Time Rank,Track Score,Spotify Streams,Spotify Playlist Count,Spotify Playlist Reach,...,SiriusXM Spins,Deezer Playlist Count,Deezer Playlist Reach,Amazon Playlist Count,Pandora Streams,Pandora Track Stations,Soundcloud Streams,Shazam Counts,TIDAL Popularity,Explicit Track
0,MILLION DOLLAR BABY,Million Dollar Baby - Single,Tommy Richman,4/26/2024,QM24S2402528,1,725.4,390470936,30716,196631588,...,684,62.0,17598718,114.0,18004655,22931,4818457.0,2669262,,0
1,Not Like Us,Not Like Us,Kendrick Lamar,5/4/2024,USUG12400910,2,545.9,323703884,28113,174597137,...,3,67.0,10422430,111.0,7780028,28444,6623075.0,1118279,,1
2,i like the way you kiss me,I like the way you kiss me,Artemas,3/19/2024,QZJ842400387,3,538.4,601309283,54331,211607669,...,536,136.0,36321847,172.0,5022621,5639,7208651.0,5285340,,0
3,Flowers,Flowers - Single,Miley Cyrus,1/12/2023,USSM12209777,4,444.9,2031280633,269802,136569078,...,2182,264.0,24684248,210.0,190260277,203384,,11822942,,0
4,Houdini,Houdini,Eminem,5/31/2024,USUG12403398,5,423.3,107034922,7223,151469874,...,1,82.0,17660624,105.0,4493884,7006,207179.0,457017,,1


In [11]:
df.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
Track,4600.0,4370.0,Danza Kuduro - Cover,13.0,,,,,,,
Album Name,4600.0,4005.0,Un Verano Sin Ti,20.0,,,,,,,
Artist,4595.0,1999.0,Drake,63.0,,,,,,,
Release Date,4600.0,1562.0,1/1/2012,38.0,,,,,,,
ISRC,4600.0,4598.0,USWL11700269,2.0,,,,,,,
All Time Rank,4600.0,4577.0,3441,2.0,,,,,,,
Track Score,4600.0,,,,41.844043,38.543766,19.4,23.3,29.9,44.425,725.4
Spotify Streams,4487.0,4425.0,1655575417,4.0,,,,,,,
Spotify Playlist Count,4530.0,4207.0,1,46.0,,,,,,,
Spotify Playlist Reach,4528.0,4478.0,3,8.0,,,,,,,


In [12]:
# Elimina columna TIDAL Popularity, son todos NaN
df = df.drop('TIDAL Popularity', axis=1)

In [13]:
# EL codigo ISRC es único por lo que se deben eliminar duplicados
# Primero revisamos si tienen la misma info
# Referencia: https://isrc.ifpi.org/es/
cancion_repetida = (df.groupby('ISRC')['ISRC'].count()>1)
df[df['ISRC'].isin(cancion_repetida[cancion_repetida].index)].T

Unnamed: 0,2449,2450,3447,3450
Track,Tennessee Orange,Tennessee Orange,Dembow,Dembow
Album Name,Tennessee Orange,Tennessee Orange,Dembow,Dembow
Artist,Megan Moroney,Megan Moroney,Danny Ocean,Danny Ocean
Release Date,9/2/2022,9/2/2022,12/8/2017,12/8/2017
ISRC,TCAGJ2289254,TCAGJ2289254,USWL11700269,USWL11700269
All Time Rank,2424,2424,3441,3441
Track Score,28.9,28.9,23.3,23.3
Spotify Streams,227893586,227893586,579189526,579189526
Spotify Playlist Count,28139,28139,60397,60397
Spotify Playlist Reach,12480714,12480714,11805084,11805084


In [14]:
# Se eliminan los duiplicados
df = df.drop_duplicates(subset='ISRC')

In [15]:
df[df['Artist'].isna()]

Unnamed: 0,Track,Album Name,Artist,Release Date,ISRC,All Time Rank,Track Score,Spotify Streams,Spotify Playlist Count,Spotify Playlist Reach,...,AirPlay Spins,SiriusXM Spins,Deezer Playlist Count,Deezer Playlist Reach,Amazon Playlist Count,Pandora Streams,Pandora Track Stations,Soundcloud Streams,Shazam Counts,Explicit Track
311,Cool,JnD Mix,,5/25/2024,QZNWQ2410638,311,86.5,,,,...,,,,,,,,,8624577.0,0
480,I Wanna Party,I Wanna Party - Single,,5/31/2024,QZYFZ2445017,482,70.3,,,,...,,,,,,,,,,0
1345,Marlboro Remix,Marlboro Remix - Single,,6/7/2024,QZNWT2471497,1343,40.6,,,,...,,,,,,,,,504979.0,0
1561,Melting,Melting - Single,,6/10/2024,QZNWU2402635,1553,37.2,,,,...,,,,,,,,,289134.0,0
3402,La ï¿½ï¿½ltima Vez (Yo Te Per,La ï¿½ï¿½ltima Vez (Yo Te Perdï¿½ï¿½),,5/2/2024,MX2832415361,3381,23.6,,,,...,,,,,,,,,1606561.0,0


In [16]:
df[df['Artist'].isna()].info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, 311 to 3402
Data columns (total 28 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Track                       5 non-null      object 
 1   Album Name                  5 non-null      object 
 2   Artist                      0 non-null      object 
 3   Release Date                5 non-null      object 
 4   ISRC                        5 non-null      object 
 5   All Time Rank               5 non-null      object 
 6   Track Score                 5 non-null      float64
 7   Spotify Streams             0 non-null      object 
 8   Spotify Playlist Count      0 non-null      object 
 9   Spotify Playlist Reach      0 non-null      object 
 10  Spotify Popularity          0 non-null      float64
 11  YouTube Views               0 non-null      object 
 12  YouTube Likes               0 non-null      object 
 13  TikTok Posts                0 non-null 

In [17]:
# Se eliminan las canciones sin artista, son sólo 5.
# Estas canciones solo muestran valores nulos en la mayoría de los casos.
# Dependiendo de como se defina el objetivo podría ser necesario reutilizarlo, aunque no creo.
df = df[~df['Artist'].isna()]

In [18]:
df.head()

Unnamed: 0,Track,Album Name,Artist,Release Date,ISRC,All Time Rank,Track Score,Spotify Streams,Spotify Playlist Count,Spotify Playlist Reach,...,AirPlay Spins,SiriusXM Spins,Deezer Playlist Count,Deezer Playlist Reach,Amazon Playlist Count,Pandora Streams,Pandora Track Stations,Soundcloud Streams,Shazam Counts,Explicit Track
0,MILLION DOLLAR BABY,Million Dollar Baby - Single,Tommy Richman,4/26/2024,QM24S2402528,1,725.4,390470936,30716,196631588,...,40975,684,62.0,17598718,114.0,18004655,22931,4818457.0,2669262,0
1,Not Like Us,Not Like Us,Kendrick Lamar,5/4/2024,USUG12400910,2,545.9,323703884,28113,174597137,...,40778,3,67.0,10422430,111.0,7780028,28444,6623075.0,1118279,1
2,i like the way you kiss me,I like the way you kiss me,Artemas,3/19/2024,QZJ842400387,3,538.4,601309283,54331,211607669,...,74333,536,136.0,36321847,172.0,5022621,5639,7208651.0,5285340,0
3,Flowers,Flowers - Single,Miley Cyrus,1/12/2023,USSM12209777,4,444.9,2031280633,269802,136569078,...,1474799,2182,264.0,24684248,210.0,190260277,203384,,11822942,0
4,Houdini,Houdini,Eminem,5/31/2024,USUG12403398,5,423.3,107034922,7223,151469874,...,12185,1,82.0,17660624,105.0,4493884,7006,207179.0,457017,1


In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4593 entries, 0 to 4599
Data columns (total 28 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Track                       4593 non-null   object 
 1   Album Name                  4593 non-null   object 
 2   Artist                      4593 non-null   object 
 3   Release Date                4593 non-null   object 
 4   ISRC                        4593 non-null   object 
 5   All Time Rank               4593 non-null   object 
 6   Track Score                 4593 non-null   float64
 7   Spotify Streams             4485 non-null   object 
 8   Spotify Playlist Count      4528 non-null   object 
 9   Spotify Playlist Reach      4526 non-null   object 
 10  Spotify Popularity          3794 non-null   float64
 11  YouTube Views               4290 non-null   object 
 12  YouTube Likes               4283 non-null   object 
 13  TikTok Posts                3425 non-n

In [20]:
# Pasamos a datetime la fecha
df['Release Date'] = pd.to_datetime(df['Release Date'], format='%m/%d/%Y')

In [21]:
# Hay un momton de variables tipo string que son cantidades asociadas a plataformas.
# Muchas tienen valores perdidos, se completarán con 0 asumiendo quie son cantidades de escuchas, playlist, vistas, etc.
cols_plataformas = ['Spotify Streams', 'Spotify Playlist Count', 'Spotify Playlist Reach', 'YouTube Views', 'YouTube Likes', 'TikTok Posts',
                    'TikTok Likes', 'TikTok Views', 'YouTube Playlist Reach', 'AirPlay Spins', 'SiriusXM Spins', 'Deezer Playlist Reach',
                    'Pandora Streams', 'Pandora Track Stations', 'Soundcloud Streams', 'Shazam Counts', 'All Time Rank']

for col_p in cols_plataformas:
    df[col_p] = df[col_p].str.replace(',', '').fillna(0).astype(float)

In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4593 entries, 0 to 4599
Data columns (total 28 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   Track                       4593 non-null   object        
 1   Album Name                  4593 non-null   object        
 2   Artist                      4593 non-null   object        
 3   Release Date                4593 non-null   datetime64[ns]
 4   ISRC                        4593 non-null   object        
 5   All Time Rank               4593 non-null   float64       
 6   Track Score                 4593 non-null   float64       
 7   Spotify Streams             4593 non-null   float64       
 8   Spotify Playlist Count      4593 non-null   float64       
 9   Spotify Playlist Reach      4593 non-null   float64       
 10  Spotify Popularity          3794 non-null   float64       
 11  YouTube Views               4593 non-null   float64       
 1

In [23]:
# All Time Rank tiene datos repetidos, pero con data diferente, se queda de momento.
df[df['All Time Rank'].isin(df.groupby('All Time Rank')['All Time Rank'].count()[df.groupby('All Time Rank')['All Time Rank'].count()>1].index)].sort_values('All Time Rank')

Unnamed: 0,Track,Album Name,Artist,Release Date,ISRC,All Time Rank,Track Score,Spotify Streams,Spotify Playlist Count,Spotify Playlist Reach,...,AirPlay Spins,SiriusXM Spins,Deezer Playlist Count,Deezer Playlist Reach,Amazon Playlist Count,Pandora Streams,Pandora Track Stations,Soundcloud Streams,Shazam Counts,Explicit Track
310,Danza Kuduro - Cover,A collection of Western music that will get yo...,MUSIC LAB JPN,2023-11-15,TCJPX2396779,355.0,86.6,1627430000.0,12.0,163.0,...,0.0,0.0,,0.0,,0.0,0.0,0.0,0.0,1
355,deja vu,deja vu,Olivia Rodrigo,2021-04-01,USUG12101240,355.0,81.3,1606976000.0,172376.0,64346126.0,...,279408.0,332.0,27.0,76887.0,49.0,62922329.0,74501.0,0.0,3545397.0,1
400,I DONï¿½ï¿½ï¿½T WANNA DO THIS A,King of the Dead,Juliï¿½ï¿½n Kh,2024-03-18,QZLL92480334,454.0,76.5,564444800.0,5.0,20.0,...,0.0,0.0,,0.0,,0.0,0.0,0.0,0.0,1
455,Unforgettable,EPIC AF,French Montana,2017-01-01,USSM11703478,454.0,72.6,2065697000.0,275044.0,91691537.0,...,105138.0,267.0,114.0,739260.0,45.0,505767047.0,265307.0,0.0,0.0,1
553,DIME QUE,DIME QUE,Ian G.,2021-07-16,QZS632157305,559.0,65.4,0.0,0.0,0.0,...,0.0,0.0,,0.0,,0.0,0.0,210524496.0,5.0,0
564,3D (feat. Jack Harlow),3D (feat. Jack Harlow),Jung Kook,2023-09-29,USA2P2346470,559.0,64.3,573249000.0,27154.0,33077841.0,...,0.0,25.0,30.0,1569004.0,49.0,2506963.0,1609.0,0.0,1379108.0,1
557,Broccoli,Broccoli - Single,ati2x06,2024-01-17,ES64E2455101,626.0,64.9,846435900.0,1.0,42.0,...,0.0,0.0,,0.0,,0.0,0.0,0.0,0.0,0
630,SICKO MODE,ASTROWORLD,Travis Scott,2018-08-03,USSM11806660,626.0,60.9,2134272000.0,330314.0,97027433.0,...,79741.0,107.0,70.0,834774.0,115.0,307942023.0,300265.0,0.0,10345810.0,1
1108,PRINCESITA DE ...,ýýNFASIS,Jere Klein,2023-12-01,QMFMF2334329,1103.0,45.5,106290200.0,11766.0,9230911.0,...,388.0,0.0,3.0,122602.0,,85692.0,48.0,0.0,195908.0,1
997,Cake By The Ocean - Cover,NEW TIK&TOK - BUZZ BEST -,MUSIC LAB JPN,2023-06-07,TCJPV2340543,1103.0,48.2,1611085000.0,0.0,0.0,...,0.0,0.0,,0.0,,0.0,0.0,0.0,0.0,1


In [24]:
df.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,min,25%,50%,75%,max,std
Track,4593.0,4365.0,Danza Kuduro - Cover,13.0,,,,,,,
Album Name,4593.0,4000.0,Un Verano Sin Ti,20.0,,,,,,,
Artist,4593.0,1999.0,Taylor Swift,63.0,,,,,,,
Release Date,4593.0,,,,2021-01-26 02:58:42.403657728,1987-07-21 00:00:00,2019-07-16 00:00:00,2022-05-31 00:00:00,2023-08-11 00:00:00,2024-06-14 00:00:00,
ISRC,4593.0,4593.0,QM24S2402528,1.0,,,,,,,
All Time Rank,4593.0,,,,2291.352928,1.0,1145.0,2291.0,3437.0,4998.0,1322.967651
Track Score,4593.0,,,,41.840235,19.4,23.3,29.9,44.4,725.4,38.562767
Spotify Streams,4593.0,,,,436886522.567603,0.0,63642484.0,226576178.0,613000788.0,4281468720.0,536481420.949785
Spotify Playlist Count,4593.0,,,,58556.917701,0.0,6068.0,31275.0,85006.0,590392.0,70987.892886
Spotify Playlist Reach,4593.0,,,,23011125.038101,0.0,4517785.0,12953022.0,29317904.0,262343414.0,29608078.192488


In [38]:
# Agrega Año y Mes
df['Release Date Year'] = df['Release Date'].dt.year
df['Release Date Month'] = df['Release Date'].dt.month_name()

In [43]:
df.groupby('Release Date Year').sum(numeric_only=True).T

Release Date Year,1987,1991,1994,1998,1999,2000,2001,2002,2003,2004,...,2015,2016,2017,2018,2019,2020,2021,2022,2023,2024
All Time Rank,3931.0,4510.0,2097.0,4155.0,3730.0,4599.0,4361.0,14950.0,10465.0,16926.0,...,221411.0,282867.0,498045.0,600912.0,677727.0,839853.0,954650.0,1574966.0,2580740.0,1442027.0
Track Score,21.5,19.6,31.5,20.8,22.2,60.8,20.1,130.3,71.3,121.2,...,3646.9,4652.9,8060.0,9207.4,11656.6,13695.2,15562.8,28906.2,50100.4,37247.0
Spotify Streams,1879386000.0,2021910000.0,1810650000.0,178339900.0,1405354000.0,3843533000.0,399145200.0,9063840000.0,4508044000.0,6163006000.0,...,101825800000.0,121976300000.0,189508100000.0,190967100000.0,197469500000.0,181546700000.0,189608800000.0,209409300000.0,225245800000.0,69397260000.0
Spotify Playlist Count,295491.0,410054.0,109411.0,189706.0,276377.0,677692.0,94038.0,1434701.0,707254.0,1024487.0,...,12032350.0,14327590.0,22504370.0,24707090.0,28784230.0,29775070.0,30026340.0,31328310.0,24526090.0,3021930.0
Spotify Playlist Reach,96032050.0,113407000.0,45003970.0,77373840.0,115315500.0,174574100.0,14869140.0,436269100.0,208184100.0,328673500.0,...,3763328000.0,4245182000.0,6145930000.0,5979434000.0,7629016000.0,6940408000.0,8316037000.0,11651160000.0,17439260000.0,17327050000.0
Spotify Popularity,80.0,77.0,60.0,72.0,79.0,163.0,68.0,390.0,240.0,367.0,...,6417.0,7991.0,13285.0,15193.0,18844.0,21785.0,24337.0,38706.0,48104.0,25361.0
YouTube Views,2007461000.0,348081700.0,824500200.0,1499386000.0,1518597000.0,3336688000.0,1632845000.0,6675818000.0,4170808000.0,5594428000.0,...,126822000000.0,138250600000.0,208754100000.0,186945900000.0,165556300000.0,141741200000.0,112600400000.0,139507000000.0,136857100000.0,17428060000.0
YouTube Likes,11222080.0,4710499.0,7455155.0,12220360.0,7781636.0,25692750.0,11706970.0,43264820.0,29442680.0,20620950.0,...,760184700.0,848596900.0,1407272000.0,1420957000.0,1416435000.0,1337382000.0,1127686000.0,998819200.0,973930400.0,237528000.0
TikTok Posts,623935.0,264172.0,18894720.0,7779004.0,52722.0,597235.0,70375.0,1208088.0,1215588.0,773127.0,...,117267200.0,121726700.0,243077400.0,271266700.0,403133100.0,526946400.0,313520700.0,533081400.0,414117700.0,41562580.0
TikTok Likes,57957600.0,95247740.0,1432228000.0,94939590.0,16291850.0,150733700.0,19778660.0,123319400.0,272930100.0,209173000.0,...,7422776000.0,7860850000.0,17872740000.0,26430750000.0,37934370000.0,56829160000.0,64080600000.0,73174180000.0,41296270000.0,4644734000.0
