# Spotify and Youtube Analysis

## Description
An comparison of various metrics across Youtube and Spotify for various music tracks

## About the Dataset
Link: https://www.kaggle.com/datasets/salvatorerastelli/spotify-and-youtube

Dataset of songs of various artist in the world and for each song is present:  
Several statistics of the music version on spotify, including the number of streams.  
Number of views of the official music video of the song on youtube.  

It includes 26 variables for each of the songs collected from spotify. These variables are briefly described next:

**Track**: name of the song, as visible on the Spotify platform.  
**Artist**: name of the artist.  
**Url_spotify**: the Url of the artist.  
**Album**: the album in wich the song is contained on Spotify.  
**Album_type**: indicates if the song is relesead on Spotify as a single or contained in an album.  
**Uri**: a spotify link used to find the song through the API.  
**Danceability**: describes how suitable a track is for dancing based on a combination of musical   elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.  
**Energy**: is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.  
**Key**: the key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.  
**Loudness**: the overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typically range between -60 and 0 db.  
**Speechiness**: detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.  
**Acousticness**: a confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.  
**Instrumentalness**: predicts whether a track contains no vocals. "Ooh" and "aah" sounds are treated   as instrumental in this context. Rap or spoken word tracks are clearly "vocal". The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.  
**Liveness**: detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.  
**Valence**: a measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).  
**Tempo**: the overall estimated tempo of a track in beats per minute (BPM). In musical terminology,   tempo is the speed or pace of a given piece and derives directly from the average beat duration.  
**Duration_ms**: the duration of the track in milliseconds.  
**Stream**: number of streams of the song on Spotify.  
**Url_youtube**: url of the video linked to the song on Youtube, if it have any.  
**Title**: title of the videoclip on youtube.  
**Channel**: name of the channel that have published the video.  
**Views**: number of views.  
**Likes**: number of likes.  
**Comments**: number of comments.  
**Description**: description of the video on Youtube.  
**Licensed**: Indicates whether the video represents licensed content, which means that the content was uploaded to a channel linked to a YouTube content partner and then claimed by that partner.
**official_video**: boolean value that indicates if the video found is the official video of the song.  

In [104]:
import pandas as pd

path_to_csv = "./Spotify_Youtube.csv"

spyt_df = pd.read_csv(path_to_csv)

spyt_df.head()

Unnamed: 0.1,Unnamed: 0,Artist,Url_spotify,Track,Album,Album_type,Uri,Danceability,Energy,Key,...,Url_youtube,Title,Channel,Views,Likes,Comments,Description,Licensed,official_video,Stream
0,0,Gorillaz,https://open.spotify.com/artist/3AA28KZvwAUcZu...,Feel Good Inc.,Demon Days,album,spotify:track:0d28khcov6AiegSCpG5TuT,0.818,0.705,6.0,...,https://www.youtube.com/watch?v=HyHNuVaZJ-k,Gorillaz - Feel Good Inc. (Official Video),Gorillaz,693555221.0,6220896.0,169907.0,Official HD Video for Gorillaz' fantastic trac...,True,True,1040235000.0
1,1,Gorillaz,https://open.spotify.com/artist/3AA28KZvwAUcZu...,Rhinestone Eyes,Plastic Beach,album,spotify:track:1foMv2HQwfQ2vntFf9HFeG,0.676,0.703,8.0,...,https://www.youtube.com/watch?v=yYDmaexVHic,Gorillaz - Rhinestone Eyes [Storyboard Film] (...,Gorillaz,72011645.0,1079128.0,31003.0,The official video for Gorillaz - Rhinestone E...,True,True,310083700.0
2,2,Gorillaz,https://open.spotify.com/artist/3AA28KZvwAUcZu...,New Gold (feat. Tame Impala and Bootie Brown),New Gold (feat. Tame Impala and Bootie Brown),single,spotify:track:64dLd6rVqDLtkXFYrEUHIU,0.695,0.923,1.0,...,https://www.youtube.com/watch?v=qJa-VFwPpYA,Gorillaz - New Gold ft. Tame Impala & Bootie B...,Gorillaz,8435055.0,282142.0,7399.0,Gorillaz - New Gold ft. Tame Impala & Bootie B...,True,True,63063470.0
3,3,Gorillaz,https://open.spotify.com/artist/3AA28KZvwAUcZu...,On Melancholy Hill,Plastic Beach,album,spotify:track:0q6LuUqGLUiCPP1cbdwFs3,0.689,0.739,2.0,...,https://www.youtube.com/watch?v=04mfKJWDSzI,Gorillaz - On Melancholy Hill (Official Video),Gorillaz,211754952.0,1788577.0,55229.0,Follow Gorillaz online:\nhttp://gorillaz.com \...,True,True,434663600.0
4,4,Gorillaz,https://open.spotify.com/artist/3AA28KZvwAUcZu...,Clint Eastwood,Gorillaz,album,spotify:track:7yMiX7n9SBvadzox8T5jzT,0.663,0.694,10.0,...,https://www.youtube.com/watch?v=1V_xRb0x9aw,Gorillaz - Clint Eastwood (Official Video),Gorillaz,618480958.0,6197318.0,155930.0,The official music video for Gorillaz - Clint ...,True,True,617259700.0


In [69]:
# Cleaning up unncessary columns
spyt_df.drop(columns=["Unnamed: 0", "Url_spotify", "Uri", "Url_youtube", "Description", "Title", "Channel"], inplace=True)

In [None]:
# Removing duplicates
duplicates = spyt_df.duplicated().sum()
print(duplicates)

## No duplicates found

0


In [71]:
# Misssing values
missing = spyt_df.isnull().sum()
print(missing)

Artist                0
Track                 0
Album                 0
Album_type            0
Danceability          2
Energy                2
Key                   2
Loudness              2
Speechiness           2
Acousticness          2
Instrumentalness      2
Liveness              2
Valence               2
Tempo                 2
Duration_ms           2
Views               470
Likes               541
Comments            569
Licensed            470
official_video      470
Stream              576
dtype: int64


#### Ran a check to see how much of the data missing constituted the overall data for the most relevant metrics

Most affected columns:  
- Spotify-related fields: Stream ~2.78%  
- YouTube-related fields: Views ~2.27%  

In [72]:
missing_stream = spyt_df["Stream"].isnull().sum()
percent_missing = (missing_stream / len(spyt_df)) * 100
print(f"Dropping Stream nulls would remove {missing_stream} rows ({percent_missing:.2f}%)")

Dropping Stream nulls would remove 576 rows (2.78%)


### 



In [73]:
missing_views = spyt_df["Views"].isnull().sum()
percent_missing = (missing_views / len(spyt_df)) * 100
print(f"Dropping Views nulls would remove {missing_views} rows ({percent_missing:.2f}%)")

Dropping Views nulls would remove 470 rows (2.27%)


In [None]:
spyt_df["missing_youtube_data"] = spyt_df["Views"].isnull()
spyt_df["missing_spotify_data"] = spyt_df["Stream"].isnull()
spyt_df['Comments'].fillna(0, inplace=True)
spyt_df['Likes'].fillna(0, inplace=True)
spyt_df.dropna(subset=['Danceability'], inplace=True)


In [96]:
spyt_df_clean = spyt_df[(spyt_df["missing_youtube_data"] == False) & (spyt_df["missing_spotify_data"] == False)]
missing = spyt_df_clean.isnull().sum()
print(missing)

Artist                  0
Track                   0
Album                   0
Album_type              0
Danceability            0
Energy                  0
Key                     0
Loudness                0
Speechiness             0
Acousticness            0
Instrumentalness        0
Liveness                0
Valence                 0
Tempo                   0
Duration_ms             0
Views                   0
Likes                   0
Comments                0
Licensed                0
official_video          0
Stream                  0
missing_youtube_data    0
missing_spotify_data    0
dtype: int64


In [97]:
spyt_df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 19691 entries, 0 to 20717
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Artist                19691 non-null  object 
 1   Track                 19691 non-null  object 
 2   Album                 19691 non-null  object 
 3   Album_type            19691 non-null  object 
 4   Danceability          19691 non-null  float64
 5   Energy                19691 non-null  float64
 6   Key                   19691 non-null  float64
 7   Loudness              19691 non-null  float64
 8   Speechiness           19691 non-null  float64
 9   Acousticness          19691 non-null  float64
 10  Instrumentalness      19691 non-null  float64
 11  Liveness              19691 non-null  float64
 12  Valence               19691 non-null  float64
 13  Tempo                 19691 non-null  float64
 14  Duration_ms           19691 non-null  float64
 15  Views                 19

In [101]:
# Convert columns to boolean
spyt_df_clean.loc[:, "Licensed"] = spyt_df_clean["Licensed"].astype(bool)
spyt_df_clean.loc[:, "official_video"] = spyt_df_clean["official_video"].astype(bool)

# Check data types
print(spyt_df_clean[["Licensed", "official_video"]].dtypes)

Licensed          bool
official_video    bool
dtype: object


In [105]:
spyt_df_clean.head(10)

Unnamed: 0,Artist,Track,Album,Album_type,Danceability,Energy,Key,Loudness,Speechiness,Acousticness,...,Tempo,Duration_ms,Views,Likes,Comments,Licensed,official_video,Stream,missing_youtube_data,missing_spotify_data
0,Gorillaz,Feel Good Inc.,Demon Days,album,0.818,0.705,6.0,-6.679,0.177,0.00836,...,138.559,222640.0,693555221.0,6220896.0,169907.0,True,True,1040235000.0,False,False
1,Gorillaz,Rhinestone Eyes,Plastic Beach,album,0.676,0.703,8.0,-5.815,0.0302,0.0869,...,92.761,200173.0,72011645.0,1079128.0,31003.0,True,True,310083700.0,False,False
2,Gorillaz,New Gold (feat. Tame Impala and Bootie Brown),New Gold (feat. Tame Impala and Bootie Brown),single,0.695,0.923,1.0,-3.93,0.0522,0.0425,...,108.014,215150.0,8435055.0,282142.0,7399.0,True,True,63063470.0,False,False
3,Gorillaz,On Melancholy Hill,Plastic Beach,album,0.689,0.739,2.0,-5.81,0.026,1.5e-05,...,120.423,233867.0,211754952.0,1788577.0,55229.0,True,True,434663600.0,False,False
4,Gorillaz,Clint Eastwood,Gorillaz,album,0.663,0.694,10.0,-8.627,0.171,0.0253,...,167.953,340920.0,618480958.0,6197318.0,155930.0,True,True,617259700.0,False,False
5,Gorillaz,DARE,Demon Days,album,0.76,0.891,11.0,-5.852,0.0372,0.0229,...,120.264,245000.0,259021161.0,1844658.0,72008.0,True,True,323850300.0,False,False
6,Gorillaz,New Gold (feat. Tame Impala and Bootie Brown) ...,New Gold (feat. Tame Impala and Bootie Brown) ...,single,0.716,0.897,4.0,-7.185,0.0629,0.012,...,127.03,274142.0,451996.0,11686.0,241.0,False,True,10666150.0,False,False
7,Gorillaz,She's My Collar (feat. Kali Uchis),Humanz (Deluxe),album,0.726,0.815,11.0,-5.886,0.0313,0.00799,...,140.158,209560.0,1010982.0,17675.0,260.0,False,False,159605900.0,False,False
8,Gorillaz,Cracker Island (feat. Thundercat),Cracker Island (feat. Thundercat),single,0.741,0.913,2.0,-3.34,0.0465,0.00343,...,120.012,213750.0,24459820.0,739527.0,20296.0,True,True,42671900.0,False,False
9,Gorillaz,Dirty Harry,Demon Days,album,0.625,0.877,10.0,-7.176,0.162,0.0315,...,192.296,230426.0,154761056.0,1386920.0,39240.0,True,True,191074700.0,False,False
