# Establishing the Dataset

In [23]:
import numpy as np
import pandas as pd

In [24]:
data_src = "Datasets/music.csv" #change if needed
df = pd.read_csv(data_src)
df.set_index("Unnamed: 0",inplace = True)

## The Dataset

The dataset that I will be using for this project is hosted on Kaggle under the title **“Spotify and Youtube”**: https://www.kaggle.com/datasets/salvatorerastelli/spotify-and-youtube

This dataset can either be accessed by downloading the CSV files directly ("tiny" form) or via the Kaggle API. In this notebook, the data is referenced as a CSV file, which can be found in the github repo under "Datasets".

### Background Information: What is this Dataset?
- **Who?**
    **The dataset is published on Kaggle** by user *“salvatorerastelli”* and is licensed under a public domain.
- **What?**
    This user compiled information on various Spotify artists top 10 songs, collecting a wide range of data such as **Spotify metadata, measurements of auditory features, and information on the affiliated music video on Youtube**.
- **When?**
    All of this data is recorded/updated as of **February 2023**. In total, the set contains **27 features and 20,718 entries**. 

- **How?** **The dataset is collected through an API**. The creator likely queried YouTube’s search or video data to assemble video metadata (such as Title, Views, Descirption, License). Spotify metadata are accessed through Spotify's Web API: https://developer.spotify.com/documentation/web-api, which pulls a wide range of information including identifiers and Spotify's pre-calculated "audio features".

### What are audio features?

Audio features are numerical measurements of certain attributes of a track, which Spotify calculates through its own audio analysis algorithms. These features can be direct measurements like as the key or tempo of the song, or more specialized and complex measurements like "danceability" and "valence". While this dataset describes all of the features used, the following are the most important to define for later in this analysis:
- **Acousticness**: "A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic." AI models determine this metric by analyzing patterns of acoustic/traditional instruments (guitar strums, piano notes, natural percussion) versus electronically produced sounds (synths, drum machines)
- **Energy**: "Is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy.For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy."
- **Loudness**: The average loudness of a track in decibels. "Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude)."

*References on auditory features:*
- Kumar, Aman. “Unleashing the Power of Audio Features with the Spotify API.” Medium, 22 Mar. 2024, onlyoneaman.medium.com/unleashing-the-power-of-audio-features-with-the-spotify-api-c544fda1af40. 
- “AI Prediction in Music: Deep Learning to Predict Hits • Studio Vi.” Studio Vi, 2 Oct. 2025, www.studiovi.com/article/from-beats-to-data-applying-ai-to-predict-hits/.

‌


## Basic Inspection

In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 20718 entries, 0 to 20717
Data columns (total 27 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Artist            20718 non-null  object 
 1   Url_spotify       20718 non-null  object 
 2   Track             20718 non-null  object 
 3   Album             20718 non-null  object 
 4   Album_type        20718 non-null  object 
 5   Uri               20718 non-null  object 
 6   Danceability      20716 non-null  float64
 7   Energy            20716 non-null  float64
 8   Key               20716 non-null  float64
 9   Loudness          20716 non-null  float64
 10  Speechiness       20716 non-null  float64
 11  Acousticness      20716 non-null  float64
 12  Instrumentalness  20716 non-null  float64
 13  Liveness          20716 non-null  float64
 14  Valence           20716 non-null  float64
 15  Tempo             20716 non-null  float64
 16  Duration_ms       20716 non-null  float64
 17

In [26]:
COLS_df = pd.DataFrame({
    "column": df.columns,
    "dtype": df.dtypes.values,
    "unique_values": df.nunique().values,
    "missing_values": df.isnull().sum()
})
COLS_df.set_index("column",inplace=True)

COLS_df

Unnamed: 0_level_0,dtype,unique_values,missing_values
column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Artist,object,2079,0
Url_spotify,object,2079,0
Track,object,17841,0
Album,object,11937,0
Album_type,object,3,0
Uri,object,18862,0
Danceability,float64,898,2
Energy,float64,1268,2
Key,float64,12,2
Loudness,float64,9417,2


From a simple summary of the column information, it appears that all tracks contain metadata, only 2 are missing auditory feature data, hundreds of entries are missing an affiliated Youtube video entry in addition to other Youtube video data, and many rows are also missing information on spotify streams. However, this dataset is generally clean, given the amount of missing information in comparison to the over 20,000 entries.