# artist-genre-evolution
This notebook is submitted as a prerequisite of CSMODEL and uses the [Spotify Music Dataset](https://www.kaggle.com/datasets/solomonameh/spotify-music-dataset) from Kaggle.

Our main goal is to answer the question **"Do artists in certain music genres exhibit more change in their musical features over time than others?"**

ie: Does the genre an artist is in affect how much they *change* or *evolve* in terms of their musical features (danceabiltiy, energy, acousticness).



## About the Dataset
__________

This project utilizes the "High Popularity Songs" portion of the Spotify Music Dataset, sourced from Kaggle. This dataset is a curated collection of songs from Spotify that are considered popular, specifically those **with a popularity score greater than 68.** 

For each song, the dataset provides a rich set of attributes, including quantitative audio features derived from Spotify's audio analysis (e.g., `danceability`, `energy`, `valence`) and descriptive metadata (e.g., `track_name`, `track_artist`, `release_date`).

## Data Collection & Potential Implications


The dataset was collated by querying the official Spotify Web API using custom Python scripts. The data for each track, including its audio features, is what Spotify provides directly to developers.

The method of collection has several key implications for our analysis:
-  **Popularity Bias**: The dataset is intentionally filtered to include only songs with a popularity score above 68. This means our insights will be specific to commercially successful music on the platform and cannot be generalized to all music or to less popular/niche genres. The factors driving popularity on Spotify (e.g., playlist placement, marketing, platform promotion) are complex and may introduce a bias towards mainstream music.

While there is a low_popularity_songs dataset, the researchers agreed not to use it because the primary research question requires analyzing an artist's musical evolution over time. **High-popularity artists are more likely to have extensive discographies with multiple release dates, providing sufficient data for a longitudinal analysis.** Conversely, low-popularity artists often have too few songs in the dataset to establish a meaningful timeline of change

- **Standardized Metrics**: Since the audio features (energy, liveness, etc.) are calculated by Spotify's proprietary algorithms, they represent a consistent and standardized measurement system. However, these metrics are a "black box," and we must rely on Spotify's definitions for their interpretation.

- **Temporal Relevance**: The dataset represents a snapshot in time. Song popularity is highly dynamic, and the trends identified may be specific to the period in which the data was collected.

- **Playlist-Based Genre**: The `playlist_genre` and `playlist_subgenre` fields are derived from the playlists the songs appeared in, not from an official artist-level genre tag. An artist's songs may appear in playlists of various genres (e.g., a rock song in a "Workout" playlist categorized under Pop). This means our genre groupings are an approximation and could introduce noise into the analysis.


## Structure 


The data is presented in a tabular format within the `high_popularity_spotify_data.csv` file.
- Rows: Each row represents a single, unique song (track).
- Columns: Each column represents a specific attribute or feature of that song.
- Observations: The dataset contains 1686 observations (songs) and 58 features (columns).



## Attributes 



The 58 features can be broadly categorized into Audio Features and Descriptive Features.

**Audio Features**

These are quantitative features generated by Spotify's analysis of a track's audio.

- `energy`: A measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy.
- `tempo`: The speed of a track, measured in beats per minute (BPM).

- `danceability`: A score describing how suitable a track is for dancing based on tempo, rhythm stability, beat strength, and overall regularity.

- `loudness`: The overall loudness of a track in decibels (dB).

- `liveness`: The likelihood of a track being a live recording. Higher values suggest the presence of an audience.

- `valence`: The musical positiveness (emotion) of a track. High valence sounds happy; low valence sounds sad or angry.

- `speechiness`: Measures the presence of spoken words in a track.

- `instrumentalness`: The likelihood a track contains no vocals. Values near 1.0 suggest purely instrumental tracks.

- `mode`: Indicates the modality (major or minor) of the track.

- `key`: The musical key, represented as an integer (0 - 11) mapping to Pitch Class notation.

- `duration_ms`: The length of the track in milliseconds.

- `acousticness`: A confidence measure of whether a track is acoustic.

**Descriptive Features**

These are metadata attributes that describe the track and its context.

- `track_name`: The name of the song.

- `track_artist`: The artist(s) who performed the song.

- `track_album_name`: The album the song belongs to.

- `track_album_release_date`: The release date of the album.

- `track_popularity`: A score (0 - 100) calculated by Spotify based on the total number of streams and how recent they are.

- `playlist_name`: The name of the playlist the track was sourced from.

- `playlist_genre:` The main genre associated with the source playlist.

- `playlist_subgenre`: A more specific subgenre of the source playlist.

- `track_id`: A unique identifier for the track, assigned by Spotify.

- `track_album_id`: A unique identifier for the album.

- `playlist_id`: A unique identifier for the source playlist.

## Data Cleaning 

Handling Missing Values
Multiple representations of the same categorical calue
Incorrect Data Types
Inconsistent Formatting

Filtering Artists (we want to evaluate change over time - so we must filter out artists who only have one song or songs from only one release date.)

Check for duplicates and outliers that might skew the metric

For unstructured data, you must check for potentially noise or unwanted data, or apply a modelling scheme to convert them
first into a tabular format.
In the Notebook, explain all the procedures applied during the data cleaning process.


sample formula formatting:
$$p \pm ME$$
$$ME =  z^* \times \sqrt{\frac{p(1-p)}{n}}$$

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


high_df = pd.read_csv('high_popularity_spotify_data.csv')
high_df.head()

Unnamed: 0,energy,tempo,danceability,playlist_genre,loudness,liveness,valence,track_artist,time_signature,speechiness,...,instrumentalness,track_album_id,mode,key,duration_ms,acousticness,id,playlist_subgenre,type,playlist_id
0,0.592,157.969,0.521,pop,-7.777,0.122,0.535,"Lady Gaga, Bruno Mars",3,0.0304,...,0.0,10FLjwfpbxLmW8c25Xyc2N,0,6,251668,0.308,2plbrEY59IikOBgBGLjaoe,mainstream,audio_features,37i9dQZF1DXcBWIGoYBM5M
1,0.507,104.978,0.747,pop,-10.171,0.117,0.438,Billie Eilish,4,0.0358,...,0.0608,7aJuG4TFXa2hmE4z1yxc3n,1,2,210373,0.2,6dOtVTDdiauQNBQEDOtlAB,mainstream,audio_features,37i9dQZF1DXcBWIGoYBM5M
2,0.808,108.548,0.554,pop,-4.169,0.159,0.372,Gracie Abrams,4,0.0368,...,0.0,0hBRqPYPXhr1RkTDG3n4Mk,1,1,166300,0.214,7ne4VBA60CxGM75vw0EYad,mainstream,audio_features,37i9dQZF1DXcBWIGoYBM5M
3,0.91,112.966,0.67,pop,-4.07,0.304,0.786,Sabrina Carpenter,4,0.0634,...,0.0,4B4Elma4nNDUyl6D5PvQkj,0,0,157280,0.0939,1d7Ptw3qYcfpdLNL5REhtJ,mainstream,audio_features,37i9dQZF1DXcBWIGoYBM5M
4,0.783,149.027,0.777,pop,-4.477,0.355,0.939,"ROSÉ, Bruno Mars",4,0.26,...,0.0,2IYQwwgxgOIn7t3iF6ufFD,0,0,169917,0.0283,5vNRhkKd0yEAg8suGBpjeY,mainstream,audio_features,37i9dQZF1DXcBWIGoYBM5M


In [5]:
high_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1686 entries, 0 to 1685
Data columns (total 29 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   energy                    1686 non-null   float64
 1   tempo                     1686 non-null   float64
 2   danceability              1686 non-null   float64
 3   playlist_genre            1686 non-null   object 
 4   loudness                  1686 non-null   float64
 5   liveness                  1686 non-null   float64
 6   valence                   1686 non-null   float64
 7   track_artist              1686 non-null   object 
 8   time_signature            1686 non-null   int64  
 9   speechiness               1686 non-null   float64
 10  track_popularity          1686 non-null   int64  
 11  track_href                1686 non-null   object 
 12  uri                       1686 non-null   object 
 13  track_album_name          1685 non-null   object 
 14  playlist

In [6]:
# This makes sure that the 'playlist_genre' and 'track_artist' columns are in lowercase and that there are no duplicates.
high_df['playlist_genre'] = high_df['playlist_genre'].str.lower().str.strip()
high_df['track_artist'] = high_df['track_artist'].str.lower().str.strip()

In [7]:
high_df


Unnamed: 0,energy,tempo,danceability,playlist_genre,loudness,liveness,valence,track_artist,time_signature,speechiness,...,instrumentalness,track_album_id,mode,key,duration_ms,acousticness,id,playlist_subgenre,type,playlist_id
0,0.592,157.969,0.521,pop,-7.777,0.1220,0.535,"lady gaga, bruno mars",3,0.0304,...,0.000000,10FLjwfpbxLmW8c25Xyc2N,0,6,251668,0.3080,2plbrEY59IikOBgBGLjaoe,mainstream,audio_features,37i9dQZF1DXcBWIGoYBM5M
1,0.507,104.978,0.747,pop,-10.171,0.1170,0.438,billie eilish,4,0.0358,...,0.060800,7aJuG4TFXa2hmE4z1yxc3n,1,2,210373,0.2000,6dOtVTDdiauQNBQEDOtlAB,mainstream,audio_features,37i9dQZF1DXcBWIGoYBM5M
2,0.808,108.548,0.554,pop,-4.169,0.1590,0.372,gracie abrams,4,0.0368,...,0.000000,0hBRqPYPXhr1RkTDG3n4Mk,1,1,166300,0.2140,7ne4VBA60CxGM75vw0EYad,mainstream,audio_features,37i9dQZF1DXcBWIGoYBM5M
3,0.910,112.966,0.670,pop,-4.070,0.3040,0.786,sabrina carpenter,4,0.0634,...,0.000000,4B4Elma4nNDUyl6D5PvQkj,0,0,157280,0.0939,1d7Ptw3qYcfpdLNL5REhtJ,mainstream,audio_features,37i9dQZF1DXcBWIGoYBM5M
4,0.783,149.027,0.777,pop,-4.477,0.3550,0.939,"rosé, bruno mars",4,0.2600,...,0.000000,2IYQwwgxgOIn7t3iF6ufFD,0,0,169917,0.0283,5vNRhkKd0yEAg8suGBpjeY,mainstream,audio_features,37i9dQZF1DXcBWIGoYBM5M
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1681,0.422,124.357,0.573,latin,-7.621,0.1020,0.693,libianca,5,0.0678,...,0.000013,5Hmh6N8oisrcuZKa8EY5dn,0,10,184791,0.5510,26b3oVLrRUaaybJulow9kz,afro-latin,audio_features,0oU30cCr8klmMsuOKHDLkh
1682,0.725,105.016,0.711,latin,-8.315,0.1100,0.530,omah lay,4,0.0941,...,0.129000,5NLjxx8nRy9ooUmgpOvfem,0,3,183057,0.4240,1wADwLSkYhrSmy4vdy6BRn,afro-latin,audio_features,0oU30cCr8klmMsuOKHDLkh
1683,0.809,99.005,0.724,latin,-5.022,0.0765,0.606,"davido, fave",4,0.0929,...,0.000000,6lI21W76LD0S3vC55GrfSS,0,6,194040,0.1820,7vKXc90NT5WBm3UTT4iTVG,afro-latin,audio_features,0oU30cCr8klmMsuOKHDLkh
1684,0.642,83.389,0.463,latin,-4.474,0.0686,0.339,"future, drake, tems",4,0.3400,...,0.000000,6tE9Dnp2zInFij4jKssysL,1,1,189893,0.3140,59nOXPmaKlBfGMDeOVGrIK,afro-latin,audio_features,0oU30cCr8klmMsuOKHDLkh


In [8]:
from scipy.stats import zscore

# Calculate z-scores for track_popularity
high_df['zscore_popularity'] = zscore(high_df['track_popularity'])

# Identify outliers (z-score > 3 or < -3)
outliers = high_df[(high_df['zscore_popularity'] > 3) | (high_df['zscore_popularity'] < -3)]


print(outliers)

     energy    tempo  danceability playlist_genre  loudness  liveness  \
0     0.592  157.969         0.521            pop    -7.777    0.1220   
1     0.507  104.978         0.747            pop   -10.171    0.1170   
4     0.783  149.027         0.777            pop    -4.477    0.3550   
5     0.582  116.712         0.700            pop    -5.960    0.0881   
455   0.592  157.969         0.521         gaming    -7.777    0.1220   
456   0.507  104.978         0.747         gaming   -10.171    0.1170   
457   0.582  116.712         0.700         gaming    -5.960    0.0881   
676   0.592  157.969         0.521            pop    -7.777    0.1220   
677   0.783  149.027         0.777            pop    -4.477    0.3550   
678   0.507  104.978         0.747            pop   -10.171    0.1170   
688   0.582  116.712         0.700            pop    -5.960    0.0881   

     valence           track_artist  time_signature  speechiness  ...  \
0      0.535  lady gaga, bruno mars               

In [12]:
##The code below credits both artists for the song they collaborated on.

In [9]:
high_df['track_artist'] = high_df['track_artist'].str.split(', ')
high_df_cleaned = high_df.explode('track_artist')
high_df_cleaned = high_df_cleaned.reset_index(drop=True)

In [None]:
Below is an example with Bruno Mars' Die with a Smile, which is a collaboration with Lady Gaga.

In [10]:
high_df_cleaned[high_df_cleaned['track_name'] == 'Die With A Smile'][['track_name', 'track_artist']]


Unnamed: 0,track_name,track_artist
0,Die With A Smile,lady gaga
1,Die With A Smile,bruno mars
725,Die With A Smile,lady gaga
726,Die With A Smile,bruno mars
1048,Die With A Smile,lady gaga
1049,Die With A Smile,bruno mars


In [None]:
Another example of APT with Bruno Mars and Rose

In [11]:
high_df_cleaned[high_df_cleaned['track_name'] == 'APT.'][['track_name', 'track_artist']]


Unnamed: 0,track_name,track_artist
5,APT.,rosé
6,APT.,bruno mars
1050,APT.,rosé
1051,APT.,bruno mars


## Exploratory Data Analysis

To help us answer the research question, we have chosen the following EDA questions:

### 1. How can we mathematically define and measure an artist's "musical change" over time using the available audio features?

Before we can compare artists or genres, we must first establish a **consistent and quantifiable** metric for `musical change` using the available audio features. In essence, this question forces us to define our dependent variable.

### 2. What is the distribution of artists across the dataset's `Playlist Genres`?
This is to understand the composition of our dataset and check for **feasibility**. We need to ensure there are enough artists in each genre to make meaningful comparisons. If one genre is heavily overrepresented or another is barely present, it could affect the validity of our conclusions.

### 3. For a few sample artists from different genres, what do their key audio features look like when plotted against their album release dates? 
This is to visually validate our core premise. By plotting the trajectories of individual artists, we can get a first look on whether musical change is observable in the data. (just to make sure were not wasting our time lol)
