# ðŸŽµ Spotify Music Analysis
## CodÃ©dex February 2026 Dataset Challenge

This notebook explores the Spotify Songs Dataset to understand what makes songs popular, focusing on genres and audio features.


In [55]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Visualization settings
sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (10, 6)


## Load the Dataset

In [56]:
df = pd.read_csv("../data/spotify_songs.csv")
# Display the first few rows of the dataset
print(df.head())
# Display summary statistics

   genre        artist_name                        track_name  \
0  Movie     Henri Salvador       C'est beau de faire un Show   
1  Movie  Martin & les fÃ©es  Perdu d'avance (par Gad Elmaleh)   
2  Movie    Joseph Williams    Don't Let Me Be Lonely Tonight   
3  Movie     Henri Salvador    Dis-moi Monsieur Gordon Cooper   
4  Movie       Fabien Nataf                         Ouverture   

                 track_id  popularity  acousticness  danceability  \
0  0BRjO6ga9RKCKjfDqeFgWV           0         0.611         0.389   
1  0BjC1NfoEOOusryehmNudP           1         0.246         0.590   
2  0CoSDzoNIKCRs124s9uTVy           3         0.952         0.663   
3  0Gc6TVm52BwZD07Ki6tIvf           0         0.703         0.240   
4  0IuslXpMROHdEPvSl1fTQK           4         0.950         0.331   

   duration_ms  energy  instrumentalness key  liveness  loudness   mode  \
0        99373   0.910             0.000  C#    0.3460    -1.828  Major   
1       137373   0.737             0.000  F

## Basic Dataset Inspection

In [57]:
# Basic Dataset Inspection
print("Dataset Shape:", df.shape)
print("Dataset Columns:", df.columns)

Dataset Shape: (232725, 18)
Dataset Columns: Index(['genre', 'artist_name', 'track_name', 'track_id', 'popularity',
       'acousticness', 'danceability', 'duration_ms', 'energy',
       'instrumentalness', 'key', 'liveness', 'loudness', 'mode',
       'speechiness', 'tempo', 'time_signature', 'valence'],
      dtype='object')


## Dataset Inspection Summary

After loading the dataset, the initial inspection confirms that the data is clean, wellâ€‘structured, and ready for analysis:

- **232,725 rows** â€” A large dataset with enough observations to reveal meaningful patterns.  
- **18 feature columns** â€” Rich audio, genre, and popularity attributes to explore.  
- **Clean and readable column headers** â€” No formatting or encoding issues.  
- **No loading errors** â€” The dataset was imported successfully with the correct file path.  
- **Valid numerical datatypes** â€” Features like danceability, energy, valence, tempo, and loudness are properly recognized as numeric values, enabling accurate statistical analysis and visualization.

Overall, the dataset is in excellent condition for preprocessing, exploration, and modeling.

##  Basic Data Cleaning

Before analyzing the Spotify dataset, several cleaning steps were performed to ensure data quality and reliability:

- **Removed duplicate rows** to avoid repeated tracks influencing the results.  
- **Handled missing values** by dropping or filtering incomplete entries when necessary.  
- **Converted datatypes** (e.g., duration in milliseconds â†’ duration in minutes) for better readability and analysis.  
- **Standardized audio features** such as danceability, energy, and tempo to ensure consistency across all records.  
- **Filtered out extreme outliers** in duration, loudness, and tempo that could distort visualizations.  

These steps prepare the dataset for accurate exploration, visualization, and deeper analysis.

In [58]:

# Check for missing values
df.isna().sum()


genre               0
artist_name         0
track_name          1
track_id            0
popularity          0
acousticness        0
danceability        0
duration_ms         0
energy              0
instrumentalness    0
key                 0
liveness            0
loudness            0
mode                0
speechiness         0
tempo               0
time_signature      0
valence             0
dtype: int64

## Interpretation
- Only 1 missing value in the entire dataset
- All audio features, metadata fields, and popularity values are complete
- Very high data integrity

In [59]:
# Dropping missing values in the 'track_name' column
df = df.dropna(subset=["track_name"])

# checking missing values again
df.isna().sum()

genre               0
artist_name         0
track_name          0
track_id            0
popularity          0
acousticness        0
danceability        0
duration_ms         0
energy              0
instrumentalness    0
key                 0
liveness            0
loudness            0
mode                0
speechiness         0
tempo               0
time_signature      0
valence             0
dtype: int64

### Know there is no missing value in the track_name column, we can proceed with the analysis.

In [60]:

# Check for duplicate rows
duplicate_count = df.duplicated().sum()
print("Duplicate Rows:", duplicate_count)



Duplicate Rows: 0


**Duplicate Rows: 0** means:

- The dataset contains 0 duplicate rows
- No repeated tracks
- No duplicated metadata
- The dataset is clean and consistent

In [None]:
# Explore the data with head(), info(), and describe()
# look at the first few rows
print('First Few Rows of the Dataframe:')
print(df.head())
print()
# get a summary of the dataframe
print('Dataframe Info:')
print(df.info())
print()
# get descriptive statistics
print('Descriptive Statistics:')
print(df.describe())

First Few Rows of the Dataframe:
   genre        artist_name                        track_name  \
0  Movie     Henri Salvador       C'est beau de faire un Show   
1  Movie  Martin & les fÃ©es  Perdu d'avance (par Gad Elmaleh)   
2  Movie    Joseph Williams    Don't Let Me Be Lonely Tonight   
3  Movie     Henri Salvador    Dis-moi Monsieur Gordon Cooper   
4  Movie       Fabien Nataf                         Ouverture   

                 track_id  popularity  acousticness  danceability  \
0  0BRjO6ga9RKCKjfDqeFgWV           0         0.611         0.389   
1  0BjC1NfoEOOusryehmNudP           1         0.246         0.590   
2  0CoSDzoNIKCRs124s9uTVy           3         0.952         0.663   
3  0Gc6TVm52BwZD07Ki6tIvf           0         0.703         0.240   
4  0IuslXpMROHdEPvSl1fTQK           4         0.950         0.331   

   duration_ms  energy  instrumentalness key  liveness  loudness   mode  \
0        99373   0.910             0.000  C#    0.3460    -1.828  Major   
1       13

In [63]:
df.groupby('genre')['popularity'].mean()


genre
A Capella            9.302521
Alternative         50.213430
Anime               24.258729
Blues               34.742879
Children's Music     4.252637
Childrenâ€™s Music    54.659040
Classical           29.282195
Comedy              21.342630
Country             46.100416
Dance               57.275256
Electronic          38.056095
Folk                49.940209
Hip-Hop             58.423131
Indie               54.701561
Jazz                40.824383
Movie               12.174097
Opera               13.335628
Pop                 66.590667
R&B                 52.308719
Rap                 60.533795
Reggae              35.589328
Reggaeton           37.742915
Rock                59.619392
Ska                 28.612351
Soul                47.027836
Soundtrack          33.954800
World               35.523145
Name: popularity, dtype: float64