In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

In [None]:
df = pd.read_csv("../../data/spotify-2023.csv", encoding='latin 1')

In [4]:
# Check unique values
print("\nUnique values in 'key':")
print(df['key'].value_counts(dropna=False))

print("\nUnique values in 'in_shazam_charts':")
print(df['in_shazam_charts'].value_counts(dropna=False))

# Check correlations with missing values
print("\nAre missing values related to other features?")
print("Songs with missing key - average stats:")
print(df[df['key'].isnull()].describe())

print("\nSongs with missing Shazam charts - average stats:")
print(df[df['in_shazam_charts'].isnull()].describe())


Unique values in 'key':
key
C#     120
G       96
NaN     95
G#      91
F       89
B       81
D       81
A       75
F#      73
E       62
A#      57
D#      33
Name: count, dtype: int64

Unique values in 'in_shazam_charts':
in_shazam_charts
0      344
1       73
NaN     50
2       35
3       21
      ... 
115      1
230      1
169      1
529      1
95       1
Name: count, Length: 199, dtype: int64

Are missing values related to other features?
Songs with missing key - average stats:
       artist_count  released_year  released_month  released_day  \
count     95.000000      95.000000       95.000000     95.000000   
mean       1.600000    2018.210526        6.105263     15.789474   
std        1.124202      11.251378        3.559446      8.151354   
min        1.000000    1952.000000        1.000000      1.000000   
25%        1.000000    2020.000000        3.000000     10.000000   
50%        1.000000    2022.000000        6.000000     17.000000   
75%        2.000000    2022.000000 

1. Musical Key (key column - 95 missing)

All 12 musical keys are present (C through B)
C# is most popular (120 songs) - interesting music theory insight!
D# is least common (33 songs)
Missing keys are from older songs (avg. 2018 release year)

2. Shazam Charts (in_shazam_charts column - 50 missing)

This is a count of how many Shazam charts the song appears in (0 to 529!)
344 songs (36%) are NOT in any Shazam charts (value = 0)
50 missing values are from even older songs (avg. 2015)
Songs with missing Shazam data are actually MORE POPULAR on Spotify and Apple! (2.4x more playlists)

Missing values are NOT random - they reveal the evolution of music data collection:

Older tracks (pre-2018) → missing key detection
Much older tracks (pre-2015) → missing Shazam tracking
Shazam wasn't always the dominant platform!