# Spotify Data Visualization Lab

## Overview
In this lab, you will explore and visualize Spotify music data to uncover insights about music trends, popular genres, artists, and song characteristics. You'll work with a real-world dataset containing information about songs, their attributes, and popularity metrics.

Amisha & Melek

## Part I: Data Loading & Initial Exploration


1.   **Task 1 -** Import all necessary libraries for data manipulation & visualization
2.   **Task 2 -** Load the Dataset, display the first few rows and check the shape of the dataset.
3.   **Task 3 -** Answer the questions:
     1.    How many songs are in the dataset?
     2.    How many features/columns does each song have?
4.   **Task 4 -** Display column names and their data types, check for missing values and get a summary statistics for numerical columns.






**Task 1**

In [1]:
# Importing all necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

**Task 2**

In [3]:
# Load dataset and display first few rows
df = pd.read_csv('songs_normalize.csv')
df.head()

Unnamed: 0,artist,song,duration_ms,explicit,year,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,genre
0,Britney Spears,Oops!...I Did It Again,211160,False,2000,77,0.751,0.834,1,-5.444,0,0.0437,0.3,1.8e-05,0.355,0.894,95.053,pop
1,blink-182,All The Small Things,167066,False,1999,79,0.434,0.897,0,-4.918,1,0.0488,0.0103,0.0,0.612,0.684,148.726,"rock, pop"
2,Faith Hill,Breathe,250546,False,1999,66,0.529,0.496,7,-9.007,1,0.029,0.173,0.0,0.251,0.278,136.859,"pop, country"
3,Bon Jovi,It's My Life,224493,False,2000,78,0.551,0.913,0,-4.063,0,0.0466,0.0263,1.3e-05,0.347,0.544,119.992,"rock, metal"
4,*NSYNC,Bye Bye Bye,200560,False,2000,65,0.614,0.928,8,-4.806,0,0.0516,0.0408,0.00104,0.0845,0.879,172.656,pop


**Task 3**

In [4]:
# How many songs are in the dataset?
num_songs = df.shape[0]
print(f'The dataset contains {num_songs} songs.')

The dataset contains 2000 songs.


In [5]:
# How many features/columns does each song have?
num_features = df.shape[1]
print(f'Each song has {num_features} features/columns.')


Each song has 18 features/columns.


**Task 4**

In [6]:
# Display column names and their data types
df.dtypes

artist               object
song                 object
duration_ms           int64
explicit               bool
year                  int64
popularity            int64
danceability        float64
energy              float64
key                   int64
loudness            float64
mode                  int64
speechiness         float64
acousticness        float64
instrumentalness    float64
liveness            float64
valence             float64
tempo               float64
genre                object
dtype: object

In [11]:
# Check for missing values
missing_values = df.isnull().sum()
print("Missing values in each column:")
missing_values

Missing values in each column:


artist              0
song                0
duration_ms         0
explicit            0
year                0
popularity          0
danceability        0
energy              0
key                 0
loudness            0
mode                0
speechiness         0
acousticness        0
instrumentalness    0
liveness            0
valence             0
tempo               0
genre               0
dtype: int64

In [10]:
# Summary statistics for numerical columns
summary_stats = df.describe()
print("Summary statistics for numerical columns:")
summary_stats

Summary statistics for numerical columns:


Unnamed: 0,duration_ms,year,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo
count,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0
mean,228748.1245,2009.494,59.8725,0.667438,0.720366,5.378,-5.512434,0.5535,0.103568,0.128955,0.015226,0.181216,0.55169,120.122558
std,39136.569008,5.85996,21.335577,0.140416,0.152745,3.615059,1.933482,0.497254,0.096159,0.173346,0.087771,0.140669,0.220864,26.967112
min,113000.0,1998.0,0.0,0.129,0.0549,0.0,-20.514,0.0,0.0232,1.9e-05,0.0,0.0215,0.0381,60.019
25%,203580.0,2004.0,56.0,0.581,0.622,2.0,-6.49025,0.0,0.0396,0.014,0.0,0.0881,0.38675,98.98575
50%,223279.5,2010.0,65.5,0.676,0.736,6.0,-5.285,1.0,0.05985,0.0557,0.0,0.124,0.5575,120.0215
75%,248133.0,2015.0,73.0,0.764,0.839,8.0,-4.16775,1.0,0.129,0.17625,6.8e-05,0.241,0.73,134.2655
max,484146.0,2020.0,89.0,0.975,0.999,11.0,-0.276,1.0,0.576,0.976,0.985,0.853,0.973,210.851


## Part II: Data Cleaning & Preprocessing

1.   **Task 1 -** Identify missing values, decide and implement appropriate strategy for each column
2.   **Task 2 -** Check if there are duplicate songs, decide whether to keep or remove them and document your decision and reasoning.
3.   **Task 3 -** Ensure all columns have appropriate data types and convert columns if necessary.

**Task 1**

In [None]:
# No missing values found in the dataset.

**Task 2**

In [14]:
# Check for duplicate songs
duplicate_songs = df.duplicated().sum()
print(f'Number of duplicate songs in the dataset: {duplicate_songs}')   

Number of duplicate songs in the dataset: 59


In [None]:
# Count duplicates by artist + song
dups_artist_song = df[df.duplicated(subset=['artist','song'], keep=False)]
dups_artist_song.groupby(['artist','song']).size().sort_values(ascending=False)

artist          song                           
Alessia Cara    Here                               2
Snoop Dogg      Drop It Like It's Hot              2
Selena Gomez    The Heart Wants What It Wants      2
                Same Old Love                      2
SAYGRACE        You Don't Own Me (feat. G-Eazy)    2
                                                  ..
Finger Eleven   Paralyzer                          2
Eminem          Like Toy Soldiers                  2
Ellie Goulding  Burn                               2
Edward Maya     Stereo Love - Radio Edit           2
X Ambassadors   Renegades                          2
Length: 74, dtype: int64

In [None]:
# Check duration variation among duplicates
dups_artist_song.groupby(['artist','song'])['duration_ms'].std()

artist             song                               
Alessia Cara       Here                                   0.0
Ariana Grande      Love Me Harder                         0.0
Baby Bash          Suga Suga                              0.0
Billie Eilish      lovely (with Khalid)                   0.0
Bruno Mars         Locked out of Heaven                   0.0
                                                         ... 
Travis Scott       SICKO MODE                             0.0
Trey Songz         Na Na                                  0.0
Waka Flocka Flame  No Hands (feat. Roscoe Dash & Wale)    0.0
Will Young         Leave Right Now                        0.0
X Ambassadors      Renegades                              0.0
Name: duration_ms, Length: 74, dtype: float64

Almost all duplicates have duration_ms std = 0.0

→ Their durations are identical

→ Meaning:

- no radio edit vs album version

- no live version

- no remix

these are the same recording


In [None]:
# Check duplicates caused by multi-genre tagging
df[df.duplicated(subset=['artist','song','duration_ms'], keep=False)][['artist','song','genre']]

Unnamed: 0,artist,song,genre
20,Linkin Park,In the End,"rock, metal"
36,Kylie Minogue,Spinning Around,"pop, Dance/Electronic"
63,Craig David,Fill Me In,"hip hop, pop, R&B"
85,Destiny's Child,"Independent Women, Pt. 1","pop, R&B"
90,Gabrielle,Rise,"pop, R&B"
...,...,...,...
1832,Jax Jones,Breathe,"hip hop, pop, Dance/Electronic"
1861,Post Malone,Better Now,hip hop
1921,Travis Scott,SICKO MODE,"hip hop, Dance/Electronic"
1929,Billie Eilish,lovely (with Khalid),"pop, Dance/Electronic"


Two key observations:

A. Many songs have multi-genre labels

Formats like:

- "rock, metal"

- "pop, Dance/Electronic"

- "hip hop, pop, R&B"
  
B. Each duplicate pair seems to differ only in genre or minor numeric features

This is consistent with the duration std = 0.0
It is nearly certain that:

✔ same audio track

✔ same artist + song

✔ identical duration

✖ differing genres or popularity metadata

✖ possibly tiny differences in audio features

Duplicates may be coming from songs appearing in multiple playlists (with different genre classifications).

It seems that wew have metadata duplicates — multiple rows describing the same recording, with variations in:

- genre

- popularity

- audio features (slight)

=> Keeping the version with the highest popularity (most up-to-date) since popularity is the field most likely to differ across playlist or source.

This gives us:

- 1 row per song

- the most accurate popularity reading

- all duplicates removed safely

In [33]:
# Keep the most popular version
df = (
    df.sort_values('popularity')
      .drop_duplicates(subset=['artist','song'], keep='last')
)


In [34]:
duplicate_songs = df.duplicated().sum()
print(f'Number of duplicate songs in the dataset: {duplicate_songs}')   

Number of duplicate songs in the dataset: 0


## Part III: Exploratory Data Analysis (EDA)

1.   **Task 1 -** Create visualizations to analyze music genres:
     1.    **(Visualization)** Plot showing the distribution of songs across different genres (You can have subgraphs/music genre or stack all distributions in one graph).
     2.    **(Question)** Which genre has the most songs?
     3.    **(Question)** What percentage of songs belong to the top 3 genres?
2.   **Task 2 -** Analyze song popularity patterns:
     1.    **(Visualization)** Distribution plot of song popularity scores.
     2.    **(Question)** Calculate and display mean, median, and mode of popularity
     3.    **(Visualization)** Create a box plot to identify outliers
     4.    **(Question)** What is the typical popularity score range?
     5.    **(Question)** Are there songs with exceptionally high or low popularity?
3.   **Task 3 -** Investigate artist-related metrics:
     1.    **(Visualization)** Bar chart showing top 10 artists by number of songs.
     3.    **(Visualization)** Bar chart showing top 10 artists by average popularity
     4.    **(Question)** Does having more songs correlate with higher popularity?
     5.    **(Question)** Who are the most prolific artists in the dataset?
4.   **Task 4 -** Analyze explicit content in songs:
     1.    **(Question)** Calculate the percentage of songs with explicit content
     1.    **(Visualization)** Pie chart or bar chart comparing explicit vs non-explicit songs.
     3.    **(Visualization)** Box plot comparing popularity scores between explicit and non-explicit songs.
     4.    **(Question)** Do explicit songs tend to be more or less popular?

In [None]:
# Write you code here
# (You can add more cells)

## Part IV: Correlation & Relationship Analysis

1.   **Task 1 -** Correlation Matrix:
     1.    **(Visualization)** Create a heatmap showing correlations between Popularity, Energy, Danceability, Loudness, Valence, Tempo, Acousticness, etc.
     2.    **(Question)** Which features are most strongly correlated?
     3.    **(Question)** Are there any surprising correlations?
2.   **Task 2 -** Multi-Feature Comparison:
     1.    **(Visualization)** Create a parallel coordinates plot or radar chart showing audio features for top 10 most popular songs
     2.    **(Question)** Compare their feature profiles

In [None]:
# Write you code here
# (You can add more cells)