# Spotify Track Analytics Popularity Prediction

Dataset Content
The data set used for this project: [Kaggle](https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset). The collection of ~114,000 songs across 125 genres with features like danceability, energy, tempo, and popularity. Ideal for audio analysis, genre classification, and music trend exploration.

The dataset consists of the following columns:

* track_id: Unique Spotify identifier for each track.
* artists: List of artists performing the track, separated by semicolons.
* album_name: Title of the album where the track appears.
* track_name: Title of the song.
* popularity: Score from 0‚Äì100 based on recent play counts; higher means more popular.
* duration_ms: Length of the track in milliseconds.
* explicit: Indicates whether the track contains explicit content (True/False).
* danceability: Score (0.0‚Äì1.0) measuring how suitable the song is for dancing.
* energy: Score (0.0‚Äì1.0) reflecting intensity, speed, and loudness.
* key: Musical key using Pitch Class notation (0 = C, 1 = C‚ôØ/D‚ô≠, etc.).
* loudness: Overall volume of the track in decibels.
* mode: Indicates scale type (1 = major, 0 = minor).
* speechiness: Score estimating spoken content in the track.
* cousticness: Likelihood (0.0‚Äì1.0) that the song is acoustic.
* instrumentalness: Probability that the track has no vocals.
* liveness: Measures if the song was recorded live (higher = more live).
* valence: Positivity of the music (0.0 = sad, 1.0 = happy).
* tempo: Speed of the song in beats per minute (BPM). time_signature: Musical meter (e.g. 4 = 4/4 time). * track_genre: Musical genre classification of the track.

Import libraries that will be used

In [1]:

from rapidfuzz import process, fuzz  #import RapidFuzz for fuzzy string matching
import pandas as pd                 #import Pandas for data manipulation
import numpy as np                  #import Numpy for numerical operations
import matplotlib.pyplot as plt     #import Matplotlib for data visualization
import seaborn as sns               #import Seaborn for statistical data visualization
from plotly.subplots import make_subplots  #import Plotly subplots for creating complex figures
import plotly.express as px         #import Plotly Express for interactive visualizations

In [2]:
sns.set(style="whitegrid")                  # Set Seaborn style for plots
plt.rcParams["figure.figsize"] = (10,6)     # Set default figure size for Matplotlib plots

## 1. Explanatory Data Analysis

In this section EDA, including data load and cleaning, is performed.
As a first step, data set is loaded into DataFrame

In [3]:
df = pd.read_csv('data/spotify_dataset.csv')  # Load the car price dataset
df.head()                                            # Display the first few rows of the dataset

Unnamed: 0.1,Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,...,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
0,0,5SuOikwiRyPMVoIQDJUgSV,Gen Hoshino,Comedy,Comedy,73,230666,False,0.676,0.461,...,-6.746,0,0.143,0.0322,1e-06,0.358,0.715,87.917,4,acoustic
1,1,4qPNDBW1i3p13qLCt0Ki3A,Ben Woodward,Ghost (Acoustic),Ghost - Acoustic,55,149610,False,0.42,0.166,...,-17.235,1,0.0763,0.924,6e-06,0.101,0.267,77.489,4,acoustic
2,2,1iJBSr7s7jYXzM8EGcbK5b,Ingrid Michaelson;ZAYN,To Begin Again,To Begin Again,57,210826,False,0.438,0.359,...,-9.734,1,0.0557,0.21,0.0,0.117,0.12,76.332,4,acoustic
3,3,6lfxq3CG4xtTiEg7opyCyx,Kina Grannis,Crazy Rich Asians (Original Motion Picture Sou...,Can't Help Falling In Love,71,201933,False,0.266,0.0596,...,-18.515,1,0.0363,0.905,7.1e-05,0.132,0.143,181.74,3,acoustic
4,4,5vjLSffimiIP26QG5WcN2K,Chord Overstreet,Hold On,Hold On,82,198853,False,0.618,0.443,...,-9.681,1,0.0526,0.469,0.0,0.0829,0.167,119.949,4,acoustic


### 1.1.Initial data exploration

In the following subsection initial data set inspection is performed. Here the shape and Info of DataFrame are shown

In [4]:
print(df.shape)                     # Print the shape of the DataFrame           
print(df.info())                    # Print concise summary of the DataFrame            
print(df.dtypes)                    # Print data types of each column

(114000, 21)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 114000 entries, 0 to 113999
Data columns (total 21 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Unnamed: 0        114000 non-null  int64  
 1   track_id          114000 non-null  object 
 2   artists           113999 non-null  object 
 3   album_name        113999 non-null  object 
 4   track_name        113999 non-null  object 
 5   popularity        114000 non-null  int64  
 6   duration_ms       114000 non-null  int64  
 7   explicit          114000 non-null  bool   
 8   danceability      114000 non-null  float64
 9   energy            114000 non-null  float64
 10  key               114000 non-null  int64  
 11  loudness          114000 non-null  float64
 12  mode              114000 non-null  int64  
 13  speechiness       114000 non-null  float64
 14  acousticness      114000 non-null  float64
 15  instrumentalness  114000 non-null  float64
 16  livenes

This dataset contains of 114 entires and 21 rows

In the next steps DataFrame is checked for any incosistencies(dublicates, missing value and etc.)

In [5]:
df.isnull().sum()                  # Check for missing values in each column

Unnamed: 0          0
track_id            0
artists             1
album_name          1
track_name          1
popularity          0
duration_ms         0
explicit            0
danceability        0
energy              0
key                 0
loudness            0
mode                0
speechiness         0
acousticness        0
instrumentalness    0
liveness            0
valence             0
tempo               0
time_signature      0
track_genre         0
dtype: int64

As it is can be seen, there are 3 empty etries. This number is neglectable, so we can drop these entries.

In [6]:
df. dropna(inplace=True)          # Drop rows with missing values
df.reset_index(drop=True, inplace=True)  # Reset index after dropping rows

and veryfying, that empty values are removed

In [7]:
df.isnull().sum()                  # Verify that there are no missing values left

Unnamed: 0          0
track_id            0
artists             0
album_name          0
track_name          0
popularity          0
duration_ms         0
explicit            0
danceability        0
energy              0
key                 0
loudness            0
mode                0
speechiness         0
acousticness        0
instrumentalness    0
liveness            0
valence             0
tempo               0
time_signature      0
track_genre         0
dtype: int64

here column names are displayed

In [8]:
df.columns

Index(['Unnamed: 0', 'track_id', 'artists', 'album_name', 'track_name',
       'popularity', 'duration_ms', 'explicit', 'danceability', 'energy',
       'key', 'loudness', 'mode', 'speechiness', 'acousticness',
       'instrumentalness', 'liveness', 'valence', 'tempo', 'time_signature',
       'track_genre'],
      dtype='object')

here we see that we have column "Unnamed:0" and "track_id". These two columns are not important for further analysis and modeling, therefore they can be dropped

In [9]:
df.drop(columns = ['Unnamed: 0', 'track_id'], inplace=True)  # Drop unnecessary columns and display the first few rows 

In [10]:
df.columns

Index(['artists', 'album_name', 'track_name', 'popularity', 'duration_ms',
       'explicit', 'danceability', 'energy', 'key', 'loudness', 'mode',
       'speechiness', 'acousticness', 'instrumentalness', 'liveness',
       'valence', 'tempo', 'time_signature', 'track_genre'],
      dtype='object')

The next step is check, wether column categorical columns have misspelling in names and, as a result duplicates did not detected.

In [11]:
categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()

print("Categorical columns:")
print(categorical_cols)

Categorical columns:
['artists', 'album_name', 'track_name', 'track_genre']


1. Detects duplicates
2. Prints total count
3. Displays which rows are duplicated (track name, artist, album)
4. Shows which columns differ across duplicates
5. Removes duplicates and saves the clean dataset into data

In [12]:
# Step 1: Detect duplicates
duplicates = df[df.duplicated(keep=False)]
print("üîÅ Number of duplicate records:", len(duplicates))

# Step 2: Show duplicate track details
if not duplicates.empty:
    print("\nüìÄ Duplicated Tracks (sample):")
    print(duplicates[['track_name', 'artists', 'album_name']].head())

    # Step 3: Compare duplicates to first occurrence
    print("\nüîé Columns with different values in duplicated rows:")
    duplicated_indices = duplicates.index

    for idx in duplicated_indices:
        row = df.loc[idx]
        first_occurrence = df[(df['track_name'] == row['track_name']) & 
                              (df['artists'] == row['artists']) & 
                              (df['album_name'] == row['album_name'])].iloc[0]
        
        differing_columns = [col for col in df.columns if row[col] != first_occurrence[col]]

        if differing_columns:
            print(f"Row {idx} differs in columns: {differing_columns}")

# Step 4: Remove duplicate rows (keeping the first occurrence)
df= df[~df.duplicated()]
print("\n‚úÖ Duplicate records removed. Cleaned dataset ready in `data`.")


üîÅ Number of duplicate records: 1129

üìÄ Duplicated Tracks (sample):
                track_name                  artists  \
1874      Song for Rollins   Buena Onda Reggae Club   
1925      Song for Rollins   Buena Onda Reggae Club   
2044  Don't Shoot Me Santa  The Killers;Ryan Pardey   
2046  Don't Shoot Me Santa  The Killers;Ryan Pardey   
2082         Christmastime    The Smashing Pumpkins   

                      album_name  
1874                     Disco 2  
1925                     Disco 2  
2044  Alternative Christmas 2022  
2046  Alternative Christmas 2022  
2082  Alternative Christmas 2022  

üîé Columns with different values in duplicated rows:
Row 3043 differs in columns: ['track_genre']
Row 3044 differs in columns: ['track_genre']
Row 3129 differs in columns: ['track_genre']
Row 3138 differs in columns: ['track_genre']
Row 3147 differs in columns: ['track_genre']
Row 3149 differs in columns: ['track_genre']
Row 3269 differs in columns: ['track_genre']
Row 3270 differ

let's make all names uniqied

In [13]:
# Loop through each categorical column and clean it
for col in categorical_cols:
    df[col] = (
        df[col]
        .astype(str)                 # ensure the column is string type
        .str.strip()                 # remove leading/trailing spaces
        .str.lower()                 # make everything lowercase
        .str.replace("-", " ", regex=False)   # replace hyphens with spaces
        .str.replace("_", " ", regex=False)   # replace underscores with spaces
        .str.replace(";", " and ", regex=False)   # replace semicolons with spaces
    )

# Check the result
df.head()

Unnamed: 0,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
0,gen hoshino,comedy,comedy,73,230666,False,0.676,0.461,1,-6.746,0,0.143,0.0322,1e-06,0.358,0.715,87.917,4,acoustic
1,ben woodward,ghost (acoustic),ghost acoustic,55,149610,False,0.42,0.166,1,-17.235,1,0.0763,0.924,6e-06,0.101,0.267,77.489,4,acoustic
2,ingrid michaelson and zayn,to begin again,to begin again,57,210826,False,0.438,0.359,0,-9.734,1,0.0557,0.21,0.0,0.117,0.12,76.332,4,acoustic
3,kina grannis,crazy rich asians (original motion picture sou...,can't help falling in love,71,201933,False,0.266,0.0596,0,-18.515,1,0.0363,0.905,7.1e-05,0.132,0.143,181.74,3,acoustic
4,chord overstreet,hold on,hold on,82,198853,False,0.618,0.443,2,-9.681,1,0.0526,0.469,0.0,0.0829,0.167,119.949,4,acoustic


Detecting misspellings or inconsistent categories in categorical columns is super important for EDA and modeling (e.g., "hip hop" vs "Hip-Hop" vs "Hip hop").

In [14]:
for col in categorical_cols:      # Loop through each categorical column
    print(f"\nüîπ Column: {col}")                   
    print(f"Unique values: {df[col].nunique()}")  # Print number of unique categories
    print(df[col].value_counts())  # Show top 10 most common categories
    print("-" * 50)


üîπ Column: artists
Unique values: 31427
artists
the beatles                                      279
george jones                                     260
stevie wonder                                    235
linkin park                                      224
ella fitzgerald                                  221
                                                ... 
alexandre aposan and salom√£o                       1
nando fortunato and sephora                        1
coral voice soul and melk villar                   1
sync 3 and ericka nascimento and matheus bird      1
jesus culture                                      1
Name: count, Length: 31427, dtype: int64
--------------------------------------------------

üîπ Column: album_name
Unique values: 46150
album_name
feliz cumplea√±os con perreo    180
alternative christmas 2022     156
metal                          143
halloween con perreito         122
halloween party 2022           111
                              ... 
5150 

This gives you an overview of what‚Äôs inside each categorical column ‚Äî letting you visually spot potential misspellings.

It is also nessecary to convert song duration from milliseconds to mintues and drop "duration_ms" column.

In [15]:
df["duration_min"] = df["duration_ms"] / 60000  # Convert duration from milliseconds to minutes
df.drop(columns=["duration_ms"], inplace=True)  # Drop the original duration_ms column
df.head()  # Display the first few rows to verify changes

Unnamed: 0,artists,album_name,track_name,popularity,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre,duration_min
0,gen hoshino,comedy,comedy,73,False,0.676,0.461,1,-6.746,0,0.143,0.0322,1e-06,0.358,0.715,87.917,4,acoustic,3.844433
1,ben woodward,ghost (acoustic),ghost acoustic,55,False,0.42,0.166,1,-17.235,1,0.0763,0.924,6e-06,0.101,0.267,77.489,4,acoustic,2.4935
2,ingrid michaelson and zayn,to begin again,to begin again,57,False,0.438,0.359,0,-9.734,1,0.0557,0.21,0.0,0.117,0.12,76.332,4,acoustic,3.513767
3,kina grannis,crazy rich asians (original motion picture sou...,can't help falling in love,71,False,0.266,0.0596,0,-18.515,1,0.0363,0.905,7.1e-05,0.132,0.143,181.74,3,acoustic,3.36555
4,chord overstreet,hold on,hold on,82,False,0.618,0.443,2,-9.681,1,0.0526,0.469,0.0,0.0829,0.167,119.949,4,acoustic,3.314217


Next step is to take a look on a descriptive statistic summary for each column in the df.

We use descriptive statistics to summarize, explore, and validate the dataset before modeling. They give us a quick overview of central tendencies, spread, and anomalies ‚Äî helping us decide how to clean, visualize, and model the data.

This summary includes following metrics:

* count: Number of non-missing (non-NaN) values. Helps check missing data.
* mean: Average (sum / count). Central tendency of numeric data.
* std: Standard deviation. How spread out the values are from the mean.
* min: Minimum value. The smallest observed value.
* 25%: 25th percentile (Q1). 25% of data is below this value
* 50% (median): 50th percentile. Half the data is below this value.
* 75%: 75th percentile (Q3). 75% of data is below this value
* max: Maximum value. The largest observed value
In following cell a descriptive statistics of numeric columns is performed:

In [16]:
df.describe() # Generate descriptive statistics of numerical columns

Unnamed: 0,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,duration_min
count,113422.0,113422.0,113422.0,113422.0,113422.0,113422.0,113422.0,113422.0,113422.0,113422.0,113422.0,113422.0,113422.0,113422.0
mean,33.359674,0.567113,0.642174,5.309332,-8.242913,0.637681,0.084697,0.314075,0.155802,0.21361,0.474239,122.176181,3.904225,3.801686
std,22.269626,0.173402,0.251031,3.559767,5.011931,0.480673,0.105803,0.331943,0.309314,0.190481,0.259239,29.972104,0.432077,1.77418
min,0.0,0.0,0.0,0.0,-49.531,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1431
25%,17.0,0.456,0.473,2.0,-9.998,0.0,0.0359,0.0168,0.0,0.098,0.26,99.299,4.0,2.902833
50%,35.0,0.58,0.685,5.0,-6.996,1.0,0.0489,0.168,4.1e-05,0.132,0.464,122.019,4.0,3.550267
75%,50.0,0.695,0.854,8.0,-5.001,1.0,0.084575,0.596,0.0487,0.273,0.683,140.073,4.0,4.36
max,100.0,0.985,1.0,11.0,4.532,1.0,0.965,0.996,1.0,1.0,0.995,243.372,5.0,87.28825


The dataset contains 113,999 tracks, and the statistics cover 14 numerical features related to track popularity and audio characteristics.

Key Feature Insights
1. Popularity
Range: 0 to 100 (Spotify-defined scale). Mean: 33.2 (relatively low), Median: 35.
Insight: Majority of songs in the dataset are not highly popular. Only a small portion reaches scores above 75.
2. Danceability
Scale: 0.0 to 1.0 (higher = more danceable). Mean: 0.567.
Insight: Most tracks are moderately danceable. Distribution is slightly right-skewed with many songs in the 0.5‚Äì0.7 range.
3. Energy
Scale: 0.0 to 1.0. Mean: 0.64.
Insight: Tracks generally have high energy, indicating a tendency toward upbeat or intense music.
4. Key
Range: 0 to 11 (12 semitones, C to B). Mean: ~5.3.
Insight: Keys are evenly distributed, with slight clustering near 5 (F major / D minor).
5. Loudness
Unit: Decibels (dB). Mean: -8.26 dB, Min: -49.5 dB.
Insight: Some tracks have extremely low loudness, possibly ambient or silent tracks; majority are mastered for streaming loudness levels (around -7 dB).
6. Mode
Binary: 0 = minor, 1 = major. Mean: 0.637 ‚Üí ~64% of songs are in major mode.
Insight: Major mode dominates (typically associated with ‚Äúhappy‚Äù sound).
7. Speechiness
Scale: 0.0 to 1.0. Mean: 0.084.
Insight: Most tracks have low speech content (e.g., songs, not podcasts), but the max of 0.965 suggests some spoken word/music hybrids.
8. Acousticness
Mean: 0.315. Insight: Majority of tracks are not acoustic-heavy, but the high standard deviation (0.33) shows some variety.
9. Instrumentalness
Mean: 0.156. 
Insight: Most tracks contain vocals (median ~0.000042), but there‚Äôs a small but significant subset of instrumental music.
10. Liveness
Mean: 0.213.
Insight: Most tracks are studio recordings; live recordings are rare.
11. Valence
Scale: 0.0 (sad) to 1.0 (happy).
Mean: 0.47.
Insight: Balanced distribution between positive and negative mood songs.
12. Tempo
Mean: 122 BPM. 
Insight: Common tempo for pop/dance tracks. Range is wide (0‚Äì243 BPM), but the quartiles (25% = 99 BPM, 75% = 140 BPM) confirm a core BPM range of ~100‚Äì140.
13. Time Signature
Mean: ~3.9. 
Insight: Most tracks are in 4/4 time (common time), as expected. Very little variation.
14. Duration (in minutes)
Mean: 3.8 minutes. 
Max: 87 minutes!
Insight: Typical track duration matches mainstream standards. Max suggests presence of podcasts, live sets, or compilation tracks.

In [17]:
df_cleaned = df.copy()  # Create a copy of the cleaned DataFrame
df_cleaned.to_csv('data/spotify_cleaned_data.csv', index=False)

### 1.2 Initial data visualization

In this section initial data visualisation is performed

1.2.1 Pairplot of key numerical 