# Spotify Track Analytics Popularity Prediction

Dataset Content
The data set used for this project: [Kaggle](https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset). The collection of ~114,000 songs across 125 genres with features like danceability, energy, tempo, and popularity. Ideal for audio analysis, genre classification, and music trend exploration.

The dataset consists of the following columns:

* track_id: Unique Spotify identifier for each track.
* artists: List of artists performing the track, separated by semicolons.
* album_name: Title of the album where the track appears.
* track_name: Title of the song.
* popularity: Score from 0‚Äì100 based on recent play counts; higher means more popular.
* duration_ms: Length of the track in milliseconds.
* explicit: Indicates whether the track contains explicit content (True/False).
* danceability: Score (0.0‚Äì1.0) measuring how suitable the song is for dancing.
* energy: Score (0.0‚Äì1.0) reflecting intensity, speed, and loudness.
* key: Musical key using Pitch Class notation (0 = C, 1 = C‚ôØ/D‚ô≠, etc.).
* loudness: Overall volume of the track in decibels.
* mode: Indicates scale type (1 = major, 0 = minor).
* speechiness: Score estimating spoken content in the track.
* cousticness: Likelihood (0.0‚Äì1.0) that the song is acoustic.
* instrumentalness: Probability that the track has no vocals.
* liveness: Measures if the song was recorded live (higher = more live).
* valence: Positivity of the music (0.0 = sad, 1.0 = happy).
* tempo: Speed of the song in beats per minute (BPM). time_signature: Musical meter (e.g. 4 = 4/4 time). * track_genre: Musical genre classification of the track.

Import libraries that will be used

In [None]:

import pandas as pd                 #import Pandas for data manipulation
import numpy as np                  #import Numpy for numerical operations
import matplotlib.pyplot as plt     #import Matplotlib for data visualization
import seaborn as sns               #import Seaborn for statistical data visualization
from plotly.subplots import make_subplots  #import Plotly subplots for creating complex figures
import plotly.express as px         #import Plotly Express for interactive visualizations

In [None]:
sns.set(style="whitegrid")                  # Set Seaborn style for plots
plt.rcParams["figure.figsize"] = (10,6)     # Set default figure size for Matplotlib plots

## 1. Explanatory Data Analysis

In this section EDA, including data load and cleaning, is performed.
As a first step, data set is loaded into DataFrame

In [None]:
df = pd.read_csv('data/spotify_dataset.csv')  # Load the car price dataset
df.head()                                            # Display the first few rows of the dataset

### 1.1.Initial data exploration

In the following subsection initial data set inspection is performed. Here the shape and Info of DataFrame are shown

In [None]:
print(df.shape)                     # Print the shape of the DataFrame           
print(df.info())                    # Print concise summary of the DataFrame            
print(df.dtypes)                    # Print data types of each column

This dataset contains of 114 entires and 21 rows

In the next steps DataFrame is checked for any incosistencies(dublicates, missing value and etc.)

In [None]:
df.isnull().sum()                  # Check for missing values in each column

As it is can be seen, there are 3 empty etries. This number is neglectable, so we can drop these entries.

In [None]:
df. dropna(inplace=True)          # Drop rows with missing values
df.reset_index(drop=True, inplace=True)  # Reset index after dropping rows

and veryfying, that empty values are removed

In [None]:
df.isnull().sum()                  # Verify that there are no missing values left

here column names are displayed

In [None]:
df.columns

here we see that we have column "Unnamed:0" and "track_id". These two columns are not important for further analysis and modeling, therefore they can be dropped

In [None]:
df.drop(columns = ['Unnamed: 0', 'track_id'], inplace=True)  # Drop unnecessary columns and display the first few rows 

In [None]:
df.columns

1. Detects duplicates
2. Prints total count
3. Displays which rows are duplicated (track name, artist, album)
4. Shows which columns differ across duplicates
5. Removes duplicates and saves the clean dataset into data

In [None]:
# Step 1: Detect duplicates
duplicates = df[df.duplicated(keep=False)]
print("üîÅ Number of duplicate records:", len(duplicates))

# Step 2: Show duplicate track details
if not duplicates.empty:
    print("\nüìÄ Duplicated Tracks (sample):")
    print(duplicates[['track_name', 'artists', 'album_name']].head())

    # Step 3: Compare duplicates to first occurrence
    print("\nüîé Columns with different values in duplicated rows:")
    duplicated_indices = duplicates.index

    for idx in duplicated_indices:
        row = df.loc[idx]
        first_occurrence = df[(df['track_name'] == row['track_name']) & 
                              (df['artists'] == row['artists']) & 
                              (df['album_name'] == row['album_name'])].iloc[0]
        
        differing_columns = [col for col in df.columns if row[col] != first_occurrence[col]]

        if differing_columns:
            print(f"Row {idx} differs in columns: {differing_columns}")

# Step 4: Remove duplicate rows (keeping the first occurrence)
df= df[~df.duplicated()]
print("\n‚úÖ Duplicate records removed. Cleaned dataset ready in `data`.")


For further simplicity, categorical and numerical columns were splitted. also column "explicit" set as numerical

In [None]:
df['explicit'] = df['explicit'].astype(int)  # Convert 'explicit' column to integer type
categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist() # Select categorical columns
numerical_cols = df.select_dtypes(include=['number']).columns.tolist()              # Select numerical columns

print("Categorical columns:")
print(categorical_cols)

The next step is to make names unified

In [None]:
# Loop through each categorical column and clean it
for col in categorical_cols:
    df[col] = (
        df[col]
        .astype(str)                 # ensure the column is string type
        .str.strip()                 # remove leading/trailing spaces
        .str.lower()                 # make everything lowercase
        .str.replace("-", " ", regex=False)   # replace hyphens with spaces
        .str.replace("_", " ", regex=False)   # replace underscores with spaces
        .str.replace(";", " and ", regex=False)   # replace semicolons with spaces
    )

# Check the result
df.head()

Detecting misspellings or inconsistent categories in categorical columns is super important for EDA and modeling (e.g., "hip hop" vs "Hip-Hop" vs "Hip hop").

In [None]:
for col in categorical_cols:      # Loop through each categorical column
    print(f"\nüîπ Column: {col}")                   
    print(f"Unique values: {df[col].nunique()}")  # Print number of unique categories
    print(df[col].value_counts())  # Show top 10 most common categories
    print("-" * 50)

This gives you an overview of what‚Äôs inside each categorical column ‚Äî letting you visually spot potential misspellings.

It is also nessecary to convert song duration from milliseconds to mintues and drop "duration_ms" column.

In [None]:
df["duration_min"] = df["duration_ms"] / 60000  # Convert duration from milliseconds to minutes
df.drop(columns=["duration_ms"], inplace=True)  # Drop the original duration_ms column
df.head()  # Display the first few rows to verify changes

Next step is to take a look on a descriptive statistic summary for each column in the df.

We use descriptive statistics to summarize, explore, and validate the dataset before modeling. They give us a quick overview of central tendencies, spread, and anomalies ‚Äî helping us decide how to clean, visualize, and model the data.

This summary includes following metrics:

* count: Number of non-missing (non-NaN) values. Helps check missing data.
* mean: Average (sum / count). Central tendency of numeric data.
* std: Standard deviation. How spread out the values are from the mean.
* min: Minimum value. The smallest observed value.
* 25%: 25th percentile (Q1). 25% of data is below this value
* 50% (median): 50th percentile. Half the data is below this value.
* 75%: 75th percentile (Q3). 75% of data is below this value
* max: Maximum value. The largest observed value
In following cell a descriptive statistics of numeric columns is performed:

In [None]:
df.describe() # Generate descriptive statistics of numerical columns

The dataset contains 113,999 tracks, and the statistics cover 14 numerical features related to track popularity and audio characteristics.

Key Feature Insights
1. Popularity
Range: 0 to 100 (Spotify-defined scale). Mean: 33.2 (relatively low), Median: 35.
Insight: Majority of songs in the dataset are not highly popular. Only a small portion reaches scores above 75.
2. Danceability
Scale: 0.0 to 1.0 (higher = more danceable). Mean: 0.567.
Insight: Most tracks are moderately danceable. Distribution is slightly right-skewed with many songs in the 0.5‚Äì0.7 range.
3. Energy
Scale: 0.0 to 1.0. Mean: 0.64.
Insight: Tracks generally have high energy, indicating a tendency toward upbeat or intense music.
4. Key
Range: 0 to 11 (12 semitones, C to B). Mean: ~5.3.
Insight: Keys are evenly distributed, with slight clustering near 5 (F major / D minor).
5. Loudness
Unit: Decibels (dB). Mean: -8.26 dB, Min: -49.5 dB.
Insight: Some tracks have extremely low loudness, possibly ambient or silent tracks; majority are mastered for streaming loudness levels (around -7 dB).
6. Mode
Binary: 0 = minor, 1 = major. Mean: 0.637 ‚Üí ~64% of songs are in major mode.
Insight: Major mode dominates (typically associated with ‚Äúhappy‚Äù sound).
7. Speechiness
Scale: 0.0 to 1.0. Mean: 0.084.
Insight: Most tracks have low speech content (e.g., songs, not podcasts), but the max of 0.965 suggests some spoken word/music hybrids.
8. Acousticness
Mean: 0.315. Insight: Majority of tracks are not acoustic-heavy, but the high standard deviation (0.33) shows some variety.
9. Instrumentalness
Mean: 0.156. 
Insight: Most tracks contain vocals (median ~0.000042), but there‚Äôs a small but significant subset of instrumental music.
10. Liveness
Mean: 0.213.
Insight: Most tracks are studio recordings; live recordings are rare.
11. Valence
Scale: 0.0 (sad) to 1.0 (happy).
Mean: 0.47.
Insight: Balanced distribution between positive and negative mood songs.
12. Tempo
Mean: 122 BPM. 
Insight: Common tempo for pop/dance tracks. Range is wide (0‚Äì243 BPM), but the quartiles (25% = 99 BPM, 75% = 140 BPM) confirm a core BPM range of ~100‚Äì140.
13. Time Signature
Mean: ~3.9. 
Insight: Most tracks are in 4/4 time (common time), as expected. Very little variation.
14. Duration (in minutes)
Mean: 3.8 minutes. 
Max: 87 minutes!
Insight: Typical track duration matches mainstream standards. Max suggests presence of podcasts, live sets, or compilation tracks.

In [None]:
df_cleaned = df.copy()  # Create a copy of the cleaned DataFrame
df_cleaned.to_csv('data/spotify_cleaned_data.csv', index=False)

### 1.2 Initial data visualization

In this section initial data visualisation is performed

1.2.1 Pairplot of key numerical 