Data Overview:
We're analyzing a dataset of `1,994` songs from Spotify spanning from `1956` to `2019`. The dataset includes key musical features like tempo, energy, danceability, and popularity scores.

Our Process:
1. Data Loading & Cleaning
   - Loaded Spotify dataset containing `15` different features
   - Verified data quality (no missing values)
   - Cleaned numerical data for analysis

2. Initial Analysis
   - Explored genre distribution (dominated by `album rock` with `413` songs)
   - Identified top artists (`Queen` leads with `37` songs)
   - Examined popularity patterns across different genres

3. Deep Dive
   - Created correlation matrices to understand feature relationships
   - Analyzed distribution patterns of key musical elements
   - Discovered that popularity isn't strongly tied to any single feature

Next Steps:
We'll create an interactive Dash-Plotly dashboard to:
- Visualize trends over time
- Allow users to explore relationships between musical features
- Compare different genres and artists
- Provide insights into what makes songs popular

This will help both music enthusiasts and industry professionals understand patterns in popular music over the past six decades.

In [1]:
# pip install spotipy
# pip install librosa
# pip install deezer-python requests

import pandas as pd
import numpy as np
import os
import plotly.express as px
import plotly.graph_objects as go


#from dash import Dash, dcc, html
import spotipy
from sklearn import preprocessing, metrics
import seaborn as sns
import matplotlib.pyplot as plt
# import librosa for audio feature extraction
import librosa as librosa
import librosa.display as ld
import IPython.display as ipd

In [2]:
dataset_path = os.path.join('..', 'data', 'spotify-2000.csv')
df = pd.read_csv(dataset_path)

In [3]:
# function to determine if columns in file have null values
def get_percent_of_na(df, num):
    count = 0
    df = df.copy()
    s = (df.isna().sum() / df.shape[0])
    for column, percent in zip(s.index, s.values):
        num_of_nulls = df[column].isna().sum()
        if num_of_nulls == 0:
            continue
        else:
            count += 1
        print('Column {} has {:.{}%} percent of Nulls, and {} of nulls'.format(column, percent, num, num_of_nulls))
    if count != 0:
        print("\033[1m" + 'There are {} columns with NA.'.format(count) + "\033[0m")
    else:
        print()
        print("\033[1m" + 'There are no columns with NA.' + "\033[0m")
        
# function to display general information about the dataset
def get_info(df):
    """
    This function uses the head(), info(), describe(), shape() and duplicated() 
    methods to display the general information about the dataset.
    """
    print("\033[1m" + '-'*100 + "\033[0m")
    print('Head:')
    print()
    display(df.head())
    print('-'*100)
    print('Info:')
    print()
    display(df.info())
    print('-'*100)
    print('Describe:')
    print()
    display(df.describe())
    print('-'*100)
    display(df.describe)
    print()
    print('Columns with nulls:')
    display(get_percent_of_na(df, 4))  # check this out
    print('-'*100)
    print('Shape:')
    print(df.shape)
    print('-'*100)
    print('Duplicated:')
    print("\033[1m" + 'We have {} duplicated rows.\n'.format(df.duplicated().sum()) + "\033[0m")

In [4]:
get_info(df)

[1m----------------------------------------------------------------------------------------------------[0m
Head:



Unnamed: 0,Index,Title,Artist,Top Genre,Year,Beats Per Minute (BPM),Energy,Danceability,Loudness (dB),Liveness,Valence,Length (Duration),Acousticness,Speechiness,Popularity
0,1,Sunrise,Norah Jones,adult standards,2004,157,30,53,-14,11,68,201,94,3,71
1,2,Black Night,Deep Purple,album rock,2000,135,79,50,-11,17,81,207,17,7,39
2,3,Clint Eastwood,Gorillaz,alternative hip hop,2001,168,69,66,-9,7,52,341,2,17,69
3,4,The Pretender,Foo Fighters,alternative metal,2007,173,96,43,-4,3,37,269,0,4,76
4,5,Waitin' On A Sunny Day,Bruce Springsteen,classic rock,2002,106,82,58,-5,10,87,256,1,3,59


----------------------------------------------------------------------------------------------------
Info:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1994 entries, 0 to 1993
Data columns (total 15 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Index                   1994 non-null   int64 
 1   Title                   1994 non-null   object
 2   Artist                  1994 non-null   object
 3   Top Genre               1994 non-null   object
 4   Year                    1994 non-null   int64 
 5   Beats Per Minute (BPM)  1994 non-null   int64 
 6   Energy                  1994 non-null   int64 
 7   Danceability            1994 non-null   int64 
 8   Loudness (dB)           1994 non-null   int64 
 9   Liveness                1994 non-null   int64 
 10  Valence                 1994 non-null   int64 
 11  Length (Duration)       1994 non-null   object
 12  Acousticness            1994 non-null   int64 
 13  

None

----------------------------------------------------------------------------------------------------
Describe:



Unnamed: 0,Index,Year,Beats Per Minute (BPM),Energy,Danceability,Loudness (dB),Liveness,Valence,Acousticness,Speechiness,Popularity
count,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0
mean,997.5,1992.992979,120.215647,59.679539,53.238215,-9.008526,19.012036,49.408726,28.858074,4.994985,59.52658
std,575.762538,16.116048,28.028096,22.154322,15.351507,3.647876,16.727378,24.858212,29.011986,4.401566,14.3516
min,1.0,1956.0,37.0,3.0,10.0,-27.0,2.0,3.0,0.0,2.0,11.0
25%,499.25,1979.0,99.0,42.0,43.0,-11.0,9.0,29.0,3.0,3.0,49.25
50%,997.5,1993.0,119.0,61.0,53.0,-8.0,12.0,47.0,18.0,4.0,62.0
75%,1495.75,2007.0,136.0,78.0,64.0,-6.0,23.0,69.75,50.0,5.0,71.0
max,1994.0,2019.0,206.0,100.0,96.0,-2.0,99.0,99.0,99.0,55.0,100.0


----------------------------------------------------------------------------------------------------


<bound method NDFrame.describe of       Index                   Title                    Artist  \
0         1                 Sunrise               Norah Jones   
1         2             Black Night               Deep Purple   
2         3          Clint Eastwood                  Gorillaz   
3         4           The Pretender              Foo Fighters   
4         5  Waitin' On A Sunny Day         Bruce Springsteen   
...     ...                     ...                       ...   
1989   1990        Heartbreak Hotel             Elvis Presley   
1990   1991               Hound Dog             Elvis Presley   
1991   1992         Johnny B. Goode               Chuck Berry   
1992   1993               Take Five  The Dave Brubeck Quartet   
1993   1994          Blueberry Hill               Fats Domino   

                Top Genre  Year  Beats Per Minute (BPM)  Energy  Danceability  \
0         adult standards  2004                     157      30            53   
1              album ro


Columns with nulls:

[1mThere are no columns with NA.[0m


None

----------------------------------------------------------------------------------------------------
Shape:
(1994, 15)
----------------------------------------------------------------------------------------------------
Duplicated:
[1mWe have 0 duplicated rows.
[0m


## Audio Features

Extract audio previews from downloaded audio files of the Top 10 Most Popular Songs in this dataset using the librosa library & Deezer API.


In [53]:
# define function for audio previews
def audio_previews(df):
    import requests
    import os
    import deezer

    top_popular = df.nlargest(10, 'Popularity')[['Title', 'Artist', 'Popularity']]
    top_songs = top_popular['Title'].tolist()

    # Initialize Deezer client
    client = deezer.Client() #


    # Create an empty list to store track data
    track_data = []

    # Search for each song and get its preview URL
    for title in top_songs:
        search_results = client.search(title) #
    
        # Check if any tracks were found
        if search_results:
            # Get the first track from the search results
            track = search_results[0] 
        
            # Append track title and preview URL to the list
            track_data.append({"Title": track.title, "Preview URL": track.preview}) #

    # Create a Pandas DataFrame from the track data
    global data
    data = pd.DataFrame(track_data)

    # Print the DataFrame
    # df


    # Create a directory to save the downloaded files
    download_dir = "audio_previews"
    os.makedirs(download_dir, exist_ok=True) # Create the directory if it doesn't exist

    # Loop through each row of the DataFrame and download the audio
    for index, row in data.iterrows():
        track_title = row['Title']
        preview_url = row['Preview URL']

        # Create a basic filename from the track title (you might want to sanitize this further)
        filename = f"{track_title}.mp3"  # Adjust file extension if needed
        filepath = os.path.join(download_dir, filename)

        try:
            # Download the audio file
            response = requests.get(preview_url, stream=True)
            response.raise_for_status()  # Check for bad status codes

            # Save the audio content to the file
            with open(filepath, 'wb') as f:
                for chunk in response.iter_content(chunk_size=8192):
                    f.write(chunk)

            print(f"Downloaded '{filename}' successfully!")

        except requests.exceptions.RequestException as e:
            print(f"Error downloading '{filename}' from {preview_url}: {e}")
        except Exception as e:
            print(f"An unexpected error occurred for '{filename}': {e}")


            pass
    random_idx = np.random.randint(0, data.shape[0])

    # select a random song from the dataset
    song = data.loc[random_idx, :]

    # load the file and print its sampling rate 
    file_path = f"audio_previews/"
    file_path = file_path  + song["Title"] + '.mp3'
    audio, sample_rate = librosa.load(file_path)
    # print info about this song
    print(' ')
    print(f"Sampling rate: {sample_rate}")
    print(song)

    # output the audio
    display(ipd.Audio(file_path))

In [54]:
audio_previews(df)

Downloaded 'Dance Monkey.mp3' successfully!
Downloaded 'Memories.mp3' successfully!
Downloaded 'bad guy.mp3' successfully!
Downloaded 'All I Want for Christmas Is You.mp3' successfully!
Downloaded 'Believer.mp3' successfully!
Downloaded 'Shallow.mp3' successfully!
Downloaded 'Perfect.mp3' successfully!
Downloaded 'Shape of You.mp3' successfully!
Downloaded 'High Hopes.mp3' successfully!
Downloaded 'All of Me.mp3' successfully!
 
Sampling rate: 22050
Title                                               Shape of You
Preview URL    https://cdnt-preview.dzcdn.net/api/1/1/f/1/b/0...
Name: 7, dtype: object


