# Exploratory Data Analysis of Spotify Music Tracks

## Introduction

**Spotify** is a digital music streaming service that was founded in Stockholm, Sweden in 2006 by Daniel Ek and Martin Lorentzon. It started as a small startup with the goal of providing a legal and accessible platform for music consumption. Over the years, Spotify has grown into one of the most popular music streaming services worldwide, with millions of active users and an extensive music library.

## Table of Contents

- Import Necessary Libraries
- Data Understanding
- Data Preparation
- Exploratory Data Analysis
- Conclusions

## Import Necessary Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

sns.set_style("darkgrid")

In [2]:
# Read data in csv format

df = pd.read_csv("spotifyfeatures.csv")

## Data Understanding

- head and tail

- sample

- Dataframe shape

- size

- columns

- dtypes

- info

- describe



In [3]:
# preview the first 5 rows of dataset

df.head()

Unnamed: 0,genre,artist_name,track_name,track_id,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
0,Movie,Henri Salvador,C'est beau de faire un Show,0BRjO6ga9RKCKjfDqeFgWV,0,0.611,0.389,99373,0.91,0.0,C#,0.346,-1.828,Major,0.0525,166.969,4/4,0.814
1,Movie,Martin & les fées,Perdu d'avance (par Gad Elmaleh),0BjC1NfoEOOusryehmNudP,1,0.246,0.59,137373,0.737,0.0,F#,0.151,-5.559,Minor,0.0868,174.003,4/4,0.816
2,Movie,Joseph Williams,Don't Let Me Be Lonely Tonight,0CoSDzoNIKCRs124s9uTVy,3,0.952,0.663,170267,0.131,0.0,C,0.103,-13.879,Minor,0.0362,99.488,5/4,0.368
3,Movie,Henri Salvador,Dis-moi Monsieur Gordon Cooper,0Gc6TVm52BwZD07Ki6tIvf,0,0.703,0.24,152427,0.326,0.0,C#,0.0985,-12.178,Major,0.0395,171.758,4/4,0.227
4,Movie,Fabien Nataf,Ouverture,0IuslXpMROHdEPvSl1fTQK,4,0.95,0.331,82625,0.225,0.123,F,0.202,-21.15,Major,0.0456,140.576,4/4,0.39


In [4]:
# preview the last 5 rows of dataset

df.tail()

Unnamed: 0,genre,artist_name,track_name,track_id,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
232720,Soul,Slave,Son Of Slide,2XGLdVl7lGeq8ksM6Al7jT,39,0.00384,0.687,326240,0.714,0.544,D,0.0845,-10.626,Major,0.0316,115.542,4/4,0.962
232721,Soul,Jr Thomas & The Volcanos,Burning Fire,1qWZdkBl4UVPj9lK6HuuFM,38,0.0329,0.785,282447,0.683,0.00088,E,0.237,-6.944,Minor,0.0337,113.83,4/4,0.969
232722,Soul,Muddy Waters,(I'm Your) Hoochie Coochie Man,2ziWXUmQLrXTiYjCg2fZ2t,47,0.901,0.517,166960,0.419,0.0,D,0.0945,-8.282,Major,0.148,84.135,4/4,0.813
232723,Soul,R.LUM.R,With My Words,6EFsue2YbIG4Qkq8Zr9Rir,44,0.262,0.745,222442,0.704,0.0,A,0.333,-7.137,Major,0.146,100.031,4/4,0.489
232724,Soul,Mint Condition,You Don't Have To Hurt No More,34XO9RwPMKjbvRry54QzWn,35,0.0973,0.758,323027,0.47,4.9e-05,G#,0.0836,-6.708,Minor,0.0287,113.897,4/4,0.479


In [5]:
# to see any random rows to ensure about the data quality

df.sample(7, random_state=12345)

Unnamed: 0,genre,artist_name,track_name,track_id,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
27447,Alternative,Phoenix,Fior di Latte,1Bit6aQpVL9mQ2jIz4wQO7,40,0.203,0.568,243733,0.499,0.000532,A#,0.228,-9.108,Major,0.0307,130.029,4/4,0.628
97011,Children’s Music,Xavier Omär,If This Is Love,2v9TxPdblHMBRir1y9uIqw,56,0.224,0.744,220301,0.472,0.000215,G,0.225,-9.757,Minor,0.111,104.063,4/4,0.281
211858,Comedy,Doug Benson,Anti-Truth,5m7AA94zU702nhxwGEqKL0,10,0.841,0.687,105594,0.587,0.0,B,0.816,-9.076,Major,0.875,58.137,4/4,0.275
41304,Folk,Car Seat Headrest,Drunk Drivers/Killer Whales,2os0aK782bakCPmjow0SU0,60,0.218,0.534,374653,0.414,2e-06,D,0.103,-7.952,Major,0.0403,117.079,4/4,0.458
13642,Dance,Zedd,Stay (with Alessia Cara),6uBhi9gBXWjanegOb2Phh0,82,0.253,0.69,210091,0.622,0.0,F,0.116,-5.025,Minor,0.0622,102.04,4/4,0.544
194412,Movie,Fabien Nataf,Lilly,76ykXT0nshkQGutCsNK3fv,8,0.372,0.697,102556,0.799,0.00551,E,0.0811,-4.598,Major,0.054,144.946,4/4,0.961
90479,Hip-Hop,Dom Kennedy,I Love Dom,1kFQ2V8nH923JPFzMsNWxS,49,0.278,0.873,199674,0.522,0.0,D,0.0919,-5.801,Major,0.0565,86.998,4/4,0.535


In [6]:
# shape of dataset

df.shape

(232725, 18)

In [7]:
# all columns of dataset

df.columns

Index(['genre', 'artist_name', 'track_name', 'track_id', 'popularity',
       'acousticness', 'danceability', 'duration_ms', 'energy',
       'instrumentalness', 'key', 'liveness', 'loudness', 'mode',
       'speechiness', 'tempo', 'time_signature', 'valence'],
      dtype='object')

In [8]:
# the summary of the dataset

df.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 232725 entries, 0 to 232724
Data columns (total 18 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   genre             232725 non-null  object 
 1   artist_name       232725 non-null  object 
 2   track_name        232725 non-null  object 
 3   track_id          232725 non-null  object 
 4   popularity        232725 non-null  int64  
 5   acousticness      232725 non-null  float64
 6   danceability      232725 non-null  float64
 7   duration_ms       232725 non-null  int64  
 8   energy            232725 non-null  float64
 9   instrumentalness  232725 non-null  float64
 10  key               232725 non-null  object 
 11  liveness          232725 non-null  float64
 12  loudness          232725 non-null  float64
 13  mode              232725 non-null  object 
 14  speechiness       232725 non-null  float64
 15  tempo             232725 non-null  float64
 16  time_signature    23

In [10]:
# the descriptive statistics summary

df.describe().T.round(2)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
popularity,232725.0,41.13,18.19,0.0,29.0,43.0,55.0,100.0
acousticness,232725.0,0.37,0.35,0.0,0.04,0.23,0.72,1.0
danceability,232725.0,0.55,0.19,0.06,0.44,0.57,0.69,0.99
duration_ms,232725.0,235122.34,118935.91,15387.0,182857.0,220427.0,265768.0,5552917.0
energy,232725.0,0.57,0.26,0.0,0.38,0.6,0.79,1.0
instrumentalness,232725.0,0.15,0.3,0.0,0.0,0.0,0.04,1.0
liveness,232725.0,0.22,0.2,0.01,0.1,0.13,0.26,1.0
loudness,232725.0,-9.57,6.0,-52.46,-11.77,-7.76,-5.5,3.74
speechiness,232725.0,0.12,0.19,0.02,0.04,0.05,0.1,0.97
tempo,232725.0,117.67,30.9,30.38,92.96,115.78,139.05,242.9


**Popularity**: The popularity of tracks in the dataset ranges from 0 to 100, with a mean value of 41.13. This indicates that there is a diverse range of popularity among the tracks.

**Acousticness**: The acousticness of tracks in the dataset ranges from 0 to 1, with a mean value of 0.37. This suggests that there is variation in the degree of acousticness among the tracks.

**Danceability**: The danceability of tracks in the dataset ranges from 0.06 to 0.99, with a mean value of 0.55. This indicates that there is a range of danceability levels among the tracks.

**Duration**: The duration of tracks in the dataset ranges from 15,387 milliseconds to 5,552,917 milliseconds, with a mean value of approximately 235,122.34 milliseconds. This suggests that there is a wide range of track durations.

**Energy**: The energy of tracks in the dataset ranges from 0 to 1, with a mean value of 0.57. This indicates that there is variation in the energy levels among the tracks.

**Instrumentalness**: The instrumentalness of tracks in the dataset ranges from 0 to 1, with a mean value of 0.15. This suggests that there is variation in the degree of instrumentalness among the tracks.

**Liveness**: The liveness of tracks in the dataset ranges from 0.01 to 1, with a mean value of 0.22. This indicates that there is variation in the level of liveness among the tracks.

**Loudness**: The loudness of tracks in the dataset ranges from -52.46 dB to 3.74 dB, with a mean value of -9.57 dB. This suggests that there is variation in the loudness levels among the tracks.

**Speechiness**: The speechiness of tracks in the dataset ranges from 0.02 to 0.97, with a mean value of 0.12. This indicates that there is variation in the level of speechiness among the tracks.

**Tempo**: The tempo of tracks in the dataset ranges from 30.38 BPM to 242.90 BPM, with a mean value of 117.67 BPM. This suggests that there is a wide range of tempos among the tracks.

**Valence**: The valence of tracks in the dataset ranges from 0 to 1, with a mean value of 0.45. This indicates that there is variation in the emotional positivity among the tracks.

## Data Preparation

- Dropping irrelevant columns and rows
- Identifying missing values
- Identifying duplicated columns
- Renaming columns
- Feature Creation

In [11]:
# the number of missing values

df.isna().sum().sort_values(ascending=False)

genre               0
artist_name         0
time_signature      0
tempo               0
speechiness         0
mode                0
loudness            0
liveness            0
key                 0
instrumentalness    0
energy              0
duration_ms         0
danceability        0
acousticness        0
popularity          0
track_id            0
track_name          0
valence             0
dtype: int64

There is no missing values

In [12]:
# the number of duplicate rows of data

df.duplicated().sum()

0

There is no duplicate rows of data.

In [13]:
# To convert the duration from milliseconds to minutes

df['duration_minutes'] = df['duration_ms'] / 60000

In [14]:
df.head()

Unnamed: 0,genre,artist_name,track_name,track_id,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,duration_minutes
0,Movie,Henri Salvador,C'est beau de faire un Show,0BRjO6ga9RKCKjfDqeFgWV,0,0.611,0.389,99373,0.91,0.0,C#,0.346,-1.828,Major,0.0525,166.969,4/4,0.814,1.656217
1,Movie,Martin & les fées,Perdu d'avance (par Gad Elmaleh),0BjC1NfoEOOusryehmNudP,1,0.246,0.59,137373,0.737,0.0,F#,0.151,-5.559,Minor,0.0868,174.003,4/4,0.816,2.28955
2,Movie,Joseph Williams,Don't Let Me Be Lonely Tonight,0CoSDzoNIKCRs124s9uTVy,3,0.952,0.663,170267,0.131,0.0,C,0.103,-13.879,Minor,0.0362,99.488,5/4,0.368,2.837783
3,Movie,Henri Salvador,Dis-moi Monsieur Gordon Cooper,0Gc6TVm52BwZD07Ki6tIvf,0,0.703,0.24,152427,0.326,0.0,C#,0.0985,-12.178,Major,0.0395,171.758,4/4,0.227,2.54045
4,Movie,Fabien Nataf,Ouverture,0IuslXpMROHdEPvSl1fTQK,4,0.95,0.331,82625,0.225,0.123,F,0.202,-21.15,Major,0.0456,140.576,4/4,0.39,1.377083


## Exploratory Data Analysis

**Genre Analysis**:

* Plot a bar chart showing the distribution of tracks across different genres.

* Create a horizontal bar chart to visualize the top genres based on the number of tracks.

* Generate a box plot to compare the popularity distribution across different genres.

**Artist Analysis**:

* Create a bar chart to visualize the top artists based on the number of tracks.
* Generate a histogram or violin plot to analyze the popularity distribution for tracks by different artists.
* Create scatter plots to explore relationships between artist attributes (e.g., energy or danceability) and track popularity.

**Track Analysis**:

* Plot a histogram or density plot to analyze the distribution of track durations.
* Create a scatter plot to visualize the relationship between track duration and popularity.
* Generate violin plots to compare the distribution of acousticness or instrumentalness across different track names.

**Popularity Analysis**:

* Create a histogram or density plot to visualize the distribution of track popularity.
* Generate a box plot to compare the popularity distribution across different genres or artists.
* Create a scatter plot to explore relationships between popularity and other attributes (e.g., energy or danceability).

**Attribute Analysis**:

* Plot histograms or density plots to analyze the distribution of different attributes (acousticness, danceability, energy, etc.).
* Create scatter plots or heatmaps to explore relationships or correlations between different attributes.
* Generate grouped bar charts to compare attribute distributions across different genres or artists.
Key and Mode Analysis:

* Create bar charts to visualize the distribution of keys and modes in the dataset.
* Generate box plots or violin plots to compare attribute distributions across different keys or modes.
* Create scatter plots or heatmaps to explore relationships between key or mode and other attributes.

**Time Signature Analysis**:

* Plot a bar chart to visualize the distribution of time signatures in the dataset.
* Create box plots or violin plots to compare attribute distributions across different time signatures.
* Generate scatter plots or heatmaps to explore relationships between time signature and other attributes.

**Valence Analysis**:

* Plot a histogram or density plot to analyze the distribution of valence across tracks.
* Create box plots or violin plots to compare the valence distribution across different genres or artists.
* Generate scatter plots to explore relationships between valence and other attributes (e.g., energy or tempo).

## Conclusions



**Author**: **Ergyun Hasan**  **2023**