 Step - 1: Importing Libraries and Loading Dataset

In [5]:
# Importing all the necessary packages and libraries

import numpy as np # linear algebra
import pandas as pd # Data pre-processing
import seaborn as sns # Required for plotting
import matplotlib.pyplot as plt # Required for plotting

In [6]:
spotify = pd.read_csv("../dataset/songs_updated_v2.csv") # Using pandas lib to read table
spotify.head() # Checking data schema for validation purpose

Unnamed: 0,artist,song,duration_ms,explicit,year,popularity,danceability,energy,key,loudness,...,pop,folk_acoustic,dance_electronic,r&b,classical,blues,latin,easy_listening,country,hasFeature
0,Britney Spears,Oops!...I Did It Again,211160,False,2000,77,0.751,0.834,1,-5.444,...,True,False,False,False,False,False,False,False,False,False
1,blink-182,All The Small Things,167066,False,1999,79,0.434,0.897,0,-4.918,...,True,False,False,False,False,False,False,False,False,False
2,Faith Hill,Breathe,250546,False,1999,66,0.529,0.496,7,-9.007,...,True,False,False,False,False,False,False,False,True,False
3,Bon Jovi,It's My Life,224493,False,2000,78,0.551,0.913,0,-4.063,...,False,False,False,False,False,False,False,False,False,False
4,*NSYNC,Bye Bye Bye,200560,False,2000,65,0.614,0.928,8,-4.806,...,True,False,False,False,False,False,False,False,False,False


Step - 2: Identifying Missing values

In [7]:
spotify.info() # Checking presence of any null values in columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 33 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   artist             2000 non-null   object 
 1   song               2000 non-null   object 
 2   duration_ms        2000 non-null   int64  
 3   explicit           2000 non-null   bool   
 4   year               2000 non-null   int64  
 5   popularity         2000 non-null   int64  
 6   danceability       2000 non-null   float64
 7   energy             2000 non-null   float64
 8   key                2000 non-null   int64  
 9   loudness           2000 non-null   float64
 10  mode               2000 non-null   int64  
 11  speechiness        2000 non-null   float64
 12  acousticness       2000 non-null   float64
 13  instrumentalness   2000 non-null   float64
 14  liveness           2000 non-null   float64
 15  valence            2000 non-null   float64
 16  tempo              2000 

Check for any duplicates

In [8]:
dp_rows = spotify[spotify.duplicated()]
print("Number of duplicated rows: ",len(dp_rows))

Number of duplicated rows:  59


There are no nulls but there are duplicates which needs to be removed

Step 3- Remove Duplicates

In [9]:
# Cloning spotify data table for avoiding any tampering
spotify_clone1 = spotify

# Removed duplicates to negate any bias
spotify_unique = spotify_clone1.drop_duplicates()
spotify_unique

Unnamed: 0,artist,song,duration_ms,explicit,year,popularity,danceability,energy,key,loudness,...,pop,folk_acoustic,dance_electronic,r&b,classical,blues,latin,easy_listening,country,hasFeature
0,Britney Spears,Oops!...I Did It Again,211160,False,2000,77,0.751,0.834,1,-5.444,...,True,False,False,False,False,False,False,False,False,False
1,blink-182,All The Small Things,167066,False,1999,79,0.434,0.897,0,-4.918,...,True,False,False,False,False,False,False,False,False,False
2,Faith Hill,Breathe,250546,False,1999,66,0.529,0.496,7,-9.007,...,True,False,False,False,False,False,False,False,True,False
3,Bon Jovi,It's My Life,224493,False,2000,78,0.551,0.913,0,-4.063,...,False,False,False,False,False,False,False,False,False,False
4,*NSYNC,Bye Bye Bye,200560,False,2000,65,0.614,0.928,8,-4.806,...,True,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,Jonas Brothers,Sucker,181026,False,2019,79,0.842,0.734,1,-5.065,...,True,False,False,False,False,False,False,False,False,False
1996,Taylor Swift,Cruel Summer,178426,False,2019,78,0.552,0.702,9,-5.707,...,True,False,False,False,False,False,False,False,False,False
1997,Blanco Brown,The Git Up,200593,False,2019,69,0.847,0.678,9,-8.635,...,False,False,False,False,False,False,False,False,True,False
1998,Sam Smith,Dancing With A Stranger (with Normani),171029,False,2019,75,0.741,0.520,8,-7.513,...,True,False,False,False,False,False,False,False,False,True


We can see that there were 59 duplicates (1941+59=2000)

1. How many songs released in which year?

In [10]:
year_distr = spotify_clone1.groupby("year").size().sort_values(ascending=False).reset_index(name='count') 
year_distr

Unnamed: 0,year,count
0,2012,115
1,2017,111
2,2001,108
3,2018,107
4,2010,107
5,2014,104
6,2005,104
7,2016,99
8,2015,99
9,2011,99


2. How many songs belong to each genre

In [15]:
genre_distr = spotify_clone1.groupby("genre").size().sort_values(ascending=False).reset_index(name='count') 
genre_distr

Unnamed: 0,genre,count
0,pop,428
1,"hip hop, pop",277
2,"hip hop, pop, R&B",244
3,"pop, Dance/Electronic",221
4,"pop, R&B",178
5,hip hop,124
6,"hip hop, pop, Dance/Electronic",78
7,rock,58
8,"rock, pop",43
9,Dance/Electronic,41


Observed a value in genre called set() having 22 records. It seems invalid input in the category of genre. It may give wrong interpretation in the genre distribution. Hence it should be removed.

In [16]:
spotify_unique = spotify_unique.query("genre != 'set()'")
spotify_unique

Unnamed: 0,artist,song,duration_ms,explicit,year,popularity,danceability,energy,key,loudness,...,pop,folk_acoustic,dance_electronic,r&b,classical,blues,latin,easy_listening,country,hasFeature
0,Britney Spears,Oops!...I Did It Again,211160,False,2000,77,0.751,0.834,1,-5.444,...,True,False,False,False,False,False,False,False,False,False
1,blink-182,All The Small Things,167066,False,1999,79,0.434,0.897,0,-4.918,...,True,False,False,False,False,False,False,False,False,False
2,Faith Hill,Breathe,250546,False,1999,66,0.529,0.496,7,-9.007,...,True,False,False,False,False,False,False,False,True,False
3,Bon Jovi,It's My Life,224493,False,2000,78,0.551,0.913,0,-4.063,...,False,False,False,False,False,False,False,False,False,False
4,*NSYNC,Bye Bye Bye,200560,False,2000,65,0.614,0.928,8,-4.806,...,True,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,Jonas Brothers,Sucker,181026,False,2019,79,0.842,0.734,1,-5.065,...,True,False,False,False,False,False,False,False,False,False
1996,Taylor Swift,Cruel Summer,178426,False,2019,78,0.552,0.702,9,-5.707,...,True,False,False,False,False,False,False,False,False,False
1997,Blanco Brown,The Git Up,200593,False,2019,69,0.847,0.678,9,-8.635,...,False,False,False,False,False,False,False,False,True,False
1998,Sam Smith,Dancing With A Stranger (with Normani),171029,False,2019,75,0.741,0.520,8,-7.513,...,True,False,False,False,False,False,False,False,False,True


In [17]:
# Write updated data frame to CSV
spotify_unique.to_csv('../dataset/songs_updated_v3.csv', index=False)

Data Pre-Precessing is completed