# Analysis of Spotify Data - Data preparation

Data sources: 
- personal data: https://skiley.net/ & http://sortyourmusic.playlistmachinery.com/index.html & data requested from the Spotify website 
- Kaggle: https://www.kaggle.com/rodolfofigueroa/spotify-12m-songs & https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks

In [1]:
import pandas as pd
import numpy as np

**Data exploration**

a) Data requested from Spotify (15.02.2021)

In [2]:
import json

File 1: Playlist1

In [None]:
file1 = "C:/Users/Agnieszka/Downloads/Datasets/Music/my_spotify_data/MyData/Playlist1.json"

Spotify_Playlists = json.load(open(file1, "r", encoding="utf8"))
Spotify_Playlists

# File includes information about each playlist - artist, track and album names

In [None]:
# Get playlist names

playlist_names = []
for index in range(len(Spotify_Playlists['playlists'])):
    playlist_names.append(Spotify_Playlists['playlists'][index]['name'])
    
playlist_names

In [5]:
# Number of personal playlists

len(Spotify_Playlists['playlists'])

61

In [6]:
# Number of songs in each playlist

songs=[]
names=[]

for item in Spotify_Playlists["playlists"]:
    songs.append(len(item["items"]))
    names.append(item["name"])

songs_df = pd.DataFrame([songs]).transpose()
songs_df

names_df = pd.DataFrame([names]).transpose()
names_df

playlists_songs = names_df.join(songs_df, how="left", rsuffix = "new")
playlists_songs.head()

playlists_songs.rename(columns={"0": "PlaylistName", "0new": "Songs"}, inplace=True)
playlists_songs.sort_values(by=["Songs"], ascending=False, inplace=True)

In [7]:
playlists_songs.head()

Unnamed: 0,PlaylistName,Songs
21,House (all),362
17,Afro House,319
55,Polskie,304
10,Hip Hop,210
53,Oldies,183


In [8]:
# Export to csv file for further analysis

playlists_songs.to_csv("C:/Users/Agnieszka/Downloads/Datasets/Music/songs_in_playlists.csv")

File 2: Inferences

In [None]:
file2 = "C:/Users/Agnieszka/Downloads/Datasets/Music/my_spotify_data/MyData/Inferences.json"

Spotify_Inferences = json.load(open(file2, "r", encoding="utf8"))
Spotify_Inferences

# Inferences -> from https://support.spotify.com/uk/article/understanding-my-data/ :
# "We draw certain inferences about your interests and preferences based on your usage of the Spotify service and 
# using data obtained from our advertisers and other advertising partners. 
# This includes a list of market segments with which you are currently associated.  
# Depending on your settings, this data may be used to serve interest-based advertising to you within the 
# Spotify service."

# -> not useful for this analysis

File 3: Userdata

In [None]:
file3 = "C:/Users/Agnieszka/Downloads/Datasets/Music/my_spotify_data/MyData/Userdata.json"

Spotify_Userdata = json.load(open(file3, "r", encoding="utf8"))
Spotify_Userdata

# CreationTime: 2017-03-08

Files 4-4B: Streaming History

In [None]:
file4 = "C:/Users/Agnieszka/Downloads/Datasets/Music/my_spotify_data/MyData/StreamingHistory0.json"

Spotify_StreamingHistory0 = json.load(open(file4, "r", encoding="utf8"))
Spotify_StreamingHistory0

# Other files: StreamingHistory1 / StreamingHistory2

# Information about streamings -> 
# "endTime": Date and time of when the stream ended
# "msPlayed": Stands for how many mili-seconds the track was listened.

In [12]:
range(len(Spotify_StreamingHistory0))

range(0, 10000)

Saving StreamTime and ArtistName to a csv file:

In [13]:
# Save StreamTime as a list

StreamTime0 = []
for index in range(len(Spotify_StreamingHistory0)):
    StreamTime0.append(Spotify_StreamingHistory0[index]['endTime'])
    
len(StreamTime0)

10000

In [14]:
# Save ArtistName as a list

ArtistName0 = []
for index in range(len(Spotify_StreamingHistory0)):
    ArtistName0.append(Spotify_StreamingHistory0[index]['artistName'])
    
ArtistName0
len(ArtistName0)

10000

In [None]:
file4A = "C:/Users/Agnieszka/Downloads/Datasets/Music/my_spotify_data/MyData/StreamingHistory1.json"

Spotify_StreamingHistory1 = json.load(open(file4A, "r", encoding="utf8"))
Spotify_StreamingHistory1

In [None]:
StreamTime1 = []
for index in range(len(Spotify_StreamingHistory1)):
    StreamTime1.append(Spotify_StreamingHistory1[index]['endTime'])
    
StreamTime1

In [None]:
ArtistName1 = []
for index in range(len(Spotify_StreamingHistory1)):
    ArtistName1.append(Spotify_StreamingHistory1[index]['artistName'])
    
ArtistName1

In [None]:
file4B = "C:/Users/Agnieszka/Downloads/Datasets/Music/my_spotify_data/MyData/StreamingHistory2.json"

Spotify_StreamingHistory2 = json.load(open(file4B, "r", encoding="utf8"))
Spotify_StreamingHistory2

In [None]:
StreamTime2 = []
for index in range(len(Spotify_StreamingHistory2)):
    StreamTime2.append(Spotify_StreamingHistory2[index]['endTime'])
    
StreamTime2

In [None]:
ArtistName2 = []
for index in range(len(Spotify_StreamingHistory2)):
    ArtistName2.append(Spotify_StreamingHistory2[index]['artistName'])
    
ArtistName2

In [21]:
# Connect all lists (StreamTime)

#len(StreamTime0) # 10000
#len(StreamTime1) # 10000
#len(StreamTime2) # 7988

StreamTime = StreamTime0 + StreamTime1 + StreamTime2
len(StreamTime)

27988

In [22]:
# Connect all lists (ArtistName)

# StreamTime0.extend(StreamTime1)

#len(ArtistName0) # 10000
#len(ArtistName1) # 10000
#len(ArtistName2) # 7988

ArtistNames = ArtistName0 + ArtistName1 + ArtistName2
len(ArtistNames)

27988

In [23]:
# Convert into DataFrame (StreamTime)

StreamTime_df = pd.DataFrame(StreamTime)
StreamTime_df 

Unnamed: 0,0
0,2020-02-02 23:25
1,2020-02-13 17:33
2,2020-02-14 11:34
3,2020-02-14 12:40
4,2020-02-15 11:29
...,...
27983,2021-02-14 14:45
27984,2021-02-14 14:54
27985,2021-02-14 15:04
27986,2021-02-14 15:12


In [24]:
# Convert into DataFrame (StreamTime)

ArtistNames_df = pd.DataFrame(ArtistNames)
ArtistNames_df

Unnamed: 0,0
0,Klevakeys
1,Shakarchi & Stranéus
2,Dennis Ferrer
3,Renato Xtrova
4,Demuir
...,...
27983,DJEFF
27984,Andreas Horvat
27985,Isolée
27986,Filippos


In [25]:
# Merge both DataFrames

ArtistNames_df.reset_index(inplace=True)
StreamTime_df.reset_index(inplace=True)
StreamTime_ArtistNames = ArtistNames_df.merge(StreamTime_df, on="index")
StreamTime_ArtistNames.head()

Unnamed: 0,index,0_x,0_y
0,0,Klevakeys,2020-02-02 23:25
1,1,Shakarchi & Stranéus,2020-02-13 17:33
2,2,Dennis Ferrer,2020-02-14 11:34
3,3,Renato Xtrova,2020-02-14 12:40
4,4,Demuir,2020-02-15 11:29


In [26]:
# Rename columns

StreamTime_ArtistNames.rename(columns={"0_x":"ArtistNames", "0_y":"StreamTime"}, inplace=True)
StreamTime_ArtistNames.head()

Unnamed: 0,index,ArtistNames,StreamTime
0,0,Klevakeys,2020-02-02 23:25
1,1,Shakarchi & Stranéus,2020-02-13 17:33
2,2,Dennis Ferrer,2020-02-14 11:34
3,3,Renato Xtrova,2020-02-14 12:40
4,4,Demuir,2020-02-15 11:29


In [27]:
StreamTime_ArtistNames.drop(columns=["index"], inplace=True)
StreamTime_ArtistNames.head()

Unnamed: 0,ArtistNames,StreamTime
0,Klevakeys,2020-02-02 23:25
1,Shakarchi & Stranéus,2020-02-13 17:33
2,Dennis Ferrer,2020-02-14 11:34
3,Renato Xtrova,2020-02-14 12:40
4,Demuir,2020-02-15 11:29


In [28]:
# Save to csv-file

StreamTime_ArtistNames.to_csv("C:/Users/Agnieszka/Downloads/Datasets/Music/StreamTime_ArtistNames.csv")

File 5: SearchQueries

In [None]:
file5 = "C:/Users/Agnieszka/Downloads/Datasets/Music/my_spotify_data/MyData/SearchQueries.json"

Spotify_SearchQueries = json.load(open(file5, "r", encoding="utf8"))
Spotify_SearchQueries

# A list of searches made : search time / device/platform used / search text / list of Uniform Resource Identifiers (URI) of the search results the user interacted with

b) Data from skiley.net

In [30]:
# Load one file to see the available features

test = pd.read_csv("C:/Users/Agnieszka/Downloads/Datasets/Music/Silkey_MySpotify/AfroHouse.csv")
test.head()

# Important: 'trackName' / 'artistName' / 'secondaryArtistsNames' /'trackDuration' / 'addedAt' (??) /
# "isLocal" - "stream" vs. "local" / 'trackPopularity' / 'artistPopularity' / 'artistFollowers' / 'artistGenres' /
# 'albumRecordLabel' / 'albumReleaseDate'

# not important: 'albumName'/ "trackNumber" / "addedBy" / 'albumArtistsNames' /'trackUrl'/ 
# 'artistUrl' / 'albumUrl'/ 'trackIsrc' (International Standard Recording Code)/ 'albumUpc' (Universal Product Code) / 
# 'albumType' / 'albumPopularity'

# artistPopularity: The popularity of the artist. The value will be between 0 and 100, 
# with 100 being the most popular. The artist’s popularity is calculated from the popularity of all the artist’s tracks.
# https://developer.spotify.com/documentation/web-api/reference/


Unnamed: 0,trackName,artistName,albumName,trackDuration,trackNumber,addedBy,addedAt,isLocal,trackPopularity,artistPopularity,...,artistGenres,albumType,albumRecordLabel,albumReleaseDate,albumArtistsNames,trackUrl,artistUrl,albumUrl,trackIsrc,albumUpc
0,Fadjamou - TWOTEK Remix,Oumou Sangaré,Mogoya Remixed,04:54,6,leposia,2020-02-21T15:48:03Z,stream,33,55.0,...,"Afropop, Malian blues, Mande pop, Wassoulou, W...",album,NO FORMAT,2018-03-02,Oumou Sangaré,https://open.spotify.com/track/7o4FDJSXetDLaZQ...,https://open.spotify.com/artist/65CKKZilbcSKkA...,https://open.spotify.com/album/3tPkkxfJ4PMlVWH...,FR2W61700101,886447000000.0
1,Sondela feat. Xolisa,David Mayer,Sondela EP,09:16,1,leposia,2020-05-08T20:45:25Z,stream,34,31.0,...,,single,Connected,2018-05-11,David Mayer & Floyd Lavine,https://open.spotify.com/track/3JSbVbuKCAEKPQG...,https://open.spotify.com/artist/7Mf7RKErlQ5maE...,https://open.spotify.com/album/62Stp3OydYGdunK...,DEU671702764,880319900000.0
2,Shango,Waajeed,Shango EP,05:38,2,leposia,2020-05-08T20:46:03Z,stream,24,29.0,...,"Detroit house, Experimental house",single,Dirt Tech Reck,2017-07-07,Waajeed,https://open.spotify.com/track/3nac7Q00wdPbA7E...,https://open.spotify.com/artist/4dXRZcf1AdzvwZ...,https://open.spotify.com/album/7heAeiemIsKYZtc...,ushm21722078,191061900000.0
3,Sittin' Here - Boddhi Satva Ancestral Soul Remix,St Germain,Sittin' Here (Remixes),06:27,2,leposia,2020-05-08T20:46:16Z,stream,20,59.0,...,"Downtempo, Electronica, Jazztronica, Nu jazz",single,Parlophone (France),2016-03-11,St Germain,https://open.spotify.com/track/4LbkNYxou1Z3uKi...,https://open.spotify.com/artist/484sZUYmnRXN84...,https://open.spotify.com/album/4XgRSFKNjnzrKvU...,FR92S1500040,190296000000.0
4,"99,9 - Alix Alvarez Sole Channel Remix",Minimoogli,"99,9 Remixes",06:22,1,leposia,2020-05-08T20:49:27Z,stream,0,5.0,...,,single,Tubetracks,2008-05-24,Minimoogli,https://open.spotify.com/track/5SL7Ht67ZGlRPFg...,https://open.spotify.com/artist/5crLJQJosOleex...,https://open.spotify.com/album/76QL5IslNgG8Cn2...,FR6V80803305,3661585000000.0


Dataset with all playlists:

In [31]:
# Dataset with all playlists

import glob
df_Silkey = pd.concat([pd.read_csv(f) for f in glob.glob("C:/Users/Agnieszka/Downloads/Datasets/Music/Silkey_MySpotify/" + "*.csv")], ignore_index=True)
len(df_Silkey)

4007

c) Data from sortyourmusic.playlistmachinery.com

In [32]:
test2 = pd.read_excel("C:/Users/Agnieszka/Downloads/Datasets/Music/SortYourMusic_20210305.xlsx", sheet_name="AfroHouse")
test2.head()


# Double information: Title / Artist / Release / Length / Pop. / 
# not important: RND

Unnamed: 0,#,Title,Artist,Release,BPM,Energy,Dance,Loud,Valence,Length,Acoustic,Pop.,RND
0,1,Fadjamou - TWOTEK Remix,Oumou Sangaré,2018-03-02 00:00:00,129.0,59.0,84.0,-9.0,46.0,04:54:00,5.0,33.0,4111.0
1,2,Sondela feat. Xolisa,David Mayer,2018-05-11 00:00:00,122.0,68.0,77.0,-8.0,14.0,09:17:00,1.0,34.0,7037.0
2,3,Shango,Waajeed,2017-07-07 00:00:00,127.0,62.0,85.0,-11.0,60.0,05:38:00,6.0,24.0,5515.0
3,4,Sittin' Here - Boddhi Satva Ancestral Soul Remix,St Germain,2016-03-11 00:00:00,122.0,72.0,79.0,-8.0,52.0,06:28:00,13.0,20.0,3277.0
4,5,"99,9 - Alix Alvarez Sole Channel Remix",Minimoogli,2008-05-24 00:00:00,126.0,86.0,80.0,-9.0,66.0,06:22:00,0.0,0.0,281.0


In [33]:
# Check files in the directory

#import os
#files = os.listdir("C:/Users/Agnieszka/Downloads/Datasets/Music/SortYourMusic/")
#files

Dataset with all playlists:

In [34]:
# Dataset with all playlists

df_SortYourMusic = pd.concat([pd.read_excel(f) for f in glob.glob("C:/Users/Agnieszka/Downloads/Datasets/Music/SortYourMusic/" + "*.xlsx")], ignore_index=True)
len(df_SortYourMusic)

4007

**Datasets for further analysis and ML models**

a) Dataset with data from all playlists 

In [35]:
df_Silkey.reset_index(inplace=True)
df_Silkey.head()

Unnamed: 0,index,trackName,artistName,albumName,trackDuration,trackNumber,addedBy,addedAt,isLocal,trackPopularity,...,artistGenres,albumType,albumRecordLabel,albumReleaseDate,albumArtistsNames,trackUrl,artistUrl,albumUrl,trackIsrc,albumUpc
0,0,"My Woman, My Everything (feat. Wandecoal)",Patoranking,"My Woman, My Everything (feat. Wandecoal) - Si...",03:53,1,leposia,2018-08-19T20:44:37Z,stream,54,...,"Afro dancehall, Afropop, Azontobeats, Gqom, Ni...",single,Greensleeves Records,2015-07-06,Patoranking,https://open.spotify.com/track/5MAScTGPHzWAeiv...,https://open.spotify.com/artist/2hKQc001G7ggs3...,https://open.spotify.com/album/0cUhwhoX9Yi7vhJ...,GBBZV1555523,601811200000.0
1,1,Pana,Tekno,Pana,04:02,1,leposia,2018-08-19T20:51:51Z,stream,59,...,"Afro dancehall, Afropop, Azontobeats, Nigerian...",single,Columbia,2016-12-20,Tekno,https://open.spotify.com/track/3ddd0nUY2jZtPYz...,https://open.spotify.com/artist/6IhG3Yxm3UW98j...,https://open.spotify.com/album/62sUZK9tqrSH3AM...,USSM11610316,886446300000.0
2,2,Jogodo,Tekno,Jogodo,04:23,1,leposia,2018-08-19T20:51:54Z,stream,44,...,"Afro dancehall, Afropop, Azontobeats, Nigerian...",single,MADE MEN MUSIC GROUP,2018-04-19,Tekno,https://open.spotify.com/track/2irIMb2mjl8lgVz...,https://open.spotify.com/artist/6IhG3Yxm3UW98j...,https://open.spotify.com/album/6aR2Z7ByqS9rxAz...,GBZB51800047,5050581000000.0
3,3,Duro,Tekno,Duro,03:32,1,leposia,2018-08-19T20:52:04Z,stream,0,...,"Afro dancehall, Afropop, Azontobeats, Nigerian...",single,Made Men Music Group,2015-06-22,Tekno,https://open.spotify.com/track/4zI3yq6BBpbnKZY...,https://open.spotify.com/artist/6IhG3Yxm3UW98j...,https://open.spotify.com/album/2RtngUBJGTBgQuY...,UK3AZ1512397,632181500000.0
4,4,Samantha,Tekno,Samantha,03:49,1,leposia,2018-08-19T20:52:24Z,stream,42,...,"Afro dancehall, Afropop, Azontobeats, Nigerian...",single,MADE MEN MUSIC GROUP,2017-06-09,Tekno,https://open.spotify.com/track/0rI1YdcUXmvWd6R...,https://open.spotify.com/artist/6IhG3Yxm3UW98j...,https://open.spotify.com/album/4JS9Fgzb4XzmWMQ...,GBZB51700035,5050581000000.0


In [36]:
df_SortYourMusic.reset_index(inplace=True)
df_SortYourMusic.head()

Unnamed: 0,index,#,Title,Artist,Release,BPM,Energy,Dance,Loud,Valence,Length,Acoustic,Pop.,RND
0,0,1,"My Woman, My Everything (feat. Wandecoal)",Patoranking,2015-07-06 00:00:00,112.0,85.0,90.0,-3.0,63.0,03:54:00,4.0,54.0,7655.0
1,1,2,Pana,Tekno,2016-12-20 00:00:00,73.0,55.0,55.0,-7.0,66.0,04:03:00,31.0,58.0,5685.0
2,2,3,Jogodo,Tekno,2018-04-19 00:00:00,100.0,77.0,74.0,-5.0,82.0,04:23:00,38.0,44.0,8581.0
3,3,4,Duro,Tekno,2015-06-22 00:00:00,90.0,84.0,74.0,-4.0,97.0,03:32:00,28.0,0.0,400.0
4,4,5,Samantha,Tekno,2017-06-09 00:00:00,106.0,75.0,83.0,-3.0,79.0,03:50:00,3.0,42.0,6739.0


Merge DataFrames

In [37]:
# Merge both DataFrames (df_Silkey & df_SortYourMusic)

All_songs = df_Silkey.merge(df_SortYourMusic, on="index")

In [38]:
#All_songs[["trackName", "Title"]].iloc[4000:4040]

Remove columns

In [39]:
# Check columns

All_songs.columns

Index(['index', 'trackName', 'artistName', 'albumName', 'trackDuration',
       'trackNumber', 'addedBy', 'addedAt', 'isLocal', 'trackPopularity',
       'artistPopularity', 'secondaryArtistsNames', 'albumPopularity',
       'artistFollowers', 'artistGenres', 'albumType', 'albumRecordLabel',
       'albumReleaseDate', 'albumArtistsNames', 'trackUrl', 'artistUrl',
       'albumUrl', 'trackIsrc', 'albumUpc', '#', 'Title', 'Artist', 'Release',
       'BPM', 'Energy', 'Dance', 'Loud', 'Valence', 'Length', 'Acoustic',
       'Pop.', 'RND'],
      dtype='object')

In [40]:
# Drop columns

# df_SortYourMusic: 
# Title / Artist / Release / Length / Pop. <- Double information & not important: RND
# df_Silkey 
# # not important: 'albumName'/ "trackNumber" / "addedBy" / 'albumArtistsNames' /'trackUrl'/ 
# 'artistUrl' / 'albumUrl'/ 'trackIsrc' (International Standard Recording Code)/ 'albumUpc' (Universal Product Code) / 
# 'albumType' / 'albumPopularity'


All_songs_cleaned1 = All_songs.drop(columns=["albumName", 
                            "trackNumber",
                            "addedBy",
                            "albumArtistsNames",
                            "trackUrl",
                            "artistUrl",
                            "albumUrl",
                            "trackIsrc",
                            "albumUpc",
                            "albumType",
                            "albumPopularity",
                            'Title', 
                            'Artist', 
                            'Release', 
                            "Length", 
                            "Pop.",
                            "RND",
                            '#',          
                           ], axis=0)

In [41]:
All_songs_cleaned1.head()

Unnamed: 0,index,trackName,artistName,trackDuration,addedAt,isLocal,trackPopularity,artistPopularity,secondaryArtistsNames,artistFollowers,artistGenres,albumRecordLabel,albumReleaseDate,BPM,Energy,Dance,Loud,Valence,Acoustic
0,0,"My Woman, My Everything (feat. Wandecoal)",Patoranking,03:53,2018-08-19T20:44:37Z,stream,54,61.0,Wande Coal,149884.0,"Afro dancehall, Afropop, Azontobeats, Gqom, Ni...",Greensleeves Records,2015-07-06,112.0,85.0,90.0,-3.0,63.0,4.0
1,1,Pana,Tekno,04:02,2018-08-19T20:51:51Z,stream,59,62.0,,175849.0,"Afro dancehall, Afropop, Azontobeats, Nigerian...",Columbia,2016-12-20,73.0,55.0,55.0,-7.0,66.0,31.0
2,2,Jogodo,Tekno,04:23,2018-08-19T20:51:54Z,stream,44,62.0,,175849.0,"Afro dancehall, Afropop, Azontobeats, Nigerian...",MADE MEN MUSIC GROUP,2018-04-19,100.0,77.0,74.0,-5.0,82.0,38.0
3,3,Duro,Tekno,03:32,2018-08-19T20:52:04Z,stream,0,62.0,,175849.0,"Afro dancehall, Afropop, Azontobeats, Nigerian...",Made Men Music Group,2015-06-22,90.0,84.0,74.0,-4.0,97.0,28.0
4,4,Samantha,Tekno,03:49,2018-08-19T20:52:24Z,stream,42,62.0,,175849.0,"Afro dancehall, Afropop, Azontobeats, Nigerian...",MADE MEN MUSIC GROUP,2017-06-09,106.0,75.0,83.0,-3.0,79.0,3.0


In [42]:
len(All_songs_cleaned1)

4007

Check on and remove duplicates

In [43]:
# Check if there are any duplicates
Duplicates = All_songs_cleaned1[["artistName", "trackName", "index"]].groupby(["trackName", "artistName"]).count().sort_values("index", ascending=False)
Duplicates.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,index
trackName,artistName,Unnamed: 2_level_1
I'm So Excited,The Pointer Sisters,4
Promise,Ben Howard,4
"Sports, Not Heavy Crime",Xploding Plastix,4
Blackbird Song,Lee DeWyze,4
Blacked Out,Yussef Dayes,4


In [44]:
# Duplicate items
#Duplicates2 = Duplicates[Duplicates["index"] > 1]
#len(Duplicates2)

In [45]:
# Single items
#Duplicates3 = Duplicates[Duplicates["index"] == 1]
#len(Duplicates3)

In [46]:
All_songs_cleaned2 = All_songs_cleaned1.drop_duplicates(subset=["artistName", "trackName"], keep="first")
All_songs_cleaned2.shape

(3510, 19)

Check on NaN values

In [47]:
All_songs_cleaned2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3510 entries, 0 to 4006
Data columns (total 19 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   index                  3510 non-null   int64  
 1   trackName              3510 non-null   object 
 2   artistName             3482 non-null   object 
 3   trackDuration          3510 non-null   object 
 4   addedAt                3510 non-null   object 
 5   isLocal                3510 non-null   object 
 6   trackPopularity        3510 non-null   int64  
 7   artistPopularity       3461 non-null   float64
 8   secondaryArtistsNames  1031 non-null   object 
 9   artistFollowers        3461 non-null   float64
 10  artistGenres           3054 non-null   object 
 11  albumRecordLabel       3461 non-null   object 
 12  albumReleaseDate       3461 non-null   object 
 13  BPM                    3461 non-null   float64
 14  Energy                 3461 non-null   float64
 15  Danc

In [48]:
# Remove rows without "artistName" -> I can do that because all rows without an artist name are local files
All_songs_cleaned2.dropna(axis=0, subset=["artistName"], inplace=True)
All_songs_cleaned2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3482 entries, 0 to 4006
Data columns (total 19 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   index                  3482 non-null   int64  
 1   trackName              3482 non-null   object 
 2   artistName             3482 non-null   object 
 3   trackDuration          3482 non-null   object 
 4   addedAt                3482 non-null   object 
 5   isLocal                3482 non-null   object 
 6   trackPopularity        3482 non-null   int64  
 7   artistPopularity       3461 non-null   float64
 8   secondaryArtistsNames  1031 non-null   object 
 9   artistFollowers        3461 non-null   float64
 10  artistGenres           3054 non-null   object 
 11  albumRecordLabel       3461 non-null   object 
 12  albumReleaseDate       3461 non-null   object 
 13  BPM                    3461 non-null   float64
 14  Energy                 3461 non-null   float64
 15  Danc

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  All_songs_cleaned2.dropna(axis=0, subset=["artistName"], inplace=True)


In [49]:
# NaN values in BPM, Energy etc. columns -> There still might be local data in the dataset
All_songs_cleaned2_test2 = All_songs_cleaned2[All_songs_cleaned2["BPM"].isna() == True]
All_songs_cleaned2_test2.head()                                 

Unnamed: 0,index,trackName,artistName,trackDuration,addedAt,isLocal,trackPopularity,artistPopularity,secondaryArtistsNames,artistFollowers,artistGenres,albumRecordLabel,albumReleaseDate,BPM,Energy,Dance,Loud,Valence,Acoustic
61,61,DJ Flex,AfroBeat Remix Drake in my feelings (KeKe),01:52,2018-11-30T08:56:07Z,local,0,,,,,,,,,,,,
142,142,Where Have They Gone (Mr Mendel Edit),M...,05:49,2020-10-24T18:17:05Z,local,0,,,,,,,,,,,,
206,206,Kenkan (David Montoya edit),Mory Kante,06:15,2020-05-08T21:10:33Z,local,0,,,,,,,,,,,,
207,207,Isibusiso Sami,Schomane RA ft Poizen,06:58,2020-05-08T21:10:45Z,local,0,,,,,,,,,,,,
265,265,Uyankenteza (Nostalgic Mix),The Godfathers Of Deep House SA,07:35,2020-05-08T21:40:27Z,local,0,,,,,,,,,,,,


In [50]:
# Remove all local data -> there is no information about this data that can be used for further analysis
All_songs_cleaned3 = All_songs_cleaned2[All_songs_cleaned2["isLocal"]=="stream"]
All_songs_cleaned3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3461 entries, 0 to 4006
Data columns (total 19 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   index                  3461 non-null   int64  
 1   trackName              3461 non-null   object 
 2   artistName             3461 non-null   object 
 3   trackDuration          3461 non-null   object 
 4   addedAt                3461 non-null   object 
 5   isLocal                3461 non-null   object 
 6   trackPopularity        3461 non-null   int64  
 7   artistPopularity       3461 non-null   float64
 8   secondaryArtistsNames  1031 non-null   object 
 9   artistFollowers        3461 non-null   float64
 10  artistGenres           3054 non-null   object 
 11  albumRecordLabel       3461 non-null   object 
 12  albumReleaseDate       3461 non-null   object 
 13  BPM                    3461 non-null   float64
 14  Energy                 3461 non-null   float64
 15  Danc

In [51]:
# There are still missing values in "SecondaryArtistNames" and "artistGenres" 
# -> These columns remain in the dataset for the EDA; they will be removed for modeling (machine learning)

In [52]:
All_songs_cleaned3.head()

Unnamed: 0,index,trackName,artistName,trackDuration,addedAt,isLocal,trackPopularity,artistPopularity,secondaryArtistsNames,artistFollowers,artistGenres,albumRecordLabel,albumReleaseDate,BPM,Energy,Dance,Loud,Valence,Acoustic
0,0,"My Woman, My Everything (feat. Wandecoal)",Patoranking,03:53,2018-08-19T20:44:37Z,stream,54,61.0,Wande Coal,149884.0,"Afro dancehall, Afropop, Azontobeats, Gqom, Ni...",Greensleeves Records,2015-07-06,112.0,85.0,90.0,-3.0,63.0,4.0
1,1,Pana,Tekno,04:02,2018-08-19T20:51:51Z,stream,59,62.0,,175849.0,"Afro dancehall, Afropop, Azontobeats, Nigerian...",Columbia,2016-12-20,73.0,55.0,55.0,-7.0,66.0,31.0
2,2,Jogodo,Tekno,04:23,2018-08-19T20:51:54Z,stream,44,62.0,,175849.0,"Afro dancehall, Afropop, Azontobeats, Nigerian...",MADE MEN MUSIC GROUP,2018-04-19,100.0,77.0,74.0,-5.0,82.0,38.0
3,3,Duro,Tekno,03:32,2018-08-19T20:52:04Z,stream,0,62.0,,175849.0,"Afro dancehall, Afropop, Azontobeats, Nigerian...",Made Men Music Group,2015-06-22,90.0,84.0,74.0,-4.0,97.0,28.0
4,4,Samantha,Tekno,03:49,2018-08-19T20:52:24Z,stream,42,62.0,,175849.0,"Afro dancehall, Afropop, Azontobeats, Nigerian...",MADE MEN MUSIC GROUP,2017-06-09,106.0,75.0,83.0,-3.0,79.0,3.0


Split "artistGenres"

In [54]:
# Split "artistGenres"
All_songs_cleaned3["artistGenres"][50]

'Afro dancehall, Afropop, Ghanaian hip hop, Nigerian pop, Swedish dancehall'

In [55]:
# Dataset without NaNs in "ArtistGenres"

#All_songs_clean_Genre = All_songs_clean3.dropna(axis=0, subset=["artistGenres"])
#All_songs_clean_Genre.info()

In [56]:
All_songs_cleaned3_Genres = All_songs_cleaned3["artistGenres"].str.split(",", expand=True)
print(len(All_songs_cleaned3_Genres))
All_songs_cleaned3_Genres.head()
# up to 16 different genres for some songs! 

3461


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,Afro dancehall,Afropop,Azontobeats,Gqom,Nigerian pop,,,,,,,,,,,
1,Afro dancehall,Afropop,Azontobeats,Nigerian hip hop,Nigerian pop,,,,,,,,,,,
2,Afro dancehall,Afropop,Azontobeats,Nigerian hip hop,Nigerian pop,,,,,,,,,,,
3,Afro dancehall,Afropop,Azontobeats,Nigerian hip hop,Nigerian pop,,,,,,,,,,,
4,Afro dancehall,Afropop,Azontobeats,Nigerian hip hop,Nigerian pop,,,,,,,,,,,


In [57]:
# Reduce genres 
All_songs_cleaned3_Genres_reduced = All_songs_cleaned3_Genres.iloc[:,0:5]
All_songs_cleaned3_Genres_reduced.head()

Unnamed: 0,0,1,2,3,4
0,Afro dancehall,Afropop,Azontobeats,Gqom,Nigerian pop
1,Afro dancehall,Afropop,Azontobeats,Nigerian hip hop,Nigerian pop
2,Afro dancehall,Afropop,Azontobeats,Nigerian hip hop,Nigerian pop
3,Afro dancehall,Afropop,Azontobeats,Nigerian hip hop,Nigerian pop
4,Afro dancehall,Afropop,Azontobeats,Nigerian hip hop,Nigerian pop


In [58]:
All_songs_cleaned3_Genres_reduced.rename(columns={0:"Genre1",
                                                    1:"Genre2",
                                                    2:"Genre3",
                                                    3:"Genre4",
                                                    4:"Genre5"}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(


In [59]:
All_songs_cleaned3_Genres_reduced.head()

Unnamed: 0,Genre1,Genre2,Genre3,Genre4,Genre5
0,Afro dancehall,Afropop,Azontobeats,Gqom,Nigerian pop
1,Afro dancehall,Afropop,Azontobeats,Nigerian hip hop,Nigerian pop
2,Afro dancehall,Afropop,Azontobeats,Nigerian hip hop,Nigerian pop
3,Afro dancehall,Afropop,Azontobeats,Nigerian hip hop,Nigerian pop
4,Afro dancehall,Afropop,Azontobeats,Nigerian hip hop,Nigerian pop


Merge Genre-DataFrame with previous cleaned DataFrame

In [60]:
All_songs_cleaned3_Genres_reduced.reset_index(inplace=True)
All_songs_cleaned4 = All_songs_cleaned3.merge(All_songs_cleaned3_Genres_reduced, on="index")
All_songs_cleaned4.head()

Unnamed: 0,index,trackName,artistName,trackDuration,addedAt,isLocal,trackPopularity,artistPopularity,secondaryArtistsNames,artistFollowers,...,Energy,Dance,Loud,Valence,Acoustic,Genre1,Genre2,Genre3,Genre4,Genre5
0,0,"My Woman, My Everything (feat. Wandecoal)",Patoranking,03:53,2018-08-19T20:44:37Z,stream,54,61.0,Wande Coal,149884.0,...,85.0,90.0,-3.0,63.0,4.0,Afro dancehall,Afropop,Azontobeats,Gqom,Nigerian pop
1,1,Pana,Tekno,04:02,2018-08-19T20:51:51Z,stream,59,62.0,,175849.0,...,55.0,55.0,-7.0,66.0,31.0,Afro dancehall,Afropop,Azontobeats,Nigerian hip hop,Nigerian pop
2,2,Jogodo,Tekno,04:23,2018-08-19T20:51:54Z,stream,44,62.0,,175849.0,...,77.0,74.0,-5.0,82.0,38.0,Afro dancehall,Afropop,Azontobeats,Nigerian hip hop,Nigerian pop
3,3,Duro,Tekno,03:32,2018-08-19T20:52:04Z,stream,0,62.0,,175849.0,...,84.0,74.0,-4.0,97.0,28.0,Afro dancehall,Afropop,Azontobeats,Nigerian hip hop,Nigerian pop
4,4,Samantha,Tekno,03:49,2018-08-19T20:52:24Z,stream,42,62.0,,175849.0,...,75.0,83.0,-3.0,79.0,3.0,Afro dancehall,Afropop,Azontobeats,Nigerian hip hop,Nigerian pop


In [61]:
All_songs_cleaned4.drop(columns=["isLocal", "artistGenres"], inplace=True)
All_songs_cleaned4.columns

Index(['index', 'trackName', 'artistName', 'trackDuration', 'addedAt',
       'trackPopularity', 'artistPopularity', 'secondaryArtistsNames',
       'artistFollowers', 'albumRecordLabel', 'albumReleaseDate', 'BPM',
       'Energy', 'Dance', 'Loud', 'Valence', 'Acoustic', 'Genre1', 'Genre2',
       'Genre3', 'Genre4', 'Genre5'],
      dtype='object')

In [62]:
All_songs_cleaned4.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3461 entries, 0 to 3460
Data columns (total 22 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   index                  3461 non-null   int64  
 1   trackName              3461 non-null   object 
 2   artistName             3461 non-null   object 
 3   trackDuration          3461 non-null   object 
 4   addedAt                3461 non-null   object 
 5   trackPopularity        3461 non-null   int64  
 6   artistPopularity       3461 non-null   float64
 7   secondaryArtistsNames  1031 non-null   object 
 8   artistFollowers        3461 non-null   float64
 9   albumRecordLabel       3461 non-null   object 
 10  albumReleaseDate       3461 non-null   object 
 11  BPM                    3461 non-null   float64
 12  Energy                 3461 non-null   float64
 13  Dance                  3461 non-null   float64
 14  Loud                   3461 non-null   float64
 15  Vale

Convert "object" into "datetime"

In [66]:
# change data type to datetime: 
# - TrackDuration [3], 
# - addedAt [4] -> Format: 2018-08-19T20:44:37Z, 
# - albumReleaseDate [10] -> 2015-07-06

In [67]:
# All_songs_cleaned4["trackDuration"] = pd.to_datetime(All_songs_cleaned4["trackDuration"], format="%H:%M:%S") # Datetime
#All_songs_cleaned4["trackDuration"] = pd.to_datetime(All_songs_cleaned4["trackDuration"]).dt.time #string

In [68]:
All_songs_cleaned4["trackDuration"].head()

0    03:53
1    04:02
2    04:23
3    03:32
4    03:49
Name: trackDuration, dtype: object

In [69]:
All_songs_cleaned4["addedAt"] = pd.to_datetime(All_songs_cleaned4["addedAt"]).dt.tz_convert(None)

In [70]:
All_songs_cleaned4["addedAt"].head()

0   2018-08-19 20:44:37
1   2018-08-19 20:51:51
2   2018-08-19 20:51:54
3   2018-08-19 20:52:04
4   2018-08-19 20:52:24
Name: addedAt, dtype: datetime64[ns]

In [71]:
All_songs_cleaned4["albumReleaseDate"] = pd.to_datetime(All_songs_cleaned4["albumReleaseDate"], yearfirst=True)

In [72]:
All_songs_cleaned4["albumReleaseDate"].head()

0   2015-07-06
1   2016-12-20
2   2018-04-19
3   2015-06-22
4   2017-06-09
Name: albumReleaseDate, dtype: datetime64[ns]

In [73]:
All_songs_cleaned4.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3461 entries, 0 to 3460
Data columns (total 22 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   index                  3461 non-null   int64         
 1   trackName              3461 non-null   object        
 2   artistName             3461 non-null   object        
 3   trackDuration          3461 non-null   object        
 4   addedAt                3461 non-null   datetime64[ns]
 5   trackPopularity        3461 non-null   int64         
 6   artistPopularity       3461 non-null   float64       
 7   secondaryArtistsNames  1031 non-null   object        
 8   artistFollowers        3461 non-null   float64       
 9   albumRecordLabel       3461 non-null   object        
 10  albumReleaseDate       3461 non-null   datetime64[ns]
 11  BPM                    3461 non-null   float64       
 12  Energy                 3461 non-null   float64       
 13  Dan

Save DataFrame to csv-file

In [74]:
All_songs_cleaned4.to_csv("C:/Users/Agnieszka/Downloads/Datasets/Music/AllSongsCleaned_EDA.csv")

**2) Dataset for machine learning models** (includes all songs)

Dataset with songs that I like and do not like. 
Songs that I like come from my personal playlists, songs I do not like I will take from a Kaggle dataset. I will select them based on the genre. I will create an extra column "like" for the classification (modeling) in each dataset -> like = 1 & like = 0 (like/ do not like). In the last step I will combine these datasets.

a) My data

In [75]:
MySpotify_cleaned = pd.read_csv("C:/Users/Agnieszka/Downloads/Datasets/Music/AllSongsCleaned_EDA.csv")
MySpotify_cleaned.drop(columns="Unnamed: 0", inplace=True)
MySpotify_cleaned.head()

Unnamed: 0,index,trackName,artistName,trackDuration,addedAt,trackPopularity,artistPopularity,secondaryArtistsNames,artistFollowers,albumRecordLabel,...,Energy,Dance,Loud,Valence,Acoustic,Genre1,Genre2,Genre3,Genre4,Genre5
0,0,"My Woman, My Everything (feat. Wandecoal)",Patoranking,03:53,2018-08-19 20:44:37,54,61.0,Wande Coal,149884.0,Greensleeves Records,...,85.0,90.0,-3.0,63.0,4.0,Afro dancehall,Afropop,Azontobeats,Gqom,Nigerian pop
1,1,Pana,Tekno,04:02,2018-08-19 20:51:51,59,62.0,,175849.0,Columbia,...,55.0,55.0,-7.0,66.0,31.0,Afro dancehall,Afropop,Azontobeats,Nigerian hip hop,Nigerian pop
2,2,Jogodo,Tekno,04:23,2018-08-19 20:51:54,44,62.0,,175849.0,MADE MEN MUSIC GROUP,...,77.0,74.0,-5.0,82.0,38.0,Afro dancehall,Afropop,Azontobeats,Nigerian hip hop,Nigerian pop
3,3,Duro,Tekno,03:32,2018-08-19 20:52:04,0,62.0,,175849.0,Made Men Music Group,...,84.0,74.0,-4.0,97.0,28.0,Afro dancehall,Afropop,Azontobeats,Nigerian hip hop,Nigerian pop
4,4,Samantha,Tekno,03:49,2018-08-19 20:52:24,42,62.0,,175849.0,MADE MEN MUSIC GROUP,...,75.0,83.0,-3.0,79.0,3.0,Afro dancehall,Afropop,Azontobeats,Nigerian hip hop,Nigerian pop


Remove columns that are not relevant for modeling

In [76]:
MySpotify_cleaned.columns

Index(['index', 'trackName', 'artistName', 'trackDuration', 'addedAt',
       'trackPopularity', 'artistPopularity', 'secondaryArtistsNames',
       'artistFollowers', 'albumRecordLabel', 'albumReleaseDate', 'BPM',
       'Energy', 'Dance', 'Loud', 'Valence', 'Acoustic', 'Genre1', 'Genre2',
       'Genre3', 'Genre4', 'Genre5'],
      dtype='object')

In [77]:
MySpotify_cleaned_modeling = MySpotify_cleaned[["artistName", "trackPopularity", "BPM", "Energy", "Dance", "Loud",
                                              "Valence", "Acoustic", "Genre1"]]
MySpotify_cleaned_modeling.head()      

Unnamed: 0,artistName,trackPopularity,BPM,Energy,Dance,Loud,Valence,Acoustic,Genre1
0,Patoranking,54,112.0,85.0,90.0,-3.0,63.0,4.0,Afro dancehall
1,Tekno,59,73.0,55.0,55.0,-7.0,66.0,31.0,Afro dancehall
2,Tekno,44,100.0,77.0,74.0,-5.0,82.0,38.0,Afro dancehall
3,Tekno,0,90.0,84.0,74.0,-4.0,97.0,28.0,Afro dancehall
4,Tekno,42,106.0,75.0,83.0,-3.0,79.0,3.0,Afro dancehall


Add "like" column for classification (modeling)

In [78]:
MySpotify_cleaned_modeling["like"] = 1
MySpotify_cleaned_modeling.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  MySpotify_cleaned_modeling["like"] = 1


Unnamed: 0,artistName,trackPopularity,BPM,Energy,Dance,Loud,Valence,Acoustic,Genre1,like
0,Patoranking,54,112.0,85.0,90.0,-3.0,63.0,4.0,Afro dancehall,1
1,Tekno,59,73.0,55.0,55.0,-7.0,66.0,31.0,Afro dancehall,1
2,Tekno,44,100.0,77.0,74.0,-5.0,82.0,38.0,Afro dancehall,1
3,Tekno,0,90.0,84.0,74.0,-4.0,97.0,28.0,Afro dancehall,1
4,Tekno,42,106.0,75.0,83.0,-3.0,79.0,3.0,Afro dancehall,1


b) Kaggle dataset

In [79]:
Kaggle = pd.read_csv("C:/Users/Agnieszka/Downloads/Datasets/Music/Spotify_Dataset1/data_w_genres.csv")

In [80]:
Kaggle.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32539 entries, 0 to 32538
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   artists           32539 non-null  object 
 1   acousticness      32539 non-null  float64
 2   danceability      32539 non-null  float64
 3   duration_ms       32539 non-null  float64
 4   energy            32539 non-null  float64
 5   instrumentalness  32539 non-null  float64
 6   liveness          32539 non-null  float64
 7   loudness          32539 non-null  float64
 8   speechiness       32539 non-null  float64
 9   tempo             32539 non-null  float64
 10  valence           32539 non-null  float64
 11  popularity        32539 non-null  float64
 12  key               32539 non-null  int64  
 13  mode              32539 non-null  int64  
 14  count             32539 non-null  int64  
 15  genres            32539 non-null  object 
dtypes: float64(11), int64(3), object(2)
memo

Select songs that I probably wouldn't like based on the genre

In [81]:
# Split genre column

Kaggle_Genres = Kaggle["genres"].str.split(",", expand=True)
# output: 20 columns - only the first one remains
Kaggle_Genres1 = pd.DataFrame(Kaggle_Genres[0])
Kaggle_Genres1.head()

#Kaggle_Genres = pd.DataFrame(Kaggle_Genres[0].to_list()) -> is not a list; "[]" -> string

Unnamed: 0,0
0,['show tunes']
1,[]
2,[]
3,[]
4,[]


In [82]:
# Add the genre column with only one genre to the initial dataset
Kaggle1 = pd.concat([Kaggle, Kaggle_Genres1], axis=1)
Kaggle1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32539 entries, 0 to 32538
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   artists           32539 non-null  object 
 1   acousticness      32539 non-null  float64
 2   danceability      32539 non-null  float64
 3   duration_ms       32539 non-null  float64
 4   energy            32539 non-null  float64
 5   instrumentalness  32539 non-null  float64
 6   liveness          32539 non-null  float64
 7   loudness          32539 non-null  float64
 8   speechiness       32539 non-null  float64
 9   tempo             32539 non-null  float64
 10  valence           32539 non-null  float64
 11  popularity        32539 non-null  float64
 12  key               32539 non-null  int64  
 13  mode              32539 non-null  int64  
 14  count             32539 non-null  int64  
 15  genres            32539 non-null  object 
 16  0                 32539 non-null  object

Select ca. 3461 random songs based on genres I don't like 

In [83]:
def Kaggle_random():
    # genre that contains: "..."
    Kaggle_pop = Kaggle1[Kaggle1[0].str.contains("pop")]
    # Take n samples from this genre
    Kaggle_pop_random = Kaggle_pop.sample(n=1770) # 1500 + ca. 200 duplicates
    
    Kaggle_techno = Kaggle1[Kaggle1["genres"].str.contains("techno")]
    Kaggle_techno_random = Kaggle_techno.sample(n=250)
    
    Kaggle_acid = Kaggle1[Kaggle1["genres"].str.contains("acid")]
    Kaggle_acid_random = Kaggle_acid.sample(n=31)
    
    Kaggle_trance = Kaggle1[Kaggle1[0].str.contains("trance")]
    Kaggle_trance_random = Kaggle_trance.sample(n=250)
    
    Kaggle_death = Kaggle1[Kaggle1["genres"].str.contains("death")]
    Kaggle_death_random = Kaggle_death.sample(n=50)
    
    Kaggle_punk = Kaggle1[Kaggle1[0].str.contains("punk")]
    Kaggle_punk_random = Kaggle_punk.sample(n=200)
    
    Kaggle_country = Kaggle1[Kaggle1[0].str.contains("country")]
    Kaggle_country_random = Kaggle_country.sample(n=400)
    
    Kaggle_emo = Kaggle1[Kaggle1[0].str.contains("emo")]
    Kaggle_emo_random = Kaggle_emo.sample(n=100)
    
    Kaggle_chinese = Kaggle1[Kaggle1["genres"].str.contains("chinese")] #
    Kaggle_chinese_random = Kaggle_chinese.sample(n=50)
    
    Kaggle_psy = Kaggle1[Kaggle1[0].str.contains("psy")]
    Kaggle_psy_random = Kaggle_psy.sample(n=80)
    
    Kaggle_german = Kaggle1[Kaggle1["genres"].str.contains("german")] # 
    Kaggle_german_random = Kaggle_german.sample(n=150)
    
    Kaggle_edm = Kaggle1[Kaggle1[0].str.contains("edm")]
    Kaggle_edm_random = Kaggle_edm.sample(n=70)
    
    Kaggle_romantic = Kaggle1[Kaggle1[0].str.contains("romantic")]
    Kaggle_romantic_random = Kaggle_edm.sample(n=70)
    
    Kaggle_synth = Kaggle1[Kaggle1["genres"].str.contains("synth")] #
    Kaggle_synth_random = Kaggle_synth.sample(n=200)
      
    Kaggle_dataset = pd.concat([Kaggle_pop_random, Kaggle_techno_random, Kaggle_acid_random,
                                Kaggle_death_random, Kaggle_trance_random, Kaggle_punk_random, 
                                Kaggle_country_random, Kaggle_emo_random, Kaggle_chinese_random, 
                                Kaggle_psy_random, Kaggle_german_random, Kaggle_edm_random,
                               Kaggle_romantic_random, Kaggle_synth_random, Kaggle_chinese_random], axis=0)
    return Kaggle_dataset

In [84]:
Kaggle_dataset = Kaggle_random()
len(Kaggle_dataset)

3721

Drop duplicates

In [85]:
#Duplicates2A = Kaggle_test["artists"].value_counts()
#Duplicates2A.head(50)

Kaggle_dataset_drop = Kaggle_dataset.drop_duplicates(keep="first")
len(Kaggle_dataset_drop)
#3461

3432

Remove columns to merge the dataset with "MySpotify" dataset

In [86]:
print(Kaggle_dataset_drop.columns)
print(MySpotify_cleaned_modeling.columns)

Index([         'artists',     'acousticness',     'danceability',
            'duration_ms',           'energy', 'instrumentalness',
               'liveness',         'loudness',      'speechiness',
                  'tempo',          'valence',       'popularity',
                    'key',             'mode',            'count',
                 'genres',                  0],
      dtype='object')
Index(['artistName', 'trackPopularity', 'BPM', 'Energy', 'Dance', 'Loud',
       'Valence', 'Acoustic', 'Genre1', 'like'],
      dtype='object')


In [87]:
Kaggle_dataset_modeling = Kaggle_dataset_drop[["artists", "popularity", "tempo", "energy", "danceability", "loudness",
                                              "valence", "acousticness", 0]]
Kaggle_dataset_modeling

Unnamed: 0,artists,popularity,tempo,energy,danceability,loudness,valence,acousticness,0
8297,Edu K,0.000000,129.983000,0.795000,0.951000,-4.996000,0.897000,0.177000,['classic russian pop']
11816,Hellogoodbye,51.500000,126.043000,0.610000,0.700000,-6.802500,0.792000,0.188500,['neon pop punk'
24719,SKY WU,47.000000,101.971000,0.470000,0.714000,-12.044000,0.468000,0.624000,['c-pop'
31101,Whitney Houston,49.301471,118.497963,0.537507,0.581419,-9.857316,0.430720,0.424661,['dance pop'
25874,Sivan,9.333333,130.448333,0.929667,0.426667,-6.441000,0.332333,0.000174,['australian pop'
...,...,...,...,...,...,...,...,...,...
11595,Harold Faltermeyer,44.857143,96.521571,0.487286,0.605000,-13.773857,0.392029,0.265284,['german soundtrack'
30455,Vanligt Folk,0.000000,153.006000,0.928000,0.650000,-6.751000,0.967000,0.029000,['swedish synth']
16038,La Roux,42.142857,133.856429,0.880429,0.636714,-2.475143,0.710714,0.009981,['alternative dance'
12356,Icona Pop,46.600000,111.184100,0.882700,0.686700,-3.646200,0.626600,0.055197,['candy pop'


Add "like" column for modeling

In [88]:
Kaggle_dataset_modeling["like"] = 0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  Kaggle_dataset_modeling["like"] = 0


Rename columns

In [89]:
print(Kaggle_dataset_modeling.columns)
print(MySpotify_cleaned_modeling.columns)

Index([     'artists',   'popularity',        'tempo',       'energy',
       'danceability',     'loudness',      'valence', 'acousticness',
                    0,         'like'],
      dtype='object')
Index(['artistName', 'trackPopularity', 'BPM', 'Energy', 'Dance', 'Loud',
       'Valence', 'Acoustic', 'Genre1', 'like'],
      dtype='object')


In [90]:
MySpotify_cleaned_modeling.rename(columns={"artistName": "artists",
                                          "trackPopularity": "popularity",
                                          "Energy":"energy",
                                          "Dance":"danceability",
                                          "Loud":"loudness",
                                          "Valence":"valence",
                                          "Acoustic": "acousticness",
                                          "Genre1": "genre"}, inplace=True)

Kaggle_dataset_modeling.rename(columns={"tempo":"BPM",
                                        "acousticnes": "acousticness",
                                       0:"genre"}, inplace=True)



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(


In [91]:
#Kaggle_dataset_modeling.rename(columns={0:"genre"}, inplace=True)

In [92]:
print(Kaggle_dataset_modeling.columns)
print(MySpotify_cleaned_modeling.columns)

Index(['artists', 'popularity', 'BPM', 'energy', 'danceability', 'loudness',
       'valence', 'acousticness', 'genre', 'like'],
      dtype='object')
Index(['artists', 'popularity', 'BPM', 'energy', 'danceability', 'loudness',
       'valence', 'acousticness', 'genre', 'like'],
      dtype='object')


Concatenate both datasets

In [93]:
dataset_modeling = pd.concat([MySpotify_cleaned_modeling, Kaggle_dataset_modeling])
dataset_modeling.head()

Unnamed: 0,artists,popularity,BPM,energy,danceability,loudness,valence,acousticness,genre,like
0,Patoranking,54.0,112.0,85.0,90.0,-3.0,63.0,4.0,Afro dancehall,1
1,Tekno,59.0,73.0,55.0,55.0,-7.0,66.0,31.0,Afro dancehall,1
2,Tekno,44.0,100.0,77.0,74.0,-5.0,82.0,38.0,Afro dancehall,1
3,Tekno,0.0,90.0,84.0,74.0,-4.0,97.0,28.0,Afro dancehall,1
4,Tekno,42.0,106.0,75.0,83.0,-3.0,79.0,3.0,Afro dancehall,1


In [94]:
print(len(Kaggle_dataset_modeling) + len(MySpotify_cleaned_modeling))
print(len(dataset_modeling))

6893
6893


In [95]:
dataset_modeling.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6893 entries, 0 to 19422
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   artists       6893 non-null   object 
 1   popularity    6893 non-null   float64
 2   BPM           6893 non-null   float64
 3   energy        6893 non-null   float64
 4   danceability  6893 non-null   float64
 5   loudness      6893 non-null   float64
 6   valence       6893 non-null   float64
 7   acousticness  6893 non-null   float64
 8   genre         6486 non-null   object 
 9   like          6893 non-null   int64  
dtypes: float64(7), int64(1), object(2)
memory usage: 592.4+ KB


In [96]:
dataset_modeling.to_csv("C:/Users/Agnieszka/Downloads/Datasets/Music/MyDatasets/dataset_modeling_all.csv")

**3) Dataset for machine learning models** (includes songs recently played)

Since my playlists include very different genres and many of them I haven't been listening to for ages, I will create models based only on playlists I have been listening to more recently to check if this enhance the model performance.

In [97]:
def preparation_train():
    
    ## My Spotify data
    
    # Datasets Silkey.net
    import glob
    df_Silkey1 = pd.concat([pd.read_csv(f) for f in glob.glob("C:/Users/Agnieszka/Downloads/Datasets/Music/Silkey_MySpotify/Recent/" + "*.csv")], ignore_index=True)
    #len(df_Silkey)
    df_Silkey1.reset_index(inplace=True)
    
    
    # Dataset SortYourMusic
    df_SortYourMusic1 = pd.concat([pd.read_excel(f) for f in glob.glob("C:/Users/Agnieszka/Downloads/Datasets/Music/SortYourMusic/Recent/" + "*.xlsx")], ignore_index=True)
    #len(df_SortYourMusic)
    df_SortYourMusic1.reset_index(inplace=True)
    
    # Merge both datasets
    All_songs = df_Silkey1.merge(df_SortYourMusic1, on="index")
    
    # Select relevant features
    All_songs_cleaned1 = All_songs[["artistName", "trackName", "trackPopularity", "BPM", "Energy", "Dance", "Loud",
                                    "Valence", "Acoustic","artistGenres","isLocal"]]
    
    # Remove duplicates
    All_songs_cleaned2 = All_songs_cleaned1.drop_duplicates(subset=["artistName", "trackName"], keep="first")
    
    # Drop NaN
    # Remove all local data -> there is no information about this data that can be used for further analysis
    All_songs_cleaned3 = All_songs_cleaned2[All_songs_cleaned2["isLocal"]=="stream"]
    
    
    # Split Genre column 
    All_songs_cleaned3_Genres = All_songs_cleaned3["artistGenres"].str.split(",", expand=True)
    # Leave only first column
    All_songs_cleaned3_Genres_reduced = All_songs_cleaned3_Genres.iloc[:,0]
    All_songs_cleaned3_Genres_reduced1 = pd.DataFrame(All_songs_cleaned3_Genres_reduced)
    
    # Rename column
    All_songs_cleaned3_Genres_reduced1.rename(columns={0:"Genre1"}, inplace=True)
    
    # Merge "Genre"-DataFrame with the initial dataset
    All_songs_cleaned3_Genres_reduced1.reset_index(inplace=True)
    All_songs_cleaned3.reset_index(inplace=True)
    All_songs_cleaned4 = All_songs_cleaned3.merge(All_songs_cleaned3_Genres_reduced1, on="index")
    
    #Drop columns
    All_songs_cleaned4.drop(columns=["isLocal", "artistGenres", "trackName", "index"], inplace=True)
    
    #Add "like" column for classification (modeling)
    All_songs_cleaned4["like"] = 1
    
    # Rename columns
    All_songs_cleaned4.rename(columns={"artistName": "artists",
                                          "trackPopularity": "popularity",
                                          "Energy":"energy",
                                          "Dance":"danceability",
                                          "Loud":"loudness",
                                          "Valence":"valence",
                                          "Acoustic": "acousticness",
                                          "Genre1": "genre"}, inplace=True)
    
    ## Kaggle dataset
    Kaggle_dataset = pd.read_csv("C:/Users/Agnieszka/Downloads/Datasets/Music/Spotify_Dataset1/data_w_genres.csv")
    
    # Split genre column
    Kaggle_Genres = Kaggle_dataset["genres"].str.split(",", expand=True)
    # Select only first column (from 20!)
    Kaggle_Genres1 = pd.DataFrame(Kaggle_Genres[0])
    
    # Add the genre column with only one genre to the initial dataset
    Kaggle1 = pd.concat([Kaggle_dataset, Kaggle_Genres1], axis=1)
    
    # Select ca. 1762 (=len(All_songs_cleaned4)) random songs based on genres I don't like
    # genre that contains: "..."
    Kaggle_pop = Kaggle1[Kaggle1[0].str.contains("pop")]
    # Take n samples from this genre
    Kaggle_pop_random = Kaggle_pop.sample(n=870) 
    
    Kaggle_techno = Kaggle1[Kaggle1["genres"].str.contains("techno")]
    Kaggle_techno_random = Kaggle_techno.sample(n=125)
    
    Kaggle_acid = Kaggle1[Kaggle1["genres"].str.contains("acid")]
    Kaggle_acid_random = Kaggle_acid.sample(n=16)
    
    Kaggle_trance = Kaggle1[Kaggle1[0].str.contains("trance")]
    Kaggle_trance_random = Kaggle_trance.sample(n=125)
    
    Kaggle_death = Kaggle1[Kaggle1["genres"].str.contains("death")]
    Kaggle_death_random = Kaggle_death.sample(n=25)
    
    Kaggle_punk = Kaggle1[Kaggle1[0].str.contains("punk")]
    Kaggle_punk_random = Kaggle_punk.sample(n=100)
    
    Kaggle_country = Kaggle1[Kaggle1[0].str.contains("country")]
    Kaggle_country_random = Kaggle_country.sample(n=200)
    
    Kaggle_emo = Kaggle1[Kaggle1[0].str.contains("emo")]
    Kaggle_emo_random = Kaggle_emo.sample(n=50)
    
    Kaggle_chinese = Kaggle1[Kaggle1["genres"].str.contains("chinese")] #
    Kaggle_chinese_random = Kaggle_chinese.sample(n=25)
    
    Kaggle_psy = Kaggle1[Kaggle1[0].str.contains("psy")]
    Kaggle_psy_random = Kaggle_psy.sample(n=40)
    
    Kaggle_german = Kaggle1[Kaggle1["genres"].str.contains("german")] # 
    Kaggle_german_random = Kaggle_german.sample(n=75)
    
    Kaggle_edm = Kaggle1[Kaggle1[0].str.contains("edm")]
    Kaggle_edm_random = Kaggle_edm.sample(n=35)
    
    Kaggle_romantic = Kaggle1[Kaggle1[0].str.contains("romantic")]
    Kaggle_romantic_random = Kaggle_edm.sample(n=35)
    
    Kaggle_synth = Kaggle1[Kaggle1["genres"].str.contains("synth")] #
    Kaggle_synth_random = Kaggle_synth.sample(n=100)
    
    
    Kaggle_dataset = pd.concat([Kaggle_pop_random, Kaggle_techno_random, Kaggle_acid_random,
                                Kaggle_death_random, Kaggle_trance_random, Kaggle_punk_random, 
                                Kaggle_country_random, Kaggle_emo_random, Kaggle_chinese_random, 
                                Kaggle_psy_random, Kaggle_german_random, Kaggle_edm_random,
                               Kaggle_romantic_random, Kaggle_synth_random, Kaggle_chinese_random], axis=0)
     
    # Drop duplicates
    Kaggle_dataset_drop1 = Kaggle_dataset.drop_duplicates(keep="first")
    
    # Select relevant features
    Kaggle_dataset_modeling = Kaggle_dataset_drop1[["artists", "popularity", "tempo", "energy", "danceability", "loudness",
                                              "valence", "acousticness", 0]]
    
    #Add "like" column for classification (modeling)
    Kaggle_dataset_modeling["like"] = 0
    
    # Rename columns
    Kaggle_dataset_modeling.rename(columns={"tempo":"BPM",
                                            "acousticnes": "acousticness",
                                           0:"genre"}, inplace=True)
    
    #Concatenate both datasets
    dataset_modeling2 = pd.concat([All_songs_cleaned4, Kaggle_dataset_modeling])
    
    return dataset_modeling2

In [98]:
Dataset_modeling_recent = preparation_train()
print(len(Dataset_modeling_recent))
Dataset_modeling_recent.head()

3524


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  Kaggle_dataset_modeling["like"] = 0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(


Unnamed: 0,artists,popularity,BPM,energy,danceability,loudness,valence,acousticness,genre,like
0,Patoranking,54.0,112.0,85.0,90.0,-3.0,63.0,4.0,Afro dancehall,1
1,Tekno,59.0,73.0,55.0,55.0,-7.0,66.0,31.0,Afro dancehall,1
2,Tekno,44.0,100.0,77.0,74.0,-5.0,82.0,38.0,Afro dancehall,1
3,Tekno,0.0,90.0,84.0,74.0,-4.0,97.0,28.0,Afro dancehall,1
4,Tekno,42.0,106.0,75.0,83.0,-3.0,79.0,3.0,Afro dancehall,1


In [99]:
Dataset_modeling_recent.to_csv("C:/Users/Agnieszka/Downloads/Datasets/Music/MyDatasets/dataset_modeling_recent.csv")

**4) Dataset for machine learning models** (test sets)

In [100]:
# filename = name of the playlist file (Check, Colors_AllShows, DiscoverWeekly_March1)

def prepare_test(filename):
    
    # Load datasets
    dataset_Silkey = pd.read_csv("C:/Users/Agnieszka/Downloads/Datasets/Music/Silkey_MySpotify/PlaylistsForTesting/" + filename + ".csv")
    dataset_SortYourMusic = dataset = pd.read_excel("C:/Users/Agnieszka/Downloads/Datasets/Music/SortYourMusic/PlaylistsForTesting/" + filename + ".xlsx")
    
    # Merge both datasets
    dataset_Silkey.reset_index(inplace=True)
    dataset_SortYourMusic.reset_index(inplace=True)
    
    Test_songs = dataset_Silkey.merge(dataset_SortYourMusic, on="index")
    
    # Select relevant features
    Test_songs_cleaned1 = Test_songs[["artistName", "trackName", "trackPopularity", "BPM", "Energy", "Dance", "Loud",
                                    "Valence", "Acoustic","artistGenres","isLocal"]]
    
    # Remove duplicates
    Test_songs_cleaned2 = Test_songs_cleaned1.drop_duplicates(subset=["artistName", "trackName"], keep="first")
    
    # Drop NaN
    # Remove all local data -> there is no information about this data that can be used for further analysis
    Test_songs_cleaned3 = Test_songs_cleaned2[Test_songs_cleaned2["isLocal"]=="stream"]
    
    
    # Split Genre column 
    Test_songs_cleaned3_Genres = Test_songs_cleaned3["artistGenres"].str.split(",", expand=True)
    # Leave only first column
    Test_songs_cleaned3_Genres_reduced = Test_songs_cleaned3_Genres.iloc[:,0]
    Test_songs_cleaned3_Genres_reduced1 = pd.DataFrame(Test_songs_cleaned3_Genres_reduced)
    
    # Rename column
    Test_songs_cleaned3_Genres_reduced1.rename(columns={0:"Genre1"}, inplace=True)
    
    # Merge "Genre"-DataFrame with the initial dataset
    Test_songs_cleaned3_Genres_reduced1.reset_index(inplace=True)
    Test_songs_cleaned3.reset_index(inplace=True)
    Test_songs_cleaned4 = Test_songs_cleaned3.merge(Test_songs_cleaned3_Genres_reduced1, on="index")
    
    #Drop columns
    Test_songs_cleaned4.drop(columns=["isLocal", "artistGenres", "trackName", "index"], inplace=True)
    
    # Rename columns
    Test_songs_cleaned4.rename(columns={"artistName": "artists",
                                          "trackPopularity": "popularity",
                                          "Energy":"energy",
                                          "Dance":"danceability",
                                          "Loud":"loudness",
                                          "Valence":"valence",
                                          "Acoustic": "acousticness",
                                          "Genre1": "genre"}, inplace=True)
    
    return Test_songs_cleaned4

a) My "Check" playlist

In [101]:
Test_set_Check = prepare_test("Check")
Test_set_Check.head()

Unnamed: 0,artists,popularity,BPM,energy,danceability,loudness,valence,acousticness,genre
0,The Howlin',0,144,61,34,-6,44,5,
1,Moodymann,39,121,81,78,-10,74,0,Deep house
2,London Residents,0,118,61,69,-10,12,92,
3,Sheitan Brothers,30,120,84,86,-6,37,2,
4,Bantwanas,31,120,63,82,-9,6,0,Afro house


In [102]:
Test_set_Check.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 49 entries, 0 to 48
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   artists       49 non-null     object
 1   popularity    49 non-null     int64 
 2   BPM           49 non-null     int64 
 3   energy        49 non-null     int64 
 4   danceability  49 non-null     int64 
 5   loudness      49 non-null     int64 
 6   valence       49 non-null     int64 
 7   acousticness  49 non-null     int64 
 8   genre         32 non-null     object
dtypes: int64(7), object(2)
memory usage: 3.8+ KB


In [103]:
Test_set_Check.to_csv("C:/Users/Agnieszka/Downloads/Datasets/Music/MyDatasets/testset_check.csv")

b) My "Discover Weekly" playlist

In [104]:
Test_set_Weekly = prepare_test("DiscoverWeekly_March1")
Test_set_Weekly.head()

Unnamed: 0,artists,popularity,BPM,energy,danceability,loudness,valence,acousticness,genre
0,Sun-El Musician,48,105,62,74,-10,56,34,Afro house
1,Honey Dijon,47,95,74,73,-7,85,2,Deep house
2,Bucie,39,126,64,71,-8,40,11,Afro house
3,Kondi Band,43,105,43,85,-11,59,71,African electronic
4,Girls of the Internet,49,122,33,87,-15,61,12,Uk house


In [105]:
Test_set_Weekly.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30 entries, 0 to 29
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   artists       30 non-null     object
 1   popularity    30 non-null     int64 
 2   BPM           30 non-null     int64 
 3   energy        30 non-null     int64 
 4   danceability  30 non-null     int64 
 5   loudness      30 non-null     int64 
 6   valence       30 non-null     int64 
 7   acousticness  30 non-null     int64 
 8   genre         21 non-null     object
dtypes: int64(7), object(2)
memory usage: 2.3+ KB


In [106]:
Test_set_Weekly.to_csv("C:/Users/Agnieszka/Downloads/Datasets/Music/MyDatasets/testset_weekly.csv")

c) "Colors - All shows" playlist (from Colors Studios)

In [107]:
Test_set_Colors = prepare_test("Colors_AllShows")
Test_set_Colors.head()

Unnamed: 0,artists,popularity,BPM,energy,danceability,loudness,valence,acousticness,genre
0,Eddy de Pretto,45,95,39,68,-11,23,50,French indie pop
1,Mustafa,27,171,24,41,-15,53,89,Chill pop
2,Nenny,41,120,37,66,-11,23,84,Hip hop tuga
3,Nathy Peluso,53,100,65,76,-10,83,34,R&b en espanol
4,Poté,36,119,44,69,-12,34,22,


In [108]:
Test_set_Colors.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 216 entries, 0 to 215
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   artists       216 non-null    object
 1   popularity    216 non-null    int64 
 2   BPM           216 non-null    int64 
 3   energy        216 non-null    int64 
 4   danceability  216 non-null    int64 
 5   loudness      216 non-null    int64 
 6   valence       216 non-null    int64 
 7   acousticness  216 non-null    int64 
 8   genre         189 non-null    object
dtypes: int64(7), object(2)
memory usage: 16.9+ KB


In [109]:
Test_set_Colors.to_csv("C:/Users/Agnieszka/Downloads/Datasets/Music/MyDatasets/testset_colors.csv")

**5) Datasets for EDA**

In [110]:
# EDA only for selected playlists

In [111]:
def prepare_EDA(filename):
    
    # Load datasets
    dataset_Silkey = pd.read_csv("C:/Users/Agnieszka/Downloads/Datasets/Music/Silkey_MySpotify/" + filename + ".csv")
    dataset_SortYourMusic = dataset = pd.read_excel("C:/Users/Agnieszka/Downloads/Datasets/Music/SortYourMusic/" + filename + ".xlsx")
    
    # Merge both datasets
    dataset_Silkey.reset_index(inplace=True)
    dataset_SortYourMusic.reset_index(inplace=True)
    
    EDA_songs = dataset_Silkey.merge(dataset_SortYourMusic, on="index")
    
    # Select relevant features
    EDA_songs_cleaned1 = EDA_songs.drop(columns=["albumName", 
                                    "trackNumber",
                                    "addedBy",
                                    "albumArtistsNames",
                                    "trackUrl",
                                    "artistUrl",
                                    "albumUrl",
                                    "trackIsrc",
                                    "albumUpc",
                                    "albumType",
                                    "albumPopularity",
                                    'Title', 
                                    'Artist', 
                                    'Release', 
                                    "Length", 
                                    "Pop.",
                                    "RND",
                                    '#',          
                                    ], axis=0)
    
    # Remove duplicates
    #EDA_songs_cleaned2 = EDA_songs_cleaned1.drop_duplicates(subset=["artistName", "trackName"], keep="first")
    
    # Drop NaN
    # Remove all local data -> there is no information about this data that can be used for further analysis
    EDA_songs_cleaned2 = EDA_songs_cleaned1[EDA_songs_cleaned1["isLocal"]=="stream"]
    
    
    # Split Genre column 
    EDA_songs_cleaned2_Genres = EDA_songs_cleaned2["artistGenres"].str.split(",", expand=True)
    # Leave only first column
    EDA_songs_cleaned2_Genres_reduced = EDA_songs_cleaned2_Genres.iloc[:,0:5]
    # Rename columns
    EDA_songs_cleaned2_Genres_reduced.rename(columns={0:"Genre1",
                                                    1:"Genre2",
                                                    2:"Genre3",
                                                    3:"Genre4",
                                                    4:"Genre5"}, inplace=True)
    
    # Merge "Genre"-DataFrame with the initial dataset
    EDA_songs_cleaned2_Genres_reduced.reset_index(inplace=True)
    EDA_songs_cleaned3 = EDA_songs_cleaned2.merge(EDA_songs_cleaned2_Genres_reduced, on="index")
    
    
    #Drop columns
    EDA_songs_cleaned3.drop(columns=["isLocal", "artistGenres","index"], inplace=True)
    
    # change data type to datetime: 
    EDA_songs_cleaned3["addedAt"] = pd.to_datetime(EDA_songs_cleaned3["addedAt"]).dt.tz_convert(None)
    EDA_songs_cleaned3["albumReleaseDate"] = pd.to_datetime(EDA_songs_cleaned3["albumReleaseDate"], yearfirst=True)
    #EDA_songs_cleaned3["trackDuration"] = pd.to_datetime(EDA_songs_cleaned3["trackDuration"])
    
    return EDA_songs_cleaned3

a) AfroHouse

In [112]:
AfroHouse = prepare_EDA("AfroHouse")
AfroHouse.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(


Unnamed: 0,trackName,artistName,trackDuration,addedAt,trackPopularity,artistPopularity,secondaryArtistsNames,artistFollowers,albumRecordLabel,albumReleaseDate,...,Energy,Dance,Loud,Valence,Acoustic,Genre1,Genre2,Genre3,Genre4,Genre5
0,Fadjamou - TWOTEK Remix,Oumou Sangaré,04:54,2020-02-21 15:48:03,33,55.0,Tony Allen & TWOTEK & LiTek & Two4Kay,72800.0,NO FORMAT,2018-03-02,...,59.0,84.0,-9.0,46.0,5.0,Afropop,Malian blues,Mande pop,Wassoulou,World
1,Sondela feat. Xolisa,David Mayer,09:16,2020-05-08 20:45:25,34,31.0,Floyd Lavine,7751.0,Connected,2018-05-11,...,68.0,77.0,-8.0,14.0,1.0,,,,,
2,Shango,Waajeed,05:38,2020-05-08 20:46:03,24,29.0,,6848.0,Dirt Tech Reck,2017-07-07,...,62.0,85.0,-11.0,60.0,6.0,Detroit house,Experimental house,,,
3,Sittin' Here - Boddhi Satva Ancestral Soul Remix,St Germain,06:27,2020-05-08 20:46:16,20,59.0,Boddhi Satva,213537.0,Parlophone (France),2016-03-11,...,72.0,79.0,-8.0,52.0,13.0,Downtempo,Electronica,Jazztronica,Nu jazz,
4,"99,9 - Alix Alvarez Sole Channel Remix",Minimoogli,06:22,2020-05-08 20:49:27,0,5.0,,126.0,Tubetracks,2008-05-24,...,86.0,80.0,-9.0,66.0,0.0,,,,,


In [113]:
AfroHouse.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 317 entries, 0 to 316
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   trackName              317 non-null    object        
 1   artistName             317 non-null    object        
 2   trackDuration          317 non-null    object        
 3   addedAt                317 non-null    datetime64[ns]
 4   trackPopularity        317 non-null    int64         
 5   artistPopularity       317 non-null    float64       
 6   secondaryArtistsNames  197 non-null    object        
 7   artistFollowers        317 non-null    float64       
 8   albumRecordLabel       317 non-null    object        
 9   albumReleaseDate       317 non-null    datetime64[ns]
 10  BPM                    317 non-null    float64       
 11  Energy                 317 non-null    float64       
 12  Dance                  317 non-null    float64       
 13  Loud 

In [114]:
AfroHouse.to_csv("C:/Users/Agnieszka/Downloads/Datasets/Music/MyDatasets/EDA_AfroHouse.csv")

b) Hip Hop

In [115]:
HipHop = prepare_EDA("HipHop")
HipHop.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(


Unnamed: 0,trackName,artistName,trackDuration,addedAt,trackPopularity,artistPopularity,secondaryArtistsNames,artistFollowers,albumRecordLabel,albumReleaseDate,...,Energy,Dance,Loud,Valence,Acoustic,Genre1,Genre2,Genre3,Genre4,Genre5
0,Stylin (feat. Saukrates),Shad,03:52,2018-08-19 20:31:57,0,39.0,Saukrates,30775.0,Black Box Recordings Inc.,2013-10-15,...,88.0,52.0,-5.0,56.0,8.0,Alternative hip hop,Canadian hip hop,Canadian old school hip hop,Conscious hip hop,
1,Leave Me Alone,Hugo Kant,03:27,2018-08-19 20:38:01,12,47.0,Lostpoet,37581.0,Bellring,2013-01-01,...,54.0,64.0,-7.0,58.0,2.0,,,,,
2,Tamagotchi,TACONAFIDE,03:21,2018-08-19 20:38:20,52,60.0,Quebonafide & Taco Hemingway,617084.0,Step Dystrybucja,2018-04-13,...,73.0,76.0,-6.0,39.0,33.0,Polish hip hop,Polish pop,Polish trap,,
3,Sound of da Police,KRS-One,04:18,2018-08-19 20:45:11,62,59.0,,442929.0,Jive,1993-09-03,...,95.0,79.0,-9.0,32.0,12.0,Alternative hip hop,Bronx hip hop,Conscious hip hop,East coast hip hop,Gangster rap
4,Guess Who's Back,Rakim,04:10,2018-08-19 20:45:21,0,61.0,,560940.0,Universal Records,1996-11-04,...,80.0,79.0,-7.0,69.0,2.0,East coast hip hop,Gangster rap,Hardcore hip hop,Hip hop,Rap


In [116]:
HipHop.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 205 entries, 0 to 204
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   trackName              205 non-null    object        
 1   artistName             205 non-null    object        
 2   trackDuration          205 non-null    object        
 3   addedAt                205 non-null    datetime64[ns]
 4   trackPopularity        205 non-null    int64         
 5   artistPopularity       205 non-null    float64       
 6   secondaryArtistsNames  92 non-null     object        
 7   artistFollowers        205 non-null    float64       
 8   albumRecordLabel       205 non-null    object        
 9   albumReleaseDate       205 non-null    datetime64[ns]
 10  BPM                    205 non-null    float64       
 11  Energy                 205 non-null    float64       
 12  Dance                  205 non-null    float64       
 13  Loud 

In [117]:
HipHop.to_csv("C:/Users/Agnieszka/Downloads/Datasets/Music/MyDatasets/EDA_HipHop.csv")

c) Hip Hop Oldschool

In [118]:
HipHopOldschool = prepare_EDA("HipHopOldschool")
HipHopOldschool.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(


Unnamed: 0,trackName,artistName,trackDuration,addedAt,trackPopularity,artistPopularity,secondaryArtistsNames,artistFollowers,albumRecordLabel,albumReleaseDate,...,Energy,Dance,Loud,Valence,Acoustic,Genre1,Genre2,Genre3,Genre4,Genre5
0,Candy Shop,50 Cent,03:28,2018-08-19 20:45:44,0,86.0,Olivia,6874358.0,Interscope,2005-03-03,...,58.0,61.0,-8.0,75.0,2.0,East coast hip hop,Gangster rap,Hip hop,Pop rap,Queens hip hop
1,Lighters Up - Nationwide Version,Lil' Kim,04:22,2018-08-29 20:58:56,40,70.0,,1081700.0,Atlantic Records,2005-09-26,...,77.0,62.0,-6.0,71.0,0.0,Dance pop,East coast hip hop,Gangster rap,Hip hop,Hip pop
2,Let Me Blow Ya Mind,Eve,03:50,2018-08-29 20:59:30,71,70.0,Gwen Stefani,832767.0,Ruff Ryders RECORDS,2001-01-01,...,56.0,91.0,-4.0,90.0,24.0,Dance pop,Hip hop,Hip pop,Philly rap,Pop rap
3,Who's That Girl?,Eve,04:41,2018-08-29 20:59:40,62,70.0,,832767.0,Ruff Ryders RECORDS,2001-01-01,...,86.0,77.0,-4.0,55.0,2.0,Dance pop,Hip hop,Hip pop,Philly rap,Pop rap
4,212,Azealia Banks,03:24,2018-08-29 21:01:06,11,66.0,Lazy Jay,679355.0,Prospect Park,2014-11-07,...,77.0,85.0,-6.0,63.0,1.0,Bmore,Dance pop,Escape room,Hip house,Hip pop


In [119]:
HipHopOldschool.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 106 entries, 0 to 105
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   trackName              106 non-null    object        
 1   artistName             106 non-null    object        
 2   trackDuration          106 non-null    object        
 3   addedAt                106 non-null    datetime64[ns]
 4   trackPopularity        106 non-null    int64         
 5   artistPopularity       106 non-null    float64       
 6   secondaryArtistsNames  34 non-null     object        
 7   artistFollowers        106 non-null    float64       
 8   albumRecordLabel       106 non-null    object        
 9   albumReleaseDate       106 non-null    datetime64[ns]
 10  BPM                    106 non-null    float64       
 11  Energy                 106 non-null    float64       
 12  Dance                  106 non-null    float64       
 13  Loud 

In [120]:
HipHopOldschool.to_csv("C:/Users/Agnieszka/Downloads/Datasets/Music/MyDatasets/EDA_HipHopOldschool.csv")

d) Afrobeats

In [121]:
Afrobeats = prepare_EDA("Afrobeats")
Afrobeats.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(


Unnamed: 0,trackName,artistName,trackDuration,addedAt,trackPopularity,artistPopularity,secondaryArtistsNames,artistFollowers,albumRecordLabel,albumReleaseDate,...,Energy,Dance,Loud,Valence,Acoustic,Genre1,Genre2,Genre3,Genre4,Genre5
0,"My Woman, My Everything (feat. Wandecoal)",Patoranking,03:53,2018-08-19 20:44:37,54,61.0,Wande Coal,149884.0,Greensleeves Records,2015-07-06,...,85.0,90.0,-3.0,63.0,4.0,Afro dancehall,Afropop,Azontobeats,Gqom,Nigerian pop
1,Pana,Tekno,04:02,2018-08-19 20:51:51,59,62.0,,175849.0,Columbia,2016-12-20,...,55.0,55.0,-7.0,66.0,31.0,Afro dancehall,Afropop,Azontobeats,Nigerian hip hop,Nigerian pop
2,Jogodo,Tekno,04:23,2018-08-19 20:51:54,44,62.0,,175849.0,MADE MEN MUSIC GROUP,2018-04-19,...,77.0,74.0,-5.0,82.0,38.0,Afro dancehall,Afropop,Azontobeats,Nigerian hip hop,Nigerian pop
3,Duro,Tekno,03:32,2018-08-19 20:52:04,0,62.0,,175849.0,Made Men Music Group,2015-06-22,...,84.0,74.0,-4.0,97.0,28.0,Afro dancehall,Afropop,Azontobeats,Nigerian hip hop,Nigerian pop
4,Samantha,Tekno,03:49,2018-08-19 20:52:24,42,62.0,,175849.0,MADE MEN MUSIC GROUP,2017-06-09,...,75.0,83.0,-3.0,79.0,3.0,Afro dancehall,Afropop,Azontobeats,Nigerian hip hop,Nigerian pop


In [122]:
Afrobeats.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 124 entries, 0 to 123
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   trackName              124 non-null    object        
 1   artistName             124 non-null    object        
 2   trackDuration          124 non-null    object        
 3   addedAt                124 non-null    datetime64[ns]
 4   trackPopularity        124 non-null    int64         
 5   artistPopularity       124 non-null    float64       
 6   secondaryArtistsNames  59 non-null     object        
 7   artistFollowers        124 non-null    float64       
 8   albumRecordLabel       124 non-null    object        
 9   albumReleaseDate       124 non-null    datetime64[ns]
 10  BPM                    124 non-null    float64       
 11  Energy                 124 non-null    float64       
 12  Dance                  124 non-null    float64       
 13  Loud 

In [123]:
Afrobeats.to_csv("C:/Users/Agnieszka/Downloads/Datasets/Music/MyDatasets/EDA_Afrobeats.csv")

e) Afro beat

In [124]:
Afrobeat_AfricanClassics = prepare_EDA("Afrobeat_AfricanClassics")
Afrobeat_AfricanClassics.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(


Unnamed: 0,trackName,artistName,trackDuration,addedAt,trackPopularity,artistPopularity,secondaryArtistsNames,artistFollowers,albumRecordLabel,albumReleaseDate,...,Energy,Dance,Loud,Valence,Acoustic,Genre1,Genre2,Genre3,Genre4,Genre5
0,Struggle Sounds,Seun Kuti,07:05,2020-10-24 18:07:53,17,41.0,Egypt 80,26690.0,Strut,2018-03-02,...,94.0,69.0,-7.0,88.0,7.0,Afrobeat,Afropop,,,
1,Black Times - Radio Edit,Seun Kuti,04:55,2020-10-24 18:09:18,37,41.0,Egypt 80 & Carlos Santana,26690.0,Strut,2017-12-08,...,77.0,61.0,-7.0,93.0,30.0,Afrobeat,Afropop,,,
2,Ene Nyame'A'Mensuro,Pat Thomas,06:16,2020-10-24 18:10:50,37,44.0,Ebo Taylor,6793.0,Strut,1977-01-01,...,79.0,69.0,-6.0,90.0,67.0,Afrobeat,Highlife,,,
3,Fungi Mama (Bebopafunkadiscolypso),Tom Browne,05:55,2020-10-24 18:11:51,3,46.0,,40083.0,Arista/Legacy,2017-10-27,...,83.0,77.0,-9.0,67.0,24.0,Disco,Funk,Jazz funk,Post-disco,Quiet storm
4,Yamona,Pat Thomas,05:00,2020-10-24 18:12:16,10,44.0,Kwashibu Area Band,6793.0,Strut,2019-10-04,...,75.0,83.0,-8.0,95.0,37.0,Afrobeat,Highlife,,,


In [125]:
Afrobeat_AfricanClassics.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 36 entries, 0 to 35
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   trackName              36 non-null     object        
 1   artistName             36 non-null     object        
 2   trackDuration          36 non-null     object        
 3   addedAt                36 non-null     datetime64[ns]
 4   trackPopularity        36 non-null     int64         
 5   artistPopularity       36 non-null     float64       
 6   secondaryArtistsNames  14 non-null     object        
 7   artistFollowers        36 non-null     float64       
 8   albumRecordLabel       36 non-null     object        
 9   albumReleaseDate       36 non-null     datetime64[ns]
 10  BPM                    36 non-null     float64       
 11  Energy                 36 non-null     float64       
 12  Dance                  36 non-null     float64       
 13  Loud   

In [126]:
Afrobeat_AfricanClassics.to_csv("C:/Users/Agnieszka/Downloads/Datasets/Music/MyDatasets/EDA_Afrobeat_AfricanClassics.csv")

f) Kuduro & Naija

In [127]:
Kuduro = prepare_EDA("KuduroNaija")
Kuduro.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(


Unnamed: 0,trackName,artistName,trackDuration,addedAt,trackPopularity,artistPopularity,secondaryArtistsNames,artistFollowers,albumRecordLabel,albumReleaseDate,...,Energy,Dance,Loud,Valence,Acoustic,Genre1,Genre2,Genre3,Genre4,Genre5
0,Body Touch,Afro Pupo,06:48,2020-10-24 17:19:18,1,7,,122,Seres Producoes,2016-07-29,...,77,54,-6,16,1,,,,,
1,Mwana Pwo,Afro Pupo,07:24,2020-10-24 17:21:03,3,7,Homeboyz,122,Kazukuta Records,2017-05-05,...,61,61,-9,68,2,,,,,
2,Third Place,Afrokillerz,06:22,2020-10-24 17:34:49,4,25,,5506,Offering Recordings,2018-03-23,...,45,80,-16,23,0,Kuduro,,,,
3,Xe Mana Bella - Main Mix,Boddhi Satva,05:04,2020-10-24 17:54:00,24,38,DJ Satelite & Fredy Massamba,11921,Offering Recordings,2016-08-19,...,58,86,-11,28,0,Afro house,,,,
4,Mama Na Bheto,DJ Satelite,05:40,2020-10-24 17:55:02,6,21,Dorivaldo Mix & Bamfumu,765,Seres Producoes,2015-09-21,...,72,57,-7,36,4,South african house,,,,


In [128]:
Kuduro.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26 entries, 0 to 25
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   trackName              26 non-null     object        
 1   artistName             26 non-null     object        
 2   trackDuration          26 non-null     object        
 3   addedAt                26 non-null     datetime64[ns]
 4   trackPopularity        26 non-null     int64         
 5   artistPopularity       26 non-null     int64         
 6   secondaryArtistsNames  14 non-null     object        
 7   artistFollowers        26 non-null     int64         
 8   albumRecordLabel       26 non-null     object        
 9   albumReleaseDate       26 non-null     datetime64[ns]
 10  BPM                    26 non-null     int64         
 11  Energy                 26 non-null     int64         
 12  Dance                  26 non-null     int64         
 13  Loud   

In [129]:
Kuduro.to_csv("C:/Users/Agnieszka/Downloads/Datasets/Music/MyDatasets/EDA_Kuduro.csv")

g) SA Sounds

In [130]:
SA_sounds = prepare_EDA("SAsounds_Amapiano_Gqom")
SA_sounds.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(


Unnamed: 0,trackName,artistName,trackDuration,addedAt,trackPopularity,artistPopularity,secondaryArtistsNames,artistFollowers,albumRecordLabel,albumReleaseDate,...,Energy,Dance,Loud,Valence,Acoustic,Genre1,Genre2,Genre3,Genre4,Genre5
0,Get Over It,Bucie,06:59,2020-05-08 21:48:47,45,46,,50621,SubSensual,2016-05-23,...,56,77,-11,71,3,Afro house,Deep house,South african house,,
1,It's You (feat. Miss P),Black Motion,08:28,2020-05-08 21:53:12,40,49,Miss P,81950,Defected Records,2017-02-03,...,39,57,-11,25,11,Afro house,Afro soul,Deep house,South african alternative,South african house
2,Lalela (feat. Celimpilo),Black Motion,06:18,2020-05-08 21:54:05,11,49,Celimpilo,81950,Sound African Recordings,2016-11-04,...,65,45,-8,27,0,Afro house,Afro soul,Deep house,South african alternative,South african house
3,Voice of a Thousand Africans,DJ Qness,07:19,2020-05-08 21:55:59,1,31,,2314,Shelter Music,2011-07-08,...,86,68,-3,15,0,Afro house,South african house,,,
4,Heartless Intentions (feat. Dr Moruti),Black Motion,05:46,2020-05-08 21:56:44,21,49,Dr Moruti,81950,Sound African Recordings,2016-11-04,...,74,81,-8,70,3,Afro house,Afro soul,Deep house,South african alternative,South african house


In [131]:
SA_sounds.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 97 entries, 0 to 96
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   trackName              97 non-null     object        
 1   artistName             97 non-null     object        
 2   trackDuration          97 non-null     object        
 3   addedAt                97 non-null     datetime64[ns]
 4   trackPopularity        97 non-null     int64         
 5   artistPopularity       97 non-null     int64         
 6   secondaryArtistsNames  85 non-null     object        
 7   artistFollowers        97 non-null     int64         
 8   albumRecordLabel       97 non-null     object        
 9   albumReleaseDate       97 non-null     datetime64[ns]
 10  BPM                    97 non-null     int64         
 11  Energy                 97 non-null     int64         
 12  Dance                  97 non-null     int64         
 13  Loud   

In [132]:
SA_sounds.to_csv("C:/Users/Agnieszka/Downloads/Datasets/Music/MyDatasets/EDA_SA_sounds.csv")

h) Roots Reggae

In [133]:
Roots = prepare_EDA("RootsReggae")
Roots.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(


Unnamed: 0,trackName,artistName,trackDuration,addedAt,trackPopularity,artistPopularity,secondaryArtistsNames,artistFollowers,albumRecordLabel,albumReleaseDate,...,Energy,Dance,Loud,Valence,Acoustic,Genre1,Genre2,Genre3,Genre4,Genre5
0,Ice Cream Love,Johnny Osbourne,03:23,2018-08-20 20:53:11,41,46.0,,38140.0,Greensleeves Records,1980-01-01,...,69.0,74.0,-7.0,78.0,21.0,Dub,Lovers rock,Old school dancehall,Reggae,Rock steady
1,No Lollipop No Sweet So,Johnny Osbourne,03:00,2018-08-20 20:53:43,26,46.0,,38140.0,Greensleeves Records,1980-01-01,...,54.0,81.0,-7.0,94.0,15.0,Dub,Lovers rock,Old school dancehall,Reggae,Rock steady
2,Budy Bye,Johnny Osbourne,04:04,2018-08-20 20:55:01,31,46.0,,38140.0,VP Records,2005-01-01,...,37.0,71.0,-15.0,87.0,23.0,Dub,Lovers rock,Old school dancehall,Reggae,Rock steady
3,Folly Ranking Remaster,Johnny Osbourne,03:29,2018-08-20 20:56:16,0,46.0,,38140.0,Hammer Musik,2018-05-18,...,35.0,87.0,-18.0,87.0,0.0,Dub,Lovers rock,Old school dancehall,Reggae,Rock steady
4,Mr Marshal Remaster,Johnny Osbourne,03:35,2018-08-20 20:56:49,0,46.0,,38140.0,Hammer Musik,2016-12-09,...,33.0,82.0,-21.0,90.0,0.0,Dub,Lovers rock,Old school dancehall,Reggae,Rock steady


In [134]:
Roots.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 167 entries, 0 to 166
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   trackName              167 non-null    object        
 1   artistName             167 non-null    object        
 2   trackDuration          167 non-null    object        
 3   addedAt                167 non-null    datetime64[ns]
 4   trackPopularity        167 non-null    int64         
 5   artistPopularity       167 non-null    float64       
 6   secondaryArtistsNames  10 non-null     object        
 7   artistFollowers        167 non-null    float64       
 8   albumRecordLabel       167 non-null    object        
 9   albumReleaseDate       167 non-null    datetime64[ns]
 10  BPM                    167 non-null    float64       
 11  Energy                 167 non-null    float64       
 12  Dance                  167 non-null    float64       
 13  Loud 

In [135]:
Roots.to_csv("C:/Users/Agnieszka/Downloads/Datasets/Music/MyDatasets/EDA_Roots.csv")

i) Modern Roots

In [136]:
ModernRoots = prepare_EDA("ModernRoots")
ModernRoots.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(


Unnamed: 0,trackName,artistName,trackDuration,addedAt,trackPopularity,artistPopularity,secondaryArtistsNames,artistFollowers,albumRecordLabel,albumReleaseDate,...,Energy,Dance,Loud,Valence,Acoustic,Genre1,Genre2,Genre3,Genre4,Genre5
0,Same War,Randy Valentine,04:18,2018-08-19 20:32:04,14,42.0,,8264.0,Troots Music,2018-06-01,...,56.0,68.0,-11.0,88.0,11.0,Modern reggae,,,,
1,Chase the Devil,"Lee ""Scratch"" Perry",03:19,2018-08-19 20:41:26,0,55.0,Subatomic Sound System & Blakkamoore & Screech...,127541.0,Echo Beach,2017-09-15,...,70.0,70.0,-6.0,81.0,23.0,Dub,Reggae,Rock steady,Roots reggae,Ska
2,Coconut Jelly Man,Shanique Marie,02:16,2018-08-19 20:42:05,31,34.0,Cali P,1586.0,Equiknoxx Music,2016-07-01,...,51.0,76.0,-9.0,55.0,4.0,,,,,
3,We No Worry 'bout Them,Konshens,03:57,2018-08-19 21:14:27,33,65.0,Romain Virgo,255912.0,Maximum Sound,2013-01-01,...,82.0,69.0,-6.0,76.0,3.0,Dancehall,Modern reggae,Rap kreyol,,
4,Fade Away (feat. Agent Sasco),Romain Virgo,04:52,2018-08-19 21:38:56,33,49.0,Agent Sasco (Assassin),83966.0,VP Records,2015-10-30,...,79.0,59.0,-3.0,68.0,8.0,Lovers rock,Modern reggae,Riddim,Soca,


In [137]:
ModernRoots.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 90 entries, 0 to 89
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   trackName              90 non-null     object        
 1   artistName             90 non-null     object        
 2   trackDuration          90 non-null     object        
 3   addedAt                90 non-null     datetime64[ns]
 4   trackPopularity        90 non-null     int64         
 5   artistPopularity       90 non-null     float64       
 6   secondaryArtistsNames  30 non-null     object        
 7   artistFollowers        90 non-null     float64       
 8   albumRecordLabel       90 non-null     object        
 9   albumReleaseDate       90 non-null     datetime64[ns]
 10  BPM                    90 non-null     float64       
 11  Energy                 90 non-null     float64       
 12  Dance                  90 non-null     float64       
 13  Loud   

In [138]:
ModernRoots.to_csv("C:/Users/Agnieszka/Downloads/Datasets/Music/MyDatasets/EDA_ModernRoots.csv")

j) Dancehall

In [139]:
Dancehall = prepare_EDA("Dancehall")
Dancehall.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(


Unnamed: 0,trackName,artistName,trackDuration,addedAt,trackPopularity,artistPopularity,secondaryArtistsNames,artistFollowers,albumRecordLabel,albumReleaseDate,...,Energy,Dance,Loud,Valence,Acoustic,Genre1,Genre2,Genre3,Genre4,Genre5
0,One Dance,Drake,02:53,2018-08-19 20:44:30,2,98.0,WizKid & Kyla,53137907.0,Universal Music Group,2016-05-06,...,62.0,79.0,-6.0,37.0,1.0,Canadian hip hop,Canadian pop,Hip hop,Pop rap,Rap
1,"Particula (feat. DJ Maphorisa, Nasty C, Ice Pr...",Major Lazer,03:24,2018-08-19 20:56:41,38,82.0,DJ Maphorisa & Nasty C & Ice Prince & Patorank...,6138871.0,Third Pardee Records,2017-11-17,...,74.0,68.0,-5.0,39.0,3.0,Dance pop,Edm,Electro house,Moombahton,Pop
2,Stay So,Busy Signal,03:41,2018-08-19 21:07:33,0,63.0,,289717.0,Warriors Musick Productions,2017-11-10,...,60.0,86.0,-4.0,81.0,11.0,Dancehall,Lovers rock,Reggae fusion,Soca,
3,Come Over [Missing You],Busy Signal,03:28,2018-08-19 21:07:50,49,63.0,,289717.0,VP Records,2012-04-09,...,65.0,65.0,-5.0,54.0,6.0,Dancehall,Lovers rock,Reggae fusion,Soca,
4,Missing You,Busy Signal,03:32,2018-08-19 21:08:06,43,63.0,,289717.0,21st Hapilos Compilations,2013-06-18,...,76.0,41.0,-4.0,68.0,24.0,Dancehall,Lovers rock,Reggae fusion,Soca,


In [140]:
Dancehall.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 165 entries, 0 to 164
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   trackName              165 non-null    object        
 1   artistName             165 non-null    object        
 2   trackDuration          165 non-null    object        
 3   addedAt                165 non-null    datetime64[ns]
 4   trackPopularity        165 non-null    int64         
 5   artistPopularity       165 non-null    float64       
 6   secondaryArtistsNames  60 non-null     object        
 7   artistFollowers        165 non-null    float64       
 8   albumRecordLabel       165 non-null    object        
 9   albumReleaseDate       165 non-null    datetime64[ns]
 10  BPM                    165 non-null    float64       
 11  Energy                 165 non-null    float64       
 12  Dance                  165 non-null    float64       
 13  Loud 

In [141]:
Dancehall.to_csv("C:/Users/Agnieszka/Downloads/Datasets/Music/MyDatasets/EDA_Dancehall.csv")

k) Dub

In [142]:
Dub = prepare_EDA("Dub")
Dub.head()

Unnamed: 0,trackName,artistName,trackDuration,addedAt,trackPopularity,artistPopularity,secondaryArtistsNames,artistFollowers,albumRecordLabel,albumReleaseDate,...,Energy,Dance,Loud,Valence,Acoustic,Genre1,Genre2,Genre3,Genre4,Genre5
0,Nu Rizla,Adrian Sherwood,04:29,2018-08-19 20:39:01,0,32,,10770,Real World Records,2006-01-01,...,85,72,-6,86,22,Dub,Experimental dub,Uk reggae,,
1,Detonator,Dubmatix,04:16,2018-08-19 20:39:09,6,38,Anthony B,17793,Soulbeats Records,2014-11-17,...,84,75,-6,42,5,Dub,,,,
2,Sound System,Radikal Guru,04:36,2018-08-19 20:39:15,29,39,Cian Finn,21118,Moonshine Recordings,2015-02-23,...,67,74,-6,39,15,Dub,Dub reggae,Irish hip hop,Polish reggae,
3,Rub A Dub Del Barrio (feat. Junior),O.B.F,03:24,2018-08-19 20:39:19,12,48,Junior,27598,Dubquake Records,2017-09-15,...,51,76,-10,67,0,Dub reggae,French dub,,,
4,Décroche pong,Tetra Hydro K,04:28,2018-08-19 20:39:30,11,40,,17826,7Lakes,2011-11-09,...,76,58,-6,53,3,Dub reggae,Electro dub,French dub,,


In [143]:
Dub.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 56 entries, 0 to 55
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   trackName              56 non-null     object        
 1   artistName             56 non-null     object        
 2   trackDuration          56 non-null     object        
 3   addedAt                56 non-null     datetime64[ns]
 4   trackPopularity        56 non-null     int64         
 5   artistPopularity       56 non-null     int64         
 6   secondaryArtistsNames  34 non-null     object        
 7   artistFollowers        56 non-null     int64         
 8   albumRecordLabel       56 non-null     object        
 9   albumReleaseDate       56 non-null     datetime64[ns]
 10  BPM                    56 non-null     int64         
 11  Energy                 56 non-null     int64         
 12  Dance                  56 non-null     int64         
 13  Loud   

In [144]:
Dub.to_csv("C:/Users/Agnieszka/Downloads/Datasets/Music/MyDatasets/EDA_Dub.csv")

l) Rock

In [145]:
Rock = prepare_EDA("Rock_Indie")
Rock.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(


Unnamed: 0,trackName,artistName,trackDuration,addedAt,trackPopularity,artistPopularity,secondaryArtistsNames,artistFollowers,albumRecordLabel,albumReleaseDate,...,Energy,Dance,Loud,Valence,Acoustic,Genre1,Genre2,Genre3,Genre4,Genre5
0,Jackie And Wilson,Hozier,03:43,2018-08-19 20:32:40,0,83,,4487679,Universal Music Group,2014-11-06,...,73,56,-4,78,6,Irish singer-songwriter,Modern rock,Pop,,
1,Amsterdam,Nothing But Thieves,04:32,2018-08-19 22:03:16,64,72,,695187,RCA Records Label,2017-09-08,...,93,51,-5,67,0,Alternative pop rock,Modern alternative rock,Modern rock,Rock,
2,Broken Machine,Nothing But Thieves,03:54,2018-08-19 22:03:34,52,72,,695187,RCA Records Label,2017-09-08,...,67,65,-6,78,6,Alternative pop rock,Modern alternative rock,Modern rock,Rock,
3,Bad Habit,The Kooks,03:41,2018-08-19 22:04:14,0,72,,2040822,Universal Music Group,2014-01-01,...,89,73,-4,86,5,Brighton indie,Modern rock,Rock,,
4,Way down We Go,KALEO,03:33,2018-08-19 22:05:05,78,74,,1164561,Elektra (NEK),2016-06-10,...,51,49,-8,34,58,Icelandic indie,Icelandic rock,Modern alternative rock,,


In [146]:
Rock.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 180 entries, 0 to 179
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   trackName              180 non-null    object        
 1   artistName             180 non-null    object        
 2   trackDuration          180 non-null    object        
 3   addedAt                180 non-null    datetime64[ns]
 4   trackPopularity        180 non-null    int64         
 5   artistPopularity       180 non-null    int64         
 6   secondaryArtistsNames  3 non-null      object        
 7   artistFollowers        180 non-null    int64         
 8   albumRecordLabel       180 non-null    object        
 9   albumReleaseDate       180 non-null    datetime64[ns]
 10  BPM                    180 non-null    int64         
 11  Energy                 180 non-null    int64         
 12  Dance                  180 non-null    int64         
 13  Loud 

In [147]:
Rock.to_csv("C:/Users/Agnieszka/Downloads/Datasets/Music/MyDatasets/EDA_Rock.csv")

m) Jazz (instrumental)

In [148]:
Jazz_inst = prepare_EDA("Jazz_instrumental")
Jazz_inst.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(


Unnamed: 0,trackName,artistName,trackDuration,addedAt,trackPopularity,artistPopularity,secondaryArtistsNames,artistFollowers,albumRecordLabel,albumReleaseDate,...,Energy,Dance,Loud,Valence,Acoustic,Genre1,Genre2,Genre3,Genre4,Genre5
0,"Sports, Not Heavy Crime",Xploding Plastix,05:07,2019-03-19 18:58:12,0,32,,19291,Beatservice Records,2002-05-28,...,84,57,-5,46,12,Jazztronica,Ninja,Nu jazz,Trip hop,
1,Svantetic,Krzysztof Komeda,03:35,2019-03-19 18:58:12,21,37,Leszek Możdżer,14651,ACT Music,2011-06-24,...,21,32,-17,30,98,Polish jazz,,,,
2,Sleep Safe and Warm,Krzysztof Komeda,06:26,2019-03-19 18:58:12,21,37,Leszek Możdżer,14651,ACT Music,2011-06-24,...,8,42,-24,13,98,Polish jazz,,,,
3,Ballad for Bernt,Krzysztof Komeda,04:19,2019-03-19 18:58:12,19,37,Leszek Możdżer,14651,ACT Music,2011-06-24,...,13,35,-24,10,98,Polish jazz,,,,
4,The Law and The Fist,Krzysztof Komeda,10:57,2019-03-19 18:58:12,32,37,Leszek Możdżer,14651,ACT Music,2011-06-24,...,7,36,-22,16,99,Polish jazz,,,,


In [149]:
Jazz_inst.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 57 entries, 0 to 56
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   trackName              57 non-null     object        
 1   artistName             57 non-null     object        
 2   trackDuration          57 non-null     object        
 3   addedAt                57 non-null     datetime64[ns]
 4   trackPopularity        57 non-null     int64         
 5   artistPopularity       57 non-null     int64         
 6   secondaryArtistsNames  14 non-null     object        
 7   artistFollowers        57 non-null     int64         
 8   albumRecordLabel       57 non-null     object        
 9   albumReleaseDate       57 non-null     datetime64[ns]
 10  BPM                    57 non-null     int64         
 11  Energy                 57 non-null     int64         
 12  Dance                  57 non-null     int64         
 13  Loud   

In [150]:
Jazz_inst.to_csv("C:/Users/Agnieszka/Downloads/Datasets/Music/MyDatasets/EDA_Jazz_inst.csv")

n) Deep House

In [151]:
DeepHouse = prepare_EDA("DeepHouse")
DeepHouse.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(


Unnamed: 0,trackName,artistName,trackDuration,addedAt,trackPopularity,artistPopularity,secondaryArtistsNames,artistFollowers,albumRecordLabel,albumReleaseDate,...,Energy,Dance,Loud,Valence,Acoustic,Genre1,Genre2,Genre3,Genre4,Genre5
0,Heal Yourself And Move,Theo Parrish,10:17,2020-05-08 20:43:31,30,43.0,,53510.0,Peacefrog Records,1998-01-01,...,46.0,83.0,-14.0,35.0,38.0,Broken beat,Deep house,Deep soul house,Detroit house,Detroit techno
1,Tech This Out Pt.1,Karizma,07:37,2020-05-08 20:43:51,6,30.0,,6197.0,Atjazz Record Company,2018-07-13,...,48.0,80.0,-9.0,23.0,0.0,Deep techno,,,,
2,Aus - Original Mix,Nina Kraviz,04:54,2020-05-08 20:49:50,31,51.0,King Aus,313693.0,Rekids,2012-02-27,...,26.0,88.0,-14.0,22.0,6.0,Electra,Russian electronic,Russian techno,,
3,The Bus Ride,Coflo,07:40,2020-05-08 20:56:10,0,20.0,Tomahawk Bang,1099.0,R2 Records,2017-05-05,...,71.0,79.0,-11.0,28.0,5.0,Soulful house,,,,
4,All New - FAKE SELF Remix,Chad(UK),05:54,2020-05-08 21:01:18,16,8.0,FAKE SELF,73.0,In You Records,2016-11-24,...,73.0,91.0,-10.0,38.0,0.0,,,,,


In [152]:
DeepHouse.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 113 entries, 0 to 112
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   trackName              113 non-null    object        
 1   artistName             113 non-null    object        
 2   trackDuration          113 non-null    object        
 3   addedAt                113 non-null    datetime64[ns]
 4   trackPopularity        113 non-null    int64         
 5   artistPopularity       113 non-null    float64       
 6   secondaryArtistsNames  43 non-null     object        
 7   artistFollowers        113 non-null    float64       
 8   albumRecordLabel       113 non-null    object        
 9   albumReleaseDate       113 non-null    datetime64[ns]
 10  BPM                    113 non-null    float64       
 11  Energy                 113 non-null    float64       
 12  Dance                  113 non-null    float64       
 13  Loud 

In [153]:
DeepHouse.to_csv("C:/Users/Agnieszka/Downloads/Datasets/Music/MyDatasets/EDA_DeepHouse.csv")

o) Groovy House

In [154]:
GroovyHouse = prepare_EDA("GroovyHouse")
GroovyHouse.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(


Unnamed: 0,trackName,artistName,trackDuration,addedAt,trackPopularity,artistPopularity,secondaryArtistsNames,artistFollowers,albumRecordLabel,albumReleaseDate,...,Energy,Dance,Loud,Valence,Acoustic,Genre1,Genre2,Genre3,Genre4,Genre5
0,Passing Me By,Maya Jane Coles,06:13,2020-05-08 20:41:00,45,58.0,,185803.0,I/AM/ME,2017-10-13,...,89.0,80.0,-6.0,60.0,1.0,Deep disco house,Deep house,Electra,Electronica,House
1,All About Disco - Radio Edit,Robbie Rivera,03:26,2020-05-08 20:41:08,36,50.0,Angelo Ferreri,33571.0,Mood Funk Records,2018-08-23,...,95.0,75.0,-7.0,76.0,1.0,Circuit,Edm,Electro house,Funky tech house,House
2,Brighter Days - Underground Goodie Mix,Cajmere,05:26,2020-05-08 20:41:14,0,42.0,Dajae,17197.0,BBE,2018-01-26,...,77.0,88.0,-8.0,30.0,0.0,Chicago house,Classic house,Deep house,Footwork,
3,Body N Deep - Jovonn Fabric Mix,Jovonn,06:11,2020-05-08 20:41:19,6,32.0,,5636.0,Body'N Deep,2017-08-07,...,51.0,86.0,-12.0,26.0,0.0,Chicago house,Deep house,Deep soul house,Minimal tech house,Soulful house
4,Sunday Walk,Seren Deep,06:12,2020-05-08 20:41:44,38,27.0,,647.0,Seren Deep,2017-12-12,...,39.0,89.0,-11.0,54.0,23.0,,,,,


In [155]:
GroovyHouse.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 159 entries, 0 to 158
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   trackName              159 non-null    object        
 1   artistName             159 non-null    object        
 2   trackDuration          159 non-null    object        
 3   addedAt                159 non-null    datetime64[ns]
 4   trackPopularity        159 non-null    int64         
 5   artistPopularity       159 non-null    float64       
 6   secondaryArtistsNames  68 non-null     object        
 7   artistFollowers        159 non-null    float64       
 8   albumRecordLabel       159 non-null    object        
 9   albumReleaseDate       159 non-null    datetime64[ns]
 10  BPM                    159 non-null    float64       
 11  Energy                 159 non-null    float64       
 12  Dance                  159 non-null    float64       
 13  Loud 

In [156]:
GroovyHouse.to_csv("C:/Users/Agnieszka/Downloads/Datasets/Music/MyDatasets/EDA_GroovyHouse.csv")

p) Funk & Soul & Disco

In [157]:
FunkSoulDisco = prepare_EDA("FunkSoulDisco")
FunkSoulDisco.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(


Unnamed: 0,trackName,artistName,trackDuration,addedAt,trackPopularity,artistPopularity,secondaryArtistsNames,artistFollowers,albumRecordLabel,albumReleaseDate,...,Energy,Dance,Loud,Valence,Acoustic,Genre1,Genre2,Genre3,Genre4,Genre5
0,It's Your Thing,The Isley Brothers,02:46,2018-08-19 20:43:58,54,68.0,,1173039.0,Epic,1999-01-01,...,68.0,38.0,-10.0,89.0,8.0,Classic soul,Disco,Funk,Motown,Quiet storm
1,Superstition,Stevie Wonder,04:04,2018-08-19 20:44:10,0,81.0,,4798243.0,Universal Strategic Marketing,2007-01-01,...,75.0,65.0,-9.0,86.0,2.0,Adult standards,Funk,Motown,Quiet storm,Soul
2,I Say a Little Prayer,Aretha Franklin,03:36,2018-08-19 22:11:52,75,76.0,,2482889.0,Rhino Atlantic,1968-06-14,...,36.0,59.0,-14.0,50.0,48.0,Classic soul,Jazz blues,Memphis soul,Soul,Southern soul
3,Respect,Aretha Franklin,02:27,2018-08-19 22:12:00,72,76.0,,2482889.0,Rhino Atlantic,1967-03-10,...,56.0,81.0,-5.0,97.0,16.0,Classic soul,Jazz blues,Memphis soul,Soul,Southern soul
4,Think,Aretha Franklin,02:19,2018-08-19 22:12:18,67,76.0,,2482889.0,Rhino Atlantic,1968-06-14,...,63.0,69.0,-11.0,93.0,25.0,Classic soul,Jazz blues,Memphis soul,Soul,Southern soul


In [158]:
FunkSoulDisco.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 159 entries, 0 to 158
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   trackName              159 non-null    object        
 1   artistName             159 non-null    object        
 2   trackDuration          159 non-null    object        
 3   addedAt                159 non-null    datetime64[ns]
 4   trackPopularity        159 non-null    int64         
 5   artistPopularity       159 non-null    float64       
 6   secondaryArtistsNames  34 non-null     object        
 7   artistFollowers        159 non-null    float64       
 8   albumRecordLabel       159 non-null    object        
 9   albumReleaseDate       159 non-null    datetime64[ns]
 10  BPM                    159 non-null    float64       
 11  Energy                 159 non-null    float64       
 12  Dance                  159 non-null    float64       
 13  Loud 

In [159]:
FunkSoulDisco.to_csv("C:/Users/Agnieszka/Downloads/Datasets/Music/MyDatasets/EDA_FunkSoulDisco.csv")