# Analysis of Spotify Data - Data preparation

Data sources: 
- personal data: https://skiley.net/ & http://sortyourmusic.playlistmachinery.com/index.html & data requested from the Spotify website 
- Kaggle: https://www.kaggle.com/rodolfofigueroa/spotify-12m-songs & https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks

In [1]:
import pandas as pd
import numpy as np

**Data exploration**

a) Data requested from Spotify (15.02.2021)

In [2]:
import json

File 1: Playlist1

In [3]:
file1 = "C:/Users/Agnieszka/Downloads/Datasets/Music/my_spotify_data/MyData/Playlist1.json"

Spotify_Playlists = json.load(open(file1, "r", encoding="utf8"))
Spotify_Playlists

# File includes information about each playlist - artist, track and album names

{'playlists': [{'name': 'Podcasts',
   'lastModifiedDate': '2021-01-08',
   'items': [{'track': None,
     'episode': {'episodeName': 'Photographer and Creative Live CEO Chase Jarvis on overcoming creative blocks, becoming a better photographer, and morning routines.',
      'showName': 'How I Work'},
     'localTrack': None},
    {'track': None,
     'episode': {'episodeName': 'Ramit Sethi: Change Your Habits, Change Your Life',
      'showName': 'The Chase Jarvis LIVE Show'},
     'localTrack': None}],
   'description': None,
   'numberOfFollowers': 0},
  {'name': 'Afro Mix',
   'lastModifiedDate': '2021-02-11',
   'items': [{'track': {'trackName': 'From the Edge',
      'artistName': 'El Deep',
      'albumName': 'From the Edge - EP'},
     'episode': None,
     'localTrack': None},
    {'track': {'trackName': "I Can't Sleep - Extended Mix",
      'artistName': 'The Deepshakerz',
      'albumName': "I Can't Sleep EP"},
     'episode': None,
     'localTrack': None},
    {'track': {'

In [4]:
# Get playlist names

playlist_names = []
for index in range(len(Spotify_Playlists['playlists'])):
    playlist_names.append(Spotify_Playlists['playlists'][index]['name'])
    
playlist_names

['Podcasts',
 'Afro Mix',
 'Deep House Mix',
 'Sing',
 'Check',
 'Neo-Soul',
 'Funk / Soul / Disco',
 'Nu-Disco',
 'Jazz (instrumental)',
 'Jazz',
 'Hip Hop',
 'Hip Hop (beats)',
 'Hip Hop Oldschool',
 "Hip Hop (after '00)",
 'Hip Hop PL',
 'Break',
 'Groovy House',
 'Afro House',
 'Deep House',
 'Chill House',
 'Soulful House',
 'House (all)',
 'Electro',
 'Techno',
 'Trap / Breaks',
 'Electronic (classics)',
 'EDM',
 'Roots',
 'Modern roots',
 'Reggae',
 'Dub',
 'World music ',
 'World music (português)',
 'World music (español)',
 'World music (français)',
 'World music (orient & balkan)',
 'Other african sounds',
 'Afrobeat / African Classics',
 'Afrobeats',
 'Kuduro / Naija',
 'SA sounds / Amapiano / Qgom',
 'Dancehall',
 'Dancehall Oldschool',
 'Chillout / Lounge',
 'Deep',
 'Archive / Massive Attack / etc.',
 '...',
 'Slow',
 'Instrumantal / Classic / Soundtrack',
 "Jungle / Drum'n'Bass",
 'Rock / Indie',
 'Indie',
 'Linkin Park',
 'Oldies',
 'Fun',
 'Polskie',
 'Poezja śpiewana

In [5]:
# Number of personal playlists

len(Spotify_Playlists['playlists'])

61

In [6]:
# Number of songs in each playlist

songs=[]
names=[]

for item in Spotify_Playlists["playlists"]:
    songs.append(len(item["items"]))
    names.append(item["name"])

songs_df = pd.DataFrame([songs]).transpose()
songs_df

names_df = pd.DataFrame([names]).transpose()
names_df

playlists_songs = names_df.join(songs_df, how="left", rsuffix = "new")
playlists_songs.head()

playlists_songs.rename(columns={"0": "PlaylistName", "0new": "Songs"}, inplace=True)
playlists_songs.sort_values(by=["Songs"], ascending=False, inplace=True)

In [8]:
playlists_songs.head()

Unnamed: 0,PlaylistName,Songs
21,House (all),362
17,Afro House,319
55,Polskie,304
10,Hip Hop,210
53,Oldies,183


In [11]:
# Export to csv file for further analysis

playlists_songs.to_csv("C:/Users/Agnieszka/Downloads/Datasets/Music/songs_in_playlists.csv")

File 2: Inferences

In [None]:
file2 = "C:/Users/Agnieszka/Downloads/Datasets/Music/my_spotify_data/MyData/Inferences.json"

Spotify_Inferences = json.load(open(file2, "r", encoding="utf8"))
Spotify_Inferences

# Inferences -> from https://support.spotify.com/uk/article/understanding-my-data/ :
# "We draw certain inferences about your interests and preferences based on your usage of the Spotify service and 
# using data obtained from our advertisers and other advertising partners. 
# This includes a list of market segments with which you are currently associated.  
# Depending on your settings, this data may be used to serve interest-based advertising to you within the 
# Spotify service."

# -> not useful for this analysis

File 3: Userdata

In [None]:
file3 = "C:/Users/Agnieszka/Downloads/Datasets/Music/my_spotify_data/MyData/Userdata.json"

Spotify_Userdata = json.load(open(file3, "r", encoding="utf8"))
Spotify_Userdata

# CreationTime: 2017-03-08

Files 4-4B: Streaming History

In [14]:
file4 = "C:/Users/Agnieszka/Downloads/Datasets/Music/my_spotify_data/MyData/StreamingHistory0.json"

Spotify_StreamingHistory0 = json.load(open(file4, "r", encoding="utf8"))
Spotify_StreamingHistory0

# Other files: StreamingHistory1 / StreamingHistory2

# Information about streamings -> 
# "endTime": Date and time of when the stream ended
# "msPlayed": Stands for how many mili-seconds the track was listened.

[{'endTime': '2020-02-02 23:25',
  'artistName': 'Klevakeys',
  'trackName': 'Lost In The Music - Original Mix',
  'msPlayed': 8753},
 {'endTime': '2020-02-13 17:33',
  'artistName': 'Shakarchi & Stranéus',
  'trackName': 'Mogadishu - Original Mix',
  'msPlayed': 55967},
 {'endTime': '2020-02-14 11:34',
  'artistName': 'Dennis Ferrer',
  'trackName': 'The Red Room - Kyle MF Hall MeMix',
  'msPlayed': 422750},
 {'endTime': '2020-02-14 12:40',
  'artistName': 'Renato Xtrova',
  'trackName': 'Angolano Kikola',
  'msPlayed': 197470},
 {'endTime': '2020-02-15 11:29',
  'artistName': 'Demuir',
  'trackName': "Disco Ain't Jack",
  'msPlayed': 382004},
 {'endTime': '2020-02-15 11:32',
  'artistName': 'Jacques Renault',
  'trackName': 'Faith - Borrowed Identity Remix',
  'msPlayed': 184803},
 {'endTime': '2020-02-15 11:32',
  'artistName': 'David Penn',
  'trackName': 'El Barrio - Instrumental',
  'msPlayed': 2020},
 {'endTime': '2020-02-15 11:40',
  'artistName': 'Lee Burridge',
  'trackName':

In [15]:
range(len(Spotify_StreamingHistory0))

range(0, 10000)

Saving StreamTime and ArtistName to a csv file:

In [16]:
# Save StreamTime as a list

StreamTime0 = []
for index in range(len(Spotify_StreamingHistory0)):
    StreamTime0.append(Spotify_StreamingHistory0[index]['endTime'])
    
len(StreamTime0)

10000

In [18]:
# Save ArtistName as a list

ArtistName0 = []
for index in range(len(Spotify_StreamingHistory0)):
    ArtistName0.append(Spotify_StreamingHistory0[index]['artistName'])
    
ArtistName0
len(ArtistName0)

10000

In [17]:
file4A = "C:/Users/Agnieszka/Downloads/Datasets/Music/my_spotify_data/MyData/StreamingHistory1.json"

Spotify_StreamingHistory1 = json.load(open(file4A, "r", encoding="utf8"))
Spotify_StreamingHistory1

[{'endTime': '2020-06-08 20:48',
  'artistName': 'T-Connection',
  'trackName': 'The Best Of My Love - 2004 Digital Remaster',
  'msPlayed': 176090},
 {'endTime': '2020-06-08 20:48',
  'artistName': 'Ezy & Isaac',
  'trackName': 'Let Your Body Move (Oba Balu Balu)',
  'msPlayed': 7640},
 {'endTime': '2020-06-08 20:49',
  'artistName': 'Lyn Collins',
  'trackName': 'Think (About It)',
  'msPlayed': 3050},
 {'endTime': '2020-06-08 20:49',
  'artistName': 'Black Ivory',
  'trackName': "Get Down (Comin' Down)",
  'msPlayed': 30010},
 {'endTime': '2020-06-08 20:49',
  'artistName': 'Star B',
  'trackName': 'Gotta Have You',
  'msPlayed': 1120},
 {'endTime': '2020-06-08 20:51',
  'artistName': 'Mulatu Astatke',
  'trackName': 'Yègellé Tezeta',
  'msPlayed': 840},
 {'endTime': '2020-06-08 20:51',
  'artistName': 'CHIC',
  'trackName': 'Le Freak',
  'msPlayed': 2720},
 {'endTime': '2020-06-08 20:51',
  'artistName': 'Marumo',
  'trackName': 'Khomo Tsaka Deile Kae?',
  'msPlayed': 143060},
 {'e

In [19]:
StreamTime1 = []
for index in range(len(Spotify_StreamingHistory1)):
    StreamTime1.append(Spotify_StreamingHistory1[index]['endTime'])
    
StreamTime1

['2020-06-08 20:48',
 '2020-06-08 20:48',
 '2020-06-08 20:49',
 '2020-06-08 20:49',
 '2020-06-08 20:49',
 '2020-06-08 20:51',
 '2020-06-08 20:51',
 '2020-06-08 20:51',
 '2020-06-08 20:51',
 '2020-06-08 20:51',
 '2020-06-08 21:15',
 '2020-06-08 21:16',
 '2020-06-08 21:18',
 '2020-06-08 21:18',
 '2020-06-08 21:26',
 '2020-06-08 21:32',
 '2020-06-08 21:33',
 '2020-06-08 21:33',
 '2020-06-08 21:40',
 '2020-06-08 21:47',
 '2020-06-08 21:53',
 '2020-06-08 22:00',
 '2020-06-08 22:07',
 '2020-06-08 22:08',
 '2020-06-09 04:36',
 '2020-06-09 04:43',
 '2020-06-09 04:51',
 '2020-06-09 04:56',
 '2020-06-09 05:01',
 '2020-06-09 05:06',
 '2020-06-09 05:06',
 '2020-06-09 05:06',
 '2020-06-09 05:09',
 '2020-06-09 05:09',
 '2020-06-09 05:09',
 '2020-06-09 05:45',
 '2020-06-09 17:09',
 '2020-06-09 17:10',
 '2020-06-09 17:14',
 '2020-06-09 17:16',
 '2020-06-09 17:18',
 '2020-06-09 17:21',
 '2020-06-09 17:21',
 '2020-06-09 17:25',
 '2020-06-09 17:25',
 '2020-06-09 17:25',
 '2020-06-09 17:25',
 '2020-06-09 

In [20]:
ArtistName1 = []
for index in range(len(Spotify_StreamingHistory1)):
    ArtistName1.append(Spotify_StreamingHistory1[index]['artistName'])
    
ArtistName1

['T-Connection',
 'Ezy & Isaac',
 'Lyn Collins',
 'Black Ivory',
 'Star B',
 'Mulatu Astatke',
 'CHIC',
 'Marumo',
 'Cymande',
 'The Vision',
 'Fela Kuti',
 'Unknown Artist',
 'Afro Warriors',
 'Unknown Artist',
 'ANII',
 'Mr Raoul K',
 'Mr Raoul K',
 'St Germain',
 'Da Capo',
 'Collé',
 'Vanco',
 'Infinite Souls',
 'Citizen Deep',
 'Tunnelvisions',
 'Folamour',
 'Nachtbraker',
 'Soulsearcher',
 'Folamour',
 "Pete Heller's Big Love",
 'Nightmares On Wax',
 'Kapote',
 'Block & Crown',
 'Demuja',
 'Tiger & Woods',
 'Demuja',
 'Huxley',
 'Lido Pimienta',
 'Manu Chao',
 'Ana Tijoux',
 'Manu Chao',
 'Quantic',
 'Afterclapp',
 'Sudan Archives',
 'Vetusta Morla',
 'Ojolo',
 'Laboratorium Pieśni',
 'Frankie Chavez',
 'Pat Thomas',
 'Miriam Makeba',
 'Manu Chao',
 'Warsaw Village Band',
 'Tom Browne',
 'Manu Chao',
 'Kayah',
 'Edyta Gorniak',
 'Dawid Podsiadło',
 'Bill Withers',
 'Aga Zaryan',
 'Grzegorz Turnau',
 'Aga Zaryan',
 'No Doubt',
 'Indios Bravos',
 'Vetusta Morla',
 'Frankie Chavez',

In [21]:
file4B = "C:/Users/Agnieszka/Downloads/Datasets/Music/my_spotify_data/MyData/StreamingHistory2.json"

Spotify_StreamingHistory2 = json.load(open(file4B, "r", encoding="utf8"))
Spotify_StreamingHistory2

[{'endTime': '2020-10-20 20:24',
  'artistName': 'Smoove',
  'trackName': "I'm a Man",
  'msPlayed': 8740},
 {'endTime': '2020-10-20 20:24',
  'artistName': 'Patti Drew',
  'trackName': 'Fever',
  'msPlayed': 8470},
 {'endTime': '2020-10-20 20:24',
  'artistName': 'The Mighty Show-stoppers',
  'trackName': 'Hippy Skippy Moon Strut',
  'msPlayed': 3580},
 {'endTime': '2020-10-20 20:24',
  'artistName': 'Def Cut',
  'trackName': 'The Lord',
  'msPlayed': 4190},
 {'endTime': '2020-10-20 20:24',
  'artistName': 'Soho',
  'trackName': 'Hot Music',
  'msPlayed': 17050},
 {'endTime': '2020-10-20 20:24',
  'artistName': 'Lack Of Afro',
  'trackName': 'The Outsider',
  'msPlayed': 8550},
 {'endTime': '2020-10-20 20:25',
  'artistName': 'Mash Hall',
  'msPlayed': 4510},
 {'endTime': '2020-10-20 20:25',
  'artistName': 'Eric B. & Rakim',
  'trackName': 'Know The Ledge',
  'msPlayed': 4180},
 {'endTime': '2020-10-20 20:25',
  'artistName': 'Dave "Baby" Cortez',
  'trackName': 'Happy Soul With A Ho

In [22]:
StreamTime2 = []
for index in range(len(Spotify_StreamingHistory2)):
    StreamTime2.append(Spotify_StreamingHistory2[index]['endTime'])
    
StreamTime2

['2020-10-20 20:24',
 '2020-10-20 20:24',
 '2020-10-20 20:24',
 '2020-10-20 20:24',
 '2020-10-20 20:24',
 '2020-10-20 20:24',
 '2020-10-20 20:25',
 '2020-10-20 20:25',
 '2020-10-20 20:25',
 '2020-10-20 20:25',
 '2020-10-20 20:25',
 '2020-10-20 20:26',
 '2020-10-20 20:26',
 '2020-10-20 20:26',
 '2020-10-20 20:26',
 '2020-10-20 20:27',
 '2020-10-20 20:28',
 '2020-10-20 20:28',
 '2020-10-20 20:28',
 '2020-10-20 20:30',
 '2020-10-20 20:31',
 '2020-10-20 20:31',
 '2020-10-20 20:32',
 '2020-10-20 20:33',
 '2020-10-20 20:34',
 '2020-10-20 20:34',
 '2020-10-20 20:34',
 '2020-10-20 20:34',
 '2020-10-20 20:34',
 '2020-10-20 20:35',
 '2020-10-20 20:35',
 '2020-10-20 20:36',
 '2020-10-20 20:36',
 '2020-10-20 20:36',
 '2020-10-20 20:36',
 '2020-10-20 20:38',
 '2020-10-20 20:38',
 '2020-10-20 20:38',
 '2020-10-20 20:38',
 '2020-10-20 20:40',
 '2020-10-20 20:40',
 '2020-10-20 20:40',
 '2020-10-20 20:40',
 '2020-10-20 20:41',
 '2020-10-20 20:41',
 '2020-10-20 20:41',
 '2020-10-20 20:41',
 '2020-10-20 

In [23]:
ArtistName2 = []
for index in range(len(Spotify_StreamingHistory2)):
    ArtistName2.append(Spotify_StreamingHistory2[index]['artistName'])
    
ArtistName2

['Smoove',
 'Patti Drew',
 'The Mighty Show-stoppers',
 'Def Cut',
 'Soho',
 'Lack Of Afro',
 'Mash Hall',
 'Eric B. & Rakim',
 'Dave "Baby" Cortez',
 'Ugly Duckling',
 'DJ Pablo',
 'Brownout',
 'Paktofonika',
 'DJ Tee',
 'Jay-Roc',
 'WWO',
 'Paktofonika',
 'Paktofonika',
 'O.S.T.R.',
 'Fu',
 'Kaliber 44',
 'O.S.T.R.',
 'Paktofonika',
 'WWO',
 'Kaliber 44',
 'Paktofonika',
 'O.S.T.R.',
 'Kaliber 44',
 'Paktofonika',
 'O.S.T.R.',
 'Fisz Emade Tworzywo',
 'Fisz Emade Tworzywo',
 'Paktofonika',
 'O.S.T.R.',
 'O.S.T.R.',
 'Paktofonika',
 'O.S.T.R.',
 'Paktofonika',
 'Paktofonika',
 'O.S.T.R.',
 'Zipera',
 'Paktofonika',
 'Paktofonika',
 'Paktofonika',
 'O.S.T.R.',
 'Zipera',
 'O.S.T.R.',
 'Fisz Emade Tworzywo',
 'Paktofonika',
 'Paktofonika',
 'O.S.T.R.',
 'Paktofonika',
 'Fisz Emade Jako Tworzywo Sztuczne',
 'Fisz Emade Tworzywo',
 'Paktofonika',
 'O.S.T.R.',
 'O.S.T.R.',
 'WWO',
 'Paktofonika',
 'O.S.T.R.',
 'Paktofonika',
 'Paktofonika',
 'O.S.T.R.',
 'O.S.T.R.',
 'Paktofonika',
 'O.S.T

In [24]:
# Connect all lists (StreamTime)

#len(StreamTime0) # 10000
#len(StreamTime1) # 10000
#len(StreamTime2) # 7988

StreamTime = StreamTime0 + StreamTime1 + StreamTime2
len(StreamTime)

27988

In [25]:
# Connect all lists (ArtistName)

# StreamTime0.extend(StreamTime1)

#len(ArtistName0) # 10000
#len(ArtistName1) # 10000
#len(ArtistName2) # 7988

ArtistNames = ArtistName0 + ArtistName1 + ArtistName2
len(ArtistNames)

27988

In [26]:
# Convert into DataFrame (StreamTime)

StreamTime_df = pd.DataFrame(StreamTime)
StreamTime_df 

Unnamed: 0,0
0,2020-02-02 23:25
1,2020-02-13 17:33
2,2020-02-14 11:34
3,2020-02-14 12:40
4,2020-02-15 11:29
...,...
27983,2021-02-14 14:45
27984,2021-02-14 14:54
27985,2021-02-14 15:04
27986,2021-02-14 15:12


In [27]:
# Convert into DataFrame (StreamTime)

ArtistNames_df = pd.DataFrame(ArtistNames)
ArtistNames_df

Unnamed: 0,0
0,Klevakeys
1,Shakarchi & Stranéus
2,Dennis Ferrer
3,Renato Xtrova
4,Demuir
...,...
27983,DJEFF
27984,Andreas Horvat
27985,Isolée
27986,Filippos


In [28]:
# Merge both DataFrames

ArtistNames_df.reset_index(inplace=True)
StreamTime_df.reset_index(inplace=True)
StreamTime_ArtistNames = ArtistNames_df.merge(StreamTime_df, on="index")
StreamTime_ArtistNames.head()

Unnamed: 0,index,0_x,0_y
0,0,Klevakeys,2020-02-02 23:25
1,1,Shakarchi & Stranéus,2020-02-13 17:33
2,2,Dennis Ferrer,2020-02-14 11:34
3,3,Renato Xtrova,2020-02-14 12:40
4,4,Demuir,2020-02-15 11:29


In [32]:
# Rename columns

StreamTime_ArtistNames.rename(columns={"0_x":"ArtistNames", "0_y":"StreamTime"}, inplace=True)
StreamTime_ArtistNames.head()

Unnamed: 0,ArtistNames,StreamTime
0,Klevakeys,2020-02-02 23:25
1,Shakarchi & Stranéus,2020-02-13 17:33
2,Dennis Ferrer,2020-02-14 11:34
3,Renato Xtrova,2020-02-14 12:40
4,Demuir,2020-02-15 11:29


In [31]:
StreamTime_ArtistNames.drop(columns=["index"], inplace=True)
StreamTime_ArtistNames.head()

Unnamed: 0,ArtistNames,StreamTime
0,Klevakeys,2020-02-02 23:25
1,Shakarchi & Stranéus,2020-02-13 17:33
2,Dennis Ferrer,2020-02-14 11:34
3,Renato Xtrova,2020-02-14 12:40
4,Demuir,2020-02-15 11:29


In [33]:
# Save to csv-file

StreamTime_ArtistNames.to_csv("C:/Users/Agnieszka/Downloads/Datasets/Music/StreamTime_ArtistNames.csv")

File 5: SearchQueries

In [None]:
file5 = "C:/Users/Agnieszka/Downloads/Datasets/Music/my_spotify_data/MyData/SearchQueries.json"

Spotify_SearchQueries = json.load(open(file5, "r", encoding="utf8"))
Spotify_SearchQueries

# A list of searches made : search time / device/platform used / search text / list of Uniform Resource Identifiers (URI) of the search results the user interacted with

b) Data from skiley.net

In [2]:
# Load one file to see the available features

test = pd.read_csv("C:/Users/Agnieszka/Downloads/Datasets/Music/Silkey_MySpotify/AfroHouse.csv")
test.head()

# Important: 'trackName' / 'artistName' / 'secondaryArtistsNames' /'trackDuration' / 'addedAt' (??) /
# "isLocal" - "stream" vs. "local" / 'trackPopularity' / 'artistPopularity' / 'artistFollowers' / 'artistGenres' /
# 'albumRecordLabel' / 'albumReleaseDate'

# not important: 'albumName'/ "trackNumber" / "addedBy" / 'albumArtistsNames' /'trackUrl'/ 
# 'artistUrl' / 'albumUrl'/ 'trackIsrc' (International Standard Recording Code)/ 'albumUpc' (Universal Product Code) / 
# 'albumType' / 'albumPopularity'

# artistPopularity: The popularity of the artist. The value will be between 0 and 100, 
# with 100 being the most popular. The artist’s popularity is calculated from the popularity of all the artist’s tracks.
# https://developer.spotify.com/documentation/web-api/reference/


Unnamed: 0,trackName,artistName,albumName,trackDuration,trackNumber,addedBy,addedAt,isLocal,trackPopularity,artistPopularity,...,artistGenres,albumType,albumRecordLabel,albumReleaseDate,albumArtistsNames,trackUrl,artistUrl,albumUrl,trackIsrc,albumUpc
0,Fadjamou - TWOTEK Remix,Oumou Sangaré,Mogoya Remixed,04:54,6,leposia,2020-02-21T15:48:03Z,stream,33,55.0,...,"Afropop, Malian blues, Mande pop, Wassoulou, W...",album,NO FORMAT,2018-03-02,Oumou Sangaré,https://open.spotify.com/track/7o4FDJSXetDLaZQ...,https://open.spotify.com/artist/65CKKZilbcSKkA...,https://open.spotify.com/album/3tPkkxfJ4PMlVWH...,FR2W61700101,886447000000.0
1,Sondela feat. Xolisa,David Mayer,Sondela EP,09:16,1,leposia,2020-05-08T20:45:25Z,stream,34,31.0,...,,single,Connected,2018-05-11,David Mayer & Floyd Lavine,https://open.spotify.com/track/3JSbVbuKCAEKPQG...,https://open.spotify.com/artist/7Mf7RKErlQ5maE...,https://open.spotify.com/album/62Stp3OydYGdunK...,DEU671702764,880319900000.0
2,Shango,Waajeed,Shango EP,05:38,2,leposia,2020-05-08T20:46:03Z,stream,24,29.0,...,"Detroit house, Experimental house",single,Dirt Tech Reck,2017-07-07,Waajeed,https://open.spotify.com/track/3nac7Q00wdPbA7E...,https://open.spotify.com/artist/4dXRZcf1AdzvwZ...,https://open.spotify.com/album/7heAeiemIsKYZtc...,ushm21722078,191061900000.0
3,Sittin' Here - Boddhi Satva Ancestral Soul Remix,St Germain,Sittin' Here (Remixes),06:27,2,leposia,2020-05-08T20:46:16Z,stream,20,59.0,...,"Downtempo, Electronica, Jazztronica, Nu jazz",single,Parlophone (France),2016-03-11,St Germain,https://open.spotify.com/track/4LbkNYxou1Z3uKi...,https://open.spotify.com/artist/484sZUYmnRXN84...,https://open.spotify.com/album/4XgRSFKNjnzrKvU...,FR92S1500040,190296000000.0
4,"99,9 - Alix Alvarez Sole Channel Remix",Minimoogli,"99,9 Remixes",06:22,1,leposia,2020-05-08T20:49:27Z,stream,0,5.0,...,,single,Tubetracks,2008-05-24,Minimoogli,https://open.spotify.com/track/5SL7Ht67ZGlRPFg...,https://open.spotify.com/artist/5crLJQJosOleex...,https://open.spotify.com/album/76QL5IslNgG8Cn2...,FR6V80803305,3661585000000.0


Dataset with all playlists:

In [34]:
# Dataset with all playlists

import glob
df_Silkey = pd.concat([pd.read_csv(f) for f in glob.glob("C:/Users/Agnieszka/Downloads/Datasets/Music/Silkey_MySpotify/" + "*.csv")], ignore_index=True)
len(df_Silkey)

4007

c) Data from sortyourmusic.playlistmachinery.com

In [4]:
test2 = pd.read_excel("C:/Users/Agnieszka/Downloads/Datasets/Music/SortYourMusic_20210305.xlsx", sheet_name="AfroHouse")
test2.head()


# Double information: Title / Artist / Release / Length / Pop. / 
# not important: RND

Unnamed: 0,#,Title,Artist,Release,BPM,Energy,Dance,Loud,Valence,Length,Acoustic,Pop.,RND
0,1,Fadjamou - TWOTEK Remix,Oumou Sangaré,2018-03-02 00:00:00,129.0,59.0,84.0,-9.0,46.0,04:54:00,5.0,33.0,4111.0
1,2,Sondela feat. Xolisa,David Mayer,2018-05-11 00:00:00,122.0,68.0,77.0,-8.0,14.0,09:17:00,1.0,34.0,7037.0
2,3,Shango,Waajeed,2017-07-07 00:00:00,127.0,62.0,85.0,-11.0,60.0,05:38:00,6.0,24.0,5515.0
3,4,Sittin' Here - Boddhi Satva Ancestral Soul Remix,St Germain,2016-03-11 00:00:00,122.0,72.0,79.0,-8.0,52.0,06:28:00,13.0,20.0,3277.0
4,5,"99,9 - Alix Alvarez Sole Channel Remix",Minimoogli,2008-05-24 00:00:00,126.0,86.0,80.0,-9.0,66.0,06:22:00,0.0,0.0,281.0


In [90]:
# Check files in the directory

#import os
#files = os.listdir("C:/Users/Agnieszka/Downloads/Datasets/Music/SortYourMusic/")
#files

Dataset with all playlists:

In [35]:
# Dataset with all playlists

df_SortYourMusic = pd.concat([pd.read_excel(f) for f in glob.glob("C:/Users/Agnieszka/Downloads/Datasets/Music/SortYourMusic/" + "*.xlsx")], ignore_index=True)
len(df_SortYourMusic)

4007

**Datasets for further analysis and ML models**

a) Dataset with data from all playlists 

In [36]:
df_Silkey.reset_index(inplace=True)
df_Silkey.head()

Unnamed: 0,index,trackName,artistName,albumName,trackDuration,trackNumber,addedBy,addedAt,isLocal,trackPopularity,...,artistGenres,albumType,albumRecordLabel,albumReleaseDate,albumArtistsNames,trackUrl,artistUrl,albumUrl,trackIsrc,albumUpc
0,0,"My Woman, My Everything (feat. Wandecoal)",Patoranking,"My Woman, My Everything (feat. Wandecoal) - Si...",03:53,1,leposia,2018-08-19T20:44:37Z,stream,54,...,"Afro dancehall, Afropop, Azontobeats, Gqom, Ni...",single,Greensleeves Records,2015-07-06,Patoranking,https://open.spotify.com/track/5MAScTGPHzWAeiv...,https://open.spotify.com/artist/2hKQc001G7ggs3...,https://open.spotify.com/album/0cUhwhoX9Yi7vhJ...,GBBZV1555523,601811200000.0
1,1,Pana,Tekno,Pana,04:02,1,leposia,2018-08-19T20:51:51Z,stream,59,...,"Afro dancehall, Afropop, Azontobeats, Nigerian...",single,Columbia,2016-12-20,Tekno,https://open.spotify.com/track/3ddd0nUY2jZtPYz...,https://open.spotify.com/artist/6IhG3Yxm3UW98j...,https://open.spotify.com/album/62sUZK9tqrSH3AM...,USSM11610316,886446300000.0
2,2,Jogodo,Tekno,Jogodo,04:23,1,leposia,2018-08-19T20:51:54Z,stream,44,...,"Afro dancehall, Afropop, Azontobeats, Nigerian...",single,MADE MEN MUSIC GROUP,2018-04-19,Tekno,https://open.spotify.com/track/2irIMb2mjl8lgVz...,https://open.spotify.com/artist/6IhG3Yxm3UW98j...,https://open.spotify.com/album/6aR2Z7ByqS9rxAz...,GBZB51800047,5050581000000.0
3,3,Duro,Tekno,Duro,03:32,1,leposia,2018-08-19T20:52:04Z,stream,0,...,"Afro dancehall, Afropop, Azontobeats, Nigerian...",single,Made Men Music Group,2015-06-22,Tekno,https://open.spotify.com/track/4zI3yq6BBpbnKZY...,https://open.spotify.com/artist/6IhG3Yxm3UW98j...,https://open.spotify.com/album/2RtngUBJGTBgQuY...,UK3AZ1512397,632181500000.0
4,4,Samantha,Tekno,Samantha,03:49,1,leposia,2018-08-19T20:52:24Z,stream,42,...,"Afro dancehall, Afropop, Azontobeats, Nigerian...",single,MADE MEN MUSIC GROUP,2017-06-09,Tekno,https://open.spotify.com/track/0rI1YdcUXmvWd6R...,https://open.spotify.com/artist/6IhG3Yxm3UW98j...,https://open.spotify.com/album/4JS9Fgzb4XzmWMQ...,GBZB51700035,5050581000000.0


In [37]:
df_SortYourMusic.reset_index(inplace=True)
df_SortYourMusic.head()

Unnamed: 0,index,#,Title,Artist,Release,BPM,Energy,Dance,Loud,Valence,Length,Acoustic,Pop.,RND
0,0,1,"My Woman, My Everything (feat. Wandecoal)",Patoranking,2015-07-06 00:00:00,112.0,85.0,90.0,-3.0,63.0,03:54:00,4.0,54.0,7655.0
1,1,2,Pana,Tekno,2016-12-20 00:00:00,73.0,55.0,55.0,-7.0,66.0,04:03:00,31.0,58.0,5685.0
2,2,3,Jogodo,Tekno,2018-04-19 00:00:00,100.0,77.0,74.0,-5.0,82.0,04:23:00,38.0,44.0,8581.0
3,3,4,Duro,Tekno,2015-06-22 00:00:00,90.0,84.0,74.0,-4.0,97.0,03:32:00,28.0,0.0,400.0
4,4,5,Samantha,Tekno,2017-06-09 00:00:00,106.0,75.0,83.0,-3.0,79.0,03:50:00,3.0,42.0,6739.0


Merge DataFrames

In [38]:
# Merge both DataFrames (df_Silkey & df_SortYourMusic)

All_songs = df_Silkey.merge(df_SortYourMusic, on="index")

In [198]:
#All_songs[["trackName", "Title"]].iloc[4000:4040]

Remove columns

In [39]:
# Check columns

All_songs.columns

Index(['index', 'trackName', 'artistName', 'albumName', 'trackDuration',
       'trackNumber', 'addedBy', 'addedAt', 'isLocal', 'trackPopularity',
       'artistPopularity', 'secondaryArtistsNames', 'albumPopularity',
       'artistFollowers', 'artistGenres', 'albumType', 'albumRecordLabel',
       'albumReleaseDate', 'albumArtistsNames', 'trackUrl', 'artistUrl',
       'albumUrl', 'trackIsrc', 'albumUpc', '#', 'Title', 'Artist', 'Release',
       'BPM', 'Energy', 'Dance', 'Loud', 'Valence', 'Length', 'Acoustic',
       'Pop.', 'RND'],
      dtype='object')

In [44]:
# Drop columns

# df_SortYourMusic: 
# Title / Artist / Release / Length / Pop. <- Double information & not important: RND
# df_Silkey 
# # not important: 'albumName'/ "trackNumber" / "addedBy" / 'albumArtistsNames' /'trackUrl'/ 
# 'artistUrl' / 'albumUrl'/ 'trackIsrc' (International Standard Recording Code)/ 'albumUpc' (Universal Product Code) / 
# 'albumType' / 'albumPopularity'


All_songs_cleaned1 = All_songs.drop(columns=["albumName", 
                            "trackNumber",
                            "addedBy",
                            "albumArtistsNames",
                            "trackUrl",
                            "artistUrl",
                            "albumUrl",
                            "trackIsrc",
                            "albumUpc",
                            "albumType",
                            "albumPopularity",
                            'Title', 
                            'Artist', 
                            'Release', 
                            "Length", 
                            "Pop.",
                            "RND",
                            '#',          
                           ], axis=0)

In [45]:
All_songs_cleaned1.head()

Unnamed: 0,index,trackName,artistName,trackDuration,addedAt,isLocal,trackPopularity,artistPopularity,secondaryArtistsNames,artistFollowers,artistGenres,albumRecordLabel,albumReleaseDate,BPM,Energy,Dance,Loud,Valence,Acoustic
0,0,"My Woman, My Everything (feat. Wandecoal)",Patoranking,03:53,2018-08-19T20:44:37Z,stream,54,61.0,Wande Coal,149884.0,"Afro dancehall, Afropop, Azontobeats, Gqom, Ni...",Greensleeves Records,2015-07-06,112.0,85.0,90.0,-3.0,63.0,4.0
1,1,Pana,Tekno,04:02,2018-08-19T20:51:51Z,stream,59,62.0,,175849.0,"Afro dancehall, Afropop, Azontobeats, Nigerian...",Columbia,2016-12-20,73.0,55.0,55.0,-7.0,66.0,31.0
2,2,Jogodo,Tekno,04:23,2018-08-19T20:51:54Z,stream,44,62.0,,175849.0,"Afro dancehall, Afropop, Azontobeats, Nigerian...",MADE MEN MUSIC GROUP,2018-04-19,100.0,77.0,74.0,-5.0,82.0,38.0
3,3,Duro,Tekno,03:32,2018-08-19T20:52:04Z,stream,0,62.0,,175849.0,"Afro dancehall, Afropop, Azontobeats, Nigerian...",Made Men Music Group,2015-06-22,90.0,84.0,74.0,-4.0,97.0,28.0
4,4,Samantha,Tekno,03:49,2018-08-19T20:52:24Z,stream,42,62.0,,175849.0,"Afro dancehall, Afropop, Azontobeats, Nigerian...",MADE MEN MUSIC GROUP,2017-06-09,106.0,75.0,83.0,-3.0,79.0,3.0


In [46]:
len(All_songs_cleaned1)

4007

Check on and remove duplicates

In [47]:
# Check if there are any duplicates
Duplicates = All_songs_cleaned1[["artistName", "trackName", "index"]].groupby(["trackName", "artistName"]).count().sort_values("index", ascending=False)
Duplicates.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,index
trackName,artistName,Unnamed: 2_level_1
I'm So Excited,The Pointer Sisters,4
Promise,Ben Howard,4
"Sports, Not Heavy Crime",Xploding Plastix,4
Blackbird Song,Lee DeWyze,4
Blacked Out,Yussef Dayes,4


In [15]:
# Duplicate items
#Duplicates2 = Duplicates[Duplicates["index"] > 1]
#len(Duplicates2)

433

In [16]:
# Single items
#Duplicates3 = Duplicates[Duplicates["index"] == 1]
#len(Duplicates3)

3049

In [48]:
All_songs_cleaned2 = All_songs_cleaned1.drop_duplicates(subset=["artistName", "trackName"], keep="first")
All_songs_cleaned2.shape

(3510, 19)

Check on NaN values

In [49]:
All_songs_cleaned2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3510 entries, 0 to 4006
Data columns (total 19 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   index                  3510 non-null   int64  
 1   trackName              3510 non-null   object 
 2   artistName             3482 non-null   object 
 3   trackDuration          3510 non-null   object 
 4   addedAt                3510 non-null   object 
 5   isLocal                3510 non-null   object 
 6   trackPopularity        3510 non-null   int64  
 7   artistPopularity       3461 non-null   float64
 8   secondaryArtistsNames  1031 non-null   object 
 9   artistFollowers        3461 non-null   float64
 10  artistGenres           3054 non-null   object 
 11  albumRecordLabel       3461 non-null   object 
 12  albumReleaseDate       3461 non-null   object 
 13  BPM                    3461 non-null   float64
 14  Energy                 3461 non-null   float64
 15  Danc

In [52]:
# Remove rows without "artistName" -> I can do that because all rows without an artist name are local files
All_songs_cleaned2.dropna(axis=0, subset=["artistName"], inplace=True)
All_songs_cleaned2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3482 entries, 0 to 4006
Data columns (total 19 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   index                  3482 non-null   int64  
 1   trackName              3482 non-null   object 
 2   artistName             3482 non-null   object 
 3   trackDuration          3482 non-null   object 
 4   addedAt                3482 non-null   object 
 5   isLocal                3482 non-null   object 
 6   trackPopularity        3482 non-null   int64  
 7   artistPopularity       3461 non-null   float64
 8   secondaryArtistsNames  1031 non-null   object 
 9   artistFollowers        3461 non-null   float64
 10  artistGenres           3054 non-null   object 
 11  albumRecordLabel       3461 non-null   object 
 12  albumReleaseDate       3461 non-null   object 
 13  BPM                    3461 non-null   float64
 14  Energy                 3461 non-null   float64
 15  Danc

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  All_songs_cleaned2.dropna(axis=0, subset=["artistName"], inplace=True)


In [55]:
# NaN values in BPM, Energy etc. columns -> There still might be local data in the dataset
All_songs_cleaned2_test2 = All_songs_cleaned2[All_songs_cleaned2["BPM"].isna() == True]
All_songs_cleaned2_test2.head()                                 

Unnamed: 0,index,trackName,artistName,trackDuration,addedAt,isLocal,trackPopularity,artistPopularity,secondaryArtistsNames,artistFollowers,artistGenres,albumRecordLabel,albumReleaseDate,BPM,Energy,Dance,Loud,Valence,Acoustic
61,61,DJ Flex,AfroBeat Remix Drake in my feelings (KeKe),01:52,2018-11-30T08:56:07Z,local,0,,,,,,,,,,,,
142,142,Where Have They Gone (Mr Mendel Edit),M...,05:49,2020-10-24T18:17:05Z,local,0,,,,,,,,,,,,
206,206,Kenkan (David Montoya edit),Mory Kante,06:15,2020-05-08T21:10:33Z,local,0,,,,,,,,,,,,
207,207,Isibusiso Sami,Schomane RA ft Poizen,06:58,2020-05-08T21:10:45Z,local,0,,,,,,,,,,,,
265,265,Uyankenteza (Nostalgic Mix),The Godfathers Of Deep House SA,07:35,2020-05-08T21:40:27Z,local,0,,,,,,,,,,,,


In [57]:
# Remove all local data -> there is no information about this data that can be used for further analysis
All_songs_cleaned3 = All_songs_cleaned2[All_songs_cleaned2["isLocal"]=="stream"]
All_songs_cleaned3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3461 entries, 0 to 4006
Data columns (total 19 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   index                  3461 non-null   int64  
 1   trackName              3461 non-null   object 
 2   artistName             3461 non-null   object 
 3   trackDuration          3461 non-null   object 
 4   addedAt                3461 non-null   object 
 5   isLocal                3461 non-null   object 
 6   trackPopularity        3461 non-null   int64  
 7   artistPopularity       3461 non-null   float64
 8   secondaryArtistsNames  1031 non-null   object 
 9   artistFollowers        3461 non-null   float64
 10  artistGenres           3054 non-null   object 
 11  albumRecordLabel       3461 non-null   object 
 12  albumReleaseDate       3461 non-null   object 
 13  BPM                    3461 non-null   float64
 14  Energy                 3461 non-null   float64
 15  Danc

In [28]:
# There are still missing values in "SecondaryArtistNames" and "artistGenres" 
# -> These columns remain in the dataset for the EDA; they will be removed for modeling (machine learning)

In [58]:
All_songs_cleaned3.head()

Unnamed: 0,index,trackName,artistName,trackDuration,addedAt,isLocal,trackPopularity,artistPopularity,secondaryArtistsNames,artistFollowers,artistGenres,albumRecordLabel,albumReleaseDate,BPM,Energy,Dance,Loud,Valence,Acoustic
0,0,"My Woman, My Everything (feat. Wandecoal)",Patoranking,03:53,2018-08-19T20:44:37Z,stream,54,61.0,Wande Coal,149884.0,"Afro dancehall, Afropop, Azontobeats, Gqom, Ni...",Greensleeves Records,2015-07-06,112.0,85.0,90.0,-3.0,63.0,4.0
1,1,Pana,Tekno,04:02,2018-08-19T20:51:51Z,stream,59,62.0,,175849.0,"Afro dancehall, Afropop, Azontobeats, Nigerian...",Columbia,2016-12-20,73.0,55.0,55.0,-7.0,66.0,31.0
2,2,Jogodo,Tekno,04:23,2018-08-19T20:51:54Z,stream,44,62.0,,175849.0,"Afro dancehall, Afropop, Azontobeats, Nigerian...",MADE MEN MUSIC GROUP,2018-04-19,100.0,77.0,74.0,-5.0,82.0,38.0
3,3,Duro,Tekno,03:32,2018-08-19T20:52:04Z,stream,0,62.0,,175849.0,"Afro dancehall, Afropop, Azontobeats, Nigerian...",Made Men Music Group,2015-06-22,90.0,84.0,74.0,-4.0,97.0,28.0
4,4,Samantha,Tekno,03:49,2018-08-19T20:52:24Z,stream,42,62.0,,175849.0,"Afro dancehall, Afropop, Azontobeats, Nigerian...",MADE MEN MUSIC GROUP,2017-06-09,106.0,75.0,83.0,-3.0,79.0,3.0


Split "artistGenres"

In [47]:
# Split "artistGenres"
All_songs_clean3["artistGenres"][50]

'Afro dancehall, Afropop, Ghanaian hip hop, Nigerian pop, Swedish dancehall'

In [49]:
# Dataset without NaNs in "ArtistGenres"

All_songs_clean_Genre = All_songs_clean3.dropna(axis=0, subset=["artistGenres"])
All_songs_clean_Genre.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3054 entries, 0 to 4005
Data columns (total 19 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   index                  3054 non-null   int64  
 1   trackName              3054 non-null   object 
 2   artistName             3054 non-null   object 
 3   trackDuration          3054 non-null   object 
 4   addedAt                3054 non-null   object 
 5   isLocal                3054 non-null   object 
 6   trackPopularity        3054 non-null   int64  
 7   artistPopularity       3054 non-null   float64
 8   secondaryArtistsNames  865 non-null    object 
 9   artistFollowers        3054 non-null   float64
 10  artistGenres           3054 non-null   object 
 11  albumRecordLabel       3054 non-null   object 
 12  albumReleaseDate       3054 non-null   object 
 13  BPM                    3054 non-null   float64
 14  Energy                 3054 non-null   float64
 15  Danc

In [59]:
All_songs_cleaned3_Genres = All_songs_cleaned3["artistGenres"].str.split(",", expand=True)
print(len(All_songs_cleaned3_Genres))
All_songs_cleaned3_Genres.head()
# up to 16 different genres for some songs! 

3461


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,Afro dancehall,Afropop,Azontobeats,Gqom,Nigerian pop,,,,,,,,,,,
1,Afro dancehall,Afropop,Azontobeats,Nigerian hip hop,Nigerian pop,,,,,,,,,,,
2,Afro dancehall,Afropop,Azontobeats,Nigerian hip hop,Nigerian pop,,,,,,,,,,,
3,Afro dancehall,Afropop,Azontobeats,Nigerian hip hop,Nigerian pop,,,,,,,,,,,
4,Afro dancehall,Afropop,Azontobeats,Nigerian hip hop,Nigerian pop,,,,,,,,,,,


In [60]:
# Reduce genres 
All_songs_cleaned3_Genres_reduced = All_songs_cleaned3_Genres.iloc[:,0:5]
All_songs_cleaned3_Genres_reduced.head()

Unnamed: 0,0,1,2,3,4
0,Afro dancehall,Afropop,Azontobeats,Gqom,Nigerian pop
1,Afro dancehall,Afropop,Azontobeats,Nigerian hip hop,Nigerian pop
2,Afro dancehall,Afropop,Azontobeats,Nigerian hip hop,Nigerian pop
3,Afro dancehall,Afropop,Azontobeats,Nigerian hip hop,Nigerian pop
4,Afro dancehall,Afropop,Azontobeats,Nigerian hip hop,Nigerian pop


In [75]:
All_songs_cleaned3_Genres_reduced.rename(columns={0:"Genre1",
                                                    1:"Genre2",
                                                    2:"Genre3",
                                                    3:"Genre4",
                                                    4:"Genre5"}, inplace=True)

In [76]:
All_songs_cleaned3_Genres_reduced.head()

Unnamed: 0,Genre1,Genre2,Genre3,Genre4,Genre5
0,Afro dancehall,Afropop,Azontobeats,Gqom,Nigerian pop
1,Afro dancehall,Afropop,Azontobeats,Nigerian hip hop,Nigerian pop
2,Afro dancehall,Afropop,Azontobeats,Nigerian hip hop,Nigerian pop
3,Afro dancehall,Afropop,Azontobeats,Nigerian hip hop,Nigerian pop
4,Afro dancehall,Afropop,Azontobeats,Nigerian hip hop,Nigerian pop


Merge Genre-DataFrame with previous cleaned DataFrame

In [77]:
All_songs_cleaned3_Genres_reduced.reset_index(inplace=True)
All_songs_cleaned4 = All_songs_cleaned3.merge(All_songs_cleaned3_Genres_reduced, on="index")
All_songs_cleaned4.head()

Unnamed: 0,index,trackName,artistName,trackDuration,addedAt,isLocal,trackPopularity,artistPopularity,secondaryArtistsNames,artistFollowers,...,Energy,Dance,Loud,Valence,Acoustic,Genre1,Genre2,Genre3,Genre4,Genre5
0,0,"My Woman, My Everything (feat. Wandecoal)",Patoranking,03:53,2018-08-19T20:44:37Z,stream,54,61.0,Wande Coal,149884.0,...,85.0,90.0,-3.0,63.0,4.0,Afro dancehall,Afropop,Azontobeats,Gqom,Nigerian pop
1,1,Pana,Tekno,04:02,2018-08-19T20:51:51Z,stream,59,62.0,,175849.0,...,55.0,55.0,-7.0,66.0,31.0,Afro dancehall,Afropop,Azontobeats,Nigerian hip hop,Nigerian pop
2,2,Jogodo,Tekno,04:23,2018-08-19T20:51:54Z,stream,44,62.0,,175849.0,...,77.0,74.0,-5.0,82.0,38.0,Afro dancehall,Afropop,Azontobeats,Nigerian hip hop,Nigerian pop
3,3,Duro,Tekno,03:32,2018-08-19T20:52:04Z,stream,0,62.0,,175849.0,...,84.0,74.0,-4.0,97.0,28.0,Afro dancehall,Afropop,Azontobeats,Nigerian hip hop,Nigerian pop
4,4,Samantha,Tekno,03:49,2018-08-19T20:52:24Z,stream,42,62.0,,175849.0,...,75.0,83.0,-3.0,79.0,3.0,Afro dancehall,Afropop,Azontobeats,Nigerian hip hop,Nigerian pop


In [80]:
All_songs_cleaned4.drop(columns=["isLocal", "artistGenres"], inplace=True)
All_songs_cleaned4.columns

Index(['index', 'trackName', 'artistName', 'trackDuration', 'addedAt',
       'trackPopularity', 'artistPopularity', 'secondaryArtistsNames',
       'artistFollowers', 'albumRecordLabel', 'albumReleaseDate', 'BPM',
       'Energy', 'Dance', 'Loud', 'Valence', 'Acoustic', 'Genre1', 'Genre2',
       'Genre3', 'Genre4', 'Genre5'],
      dtype='object')

In [81]:
All_songs_cleaned4.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3461 entries, 0 to 3460
Data columns (total 22 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   index                  3461 non-null   int64  
 1   trackName              3461 non-null   object 
 2   artistName             3461 non-null   object 
 3   trackDuration          3461 non-null   object 
 4   addedAt                3461 non-null   object 
 5   trackPopularity        3461 non-null   int64  
 6   artistPopularity       3461 non-null   float64
 7   secondaryArtistsNames  1031 non-null   object 
 8   artistFollowers        3461 non-null   float64
 9   albumRecordLabel       3461 non-null   object 
 10  albumReleaseDate       3461 non-null   object 
 11  BPM                    3461 non-null   float64
 12  Energy                 3461 non-null   float64
 13  Dance                  3461 non-null   float64
 14  Loud                   3461 non-null   float64
 15  Vale

In [None]:
# To do:
# - TrackDuration [3], addedAt [4], albumReleaseDate [10] -> change data type to datetime

2) Dataset for machine learning models

In [None]:
# Remove: Genre, TrackDuration, addedAt, secondaryArtistNames, albumRecordLabel, albumReleaseDate
# Add: "Like" column (value 1)

# Load Kaggle Dataset -> Select songs that I do not like (3461 to have balanced dataset) 
# -> create DataFrame with those songs -> Add "like" column (value 0) -> merge with cleaned dataset 

3) Datasets for EDA  

In [None]:
# EDA only for selected playlists: Afro House / Hip Hop / ...