# Spotify Million Playlist Project

For this project, I will look into the [AI Crowd - Spotify Million Playlist Dataset Challenge](https://www.aicrowd.com/challenges/spotify-million-playlist-dataset-challenge). If you are interested in checking out the dataset yourself, the instructions on how to download are on the link above.

My version of the project will take 100,000 playlists from the list and attempt to explore them to find interesting insights. Afterwards, I will attempt to create a recommendation set of songs for some given playlists, based on how many songs they have and what songs are already in them.

To start, we will bring in the necessary libraries and load our data into the dataframe *df*.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from zipfile import ZipFile
from urllib.request import urlopen   
from io import BytesIO
import json
import os
import glob

In [3]:
path_to_json = 'data/' 

json_pattern = os.path.join(path_to_json,'*.json')
file_list = glob.glob(json_pattern)

In [4]:
temp = []

for file in file_list[:100]:
    data = json.load(open(file))
    temp.append(data['playlists'])

In [5]:
df0 = pd.DataFrame(temp[0])

In [6]:
df = df0
for item in temp[1:]:
    df_temp = pd.DataFrame(item)
    df =pd.concat([df, df_temp], ignore_index=True)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 12 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   name           100000 non-null  object
 1   collaborative  100000 non-null  object
 2   pid            100000 non-null  int64 
 3   modified_at    100000 non-null  int64 
 4   num_tracks     100000 non-null  int64 
 5   num_albums     100000 non-null  int64 
 6   num_followers  100000 non-null  int64 
 7   tracks         100000 non-null  object
 8   num_edits      100000 non-null  int64 
 9   duration_ms    100000 non-null  int64 
 10  num_artists    100000 non-null  int64 
 11  description    1796 non-null    object
dtypes: int64(8), object(4)
memory usage: 9.2+ MB


In [7]:
df.head()

Unnamed: 0,name,collaborative,pid,modified_at,num_tracks,num_albums,num_followers,tracks,num_edits,duration_ms,num_artists,description
0,Throwbacks,False,0,1493424000,52,47,1,"[{'pos': 0, 'artist_name': 'Missy Elliott', 't...",6,11532414,37,
1,Awesome Playlist,False,1,1506556800,39,23,1,"[{'pos': 0, 'artist_name': 'Survivor', 'track_...",5,11656470,21,
2,korean,False,2,1505692800,64,51,1,"[{'pos': 0, 'artist_name': 'Hoody', 'track_uri...",18,14039958,31,
3,mat,False,3,1501027200,126,107,1,"[{'pos': 0, 'artist_name': 'Camille Saint-Saën...",4,28926058,86,
4,90s,False,4,1401667200,17,16,2,"[{'pos': 0, 'artist_name': 'The Smashing Pumpk...",7,4335282,16,


Now that we have the data loaded, let's take a look and see what we can gather in regards to insights. First off, we notice that the description column is the only column missing data. We can look into this column and see if there is anything worth keeping, or if the column is missing too much data to be useful.

In [8]:
df[df['description'].notnull()]['description'].head(10)

94                                  chilllll out
102                                          uzi
320                           sit back and chill
329                            el espanish trap.
339             roasty toasty in the holy ghosty
353                           Always thinking...
354    What I listen to crusing on my motorcycle
370                               merry chrysler
475                                          sad
491                   A little bit of everything
Name: description, dtype: object

Just looking at the first 10 entries that are not null shows that there is some information about the purpose of each playlist. 

In [9]:
artists = [item.get('artist_name') for i in df['tracks'] for item in i]
tracks = [str(item.get('artist_name'))+': '+str(item.get('track_name')) for i in df['tracks'] for item in i]

In [10]:
tracks[:10]

['Missy Elliott: Lose Control (feat. Ciara & Fat Man Scoop)',
 'Britney Spears: Toxic',
 'Beyoncé: Crazy In Love',
 'Justin Timberlake: Rock Your Body',
 "Shaggy: It Wasn't Me",
 'Usher: Yeah!',
 'Usher: My Boo',
 'The Pussycat Dolls: Buttons',
 "Destiny's Child: Say My Name",
 'OutKast: Hey Ya! - Radio Mix / Club Mix']

In [11]:
artists[:10]

['Missy Elliott',
 'Britney Spears',
 'Beyoncé',
 'Justin Timberlake',
 'Shaggy',
 'Usher',
 'Usher',
 'The Pussycat Dolls',
 "Destiny's Child",
 'OutKast']

In [12]:
from collections import Counter

In [13]:
Counter(tracks).most_common(10)

[('Kendrick Lamar: HUMBLE.', 4562),
 ('Drake: One Dance', 4355),
 ('DRAM: Broccoli (feat. Lil Yachty)', 4105),
 ('The Chainsmokers: Closer', 4015),
 ('Post Malone: Congratulations', 3985),
 ('Aminé: Caroline', 3540),
 ('Khalid: Location', 3510),
 ('Migos: Bad and Boujee (feat. Lil Uzi Vert)', 3480),
 ('KYLE: iSpy (feat. Lil Yachty)', 3473),
 ('Lil Uzi Vert: XO TOUR Llif3', 3456)]

We see below the top 10 most common artists in all 100,000 playlists that we have sampled as well:

In [14]:
Counter(artists).most_common(10)

[('Drake', 85765),
 ('Kanye West', 42109),
 ('Kendrick Lamar', 34420),
 ('Rihanna', 33330),
 ('The Weeknd', 31199),
 ('Eminem', 30087),
 ('Ed Sheeran', 27154),
 ('Future', 25780),
 ('J. Cole', 24684),
 ('Justin Bieber', 23970)]

#### Playlist Title examination

In [15]:
import spacy
import nltk

nlp = spacy.load('en_core_web_sm', parse=True, tag=True, entity=True)

In [16]:
def lemmatize_text(text):
    text = nlp(text)
    text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
    return text

playlists = list(df['name'])

In [17]:
playlist_lemma = []
for i in playlists:
    playlist_lemma.append(lemmatize_text(i))

['throwback', 'Awesome Playlist', 'korean', 'mat', '90s']

In [1]:
from sklearn.cluster import KMeans

In [20]:
playlist_lemma[:10]

['throwback',
 'Awesome Playlist',
 'korean',
 'mat',
 '90s',
 'wedding',
 'I put a spell on You',
 '2017',
 'BOP',
 'old country']

In [8]:
df.to_csv('data/sample.csv')