# Project: Million Playlist Dataset
## Spotify


1. Present your approach to handling this big dataset. Given that the dataset is big, at about 35 GB,
you are not expected to be able to process all of it. Ideally, of course, you should have a solution
that scales to the whole dataset. Describe the approach you took and all the alternatives you
considered.

## A.Random selection
Randomly, we could pick a sample of playlists.
### Simple random selection
1st, a random selection on the 1000 Json files playlists.
### Double random selection
2nd, a random selection on the 1000 playlists inside the randomly picked Json files playlists.

But we could suppose the available playlists groups were already randomly build. Therefore, represent already a representative sample.
This assumption saves a loop.

## B.Stratified selection
Another option would be to build a stratified sample, based on the playlists content, but it supposes the 1 million playlists have been crossed through once.
Then, to create a list recording the 1 million playlists names. Based on that, we could group playlists by types, and produce some stats to build a stratified selection. 

## C. Strategy choice

We finally chose the 2nd strategy: a stratified sample of the 1 million playlists, data expensive but with SPARK, it should be quite fast...
Eventually, facing the awful installation configuration needed to make Spark work on Windows with Jupyter, Python (Pyspark), Java, Hadoop,...
We decided to work on DASK

In [4]:
import dask, json, os, re, sys
import dask.dataframe as dd
import dask.bag as db
import pandas as pd
import numpy as np
import random as rand

In [5]:
from pathlib import Path

In [6]:
folder_single = Path('data_spotify/mpd.slice.0-999.json')
folder = Path('data_spotify/')

In [7]:
def flatten(record):
    return {
        'zip_slice': record['info']['slice'],
        'playlist_name': record['playlist']['name'],
        'playlist_number': record['playlist']['pid'],
        'playlist_total_tracks': record['playlist']['num_tracks'],
        'playlist_total_albums': record['playlist']['num_albums'],
        'playlist_tracks': record['playlist']['tracks'],
        'playlist_tracks_artist_name':record['playlist']['tracks']['artist_name'],
        'playlist_tracks_artist_uri':record['playlist']['tracks']['artist_uri'],
        'playlist_tracks_track_name':record['playlist']['tracks']['track_name'],
        'playlist_tracks_track_uri':record['playlist']['tracks']['track_uri'],
        'playlist_tracks_album_name':record['playlist']['tracks']['album_name'],
        'playlist_tracks_album_uri':record['playlist']['tracks']['album_uri']
    }


In [8]:
loose_playlists = db.read_text(folder_single)

In [9]:
loose_playlists.take(20)

('{\n',
 '    "info": {\n',
 '        "generated_on": "2017-12-03 08:41:42.057563", \n',
 '        "slice": "0-999", \n',
 '        "version": "v1"\n',
 '    }, \n',
 '    "playlists": [\n',
 '        {\n',
 '            "name": "Throwbacks", \n',
 '            "collaborative": "false", \n',
 '            "pid": 0, \n',
 '            "modified_at": 1493424000, \n',
 '            "num_tracks": 52, \n',
 '            "num_albums": 47, \n',
 '            "num_followers": 1, \n',
 '            "tracks": [\n',
 '                {\n',
 '                    "pos": 0, \n',
 '                    "artist_name": "Missy Elliott", \n',
 '                    "track_uri": "spotify:track:0UaMYEvWZi0ZqiDOoHU3YI", \n')

list of the 1k Json paths

In [11]:
playlists_paths = [os.path.join(folder,kilo_playlist) for kilo_playlist in os.listdir(folder)]

In [12]:
'''for single_path in playlists_paths:
    with open(single_path,'r') as loose_playlists:
         straight_playlists = json.load(loose_playlists)
         mega_playlist.update(straight_playlists)'''

"for single_path in playlists_paths:\n    with open(single_path,'r') as loose_playlists:\n         straight_playlists = json.load(loose_playlists)\n         mega_playlist.update(straight_playlists)"

In [13]:
with open(playlists_paths[11],'r') as loose_playlist:
     straight_playlists = json.load(loose_playlist)

In [14]:
sys.getsizeof(straight_playlists)

184

illustrations

In [16]:
clefs = straight_playlists.keys()

In [17]:
clefs

dict_keys(['info', 'playlists'])

In [18]:
straight_playlists["info"]

{'generated_on': '2017-12-03 08:41:42.057563',
 'slice': '108000-108999',
 'version': 'v1'}

In [19]:
playlists_dd =dd.from_pandas(pd.json_normalize(straight_playlists, record_path="playlists",meta="info"))

In [20]:
playlists_dd.head(6)

Unnamed: 0,name,collaborative,pid,modified_at,num_tracks,num_albums,num_followers,tracks,num_edits,duration_ms,num_artists,description,info
0,Summer 2017,False,108000,1504656000,127,121,2,"[{'pos': 0, 'artist_name': 'Julia Michaels', '...",39,26593378,117,,"{'generated_on': '2017-12-03 08:41:42.057563',..."
1,Ambient,False,108001,1502496000,55,30,3,"[{'pos': 0, 'artist_name': 'A Winged Victory f...",16,21884913,24,,"{'generated_on': '2017-12-03 08:41:42.057563',..."
2,House Party,False,108002,1479340800,88,82,1,"[{'pos': 0, 'artist_name': 'Beyoncé', 'track_u...",2,22139851,62,,"{'generated_on': '2017-12-03 08:41:42.057563',..."
3,Coachella,False,108003,1430611200,72,55,3,"[{'pos': 0, 'artist_name': 'Alison Wonderland'...",7,16964038,40,,"{'generated_on': '2017-12-03 08:41:42.057563',..."
4,cruisin,False,108004,1471737600,82,72,1,"[{'pos': 0, 'artist_name': 'Josh Garrels', 'tr...",17,19765041,63,,"{'generated_on': '2017-12-03 08:41:42.057563',..."
5,lift,False,108005,1490140800,52,50,1,"[{'pos': 0, 'artist_name': 'Ciara', 'track_uri...",29,12205359,45,,"{'generated_on': '2017-12-03 08:41:42.057563',..."


still have to flatten the tracks field

In [22]:
key_info = ["info",["playlists","pid"],["playlists","name"],["playlists","num_tracks"],["playlists","num_albums"],["playlists","num_artists"]]

In [23]:
playlists_dd = dd.from_pandas(pd.json_normalize(straight_playlists, record_path=["playlists","tracks"],meta=key_info))

In [24]:
playlists_dd.compute().shape

(69098, 14)

In [25]:
playlists_dd.head(4)

Unnamed: 0,pos,artist_name,track_uri,artist_uri,track_name,album_uri,duration_ms,album_name,info,playlists.pid,playlists.name,playlists.num_tracks,playlists.num_albums,playlists.num_artists
0,0,Julia Michaels,spotify:track:7hCcLbNDCVPNqAwuoutJ31,spotify:artist:0ZED1XzwlLHW4ZaG4lOT6m,Uh Huh,spotify:album:22aqlgX39cWQMWazg9qwuv,178973,Nervous System,"{'generated_on': '2017-12-03 08:41:42.057563',...",108000,Summer 2017,127,121,117
1,1,Lostboycrow,spotify:track:0to2wztJrgQuAmjDXDK9B1,spotify:artist:5PxCTrv3Y1xVACfngpt7D2,Start Something,spotify:album:2LgngcCBV44eFeM6xs5QaU,235856,Start Something,"{'generated_on': '2017-12-03 08:41:42.057563',...",108000,Summer 2017,127,121,117
2,2,Spirix,spotify:track:6p95ii7EDv1DUDvqvDeMKS,spotify:artist:45DhxmSg67EFf4eB9ZXRFo,Runaway,spotify:album:4uj9FxZtiuSW9q9ykj5SKl,188399,Runaway,"{'generated_on': '2017-12-03 08:41:42.057563',...",108000,Summer 2017,127,121,117
3,3,StéLouse,spotify:track:5r7PIaS6r6V0hghjnmrdI5,spotify:artist:6kyUfziWP8ydAsKzyXsC83,Been So Long,spotify:album:60BOANDxXuiL5WmDeV5e07,246146,StéLouse,"{'generated_on': '2017-12-03 08:41:42.057563',...",108000,Summer 2017,127,121,117


In [26]:
playlists_dd.tail()

Unnamed: 0,pos,artist_name,track_uri,artist_uri,track_name,album_uri,duration_ms,album_name,info,playlists.pid,playlists.name,playlists.num_tracks,playlists.num_albums,playlists.num_artists
69093,9,David Bowie,spotify:track:6cDvpixBNcb1dNw6QmbpEQ,spotify:artist:0oSGxfWSnnOXhD2fKuz2Gy,Space Oddity - Mitty Mix,spotify:album:7tWBYtZrCxFsheRtsqk9ra,311053,Das erstaunliche Leben des Walter Mitty (Origi...,"{'generated_on': '2017-12-03 08:41:42.057563',...",108999,The Secret Life Of Walter Mitty,14,10,10
69094,10,José González,spotify:track:6uuOuNPbiZq5i7CePm76Sb,spotify:artist:6xrCU6zdcSTsG2hLrojpmI,#9 Dream,spotify:album:1WfjeDnn6P7umwi5gLZSGx,337733,The Secret Life Of Walter Mitty,"{'generated_on': '2017-12-03 08:41:42.057563',...",108999,The Secret Life Of Walter Mitty,14,10,10
69095,11,Grace Mitchell,spotify:track:01xAeS2cg8WRrGGc0VlenZ,spotify:artist:3VjZEaHXvNbCmV5jc0VfZR,Maneater,spotify:album:1WfjeDnn6P7umwi5gLZSGx,234906,The Secret Life Of Walter Mitty,"{'generated_on': '2017-12-03 08:41:42.057563',...",108999,The Secret Life Of Walter Mitty,14,10,10
69096,12,Jack Johnson,spotify:track:4VVqD5gzpnt9Yb8AA9Mjjb,spotify:artist:3GBPw9NK25X1Wt2OUvOwY3,Escape (The Pina Colada Song),spotify:album:1WfjeDnn6P7umwi5gLZSGx,242080,The Secret Life Of Walter Mitty,"{'generated_on': '2017-12-03 08:41:42.057563',...",108999,The Secret Life Of Walter Mitty,14,10,10
69097,13,Arcade Fire,spotify:track:6Hmj7SrLRbreLVfVS7mV1S,spotify:artist:3kjuyTCjPG1WMFCiyc5IuB,Wake Up,spotify:album:6ZB8qaR9JNuS0Q0bG1nbcH,335333,Funeral,"{'generated_on': '2017-12-03 08:41:42.057563',...",108999,The Secret Life Of Walter Mitty,14,10,10


build of a random dask dataframe

In [28]:
nb_json=len(playlists_paths)
np.random.seed(398567)
random_sample = np.random.randint(nb_json,size=100)
random_playlists = [playlists_paths[s] for s in random_sample]

In [29]:
random_playlists

['data_spotify\\mpd.slice.970000-970999.json',
 'data_spotify\\mpd.slice.933000-933999.json',
 'data_spotify\\mpd.slice.274000-274999.json',
 'data_spotify\\mpd.slice.990000-990999.json',
 'data_spotify\\mpd.slice.158000-158999.json',
 'data_spotify\\mpd.slice.738000-738999.json',
 'data_spotify\\mpd.slice.807000-807999.json',
 'data_spotify\\mpd.slice.528000-528999.json',
 'data_spotify\\mpd.slice.359000-359999.json',
 'data_spotify\\mpd.slice.15000-15999.json',
 'data_spotify\\mpd.slice.489000-489999.json',
 'data_spotify\\mpd.slice.165000-165999.json',
 'data_spotify\\mpd.slice.187000-187999.json',
 'data_spotify\\mpd.slice.740000-740999.json',
 'data_spotify\\mpd.slice.380000-380999.json',
 'data_spotify\\mpd.slice.517000-517999.json',
 'data_spotify\\mpd.slice.984000-984999.json',
 'data_spotify\\mpd.slice.252000-252999.json',
 'data_spotify\\mpd.slice.602000-602999.json',
 'data_spotify\\mpd.slice.279000-279999.json',
 'data_spotify\\mpd.slice.127000-127999.json',
 'data_spotify\

In [39]:
HECTO_ddf=dd.from_pandas(pd.json_normalize(random_playlists[0], record_path=["playlists","tracks"],meta=key_info))

In [47]:
for i in range(1,len(random_playlists)):
    single_path = random_playlists[i]
    with open(single_path) as loose_playlists:
         straight_playlists = json.load(loose_playlists)
    
    random_dd = dd.from_pandas(pd.json_normalize(straight_playlists, record_path=["playlists","tracks"],meta=key_info))
    HECTO_ddf = dd.concat([HECTO_ddf,random_dd],axis=0)

In [49]:
HECTO_ddf.head(4)

Unnamed: 0,pos,artist_name,track_uri,artist_uri,track_name,album_uri,duration_ms,album_name,info,playlists.pid,playlists.name,playlists.num_tracks,playlists.num_albums,playlists.num_artists
0,0,Fetty Wap,spotify:track:5NQbUaeTEOGdD6hHcre0dZ,spotify:artist:6PXS4YHDkKvl1wkIl4V8DL,679 (feat. Remy Boyz),spotify:album:0TyDTzG2zt2tg0wONT9wSU,196693,679 (feat. Remy Boyz),"{'generated_on': '2017-12-04 03:05:11.774401',...",933000,Old Mix,82,63,40
1,1,Kendrick Lamar,spotify:track:3iVcZ5G6tvkXZkZKlMpIUs,spotify:artist:2YZyLoL8N0Wb9xBt1NhZWg,Alright,spotify:album:7ycBtnsMtyVbbwTfJwRjSP,219333,To Pimp A Butterfly,"{'generated_on': '2017-12-04 03:05:11.774401',...",933000,Old Mix,82,63,40
2,2,Drake,spotify:track:5lFDtgWsjRJu8fPOAyJIAK,spotify:artist:3TVXtAsR1Inumwj472S9r4,Back To Back,spotify:album:79qV4McLzhs8U3FyRKnocz,170637,Back To Back,"{'generated_on': '2017-12-04 03:05:11.774401',...",933000,Old Mix,82,63,40
3,3,Big Sean,spotify:track:4E92bmgFiHxvy2SNjXjC3b,spotify:artist:0c173mlxpT3dSFRgMO8XPh,Blessings,spotify:album:3QZ9gSv35M8vfQA127l3vn,252040,Dark Sky Paradise,"{'generated_on': '2017-12-04 03:05:11.774401',...",933000,Old Mix,82,63,40


In [51]:
HECTO_ddf.tail(4)

Unnamed: 0,pos,artist_name,track_uri,artist_uri,track_name,album_uri,duration_ms,album_name,info,playlists.pid,playlists.name,playlists.num_tracks,playlists.num_albums,playlists.num_artists
65623,28,Michael Jackson,spotify:track:7oOOI85fVQvVnK5ynNMdW7,spotify:artist:3fMbdgg4jU18AjLCKBhRSm,Rock with You - Single Version,spotify:album:2ZytN2cY4Zjrr9ukb2rqTP,220626,Off the Wall,"{'generated_on': '2017-12-03 08:41:42.057563',...",266999,Michael Jackson,32,13,3
65624,29,The Jacksons,spotify:track:3qI94hINNNeb4S7xQi18lS,spotify:artist:2yrbLiuBmc9j81lTX3XUuI,Blame It on the Boogie,spotify:album:46yTC334aJI4QTv4lkzj3a,214200,Destiny,"{'generated_on': '2017-12-03 08:41:42.057563',...",266999,Michael Jackson,32,13,3
65625,30,Michael Jackson,spotify:track:5lA3pwMkBdd24StM90QrNR,spotify:artist:3fMbdgg4jU18AjLCKBhRSm,P.Y.T. (Pretty Young Thing),spotify:album:1C2h7mLntPSeVYciMRTF4a,238733,Thriller 25 Super Deluxe Edition,"{'generated_on': '2017-12-03 08:41:42.057563',...",266999,Michael Jackson,32,13,3
65626,31,Michael Jackson,spotify:track:63gcHyuhAC8G2C0bVQUOJI,spotify:artist:3fMbdgg4jU18AjLCKBhRSm,Rockin' Robin,spotify:album:0F4XW0iBOhNFkbn1BuQ8cu,154840,Got To Be There,"{'generated_on': '2017-12-03 08:41:42.057563',...",266999,Michael Jackson,32,13,3


In [53]:
HECTO_ddf.compute().shape

(6630374, 14)