# Project: Million Playlist Dataset
## Spotify


1. Present your approach to handling this big dataset. Given that the dataset is big, at about 35 GB,
you are not expected to be able to process all of it. Ideally, of course, you should have a solution
that scales to the whole dataset. Describe the approach you took and all the alternatives you
considered.

## A.Random selection
Randomly, we could pick a sample of playlists.
### Simple random selection
1st, a random selection on the 1000 playlists groups.
### Double random selection
2nd, a random selection on the 1000 playlists inside the playlists groups

But we could suppose the available playlists groups were already randomly build. Therefore, represent already a representative sample.
This assumption saves a loop.

## B.Stratified selection
Another option would be to build a stratified sample, based on the playlists content, but it supposes the 1 million playlists have been crossed through once.
Then, to create a list recording the 1 million playlists names. Based on that, we could group playlists by types, and produce some stats to build a stratified selection. 

## C. Strategy choice

We finally chose the 2nd strategy: a stratified sample of the 1 million playlists, data expensive but with SPARK, it should be quite fast...
Eventually, facing the awful installation configuration needed to make Spark work on Windows with Jupyter, Python (Pyspark), Java, Hadoop,...
We decided to work on DASK

In [4]:
import dask
import dask.dataframe as dd
import dask.bag as db
import pandas as pd
import numpy as np
import json
import os
import re

In [5]:
from pathlib import Path

In [6]:
folder_single = Path('data_spotify/mpd.slice.0-999.json')
folder = Path('data_spotify/')
#folder = 'C:/Users/ngoma/data_spotify/mpd.slice.0-999.json'

In [7]:
def flatten(record):
    return {
        'zip_slice': record['info']['slice'],
        'playlist_name': record['playlist']['name'],
        'playlist_number': record['playlist']['pid'],
        'playlist_total_tracks': record['playlist']['num_tracks'],
        'playlist_total_albums': record['playlist']['num_albums'],
        'playlist_tracks': record['playlist']['tracks'],
        'playlist_tracks_artist_name':record['playlist']['tracks']['artist_name'],
        'playlist_tracks_artist_uri':record['playlist']['tracks']['artist_uri'],
        'playlist_tracks_track_name':record['playlist']['tracks']['track_name'],
        'playlist_tracks_track_uri':record['playlist']['tracks']['track_uri'],
        'playlist_tracks_album_name':record['playlist']['tracks']['album_name'],
        'playlist_tracks_album_uri':record['playlist']['tracks']['album_uri']
    }


In [8]:
loose_playlists = db.read_text(folder_single)

In [9]:
loose_playlists.take(20)

('{\n',
 '    "info": {\n',
 '        "generated_on": "2017-12-03 08:41:42.057563", \n',
 '        "slice": "0-999", \n',
 '        "version": "v1"\n',
 '    }, \n',
 '    "playlists": [\n',
 '        {\n',
 '            "name": "Throwbacks", \n',
 '            "collaborative": "false", \n',
 '            "pid": 0, \n',
 '            "modified_at": 1493424000, \n',
 '            "num_tracks": 52, \n',
 '            "num_albums": 47, \n',
 '            "num_followers": 1, \n',
 '            "tracks": [\n',
 '                {\n',
 '                    "pos": 0, \n',
 '                    "artist_name": "Missy Elliott", \n',
 '                    "track_uri": "spotify:track:0UaMYEvWZi0ZqiDOoHU3YI", \n')

In [10]:
#mega_playlist = []

In [14]:
with open(folder_single,'r') as loose_playlists:
     straight_playlists = json.load(loose_playlists)

In [16]:
'''for kilo_playlist in os.listdir(folder):
    playlists_path = [os.path.join(folder,kilo_playlist)]
    with open(playlists_path,'r') as loose_playlist:
         straight_folder = json.load(loose_playlist)'''

"for kilo_playlist in os.listdir(folder):\n    playlists_path = [os.path.join(folder,kilo_playlist)]\n    with open(playlists_path,'r') as loose_playlist:\n         straight_folder = json.load(loose_playlist)"

illustrations

In [19]:
clefs = straight_playlists.keys()

In [21]:
clefs

dict_keys(['info', 'playlists'])

In [23]:
straight_playlists["info"]

{'generated_on': '2017-12-03 08:41:42.057563',
 'slice': '0-999',
 'version': 'v1'}

In [25]:
playlists_dd =dd.from_pandas(pd.json_normalize(straight_playlists, record_path="playlists",meta="info"))

In [26]:
playlists_dd.head(6)

Unnamed: 0,name,collaborative,pid,modified_at,num_tracks,num_albums,num_followers,tracks,num_edits,duration_ms,num_artists,description,info
0,Throwbacks,False,0,1493424000,52,47,1,"[{'pos': 0, 'artist_name': 'Missy Elliott', 't...",6,11532414,37,,"{'generated_on': '2017-12-03 08:41:42.057563',..."
1,Awesome Playlist,False,1,1506556800,39,23,1,"[{'pos': 0, 'artist_name': 'Survivor', 'track_...",5,11656470,21,,"{'generated_on': '2017-12-03 08:41:42.057563',..."
2,korean,False,2,1505692800,64,51,1,"[{'pos': 0, 'artist_name': 'Hoody', 'track_uri...",18,14039958,31,,"{'generated_on': '2017-12-03 08:41:42.057563',..."
3,mat,False,3,1501027200,126,107,1,"[{'pos': 0, 'artist_name': 'Camille Saint-Saën...",4,28926058,86,,"{'generated_on': '2017-12-03 08:41:42.057563',..."
4,90s,False,4,1401667200,17,16,2,"[{'pos': 0, 'artist_name': 'The Smashing Pumpk...",7,4335282,16,,"{'generated_on': '2017-12-03 08:41:42.057563',..."
5,Wedding,False,5,1430956800,80,71,1,"[{'pos': 0, 'artist_name': 'Cali Swag District...",3,19156557,56,,"{'generated_on': '2017-12-03 08:41:42.057563',..."


still have to flatten the tracks field

In [30]:
key_info = ["info",["playlists","pid"],["playlists","name"],["playlists","num_tracks"],["playlists","num_albums"],["playlists","num_artists"]]

In [32]:
playlists_dd = dd.from_pandas(pd.json_normalize(straight_playlists, record_path=["playlists","tracks"],meta=key_info))

In [33]:
playlists_dd.head(4)

Unnamed: 0,pos,artist_name,track_uri,artist_uri,track_name,album_uri,duration_ms,album_name,info,playlists.pid,playlists.name,playlists.num_tracks,playlists.num_albums,playlists.num_artists
0,0,Missy Elliott,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI,spotify:artist:2wIVse2owClT7go1WT98tk,Lose Control (feat. Ciara & Fat Man Scoop),spotify:album:6vV5UrXcfyQD1wu4Qo2I9K,226863,The Cookbook,"{'generated_on': '2017-12-03 08:41:42.057563',...",0,Throwbacks,52,47,37
1,1,Britney Spears,spotify:track:6I9VzXrHxO9rA9A5euc8Ak,spotify:artist:26dSoYclwsYLMAKD3tpOr4,Toxic,spotify:album:0z7pVBGOD7HCIB7S8eLkLI,198800,In The Zone,"{'generated_on': '2017-12-03 08:41:42.057563',...",0,Throwbacks,52,47,37
2,2,Beyoncé,spotify:track:0WqIKmW4BTrj3eJFmnCKMv,spotify:artist:6vWDO969PvNqNYHIOW5v0m,Crazy In Love,spotify:album:25hVFAxTlDvXbx2X2QkUkE,235933,Dangerously In Love (Alben für die Ewigkeit),"{'generated_on': '2017-12-03 08:41:42.057563',...",0,Throwbacks,52,47,37
3,3,Justin Timberlake,spotify:track:1AWQoqb9bSvzTjaLralEkT,spotify:artist:31TPClRtHm23RisEBtV3X7,Rock Your Body,spotify:album:6QPkyl04rXwTGlGlcYaRoW,267266,Justified,"{'generated_on': '2017-12-03 08:41:42.057563',...",0,Throwbacks,52,47,37
