# Preprocessing

## Goal
The goal of this notebook is to create a utility matrix consisting of playlists (rows) and tracks included in those playlists (columns). The data package provided from AIcrowd, [here](https://www.aicrowd.com/challenges/spotify-million-playlist-dataset-challenge/dataset_files), was split into 1,000 separate JSON files which each included 1,000 playlists, totaling 1,000,000 playlists. The package also included a useful TEXT file, `stats.txt`, that had a basic summary of particular aspects of the dataset. The TEXT file was particularly useful in that it informed me to expect 2,262,292 unique tracks. Given this information, I expect the dimenstions of the final utilitly matrix to be 1,000,000, by 2,262,292.

**NOTICE**

* The data package from AIcrowd is much too large to upload too GitHub. You will have to navigate to the link above, download the data package (ZIP file - 5.39GB) to the project folder on your local computer, and then extract the contents from the ZIP file there.
* This notebook may require more than 8GB of RAM to run successfully. 

In [None]:
# Import entire modules
import json
import numpy as np
import pandas as pd
import sys
 
# Import specific functions from modules
# from pathlib import Path
from sklearn.preprocessing import MultiLabelBinarizer
from scipy.sparse import save_npz
from scipy.sparse import load_npz

# Remove warnings as required
# import warnings
# warnings.simplefilter(action='ignore', category=FutureWarning)
# warnings.simplefilter("ignore", UserWarning)

The first objective is to create a `for` loop that iterates through each JSON file to ultimately create a DataFrame, `final_df`, with 1 million rows representing playlists and one column that consists of lists of tracks pertaining to each playlist.

The first step to building the `for` loop is to read from the JSON files. Each of the JSON file names has two identifying features: an initial playlist number ending in 0, `initial_num`, and a final playlist number ending in 999, `final_num`. Applying an `incrementer` of 1,000 to both the initial and final playlist numbers within the `for` loop allows us to effectively read from each JSON file.

Next, we have to pull the tracks from each playlist. This requires a nested `for` loop that populates a temporary list, `data`, with 1,000 lists where each list consists of each track in the playlist and each track has identifying information pertaining to it, shown below:
   * `track_name` - the name of the track
   * `track_uri` - the Spotify URI of the track
   * `album_name` - the name of the track's album
   * `album_uri` - the Spotify URI of the album
   * `artist_name` - the name of the track's primary artist
   * `artist_uri` - the Spotify URI of track's primary artist
   * `duration_ms` - the duration of the track in milliseconds
   * `pos` - the position of the track in the playlist (zero-based)

Once `data` has been fully populated with 1,000 lists from the nested `for` loop, I convert `data` to a temporary DataFrame, `df`, with dimensions 1,000 by 1. I then manipulate the single column in `df` to create a new column that represents a list of tracks with only one identifying feature for a track opposed to all of the identifying features mentioned above. I also decided to use the `track_uri` instead of the `track_name` as the primary identifying feature for a track, so I could pull additional track data from Spotify's API later if needed.

Since memory consumption is an issue with this dataset, using a single identifying feature for a track minimized this problem tremendously. As you will find, I also took additional measures throughout this notebook to reduce memory consumption as best as I could.

The final step to the `for` loop before iterating to the next JSON file is to concatenate `df` with `final_df`, essentially adding the list of tracks for each playlist from the currently open JSON file to the final DataFrame.

**NOTICE**: This block of code will take a while to run.

In [None]:
# DO NOT CHANGE THESE VALUES!!!
initial_num = 0
final_num = 999
incrementer = 1000

# If memory is a limitation, reduce the `num_files` as needed.
num_files = 1000

# Declaring empty DataFrame, `final_d`
# This DataFrame will consist of the full amount of playlists (aka 1,000,000)
# and one column that consists of lists of tracks pertaining to each playlist.
final_df = pd.DataFrame()

# The following `for` loop is used to iterate through each JSON file and populate `final_df`
for file_index in range(0, num_files):
    # `print` function shows the progress of the `for` loop
    print(file_index)
    
    # Declaring empty list, `data`
    data = []
    
    # Opening the JSON file
    f = open(f'./spotify_million_playlist_dataset/data/mpd.slice.{initial_num}-{final_num}.json')
    
    # Creating a dictionary, `d` from the JSON data contained in `f`
    d = json.load(f)
    
    
    # The following `for` loop is used to populate `data` with 1,000 lists.
    # Each list pertains to each playlist and consists of each track in the playlist.
    # Additionally, each track has identifying information pertaining to it.
    for playlist in range(len(d['playlists'])):
        tracks_list = d['playlists'][playlist]['tracks']
        data.append(tracks_list)
    
    
    # Converting `data` from a list to a DataFrame with dimensions 1,000 by 1
    df = pd.DataFrame(pd.Series(data))
    df.rename(columns={0: "tracks"}, inplace=True)
    
    # Creating an additional column within `df` that will inlcude lists of tracks with one identifying feature.
    # The primary identifying feature chosen: `track_uri`
    df['track_uris'] = df['tracks'].map(lambda x: [track['track_uri'] for track in x])
    
    # Dropping first column that is no longer needed and consequently reduces memory consumption
    df.drop(columns='tracks', inplace=True)
    
    # Concatenating `final_df` with `df` 
    final_df = pd.concat([final_df, df])
    
    # Incrementing initial and final playlist numbers in order to select next JSON file
    initial_num += incrementer
    final_num += incrementer
    
    # Closing the currently open JSON file
    f.close()

# Reducing memory used by the following variables
d = {}
data = []
df = pd.DataFrame()

In [None]:
# Output Expectation: (1000000, 1)
final_df.shape

In [None]:
# Reset indices to `final_df`
final_df.reset_index(drop=True, inplace=True)

In [None]:
# Here's a good visual representation of the DataFrame in its current state
final_df

`final_df` looks perfect so far! Just one more step to create the final utility matrix we desire. This will require the use of a multilabel binarizer to create a Compressed Sparse Row (CSR) matrix that will act as our final utility matrix.

A multilabel binarizer will work wonders for what we want to accomplish. For one, we want to create tons of columns that represent each track in the entire dataset, and most importantly identify with a 1 (yes) or a 0 (no) if a specific track was included or not in any of the one million playlists. A multilable binarizer accomplishes just this, and, additionally, scikit-learn's `MultiLabelBinarizer` can output a CSR matrix if the `sparse_output` parameter is set to `True`. Since most of the elements in our utility matrix will be zero-valued, a CSR matrix will be an ideal output datatype for reducing memory consumption.

**NOTICE**: This block of code will take a while to run.

In [None]:
mlb = MultiLabelBinarizer(sparse_output=True)

U = mlb.fit_transform(final_df.pop('track_uris'))

U

Notice that the dimensions of `U` are exactly what we had hoped for: 1,000,000 by 2,262,292!

Additionally, look how little memory the CSR matrix uses! 48 bytes!

In [None]:
# Output Expectation: 48 (bytes)
sys.getsizeof(U)

To bypass having to run the lengthy/time-consuming block of code above, I'm going to save the CSR matrix as a NPZ file and the list of tracks and playlists as NPY files to a folder called `tmp`, so I can simply load them into the modeling notebook later. This will save lots of time down the line.

In [None]:
save_npz('./tmp/U.npz', U)

In [None]:
tracks = mlb.classes_
np.save('./tmp/tracks.npy', tracks)

In [None]:
playlists = np.asarray(final_df.index)
np.save('./tmp/playlists.npy', playlists)

In [None]:
U = load_npz('./tmp/U.npz')

U

In [None]:
tracks = np.load('./tmp/tracks.npy', allow_pickle=True)

display(len(tracks))
display(tracks)

In [None]:
playlists = np.load('./tmp/playlists.npy', allow_pickle=True)

display(len(playlists))
display(playlists)

It appears that loading in the NZP and NPY files we created earlier is working well!

Although we have a utility matrix that looks promising, let's still convert the CSR matrix to a DataFrame in an effort to visually verify that the CSR matrix is a correct representation of the data.

In [None]:
final_df = pd.DataFrame.sparse.from_spmatrix(U, index=playlists, columns=tracks)

# final_df = pd.DataFrame.sparse.from_spmatrix(U, index=final_df.index, columns=mlb.classes_)

In [None]:
# Output Expectation: (1000000, 2262292)
final_df.shape

In [None]:
# Here's a good visual representation of the utility matrix in it's final state
final_df

In [None]:
# Output Expectation: 6 occurrences of this specific song
final_df['spotify:track:0002yNGLtYSYtc0X6ZnFvp'].value_counts()

The data in the DataFrame version of the utility matrix appears to be a correct representation of the dataset, which infers that the CSR matrix is as well. It's now time to move on to the modeling phase!

Side Note: Look how significantly larger the DataFrame version of our utility matrix is in comparison to the CSR matrix!

In [None]:
# # NOTICE: This block of code will take a while to run.
# # Output Expectation: 531718224 (bytes)
# sys.getsizeof(final_df)

In [None]:
# # This will take obsolutely forever to run for this DataFrame,
# # but this could be useful down the road.

# filepath = Path('./output_files/playlists_vs_songs.csv')  
# filepath.parent.mkdir(parents=True, exist_ok=True)  
# final_df.to_csv(filepath, index=False)