# Automated Playlist Description Generation (APDG) Preprocessing Notebook

This notebook outlines the process for fetching and preprocessing the Spotify Million Playlist Dataset (MPD) to prepare it for training a model capable of generating playlist descriptions. The preprocessing steps are based on criteria outlined in a recent study, focusing on playlist titles that incorporate common features among the songs.

## Setup and Imports
This section imports necessary libraries and defines the setup for our preprocessing task.

In [13]:
import json
import os
from itertools import islice
import numpy as np
from sklearn.model_selection import train_test_split

## Fetching Data
Due to the large size of the Spotify Million Playlist Dataset, it is typically downloaded manually from the Spotify research website and extracted into a local directory. This section assumes that the dataset is already available locally.

Note: Since the Spotify Million Playlist Dataset is large (several GBs), it's typically downloaded manually from the Spotify research website and extracted to a local directory. The following step assumes you have already downloaded and extracted the dataset.

## Load Dataset

This section includes the function load_spotify_dataset(directory_path), which loads the playlists from the Spotify Million Playlist Dataset from a specified directory. Each playlist in the dataset includes a unique identifier, the playlist title, and a list of track URIs.

In [2]:
def load_mpd_dataset(directory_path, max_files=None):
    json_files = (file for file in os.listdir(directory_path) if file.endswith('.json'))
    playlists = []

    for filename in islice(json_files, max_files):
        with open(os.path.join(directory_path, filename), 'r') as f:
            data = json.load(f)
            playlists.extend(data['playlists'])

    return playlists

# Path to the dataset
dataset_path = "/Users/bestricemossberg/Projects/automated-playlist-description-generation-system/spotify_million_playlist_dataset/data"

# Load a limited dataset for testing (e.g., only the first 50 files)
playlists = load_mpd_dataset(dataset_path, max_files=50)

## Preprocess Dataset

The function preprocess_playlists(playlists) filters the playlists based on several criteria:

- The number of tracks in the playlist must be more than 10.
- The playlist title must have more than 3 tokens.
- The average character length of title tokens must be more than 3.
- Titles are normalized to lowercase to ensure consistency.

This section aims to refine the dataset by removing playlists that do not meet these criteria, reducing noise and improving the quality of our training data.

In [14]:
# Preprocess the playlists
def preprocess_playlists(playlists):
    return [
        {
            'pid': playlist['pid'],
            'name': playlist['name'].lower(),
            'description': playlist['description'].lower(),
            'tracks': [track['track_uri'] for track in playlist['tracks']]
        }
        for playlist in playlists
        if "description" in playlist and "name" in playlist and len(playlist['tracks']) > 10 # Filter out playlists with less than 10 tracks
        and len((tokens := playlist['description'].split())) > 3 # Filter out playlists with less than 3 words in the description
        and (char_length := sum(len(token) for token in tokens) / len(tokens)) > 3 # Filter out playlists with less than 3 characters per word in the description
    ]

preprocessed_playlists = preprocess_playlists(playlists)

In [11]:
preprocessed_playlists[0]

{'pid': 549056,
 'name': 'indie vibes',
 'description': 'that good summer vibe feeling &lt;3',
 'tracks': ['spotify:track:0GO8y8jQk1PkHzS31d699N',
  'spotify:track:1lbWbnWiEbAya5FlCzfsrq',
  'spotify:track:3bhhM8sG53lsPYRpakieZB',
  'spotify:track:5DfWswkGoWTEUJrflSC9hN',
  'spotify:track:316r1KLN0bcmpr7TZcMCXT',
  'spotify:track:4WiiRw2PHMNQE0ad6y6GdD',
  'spotify:track:5vgdeMt4uKUN2BeltZjoDh',
  'spotify:track:5CgihnZO9To8wj7ALOoTPD',
  'spotify:track:51cd3bzVmLAjlnsSZn4ecW',
  'spotify:track:6jrMVRReY24qzCfe1BRrww',
  'spotify:track:2jnvdMCTvtdVCci3YLqxGY',
  'spotify:track:0UeYCHOETPfai02uskjJ3x',
  'spotify:track:4M1xxMtl43A2JBMYLeF9Gg',
  'spotify:track:2EDuTLFathp2H49IfULO9G',
  'spotify:track:1aGvLFHJ2shKqO9uycaUcW',
  'spotify:track:5R9CJ2SnHywwwjGQwCLiIL',
  'spotify:track:6PZ5g4V0DM1sILf1oLlS42',
  'spotify:track:6SRdJTBk65cxlI87QfZmWw',
  'spotify:track:3KIIwkf6lNwJqLcx6GUIzr',
  'spotify:track:0FFTiimY7SZfLj2hPDOUl3',
  'spotify:track:3pLTOP0G0etiWUknFoRpsr',
  'spotify:tr

## Split Dataset
After preprocessing, we need to split our dataset into training, validation, and test sets. This is crucial for training our model effectively and evaluating its performance. The function split_dataset(playlists, test_size=0.1, validation_size=0.1) handles this task, ensuring that we have a balanced split.

In [16]:
def split_dataset(playlists, test_size=0.1, validation_size=0.1):
    train_val, test = train_test_split(playlists, test_size=test_size, random_state=42)
    train, validation = train_test_split(train_val, test_size=validation_size/(1-test_size), random_state=42)
    return train, validation, test

train, validation, test = split_dataset(preprocessed_playlists)

## Save Preprocessed Data
Once the playlists are preprocessed and split, we save them to disk for future use in model training and evaluation. The function save_dataset(datasets, directory_path) saves the data in JSON format, making it easy to load and use in various stages of the project.

In [18]:
def save_dataset(datasets, directory_path):
    for name, data in datasets.items():
        with open(os.path.join(directory_path, f"{name}.json"), 'w') as f:
            json.dump(data, f, ensure_ascii=False, indent=4)

save_path = '/Users/bestricemossberg/Projects/automated-playlist-description-generation-system/datasets' # Replace with the path to save preprocessed data
save_dataset({'train': train, 'validation': validation, 'test': test}, save_path)

## Main Workflow
The main workflow orchestrates the loading, preprocessing, splitting, and saving of the Spotify Million Playlist Dataset. This ensures that the dataset is ready for use in training models capable of generating meaningful and relevant playlist titles based on the tracks they contain.

In [None]:
if __name__ == "__main__":
    dataset_path = "path/to/spotify_million_playlist_dataset"
    output_path = "path/to/preprocessed_dataset"
    playlists = load_spotify_dataset(dataset_path)
    preprocessed_playlists = preprocess_playlists(playlists)
    train, validation, test = split_dataset(preprocessed_playlists)
    save_dataset({'train': train, 'validation': validation, 'test': test}, output_path)


This notebook provides a structured approach to preparing the Spotify Million Playlist Dataset for the task of generating playlist descriptions, adhering to the preprocessing criteria specified in the study.