# PlaylistDivider: 
A program that uses ML to adapt to each user's definition of a category, then divides the user's playlists into smaller playlists, following those categories, for better organization

* Use Spotify API in conjunction with last.fm API (last.fm API doesn't seem to be working, but this project can still be done with getsongbpm api)
* Use pandas and np to manipulate and store the data
* Use scikit for ML

How to scale up:
 Have a login system, such that the algorithm for each person will have learned what each user defines as 'happy' or 'sad' and will naturally be able to better divide songs for that user

https://getsongbpm.com/api

https://www.last.fm/api/account/create
https://www.last.fm/music/Sabrina+Carpenter/_/Tears
https://listenbrainz.readthedocs.io/en/latest/users/api-compat.html
https://www.reddit.com/r/spotifyapi/comments/1ldtwro/track_audio_feature_substitute/
https://www.reddit.com/r/spotifyapi/comments/1h75k49/spotifys_api_changes_hurt_developersheres_a/
https://developer.spotify.com/blog/2024-11-27-changes-to-the-web-api
https://developer.spotify.com/documentation/web-api/reference/get-audio-features

https://www.kaggle.com/docs/api#authentication
https://www.kaggle.com/settings
https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset
https://www.kaggle.com/datasets/conorvaneden/best-songs-on-spotify-for-every-year-2000-2023
https://www.kaggle.com/datasets/undefinenull/million-song-dataset-spotify-lastfm

https://developer.spotify.com/dashboard/8c862e73e7714036837f2fec988afcda
https://developer.spotify.com/documentation/web-api/concepts/access-token
https://developer.spotify.com/documentation/web-api/concepts/authorization

**This is the model training script**

This project is divided into two parts, where this script is for training the model, and the other script is for choosing the playlist to divide and then dividing it.

In [24]:
import spotipy
from spotipy.oauth2 import SpotifyOAuth
import pylast
import pandas as pd
from tqdm import tqdm
import time
import joblib
import os
import random
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline


In [18]:
#Authentication
SPOTIPY_CLIENT_ID = os.getenv("SPOTIPY_CLIENT_ID")
SPOTIPY_CLIENT_SECRET = os.getenv("SPOTIPY_CLIENT_SECRET")
SPOTIPY_REDIRECT_URI = os.getenv("SPOTIPY_REDIRECT_URI")

LASTFM_API_KEY = os.getenv("LASTFM_API_KEY")
LASTFM_API_SECRET = os.getenv("LASTFM_API_KEY")
LASTFM_PSWD = os.getenv("LASTFM_PSWD")

# To generate your password hash:
# password_hash = pylast.md5("YourLastFmPassword")

USERNAME = "bujjujj"
PASSWORD_HASH = pylast.md5(LASTFM_PSWD)

# Authenticate with both services
print("Connecting to APIs...")
try:
    sp = spotipy.Spotify(auth_manager=SpotifyOAuth(
        client_id=SPOTIPY_CLIENT_ID,
        client_secret=SPOTIPY_CLIENT_SECRET,
        redirect_uri=SPOTIPY_REDIRECT_URI,
        scope="playlist-modify-public playlist-read-private"
    ))
    user_id = sp.current_user()['id']
    network = pylast.LastFMNetwork(api_key=LASTFM_API_KEY, api_secret=LASTFM_API_SECRET)
    print("Successfully connected to Spotify & Last.fm!")
except Exception as e:
    print(f"Error during authentication: {e}")
    exit()

Connecting to APIs...
Successfully connected to Spotify & Last.fm!


In [19]:
TRAINING_PLAYLISTS = {
    'hiphop-workout': 'spotify:playlist:6nZs9qkLc2Pg6Q2iZOSgbk',
    'makeout': 'spotify:playlist:4qZs86VNq1kXaRtCE5lcSr',
    'chase': 'spotify:playlist:74s26XHRLZ716UvUj3hL4S',
    'lofi-downtempo': 'spotify:playlist:2DRvUsr4TnWlAvFYv5B1xi', #very huge playlist, only take first 450 songs
    'instrumental-happy': 'spotify:playlist:3L3ChTfSqTO6QdEfCd7l0s',
    'ambient-focus': 'spotify:playlist:70S8eB9yATWo90aQny9oGb', #very huge playlist, only take first 450 songs
    'atmospheric-room': 'spotify:playlist:36vNl3AjU4sbCQcsQUOq3K',
    'acoustic-guitar': 'spotify:playlist:3ru5tc8HpvzsOklw78rKnf',
    'citypop': 'spotify:playlist:5drMgosoieMPSYbq46ugqa',
    'feel-good': 'spotify:playlist:4bY71u1Mc66zxbWSmPjBeF',
    'edm-club': 'spotify:playlist:2Fl0AxmDN4BPYvgZrtQSZF'
}

In [25]:
# --- 2. Data Collection Functions ---

def get_lastfm_features(artist, track_name):
    """
    Cleans search terms and uses Last.fm's search to find the best match
    before fetching tags.
    """
    try:
        # 1. Clean the track name by removing common extra info
        # This removes things like (feat...), (with...), - Remastered, etc.
        cleaned_track_name = re.sub(r"\(feat\..*\)|_feat\..*|\(with.*\)|-.*Remaster.*", "", track_name, flags=re.IGNORECASE).strip()

        # 2. Use Last.fm's search function to find the best match
        search_results = network.search_for_track(artist, cleaned_track_name)
        
        # 3. Get the top result from the search
        top_result = search_results.get_next_page()[0]
        
        # 4. Get the tags for that top result
        top_tags = top_result.get_top_tags()
        
        feature_string = " ".join([tag.item.name.lower().replace(" ", "-") for tag in top_tags for _ in range(int(tag.weight))])
        return feature_string
    except (IndexError, pylast.WSError):
        # IndexError means the search returned no results
        # WSError handles other API issues
        return ""
    except Exception:
        return ""

def fetch_training_data(playlists_dict):
    """Fetches tracks and features for all training playlists with success reporting."""
    all_tracks_data = []
    MAX_SONGS_PER_PLAYLIST = 450
    
    for label, playlist_id in playlists_dict.items():
        print(f"\nFetching tracks for '{label}' playlist...")
        
        results = sp.playlist_items(playlist_id)
        tracks = results['items']
        while results['next']:
            results = sp.next(results)
            tracks.extend(results['items'])
        
        # --- NEW: Store total tracks before sampling ---
        total_tracks_in_playlist = len(tracks)
        
        if total_tracks_in_playlist > MAX_SONGS_PER_PLAYLIST:
            print(f"Playlist has {total_tracks_in_playlist} songs. Taking a random sample of {MAX_SONGS_PER_PLAYLIST}.")
            tracks = random.sample(tracks, MAX_SONGS_PER_PLAYLIST)
        
        # --- NEW: Counter for successful fetches ---
        successful_fetches = 0
        
        for item in tqdm(tracks, desc=f"Processing '{label}'"):
            track = item['track']
            if track and track['artists']:
                artist = track['artists'][0]['name']
                name = track['name']
                features = get_lastfm_features(artist, name)
                if features:
                    all_tracks_data.append({'features': features, 'label': label})
                    # --- NEW: Increment counter on success ---
                    successful_fetches += 1
                time.sleep(0.1)

        # --- NEW: Print the summary report for the playlist ---
        print(f"Finished '{label}': Successfully fetched data for {successful_fetches} / {total_tracks_in_playlist} songs considered.")


    return pd.DataFrame(all_tracks_data)

In [26]:
# --- 3. Main Training Logic ---

print("Starting training data collection...")
df = fetch_training_data(TRAINING_PLAYLISTS)

if df.empty:
    print("\nNo training data could be collected. Please check your playlist IDs and API keys.")
else:
    print(f"\nCollected data for {len(df)} songs. Training model...")
    
    X = df['features']
    y = df['label']
    
    # Create a machine learning pipeline
    # 1. CountVectorizer: Converts the tag string into a vector of word counts
    # 2. MultinomialNB: The classifier algorithm
    model_pipeline = Pipeline([
        ('vectorizer', CountVectorizer()),
        ('classifier', MultinomialNB())
    ])
    
    # Train the model
    model_pipeline.fit(X, y)
    
    # Save the trained model to a file
    joblib.dump(model_pipeline, 'song_classifier.joblib')
    
    print("\nModel training complete!")
    print("The trained model has been saved to 'song_classifier.joblib'.")

Starting training data collection...

Fetching tracks for 'hiphop-workout' playlist...


Processing 'hiphop-workout': 100%|██████████| 432/432 [03:35<00:00,  2.01it/s]


Finished 'hiphop-workout': Successfully fetched data for 151 / 432 songs considered.

Fetching tracks for 'makeout' playlist...


Processing 'makeout': 100%|██████████| 252/252 [02:00<00:00,  2.10it/s]


Finished 'makeout': Successfully fetched data for 107 / 252 songs considered.

Fetching tracks for 'chase' playlist...


Processing 'chase': 100%|██████████| 252/252 [01:50<00:00,  2.28it/s]


Finished 'chase': Successfully fetched data for 122 / 252 songs considered.

Fetching tracks for 'lofi-downtempo' playlist...
Playlist has 1243 songs. Taking a random sample of 450.


Processing 'lofi-downtempo': 100%|██████████| 450/450 [03:19<00:00,  2.26it/s]


Finished 'lofi-downtempo': Successfully fetched data for 0 / 1243 songs considered.

Fetching tracks for 'instrumental-happy' playlist...


Processing 'instrumental-happy': 100%|██████████| 235/235 [02:01<00:00,  1.93it/s]


Finished 'instrumental-happy': Successfully fetched data for 0 / 235 songs considered.

Fetching tracks for 'ambient-focus' playlist...
Playlist has 1067 songs. Taking a random sample of 450.


Processing 'ambient-focus': 100%|██████████| 450/450 [03:40<00:00,  2.04it/s]


Finished 'ambient-focus': Successfully fetched data for 0 / 1067 songs considered.

Fetching tracks for 'atmospheric-room' playlist...


Processing 'atmospheric-room': 100%|██████████| 200/200 [01:35<00:00,  2.10it/s]


Finished 'atmospheric-room': Successfully fetched data for 29 / 200 songs considered.

Fetching tracks for 'acoustic-guitar' playlist...


Processing 'acoustic-guitar': 100%|██████████| 180/180 [01:24<00:00,  2.14it/s]


Finished 'acoustic-guitar': Successfully fetched data for 13 / 180 songs considered.

Fetching tracks for 'citypop' playlist...


Processing 'citypop': 100%|██████████| 101/101 [00:44<00:00,  2.28it/s]


Finished 'citypop': Successfully fetched data for 3 / 101 songs considered.

Fetching tracks for 'feel-good' playlist...


Processing 'feel-good': 100%|██████████| 408/408 [03:27<00:00,  1.97it/s]


Finished 'feel-good': Successfully fetched data for 97 / 408 songs considered.

Fetching tracks for 'edm-club' playlist...


Processing 'edm-club': 100%|██████████| 217/217 [01:52<00:00,  1.92it/s]


Finished 'edm-club': Successfully fetched data for 39 / 217 songs considered.

Collected data for 561 songs. Training model...

Model training complete!
The trained model has been saved to 'song_classifier.joblib'.
