In [None]:
import numpy as np
import pandas as pd

import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from getpass import getpass

import keyring
import time

<b> NOTE: <br>Before running the whole notebook, retrieve your `client_id` and `client_secret` in the Developer's section of your Spotify Developers account. </b> <br>

<p> Then, input it in the next 2 cells. To ensure privacy, it will be implemented using getpass(). </p>

In [None]:
client_id = getpass()

In [None]:
password = getpass()

In [None]:
client_credentials_manager = SpotifyClientCredentials(client_id, client_secret)
sp = spotipy.Spotify(client_credentials_manager = client_credentials_manager)

In [None]:
# Helper function to print structure of a nested dict object
def print_dict_structure(d, indent=0):
    """Return the structure of nested dict object"""
    for key, value in d.items():
        print('  ' * indent  +  str(key))
        if isinstance(value, dict):
            print_dict_structure(value, indent + 2)


## Data Collection using Spotify API and Spotipy

The function `fetch_playlists` return all playlists based on your chosen query. Parameters include:
<li> <b>sp</b>: This parameter expects an instance of the Spotify client. It is used to make authenticated requests to the Spotify Web API, such as searching for playlists based on a given query. </li>
<li> <b>query</b>: A string parameter representing the search query to find playlists on Spotify. This could include specific genres, artists, songs, or keywords related to the playlists you're trying to fetch. </li>
<li> <b>total_limit</b>: An integer specifying the total number of playlists to retrieve. This function will fetch playlists in batches (up to this limit) due to the API's rate limiting and pagination. </li>
<li> <b>market</b>: (optional with a default value of "PH"): A string parameter that specifies the market (country) to consider when fetching playlists. This is used to tailor the search results to a particular geographic market, as playlist availability can vary by country. The default value "PH" stands for the Philippines, but it can be changed to any valid ISO 3166-1 alpha-2 country code. </li>

In [None]:
def fetch_playlists(sp, query, total_limit, market="PH"):
    """
    Fetch playlists from Spotify based on a query and compile the
    data into a DataFrame.

    Returns:
    - DataFrame with columns for playlist_id, playlist_name,
    and playlist_owner.
    """
    # Initialize lists
    playlist_ids = []
    playlist_names = []
    playlist_owners = []

    # Loop to fetch playlists in batches
    for offset in range(0, total_limit, 50):
        current_limit = min(50, total_limit - offset)

        # search
        results = sp.search(
            query,
            limit=current_limit,
            offset=offset,
            type="playlist",
            market=market
        )

        # Extract playlist information
        for item in results["playlists"]["items"]:
            playlist_ids.append(item["id"])
            playlist_names.append(item["name"])
            playlist_owners.append(item["owner"]["display_name"])

    # Compile playlist
    df_playlists = pd.DataFrame(
        {
            "playlist_id": playlist_ids,
            "playlist_name": playlist_names,
            "playlist_owner": playlist_owners,
        }
    )

    return df_playlists


Search keywords such as `coffee`, `coffee shop`, `cafe`, `coffee shop pinoy café kape` will be used in searching for playlists.

In [None]:
df_coffee = fetch_playlists(sp, "coffee shop café", 500, 'PH')
df_coffee.head()

Unnamed: 0,playlist_id,playlist_name,playlist_owner
0,6Jh3UZ9pJDnBMOKrAYysE1,"Background Noises - Coffee Shop, Cafe & More",Granular Noise
1,37i9dQZF1DXa1BeMIGX5Du,Coffee + Chill,Spotify
2,37i9dQZF1DWVqfgj8NZEp1,Coffee Table Jazz,Spotify
3,6AshwSEJ74Hpte22dnkahc,Coffee Shop Music,vylo
4,37i9dQZF1DWSGaMpjluQpy,Jazz in the Rain,Spotify


In [None]:
df_coffee_pinoy = fetch_playlists(sp, "coffee shop pinoy café kape", 500, 'PH')
df_coffee_pinoy.head()

Unnamed: 0,playlist_id,playlist_name,playlist_owner
0,2woCPqlMsaC7SosGhiV0tc,Pinoy Coffee shop,newwithtags
1,0KRWomdHs9KyTeJaiXlb9P,Pinoy cafe vibes ☕️,niki
2,3foVOF8aTnbzyxhgnq4QNZ,Pinoy Coffee House,G-Boy
3,37i9dQZF1DX4olOMiqFeqU,OPM Favorites,Spotify
4,0Toja9vGmlFO8q8lAjla1k,Kapoy - Coffee Store,Poy Lechonsito


In [None]:
df_coffee = pd.concat([df_coffee, df_coffee_pinoy])
df_coffee = df_coffee.drop_duplicates(subset='playlist_id')
df_coffee.reset_index(drop=True, inplace=True)
df_coffee.head()

Unnamed: 0,playlist_id,playlist_name,playlist_owner
0,6Jh3UZ9pJDnBMOKrAYysE1,"Background Noises - Coffee Shop, Cafe & More",Granular Noise
1,37i9dQZF1DXa1BeMIGX5Du,Coffee + Chill,Spotify
2,37i9dQZF1DWVqfgj8NZEp1,Coffee Table Jazz,Spotify
3,6AshwSEJ74Hpte22dnkahc,Coffee Shop Music,vylo
4,37i9dQZF1DWSGaMpjluQpy,Jazz in the Rain,Spotify


The `Spotify` user will be removed to allow for greater user contributions to the dataset.

In [None]:
df_coffee = (
    df_coffee[df_coffee['playlist_owner'] != 'Spotify']
    .reset_index(drop=True)
)
df_coffee_unique = df_coffee.drop_duplicates(subset=['playlist_name'])
df_coffee_unique = df_coffee_unique.drop_duplicates(subset=['playlist_owner'])
df_coffee_unique

Unnamed: 0,playlist_id,playlist_name,playlist_owner
0,6Jh3UZ9pJDnBMOKrAYysE1,"Background Noises - Coffee Shop, Cafe & More",Granular Noise
1,6AshwSEJ74Hpte22dnkahc,Coffee Shop Music,vylo
2,582DERU3rbGArEYQ1gi4Bw,Coffee shop soundtrack,Jake Cope
3,11nSleISOWGLboWVWPDuwB,Café Music 2024 ☕ Coffee Lounge Vibes,LoudKult
4,7K6fVGUcL6ChCsRMJP4oOC,Lofi Coffee Shop (Lo-Fi Café Morning),LoFi Coffee
...,...,...,...
803,6x1sZ0q2AHCDwUHXkDgWsf,CHILL/CAFE KOREAN SONGS,♡
806,5qgkey5QdQXKD0sQPS4EzL,PINOY FOLK SONGS OPM,Ricardo Bognot
807,26oD7sFxuNcV580wMImcBs,Calm And Relaxing OPM Songs,Michael John
808,10bTdZg58svtXY7ZHaUwVQ,Coffee shop bangers,Ashley Dehmlow


The `playlist_descriptions` are gathered by iterating through each `playlist_id` and fetching the corresponding descriptions. This information will be utilized to refine the dataset, ensuring it only includes rows that reference specific keywords.

In [None]:
playlist_descriptions = []

# Loop through each row in the DataFrame to fetch playlist descriptions
for playlist_id in df_coffee["playlist_id"]:
    try:
        # Fetch the playlist data
        playlist_data = sp.playlist(playlist_id)

        # Extract the description and append it to the list
        playlist_descriptions.append(playlist_data["description"])
    except Exception as e:
        # Append a None if you cannot fetch the description
        playlist_descriptions.append(None)

# Add the list of descriptions as a new column in the DataFrame
df_coffee["playlist_description"] = playlist_descriptions

In [None]:
df_coffee.head()

Unnamed: 0,playlist_id,playlist_name,playlist_owner,playlist_description
0,6Jh3UZ9pJDnBMOKrAYysE1,"Background Noises - Coffee Shop, Cafe & More",Granular Noise,The best ASMR White Noise ambience incl Coffee...
1,6AshwSEJ74Hpte22dnkahc,Coffee Shop Music,vylo,Relaxing Coffee Shop Music. Listen to this pla...
2,582DERU3rbGArEYQ1gi4Bw,Coffee shop soundtrack,Jake Cope,
3,11nSleISOWGLboWVWPDuwB,Café Music 2024 ☕ Coffee Lounge Vibes,LoudKult,Coffe shop Music to enjoy in the mornings! Caf...
4,7K6fVGUcL6ChCsRMJP4oOC,Lofi Coffee Shop (Lo-Fi Café Morning),LoFi Coffee,Best LoFi to sip your morning coffee to..Morni...


The dataset is filtered based on `playlist_name` and `playlist_description` to include only rows that mention the keywords `coffee`, `cafe`, `shop`, or `café`.

In [None]:
keywords_pattern = r"(coffee|cafe|shop|café)"

# Filter rows that contains keywords
df_coffee_rev = df_coffee[
    df_coffee["playlist_name"].str.contains(keywords_pattern,
                                            case=False,
                                            na=False)
    | df_coffee["playlist_description"].str.contains(
        keywords_pattern, case=False, na=False
    )
]

df_coffee_rev

  df_coffee["playlist_name"].str.contains(keywords_pattern,
  | df_coffee["playlist_description"].str.contains(


Unnamed: 0,playlist_id,playlist_name,playlist_owner,playlist_description
0,6Jh3UZ9pJDnBMOKrAYysE1,"Background Noises - Coffee Shop, Cafe & More",Granular Noise,The best ASMR White Noise ambience incl Coffee...
1,6AshwSEJ74Hpte22dnkahc,Coffee Shop Music,vylo,Relaxing Coffee Shop Music. Listen to this pla...
2,582DERU3rbGArEYQ1gi4Bw,Coffee shop soundtrack,Jake Cope,
3,11nSleISOWGLboWVWPDuwB,Café Music 2024 ☕ Coffee Lounge Vibes,LoudKult,Coffe shop Music to enjoy in the mornings! Caf...
4,7K6fVGUcL6ChCsRMJP4oOC,Lofi Coffee Shop (Lo-Fi Café Morning),LoFi Coffee,Best LoFi to sip your morning coffee to..Morni...
...,...,...,...,...
791,47lhDA0weXryO4XqQm2Dj3,Cafe Playlist,Cloverleaf Cafe,
801,1TteYRdmI9msGhsYslvr3Z,Coffeeshop Feels,Amber Martin,
803,6x1sZ0q2AHCDwUHXkDgWsf,CHILL/CAFE KOREAN SONGS,♡,
804,7wluQO0aGz8D025SmF4DPw,Coffee Shop,Saul,


The algorithm iterates through each `playlist_id` to retrieve all the songs contained in each playlist. This comprehensive collection of songs will later serve as the basis for Frequent Itemset Mining, enabling the analysis of common patterns and relationships between tracks across various playlists.

In [None]:
df = df_coffee_rev
# Initialize the DataFrame to store songs information
df_playlist_songs = pd.DataFrame(
    columns=[
        "playlist_id",
        "playlist_name",
        "song_name",
        "artist_name",
        "song_id",
        "artist_id",
    ]
)

# Loop through each playlist ID in the DataFrame obtained from previous step
for playlist_id in df["playlist_id"]:
    try:
        playlist_data = sp.playlist(playlist_id)
        if (playlist_data and "name" in playlist_data and
            "tracks" in playlist_data):
            playlist_name = playlist_data["name"]
            tracks = playlist_data["tracks"]

            # Initialize a list to collect song data
            songs_data = []

            # Loop to handle pagination and fetch all tracks
            while tracks:
                for item in tracks["items"]:
                    if (
                        item
                        and "track" in item
                        and item["track"]
                        and item["track"]["artists"]
                    ):
                        artist_name = (
                            item["track"]["artists"][0]["name"]
                            if item["track"]["artists"]
                            else "Unknown Artist"
                        )
                        artist_id = (
                            item["track"]["artists"][0]["id"]
                            if item["track"]["artists"]
                            else "Unknown ID"
                        )
                        song_name = (
                            item["track"]["name"]
                            if "name" in item["track"]
                            else "Unknown Song"
                        )
                        song_id = (
                            item["track"]["id"]
                            if "id" in item["track"]
                            else "Unknown ID"
                        )

                        # Collect each song's info as a dictionary
                        song_info = {
                            "playlist_id": playlist_id,
                            "playlist_name": playlist_name,
                            "song_id": song_id,
                            "song_name": song_name,
                            "artist_id": artist_id,
                            "artist_name": artist_name,
                        }
                        songs_data.append(song_info)

                # Check if there are more tracks to fetch
                if tracks["next"]:
                    tracks = sp.next(tracks)
                else:
                    tracks = None

            # Create a DataFrame from the collected song data
            df_current_playlist_songs = pd.DataFrame(songs_data)

            # Concatenate the playlist's songs DataFrame with the DataFrame
            df_playlist_songs = pd.concat(
                [df_playlist_songs, df_current_playlist_songs],
                ignore_index=True
            )
    except Exception as e:
        print(f"An error occurred with playlist {playlist_id}: {e}")

# Display the first few rows of the compiled songs DataFrame
df_playlist_songs

Unnamed: 0,playlist_id,playlist_name,song_name,artist_name,song_id,artist_id
0,6Jh3UZ9pJDnBMOKrAYysE1,"Background Noises - Coffee Shop, Cafe & More",Coffee Shop Sound for Working and Studying Par...,Background Music & Sounds From I’m In Records,2nr2nZscQKGywtgHTUj1b7,0CmiAwb3M1ErCaws6d6H6e
1,6Jh3UZ9pJDnBMOKrAYysE1,"Background Noises - Coffee Shop, Cafe & More",Coffee Shop Sound for Working and Studying Par...,Background Music & Sounds From I’m In Records,0Bx80DISZZeU1Fq5jPa7IY,0CmiAwb3M1ErCaws6d6H6e
2,6Jh3UZ9pJDnBMOKrAYysE1,"Background Noises - Coffee Shop, Cafe & More",Coffee Shop Sound for Working and Studying Par...,Background Music & Sounds From I’m In Records,3UAlhre5mLRbUO7ESKcpNC,0CmiAwb3M1ErCaws6d6H6e
3,6Jh3UZ9pJDnBMOKrAYysE1,"Background Noises - Coffee Shop, Cafe & More",Coffee Shop Sound for Working and Studying Par...,Background Music & Sounds From I’m In Records,1AKDpveHMcLTdLFnKe66Ce,0CmiAwb3M1ErCaws6d6H6e
4,6Jh3UZ9pJDnBMOKrAYysE1,"Background Noises - Coffee Shop, Cafe & More",Coffee Shop Sound for Working and Studying Par...,Background Music & Sounds From I’m In Records,54jN6mX4zbh3ewPtS7mZ89,0CmiAwb3M1ErCaws6d6H6e
...,...,...,...,...,...,...
173358,10bTdZg58svtXY7ZHaUwVQ,Coffee shop bangers,everything i wanted,Billie Eilish,3ZCTVFBt2Brf31RLEnCkWJ,6qqNVTkY8uBg9cP3Jd7DAH
173359,10bTdZg58svtXY7ZHaUwVQ,Coffee shop bangers,ocean eyes,Billie Eilish,7hDVYcQq6MxkdJGweuCtl9,6qqNVTkY8uBg9cP3Jd7DAH
173360,10bTdZg58svtXY7ZHaUwVQ,Coffee shop bangers,when the party's over,Billie Eilish,43zdsphuZLzwA9k4DJhU0I,6qqNVTkY8uBg9cP3Jd7DAH
173361,10bTdZg58svtXY7ZHaUwVQ,Coffee shop bangers,bellyache,Billie Eilish,51NFxnQvaosfDDutk0tams,6qqNVTkY8uBg9cP3Jd7DAH


Several filters were done on the dataset:
<li> First, the dataset is first filtered to exclude any playlists where the `playlist_name` contains the word 'korean', ignoring case sensitivity. This is done to focus the analysis on non-Korean music or to specifically exclude Korean-themed playlists for the study at hand. </li>
<li> The `song_name` column is processed to remove leading and trailing spaces from each song name, ensuring consistency and accuracy in naming conventions across the dataset. </li>
<li> The `playlist_id` column is converted to strings, ensuring that playlist IDs are treated as categorical data, which is important for any subsequent operations that rely on playlist identification. </li>
<li> The dataset is further filtered to retain only those songs that appear in more than one playlist. This is achieved by grouping the data by `song_name` and filtering based on the uniqueness of `song_id` within each group. This step is crucial for identifying songs with widespread popularity or relevance across different playlists. </li>

In [None]:
df_coffee_playlist_songs = df_playlist_songs
df_coffee_playlist_songs = df_coffee_playlist_songs[
    ~df_coffee_playlist_songs['playlist_name'].str.contains('korean',
                                                            case=False,
                                                            na=False)
]
df_coffee_playlist_songs['song_name'] = (
    df_coffee_playlist_songs['song_name'].str.strip()
)
df_coffee_playlist_songs['playlist_id'] = (
    df_coffee_playlist_songs['playlist_id'].astype('str')
)
df_coffee_filtered = df_coffee_playlist_songs.groupby('song_name').filter(
    lambda x: x['song_id'].nunique() > 1
)
df_coffee_sorted = df_coffee_filtered.sort_values(by='song_name')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_coffee_playlist_songs['song_name'] = df_coffee_playlist_songs['song_name'].str.strip()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_coffee_playlist_songs['playlist_id'] = df_coffee_playlist_songs['playlist_id'].astype('str')


Initially, the dataset is filtered to remove playlists with names containing the term 'korean', disregarding case sensitivity. This step aims to narrow the focus to non-Korean music, intentionally omitting Korean-themed playlists to align with the specific objectives of the analysis.

The `song_name` field undergoes a cleaning process to eliminate any leading or trailing spaces. This standardization ensures that song names are consistent throughout the dataset, facilitating accurate identification and comparison.

Playlist IDs in the `playlist_id` column are converted to string format. This transformation categorizes these IDs as nominal data, which is essential for accurately identifying and referencing playlists in subsequent analyses.

The dataset is further refined to include only those songs that are featured in more than one playlist. By grouping the data by `song_name` and filtering based on the unique count of `song_id` in each group, the analysis can focus on songs demonstrating broader appeal or popularity across various playlists.

In [None]:
df_coffee_final = df_coffee_sorted[
    ~df_coffee_sorted['song_name'].str.contains(
        '[\u3040-\u30FF\u4E00-\u9FAF\uAC00-\uD7AF]', regex=True, na=False
    )
]
df_coffee_final.tail(3)

Unnamed: 0,playlist_id,playlist_name,song_name,artist_name,song_id,artist_id
14581,48GTjO4Pc1Z6yFAmp47n4g,coffee shops in the fall,‘tis the damn season,Taylor Swift,6sQckd3Z8NPxVVKUnavY1F,06HL4z0CvFAxyc27GXpf02
25534,6LuBlCUSnil3ECt6yh8IAG,coffee shop music,‘tis the damn season,Taylor Swift,6sQckd3Z8NPxVVKUnavY1F,06HL4z0CvFAxyc27GXpf02
29905,05UG0w16MESsDPVy9y3nYf,taylor swift café,‘tis the damn season,Taylor Swift,6sQckd3Z8NPxVVKUnavY1F,06HL4z0CvFAxyc27GXpf02


### Saving the file for backup

In [None]:
df_coffee_final.to_csv('df_coffee_final.csv', index=False)