## Data Collection
I utilize two datasets in building a song recommendation system. The first is a library of tracks available on kaggle and the second is my extended listening history which I requested directly from Spotify. Because both datasets are incomplete I will make calls to the Spotify Web API to pull in artist and track features to ensure I have a complete library. Lastly I will build a function that takes an artist and track name as inputs and outputs all the necessary datapoints to be considered in my recommendation model. I will utilize this function in model evaluation and also within my streamlit app.

Data Sources
<br>1- Spotify Listening History
<br>2- Kaggle Song List
<br>3- Spotify API

#### Import Libraries

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing

import time
import pickle

#### Load in Datasets

In [2]:
#load kaggle data
kaggle_df = pd.read_csv('../data/data.csv')

#load extended listening history
extended = pd.read_json('../data/endsong_0.json')
extended1 = pd.read_json('../data/endsong_1.json')
extended2 = pd.read_json('../data/endsong_2.json')
extended = pd.concat([extended, extended1, extended2])

#remove rows with null track ids
extended = extended[extended['spotify_track_uri'].isnull() == False]
extended.reset_index(drop = True, inplace = True)

#pull track ids into useable format
extended['track_id'] = [extended['spotify_track_uri'][x][14:] for x in range(len(extended['spotify_track_uri']))]

### Spotify API
In order to access the Spotify API I first setup a developer account to receive some credentials which can be used to make requests to the API. The main endpoints that I hit were search, tracks, audio features, and artist. These endpoints provided various features relating to the artist and track including genre, energy, danceability, acousticness, etc which I can use to identify the similarity between tracks and provide recommendations.

Please note there are specific rate limits associated with each endpoint. As such I split my datasets into smaller batches and utilized a sleep function to ensure that I did not exceed these rate limits. Rerunning these API calls will take a few hours. 

In [4]:
#import spotipy and use credentials to authenticate through spotify api
import spotipy

#!ln -s ../config.py config.py 
import config

from spotipy.oauth2 import SpotifyClientCredentials
client_credentials_manager = SpotifyClientCredentials(client_id=config.cid, client_secret=config.secret)
sp = spotipy.Spotify(client_credentials_manager = client_credentials_manager)

In [5]:
def track_features(data):
    danceability = []
    energy = []
    key = []
    loudness = []
    mode = []
    speechiness = []
    acousticness = []
    instrumentalness = []
    liveness = []
    valence = []
    tempo = []
    track_id = []
    duration_ms = []
    time_signature = []
    
    for t in data:
        try:
            results = sp.audio_features(tracks = t)
            danceability.append(results[0]['danceability'])
            energy.append(results[0]['energy'])
            key.append(results[0]['key'])
            loudness.append(results[0]['loudness'])
            mode.append(results[0]['mode'])
            speechiness.append(results[0]['speechiness'])
            acousticness.append(results[0]['acousticness'])
            instrumentalness.append(results[0]['instrumentalness'])
            liveness.append(results[0]['liveness'])
            valence.append(results[0]['valence'])
            tempo.append(results[0]['tempo'])
            track_id.append(results[0]['id'])
            duration_ms.append(results[0]['duration_ms'])
            time_signature.append(results[0]['time_signature'])
        except:
            pass
    
    track_features = pd.DataFrame(track_id, columns = ['trackID'])
    track_features['danceability'] = danceability
    track_features['energy'] = energy
    track_features['key'] = key
    track_features['loudness'] = loudness 
    track_features['mode'] = mode 
    track_features['speechiness'] = speechiness
    track_features['acousticness'] = acousticness
    track_features['instrumentalness'] = instrumentalness
    track_features['liveness'] = liveness
    track_features['valence'] = valence
    track_features['tempo'] = tempo
    track_features['duration_ms'] = duration_ms
    track_features['time_signature'] = time_signature
    
    return track_features

#### Kaggle Dataset - Genre & Followers

While the Kaggle dataset included features relating to each track it did not contain information regarding the genre or followers associated with each artist. The function below pulls this additional information from the Spotify API and combines it with the original dataset.

In [9]:
#Using a different function as the kaggle data doesnt include artist ID
def kaggle_artist_features(artist):
    results = sp.search(q=f'artist: {artist}', type='artist', limit=1)
    t = results['artists']['items']
    ids = []
    artist = artist
    name = []
    genre = []
    followers = []
    
    try:
        for s in t: 
            ids.append(s['id'])
            genre.append(s['genres'])
            followers.append(s['followers']['total'])
            name.append(s['name'])   
    except:
        ids.append(0)
        
    art_feat = pd.DataFrame(ids, columns = ['artist_id'])
    art_feat['artists'] = artist
    art_feat['artistName'] = name
    art_feat['genre'] = genre
    art_feat['followers'] = followers
    
    return art_feat

In [90]:
#I ran into issues trying to pull artist data so I had to run API requests in smaller batches
#I also had to filter out strings that were too long to be searched

artist_batches = np.array_split(kaggle_df['artists'].loc[(kaggle_df['artists'].str.len() < 200)].unique(), 150)

kaggle_artists = []

In [17]:
for x in range(140,150):
    if counter < 1:
        time.sleep(30)
        kaggle_artists.append([kaggle_artist_features(a) for a in artist_batches[x]])
        counter += 1
    else: 
        time.sleep(30)
        counter == 0
        kaggle_artists.append([kaggle_artist_features(a) for a in artist_batches[x]])

In [71]:
#Condensing the artists data into one dataframe with the track data from kaggle_df
batch_df = []

for df in range(len(kaggle_artists)):
    batch_df.append(pd.concat([x for x in kaggle_artists[df]]))

test= pd.concat([batch_df[x] for x in range(len(batch_df))])

kaggle = pd.merge(left = kaggle_df, right = test, on = ['artists', 'artists'], how = 'left')

#### Extended Streaming History
Spotify provided my with a few json files containing my extended listening history, however this dataset does not include most of the track or artist features that I need to identify song similarities. The functions below pull the additional track and artist data directly from the spotify api.

In [6]:
def get_track_artist(tracks):
    results = sp.tracks(tracks)
    t = results['tracks']
    ids = []
    track_id = [track for track in tracks]
    track_name = []
    
    try:
        for s in t: 
            ids.append(s['artists'][0]['id'])
            track_name.append(s['name']) 
    except:
        ids.append(0)
        
    
    #art_feat = pd.DataFrame(ids, columns = ['artist_id'])
    #art_feat['track_id'] = track_id
    
    df = artist_features(ids)
    df['track_id'] = track_id
    df['trackName'] = track_name
    
    return df

In [7]:
def artist_features(artist):
    results = sp.artists(artist)
    t = results['artists']
    ids = []
    artist_id = [x for x in artist]
    name = []
    genre = []
    popularity = []
    followers = []
    
    
    try:
        for s in t: 
            ids.append(s['id'])
            genre.append(s['genres'])
            popularity.append(s['popularity'])
            followers.append(s['followers']['total'])   
    except:
        ids.append(0)
        
    art_feat = pd.DataFrame(ids, columns = ['artist_id'])
    art_feat['artist_id'] = artist_id
    art_feat['genre'] = genre
    art_feat['popularity'] = popularity
    art_feat['followers'] = followers
    
    return art_feat

In [None]:
#Splitting the data into smaller batches 
ext_batches = np.array_split(extended['track_id'], 1000)
art_feat = []
counter = 0

In [155]:
#Running each call in small batches with built in pauses to avoid rate limiting
for x in range(900,1000):
    if counter < 1:
        time.sleep(5)
        art_feat.append(get_track_artist(ext_batches[x]))
        counter += 1
    else: 
        time.sleep(5)
        counter == 0
        art_feat.append(get_track_artist(ext_batches[x]))

In [32]:
#Pulling the track features for each song in extended history
ext_batches = np.array_split(extended['track_id'], 100)
track_feat = []
counter = 0

In [58]:
#Running each call in small batches with built in pauses to avoid rate limiting
for x in range(95,100):
    if counter < 1:
        time.sleep(30)
        track_feat.append(track_features(ext_batches[x]))
        counter += 1
    else: 
        time.sleep(30)
        counter == 0
        track_feat.append(track_features(ext_batches[x]))

In [185]:
#Merge the track and artists features pulled from spotify api
track_df = pd.concat([track_feat[x] for x in range(len(track_feat))])
art_df = pd.concat([art_feat[x] for x in range(len(art_feat))])
track_df.drop_duplicates(inplace = True)
art_df.drop_duplicates(subset = 'track_id', inplace = True)

extended_spotify = pd.merge(track_df, art_df, how = 'left', left_on = 'trackID', right_on ='track_id')
extended = pd.merge(extended, extended_spotify, how = 'left', left_on = 'track_id', right_on = 'trackID')

In [194]:
#save variables as csvs to pass to read across notebooks
kaggle.to_csv('../data/kaggle.csv')
history.to_csv('../data/streaminghistory.csv')
extended.to_csv('../data/extendedhistory.csv')

### Streamlit User Input Function
This function will be the basis for my streamlit app whereby a user can input an artist and track and receive recommendations. This function will be responsible for calling the spotify api to gather all the necessary data points for consideration in the recommendation system built in later notebooks.

In [28]:
#User Input Functions
def get_users_track(artist, track):
    
    results = sp.search(q="artist: " + artist + "track: " + track, type="track", limit =1)['tracks']['items'][0]
    artist_id = results['album']['artists'][0]['id']
    track_id = results['id']
    trackName = results['name']
    
    #get artist and track features
    user_artists = artist_features([artist_id])
    user_track = track_features([track_id])
    user_table = pd.concat([user_artists, user_track], axis = 1)
    user_table.drop(columns = ['trackID','duration_ms', 'time_signature'], inplace = True)
    user_table.index = [trackName]
        
    return user_table

In [27]:
#Using My Top 5 Songs from 2022 for Model Evaluation
track1 = get_users_track('Summer Walker', 'CPR') 
track2 = get_users_track('MEDUZA', 'Lose Control') 
track3 = get_users_track('Leon Bridges', 'Coming Home')
track4 = get_users_track('Summer Walker', 'Session 32')
track5 = get_users_track('Kali Uchis', 'telepatía') 

%store track1
%store track2
%store track3
%store track4
%store track5