# Spotify Songs Network - Dataset Generation
* In this notebook we will create the dataset that we will use to create a Network about Spotify Songs, based on user's Playlists.
* Specifically, we want to create a Network with the following characteristics
  * **Nodes**: Songs
  * **Edges**: will be created between songs if the songs are found in the same playlist.
* In this notebook, we will create our dataset, and to do that we will obtain data from:
  1. [Spotify Playlists](https://www.kaggle.com/andrewmvd/spotify-playlists) Dataset from [Kaggle](https://www.kaggle.com/).
    * Pichl, Martin; Zangerle, Eva; Specht, Günther: "Towards a Context-Aware Music Recommendation Approach: What is Hidden in the Playlist Name?" in 15th IEEE International Conference on Data Mining Workshops (ICDM 2015), pp. 1360-1365, IEEE, Atlantic City, 2015.
    * **License**: CC BY 4.0
  2. [Spotify Web API](https://developer.spotify.com/documentation/web-api/)
  3. [Chosic Music Genre Finder](https://www.chosic.com/music-genre-finder/)

## Spotify for Developers Credentials
* In case a user of this notebook wants to execute the cells that create a connection with the [Spotify's Web API](https://developer.spotify.com/documentation/web-api/) it is necessary to create an application at http://developer.spotify.com.
* In that way the user will get a client ID and a client secret.
* Then, they have to create a file `spotify_config.py` with the following contents:

  ```
  config = {
      'client_id' : 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX',
      'client_secret' :'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
  }
  ```
  where instead of Xs there are the client ID and client secret of the user.
* This file will be placed in the same folder as this notebook.

## Import packages
* To begin with, we will import the packages, that we will use in the following segments of the project:
    * [pandas](https://pandas.pydata.org/)
    * [Spotipy](https://spotipy.readthedocs.io/en/2.19.0/)
    * [webdriver-manager](https://pypi.org/project/webdriver-manager/)
    * [Selenium](https://selenium-python.readthedocs.io/)
    * [Beautiful Soup](https://beautiful-soup-4.readthedocs.io/en/latest/)
* Note that the prementioned packages **must be locally installed too** in order to be used.

In [None]:
import pandas as pd
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from webdriver_manager.firefox import GeckoDriverManager
from selenium import webdriver
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup
import bs4

import random
from itertools import combinations
from collections import defaultdict
import csv

## Kaggle Dataset
* As mentioned above, we will get the basic data from [Spotify Playlists](https://www.kaggle.com/andrewmvd/spotify-playlists) Dataset from [Kaggle](https://www.kaggle.com/).
* After downloading it, we have to create a folder <code>data</code> and put it into it, under the name <code>spotify_dataset.csv.zip</code>.
* So, let's read it.

In [4]:
df = pd.read_csv('data/spotify_dataset.csv.zip', on_bad_lines='skip')
df.head(5)

Unnamed: 0,user_id,"""artistname""","""trackname""","""playlistname"""
0,9cc0cfd4d7d7885102480dd99e7a90d6,Elvis Costello,(The Angels Wanna Wear My) Red Shoes,HARD ROCK 2010
1,9cc0cfd4d7d7885102480dd99e7a90d6,Elvis Costello & The Attractions,"(What's So Funny 'Bout) Peace, Love And Unders...",HARD ROCK 2010
2,9cc0cfd4d7d7885102480dd99e7a90d6,Tiffany Page,7 Years Too Late,HARD ROCK 2010
3,9cc0cfd4d7d7885102480dd99e7a90d6,Elvis Costello & The Attractions,Accidents Will Happen,HARD ROCK 2010
4,9cc0cfd4d7d7885102480dd99e7a90d6,Elvis Costello,Alison,HARD ROCK 2010


* Next, we will rename the columns.

In [None]:
df.rename(columns={' "artistname"' : 'Artist', ' "trackname"': 'Track_Name', ' "playlistname"': 'Playlist_Name'}, inplace=True)

* Because our dataset contains too many songs we will **keep** only those that are included in more than 500 playlists.
* We will do that because if we have to many nodes in our Network, it will not be easily **interpretable**.

In [None]:
#https://stackoverflow.com/questions/44888858/how-to-drop-unique-rows-in-a-pandas-dataframe
df = df[df.groupby(['Track_Name', 'Artist'])['Track_Name'].transform('size') > 500]

* Also, we will combine the columns <code>user_id</code> and <code>Playlist_Name</code> into one, in order to be our data more concentrated.

In [None]:
df['Playlist'] = df.apply(lambda row: str(row['Playlist_Name']) + " by " + row['user_id'], axis=1)
df.drop(columns=['user_id', 'Playlist_Name'], inplace=True)

## Edges Creation
* Next, we will create our edges, that will be **weighted**.
* Each edge will have a *Source*, a *Target* and a *Weight*.
* The *Weight* will be the number of Playlists that the two songs are included together.

* Before doing that, some songs contain in their names characters that make them not searchable using the API, so we will slightly modify their names.

In [None]:
track_name_mapping = {
    "Baba O'Riley - Original Album Version" : "Baba O'Riley",
    'Jerk It Out - Original Mix' : 'Jerk It Out',
    'Jump - Remastered Version' : 'Jump',
    "Don't You Worry Child (Radio Edit) [feat. John Martin]" : "Don't You Worry Child Radio Edit",
    'Save the World - Radio Mix' : 'Save the World',
    'Wildfire (feat. Little Dragon)' : 'Wildfire',
    'Blister In The Sun (Remastered Album Version)' : 'Blister In The Sun',
    'Hey Ya! - Radio Mix / Club Mix' : 'Hey Ya!',
    'How Soon Is Now? (2008 Remastered Version)' : 'How Soon Is Now?',
    'Intergalactic - 2009 Digital Remaster' : 'Intergalactic',
    'This Charming Man (2008 Remastered Version)' : 'This Charming Man',
    'Suit & Tie featuring JAY Z' : 'Suit & Tie',
    'A-Punk (Album)' : 'A-Punk',
    'Heroes - 1999 Remastered Version' : 'Heroes',
    'Sexy Bitch (feat. Akon) - Featuring Akon;explicit' : 'Sexy Bitch',
    'Wannabe - Radio Edit' : 'Wannabe'    
}

df['Track_Name'] = df['Track_Name'].map(lambda x: track_name_mapping.get(x, x))

* Now, we are ready to create the nodes of our Network.

In [None]:
nodes = df[['Track_Name', 'Artist']].copy().drop_duplicates()
nodes.reset_index(inplace=True, drop=True)
nodes['Id'] = nodes.index
nodes.rename(columns={'Track_Name': 'Label'}, inplace=True)

* Then, we will create a new column into the datast, that will contain the *Node ID* for each track.
* To do that, we will use a mapping, with keys the name of the track and the artist and values the id of the node.

In [None]:
list_tracks_artist = list(zip(nodes['Label'], nodes['Artist']))
nodes_id_mapping = dict(zip(list_tracks_artist, nodes['Id']))

df['Track_Id'] = df.apply(lambda row: nodes_id_mapping[(row['Track_Name'], row['Artist'])], axis=1)

* The next thing we will do is to count the number of playlists that each pair of tracks are included together.
* To do that, first we will **group** our dataframe using the <code>Playlist</code> column.
* And then, we will use [itertools](https://docs.python.org/3/library/itertools.html#itertools) to get all the possible pairs within each playlist.

In [None]:
df_grouped = df.groupby('Playlist')

pair_counts = defaultdict(int)
for name, group in df_grouped:
    try:
        pairs = list(combinations(group['Track_Id'], 2))
        for pair in pairs:
            pair_sorted = tuple(sorted(list(pair)))
            pair_counts[pair_sorted] += 1
    except MemoryError:
        print('Group {} is too big, it contains {} rows.'.format(name, len(group)))

* We are now ready to extract our edges.

In [None]:
import csv

def write_headers(writer):
    headers = ['Source', 'Target', 'Weight']
    writer.writerow(headers)

def write_edges(writer, edges_weights_dict):
    for edge in edges_weights_dict.keys():
        edge_row = [edge[0], edge[1], edges_weights_dict[edge]]
        writer.writerow(edge_row)

f = open('network_data/edges.csv', 'w', newline='')
writer = csv.writer(f)

write_headers(writer)
write_edges(writer, pair_counts)

f.close()

## Spotify Web API
* Unfortunately, our main dataset does not contain any **extra information** about the songs, except their name and artist.
* So we will try to enrich our dataset by using the [Spotify Web API](https://developer.spotify.com/documentation/web-api/).
* We don't even have the Spotify ID of each song, so we have to **search** for it, using the name of the song and the artist.
* *Remember, to create a connection with Spotify's API using the following code a <code>spotify_config.py</code> file must have been created as mentioned in the beginning*.

In [None]:
from spotify_config import config

client_credentials_manager = SpotifyClientCredentials(config['client_id'],
                                                      config['client_secret'])
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

* Next, for each node of our network we will get the *audio features* that Spotify provides.
* The following code searches using the API using the name and artist for each song, in order to get its id in Spotify. Then, the audio features of the songs are gotten.

In [71]:
nodes['Track'] = nodes.apply(lambda row: row['Label'] + " " + row['Artist'], axis=1)

ids = list(nodes['Id'])
tracks = list(nodes['Track'])

def get_track_id_from_json(track_json):
    return track_json['tracks']['items'][0]['uri']

features = {}
start = 0
num_tracks = 100
while start < len(ids):
    print(f'getting from {start} to {start+num_tracks}')
    ids_batch = ids[start:start+num_tracks]
    tracks_batch = tracks[start:start+num_tracks]
    spotify_ids_batch = []
    for track in tracks_batch:
        try:
            search_result = sp.search(q=track, type='track', limit=1)
            spotify_ids_batch.append(get_track_id_from_json(search_result))
        except:
            print(track)
    features_batch = sp.audio_features(spotify_ids_batch)
    features.update({ id : track_features 
                     for id, track_features in zip(ids_batch, features_batch) })
    start += num_tracks

getting from 0 to 100
getting from 100 to 200
getting from 200 to 300
getting from 300 to 400
getting from 400 to 500
getting from 500 to 600
getting from 600 to 700
getting from 700 to 800
getting from 800 to 900
getting from 900 to 1000


* Let's put them into a dataframe.features_df = pd.DataFrame.from_dict(features, orient='index')

In [178]:
features_df = pd.DataFrame.from_dict(features, orient='index')
features_df.drop(columns=['key', 'mode', 'type', 'id', 'uri', 'track_href', 'analysis_url', 'time_signature'], inplace=True)
features_df.columns = features_df.columns.to_series().apply(lambda name: name.capitalize())
features_df.head(5)

Unnamed: 0,Danceability,Energy,Loudness,Speechiness,Acousticness,Instrumentalness,Liveness,Valence,Tempo,Duration_ms
0,0.439,0.422,-17.227,0.0409,0.0148,4.8e-05,0.0697,0.551,81.833,236933
1,0.803,0.548,-7.103,0.12,0.351,0.0,0.0953,0.75,121.942,214920
2,0.284,0.875,-6.069,0.0422,0.00752,0.000461,0.402,0.595,75.009,340907
3,0.36,0.684,-6.457,0.0308,0.323,0.0,0.34,0.2,77.15,342653
4,0.27,0.944,-4.199,0.0975,0.00501,2.1e-05,0.116,0.606,146.347,269920
