# Spotify Songs Network - Dataset Generation
* In this notebook we will create the dataset that we will use to create a Network about Spotify Songs, based on user's Playlists.
* Specifically, we want to create a Network with the following characteristics
  * **Nodes**: Songs
  * **Edges**: will be created between songs if the songs are found in the same playlist.
* In this notebook, we will create our dataset, and to do that we will obtain data from:
  1. [Spotify Playlists](https://www.kaggle.com/andrewmvd/spotify-playlists) Dataset from [Kaggle](https://www.kaggle.com/).
    * Pichl, Martin; Zangerle, Eva; Specht, Günther: "Towards a Context-Aware Music Recommendation Approach: What is Hidden in the Playlist Name?" in 15th IEEE International Conference on Data Mining Workshops (ICDM 2015), pp. 1360-1365, IEEE, Atlantic City, 2015.
    * **License**: CC BY 4.0
  2. [Spotify Web API](https://developer.spotify.com/documentation/web-api/)
  3. [Chosic Music Genre Finder](https://www.chosic.com/music-genre-finder/)

## Spotify for Developers Credentials
* In case a user of this notebook wants to execute the cells that create a connection with the [Spotify's Web API](https://developer.spotify.com/documentation/web-api/) it is necessary to create an application at http://developer.spotify.com.
* In that way the user will get a client ID and a client secret.
* Then, they have to create a file `spotify_config.py` with the following contents:

  ```
  config = {
      'client_id' : 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX',
      'client_secret' :'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
  }
  ```
  where instead of Xs there are the client ID and client secret of the user.
* This file will be placed in the same folder as this notebook.

## Import packages
* To begin with, we will import the packages, that we will use in the following segments of the project:
    * [pandas](https://pandas.pydata.org/)
    * [Spotipy](https://spotipy.readthedocs.io/en/2.19.0/)
    * [webdriver-manager](https://pypi.org/project/webdriver-manager/)
    * [Selenium](https://selenium-python.readthedocs.io/)
    * [Beautiful Soup](https://beautiful-soup-4.readthedocs.io/en/latest/)
* Note that the prementioned packages **must be locally installed too** in order to be used.

In [None]:
import pandas as pd
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from webdriver_manager.firefox import GeckoDriverManager
from selenium import webdriver
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup
import bs4

import random
from itertools import combinations
from collections import defaultdict
import csv

## Kaggle Dataset
* As mentioned above, we will get the basic data from [Spotify Playlists](https://www.kaggle.com/andrewmvd/spotify-playlists) Dataset from [Kaggle](https://www.kaggle.com/).
* After downloading it, we have to create a folder <code>data</code> and put it into it, under the name <code>spotify_dataset.csv.zip</code>.
* So, let's read it.

In [4]:
df = pd.read_csv('data/spotify_dataset.csv.zip', on_bad_lines='skip')
df.head(5)

Unnamed: 0,user_id,"""artistname""","""trackname""","""playlistname"""
0,9cc0cfd4d7d7885102480dd99e7a90d6,Elvis Costello,(The Angels Wanna Wear My) Red Shoes,HARD ROCK 2010
1,9cc0cfd4d7d7885102480dd99e7a90d6,Elvis Costello & The Attractions,"(What's So Funny 'Bout) Peace, Love And Unders...",HARD ROCK 2010
2,9cc0cfd4d7d7885102480dd99e7a90d6,Tiffany Page,7 Years Too Late,HARD ROCK 2010
3,9cc0cfd4d7d7885102480dd99e7a90d6,Elvis Costello & The Attractions,Accidents Will Happen,HARD ROCK 2010
4,9cc0cfd4d7d7885102480dd99e7a90d6,Elvis Costello,Alison,HARD ROCK 2010


* Next, we will rename the columns.

In [None]:
df.rename(columns={' "artistname"' : 'Artist', ' "trackname"': 'Track_Name', ' "playlistname"': 'Playlist_Name'}, inplace=True)

* Because our dataset contains too many songs we will **keep** only those that are included in more than 500 playlists.
* We will do that because if we have to many nodes in our Network, it will not be easily **interpretable**.

In [None]:
#https://stackoverflow.com/questions/44888858/how-to-drop-unique-rows-in-a-pandas-dataframe
df = df[df.groupby(['Track_Name', 'Artist'])['Track_Name'].transform('size') > 500]