# Notebook 01: Data Collection

## Contents:
1. [Summary](#section1)
2. [Configure SQL Server](#section2)
3. [Collect Song Info](#section3)
4. [Collect Lyric Info](#section4)

## Summary <a name="section1"></a>
I will be collecting song data directly from the Spotify and Genius APIs to build a corpus of song lyrics. Spotify-curated playlists from the Romance categoy will inform the tracklist while Genius will provide the lyrics. Once the tracklists and lyrics have been queried, a SQL table will be populated on a remote database allowing for easy transfer between notebooks.

Importing the libraries used within this notebook. Of note is the spotipy package, which is a python wrapper for the Spotify API. API credentials will also be imported from external helper files.

In [3]:
# !pip install spotipy

Collecting spotipy
  Downloading https://files.pythonhosted.org/packages/59/46/3c957255c96910a8a0e2d9c25db1de51a8676ebba01d7966bedc6e748822/spotipy-2.4.4.tar.gz
Building wheels for collected packages: spotipy
  Building wheel for spotipy (setup.py) ... [?25ldone
[?25h  Stored in directory: /home/jovyan/.cache/pip/wheels/76/28/19/a86ca9bb0e32dbd4a4f580870250f5aeef852870578e0427e6
Successfully built spotipy
Installing collected packages: spotipy
Successfully installed spotipy-2.4.4


In [5]:
import json, time, re, requests, spotipy
import pandas as pd
import psycopg2 as pg2

from bs4 import BeautifulSoup
from time import sleep
from spotipy.oauth2 import SpotifyClientCredentials

from psycopg2.extras import RealDictCursor, Json
from psycopg2.extensions import AsIs
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

%run ../assets/sql_cred.py
%run ../assets/spotify_cred.py
%run ../assets/genius_cred.py

Supplying client credentials to the Spotify Client manager to receive an API access token which allows access to their data.

In [3]:
client_credential_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credential_manager)

## Configure SQL Server <a name="section2"></a>

Defining helper functions to allow easy interfacing with the SQL database:
-  **con_cur_to_db**: returns both a connection and a cursor object for database
-  **execute_query**: executes query directly to database without having to create a cursor and connection each time

In [7]:
def con_cur_to_db(dbname=DBNAME, dict_cur=None):
    con = pg2.connect(host=IP_ADDRESS,
                  dbname=dbname,
                  user=USER,
                  password=PASSWORD)
    if dict_cur:
        cur = con.cursor(cursor_factory=RealDictCursor)
    else:
        cur = con.cursor()
    return con, cur
    
def execute_query(query, dbname=DBNAME, dict_cur=None, command=False):
    con, cur = con_cur_to_db(dbname, dict_cur)
    cur.execute(f'{query}')
    if not command:
        data = cur.fetchall()
        con.close()
        return data
    con.commit() #sends to server
    con.close() #closes server connection

Create table `track_list` to save the collected data. 

The primary key will be the Spotify ID, a unique identifier which will allow me to reference song metadata throughout this project. The SQL table will be organized with the following features:

| Feature        | SQL Type              | Data Type | Description                                                                                          |
|----------------|-----------------------|-----------|------------------------------------------------------------------------------------------------------|
| track_id       | VARCHAR / PRIMARY KEY | string    | Unique track ID assigned by Spotify and used to trace back to song metadata throughout this project. |
| playlist_id    | VARCHAR               | string    | Unique playlist ID assigned by Spotify and used to group tracks by playlist.                         |
| track_name     | VARCHAR               | string    | Name of the Track.                                                                                   |
| artist_name    | VARCHAR               | string    | Name of the Artist.                                                                                  |
| album_name     | VARCHAR               | string    | Name of the Album.                                                                                   |
| playlist_name  | VARCHAR               | string    | Name of the Spotify playlist.                                                                        |
| playlist_owner | VARCHAR               | string    | Name of the playlist creator (Spotify).                                                              |
| lyrics         | JSON                  | string    | Lyrics queried from Genius (http://genius.com)                                                       |

In [None]:
# query = '''CREATE TABLE track_table (
# track_id VARCHAR PRIMARY KEY,
# playlist_id VARCHAR,
# track_name VARCHAR,
# artist_name VARCHAR,
# album_name VARCHAR,
# playlist_name VARCHAR,
# playlist_owner VARCHAR,
# lyrics JSON
# )
# ;'''

# execute_query(query, command=True)

Creating a helper function to insert track metadata into the SQL table once it's been collected from the Spotify API. If the track has already been added from an earlier playlist it will alert the user that it has been skipped. 

In [5]:
def insert_playlist_info(track_dict):
    con, cur = con_cur_to_db()
    
    columns = track_dict.keys()
    values = [track_dict[column] for column in columns]

    insert_statement = 'INSERT INTO track_table (%s) VALUES %s;'
    
    try:
        cur.execute(insert_statement, (AsIs(','.join(columns)), tuple(values)))
    
    except pg2.IntegrityError:
        print("Duplicate track. Skipping:")
        print(tuple(values))

    con.commit()
    con.close()

Creating a helper function to insert track lyrics into the SQL table once it's been collected from the Genius API. If the lyrics have already been added or the lyrics haven't been added to Genius the user will be notified accordingly.

In [6]:
def insert_lyrics(lyrics, track_id):
    con, cur = con_cur_to_db()
    
    insert_statement = f"UPDATE track_table SET lyrics = {Json(lyrics)} WHERE track_id = '{track_id}';" 
    
    try:
        cur.execute(insert_statement)

    except pg2.IntegrityError:
        print("Lyrics already added. Skipping:")
        print(track_id)
    
    except:
        print("Upload error. Skipping:")
        print(track_id)
        
    con.commit()
    con.close()

Creating a function to collect all tracks from all playlists for a given category. This function leverages the helper function above to ensure that track metadata is added as each playlist's data is collected.

## Collect Song Info <a name="section3"></a>

In [7]:
def get_playlist_tracks(category_id='romance', country='us', limit=50, cycles=10, offset=0):

    fin_cycles = 0
    
    while fin_cycles < cycles:
        
        playlists = sp.category_playlists(category_id=category_id, country=country, limit=limit, offset=offset)

        offset = offset + len(playlists['playlists']['items'])

        for playlist in playlists['playlists']['items']:
            playlist_id = playlist['id']
            playlist_name = playlist['name']
            playlist_owner = playlist['owner']['id']

            tracks = sp.user_playlist_tracks(user = playlist_owner, 
                                             playlist_id = playlist_id) 

            for track in tracks['items']:
                try:
                    track_dict = {
                        'track_id' : track['track']['id'],
                        'playlist_id' : playlist_id,
                        'track_name' : track['track']['name'],
                        'artist_name' : track['track']['artists'][0]['name'],
                        'album_name' : track['track']['album']['name'],
                        'playlist_name' : playlist_name,
                        'playlist_owner' : playlist_owner
                    }

                    insert_playlist_info(track_dict)
                
                except TypeError:
                    print(f'Missing track info for track in {playlist_id}. Skipping track.')
                    
            print(f'Uploaded {playlist_name} playlist to SQLdb')

        print(f'Finished uploading {offset} playlists in {category_id} category')
        
        fin_cycles += 1
    
    print(f'Finished uploading cycle: {fin_cycles}')
    print(f'Current offset: {offset}')

Because song lyrics are not a publicly available endpoint offered through the genisus api, I created a helper function that uses a track name and artist name to return a URL to the lyrics page. The page is then scraped for the lyrics, if they exist. If the track is not present in the Genius database then the user is notified and the track is skipped. 

In [14]:
def request_song_info(song_title, artist_name):
    search_url = 'https://api.genius.com/search'  
    data = {'q': song_title + ' ' + artist_name}
    headers = {'Authorization': 'Bearer ' + TOKEN}
    
    try:
        response = requests.get(search_url, data=data, headers=headers)
        json = response.json()

    except requests.exceptions.ConnectionError: 
        print(f'*****Connection error for {artist_name}, {song_title}*****')
        print('Retrying connection')
        
        session = requests.Session()
        retry = Retry(connect=3, backoff_factor=0.5)
        adapter = HTTPAdapter(max_retries=retry)
        session.mount('http://', adapter)
        session.mount('https://', adapter)

        session.get(search_url, data=data, headers=headers)
        
    for hit in json['response']['hits']:
        if (song_title.lower() in hit['result']['title'].lower()) & (artist_name.lower() in hit['result']['primary_artist']['name'].lower()): 
            song_url = hit['result']['url']
            print(song_url)
            page = requests.get(song_url)
            html = BeautifulSoup(page.text, 'html.parser')
            lyrics = html.find('div', class_='lyrics').get_text()

            return lyrics

        if (song_title.lower() in hit['result']['title'].lower()) | (artist_name.lower() in hit['result']['primary_artist']['name'].lower()): 
            song_url = hit['result']['url']
            print(song_url)
            page = requests.get(song_url)
            html = BeautifulSoup(page.text, 'html.parser')
            lyrics = html.find('div', class_='lyrics').get_text()

            return lyrics
        
    else:
        print(f'{song_title}, {artist_name} not found')

Collecting the tracklists for Spotify-curated playlists in the Romance category. 

In [23]:
get_playlist_tracks(category_id='romance', country='us', limit=50, cycles=10, offset=52)

Finished uploading 52 playlists in romance category
Finished uploading 52 playlists in romance category
Finished uploading 52 playlists in romance category
Finished uploading 52 playlists in romance category
Finished uploading 52 playlists in romance category
Finished uploading 52 playlists in romance category
Finished uploading 52 playlists in romance category
Finished uploading 52 playlists in romance category
Finished uploading 52 playlists in romance category
Finished uploading 52 playlists in romance category
Finished uploading cycle: 10
Current offset: 52


Pulling the SQL table into a dataframe that we can then iterate through to collect lyrics.

In [8]:
query = '''SELECT * FROM track_table;'''
response = execute_query(query, dict_cur=True)
track_df = pd.DataFrame(response)
track_df.set_index('track_id', inplace=True)

In [9]:
track_df.head()

Unnamed: 0_level_0,album_name,artist_name,lyrics,playlist_id,playlist_name,playlist_owner,track_name
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0h7TlF8gKb61aSm874s3cV,I Can't Tell You How Much It Hurts,moow,\n\nIf your needle is near\nNeedle is near\nYo...,37i9dQZF1DXarebqD2nAVg,Tender,spotify,You'r in My Head
6koowTu9pFHPEcZnACLKbK,Coming Home,Leon Bridges,\n\n[Verse 1]\nBrown skin girl on the other si...,37i9dQZF1DX4adj7PFEBwf,Wedding Bells,spotify,Brown Skin Girl
1JkhKUXAoNivi87ipmV3rp,Back To Love (Deluxe Version),Anthony Hamilton,"\n\n[Verse 1]\nIt's simple, I love it\nHaving ...",37i9dQZF1DX4adj7PFEBwf,Wedding Bells,spotify,Best of Me
51lPx6ZCSalL2kvSrDUyJc,The Search for Everything,John Mayer,\n\n[Intro: Whistling]\n\n[Verse 1]\nA great b...,37i9dQZF1DX4adj7PFEBwf,Wedding Bells,spotify,You're Gonna Live Forever in Me
3vqlZUIT3rEmLaYKDBfb4Q,Songs In The Key Of Life,Stevie Wonder,\n\n[Verse 1]\nIsn't she lovely\nIsn't she won...,37i9dQZF1DX4adj7PFEBwf,Wedding Bells,spotify,Isn't She Lovely


Checking the shape of the dataframe to know how many tracks have been collected.

In [11]:
track_df.shape

(2861, 7)

Reordering the columns for easier viewing.

In [65]:
track_df = track_df[['track_name','artist_name', 'album_name', 'playlist_name', 'playlist_id', 'lyrics']]

In [66]:
track_df.head()

Unnamed: 0_level_0,track_name,artist_name,album_name,playlist_name,playlist_id,lyrics
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0h7TlF8gKb61aSm874s3cV,You'r in My Head,moow,I Can't Tell You How Much It Hurts,Tender,37i9dQZF1DXarebqD2nAVg,\n\nIf your needle is near\nNeedle is near\nYo...
6koowTu9pFHPEcZnACLKbK,Brown Skin Girl,Leon Bridges,Coming Home,Wedding Bells,37i9dQZF1DX4adj7PFEBwf,\n\n[Verse 1]\nBrown skin girl on the other si...
1JkhKUXAoNivi87ipmV3rp,Best of Me,Anthony Hamilton,Back To Love (Deluxe Version),Wedding Bells,37i9dQZF1DX4adj7PFEBwf,"\n\n[Verse 1]\nIt's simple, I love it\nHaving ..."
51lPx6ZCSalL2kvSrDUyJc,You're Gonna Live Forever in Me,John Mayer,The Search for Everything,Wedding Bells,37i9dQZF1DX4adj7PFEBwf,\n\n[Intro: Whistling]\n\n[Verse 1]\nA great b...
3vqlZUIT3rEmLaYKDBfb4Q,Isn't She Lovely,Stevie Wonder,Songs In The Key Of Life,Wedding Bells,37i9dQZF1DX4adj7PFEBwf,\n\n[Verse 1]\nIsn't she lovely\nIsn't she won...


Creating a helper function to clean song titles for more consistent results with the Genius API. If the lyrics are below a certain length they will be flagged for user review, otherwise if they are featureed in the Genius database they will be scrapped and added to our SQL table.

## Collect Lyric Info <a name="section4"></a>

In [12]:
def get_genius_lyrics(df=track_df):
    
    for ix, row in df.iterrows():
        track_id = str(ix)
        song_title = row['track_name'].lower()
        artist_name = row['artist_name']
    
        if (row['lyrics'] == None):
            song_title = re.sub('(^|)((?<=)\s-.+|\s\(.+\)|(\ssoundtrack)|\s(\".+\"))','', song_title) 
            song_title = re.sub('(^|)((\sremastered)|(\sremaster)|(\s\d{4})|(\w+\sversion))','', song_title)
            song_title = re.sub('(^|)((\w+\s+edit)|(\sspotify.+)|(\sat\sspotify.+))','', song_title)
            song_title = re.sub('\s(feat\..+)','', song_title)

            lyrics = request_song_info(song_title=song_title, artist_name=artist_name)

            try:
                if len(lyrics) > 100:
                    insert_lyrics(lyrics, track_id)
                    print(f'Uploaded lyrics for {track_id}, {song_title}')
                    time.sleep(.5)

                else:
                    print(f'*****Lyrics look short. Retry {track_id}, {song_title}*****')

            except TypeError:
                print(f'*****No lyrics for {track_id}, {song_title}*****')
        
    return 'Finished uploading tracks to SQLdb'

Collecting lyrics

In [None]:
get_genius_lyrics(track_df)

Reviewing the results

In [16]:
query = '''SELECT * FROM track_table;'''
response = execute_query(query, dict_cur=True)
track_df = pd.DataFrame(response)
track_df.set_index('track_id', inplace=True)

Checking to see how many tracks returned null values.

In [10]:
track_df[['lyrics']].isna().sum()

lyrics    585
dtype: int64

Checking to see what proportion of our dataset is missing.

In [11]:
track_df[['lyrics']].isna().sum().values[0] / track_df.shape[0]

0.20447396015379238

Reviewing the tracks that did not return any tracks. If there is a consistent pattern in missing tracks, such as live recordings, remasterings, or special versions they will be addressed programmatically in the API query function.

In [12]:
track_df[['lyrics', 'track_name', 'artist_name']][track_df['lyrics'].isna()]

Unnamed: 0_level_0,lyrics,track_name,artist_name
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
4s22ihyn2FtBLeKxzXQ3FX,,Bye Bye,ElMari.
5a0a6WC2FzLAxNfK7XMYtu,,Freshmen,S. Fidelity
3PIl4BCEbLuhRax4mgoJ4N,,Caught Her Eyes,C Y G N
72WU1V0HVg4HWQAk5vdop8,,Sweetie,Devin Morrison
1HY91LimEeoRRmq3EYcUzK,,Spazzn,Decap
6qMExjyIOHf4M5rf8Ft2AA,,Drumss,J.Robb
0xT4knFEialXamJIax97Yx,,Riri Thick - Moods Remix,Ian Ewing
2UQ1FMQDQuAQKTf8UKohyv,,Memory Well,David Blazer
49VY7zaIgfNI96qdK1Xi2l,,Yours and Nobody Else's,Fallen Roses
1kL4RYBpdW5F33NAqGbxMK,,Swell,Ian Ewing


Helper code to search for tracks as a one-off in the Genius database.

In [110]:
song_title = 'LOVE. FEAT. ZACARI.'
artist_name = 'kendrick lamar'

lyrics = request_song_info(song_title=song_title, artist_name=artist_name)

LOVE. FEAT. ZACARI., kendrick lamar not found


In [19]:
len(lyrics)

2608

In [20]:
lyrics

'\n\n[Chorus: Cass]\nI just wanna say fuck you \'til I fuck you again\nI think I\'ve had enough of you, but I don\'t wanna lose a friend\nI\'m gonna love you forever\nThat\'s just my curse\nIt\'s whatever\n\n[Verse 1: Karizma]\nI thought you\'d leave me\nI\'m the bottle, you\'re the genie\nAnd I don\'t think I give a damn\nI felt the beating\nThank god I ate my wheaties\nOr I don\'t know if I could stand\nWe hate to give in, but we\'re hardly living comfortably\nI hate you because you think that you\'re still in love with me\nIt\'s time we drift apart, we\'re blind and in the dark\nI kinda wanna care but it\'s kinda too late to start\nAnd here you are, hitting my line\nHitting rewind on the past for the 50th time\nCome on tell me, what do you think\nWanna lay down, up in the clouds or under the sheets\n\n[Chorus: Cass]\nI just wanna say fuck you \'til I fuck you again\nI think I\'ve had enough of you, but I don\'t wanna lose a friend\nI\'m gonna love you forever\nThat\'s just my curse\

# CONTINUE TO NOTEBOOK 02: EDA