## Data collection & storage...
*Collect updated music track list and lyrics from web...*
    
#### Data Sources ####
*All data has been collected via web-scrapes or developer apis...*
- Spotify Api
- Genius Api
- Genius Website

##### Soon-to-be-Sources #####
- Facebook Api
- Apple Music Api

###### Known Data Source Columns ######
- Spotify Acoustic Features (11 total, ~2 need normalizing)
    - energy
    - liveness
    - tempo
    - speechiness
    - acousticness
    - instrumentalness
    - time_signature
    - danceability
    - duration_ms
    - loudness
    - valence
- Utils for data_mapping (2)
    - utils_spotify_id
    - utils_genius_data

### NOTE: Will need to re-download all song lyrics

### Imports...

#### Tools... 

In [1]:
# TODO: Figure out if this is the complete list of imports needed...

# libraries & tool imports...
import os
import re
import ast
import sys
import json
import pprint
import requests

import numpy as np
import pandas as pd
from os import listdir
from tqdm import tqdm
from os.path import isfile, join
from datetime import datetime
from spotify import spotifyApi
from text_miner import textMiner
from text_miner import geniusApi

In [2]:
# initializing helpers..
helper = spotifyApi()
txtMiner = textMiner()
genius = geniusApi()

In [6]:
# existing lyrics dataset imports...
data_path = "/Users/dayoorigunwa/code_base/music_mapping/data/"
allfiles = [f for f in listdir(data_path) if isfile(join(data_path, f))]

gScrapeLyrics = pd.read_csv(data_path + "SpotifySongLyrics.csv", encoding="utf-8")
gScrapeLyrics.columns = ["name", "lyrics"]
gApiData = pd.read_csv(data_path + "GeniusAPISearchResults", encoding="utf-8")
# rename cols example -> df2 = df.rename({'a': 'X', 'b': 'Y'}, axis=1)

#### Testing encoded import...

In [7]:
# # explicit utf-8 import...
# utf_scrape_df = pd.read_csv(data_path + "SpotifySongLyrics.csv", encoding='utf-8')
# utf_scrape_df.columns = ['name', 'lyrics']

# # GET unique tracklist
# utf_songs = utf_scrape_df.name.unique()

# # comparing with non-encoded import

# raw_songs = gScrapeLyrics.name.unique()
# diff_songs = [song for song in utf_songs if song not in raw_songs]
# print(len(diff_songs),diff_songs)

# # ...sadness

#### Investigating geniusApi breadcrumbs...

In [8]:
# extracting genius data...
names = list(gApiData.columns)
genius_data = pd.DataFrame(
    [{"name": str(s), "data": list(gApiData[s].values)[-1]} for s in names]
)
genius_data.dropna(inplace=True)
genius_data["name"] = genius_data["name"].astype(str)
genius_data["data"] = genius_data["data"].apply(lambda x: ast.literal_eval(x))
name_to_genius = dict(zip(genius_data.name, genius_data.data))

In [6]:
# sanity check
name_to_genius["Then the Quiet Explosion"]["url"]

'https://genius.com/Hammock-then-the-quiet-explosion-lyrics'

#### This cell converts a string object into json

In [7]:
# # extracting
# test = list(gApiData["Then the Quiet Explosion"].values)
# test = [val for val in test if val not in [np.nan, 'song', '[]']]

# # jsonifying data...
# parsed_json = ast.literal_eval(test[0])
# parsed_json

### GET Spotify Data...

#### Audio...

In [3]:
# updating audio feature datasets...
helper = spotifyApi()
helper.download_spotify_playlist_audio_features()

<spotipy.client.Spotify object at 0x11f6d9750>
Name: b'tide pools ', Number of songs: 4, Playlist ID: 7HCAhke3De2jsnXb4ZFXaf 
Name: b'Unbeknownst Bangers', Number of songs: 27, Playlist ID: 3E684Qjlabk2LmcJjZ2zZQ 
Name: b'sampleLove', Number of songs: 116, Playlist ID: 0DYobfEIwMJlvVBymPoPCf 
Name: b'Costa Rica ', Number of songs: 0, Playlist ID: 46CQNWsKAfp3O2dD12aGV6 
Name: b'purpleParrots pt.3', Number of songs: 34, Playlist ID: 01RLihOalQPRbNAJdCiwOn 
Name: b'Study \xf0\x9f\x93\x9a', Number of songs: 53, Playlist ID: 5c68TrgrsR1COt97GYiHTm 
Name: b'yeye', Number of songs: 12, Playlist ID: 3YUa8BAtVbKmStMYcUnF5z 
Name: b"I'm Just Snacking", Number of songs: 23, Playlist ID: 7f7xecW1muGd5KamjC3EEM 
Name: b'la flame', Number of songs: 472, Playlist ID: 0i3CkxalKp9RRW4jf63PU5 
Name: b'Fashion Killa', Number of songs: 372, Playlist ID: 6k96myNgO28Lt9lZUPqdiU 
Name: b'vamos', Number of songs: 8, Playlist ID: 6pS0A7SrSkuxPSZnsaHd54 
Name: b'View From A Blue Moon', Number of songs: 13, Pla

100%|██████| 14/14 [00:20<00:00,  1.48s/it]


#### Tracklists...

In [4]:
# getting most up-to-date songlist from Spotify...
print("Getting my spotify playlist data...\n")
songs = []
sp = helper.sp
username = helper.username
playlists = helper.get_user_playlists(username, sp)
for playlist in tqdm(playlists):
    songs.extend(helper.get_playlist_content(username, playlist, sp, False))

print(
    f"\nExtracted {len(songs)} songs from {len(playlists)} playlists from your Spotify Account."
)

Getting my spotify playlist data...

Name: b'tide pools ', Number of songs: 4, Playlist ID: 7HCAhke3De2jsnXb4ZFXaf 
Name: b'Unbeknownst Bangers', Number of songs: 27, Playlist ID: 3E684Qjlabk2LmcJjZ2zZQ 
Name: b'sampleLove', Number of songs: 116, Playlist ID: 0DYobfEIwMJlvVBymPoPCf 
Name: b'Costa Rica ', Number of songs: 0, Playlist ID: 46CQNWsKAfp3O2dD12aGV6 
Name: b'purpleParrots pt.3', Number of songs: 34, Playlist ID: 01RLihOalQPRbNAJdCiwOn 
Name: b'Study \xf0\x9f\x93\x9a', Number of songs: 53, Playlist ID: 5c68TrgrsR1COt97GYiHTm 
Name: b'yeye', Number of songs: 12, Playlist ID: 3YUa8BAtVbKmStMYcUnF5z 
Name: b"I'm Just Snacking", Number of songs: 23, Playlist ID: 7f7xecW1muGd5KamjC3EEM 
Name: b'la flame', Number of songs: 472, Playlist ID: 0i3CkxalKp9RRW4jf63PU5 
Name: b'Fashion Killa', Number of songs: 372, Playlist ID: 6k96myNgO28Lt9lZUPqdiU 
Name: b'vamos', Number of songs: 8, Playlist ID: 6pS0A7SrSkuxPSZnsaHd54 
Name: b'View From A Blue Moon', Number of songs: 13, Playlist ID: 

100%|██████| 14/14 [00:08<00:00,  1.71it/s]


Extracted 1333 songs from 14 playlists from your Spotify Account.





In [9]:
# preparing song lists...
sSongs = [val["track"]["name"] for val in songs]
gSongs = [val for val in list(gApiData.columns)]
nSongs = [val for val in songs if val["track"]["name"] not in gSongs]

# sanity checks...
print(len(sSongs), "-", len(gSongs), "=", len(nSongs), "missing songs")

1333 - 999 = 297 missing songs


### GET Genius Lyrics...

#### method 1: web-scraping

In [11]:
url_lyrics = dict()
for songname in tqdm(name_to_genius):
    url = name_to_genius[songname]["url"]
    songlyrics = txtMiner.scrape_genius_lyrics(
        artistname="none", songname=songname, url=url
    )
    url_lyrics[songname] = songlyrics

100%|████| 774/774 [16:32<00:00,  1.28s/it]


In [10]:
# Sanity check
url_lyrics["Then the Quiet Explosion"]

NameError: name 'url_lyrics' is not defined

In [19]:
# saving intermediate results for now...
print("Saving scraped song lyric data... \n")
date_piece = datetime.today().strftime("%Y-%m-%d")
scraped_df = pd.DataFrame([url_lyrics])
scraped_df.to_csv(
    data_path + "scraped_SpotifySongLyrics_" + date_piece + ".csv", index=False
)

Saving scraped song lyric data... 



In [20]:
# Sanity check
count = 0
for key in url_lyrics:
    if len(url_lyrics[key]) < 15:
        count += 1
print(f"Genius web-scraping accuracy: {1 - count/len(url_lyrics.keys())}")

Genius web-scraping accuracy: 0.9573643410852714


In [21]:
# names = [s['track']['name'] for s in nSongs if s['track']['name'] not in list(url_lyrics.keys())]
# [n for n in names if n not in a]

#### method 2: querying Genius API

In [22]:
# checking the the geniusApi for song matches...
print("Querying GENIUS api for track metadata... \n")
song_set = dict()
for song in tqdm(nSongs):
    s = song["track"]
    artistname = s["artists"][0]["name"]
    songname = s["name"]
    songlyrics = txtMiner.search_genius_song(
        track_name=songname, track_artist=artistname
    )
    song_set[songname] = songlyrics

Querying GENIUS api for track metadata... 



  0%|                                       | 0/85 [00:00<?, ?it/s]

Searching for a match for: family ties (with Kendrick Lamar), Baby Keem


  1%|▎                              | 1/85 [00:00<00:35,  2.38it/s]

Searching for a match for: trademark usa, Baby Keem


  2%|▋                              | 2/85 [00:00<00:34,  2.43it/s]

Found a match!
Searching for a match for: VALENTINO, 24kGoldn


  4%|█                              | 3/85 [00:01<00:33,  2.43it/s]

Found a match!
Searching for a match for: Pepas, Farruko


  5%|█▍                             | 4/85 [00:01<00:32,  2.52it/s]

Found a match!
Searching for a match for: Stick (with JID & J. Cole feat. Kenny Mason & Sheck Wes), Dreamville


  6%|█▊                             | 5/85 [00:02<00:33,  2.39it/s]

Searching for a match for: BOP, DaBaby


  7%|██▏                            | 6/85 [00:02<00:33,  2.38it/s]

Found a match!
Searching for a match for: DOLLAZ ON MY HEAD (feat. Young Thug), Gunna


  8%|██▌                            | 7/85 [00:02<00:30,  2.58it/s]

Searching for a match for: RNP (feat. Anderson .Paak), Cordae


  9%|██▉                            | 8/85 [00:03<00:30,  2.49it/s]

Found a match!
Searching for a match for: Check, Young Thug


 11%|███▎                           | 9/85 [00:03<00:35,  2.17it/s]

Found a match!
Searching for a match for: Diet Coke, Pusha T


 12%|███▌                          | 10/85 [00:04<00:34,  2.14it/s]

Found a match!
Searching for a match for: Arya (feat. A$AP Rocky), A$AP Rocky


 13%|███▉                          | 11/85 [00:04<00:33,  2.24it/s]

Found a match!
Searching for a match for: INDUSTRY BABY (feat. Jack Harlow), Lil Nas X


 14%|████▏                         | 12/85 [00:05<00:31,  2.29it/s]

Searching for a match for: The Island, Pt. I (Dawn), Pendulum


 15%|████▌                         | 13/85 [00:05<00:32,  2.25it/s]

Found a match!
Searching for a match for: Woodlawn, Aminé


 16%|████▉                         | 14/85 [00:06<00:31,  2.23it/s]

Found a match!
Searching for a match for: Baianá, Bakermat


 18%|█████▎                        | 15/85 [00:06<00:28,  2.46it/s]

Found a match!
Searching for a match for: Piece Of Your Heart, MEDUZA


 19%|█████▋                        | 16/85 [00:06<00:27,  2.49it/s]

Found a match!
Searching for a match for: What A Life, Scarlet Pleasure


 20%|██████                        | 17/85 [00:07<00:34,  1.99it/s]

Found a match!
Searching for a match for: The Bigger Picture, Lil Baby


 21%|██████▎                       | 18/85 [00:07<00:30,  2.16it/s]

Found a match!
Searching for a match for: Save Your Tears, The Weeknd


 22%|██████▋                       | 19/85 [00:08<00:35,  1.83it/s]

Found a match!
Searching for a match for: Thief, Ookay


 24%|███████                       | 20/85 [00:09<00:33,  1.95it/s]

Found a match!
Searching for a match for: Spicy (feat. Fivio Foreign & A$AP Ferg), Nas


 25%|███████▍                      | 21/85 [00:09<00:35,  1.82it/s]

Found a match!
Searching for a match for: WHATS POPPIN (feat. DaBaby, Tory Lanez & Lil Wayne) - Remix, Jack Harlow


 26%|███████▊                      | 22/85 [00:10<00:30,  2.04it/s]

Searching for a match for: New God Flow, Pusha T


 27%|████████                      | 23/85 [00:10<00:28,  2.15it/s]

Found a match!
Searching for a match for: The Box, Roddy Ricch


 28%|████████▍                     | 24/85 [00:10<00:27,  2.25it/s]

Found a match!
Searching for a match for: Marvin & Chardonnay, Big Sean


 29%|████████▊                     | 25/85 [00:11<00:25,  2.38it/s]

Found a match!
Searching for a match for: Right Above It, Lil Wayne


 31%|█████████▏                    | 26/85 [00:11<00:25,  2.34it/s]

Found a match!
Searching for a match for: Over, Drake


 32%|█████████▌                    | 27/85 [00:12<00:25,  2.30it/s]

Found a match!
Searching for a match for: Valley of Peace, TEYMORI


 33%|█████████▉                    | 28/85 [00:12<00:24,  2.32it/s]

Searching for a match for: Segrados Do Samba, Souleance


 34%|██████████▏                   | 29/85 [00:12<00:21,  2.55it/s]

Searching for a match for: Ani Kuni, Polo & Pan


 35%|██████████▌                   | 30/85 [00:13<00:27,  2.04it/s]

Found a match!
Searching for a match for: Rêverie, L. 68, Claude Debussy


 36%|██████████▉                   | 31/85 [00:13<00:23,  2.30it/s]

Found a match!
Searching for a match for: Old English (feat. Travi$ Scott), King Chip


 38%|███████████▎                  | 32/85 [00:14<00:22,  2.34it/s]

Searching for a match for: Knife, Travis Scott


 39%|███████████▋                  | 33/85 [00:14<00:26,  2.00it/s]

Found a match!
Searching for a match for: Quintana Pt 2 / Upper Echelon (Mike Dean Live Mix), Travis Scott


 40%|████████████                  | 34/85 [00:15<00:21,  2.36it/s]

Searching for a match for: Pablo, Kanye West


 41%|████████████▎                 | 35/85 [00:15<00:22,  2.24it/s]

Found a match!
Searching for a match for: NO BYSTANDERS (remix), Travis Scott


 42%|████████████▋                 | 36/85 [00:16<00:21,  2.26it/s]

Searching for a match for: Future Sounds, Kanye West & Travis Scott


 44%|█████████████                 | 37/85 [00:16<00:24,  1.97it/s]

Searching for a match for: Hold That Heat (feat. Travis Scott), Southside


 45%|█████████████▍                | 38/85 [00:17<00:22,  2.06it/s]

Found a match!
Searching for a match for: LEVITATE, Don Toliver & Travis Scott


 46%|█████████████▊                | 39/85 [00:17<00:20,  2.19it/s]

Searching for a match for: CHAIN SEX FT. 21 SAVAGE, Travis Scott


 47%|██████████████                | 40/85 [00:18<00:21,  2.10it/s]

Found a match!
Searching for a match for: OUTLANDISH, Travis Scott


 48%|██████████████▍               | 41/85 [00:18<00:21,  2.01it/s]

Found a match!
Searching for a match for: Same Way ft. Quavo, Travis Scott


 49%|██████████████▊               | 42/85 [00:19<00:21,  1.99it/s]

Found a match!
Searching for a match for: Gasoline ft. Drake, Travis Scott


 51%|███████████████▏              | 43/85 [00:19<00:17,  2.35it/s]

Searching for a match for: Last Time, Travis Scott


 52%|███████████████▌              | 44/85 [00:19<00:18,  2.25it/s]

Found a match!
Searching for a match for: Can't See That ft. Quavo, Travis Scott


 53%|███████████████▉              | 45/85 [00:20<00:18,  2.16it/s]

Found a match!
Searching for a match for: Patience ft. Justin Bieber, Travis Scott


 54%|████████████████▏             | 46/85 [00:20<00:19,  2.05it/s]

Searching for a match for: Never Settle ft. Drake, Travis Scott


 55%|████████████████▌             | 47/85 [00:21<00:17,  2.12it/s]

Found a match!
Searching for a match for: Every Night ft. Kid Cudi, Travis Scott


 56%|████████████████▉             | 48/85 [00:21<00:18,  2.03it/s]

Found a match!
Searching for a match for: Help Myself (feat. Don Toliver & Quavo), Travis Scott


 58%|█████████████████▎            | 49/85 [00:22<00:17,  2.04it/s]

Searching for a match for: MAFIA/LOST FOREVER, Travis Scott


 59%|█████████████████▋            | 50/85 [00:22<00:17,  1.98it/s]

Searching for a match for: MAFIA (Live), Travis Scott


 60%|██████████████████            | 51/85 [00:23<00:16,  2.10it/s]

Searching for a match for: About Damn Time, Lizzo


 61%|██████████████████▎           | 52/85 [00:23<00:15,  2.19it/s]

Found a match!
Searching for a match for: First Class, Jack Harlow


 62%|██████████████████▋           | 53/85 [00:24<00:15,  2.11it/s]

Found a match!
Searching for a match for: up at night (feat. justin bieber), Kehlani


 64%|███████████████████           | 54/85 [00:24<00:13,  2.27it/s]

Found a match!
Searching for a match for: MEAN!, Madeline The Person


 65%|███████████████████▍          | 55/85 [00:25<00:12,  2.35it/s]

Found a match!
Searching for a match for: Where you are (feat. WILLOW), PinkPantheress


 66%|███████████████████▊          | 56/85 [00:25<00:12,  2.27it/s]

Found a match!
Searching for a match for: Birthday Cake, Dylan Conrique


 67%|████████████████████          | 57/85 [00:26<00:12,  2.17it/s]

Found a match!
Searching for a match for: As It Was, Harry Styles


 68%|████████████████████▍         | 58/85 [00:26<00:12,  2.25it/s]

Found a match!
Searching for a match for: Maybe You’re The Problem, Ava Max


 69%|████████████████████▊         | 59/85 [00:26<00:11,  2.29it/s]

Found a match!
Searching for a match for: Tamagotchi, Omar Apollo


 71%|█████████████████████▏        | 60/85 [00:27<00:10,  2.34it/s]

Found a match!
Searching for a match for: Dua Lipa, Jack Harlow


 72%|█████████████████████▌        | 61/85 [00:27<00:09,  2.42it/s]

Found a match!
Searching for a match for: Lola, Maude Latour


 73%|█████████████████████▉        | 62/85 [00:28<00:09,  2.43it/s]

Found a match!
Searching for a match for: Slow Song (with Dragonette), The Knocks


 74%|██████████████████████▏       | 63/85 [00:28<00:09,  2.38it/s]

Found a match!
Searching for a match for: Sweetest Pie, Megan Thee Stallion


 75%|██████████████████████▌       | 64/85 [00:28<00:09,  2.23it/s]

Found a match!
Searching for a match for: Sigue, J Balvin


 76%|██████████████████████▉       | 65/85 [00:29<00:09,  2.18it/s]

Found a match!
Searching for a match for: Something in the Orange, Zach Bryan


 78%|███████████████████████▎      | 66/85 [00:29<00:08,  2.32it/s]

Found a match!
Searching for a match for: Attention, Omah Lay


 79%|███████████████████████▋      | 67/85 [00:30<00:08,  2.24it/s]

Found a match!
Searching for a match for: Beg For You (feat. Rina Sawayama), Charli XCX


 80%|████████████████████████      | 68/85 [00:30<00:07,  2.30it/s]

Found a match!
Searching for a match for: In The Stars, Benson Boone


 81%|████████████████████████▎     | 69/85 [00:31<00:06,  2.33it/s]

Found a match!
Searching for a match for: everything, Kehlani


 82%|████████████████████████▋     | 70/85 [00:32<00:08,  1.72it/s]

Found a match!
Searching for a match for: Envolver, Anitta


 84%|█████████████████████████     | 71/85 [00:32<00:07,  1.91it/s]

Found a match!
Searching for a match for: Closer (feat. H.E.R.), Saweetie


 85%|█████████████████████████▍    | 72/85 [00:32<00:06,  1.97it/s]

Searching for a match for: Finesse, Pheelz


 86%|█████████████████████████▊    | 73/85 [00:33<00:05,  2.19it/s]

Found a match!
Searching for a match for: Light Switch, Charlie Puth


 87%|██████████████████████████    | 74/85 [00:33<00:04,  2.33it/s]

Found a match!
Searching for a match for: Headlights (feat. KIDDO), Alok


 88%|██████████████████████████▍   | 75/85 [00:34<00:04,  2.17it/s]

Found a match!
Searching for a match for: Infinity, Jaymes Young


 89%|██████████████████████████▊   | 76/85 [00:34<00:03,  2.28it/s]

Found a match!
Searching for a match for: Boyfriend, Dove Cameron


 91%|███████████████████████████▏  | 77/85 [00:34<00:03,  2.38it/s]

Found a match!
Searching for a match for: Killing Me, Omar Apollo


 92%|███████████████████████████▌  | 78/85 [00:35<00:03,  1.86it/s]

Found a match!
Searching for a match for: That's Hilarious, Charlie Puth


 93%|███████████████████████████▉  | 79/85 [00:36<00:03,  1.99it/s]

Found a match!
Searching for a match for: May Snow, Sidders


 94%|████████████████████████████▏ | 80/85 [00:36<00:02,  2.06it/s]

Searching for a match for: Let You Go, Diplo


 95%|████████████████████████████▌ | 81/85 [00:37<00:02,  1.82it/s]

Found a match!
Searching for a match for: What Would You Do?, Joel Corry


 96%|████████████████████████████▉ | 82/85 [00:37<00:01,  1.84it/s]

Found a match!
Searching for a match for: Everything But You (feat. A7S), Clean Bandit


 98%|█████████████████████████████▎| 83/85 [00:38<00:01,  1.68it/s]

Found a match!
Searching for a match for: COMPLETE MESS, 5 Seconds of Summer


 99%|█████████████████████████████▋| 84/85 [00:39<00:00,  1.52it/s]

Found a match!
Searching for a match for: Head on Fire, Griff


100%|██████████████████████████████| 85/85 [00:40<00:00,  2.12it/s]

Found a match!





In [23]:
# extracting query results...
song_set = dict(
    zip(
        [val for val in song_set if song_set[val]],
        [song_set[val]["result"] for val in song_set if song_set[val]],
    )
)
for songname in tqdm(song_set):
    url = song_set[songname]["url"]
    songlyrics = txtMiner.scrape_genius_lyrics(
        artistname="none", songname=songname, url=url
    )
    url_lyrics[songname] = songlyrics

100%|██████████████████████████████| 66/66 [01:17<00:00,  1.18s/it]


In [24]:
missed = [val for val in nSongs if val["track"]["name"] not in song_set.keys()]
print(f"Genius url accuracy: {(len(nSongs) - len(missed))/len(nSongs)}")

Genius url accuracy: 0.7764705882352941


#### Importing audio datasets...

In [25]:
# importing audio file... (see SpotifySongAudio Wrangling.ipynb for code)
data_path = "/Users/dayoorigunwa/code_base/music_mapping/data/"
allfiles = [f for f in listdir(data_path) if isfile(join(data_path, f))]
audio_files = [f for f in allfiles if ("12178525311" in f) and (".csv" in f)]

# sanity check
audio_df = pd.DataFrame()
for filename in audio_files:
    tmp_df = pd.read_csv(data_path + filename)
    audio_df = pd.concat([audio_df, tmp_df], axis=0)
audio_df.head()
audio_df["id"] = audio_df["uri"].apply(lambda x: x.split(":")[-1])

In [26]:
# Sanity checks...
count = 0
ids = [s["track"]["id"] for s in songs]
for val in list(audio_df["id"].values):
    if val in ids:
        count += 1
count

889

In [27]:
# extracting genius data...
names = list(gApiData.columns)
genius_data = pd.DataFrame(
    [{"name": str(s), "data": list(gApiData[s].values)[-1]} for s in names]
)
genius_data.dropna(inplace=True)
genius_data["name"] = genius_data["name"].astype(str)
genius_data["data"] = genius_data["data"].apply(lambda x: ast.literal_eval(x))
name_to_genius = dict(zip(genius_data.name, genius_data.data))

In [28]:
# extracting genius data pt2...
url_lyrics = {key: url_lyrics[key] for key in url_lyrics if len(url_lyrics[key]) > 25}
names = list(url_lyrics.keys())
url_lyric_df = pd.DataFrame([{"name": str(s), "lyrics": url_lyrics[s]} for s in names])
url_lyric_df.head()

Unnamed: 0,name,lyrics
0,Then the Quiet Explosion,I can’t feel you There’s no trace Lights will ...
1,Immunity,You've answered my prayer For a worthless diam...
2,Mumma Don't Tell,My mama don't tell I'm the same My mama don't ...
3,Quick Musical Doodles,[Verse 1] You remember You remember my love Yo...
4,Killing Me To Love You,[Verse 1] Your body is broken but you’re tryin...


In [29]:
# collecting lyric data...
lyric_df_list = [gScrapeLyrics, url_lyric_df]
lyrics_df = pd.DataFrame()
for l_df in lyric_df_list:
    lyrics_df = pd.concat([lyrics_df, l_df], axis=0)
lyrics_dictlist = lyrics_df.to_dict("list")
name_to_lyrics = dict(zip(lyrics_dictlist["name"], lyrics_dictlist["lyrics"]))
lyrics_df.head()

Unnamed: 0,name,lyrics
0,Then the Quiet Explosion,I can’t feel youThere’s no traceLights will bu...
1,Two Thousand and Seventeen,[Non-Lyrical Vocals]
2,Immunity,You've answered my prayerFor a worthless diamo...
3,Quick Musical Doodles,[Verse 1]You rememberYou remember my loveYou s...
4,Pretending,[Intrumental]


#### Unifying Datasets... ####

In [30]:
# vars...
data = dict()
columns = [
    "name",
    "artist",
    "lyrics",
    "date_added",
    "energy",
    "liveness",
    "tempo",
    "speechiness",
    "acousticness",
    "instrumentalness",
    "time_signature",
    "danceability",
    "valence",
    "duration_ms",
    "loudness",
    "utils_spotify_id",
    "utils_genius_data",
]

for s in songs:

    # get base fields...
    cur_keys = list(data.keys())
    name = s["track"]["name"]
    utils_spotify_id = s["track"]["id"]
    utils_genius_data = name_to_genius.get(name, np.nan)
    artist = s["track"]["artists"][0][
        "name"
    ]  # if s['track']['artists'][0]['name'] else np.nan
    lyrics = name_to_lyrics.get(name, np.nan)
    date_added = s["added_at"]  # if s['added_at'] else np.nan
    audio_row = audio_df[audio_df["id"] == utils_spotify_id]

    # set spotify fields...
    if audio_df[audio_df["id"] == utils_spotify_id].empty == False:
        energy = (
            list(audio_row.energy.values).pop()
            if not audio_row.energy.empty
            else np.nan
        )
        liveness = (
            list(audio_row.liveness.values).pop()
            if not audio_row.liveness.empty
            else np.nan
        )
        tempo = (
            list(audio_row.tempo.values).pop() if not audio_row.tempo.empty else np.nan
        )
        speechiness = (
            list(audio_row.speechiness.values).pop()
            if not audio_row.speechiness.empty
            else np.nan
        )
        acousticness = (
            list(audio_row.acousticness.values).pop()
            if not audio_row.acousticness.empty
            else np.nan
        )
        instrumentalness = (
            list(audio_row.instrumentalness.values).pop()
            if not audio_row.instrumentalness.empty
            else np.nan
        )
        time_signature = (
            list(audio_row.time_signature.values).pop()
            if not audio_row.time_signature.empty
            else np.nan
        )
        danceability = (
            list(audio_row.danceability.values).pop()
            if not audio_row.danceability.empty
            else np.nan
        )
        valence = (
            list(audio_row.valence.values).pop()
            if not audio_row.valence.empty
            else np.nan
        )
        duration_ms = (
            list(audio_row.duration_ms.values).pop()
            if not audio_row.duration_ms.empty
            else np.nan
        )
        loudness = (
            list(audio_row.loudness.values).pop()
            if not audio_row.loudness.empty
            else np.nan
        )
    else:
        energy = np.nan
        liveness = np.nan
        tempo = np.nan
        speechiness = np.nan
        acousticness = np.nan
        instrumentalness = np.nan
        time_signature = np.nan
        danceability = np.nan
        valence = np.nan
        duration_ms = np.nan
        loudness = np.nan

    # wrapping args up...
    values = [
        name,
        artist,
        lyrics,
        date_added,
        energy,
        liveness,
        tempo,
        speechiness,
        acousticness,
        instrumentalness,
        time_signature,
        danceability,
        valence,
        duration_ms,
        loudness,
        utils_spotify_id,
        utils_genius_data,
    ]
    arg_tups = zip(columns, values)

    # build datadict...
    for arg_key, arg_val in list(arg_tups):
        if arg_key in cur_keys:
            data[arg_key].append(arg_val)
        else:
            data[arg_key] = [arg_val]

final_df = pd.DataFrame(columns=columns, data=data)
final_df.head()

Unnamed: 0,name,artist,lyrics,date_added,energy,liveness,tempo,speechiness,acousticness,instrumentalness,time_signature,danceability,valence,duration_ms,loudness,utils_spotify_id,utils_genius_data
0,family ties (with Kendrick Lamar),Baby Keem,,2022-04-21T00:18:15Z,0.611,0.23,134.093,0.33,0.00588,0.0,4.0,0.711,0.144,252262.0,-5.453,3QFInJAm9eyaho5vBzxInN,
1,trademark usa,Baby Keem,[Part I] [Intro] I can't help but feel neglect...,2022-04-21T00:18:29Z,0.6,0.274,130.732,0.281,0.108,2e-06,4.0,0.613,0.067,270671.0,-5.621,6G9aDedv5hYaTgNYDuduqk,
2,VALENTINO,24kGoldn,"[Chorus] I don't want a valentine, I just wa...",2022-04-21T00:19:39Z,0.717,0.132,150.964,0.179,0.199,0.0,4.0,0.746,0.523,179133.0,-4.841,6piAUJJQFD8oHDUr0b7l7q,
3,Pepas,Farruko,"[Letra de ""Pepas""] [Refrán] No me importa lo q...",2022-04-21T00:41:47Z,0.766,0.128,130.001,0.0343,0.00776,7e-05,4.0,0.762,0.442,287120.0,-3.955,5fwSHlTEWpluwOM0Sxnh5k,
4,Stick (with JID & J. Cole feat. Kenny Mason & ...,Dreamville,,2022-04-21T00:42:09Z,0.857,0.668,118.574,0.292,0.266,0.0,4.0,0.671,0.597,309323.0,-5.435,1BzXvBpIFWJgu0P8P6xmP4,


In [31]:
final_df.lyrics.isna().sum()

282

In [32]:
date_piece = datetime.today().strftime("%Y-%m-%d")
final_df.to_csv(data_path + "scraped_dataset_" + date_piece + ".csv")

In [None]:
final_df.info()