# 1. Download lyrics

The first step to build our Lyrics generator is to find existing songs from which the model can learn how to properly write lyrics. So, the first step is to **build a dataset of lyrics**. <br><br>
**Idea:** scrape song titles and corresponding artists from weekly billboard hot 100 charts and later use them to download the according lyrics from genius.com. <br><br>
Luckily, there's an existing dataset of scraped charts. However, the most recent entries are from November 2021. So in case we want more recent songs, we have to consider scraping the charts from https://www.billboard.com/charts/hot-100/ ourselves. The source of the dataset is https://www.kaggle.com/dhruvildave/billboard-the-hot-100-songs/version/11.

In [1]:
# imports
# install library LyricsGenius to access the Genius API
!pip install git+https://github.com/johnwmillr/LyricsGenius.git
# install library spotipy to access song characteristics by spotify
!pip install spotipy
import lyricsgenius
import pandas as pd
import numpy as np
import re
from datetime import date
# import helper functions
import lg_functions as lg

Collecting git+https://github.com/johnwmillr/LyricsGenius.git
  Cloning https://github.com/johnwmillr/LyricsGenius.git to c:\users\stock\appdata\local\temp\pip-req-build-alx43kp2
  Resolved https://github.com/johnwmillr/LyricsGenius.git to commit bec02665b807941ca95e045be910e861789fc4a7
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'


  Running command git clone --filter=blob:none --quiet https://github.com/johnwmillr/LyricsGenius.git 'C:\Users\stock\AppData\Local\Temp\pip-req-build-alx43kp2'


Collecting spotipy
  Downloading spotipy-2.22.0-py3-none-any.whl (28 kB)
Collecting redis>=3.5.3
  Downloading redis-4.4.0-py3-none-any.whl (236 kB)
     -------------------------------------- 236.4/236.4 kB 4.8 MB/s eta 0:00:00
Collecting async-timeout>=4.0.2
  Downloading async_timeout-4.0.2-py3-none-any.whl (5.8 kB)
Installing collected packages: async-timeout, redis, spotipy
Successfully installed async-timeout-4.0.2 redis-4.4.0 spotipy-2.22.0


## Weekly Billboard Hot 100 Charts

In [2]:
# open dataset from kaggle
charts = pd.read_csv("./data/charts.csv")
charts.head()

Unnamed: 0,date,rank,song,artist,last-week,peak-rank,weeks-on-board
0,2021-11-06,1,Easy On Me,Adele,1.0,1,3
1,2021-11-06,2,Stay,The Kid LAROI & Justin Bieber,2.0,1,16
2,2021-11-06,3,Industry Baby,Lil Nas X & Jack Harlow,3.0,1,14
3,2021-11-06,4,Fancy Like,Walker Hayes,4.0,3,19
4,2021-11-06,5,Bad Habits,Ed Sheeran,5.0,2,18


In [3]:
len(charts)

330087

We only need every song once, moreover we restrict the data to only more "recent" songs, so songs after the year 1980. Note that we initially only took songs from 2000 onwards but realised that our model will then output very rap-heavy lyrics, so we increased the date range.

In [4]:
charts = charts.drop(["rank", "last-week", "peak-rank", "weeks-on-board"], axis = 1)
charts = charts.drop_duplicates(["song", "artist"])
charts.tail()

Unnamed: 0,date,song,artist
330076,1958-08-04,Stay,The Ames Brothers
330082,1958-08-04,Over And Over,Thurston Harris
330084,1958-08-04,Little Serenade,The Ames Brothers
330085,1958-08-04,I'll Get By (As Long As I Have You),Billy Williams
330086,1958-08-04,Judy,Frankie Vaughan


In [5]:
# number of unique songs before filter
len(charts)

29681

In [7]:
# restrict to recent songs
# change column to datetime
charts["date"] = pd.to_datetime(charts["date"])

# restrict to dates after 1980-01-01
charts = charts.loc[charts["date"] > pd.to_datetime("1980-01-01")]

charts.tail()

Unnamed: 0,date,song,artist
113898,2000-01-15,The Greatest Romance Ever Sold,Prince
113917,2000-01-08,The Christmas Song (Chestnuts Roasting On An O...,Christina Aguilera
113960,2000-01-08,Deck The Halls,SHeDAISY
113970,2000-01-08,I Love You,Martina McBride
113990,2000-01-08,Left & Right,D'Angelo Featuring Method Man And Redman


In [9]:
# number of unique songs after filter
len(charts)

9194

In [10]:
# Now drop the date column, we don't need it anymore
charts = charts.drop(["date"], axis = 1).reset_index(drop = True)

In [11]:
# define function for stripping whitespace in list in columns
def strip_element(my_list):
    return [x.strip() for x in my_list]

# split multiple artists into list of artists and strip whitespace
charts["artist"] = charts["artist"].apply(lambda x: re.split(r"Featuring|&", x)).apply(strip_element)

## Find song link with Genius API

Use the custom function `search_url` in lg_functions.py to search for the song url on genius, using the artist and the song title. This url can later be used to download the lyrics of the song.

In [None]:
# useful if we need to reload lg_functions.py after adapting functions there.
# from importlib import reload
# reload(lg)

<module 'lg_functions' from '/work/lg_functions.py'>

In [None]:
# search for url of the songs on genius
songs = lg.search_url(charts)

In [None]:
# for some songs, no url can be found
songs[songs["url"].isnull()]

Unnamed: 0,song,artist,url
27,Let's Go Brandon,"[Bryson Gray, Tyson James, Chandler Crump]",
91,"Ya Superame (En Vivo Desde Culiacan, Sinaloa)",[Grupo Firme],
270,One Too Many,[Keith Urban Duet With P!nk],
348,pride.is.the.devil,"[J. Cole, Lil Baby]",
402,let.go.my.hand,"[J. Cole, Bas, 6LACK]",
...,...,...,...
8780,Change The Game,"[Jay-Z, Beanie Sigel And Memphis Bleek]",
8842,You All Dat,[Baha Men With Imani Coppola],
8908,Where I Wanna Be,"[Damizza Presents Shade Sheist, Nate Dogg, Kur...",
8942,Toca's Miracle,[Fragma],


**Note:** consider specially treating the songs with missing urls, but it's only 90, so probably not worth it

In [None]:
# drop songs without url
songs = songs.loc[~songs["url"].isnull()]

## Find lyrics using the song link

Use the song link to download the lyrics of the song, using `lyrics_from_url`.

In [None]:
songs = lg.lyrics_from_url(songs)

Couldn't find the lyrics section. Please report this if the song has lyrics.
Song URL: https://genius.com/8-bit-arcade-95-south-8-bit-j-cole-emulation-lyrics
Couldn't find the lyrics section. Please report this if the song has lyrics.
Song URL: https://genius.com/John-legend-happy-xmas-war-is-over-lyrics
Couldn't find the lyrics section. Please report this if the song has lyrics.
Song URL: https://genius.com/8-bit-arcade-ganja-burn-8-bit-nicki-minaj-emulation-lyrics
Couldn't find the lyrics section. Please report this if the song has lyrics.
Song URL: https://genius.com/8-bit-arcade-bigger-than-you-8-bit-2-chainz-drake-and-quavo-emulation-lyrics
Couldn't find the lyrics section. Please report this if the song has lyrics.
Song URL: https://genius.com/Lindsey-stirling-hallelujah-lyrics
Couldn't find the lyrics section. Please report this if the song has lyrics.
Song URL: https://genius.com/8-bit-arcade-lady-marmalade-8-bit-christina-aguilera-lil-kim-mya-and-pink-emulation-lyrics


In [None]:
# save as csv
songs.to_csv("./data/songs_only_lyrics.csv", index = False)