### Creating the data set
The data set will be all the publicly available song lyrics at songlyrics.com

Starting with Wu-Tang lyrics @ http://www.songlyrics.com/wu-tang-clan-lyrics/ but this could be edited to work for any artist that have lyrics on songlyrics.com

In [1]:
import bs4
import requests
import re

url = "http://www.songlyrics.com/wu-tang-clan-lyrics/"
song_table_class = "tracklist"
song_lyric_id = "songLyricsDiv"
attribute_to_link = "href"

# dividers for the output file
d1 = '#'*10
song_start_div = '\n{} BEGIN SONG {}\n'.format(d1, d1)
song_end_div = '\n{} END SONG {}\n'.format(d1, d1)


Use requests to get the html content and create a beatofulsoup obeject

In [2]:
req = requests.get(url)
bs_html = bs4.BeautifulSoup(req.text)

Find the table that contain the links to the song lyrics and extract the song links and the song names

In [3]:
song_table_html = bs_html.find(name='table', attrs={'class':song_table_class})
links_html = song_table_html.find_all(name='a')
links = [ (x['href'], x.text) for x in links_html ]

Go through each link and retrieve the div containing the song lyric, this might take some time depending on your internet speed

In [4]:
lyrics = []
for link, name in links:
    
    try:
        req = requests.get(link)
        bs_html = bs4.BeautifulSoup(req.text)
        song_lyric_html = bs_html.find(name='p', attrs={'id':song_lyric_id})
        lyrics.append((name, song_lyric_html.text))
    except:
        # broken link
        pass

Now we have the lyrics in a raw format we want to remove redundant information, all we want is the lyrics.

In [5]:
len(lyrics)

401

After inspecting the syntax of the lyrics we can only clean some parts. For example parantheses is used for adlibs and also for repititons, so we can't just expand on it or remove it. Brackets are used for which vers it is and it's also used for who is singing or rappin. What we can remove is everything that is enclosed with brackets, becuase we want to generate Wu-Tang lyrics not specific to each member.

In [6]:
for i in range(len(lyrics)):
    
    # greedy match with brackets with any content, in both title and lyric
    song_lyric = re.sub(r'\[.*\]','', lyrics[i][1])
    song_title = re.sub(r'\[.*\]', '', lyrics[i][0])
    song_title = re.sub(r'\(.*\)', '', song_title)
    
    lyrics[i] = (song_title, song_lyric)

Now we put it all together and add dividers between each song. We created dividers to be able to pass the entire file to the model for training.

In [7]:
file_out = open('wu_tang_songs.txt', 'w', encoding='utf8')
for name, lyric in lyrics:
    
    # write a chunk
    file_out.write(song_start_div)
    file_out.write(lyric)
    file_out.write(song_end_div)
    
file_out.close()