In this first notebook I'll apply web scraping to obtain the lyrics of Evanescence and Within Temptation from the web. 

First, I've checked the web around for some lyrics sites that have lyrics from both bands. Like this the code I use to extract lyrics for one band can be used for the other one. 

Some of the websites I've explored were http://www.metrolyrics.com/evanescence-lyrics.html, https://www.azlyrics.com/, https://www.songteksten.nl/, and https://www.songteksten.net/.

To extract hyperlinks I've used the packages: 

* [requests](https://pypi.org/project/requests/) 
* [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). 

You will notice that some simple [Python string methods](https://www.w3schools.com/python/python_strings.asp) were used with the task of retrieving lyrics. So let's start webscraping.

# **Webscraping**

The first step is to extract addresses of links of each lyric of the band. The webpages chosen usually presents title of lyrics where each title is a hyperlink to a lyric. Like this, once we have all hyperlinks, we will be able to extract each one of the lyrics using those links.

The following function retrieves all hyperlinks within the main page. However, each website considered has its own particularities which must be considered when filtering out hyperlinks of the lyrics.


In [2]:
# importing packages

import requests
from bs4 import BeautifulSoup


def retrieve_hyperlinks(main_url):
    """ Extract all hyperlinks in 'main_url' and return a list with these hyperlinks """
    
    # Packages the request, send the request and catch the response: r

    r = requests.get(main_url)

    # Extracts the response as html: html_doc
    html_doc = r.text

    # Create a BeautifulSoup object from the HTML: soup
    soup = BeautifulSoup(html_doc)
    
    # Find all 'a' tags (which define hyperlinks): a_tags

    a_tags = soup.find_all('a')
    
    # Create a list with hyperlinks found

    list_links = [link.get('href') for link in a_tags]
    
    # Remove none values if there is some
    
    list_links = list(filter(None, list_links)) 
    
    return list_links

## Retrieving hyperlinks of lyrics

Once we obtain all hyperlinks, we need to select those that contains lyrics. For now, we are investigating different lyrics websites and each one presents a slightly different structure that must be considered when filtering out hyperlinks. 

By working with different websites, we can choose the one that we consider the most appropriated for extracting the lyrics.

Notice that initially we will be only considering evanescence lyrics. However, we have already checked that all websites considered also contain lyrics from Within Temptation. Therefore, once we have our solution, it will be applicable for both bands.

### Retrieving from [metrolyrics](https://www.metrolyrics.com/)

To retrieve lyrics of Evanescence from the `metrolyrics` first, we apply the above function using `main_url` = 'http://www.metrolyrics.com/evanescence-lyrics.html'.

Then, as expected, we notice that not all hyperlinks are related to lyrics, so we need to filter those out and keep only the ones containing lyrics.


In [3]:
url = 'http://www.metrolyrics.com/evanescence-lyrics.html'
list_links_lyrics_metrolyrics = retrieve_hyperlinks(url)

# remove probable repetitions

list_links_lyrics_metrolyrics = list(set(list_links_lyrics_metrolyrics))

print('\n Number of links before filtering:', len(list_links_lyrics_metrolyrics))
list_links_lyrics_metrolyrics[:20]


 Number of links before filtering: 112


['https://www.metrolyrics.com/war-lyrics-drake.html',
 'https://www.metrolyrics.com/news.html',
 'https://www.metrolyrics.com/call-me-when-youre-sober-lyrics-evanescence.html',
 'https://www.metrolyrics.com/away-from-me-lyrics-evanescence.html',
 'https://www.metrolyrics.com/imaginary-lyrics-evanescence.html',
 'https://www.metrolyrics.com/october-lyrics-evanescence.html',
 'https://www.metrolyrics.com/liquid-blue-lyrics-evanescence.html',
 'javascript:void(0)',
 'https://www.metrolyrics.com/sweet-sacrifice-lyrics-evanescence.html',
 'https://www.metrolyrics.com/sallys-song-lyrics-evanescence.html',
 'https://www.metrolyrics.com/solitude-lyrics-evanescence.html',
 'https://www.metrolyrics.com/the-chain-lyrics-evanescence.html',
 'https://www.metrolyrics.com/imperfection-lyrics-evanescence.html',
 'https://www.metrolyrics.com/cloud-nine-lyrics-evanescence.html',
 'https://www.metrolyrics.com/bring-me-to-life-lyrics-dancing-with-the-stars.html',
 'https://www.metrolyrics.com/',
 'https:/

A quick look in the list above reveals that links containing lyrics are of the form 'http://www.metrolyrics.com/TITLE-lyrics-evanescence.html'. So, we can select the elements of the list that contains '-lyrics-evanescence.html'.

The list comprehension bellow does the job and return us 80 hyperlinks instead of the initial 118 hyperlinks.

In [4]:
list_links_lyrics_metrolyrics = [link for link in list_links_lyrics_metrolyrics if '-lyrics-evanescence.html' in link]
print('\n Number of links after filtering:', len(list_links_lyrics_metrolyrics))
list_links_lyrics_metrolyrics



 Number of links after filtering: 80


['https://www.metrolyrics.com/call-me-when-youre-sober-lyrics-evanescence.html',
 'https://www.metrolyrics.com/away-from-me-lyrics-evanescence.html',
 'https://www.metrolyrics.com/imaginary-lyrics-evanescence.html',
 'https://www.metrolyrics.com/october-lyrics-evanescence.html',
 'https://www.metrolyrics.com/liquid-blue-lyrics-evanescence.html',
 'https://www.metrolyrics.com/sweet-sacrifice-lyrics-evanescence.html',
 'https://www.metrolyrics.com/sallys-song-lyrics-evanescence.html',
 'https://www.metrolyrics.com/solitude-lyrics-evanescence.html',
 'https://www.metrolyrics.com/the-chain-lyrics-evanescence.html',
 'https://www.metrolyrics.com/imperfection-lyrics-evanescence.html',
 'https://www.metrolyrics.com/cloud-nine-lyrics-evanescence.html',
 'https://www.metrolyrics.com/broken-lyrics-evanescence.html',
 'https://www.metrolyrics.com/whisper-lyrics-evanescence.html',
 'https://www.metrolyrics.com/field-of-innocence-lyrics-evanescence.html',
 'https://www.metrolyrics.com/weight-of-the

After filtering, we see that the 3 last links finishing with '/correction', so we update the list comprehension above to eliminate the links with '/correction'.

In [7]:
# updated version

list_links_lyrics_metrolyrics = [link for link in list_links_lyrics_metrolyrics if 
                                 ('-lyrics-evanescence.html' in link and '/correction' not in link)]

print('\n Number of links after updated filtering and applying set:', len(list_links_lyrics_metrolyrics))

list_links_lyrics_metrolyrics.sort()
list_links_lyrics_metrolyrics


 Number of links after updated filtering and applying set: 77


['https://www.metrolyrics.com/angel-of-mine-lyrics-evanescence.html',
 'https://www.metrolyrics.com/anywhere-lyrics-evanescence.html',
 'https://www.metrolyrics.com/away-from-me-lyrics-evanescence.html',
 'https://www.metrolyrics.com/before-the-dawn-lyrics-evanescence.html',
 'https://www.metrolyrics.com/bleed-lyrics-evanescence.html',
 'https://www.metrolyrics.com/breathe-no-more-lyrics-evanescence.html',
 'https://www.metrolyrics.com/bring-me-to-life-lyrics-evanescence.html',
 'https://www.metrolyrics.com/broken-lyrics-evanescence.html',
 'https://www.metrolyrics.com/call-me-when-your-sober-lyrics-evanescence.html',
 'https://www.metrolyrics.com/call-me-when-youre-a-sober-lyrics-evanescence.html',
 'https://www.metrolyrics.com/call-me-when-youre-sober-lyrics-evanescence.html',
 'https://www.metrolyrics.com/cartoon-network-song-lyrics-evanescence.html',
 'https://www.metrolyrics.com/cloud-nine-lyrics-evanescence.html',
 'https://www.metrolyrics.com/end-of-the-dream-lyrics-evanescence.

So finally, for metrolyrics we have obtained 77 hyperlinks of Evanescence's lyrics. 

Let's check the other websites.

### Retrieving from [songteksten.net](https://songteksten.net/)

This webpage presents 3 pages with links to lyrics, so we need to apply the function on all links.

Before filtering we have 568 links after filtering 86. Observe that although there is apparently no repetition, there are different versions of the same music (e.g. lies and lies-remix).

In [8]:
# retrieving all hyperlinks
urls = ['https://songteksten.net/artist/lyrics/1938/evanescence.html',
       'https://songteksten.net/artist/lyrics/1938/evanescence/page/2.html',
       'https://songteksten.net/artist/lyrics/1938/evanescence/page/3.html']

list_links_lyrics_songteksten_net = []

for url in urls:
    list_links_lyrics_songteksten_net.extend(retrieve_hyperlinks(url))
    
# remove probable repetitions

list_links_lyrics_songteksten_net = list(set(list_links_lyrics_songteksten_net))

    
print('Number of links before filtering:', len(list_links_lyrics_songteksten_net))

Number of links before filtering: 223


In [9]:
list_links_lyrics_songteksten_net

['https://songteksten.net/lyric/7419/102714/niels-destadsbader/het-licht-gaat-uit.html',
 'https://songteksten.net/lyric/1938/77224/evanescence/angel-of-mine.html',
 'https://songteksten.net/lyric/1938/93923/evanescence/new-way-to-bleed.html',
 'https://songteksten.net/lyric/7419/102709/niels-destadsbader/vergeef-me.html',
 'https://songteksten.net/lyric/1938/30601/evanescence/understanding-wash-it-all-away.html',
 'https://songteksten.net/artist/lyrics/1938/evanescence/page/1.html',
 'https://songteksten.net/artist/lyrics/9049/danny-vera.html',
 '//songteksten.net/lyrics/latest.html',
 'https://songteksten.net/lyric/1938/30596/evanescence/lies.html',
 '//songteksten.net/artists.html',
 'https://songteksten.net/artists/o.html',
 'https://songteksten.net/artists/p.html',
 'https://songteksten.net/lyric/1938/87754/evanescence/wake-me-up-inside-bring-me-to-life.html',
 '//songteksten.net/lyrics/popular.html',
 'https://songteksten.net/lyric/1938/59309/evanescence/all-that-im-living-for.ht

In this case, to obtain lyrics we need to keep links that contain '/songteksten.net/lyric/1938' which will do by extracting the information from the url address.

In [10]:
# using url address to filter lyrics

spliting = urls[0].split('/')
filter_lyrics = spliting[2]+'/lyric/'+spliting[-2]

list_links_lyrics_songteksten_net = [link for link in list_links_lyrics_songteksten_net if (filter_lyrics 
                                                                              in link) ]

print('Number of links after filtering:', len(list_links_lyrics_songteksten_net))


Number of links after filtering: 86


In [14]:
#Extracting the titles of the song and rearranging in alphabetical order for quick inspection

list_titles = [link.split('/')[-1].split('.')[-2] for link in list_links_lyrics_songteksten_net ]
list_titles.sort()
list_titles

['4th-of-july',
 'all-that-im-living-for',
 'angel-of-mine',
 'anything-for-you',
 'anywhere',
 'away-from-me',
 'before-the-dawn',
 'bleed',
 'breathe-no-more',
 'bring-me-to-life',
 'broken',
 'call-me-when-youre-sober',
 'cloud-nine',
 'disappear',
 'end-of-the-dream',
 'erase-this',
 'even-in-death',
 'everybodys-fool',
 'exodus',
 'fall-into-you',
 'farther-away',
 'fields-of-innocence',
 'forever-gone-forever-you',
 'forgive-me',
 'give-unto-me',
 'going-under',
 'good-enough',
 'goodnight',
 'haunted',
 'haunting-you',
 'heart-shaped-box',
 'hello',
 'i-believe-in-you',
 'i-must-be-dreaming',
 'imaginary',
 'imperfection',
 'lacrymosa',
 'lies',
 'lies-remix',
 'like-you',
 'listen-to-the-rain',
 'lithium',
 'lose-control',
 'lost-in-paradise',
 'made-of-stone',
 'missing',
 'must-be-dreaming',
 'my-cartoon-network',
 'my-heart-is-broken',
 'my-immortal',
 'my-last-breath',
 'never-go-back',
 'new-way-to-bleed',
 'october',
 'restless',
 'say-you-will',
 'secret-door',
 'sick',


### Retrieving from [songteksten.nl](https://www.songteksten.nl/)

Similarly, to songteksten.net we have multiple pages with hyperlinks.

In [15]:
# retrieving all hyperlinks
urls = ['https://www.songteksten.nl/artiest/4713/evanescence.htm',
       'https://www.songteksten.nl/artiest/4713/2/evanescence.htm']


list_links_lyrics_songteksten_nl = []

for url in urls:
    list_links_lyrics_songteksten_nl.extend(retrieve_hyperlinks(url))
    
# removing probable repetitions
list_links_lyrics_songteksten_nl = list(set(list_links_lyrics_songteksten_nl))
    
print('Number of links before filtering:', len(list_links_lyrics_songteksten_nl))

Number of links before filtering: 128


In [16]:
list_links_lyrics_songteksten_nl

['/songteksten/795870/evanescence/on-the-songs%3a-lost-in-paradise.htm',
 '/songteksten/174661/evanescence/field-of-innocence.htm',
 '/songteksten/41316/evanescence/my-last-breath.htm',
 '/songteksten/74926/evanescence/imaginery.htm',
 '/songteksten/41309/evanescence/fields-of-innocence.htm',
 '/songteksten/41812/evanescence/even-in-death.htm',
 '/songteksten/67506/evanescence/cloud-nine.htm',
 '/songteksten/360416/evanescence/what-you-want.htm',
 '/songteksten/174658/evanescence/goodnight.htm',
 '/songteksten/41310/evanescence/going-under.htm',
 '/songteksten/43906/evanescence/away-from-me.htm',
 '/songteksten/67509/evanescence/like-you.htm',
 '/songteksten/994425/evanescence/the-in-between--piano-solo-.htm',
 '/songteksten/48710/evanescence/exodus.htm',
 '/songteksten/363714/evanescence/new-way-to-bleed.htm',
 '/songteksten/41318/evanescence/tourniquet.htm',
 '/songteksten/49745/evanescence/heart-shaped-box.htm',
 '/',
 '/songteksten/41320/evanescence/whisper.htm',
 '/songteksten/675

In [17]:
# Only keeping the hyperlinks of the lyrics

list_links_lyrics_songteksten_nl = [link for link in list_links_lyrics_songteksten_nl 
                                    if ('/evanescence/' in link) ]

len(list_links_lyrics_songteksten_nl)

112

In [18]:
list_links_lyrics_songteksten_nl

['/songteksten/795870/evanescence/on-the-songs%3a-lost-in-paradise.htm',
 '/songteksten/174661/evanescence/field-of-innocence.htm',
 '/songteksten/41316/evanescence/my-last-breath.htm',
 '/songteksten/74926/evanescence/imaginery.htm',
 '/songteksten/41309/evanescence/fields-of-innocence.htm',
 '/songteksten/41812/evanescence/even-in-death.htm',
 '/songteksten/67506/evanescence/cloud-nine.htm',
 '/songteksten/360416/evanescence/what-you-want.htm',
 '/songteksten/174658/evanescence/goodnight.htm',
 '/songteksten/41310/evanescence/going-under.htm',
 '/songteksten/43906/evanescence/away-from-me.htm',
 '/songteksten/67509/evanescence/like-you.htm',
 '/songteksten/994425/evanescence/the-in-between--piano-solo-.htm',
 '/songteksten/48710/evanescence/exodus.htm',
 '/songteksten/363714/evanescence/new-way-to-bleed.htm',
 '/songteksten/41318/evanescence/tourniquet.htm',
 '/songteksten/49745/evanescence/heart-shaped-box.htm',
 '/songteksten/41320/evanescence/whisper.htm',
 '/songteksten/67513/eva

A quick inspection reveals often misspelling.

### Retrieving from [AZLyrics](https://www.azlyrics.com/)

The last website investigated as the first one has only one main page with all lyrics. 

In [20]:
url = 'https://www.azlyrics.com/e/evanescence.html'

list_links_lyrics_azlyrics = retrieve_hyperlinks(url)

# removing possible repetitions 

list_links_lyrics_azlyrics = list(set(list_links_lyrics_azlyrics))

print('\n Number of links before filtering:', len(list_links_lyrics_azlyrics))
list_links_lyrics_azlyrics


 Number of links before filtering: 128


['../lyrics/evanescence/ifyoudontmind.html',
 '//www.facebook.com/pages/AZLyricscom/154139197951223',
 '../lyrics/evanescence/awayfromme.html',
 '../lyrics/evanescence/anewwaytobleed.html',
 '../lyrics/evanescence/zero.html',
 '//www.azlyrics.com/o.html',
 '//www.azlyrics.com/n.html',
 '//www.azlyrics.com/copyright.html',
 '../lyrics/evanescence/haunteddemo.html',
 '../lyrics/evanescence/lacrymosa572218.html',
 '../lyrics/evanescence/hello.html',
 '//www.azlyrics.com/c.html',
 '../lyrics/evanescence/sweetsacrifice.html',
 '//www.azlyrics.com/cookie.html',
 '//www.azlyrics.com/s.html',
 '//www.azlyrics.com/p.html',
 '../lyrics/evanescence/beforethedawn.html',
 '../lyrics/evanescence/forevergoneforeveryou.html',
 '../lyrics/evanescence/lacrymosa.html',
 '../a/amylee.html',
 '//www.azlyrics.com/v.html',
 '../lyrics/evanescence/thoughtless.html',
 '//www.azlyrics.com/adv.html',
 '../lyrics/evanescence/thechange.html',
 '../lyrics/evanescence/everybodysfool.html',
 '../lyrics/evanescence/oc

And in this case our filtering list comprehension uses '/lyrics/' to keep only the relevant links. 

In [22]:
list_links_lyrics_azlyrics = [link for link in list_links_lyrics_azlyrics if '/lyrics/' in link ]

print('\n Number of links after updated filtering and applying set:', len(list_links_lyrics_azlyrics))


 Number of links after updated filtering and applying set: 88


In [23]:
# organizing in alphabetical order to make visualization easy
list_links_lyrics_azlyrics.sort()
list_links_lyrics_azlyrics

['../lyrics/evanescence/allthatimlivingfor.html',
 '../lyrics/evanescence/anewwaytobleed.html',
 '../lyrics/evanescence/anythingforyou.html',
 '../lyrics/evanescence/anywhere.html',
 '../lyrics/evanescence/awayfromme.html',
 '../lyrics/evanescence/beforethedawn.html',
 '../lyrics/evanescence/bleedimustbedreaming.html',
 '../lyrics/evanescence/breathenomore.html',
 '../lyrics/evanescence/bringmetolife.html',
 '../lyrics/evanescence/bringmetolifesynthesis.html',
 '../lyrics/evanescence/callmewhenyouresober.html',
 '../lyrics/evanescence/cloudnine.html',
 '../lyrics/evanescence/disappear.html',
 '../lyrics/evanescence/endofthedream.html',
 '../lyrics/evanescence/erasethis.html',
 '../lyrics/evanescence/evenindeath.html',
 '../lyrics/evanescence/evenindeath2016version.html',
 '../lyrics/evanescence/everybodysfool.html',
 '../lyrics/evanescence/exodus.html',
 '../lyrics/evanescence/fartheraway.html',
 '../lyrics/evanescence/fieldofinnocence.html',
 '../lyrics/evanescence/forevergoneforevery

# **Extract lyrics from a webpage**

Now is time to use hyperlinks of the lyrics obtained in the previous section to extract lyrics.

From the sites inspected before, I've decided to eliminate songteksten.nl because of the misspelling issues, and I decided to work further with songteksten.net because it seemed to be the less complicate to extract lyrics.

The following functions handle this task by using `BeautifullSoup` to parse the html and prettify it. After that I have used some strings methods from Python to have the lyrics both in string format but also in list format.

In [24]:
def extract_lyric_from_url(url_lyric):
    """ Extract lyrics after prettify beautiful soup from www.songteksten.nl """
    
    
    # send a http request
    r_lyric = requests.get(url_lyric)
    
    # obtain text with html containt of the url
    html_doc_lyric = r_lyric.text
    
    # making html easier to read
    soup_lyric = BeautifulSoup(html_doc_lyric)

    
    # prettifying it
    soup_lyric_pretty = soup_lyric.prettify()
    
    # Isolating deal that contains the lyric
    
    text = soup_lyric_pretty.split('</h1>\n')[1].split('<div class="buma-consent" role="alert">')[0]

    # Cleaning text and building a list with it
    list_lyrics = text.split('<br/>\n')
    list_lyrics = [item.replace('\n','') for item in list_lyrics]
    list_lyrics = [item.lstrip().rstrip() for item in list_lyrics]
    
    # removing empty elements from the list
    
    for item in list_lyrics:
        if str(item) == '':
            list_lyrics.remove(item)
            
    # this part was added after noticing that at least one lyric was not following the normal pattern
    
    if '<div' in list_lyrics[0]:
        list_lyrics = list_lyrics[1:]
        
        
    # Having the lyrics in string format
    
    lyrics = '. '.join(list_lyrics)
            
    
    # returning both list and string
    
    return list_lyrics, lyrics

## Extracting lyric from songteksten.net

In [25]:
url_lyric = list_links_lyrics_songteksten_net[1]

print(url_lyric,'\n')

https://songteksten.net/lyric/1938/30581/evanescence/going-under.html 



In [26]:
list_lyrics = extract_lyric_from_url(url_lyric)[0]
list_lyrics

["now I will tell you what I've done for you",
 "50 thousand tears I've cried",
 'screaming deceiving and bleeding for you',
 "and you still won't hear me",
 "don't want your hand this time I'll save myself",
 "maybe I'll wake up for once",
 'not tormented daily defeated by you',
 "just when I thought I'd reached the bottom",
 "I'm dying again",
 "I'm going under",
 'drowning in you',
 "I'm falling forever",
 "I've got to break through",
 "I'm going under",
 'blurring and stirring the truth and the lies',
 "so I don't know what's real and what's not",
 'always confusing the thoughts in my head',
 "so I can't trust myself anymore",
 "I'm dying again",
 "I'm going under",
 'drowning in you',
 "I'm falling forever",
 "I've got to break through",
 'so go on and scream',
 "scream at me I' m so far away",
 "I won't be broken again",
 "I've got to breathe I can't keep going under",
 "I'm dying again",
 "I'm going under",
 'drowning in you',
 "I'm falling forever",
 "I've got to break through"

In [27]:
lyrics = extract_lyric_from_url(url_lyric)[1]
lyrics

"now I will tell you what I've done for you. 50 thousand tears I've cried. screaming deceiving and bleeding for you. and you still won't hear me. don't want your hand this time I'll save myself. maybe I'll wake up for once. not tormented daily defeated by you. just when I thought I'd reached the bottom. I'm dying again. I'm going under. drowning in you. I'm falling forever. I've got to break through. I'm going under. blurring and stirring the truth and the lies. so I don't know what's real and what's not. always confusing the thoughts in my head. so I can't trust myself anymore. I'm dying again. I'm going under. drowning in you. I'm falling forever. I've got to break through. so go on and scream. scream at me I' m so far away. I won't be broken again. I've got to breathe I can't keep going under. I'm dying again. I'm going under. drowning in you. I'm falling forever. I've got to break through. I'm going under"

### Try another one

In [29]:
url_lyric = list_links_lyrics_songteksten_net[10]

print(url_lyric)

https://songteksten.net/lyric/1938/30592/evanescence/where-will-you-go.html


In [30]:
list_lyrics = extract_lyric_from_url(url_lyric)[0]
list_lyrics

["You're too important for anyone",
 'You play the role of all you long to be',
 'But I, I know who you really are',
 "You're the one who cries when you're alone",
 'But where will you go',
 'With no one left to save you from yourself',
 "You can't escape",
 "You can't escape",
 "You think that I can't see right through your eyes",
 'Scared to death to face reality',
 'No one seems to hear your hidden cries',
 "You're left to face yourself alone",
 'But where will you go',
 'With no one left to save you from yourself',
 "You can't escape",
 "You can't escape",
 "I realize you're afraid",
 "But you can't abandon everyone",
 "You can't escape",
 "You don't want to escape",
 "I'm so sick",
 'of speaking words that no one understands',
 "Is it clear enough that you can't live",
 'your whole life all alone',
 'I can hear you in a whisper',
 "But you can't even hear me screaming",
 'But where will you go',
 'With no one left to save you from yourself',
 "You can't escape",
 "You can't escape

In [31]:
lyrics = extract_lyric_from_url(url_lyric)[1]
lyrics

"You're too important for anyone. You play the role of all you long to be. But I, I know who you really are. You're the one who cries when you're alone. But where will you go. With no one left to save you from yourself. You can't escape. You can't escape. You think that I can't see right through your eyes. Scared to death to face reality. No one seems to hear your hidden cries. You're left to face yourself alone. But where will you go. With no one left to save you from yourself. You can't escape. You can't escape. I realize you're afraid. But you can't abandon everyone. You can't escape. You don't want to escape. I'm so sick. of speaking words that no one understands. Is it clear enough that you can't live. your whole life all alone. I can hear you in a whisper. But you can't even hear me screaming. But where will you go. With no one left to save you from yourself. You can't escape. You can't escape. I realize you're afraid. But you can't reject the whole world. You can't escape. You w

It seems to work just fine!

Let's run for all songs and save all strings in a list so we can use to put together with the data we will put together about songs of Evanescence!

In [32]:
list_lyrics_evanescence = []
list_title_lyrics_evanescence = []

In [34]:
len(list_links_lyrics_songteksten_net)

86

In [36]:
# building lists with titles of lyrics and lyrics

for url_lyric in list_links_lyrics_songteksten_net:
    
    list_title_lyrics_evanescence.append(url_lyric.split('/')[-1].split('.')[-2])
    list_lyrics_evanescence.append(extract_lyric_from_url(url_lyric)[1])


Just verifying that everything is as expected.

In [37]:
len(list_title_lyrics_evanescence)

86

In [38]:
len(list_lyrics_evanescence)

86

Now we create a Pandas data frame with song title and lyrics that will be saved in .csv format for later use.

In [39]:
# Creating a dataframe with song titles and lyrics

import pandas as pd
df = pd.DataFrame({'song_title': list_title_lyrics_evanescence,
                  'lyrics': list_lyrics_evanescence})

In [40]:
df.head()

Unnamed: 0,song_title,lyrics
0,imperfection,The more you try to fight it. The more you try...
1,going-under,now I will tell you what I've done for you. 50...
2,everybodys-fool,perfect by nature. icons of self indulgence. j...
3,my-immortal,I'm so tired of being here. suppressed by all ...
4,haunted,long lost words whisper slowly to me. still ca...


Before saving this data frame in .csv format for future use let's remove '-' from the song title and make all titles lower case by applying '.lower()'.

By keeping the format of song's title uniform we make our life easier when wanting to add more information about those songs. For instances, some metadata, like album title and year that was recorded.

In [41]:
df['song_title'] = df['song_title'].apply(lambda x: x.replace('-',' ').lower())

In [42]:
df.head()

Unnamed: 0,song_title,lyrics
0,imperfection,The more you try to fight it. The more you try...
1,going under,now I will tell you what I've done for you. 50...
2,everybodys fool,perfect by nature. icons of self indulgence. j...
3,my immortal,I'm so tired of being here. suppressed by all ...
4,haunted,long lost words whisper slowly to me. still ca...


In [43]:
# saving dataframe to .csv

df.to_csv("./data/lyrics_evanescence.csv", index = False)

In [44]:
# testing saved .csv

df2 = pd.read_csv("./data/lyrics_evanescence.csv")

df2.head()

Unnamed: 0,song_title,lyrics
0,imperfection,The more you try to fight it. The more you try...
1,going under,now I will tell you what I've done for you. 50...
2,everybodys fool,perfect by nature. icons of self indulgence. j...
3,my immortal,I'm so tired of being here. suppressed by all ...
4,haunted,long lost words whisper slowly to me. still ca...


In [45]:
df2.shape

(86, 2)

In [46]:
del df2

# **Applying the same to Within Temptation**

Previously, we have tried 4 different lyrics' websites to retrieve Evanescence's lyrics and we have chosen songteksten.net to extract those lyrics. 

Some reasons why I chose `songteksten.net`:

1. When filtering we used `songteksten.net/lyric/number_of_the_artist` and this can be extracted from the url which make easy for us to generalize the code for both or even other bands.

2. Title of the songs are of the form words separated by '-' which can be easily put in a form to facilitate merging with additional data as we've already planned.

Summarizing, the steps to retrieve lyrics are:

1. Apply function to retrieve hyperlinks
2. Filter out hyperlinks that do not contain lyrics
3. Apply function to retrieve hyperlinks (Step 1) to extract lyrics.

And those will be now also applied to retrieve lyrics of Within Temptation.


## STEP 1 - Retrieving hyperlinks from songteksten.net

In [47]:
# retrieving all hyperlinks
urls = ['https://songteksten.net/artist/lyrics/320/within-temptation.html',
       'https://songteksten.net/artist/lyrics/320/within-temptation/page/2.html',
       'https://songteksten.net/artist/lyrics/320/within-temptation/page/3.html']

list_links_lyrics_songteksten_net = []

for url in urls:
    list_links_lyrics_songteksten_net.extend(retrieve_hyperlinks(url))
    
# removing possible duplicates
list_links_lyrics_songteksten_net = list(set(list_links_lyrics_songteksten_net))

    
print('Number of links before filtering:', len(list_links_lyrics_songteksten_net))

Number of links before filtering: 220


In [48]:
list_links_lyrics_songteksten_net[:10]

['https://songteksten.net/lyric/7419/102714/niels-destadsbader/het-licht-gaat-uit.html',
 'https://songteksten.net/lyric/320/24061/within-temptation/world-of-make-believe.html',
 'https://songteksten.net/lyric/320/45119/within-temptation/angels.html',
 'https://songteksten.net/lyric/320/89809/within-temptation/stairway-to-the-skies.html',
 'https://songteksten.net/lyric/7419/102709/niels-destadsbader/vergeef-me.html',
 'https://songteksten.net/lyric/320/53788/within-temptation/towards-the-end.html',
 'https://songteksten.net/artist/lyrics/9049/danny-vera.html',
 '//songteksten.net/lyrics/latest.html',
 'https://songteksten.net/albums/album/320/382/within-temptation/enter.html',
 '//songteksten.net/artists.html']

## STEP 2 - Keeping only hyperlinks of lyrics

In [49]:
# filtering hyperlinks which contain lyrics - specific for songteksten.net

# using url address to filter lyrics

spliting = urls[0].split('/')
filter_lyrics = spliting[2]+'/lyric/'+spliting[-2]

list_links_lyrics_songteksten_net = [link for link in list_links_lyrics_songteksten_net if (filter_lyrics 
                                                                              in link) ]

print('Number of links after filtering:', len(list_links_lyrics_songteksten_net))

Number of links after filtering: 74


In [50]:
url_lyric = list_links_lyrics_songteksten_net[1]

print(url_lyric)

https://songteksten.net/lyric/320/45119/within-temptation/angels.html


## STEP 3 - Extracting song titles and lyrics from lyric's hyperlink

In [51]:
# lyrics in form of list
list_lyrics = extract_lyric_from_url(url_lyric)[0]
list_lyrics

['Sparkling angel I believed',
 'You were my saviour in my time of need',
 "Blinded by faith I couldn't hear",
 'I see the angels',
 "I'll lead them to your door",
 "There's no escape now",
 'No mercy no more',
 'No remorse cause I still remember',
 'The smile when you tore me apart',
 'You took my heart',
 'Deceived me right from the start',
 'You showed me dreams',
 "I wished they'd turn into real",
 'You broke the promise and made me realise',
 'It was all just a lie',
 "Sparkling angel, I couldn't see",
 'Your dark intensions, your feelings for me',
 'Fallen angel, tell me why?',
 'What is the reason, the thorn in your eye?',
 'I see the angels',
 "I'll lead them to your door",
 "There's no escape now",
 'No mercy no more',
 'No remorse cause I still remember',
 'The smile when you tore me apart',
 'You took my heart',
 'Deceived me right from the start',
 'You showed me dreams',
 "I wished they'd turn into real",
 'You broke the promise and made me realise',
 'It was all just a li

In [52]:
# lyrics in form of string

lyrics = extract_lyric_from_url(url_lyric)[1]
lyrics



## Saving song's titles and lyrics in a .csv file

Just like before we save the result in a .csv to be used in our further analysis.

In [53]:
# building lists with titles of lyrics and lyrics
list_title_lyrics_within_temptation = []
list_lyrics_within_temptation = []

for url_lyric in list_links_lyrics_songteksten_net:
    
    list_title_lyrics_within_temptation.append(url_lyric.split('/')[-1].split('.')[-2])
    list_lyrics_within_temptation.append(extract_lyric_from_url(url_lyric)[1])


In [54]:
len(list_title_lyrics_within_temptation)

74

In [55]:
len(list_lyrics_within_temptation)

74

In [56]:
import pandas as pd
df = pd.DataFrame({'song_title': list_title_lyrics_within_temptation,
                  'lyrics': list_lyrics_within_temptation})

In [59]:
# Here we also remove '-' from the title of the songs

df['song_title'] = df['song_title'].apply(lambda x: x.replace('-',' ').lower())

In [64]:
df.head()

Unnamed: 0,song_title,lyrics
0,world of make believe,On golden wings. She flies at night. With her ...
1,angels,Sparkling angel I believed. You were my saviou...
2,stairway to the skies,Seven seconds to the rise. Can't believe I'm s...
3,towards the end,The turn against. The world we know. Now our d...
4,in perfect harmony,At the end of a closing day. A little child wa...


In [65]:
# saving dataframe to .csv

df.to_csv("./data/lyrics_within_temptation.csv", index = False)

In [66]:
df2 = pd.read_csv("./data/lyrics_within_temptation.csv")

In [67]:
df2.head()

Unnamed: 0,song_title,lyrics
0,world of make believe,On golden wings. She flies at night. With her ...
1,angels,Sparkling angel I believed. You were my saviou...
2,stairway to the skies,Seven seconds to the rise. Can't believe I'm s...
3,towards the end,The turn against. The world we know. Now our d...
4,in perfect harmony,At the end of a closing day. A little child wa...


In [68]:
df2.shape

(74, 2)

In [None]:
del df2