In this first notebook I'll apply web scraping to obtain lyrics of Evanescence and Within Temptation from the web. 

First, I've checked the web around for some lyrics' sites with lyrics of both bands. My goal them is to use the same code to extract lyrics for both bands.

Some of the websites I've explored were 

- http://www.metrolyrics.com
- https://www.azlyrics.com/
- https://www.songteksten.nl/
- https://www.songteksten.net/

To extract hyperlinks I've used the packages: 

* [requests](https://pypi.org/project/requests/) 
* [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). 

You will notice that some simple [Python string methods](https://www.w3schools.com/python/python_strings.asp) were used with the task of retrieving lyrics. So let's start webscraping.

# **Webscraping**

The first step is to extract addresses of links of each lyric of the band. The webpages chosen usually present title of lyrics where each title is a hyperlink to a lyric. Like this, once we have all hyperlinks, we will be able to extract each one of the lyrics using those links.

The following function retrieves all hyperlinks within the main page. However, each website considered has its own particularities which must be considered when filtering out hyperlinks of the lyrics.


In [1]:
# importing packages

import requests
from bs4 import BeautifulSoup


def retrieve_hyperlinks(main_url):
    """ Extract all hyperlinks in 'main_url' and return a list with these hyperlinks """
    
    # Packages the request, send the request and catch the response: r

    r = requests.get(main_url)

    # Extracts the response as html: html_doc
    html_doc = r.text

    # Create a BeautifulSoup object from the HTML: soup
    soup = BeautifulSoup(html_doc,"lxml")
    
    # Find all 'a' tags (which define hyperlinks): a_tags

    a_tags = soup.find_all('a')
    
    # Create a list with hyperlinks found

    list_links = [link.get('href') for link in a_tags]
    
    # Remove none values if there is some
    
    list_links = list(filter(None, list_links)) 
    
    return list_links

## Retrieving hyperlinks of lyrics

Once we obtain all hyperlinks, we need to select those that contains lyrics. For now, we are investigating different lyrics' websites and each one presents a slightly different structure that must be considered when filtering out hyperlinks. 

By working with different websites, we can choose the one that we consider the most appropriated for extracting the lyrics.

Notice that initially we will be only considering evanescence lyrics. However, as said, we have already checked that all websites considered also contain lyrics from Within Temptation. Therefore, once we have our solution, it will be applicable for both bands.

### Retrieving from [metrolyrics](https://www.metrolyrics.com/)

To retrieve lyrics of Evanescence from the `metrolyrics` first, we apply the above function using `main_url` = 'http://www.metrolyrics.com/evanescence-lyrics.html'.

Then, as expected, we notice that not all hyperlinks are related to lyrics, so we need to filter those out and keep only the ones containing lyrics.


In [2]:
url = 'http://www.metrolyrics.com/evanescence-lyrics.html'
list_links_lyrics_metrolyrics = retrieve_hyperlinks(url)

# remove probable repetitions

list_links_lyrics_metrolyrics = list(set(list_links_lyrics_metrolyrics))

print('\n Number of links before filtering:', len(list_links_lyrics_metrolyrics))
list_links_lyrics_metrolyrics[:20]


 Number of links before filtering: 111


['https://www.metrolyrics.com/bring-me-to-life-lyrics-x-factor-uk.html',
 'https://www.metrolyrics.com/the-only-one-lyrics-evanescence.html',
 'https://www.metrolyrics.com/evanescence-lyrics.html',
 'https://www.metrolyrics.com/if-you-dont-mind-lyrics-evanescence.html',
 'https://www.metrolyrics.com/the-chain-from-gears-5-lyrics-evanescence.html',
 'https://www.metrolyrics.com/liquid-blue-lyrics-evanescence.html',
 'https://www.metrolyrics.com/even-in-death-lyrics-evanescence.html',
 'https://www.metrolyrics.com/bleed-lyrics-evanescence.html',
 'https://www.metrolyrics.com/my-heart-is-broken-lyrics-evanescence.html',
 'https://www.metrolyrics.com/haunted-lyrics-evanescence.html',
 'https://www.metrolyrics.com/rock-this-town-stray-cats-ml-video-n43.html',
 'https://www.metrolyrics.com/lacrymosa-lyrics-evanescence.html',
 'https://www.metrolyrics.com/your-star-lyrics-evanescence.html',
 'https://www.metrolyrics.com/sallys-song-lyrics-evanescence.html',
 'https://www.metrolyrics.com/evane

A quick look in the list above reveals that links containing lyrics are of the form 'http://www.metrolyrics.com/TITLE-lyrics-evanescence.html'. So, we can select the elements of the list that contains '-lyrics-evanescence.html'.

The list comprehension bellow does the job and return us 80 hyperlinks instead of the initial 111 hyperlinks.

In [4]:
list_links_lyrics_metrolyrics = [link for link in list_links_lyrics_metrolyrics if '-lyrics-evanescence.html' in link]
print('\n Number of links after filtering:', len(list_links_lyrics_metrolyrics))
list_links_lyrics_metrolyrics



 Number of links after filtering: 80


['https://www.metrolyrics.com/the-only-one-lyrics-evanescence.html',
 'https://www.metrolyrics.com/if-you-dont-mind-lyrics-evanescence.html',
 'https://www.metrolyrics.com/the-chain-from-gears-5-lyrics-evanescence.html',
 'https://www.metrolyrics.com/liquid-blue-lyrics-evanescence.html',
 'https://www.metrolyrics.com/even-in-death-lyrics-evanescence.html',
 'https://www.metrolyrics.com/bleed-lyrics-evanescence.html',
 'https://www.metrolyrics.com/my-heart-is-broken-lyrics-evanescence.html',
 'https://www.metrolyrics.com/haunted-lyrics-evanescence.html',
 'https://www.metrolyrics.com/lacrymosa-lyrics-evanescence.html',
 'https://www.metrolyrics.com/your-star-lyrics-evanescence.html',
 'https://www.metrolyrics.com/sallys-song-lyrics-evanescence.html',
 'https://www.metrolyrics.com/sweet-sacrifice-lyrics-evanescence.html',
 'https://www.metrolyrics.com/missing-lyrics-evanescence.html',
 'https://www.metrolyrics.com/like-you-lyrics-evanescence.html',
 'https://www.metrolyrics.com/the-other

After filtering, we see that some links finishing with '/correction', so we update the list comprehension above to eliminate the links with '/correction'.

In [4]:
# updated version

list_links_lyrics_metrolyrics = [link for link in list_links_lyrics_metrolyrics if 
                                 ('-lyrics-evanescence.html' in link and '/correction' not in link)]

print('\n Number of links after updated filtering:', len(list_links_lyrics_metrolyrics))

list_links_lyrics_metrolyrics.sort()
list_links_lyrics_metrolyrics


 Number of links after updated filtering and applying set: 77


['https://www.metrolyrics.com/angel-of-mine-lyrics-evanescence.html',
 'https://www.metrolyrics.com/anywhere-lyrics-evanescence.html',
 'https://www.metrolyrics.com/away-from-me-lyrics-evanescence.html',
 'https://www.metrolyrics.com/before-the-dawn-lyrics-evanescence.html',
 'https://www.metrolyrics.com/bleed-lyrics-evanescence.html',
 'https://www.metrolyrics.com/breathe-no-more-lyrics-evanescence.html',
 'https://www.metrolyrics.com/bring-me-to-life-lyrics-evanescence.html',
 'https://www.metrolyrics.com/broken-lyrics-evanescence.html',
 'https://www.metrolyrics.com/call-me-when-your-sober-lyrics-evanescence.html',
 'https://www.metrolyrics.com/call-me-when-youre-a-sober-lyrics-evanescence.html',
 'https://www.metrolyrics.com/call-me-when-youre-sober-lyrics-evanescence.html',
 'https://www.metrolyrics.com/cartoon-network-song-lyrics-evanescence.html',
 'https://www.metrolyrics.com/cloud-nine-lyrics-evanescence.html',
 'https://www.metrolyrics.com/end-of-the-dream-lyrics-evanescence.

So finally, for metrolyrics we have obtained 77 hyperlinks of Evanescence's lyrics. 

Let's check the other websites.

### Retrieving from [songteksten.net](https://songteksten.net/)

This webpage presents 3 pages with links to lyrics, so we need to apply the function on all links.

Before filtering we have 229 links after filtering 86. Observe that although there is apparently no repetition, there are different versions of the same music (e.g. lies and lies-remix).

In [9]:
# retrieving all hyperlinks
urls = ['https://songteksten.net/artist/lyrics/1938/evanescence.html',
       'https://songteksten.net/artist/lyrics/1938/evanescence/page/2.html',
       'https://songteksten.net/artist/lyrics/1938/evanescence/page/3.html']

list_links_lyrics_songteksten_net = []

for url in urls:
    list_links_lyrics_songteksten_net.extend(retrieve_hyperlinks(url))
    
# remove probable repetitions

list_links_lyrics_songteksten_net = list(set(list_links_lyrics_songteksten_net))

    
print('Number of links before filtering:', len(list_links_lyrics_songteksten_net))

Number of links before filtering: 229


In [10]:
list_links_lyrics_songteksten_net

['https://songteksten.net/lyric/1938/30606/evanescence/breathe-no-more.html',
 'https://forum.songteksten.net/index.php?topic=1327.msg86864;topicseen#new',
 '//songteksten.net/genres.html',
 'https://songteksten.net/albums/add/1938.html',
 'https://songteksten.net/news/102225/marcel-van-roosmalen-verkozen-tot-republikein-van-het-jaar.html',
 '//songteksten.net/albums.html',
 'https://songteksten.net/artists/3.html',
 'https://songteksten.net/artists/n.html',
 'https://songteksten.net/lyric/1938/30586/evanescence/tourniquet.html',
 'https://songteksten.net/info/1187/rss-api-functies.html',
 'http://bandhosting.nl',
 'https://songteksten.net/lyric/1938/93924/evanescence/never-go-back.html',
 'https://songteksten.net/lyric/1938/59299/evanescence/sweet-sacrifice.html',
 'https://songteksten.net/lyric/1938/59309/evanescence/all-that-im-living-for.html',
 'https://songteksten.net/artists/q.html',
 'https://songteksten.net/lyric/9019/102741/snelle/smoorverliefd.html',
 'https://songteksten.ne

In this case, to obtain lyrics we need to keep links that contain '/songteksten.net/lyric/1938' which will do by extracting the information from the url address.

In [17]:
# using url address to filter lyrics

spliting = urls[0].split('/')
filter_lyrics = spliting[2]+'/lyric/'+spliting[-2]

list_links_lyrics_songteksten_net = [link for link in list_links_lyrics_songteksten_net if (filter_lyrics 
                                                                              in link) ]

print('Number of links after filtering:', len(list_links_lyrics_songteksten_net))


Number of links after filtering: 86


In [18]:
#Extracting the titles of the song and rearranging in alphabetical order for quick inspection

list_titles = [link.split('/')[-1].split('.')[-2] for link in list_links_lyrics_songteksten_net ]
list_titles.sort()
list_titles

['4th-of-july',
 'all-that-im-living-for',
 'angel-of-mine',
 'anything-for-you',
 'anywhere',
 'away-from-me',
 'before-the-dawn',
 'bleed',
 'breathe-no-more',
 'bring-me-to-life',
 'broken',
 'call-me-when-youre-sober',
 'cloud-nine',
 'disappear',
 'end-of-the-dream',
 'erase-this',
 'even-in-death',
 'everybodys-fool',
 'exodus',
 'fall-into-you',
 'farther-away',
 'fields-of-innocence',
 'forever-gone-forever-you',
 'forgive-me',
 'give-unto-me',
 'going-under',
 'good-enough',
 'goodnight',
 'haunted',
 'haunting-you',
 'heart-shaped-box',
 'hello',
 'i-believe-in-you',
 'i-must-be-dreaming',
 'imaginary',
 'imperfection',
 'lacrymosa',
 'lies',
 'lies-remix',
 'like-you',
 'listen-to-the-rain',
 'lithium',
 'lose-control',
 'lost-in-paradise',
 'made-of-stone',
 'missing',
 'must-be-dreaming',
 'my-cartoon-network',
 'my-heart-is-broken',
 'my-immortal',
 'my-last-breath',
 'never-go-back',
 'new-way-to-bleed',
 'october',
 'restless',
 'say-you-will',
 'secret-door',
 'sick',


### Retrieving from [songteksten.nl](https://www.songteksten.nl/)

Similarly, to songteksten.net we have multiple pages with hyperlinks.

In [19]:
# retrieving all hyperlinks
urls = ['https://www.songteksten.nl/artiest/4713/evanescence.htm',
       'https://www.songteksten.nl/artiest/4713/2/evanescence.htm']


list_links_lyrics_songteksten_nl = []

for url in urls:
    list_links_lyrics_songteksten_nl.extend(retrieve_hyperlinks(url))
    
# removing probable repetitions
list_links_lyrics_songteksten_nl = list(set(list_links_lyrics_songteksten_nl))
    
print('Number of links before filtering:', len(list_links_lyrics_songteksten_nl))

Number of links before filtering: 128


In [20]:
list_links_lyrics_songteksten_nl

['/songteksten/41311/evanescence/haunted.htm',
 'https://www.twitter.com/songtekstennl',
 '/top45',
 '/songteksten/67507/evanescence/good-enough.htm',
 '/songteksten/67698/evanescence/sweet-sacrafice.htm',
 '/songteksten/74928/evanescence/lynn.htm',
 '/songteksten/42848/evanescence/forgive-me.htm',
 '/songteksten/41308/evanescence/farther-away.htm',
 '/songteksten/45785/evanescence/4th-of-july.htm',
 '/index.php',
 '/songteksten/48713/evanescence/bleed.htm',
 '/songteksten/45370/evanescence/goodnite.htm',
 '/songteksten/41313/evanescence/i-believe-in-you.htm',
 '/songteksten/43242/evanescence/surrender.htm',
 '/songteksten/88807/evanescence/together-again.htm',
 '/songteksten/43243/evanescence/anything-for-you.htm',
 '/browse/e.htm',
 '/songteksten/41319/evanescence/wake-me-up-inside.htm',
 '/songteksten/67510/evanescence/lithium.htm',
 '/songteksten/53789/evanescence/solitude.htm',
 '/',
 '/songteksten/43906/evanescence/away-from-me.htm',
 '/songteksten/67513/evanescence/sweet-sacrifi

In [21]:
# Only keeping the hyperlinks of the lyrics

list_links_lyrics_songteksten_nl = [link for link in list_links_lyrics_songteksten_nl 
                                    if ('/evanescence/' in link) ]

len(list_links_lyrics_songteksten_nl)

112

In [22]:
list_links_lyrics_songteksten_nl

['/songteksten/41311/evanescence/haunted.htm',
 '/songteksten/67507/evanescence/good-enough.htm',
 '/songteksten/67698/evanescence/sweet-sacrafice.htm',
 '/songteksten/74928/evanescence/lynn.htm',
 '/songteksten/42848/evanescence/forgive-me.htm',
 '/songteksten/41308/evanescence/farther-away.htm',
 '/songteksten/45785/evanescence/4th-of-july.htm',
 '/songteksten/48713/evanescence/bleed.htm',
 '/songteksten/45370/evanescence/goodnite.htm',
 '/songteksten/41313/evanescence/i-believe-in-you.htm',
 '/songteksten/43242/evanescence/surrender.htm',
 '/songteksten/88807/evanescence/together-again.htm',
 '/songteksten/43243/evanescence/anything-for-you.htm',
 '/songteksten/41319/evanescence/wake-me-up-inside.htm',
 '/songteksten/67510/evanescence/lithium.htm',
 '/songteksten/53789/evanescence/solitude.htm',
 '/songteksten/43906/evanescence/away-from-me.htm',
 '/songteksten/67513/evanescence/sweet-sacrifice.htm',
 '/songteksten/44308/evanescence/breathe-no-more.htm',
 '/songteksten/50229/evanesc

A quick inspection reveals often misspelling.

### Retrieving from [AZLyrics](https://www.azlyrics.com/)

The last website investigated as the first one has only one main page with all lyrics. 

In [23]:
url = 'https://www.azlyrics.com/e/evanescence.html'

list_links_lyrics_azlyrics = retrieve_hyperlinks(url)

# removing possible repetitions 

list_links_lyrics_azlyrics = list(set(list_links_lyrics_azlyrics))

print('\n Number of links before filtering:', len(list_links_lyrics_azlyrics))
list_links_lyrics_azlyrics


 Number of links before filtering: 128


['//www.azlyrics.com/copyright.html',
 '../lyrics/evanescence/whisper.html',
 '../lyrics/evanescence/haunteddemo.html',
 '//www.stlyrics.com',
 '//www.azlyrics.com/o.html',
 '../lyrics/evanescence/snowwhitequeen.html',
 '../lyrics/evanescence/wherewillyougo.html',
 '../lyrics/evanescence/soclose.html',
 '../lyrics/evanescence/whatyouwant.html',
 '../lyrics/evanescence/takingoverme.html',
 '../lyrics/evanescence/forgiveme.html',
 '../lyrics/evanescence/goodnight.html',
 '../lyrics/evanescence/ifyoudontmind.html',
 '../lyrics/evanescence/breathenomore.html',
 '../lyrics/evanescence/thechange.html',
 '../lyrics/evanescence/hello.html',
 '../a/amylee.html',
 '../lyrics/evanescence/heartshapedbox.html',
 '../lyrics/evanescence/losecontrol.html',
 '../lyrics/evanescence/exodus.html',
 '../lyrics/evanescence/october.html',
 '../lyrics/evanescence/untitledyoucantkillthemeinyou.html',
 '//www.azlyrics.com/z.html',
 '../lyrics/evanescence/everybodysfool.html',
 '//www.azlyrics.com',
 '../lyrics/

And in this case our filtering list comprehension uses '/lyrics/' to keep only the relevant links. 

In [24]:
list_links_lyrics_azlyrics = [link for link in list_links_lyrics_azlyrics if '/lyrics/' in link ]

print('\n Number of links after updated filtering and applying set:', len(list_links_lyrics_azlyrics))


 Number of links after updated filtering and applying set: 88


In [25]:
# organizing in alphabetical order to make visualization easy
list_links_lyrics_azlyrics.sort()
list_links_lyrics_azlyrics

['../lyrics/evanescence/allthatimlivingfor.html',
 '../lyrics/evanescence/anewwaytobleed.html',
 '../lyrics/evanescence/anythingforyou.html',
 '../lyrics/evanescence/anywhere.html',
 '../lyrics/evanescence/awayfromme.html',
 '../lyrics/evanescence/beforethedawn.html',
 '../lyrics/evanescence/bleedimustbedreaming.html',
 '../lyrics/evanescence/breathenomore.html',
 '../lyrics/evanescence/bringmetolife.html',
 '../lyrics/evanescence/bringmetolifesynthesis.html',
 '../lyrics/evanescence/callmewhenyouresober.html',
 '../lyrics/evanescence/cloudnine.html',
 '../lyrics/evanescence/disappear.html',
 '../lyrics/evanescence/endofthedream.html',
 '../lyrics/evanescence/erasethis.html',
 '../lyrics/evanescence/evenindeath.html',
 '../lyrics/evanescence/evenindeath2016version.html',
 '../lyrics/evanescence/everybodysfool.html',
 '../lyrics/evanescence/exodus.html',
 '../lyrics/evanescence/fartheraway.html',
 '../lyrics/evanescence/fieldofinnocence.html',
 '../lyrics/evanescence/forevergoneforevery

# **Extract lyrics from a webpage**

Now is time to use hyperlinks of the lyrics obtained in the previous section to extract lyrics.

From the sites inspected before, I've decided to eliminate songteksten.nl because of the misspelling issues, and to work further with songteksten.net because it seemed to be the less complicate to extract lyrics.

The following functions handle this task by using `BeautifullSoup` to parse the html and prettify it. After that I have used some strings methods from Python to have the lyrics both in string format but also in list format.

In [33]:
url_lyric = 'https://songteksten.net/lyric/1938/30609/evanescence/bring-me-to-life.html'
r_lyric = requests.get(url_lyric)

In [34]:
r_lyric

<Response [200]>

In [35]:
html_doc_lyric = r_lyric.text
html_doc_lyric

'<!DOCTYPE html>\n<html lang="nl">\n<head>\n<meta charset="iso-8859-1">\n<meta http-equiv="X-UA-Compatible" content="IE=edge">\n<meta name="viewport" content="width=device-width, initial-scale=1">\n<meta name="robots" content="index, follow" />\n<title>Songteksten.net - Songtekst: Evanescence - Bring Me To Life</title>\n<meta name="description" content="Bekijk de songtekst Bring Me To Life van Evanescence op Songteksten.net">\n<meta name="verify-v1" content="Z400kCtagpTtN7S0faT4/8v+PFXYrWHVMhCEOLMOQYk=" />\n<meta name="google-site-verification" content="uyniupsja-pMOHt0klVT2713XR6GflHhdoBeh5uqwgk" />\n<base href="https://songteksten.net/" />\n<link rel="stylesheet" href="/css/bootstrap.min.css" />\n<link rel="stylesheet" href="/css/template.min.css?201901" />\n<script src="/js/jquery-2.2.0.min.js"></script>\n<script src="/js/bootstrap.min.js"></script>\n<script async src="//edgecastcdn.net/000540/cdn/client/bandhosting/init.js" type="text/javascript"></script>\n<link rel="shortcut icon

In [36]:
soup_lyric = BeautifulSoup(html_doc_lyric,"lxml")
soup_lyric

<!DOCTYPE html>
<html lang="nl">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<meta content="index, follow" name="robots"/>
<title>Songteksten.net - Songtekst: Evanescence - Bring Me To Life</title>
<meta content="Bekijk de songtekst Bring Me To Life van Evanescence op Songteksten.net" name="description"/>
<meta content="Z400kCtagpTtN7S0faT4/8v+PFXYrWHVMhCEOLMOQYk=" name="verify-v1"/>
<meta content="uyniupsja-pMOHt0klVT2713XR6GflHhdoBeh5uqwgk" name="google-site-verification"/>
<base href="https://songteksten.net/"/>
<link href="/css/bootstrap.min.css" rel="stylesheet"/>
<link href="/css/template.min.css?201901" rel="stylesheet"/>
<script src="/js/jquery-2.2.0.min.js"></script>
<script src="/js/bootstrap.min.js"></script>
<script async="" src="//edgecastcdn.net/000540/cdn/client/bandhosting/init.js" type="text/javascript"></script>
<link href="/img/favicon.ico" rel="shortcut ic

In [38]:
soup_lyric_pretty = soup_lyric.prettify()
print(soup_lyric_pretty)

<!DOCTYPE html>
<html lang="nl">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <meta content="index, follow" name="robots"/>
  <title>
   Songteksten.net - Songtekst: Evanescence - Bring Me To Life
  </title>
  <meta content="Bekijk de songtekst Bring Me To Life van Evanescence op Songteksten.net" name="description"/>
  <meta content="Z400kCtagpTtN7S0faT4/8v+PFXYrWHVMhCEOLMOQYk=" name="verify-v1"/>
  <meta content="uyniupsja-pMOHt0klVT2713XR6GflHhdoBeh5uqwgk" name="google-site-verification"/>
  <base href="https://songteksten.net/"/>
  <link href="/css/bootstrap.min.css" rel="stylesheet"/>
  <link href="/css/template.min.css?201901" rel="stylesheet"/>
  <script src="/js/jquery-2.2.0.min.js">
  </script>
  <script src="/js/bootstrap.min.js">
  </script>
  <script async="" src="//edgecastcdn.net/000540/cdn/client/bandhosting/init.js" type="text/javascript">
  </script>
 

In [48]:
text = soup_lyric_pretty.split('</h1>\n')
print(text[1])

          how can you see into my eyes
          <br/>
          like open doors
          <br/>
          leading you down into my core
          <br/>
          where I've become so numb without a soul
          <br/>
          my spirit sleeping somewhere cold
          <br/>
          until you find it there and lead it back home
          <br/>
          <br/>
          (Wake me up)
          <br/>
          Wake me up inside
          <br/>
          (I can't wake up)
          <br/>
          Wake me up inside
          <br/>
          (Save me)
          <br/>
          call my name and save me from the dark
          <br/>
          (Wake me up)
          <br/>
          bid my blood to run
          <br/>
          (I can't wake up)
          <br/>
          before I come undone
          <br/>
          (Save me)
          <br/>
          save me from the nothing I've become
          <br/>
          <br/>
          now that I know what I'm without
          <br/>
          

In [37]:
text = soup_lyric.split('</h1>\n')[1].split('<div class="buma-consent" role="alert">')[0]
text

TypeError: 'NoneType' object is not callable

In [16]:
def extract_lyric_from_url(url_lyric):
    """ Extract lyrics after prettify beautiful soup from www.songteksten.nl """
    
    
    # send a http request
    r_lyric = requests.get(url_lyric)
    
    # obtain text with html containt of the url
    html_doc_lyric = r_lyric.text
    
    # making html easier to read
    soup_lyric = BeautifulSoup(html_doc_lyric,,"lxml")

    
    # prettifying it
    soup_lyric_pretty = soup_lyric.prettify()
    
    # Isolating deal that contains the lyric
    
    text = soup_lyric_pretty.split('</h1>\n')[1].split('<div class="buma-consent" role="alert">')[0]

    # Cleaning text and building a list with it
    list_lyrics = text.split('<br/>\n')
    list_lyrics = [item.replace('\n','') for item in list_lyrics]
    list_lyrics = [item.lstrip().rstrip() for item in list_lyrics]
    
    # removing empty elements from the list
    
    for item in list_lyrics:
        if str(item) == '':
            list_lyrics.remove(item)
            
    # this part was added after noticing that at least one lyric was not following the normal pattern
    
    if '<div' in list_lyrics[0]:
        list_lyrics = list_lyrics[1:]
        
        
    # Having the lyrics in string format
    
    lyrics = '. '.join(list_lyrics)
            
    
    # returning both list and string
    
    return list_lyrics, lyrics

## Extracting lyric from songteksten.net

In [17]:
url_lyric = list_links_lyrics_songteksten_net[1]

print(url_lyric,'\n')

https://songteksten.net/lyric/1938/30606/evanescence/breathe-no-more.html 



In [18]:
list_lyrics = extract_lyric_from_url(url_lyric)[0]
list_lyrics

["I've been looking in the mirror for so long.",
 "That I've come to believe my soul's on the other side.",
 'Oh the little pieces falling, shatter.',
 'Shards of me,',
 'Too sharp to put back together.',
 'Too small to matter,',
 'But big enough to cut me into so many little pieces.',
 'If I try to touch her,',
 'And I bleed,',
 'I bleed,',
 'And I breathe,',
 'I breathe no more.',
 'Take a breath and I try to draw from my spirits well.',
 'Yet again you refuse to drink like a stubborn child.',
 'Lie to me,',
 "Convince me that I've been sick forever.",
 'And all of this,',
 'Will make sense when I get better.',
 'I know the difference,',
 'Between myself and my reflection.',
 "I just can't help but to wonder,",
 'Which of us do you love.',
 'So I bleed,',
 'I bleed,',
 'And I breathe,',
 'I breathe now...',
 'Bleed,',
 'I bleed,',
 'And I breathe,',
 'I breathe,',
 'I breathe-',
 'I breathe no more.']

In [19]:
lyrics = extract_lyric_from_url(url_lyric)[1]
lyrics

"I've been looking in the mirror for so long.. That I've come to believe my soul's on the other side.. Oh the little pieces falling, shatter.. Shards of me,. Too sharp to put back together.. Too small to matter,. But big enough to cut me into so many little pieces.. If I try to touch her,. And I bleed,. I bleed,. And I breathe,. I breathe no more.. Take a breath and I try to draw from my spirits well.. Yet again you refuse to drink like a stubborn child.. Lie to me,. Convince me that I've been sick forever.. And all of this,. Will make sense when I get better.. I know the difference,. Between myself and my reflection.. I just can't help but to wonder,. Which of us do you love.. So I bleed,. I bleed,. And I breathe,. I breathe now.... Bleed,. I bleed,. And I breathe,. I breathe,. I breathe-. I breathe no more."

### Try another one

In [20]:
url_lyric = list_links_lyrics_songteksten_net[10]

print(url_lyric)

https://songteksten.net/lyric/1938/77224/evanescence/angel-of-mine.html


In [21]:
list_lyrics = extract_lyric_from_url(url_lyric)[0]
list_lyrics

['You are everything I need to see',
 'Smile and sunlight makes sunlight to me',
 'Laugh and come and look into me',
 'Drips of moonlight washing over me',
 'Can I show you what you are for me',
 'Angel of mine, let me I thank you',
 'You have saved me time and time again',
 'Angel, I must confess',
 "It's you that always gives me strength",
 "And I don't know where I'd be without you",
 'After all these years, one thing is true',
 'Constant force within my heart is you',
 "You touch me, I feel I'm moving into you",
 'I treasure every day I spend with you',
 'All the things I am come down to you',
 'Angel of mine',
 'Let me thank you',
 'You have saved me time and time again',
 'Angel, I must confess',
 "It's you that always gives me strength",
 "And I don't know where I'd be without you",
 'Back in the arms of my angel',
 'Back to the peace that I so love',
 'Back in the arms of my angel I can finally rest',
 'Giving you a gift that you remind me']

In [22]:
lyrics = extract_lyric_from_url(url_lyric)[1]
lyrics

"You are everything I need to see. Smile and sunlight makes sunlight to me. Laugh and come and look into me. Drips of moonlight washing over me. Can I show you what you are for me. Angel of mine, let me I thank you. You have saved me time and time again. Angel, I must confess. It's you that always gives me strength. And I don't know where I'd be without you. After all these years, one thing is true. Constant force within my heart is you. You touch me, I feel I'm moving into you. I treasure every day I spend with you. All the things I am come down to you. Angel of mine. Let me thank you. You have saved me time and time again. Angel, I must confess. It's you that always gives me strength. And I don't know where I'd be without you. Back in the arms of my angel. Back to the peace that I so love. Back in the arms of my angel I can finally rest. Giving you a gift that you remind me"

It seems to work just fine!

Let's run for all songs and save all strings in a list so we can use to put together with the data we will put together about songs of Evanescence!

In [23]:
list_lyrics_evanescence = []
list_title_lyrics_evanescence = []

In [24]:
len(list_links_lyrics_songteksten_net)

86

In [25]:
# building lists with titles of lyrics and lyrics

for url_lyric in list_links_lyrics_songteksten_net:
    
    list_title_lyrics_evanescence.append(url_lyric.split('/')[-1].split('.')[-2])
    list_lyrics_evanescence.append(extract_lyric_from_url(url_lyric)[1])


Just verifying that everything is as expected.

In [26]:
len(list_title_lyrics_evanescence)

86

In [27]:
len(list_lyrics_evanescence)

86

Now we create a Pandas data frame with song title and lyrics that will be saved in .csv format for later use.

In [28]:
# Creating a dataframe with song titles and lyrics

import pandas as pd
df = pd.DataFrame({'song_title': list_title_lyrics_evanescence,
                  'lyrics': list_lyrics_evanescence})

In [29]:
df.head()

Unnamed: 0,song_title,lyrics
0,wake-me-up-inside-bring-me-to-life,How can you see into my eyes. Like open doors....
1,breathe-no-more,I've been looking in the mirror for so long.. ...
2,going-under,now I will tell you what I've done for you. 50...
3,bleed,How can I pretend that I don't see. What you h...
4,the-only-one,When they all come crashing down- midflight. y...


Before saving this data frame in .csv format for future use let's remove '-' from the song title and make all titles lower case by applying '.lower()'.

By keeping the format of song's title uniform we make our life easier when wanting to add more information about those songs. For instances, some metadata, like album title and year that was recorded.

In [30]:
df['song_title'] = df['song_title'].apply(lambda x: x.replace('-',' ').lower())

In [31]:
df.head()

Unnamed: 0,song_title,lyrics
0,wake me up inside bring me to life,How can you see into my eyes. Like open doors....
1,breathe no more,I've been looking in the mirror for so long.. ...
2,going under,now I will tell you what I've done for you. 50...
3,bleed,How can I pretend that I don't see. What you h...
4,the only one,When they all come crashing down- midflight. y...


In [32]:
# saving dataframe to .csv

df.to_csv("./data/lyrics_evanescence.csv", index = False)

In [33]:
# testing saved .csv

df2 = pd.read_csv("./data/lyrics_evanescence.csv")

df2.head()

Unnamed: 0,song_title,lyrics
0,wake me up inside bring me to life,How can you see into my eyes. Like open doors....
1,breathe no more,I've been looking in the mirror for so long.. ...
2,going under,now I will tell you what I've done for you. 50...
3,bleed,How can I pretend that I don't see. What you h...
4,the only one,When they all come crashing down- midflight. y...


In [34]:
df2.shape

(86, 2)

In [35]:
del df2

# **Applying the same to Within Temptation**

Previously, we have tried 4 different lyrics' websites to retrieve Evanescence's lyrics and we have chosen songteksten.net to extract those lyrics. 

Some reasons why I chose `songteksten.net`:

1. When filtering we used `songteksten.net/lyric/number_of_the_artist` and this can be extracted from the url which make easy for us to generalize the code for both or even other bands.

2. Title of the songs are of the form words separated by '-' which can be easily put in a form to facilitate merging with additional data as we've already planned.

Summarizing, the steps to retrieve lyrics are:

1. Apply function to retrieve hyperlinks
2. Filter out hyperlinks that do not contain lyrics
3. Apply function to retrieve hyperlinks (Step 1) to extract lyrics.

And those will be now also applied to retrieve lyrics of Within Temptation.


## STEP 1 - Retrieving hyperlinks from songteksten.net

In [36]:
# retrieving all hyperlinks
urls = ['https://songteksten.net/artist/lyrics/320/within-temptation.html',
       'https://songteksten.net/artist/lyrics/320/within-temptation/page/2.html',
       'https://songteksten.net/artist/lyrics/320/within-temptation/page/3.html']

list_links_lyrics_songteksten_net = []

for url in urls:
    list_links_lyrics_songteksten_net.extend(retrieve_hyperlinks(url))
    
# removing possible duplicates
list_links_lyrics_songteksten_net = list(set(list_links_lyrics_songteksten_net))

    
print('Number of links before filtering:', len(list_links_lyrics_songteksten_net))

Number of links before filtering: 220


In [37]:
list_links_lyrics_songteksten_net[:10]

['https://songteksten.net/artists/u.html',
 'https://songteksten.net/artists/7.html',
 'https://songteksten.net/lyric/320/24061/within-temptation/world-of-make-believe.html',
 'https://songteksten.net/artists/j.html',
 'https://songteksten.net/lyric/2294/46260/leonard-cohen/hallelujah.html',
 'https://songteksten.net/lyric/320/89805/within-temptation/fire-and-ice.html',
 'https://songteksten.net/artists/x.html',
 'https://songteksten.net/lyric/9173/102289/tones-and-i/dance-monkey.html',
 'https://songteksten.net/adverteren.html',
 '//songteksten.net/albums/add.html']

## STEP 2 - Keeping only hyperlinks of lyrics

In [38]:
# filtering hyperlinks which contain lyrics - specific for songteksten.net

# using url address to filter lyrics

spliting = urls[0].split('/')
filter_lyrics = spliting[2]+'/lyric/'+spliting[-2]

list_links_lyrics_songteksten_net = [link for link in list_links_lyrics_songteksten_net if (filter_lyrics 
                                                                              in link) ]

print('Number of links after filtering:', len(list_links_lyrics_songteksten_net))

Number of links after filtering: 74


In [39]:
url_lyric = list_links_lyrics_songteksten_net[1]

print(url_lyric)

https://songteksten.net/lyric/320/89805/within-temptation/fire-and-ice.html


## STEP 3 - Extracting song titles and lyrics from lyric's hyperlink

In [40]:
# lyrics in form of list
list_lyrics = extract_lyric_from_url(url_lyric)[0]
list_lyrics

["Every word you're saying is a lie",
 'Run away my dear',
 'But every sign will say',
 'Your heart is dead',
 'Bury all the memories',
 'Cover them with dirt',
 "Where's the love we once had",
 "Our destiny's unsure",
 "Why can't you see what we had",
 'Let the fire burn the ice',
 "Where's the love we once had",
 'Is it all a lie',
 'And I still wonder',
 'Why heaven has died',
 'The skies are all falling',
 "I'm breathing but why",
 'In silence I hold on',
 'To you and I',
 'Closer to insantiy',
 'Buries me alive',
 "Where's the life we once had",
 'It cannot be denied',
 "Why can't you see what we had",
 'Let the fire burn the ice',
 "Where's the love we once had",
 'Is it all a lie',
 'And I still wonder',
 'Why heaven has died',
 'The skies are all falling',
 "I'm breathing but why",
 'In silence I hold on',
 'To you and I',
 'You run away, you hide away',
 'To the other side of the universe',
 "Where you're safe from all that hunts you down",
 'But the world has gone, where you 

In [41]:
# lyrics in form of string

lyrics = extract_lyric_from_url(url_lyric)[1]
lyrics

"Every word you're saying is a lie. Run away my dear. But every sign will say. Your heart is dead. Bury all the memories. Cover them with dirt. Where's the love we once had. Our destiny's unsure. Why can't you see what we had. Let the fire burn the ice. Where's the love we once had. Is it all a lie. And I still wonder. Why heaven has died. The skies are all falling. I'm breathing but why. In silence I hold on. To you and I. Closer to insantiy. Buries me alive. Where's the life we once had. It cannot be denied. Why can't you see what we had. Let the fire burn the ice. Where's the love we once had. Is it all a lie. And I still wonder. Why heaven has died. The skies are all falling. I'm breathing but why. In silence I hold on. To you and I. You run away, you hide away. To the other side of the universe. Where you're safe from all that hunts you down. But the world has gone, where you belong. And it feels too late so you're moving on. Can you find your way back home. And I still wonder. Wh

## Saving song's titles and lyrics in a .csv file

Just like before we save the result in a .csv to be used in our further analysis.

In [42]:
# building lists with titles of lyrics and lyrics
list_title_lyrics_within_temptation = []
list_lyrics_within_temptation = []

for url_lyric in list_links_lyrics_songteksten_net:
    
    list_title_lyrics_within_temptation.append(url_lyric.split('/')[-1].split('.')[-2])
    list_lyrics_within_temptation.append(extract_lyric_from_url(url_lyric)[1])


In [43]:
len(list_title_lyrics_within_temptation)

74

In [44]:
len(list_lyrics_within_temptation)

74

In [45]:
df = pd.DataFrame({'song_title': list_title_lyrics_within_temptation,
                  'lyrics': list_lyrics_within_temptation})

In [46]:
# Here we also remove '-' from the title of the songs

df['song_title'] = df['song_title'].apply(lambda x: x.replace('-',' ').lower())

In [47]:
df.head()

Unnamed: 0,song_title,lyrics
0,world of make believe,On golden wings. She flies at night. With her ...
1,fire and ice,Every word you're saying is a lie. Run away my...
2,grace,Cold are the bones of thy soldiers. Longing fo...
3,overcome,Where are the heroes. I my time of need. Is my...
4,somewhere,Lost in the darkness. Hoping for a sign. Inste...


In [48]:
# saving dataframe to .csv

df.to_csv("./data/lyrics_within_temptation.csv", index = False)

In [49]:
df2 = pd.read_csv("./data/lyrics_within_temptation.csv")

In [50]:
df2.head()

Unnamed: 0,song_title,lyrics
0,world of make believe,On golden wings. She flies at night. With her ...
1,fire and ice,Every word you're saying is a lie. Run away my...
2,grace,Cold are the bones of thy soldiers. Longing fo...
3,overcome,Where are the heroes. I my time of need. Is my...
4,somewhere,Lost in the darkness. Hoping for a sign. Inste...


In [51]:
df2.shape

(74, 2)

In [52]:
del df2