In this first notebook I'll apply web scraping to obtain lyrics of Evanescence and Within Temptation from the web. 

First, I've checked the web around for some lyrics' sites with lyrics of both bands. My goal them is to use the same code to extract lyrics for both bands.

Some of the websites I've explored were 

- http://www.metrolyrics.com
- https://www.azlyrics.com/
- https://www.songteksten.nl/
- https://www.songteksten.net/

To extract hyperlinks I've used the packages: 

* [requests](https://pypi.org/project/requests/) 
* [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). 

You will notice that some simple [Python string methods](https://www.w3schools.com/python/python_strings.asp) were used with the task of retrieving lyrics. So let's start webscraping.

P.S.: Notice that the data retrieved will depend when you do it since websites suffer updates.

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Load-Packages" data-toc-modified-id="Load-Packages-1"><span class="toc-item-num">1&nbsp;&nbsp;</span><strong>Load Packages</strong></a></span></li><li><span><a href="#Webscraping" data-toc-modified-id="Webscraping-2"><span class="toc-item-num">2&nbsp;&nbsp;</span><strong>Webscraping</strong></a></span><ul class="toc-item"><li><span><a href="#Retrieving-hyperlinks-of-lyrics" data-toc-modified-id="Retrieving-hyperlinks-of-lyrics-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Retrieving hyperlinks of lyrics</a></span><ul class="toc-item"><li><span><a href="#Retrieving-from-metrolyrics" data-toc-modified-id="Retrieving-from-metrolyrics-2.1.1"><span class="toc-item-num">2.1.1&nbsp;&nbsp;</span>Retrieving from <a href="https://www.metrolyrics.com/" target="_blank">metrolyrics</a></a></span></li><li><span><a href="#Retrieving-from-songteksten.net" data-toc-modified-id="Retrieving-from-songteksten.net-2.1.2"><span class="toc-item-num">2.1.2&nbsp;&nbsp;</span>Retrieving from <a href="https://songteksten.net/" target="_blank">songteksten.net</a></a></span></li><li><span><a href="#Retrieving-from-songteksten.nl" data-toc-modified-id="Retrieving-from-songteksten.nl-2.1.3"><span class="toc-item-num">2.1.3&nbsp;&nbsp;</span>Retrieving from <a href="https://www.songteksten.nl/" target="_blank">songteksten.nl</a></a></span></li><li><span><a href="#Retrieving-from-AZLyrics" data-toc-modified-id="Retrieving-from-AZLyrics-2.1.4"><span class="toc-item-num">2.1.4&nbsp;&nbsp;</span>Retrieving from <a href="https://www.azlyrics.com/" target="_blank">AZLyrics</a></a></span></li></ul></li></ul></li><li><span><a href="#Extracting-lyrics-from-a-webpage" data-toc-modified-id="Extracting-lyrics-from-a-webpage-3"><span class="toc-item-num">3&nbsp;&nbsp;</span><strong>Extracting lyrics from a webpage</strong></a></span><ul class="toc-item"><li><span><a href="#Extracting-lyric-from-songteksten.net" data-toc-modified-id="Extracting-lyric-from-songteksten.net-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Extracting lyric from songteksten.net</a></span><ul class="toc-item"><li><span><a href="#Try-another-one" data-toc-modified-id="Try-another-one-3.1.1"><span class="toc-item-num">3.1.1&nbsp;&nbsp;</span>Try another one</a></span></li></ul></li></ul></li><li><span><a href="#Applying-the-same-to-Within-Temptation" data-toc-modified-id="Applying-the-same-to-Within-Temptation-4"><span class="toc-item-num">4&nbsp;&nbsp;</span><strong>Applying the same to Within Temptation</strong></a></span><ul class="toc-item"><li><span><a href="#STEP-1---Retrieving-hyperlinks-from-songteksten.net" data-toc-modified-id="STEP-1---Retrieving-hyperlinks-from-songteksten.net-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>STEP 1 - Retrieving hyperlinks from songteksten.net</a></span></li><li><span><a href="#STEP-2---Keeping-only-hyperlinks-of-lyrics" data-toc-modified-id="STEP-2---Keeping-only-hyperlinks-of-lyrics-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>STEP 2 - Keeping only hyperlinks of lyrics</a></span></li><li><span><a href="#STEP-3---Extracting-song-titles-and-lyrics-from-lyric's-hyperlink" data-toc-modified-id="STEP-3---Extracting-song-titles-and-lyrics-from-lyric's-hyperlink-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>STEP 3 - Extracting song titles and lyrics from lyric's hyperlink</a></span></li><li><span><a href="#Saving-song's-titles-and-lyrics-in-a-.csv-file" data-toc-modified-id="Saving-song's-titles-and-lyrics-in-a-.csv-file-4.4"><span class="toc-item-num">4.4&nbsp;&nbsp;</span>Saving song's titles and lyrics in a .csv file</a></span></li></ul></li><li><span><a href="#Conclusions" data-toc-modified-id="Conclusions-5"><span class="toc-item-num">5&nbsp;&nbsp;</span><strong>Conclusions</strong></a></span></li></ul></div>

# **Load Packages**

In [1]:
# importing packages

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

TodaysDate = time.strftime("%Y-%m-%d")

pd.options.display.max_rows = 999


# **Webscraping**

The first step is to extract addresses of hyperlinks of each band's lyric. Webpages chosen usually present title of lyrics where each title is a hyperlink to a lyric. Like this, once we have all hyperlinks, we will be able to extract each one of the lyrics using those links.

The following function retrieves all hyperlinks within the main page. However, each website considered has its own particularities which must be considered when filtering out hyperlinks of lyrics.


In [2]:
def retrieve_hyperlinks(main_url):
    """ 
    Extract all hyperlinks in 'main_url' and return a list with these hyperlinks 
    """
    
    # Send request and catch response: r

    r = requests.get(main_url)

    # Extracts response as html: html_doc
    html_doc = r.text

    # Create a BeautifulSoup object from the HTML: soup
    soup = BeautifulSoup(html_doc,"lxml")
    
    # Find all 'a' tags (which define hyperlinks): a_tags

    a_tags = soup.find_all('a')
    
    # Create a list with hyperlinks found

    list_links = [link.get('href') for link in a_tags]
    
    # Remove none values if there is some
    
    list_links = list(filter(None, list_links)) 
    
    return list_links

## Retrieving hyperlinks of lyrics

Once we obtain all hyperlinks, we need to select those that contains lyrics. For now, we are investigating different lyrics' websites and each one presents a slightly different structure that must be considered when filtering out hyperlinks. 

By working with different websites, we can choose the one that we consider the most appropriated for extracting the lyrics.

Notice that initially we will be only considering `Evanescence` lyrics. However, as said, we have already checked that all websites considered also contain lyrics from `Within Temptation`. Therefore, once we have our solution, it will be applicable for both bands.

### Retrieving from [metrolyrics](https://www.metrolyrics.com/)

To retrieve lyrics of Evanescence from the `metrolyrics` first, we apply the above function using `main_url` = 'http://www.metrolyrics.com/evanescence-lyrics.html'.

Then, as expected, we notice that not all hyperlinks are related to lyrics, so we need to filter those out and keep only the ones containing lyrics.


In [3]:
url = 'http://www.metrolyrics.com/evanescence-lyrics.html'
list_links_lyrics_metrolyrics = retrieve_hyperlinks(url)

# remove probable repetitions

list_links_lyrics_metrolyrics = list(set(list_links_lyrics_metrolyrics))

print('\n Number of links before filtering:', len(list_links_lyrics_metrolyrics))
list_links_lyrics_metrolyrics[:20]


 Number of links before filtering: 110


['https://www.metrolyrics.com/lost-in-paradise-lyrics-evanescence.html',
 'https://www.metrolyrics.com/gcf-in-saipan-lyrics-bts.html#/correction',
 'https://www.metrolyrics.com/weight-of-the-world-lyrics-evanescence.html',
 'https://www.metrolyrics.com/evanescence-news.html',
 'https://www.metrolyrics.com/missing-lyrics-evanescence.html',
 'https://www.metrolyrics.com/if-you-dont-mind-lyrics-evanescence.html',
 'https://www.metrolyrics.com/never-was-never-will-be-everybodys-fool-lyrics-evanescence.html',
 'https://www.metrolyrics.com/going-under-lyrics-evanescence.html',
 'https://www.metrolyrics.com/my-immortal-lyrics-evanescence.html',
 'https://www.metrolyrics.com/everybodys-fool-lyrics-evanescence.html',
 'https://www.metrolyrics.com/bleed-lyrics-evanescence.html',
 'https://www.metrolyrics.com/lies-lyrics-evanescence.html',
 'https://www.metrolyrics.com/the-last-song-im-wasting-on-you-lyrics-evanescence.html',
 'https://www.metrolyrics.com/even-in-death-lyrics-evanescence.html',
 

A quick look in the list above reveals that links containing lyrics are of the form 'http://www.metrolyrics.com/TITLE-lyrics-evanescence.html'. So, we can select the elements of the list that contains '-lyrics-evanescence.html'.

The list comprehension bellow does the job and return us 77 hyperlinks instead of the initial 110 hyperlinks.

In [4]:
list_links_lyrics_metrolyrics = [link for link in list_links_lyrics_metrolyrics if '-lyrics-evanescence.html' in link]
print('\n Number of links after filtering:', len(list_links_lyrics_metrolyrics))
list_links_lyrics_metrolyrics



 Number of links after filtering: 77


['https://www.metrolyrics.com/lost-in-paradise-lyrics-evanescence.html',
 'https://www.metrolyrics.com/weight-of-the-world-lyrics-evanescence.html',
 'https://www.metrolyrics.com/missing-lyrics-evanescence.html',
 'https://www.metrolyrics.com/if-you-dont-mind-lyrics-evanescence.html',
 'https://www.metrolyrics.com/never-was-never-will-be-everybodys-fool-lyrics-evanescence.html',
 'https://www.metrolyrics.com/going-under-lyrics-evanescence.html',
 'https://www.metrolyrics.com/my-immortal-lyrics-evanescence.html',
 'https://www.metrolyrics.com/everybodys-fool-lyrics-evanescence.html',
 'https://www.metrolyrics.com/bleed-lyrics-evanescence.html',
 'https://www.metrolyrics.com/lies-lyrics-evanescence.html',
 'https://www.metrolyrics.com/the-last-song-im-wasting-on-you-lyrics-evanescence.html',
 'https://www.metrolyrics.com/even-in-death-lyrics-evanescence.html',
 'https://www.metrolyrics.com/hells-angels-lyrics-evanescence.html',
 'https://www.metrolyrics.com/bring-me-to-life-lyrics-evanes

After filtering, we see that some links finishing with '/correction', so we update the list comprehension above to eliminate the links with '/correction'.

In [5]:
# updated version

list_links_lyrics_metrolyrics = [link for link in list_links_lyrics_metrolyrics if 
                                 ('-lyrics-evanescence.html' in link and '/correction' not in link)]

print('\n Number of links after updated filtering:', len(list_links_lyrics_metrolyrics))

list_links_lyrics_metrolyrics.sort()
list_links_lyrics_metrolyrics


 Number of links after updated filtering: 76


['https://www.metrolyrics.com/all-that-i-am-living-for-lyrics-evanescence.html',
 'https://www.metrolyrics.com/angel-of-mine-lyrics-evanescence.html',
 'https://www.metrolyrics.com/anywhere-lyrics-evanescence.html',
 'https://www.metrolyrics.com/away-from-me-lyrics-evanescence.html',
 'https://www.metrolyrics.com/before-the-dawn-lyrics-evanescence.html',
 'https://www.metrolyrics.com/bleed-lyrics-evanescence.html',
 'https://www.metrolyrics.com/breathe-no-more-lyrics-evanescence.html',
 'https://www.metrolyrics.com/bring-me-to-life-lyrics-evanescence.html',
 'https://www.metrolyrics.com/broken-lyrics-evanescence.html',
 'https://www.metrolyrics.com/call-me-when-your-sober-lyrics-evanescence.html',
 'https://www.metrolyrics.com/call-me-when-youre-sober-lyrics-evanescence.html',
 'https://www.metrolyrics.com/cartoon-network-song-lyrics-evanescence.html',
 'https://www.metrolyrics.com/end-of-the-dream-lyrics-evanescence.html',
 'https://www.metrolyrics.com/erase-this-lyrics-evanescence.ht

So finally, for metrolyrics we have obtained 76 hyperlinks of Evanescence's lyrics. 

Let's check the other websites.

### Retrieving from [songteksten.net](https://songteksten.net/)

This webpage presents 3 pages with links to lyrics, so we need to apply function `retrieve_hyperlinks` on all 3 pages. For this we set all three url in a list and use a for-loop in order to retrieve all hyperlinks in these pages. After that we apply filtering that will keep only hyperlinks of lyrics.

Before filtering we have 233 links after filtering 86. Observe that although there is apparently no repetition, there are different versions of the same music (e.g. lies and lies-remix).

In [6]:
# retrieving all hyperlinks
urls = ['https://songteksten.net/artist/lyrics/1938/evanescence.html',
       'https://songteksten.net/artist/lyrics/1938/evanescence/page/2.html',
       'https://songteksten.net/artist/lyrics/1938/evanescence/page/3.html']

list_links_lyrics_songteksten_net = []

for url in urls:
    list_links_lyrics_songteksten_net.extend(retrieve_hyperlinks(url))
    
# remove probable repetitions

list_links_lyrics_songteksten_net = list(set(list_links_lyrics_songteksten_net))

    
print('Number of links before filtering:', len(list_links_lyrics_songteksten_net))

Number of links before filtering: 233


In [7]:
list_links_lyrics_songteksten_net

['https://songteksten.net/lyric/3648/96617/john-legend/all-of-me.html',
 'https://songteksten.net/news/106417/anouk-gaat-in-zelfquarantaine-om-lockdownversoepeling.html',
 'https://songteksten.net/artists/o.html',
 'https://songteksten.net/news/106410/bners-demonstreren-online-voor-culturele-sector.html',
 'https://songteksten.net/albums/album/447/2312/george-michael/older.html',
 'https://songteksten.net/artist/lyrics/1938/evanescence.html',
 'https://forum.songteksten.net/index.php?topic=3564.msg86895;topicseen#new',
 'https://songteksten.net/lyric/1938/32041/evanescence/listen-to-the-rain.html',
 'https://songteksten.net/lyric/1938/48350/evanescence/zero.html',
 'https://songteksten.net/news/106412/robert-kijkt-terug-op-bizar-seizoen-all-you-need-is-love.html',
 'https://songteksten.net/lyric/1938/32518/evanescence/i-must-be-dreaming.html',
 'https://songteksten.net/news/106411/actrice-en-presentatrice-martine-bijl-een-jaar-dood.html',
 'https://songteksten.net/lyric/1938/92309/evan

In this case, to obtain lyrics we need to keep links that contain '/songteksten.net/lyric/1938' which will do by extracting the information from the url address.

In [8]:
# using url address to filter lyrics

spliting = urls[0].split('/')
filter_lyrics = spliting[2]+'/lyric/'+spliting[-2]

list_links_lyrics_songteksten_net = [link for link in list_links_lyrics_songteksten_net if (filter_lyrics 
                                                                              in link) ]

print('Number of links after filtering:', len(list_links_lyrics_songteksten_net))


Number of links after filtering: 86


In [9]:
#Extracting the titles of the song and rearranging in alphabetical order for quick inspection

list_titles = [link.split('/')[-1].split('.')[-2] for link in list_links_lyrics_songteksten_net ]
list_titles.sort()
list_titles

['4th-of-july',
 'all-that-im-living-for',
 'angel-of-mine',
 'anything-for-you',
 'anywhere',
 'away-from-me',
 'before-the-dawn',
 'bleed',
 'breathe-no-more',
 'bring-me-to-life',
 'broken',
 'call-me-when-youre-sober',
 'cloud-nine',
 'disappear',
 'end-of-the-dream',
 'erase-this',
 'even-in-death',
 'everybodys-fool',
 'exodus',
 'fall-into-you',
 'farther-away',
 'fields-of-innocence',
 'forever-gone-forever-you',
 'forgive-me',
 'give-unto-me',
 'going-under',
 'good-enough',
 'goodnight',
 'haunted',
 'haunting-you',
 'heart-shaped-box',
 'hello',
 'i-believe-in-you',
 'i-must-be-dreaming',
 'imaginary',
 'imperfection',
 'lacrymosa',
 'lies',
 'lies-remix',
 'like-you',
 'listen-to-the-rain',
 'lithium',
 'lose-control',
 'lost-in-paradise',
 'made-of-stone',
 'missing',
 'must-be-dreaming',
 'my-cartoon-network',
 'my-heart-is-broken',
 'my-immortal',
 'my-last-breath',
 'never-go-back',
 'new-way-to-bleed',
 'october',
 'restless',
 'say-you-will',
 'secret-door',
 'sick',


### Retrieving from [songteksten.nl](https://www.songteksten.nl/)

Similarly, to songteksten.net we have multiple pages with hyperlinks. 

In [10]:
# retrieving all hyperlinks
urls = ['https://www.songteksten.nl/artiest/4713/evanescence.htm',
       'https://www.songteksten.nl/artiest/4713/2/evanescence.htm']


list_links_lyrics_songteksten_nl = []

for url in urls:
    list_links_lyrics_songteksten_nl.extend(retrieve_hyperlinks(url))
    
# removing probable repetitions
list_links_lyrics_songteksten_nl = list(set(list_links_lyrics_songteksten_nl))
    
print('Number of links before filtering:', len(list_links_lyrics_songteksten_nl))

Number of links before filtering: 128


In [11]:
list_links_lyrics_songteksten_nl

['/songteksten/41310/evanescence/going-under.htm',
 '/songteksten/42848/evanescence/forgive-me.htm',
 '/songteksten/174663/evanescence/demise.htm',
 '/songteksten/67508/evanescence/lacrymosa.htm',
 '/songteksten/795424/evanescence/sick.htm',
 '/songteksten/43906/evanescence/away-from-me.htm',
 '/songteksten/67507/evanescence/good-enough.htm',
 '/songteksten/174664/evanescence/ascension-of-the-spirit.htm',
 '/songteksten/67513/evanescence/sweet-sacrifice.htm',
 '/songteksten/49240/evanescence/forever-gone-forever-you.htm',
 '/songteksten/360750/evanescence/made-of-stone-.htm',
 '/',
 '/songteksten/41812/evanescence/even-in-death.htm',
 '/index.php',
 '/songteksten/67512/evanescence/snow-white-queen.htm',
 '/songteksten/174658/evanescence/goodnight.htm',
 '/songteksten/360416/evanescence/what-you-want.htm',
 '/songteksten/41313/evanescence/i-believe-in-you.htm',
 '/songteksten/45787/evanescence/zero.htm',
 '/songteksten/74927/evanescence/jealous.htm',
 'https://www.twitter.com/songtekste

In [12]:
# Only keeping the hyperlinks of the lyrics

list_links_lyrics_songteksten_nl = [link for link in list_links_lyrics_songteksten_nl 
                                    if ('/evanescence/' in link) ]

len(list_links_lyrics_songteksten_nl)

112

In [13]:
list_links_lyrics_songteksten_nl

['/songteksten/41310/evanescence/going-under.htm',
 '/songteksten/42848/evanescence/forgive-me.htm',
 '/songteksten/174663/evanescence/demise.htm',
 '/songteksten/67508/evanescence/lacrymosa.htm',
 '/songteksten/795424/evanescence/sick.htm',
 '/songteksten/43906/evanescence/away-from-me.htm',
 '/songteksten/67507/evanescence/good-enough.htm',
 '/songteksten/174664/evanescence/ascension-of-the-spirit.htm',
 '/songteksten/67513/evanescence/sweet-sacrifice.htm',
 '/songteksten/49240/evanescence/forever-gone-forever-you.htm',
 '/songteksten/360750/evanescence/made-of-stone-.htm',
 '/songteksten/41812/evanescence/even-in-death.htm',
 '/songteksten/67512/evanescence/snow-white-queen.htm',
 '/songteksten/174658/evanescence/goodnight.htm',
 '/songteksten/360416/evanescence/what-you-want.htm',
 '/songteksten/41313/evanescence/i-believe-in-you.htm',
 '/songteksten/45787/evanescence/zero.htm',
 '/songteksten/74927/evanescence/jealous.htm',
 '/songteksten/795426/evanescence/the-other-side.htm',
 '

In [14]:
#Extracting the titles of the song and rearranging in alphabetical order for quick inspection

list_titles = [link.split('/')[-1].split('.')[-2] for link in list_links_lyrics_songteksten_nl]
list_titles.sort()
list_titles

['4th-of-july',
 'all-that-i-am-living-for',
 'all-that-i-m-living-for',
 'angel-of-mine',
 'anything-for-you',
 'anywhere',
 'ascension-of-the-spirit',
 'away-from-me',
 'before-the-dawn',
 'bleed',
 'breathe-no-more',
 'bring-me-to-life',
 'bring-me-to-life--synthesis-',
 'broken',
 'call-me-when-you-re-sober',
 'can-t-wash-it-all-away',
 'cartoon-network',
 'cloud-nine',
 'demise',
 'disappear',
 'do-what-you-want',
 'erase-this',
 'eternal',
 'even-in-death',
 'everybody-s-fool',
 'exodus',
 'exodus--demo-',
 'fallen',
 'farther-away',
 'feel-for-you',
 'field-of-innocence',
 'fields-of-innocence',
 'forever',
 'forever-gone-forever-you',
 'forgive-me',
 'give-unto-me',
 'going-under',
 'good-enough',
 'goodnight',
 'goodnite',
 'haunted',
 'heart-shaped-box',
 'hello',
 'hi-lo',
 'holding-my-last-breathe',
 'i-believe-in-you',
 'i-must-be-dreaming',
 'imaginary',
 'imaginery',
 'imperfection',
 'jealous',
 'lacrymosa',
 'lacrymose',
 'lies',
 'like-you',
 'listen-to-the-rain',
 'l

Before filtering we have 128 hyperlinks, after that 112. A quick inspection reveals often misspelling.

### Retrieving from [AZLyrics](https://www.azlyrics.com/)

The last website investigated as the first one has only one main page with all lyrics. 

In [15]:
url = 'https://www.azlyrics.com/e/evanescence.html'

list_links_lyrics_azlyrics = retrieve_hyperlinks(url)

# removing possible repetitions 

list_links_lyrics_azlyrics = list(set(list_links_lyrics_azlyrics))

print('\n Number of links before filtering:', len(list_links_lyrics_azlyrics))
list_links_lyrics_azlyrics


 Number of links before filtering: 129


['../lyrics/evanescence/hilo.html',
 '../lyrics/evanescence/secretdoor.html',
 '//www.azlyrics.com/privacy.html',
 '../lyrics/evanescence/thechange.html',
 '//www.azlyrics.com/u.html',
 '../a/amylee.html',
 '../lyrics/evanescence/goingunder.html',
 '../lyrics/evanescence/lostwhispers.html',
 '../lyrics/evanescence/yourstar.html',
 '//www.azlyrics.com/contact.html',
 '../lyrics/evanescence/itwasallalie.html',
 'mailto:?subject=Evanescence%20Lyrics&body=https%3A%2F%2Fwww.azlyrics.com%2Fe%2Fevanescence.html',
 '../lyrics/evanescence/lithium.html',
 '//www.azlyrics.com/i.html',
 '../lyrics/evanescence/nevergoback.html',
 '../lyrics/evanescence/surrender.html',
 '//www.azlyrics.com/copyright.html',
 '../lyrics/evanescence/exodus.html',
 '../lyrics/evanescence/breathenomore.html',
 '../lyrics/evanescence/wherewillyougo.html',
 '../lyrics/evanescence/fieldofinnocence.html',
 '//www.azlyrics.com/adv.html',
 '../lyrics/evanescence/heartshapedbox.html',
 '../lyrics/evanescence/october.html',
 '.

And in this case our filtering list comprehension uses '/lyrics/' to keep only the relevant links. 

In [16]:
list_links_lyrics_azlyrics = [link for link in list_links_lyrics_azlyrics if '/lyrics/' in link ]

print('\n Number of links after updated filtering and applying set:', len(list_links_lyrics_azlyrics))


 Number of links after updated filtering and applying set: 89


In [17]:
# organizing in alphabetical order to make visualization easy
list_links_lyrics_azlyrics.sort()
list_links_lyrics_azlyrics

['../lyrics/evanescence/allthatimlivingfor.html',
 '../lyrics/evanescence/anewwaytobleed.html',
 '../lyrics/evanescence/anythingforyou.html',
 '../lyrics/evanescence/anywhere.html',
 '../lyrics/evanescence/awayfromme.html',
 '../lyrics/evanescence/beforethedawn.html',
 '../lyrics/evanescence/bleedimustbedreaming.html',
 '../lyrics/evanescence/breathenomore.html',
 '../lyrics/evanescence/bringmetolife.html',
 '../lyrics/evanescence/bringmetolifesynthesis.html',
 '../lyrics/evanescence/callmewhenyouresober.html',
 '../lyrics/evanescence/cloudnine.html',
 '../lyrics/evanescence/disappear.html',
 '../lyrics/evanescence/endofthedream.html',
 '../lyrics/evanescence/erasethis.html',
 '../lyrics/evanescence/evenindeath.html',
 '../lyrics/evanescence/evenindeath2016version.html',
 '../lyrics/evanescence/everybodysfool.html',
 '../lyrics/evanescence/exodus.html',
 '../lyrics/evanescence/fartheraway.html',
 '../lyrics/evanescence/fieldofinnocence.html',
 '../lyrics/evanescence/forevergoneforevery

# **Extracting lyrics from a webpage**

Now is time to use the hyperlinks of lyrics obtained in the previous section to extract lyrics.

From the sites inspected before, I've decided to eliminate songteksten.nl because of the misspelling issues, and to work further with songteksten.net because it seemed to be the less complicate to extract lyrics (based on html structure).

The following functions handle this task by using `BeautifullSoup` to parse the html and prettify it. After that I use some strings methods from Python to have the lyrics both in string format but also in list format. 

In [18]:
def extract_lyric_from_url(url_lyric):
    """ 
    Extract lyrics after prettify beautiful soup from www.songteksten.nl 
    """
    
    
    # send a http request
    r_lyric = requests.get(url_lyric)
    
    # obtain text with html containt of the url
    html_doc_lyric = r_lyric.text
    
    # making html easier to read
    soup_lyric = BeautifulSoup(html_doc_lyric,"lxml")

    
    # prettifying it
    soup_lyric_pretty = soup_lyric.prettify()
    
    # Isolating deal that contains the lyric
    
    text = soup_lyric_pretty.split('</h1>\n')[1].split('<div class="buma-consent" role="alert">')[0]

    # Cleaning text and building a list with it
    list_lyrics = text.split('<br/>\n')
    list_lyrics = [item.replace('\n','') for item in list_lyrics]
    list_lyrics = [item.lstrip().rstrip() for item in list_lyrics]
    
    # removing empty elements from the list
    
    for item in list_lyrics:
        if str(item) == '':
            list_lyrics.remove(item)
            
    # this part was added after noticing that at least one lyric was not following the normal pattern
    
    if '<div' in list_lyrics[0]:
        list_lyrics = list_lyrics[1:]
        
        
    # Having the lyrics in string format
    
    lyrics = '. '.join(list_lyrics)
            
    
    # returning both list and string
    
    return list_lyrics, lyrics

## Extracting lyric from songteksten.net

In [19]:
url_lyric = list_links_lyrics_songteksten_net[1]

print(url_lyric,'\n')

https://songteksten.net/lyric/1938/48350/evanescence/zero.html 



In [20]:
list_lyrics = extract_lyric_from_url(url_lyric)[0]
list_lyrics

['My reflection, dirty mirror',
 "There's no connection to myself",
 "I'm your lover, I'm your zero",
 "I'm in the face of your dreams of glass",
 'So save your prayers',
 "For when we're really gonna need'em",
 'Throw out your cares and fly',
 'Wanna go for a ride?',
 "She's the one for me",
 "She's all I really need",
 "Cause she's the one for me",
 'Emptiness is loneliness,',
 'and loneliness is cleanliness',
 'And cleanliness is godliness,',
 'And god is empty just like me',
 'You never believe',
 'Intoxicated with the madness',
 "I'm in love with my sadness",
 'Bullshit fakers, enchanted kingdoms',
 'The fasion victims chew their charcoal teeth',
 'I never let on',
 'That I was on a sinking ship',
 'I never let on that I was down',
 'You blame yourself',
 "For what you can't ignore",
 'You blame yourself for wanting more',
 "She's the one for me",
 "She's all I really need",
 "She's the one for me",
 "She's my one and only"]

In [21]:
lyrics = extract_lyric_from_url(url_lyric)[1]
lyrics

"My reflection, dirty mirror. There's no connection to myself. I'm your lover, I'm your zero. I'm in the face of your dreams of glass. So save your prayers. For when we're really gonna need'em. Throw out your cares and fly. Wanna go for a ride?. She's the one for me. She's all I really need. Cause she's the one for me. Emptiness is loneliness,. and loneliness is cleanliness. And cleanliness is godliness,. And god is empty just like me. You never believe. Intoxicated with the madness. I'm in love with my sadness. Bullshit fakers, enchanted kingdoms. The fasion victims chew their charcoal teeth. I never let on. That I was on a sinking ship. I never let on that I was down. You blame yourself. For what you can't ignore. You blame yourself for wanting more. She's the one for me. She's all I really need. She's the one for me. She's my one and only"

### Try another one

In [22]:
url_lyric = list_links_lyrics_songteksten_net[10]

print(url_lyric)

https://songteksten.net/lyric/1938/30586/evanescence/tourniquet.html


In [23]:
list_lyrics = extract_lyric_from_url(url_lyric)[0]
list_lyrics

['I tried to kill the pain',
 'But only brought more (so much more)',
 'I lay dying',
 "And I'm pouring crimson regret and betrayal",
 "I'm dying (dying)",
 'Praying (praying)',
 'Bleeding (bleeding)',
 'And screaming',
 'Am I too lost to be saved?',
 'Am I too lost?',
 'My God, my tourniquet',
 'Return to me salvation',
 'My God, my tourniquet',
 'Return to me salvation',
 'Do you remember me?',
 'Lost for so long',
 'Will you be on the other side?',
 'Will you forget me?',
 "I'm dying (dying)",
 'Praying (praying)',
 'Bleeding (bleeding)',
 'And screaming',
 'Am I too lost to be saved?',
 'Am I too lost?',
 'My God, my tourniquet',
 'Return to me salvation',
 'My God, my tourniquet',
 'Return to me salvation',
 '(Return to me salvation)',
 '[I want to die]',
 'My God, my tourniquet',
 'Return to me salvation',
 'My God, my tourniquet',
 'Return to me salvation',
 'My wounds cry for the grave',
 'My soul cries for deliverance',
 'Will I be denied?',
 'Christ -',
 'tourniquet -',
 'my 

In [24]:
lyrics = extract_lyric_from_url(url_lyric)[1]
lyrics

"I tried to kill the pain. But only brought more (so much more). I lay dying. And I'm pouring crimson regret and betrayal. I'm dying (dying). Praying (praying). Bleeding (bleeding). And screaming. Am I too lost to be saved?. Am I too lost?. My God, my tourniquet. Return to me salvation. My God, my tourniquet. Return to me salvation. Do you remember me?. Lost for so long. Will you be on the other side?. Will you forget me?. I'm dying (dying). Praying (praying). Bleeding (bleeding). And screaming. Am I too lost to be saved?. Am I too lost?. My God, my tourniquet. Return to me salvation. My God, my tourniquet. Return to me salvation. (Return to me salvation). [I want to die]. My God, my tourniquet. Return to me salvation. My God, my tourniquet. Return to me salvation. My wounds cry for the grave. My soul cries for deliverance. Will I be denied?. Christ -. tourniquet -. my suicide"

It seems to work just fine!

Let's run for all songs and save all strings in a list so we can use to put together with the data we will put together about songs of Evanescence!

In [25]:
list_lyrics_evanescence = []
list_title_lyrics_evanescence = []

In [26]:
len(list_links_lyrics_songteksten_net)

86

In [27]:
# building lists with titles of lyrics and lyrics

for url_lyric in list_links_lyrics_songteksten_net:
    
    list_title_lyrics_evanescence.append(url_lyric.split('/')[-1].split('.')[-2])
    list_lyrics_evanescence.append(extract_lyric_from_url(url_lyric)[1])


Just verifying that everything is as expected.

In [28]:
len(list_title_lyrics_evanescence)

86

In [29]:
len(list_lyrics_evanescence)

86

Now we create a Pandas data frame with song title and lyrics that will be saved in .csv format for later use.

In [30]:
# Creating a dataframe with song titles and lyrics

df = pd.DataFrame({'song_title': list_title_lyrics_evanescence,
                  'lyrics': list_lyrics_evanescence})

In [31]:
df.head()

Unnamed: 0,song_title,lyrics
0,listen-to-the-rain,Listen listen. Listen listen. Listen listen. L...
1,zero,"My reflection, dirty mirror. There's no connec..."
2,i-must-be-dreaming,How can I pretend that I don't see. What you h...
3,the-end,I found a grave. Brushed off the face. Felt yo...
4,breathe-no-more,I've been looking in the mirror for so long.. ...


Before saving this data frame in .csv format for future use let's remove '-' from the song title and make all titles lower case by applying '.lower()'.

By keeping the format of song's title uniform we make our life easier when wanting to add more information about those songs. For instances, some metadata, like album title and year that was recorded.

In [32]:
df['song_title'] = df['song_title'].apply(lambda x: x.replace('-',' ').lower())

In [33]:
df.head()

Unnamed: 0,song_title,lyrics
0,listen to the rain,Listen listen. Listen listen. Listen listen. L...
1,zero,"My reflection, dirty mirror. There's no connec..."
2,i must be dreaming,How can I pretend that I don't see. What you h...
3,the end,I found a grave. Brushed off the face. Felt yo...
4,breathe no more,I've been looking in the mirror for so long.. ...


Checking if there are duplicated `song_title`.

In [34]:
# Find a duplicate rows
duplicateDFRow = df[df.duplicated()]
print(duplicateDFRow)

Empty DataFrame
Columns: [song_title, lyrics]
Index: []


In [35]:
# Organize dataframe in alphabetical order
df.sort_values('song_title', inplace = True)
df.reset_index(drop = True, inplace = True)
df.head()

Unnamed: 0,song_title,lyrics
0,4th of july,Shower in the dark day. Clean sparks driving d...
1,all that im living for,All that I'm living for. All that I'm dying fo...
2,angel of mine,You are everything I need to see. Smile and su...
3,anything for you,I'd give anything to give me to you. Can you f...
4,anywhere,"Dear my love, haven't you wanted to be with me..."


In [36]:
# saving dataframe to .csv

df.to_csv("./data/lyrics_evanescence_"+TodaysDate+".csv", index = False)

In [37]:
# testing saved .csv

df2 = pd.read_csv("./data/lyrics_evanescence_"+TodaysDate+".csv")

df2.head()

Unnamed: 0,song_title,lyrics
0,4th of july,Shower in the dark day. Clean sparks driving d...
1,all that im living for,All that I'm living for. All that I'm dying fo...
2,angel of mine,You are everything I need to see. Smile and su...
3,anything for you,I'd give anything to give me to you. Can you f...
4,anywhere,"Dear my love, haven't you wanted to be with me..."


In [38]:
df2.shape

(86, 2)

In [39]:
del df2

# **Applying the same to Within Temptation**

Previously, we have tried 4 different lyrics' websites to retrieve Evanescence's lyrics and we have chosen songteksten.net to extract those lyrics. 

Some reasons why I chose `songteksten.net`:

1. When filtering we used `songteksten.net/lyric/number_of_the_artist` and this can be extracted from the url which make easy for us to generalize the code for both or even other bands.

2. Title of the songs are of the form words separated by '-' which can be easily put in a form to facilitate merging with additional data as we've already planned.

Summarizing, the steps to retrieve lyrics are:

1. Apply function to retrieve hyperlinks
2. Filter out hyperlinks that do not contain lyrics
3. Apply function to retrieve hyperlinks (Step 1) to extract lyrics.

And those will be now also applied to retrieve lyrics of Within Temptation.


## STEP 1 - Retrieving hyperlinks from songteksten.net

In [40]:
# retrieving all hyperlinks
urls = ['https://songteksten.net/artist/lyrics/320/within-temptation.html',
       'https://songteksten.net/artist/lyrics/320/within-temptation/page/2.html',
       'https://songteksten.net/artist/lyrics/320/within-temptation/page/3.html']

list_links_lyrics_songteksten_net = []

for url in urls:
    list_links_lyrics_songteksten_net.extend(retrieve_hyperlinks(url))
    
# removing possible duplicates
list_links_lyrics_songteksten_net = list(set(list_links_lyrics_songteksten_net))

    
print('Number of links before filtering:', len(list_links_lyrics_songteksten_net))

Number of links before filtering: 231


In [41]:
list_links_lyrics_songteksten_net[:10]

['https://songteksten.net/lyric/3648/96617/john-legend/all-of-me.html',
 'https://songteksten.net/lyric/320/11149/within-temptation/our-farewell.html',
 'https://songteksten.net/lyric/320/75323/within-temptation/sounds-of-freedom.html',
 'https://songteksten.net/news/106417/anouk-gaat-in-zelfquarantaine-om-lockdownversoepeling.html',
 'https://songteksten.net/artists/o.html',
 'https://songteksten.net/news/106410/bners-demonstreren-online-voor-culturele-sector.html',
 'https://songteksten.net/albums/album/447/2312/george-michael/older.html',
 'https://songteksten.net/lyric/320/11878/within-temptation/the-gatekeeper.html',
 'https://songteksten.net/lyric/320/73688/within-temptation/gothic-christmas.html',
 'https://forum.songteksten.net/index.php?topic=3564.msg86895;topicseen#new']

## STEP 2 - Keeping only hyperlinks of lyrics

In [42]:
# filtering hyperlinks which contain lyrics - specific for songteksten.net

# using url address to filter lyrics

spliting = urls[0].split('/')
filter_lyrics = spliting[2]+'/lyric/'+spliting[-2]

list_links_lyrics_songteksten_net = [link for link in list_links_lyrics_songteksten_net if (filter_lyrics 
                                                                              in link) ]

print('Number of links after filtering:', len(list_links_lyrics_songteksten_net))

Number of links after filtering: 74


Before filtering there were 231, after filtering 74 hyperlinks are indeed from Within Temptation's lyrics.

## STEP 3 - Extracting song titles and lyrics from lyric's hyperlink

In [43]:
url_lyric = list_links_lyrics_songteksten_net[1]

print(url_lyric)

https://songteksten.net/lyric/320/75323/within-temptation/sounds-of-freedom.html


In [44]:
# lyrics in form of list
list_lyrics = extract_lyric_from_url(url_lyric)[0]
list_lyrics

['Your call is coming',
 "I'm dreaming away",
 'For what lies hidden',
 'It needs to be found',
 'Sounds of freedom make me wanna try',
 'When the ghosts are found',
 'They will lead us to tomorrow',
 'Sounds of freedom make me wanna try',
 'Voices forgotten',
 'I hear them close by',
 'Ghosts from the past i can see through their eyes',
 'Are these the ancestors leaving me signs?',
 'Sounds of freedom make me wanna try',
 'When the ghosts are found',
 'They will lead us to tomorrow',
 'Sounds of freedom make me wanna try',
 'The sounds, they are all around',
 'Forces start moving out',
 "Taking sides, though there's so much",
 'That i need to know',
 'And soon it will be shown',
 'Sounds of freedom make me wanna try',
 'When the ghosts are found',
 'They will lead us to tomorrow',
 'Sounds of freedom make me wanna try',
 'When the ghosts are found,',
 'They will lead us to tomorrow',
 'When the ghosts are found,',
 'They will lead us to tomorrow',
 'If we could restart how it was befo

In [45]:
# lyrics in form of string

lyrics = extract_lyric_from_url(url_lyric)[1]
lyrics

"Your call is coming. I'm dreaming away. For what lies hidden. It needs to be found. Sounds of freedom make me wanna try. When the ghosts are found. They will lead us to tomorrow. Sounds of freedom make me wanna try. Voices forgotten. I hear them close by. Ghosts from the past i can see through their eyes. Are these the ancestors leaving me signs?. Sounds of freedom make me wanna try. When the ghosts are found. They will lead us to tomorrow. Sounds of freedom make me wanna try. The sounds, they are all around. Forces start moving out. Taking sides, though there's so much. That i need to know. And soon it will be shown. Sounds of freedom make me wanna try. When the ghosts are found. They will lead us to tomorrow. Sounds of freedom make me wanna try. When the ghosts are found,. They will lead us to tomorrow. When the ghosts are found,. They will lead us to tomorrow. If we could restart how it was before tomorrow"

## Saving song's titles and lyrics in a .csv file

Just like before we save the result in a .csv to be used in our further analysis.

In [46]:
# building lists with titles of lyrics and lyrics
list_title_lyrics_within_temptation = []
list_lyrics_within_temptation = []

for url_lyric in list_links_lyrics_songteksten_net:
    
    list_title_lyrics_within_temptation.append(url_lyric.split('/')[-1].split('.')[-2])
    list_lyrics_within_temptation.append(extract_lyric_from_url(url_lyric)[1])


In [47]:
len(list_title_lyrics_within_temptation)

74

In [48]:
len(list_lyrics_within_temptation)

74

In [49]:
df = pd.DataFrame({'song_title': list_title_lyrics_within_temptation,
                  'lyrics': list_lyrics_within_temptation})

In [50]:
# Here we also remove '-' from the title of the songs

df['song_title'] = df['song_title'].apply(lambda x: x.replace('-',' ').lower())

In [51]:
df.head()

Unnamed: 0,song_title,lyrics
0,our farewell,In my hands. A legacy of memories. I can hear ...
1,sounds of freedom,Your call is coming. I'm dreaming away. For wh...
2,the gatekeeper,"The shadows of the night,. are unleashed again..."
3,gothic christmas,We're gonna have a gothic christmas. That is w...
4,enter,Welcome to my home. The gates of time have ope...


In [52]:
# Find a duplicate rows
duplicateDFRow = df[df.duplicated()]
print(duplicateDFRow)

Empty DataFrame
Columns: [song_title, lyrics]
Index: []


In [53]:
# Organize dataframe in alphabetical order
df.sort_values('song_title', inplace = True)
df.reset_index(drop = True, inplace = True)
df.head()

Unnamed: 0,song_title,lyrics
0,a dangerous mind,Cause something is not right. I follow the sig...
1,a demons fate,"Ooh, ooh, ooh, ooh, ooh. Ooh, ooh, ooh, ooh, o..."
2,all i need,I'm dying to catch my breath. Oh why don't I e...
3,angels,Sparkling angel I believed. You were my saviou...
4,another day,I know you are going away. I take my love into...


In [54]:
# saving dataframe to .csv

df.to_csv("./data/lyrics_within_temptation_"+TodaysDate+".csv", index = False)

In [55]:
df2 = pd.read_csv("./data/lyrics_within_temptation_"+TodaysDate+".csv")

In [56]:
df2.head()

Unnamed: 0,song_title,lyrics
0,a dangerous mind,Cause something is not right. I follow the sig...
1,a demons fate,"Ooh, ooh, ooh, ooh, ooh. Ooh, ooh, ooh, ooh, o..."
2,all i need,I'm dying to catch my breath. Oh why don't I e...
3,angels,Sparkling angel I believed. You were my saviou...
4,another day,I know you are going away. I take my love into...


In [57]:
df2.shape

(74, 2)

In [58]:
del df2

# **Conclusions**

In this notebook we have built a Python web scraper using `BeatifulSoup` and `requests`. A script is also available in [GitHub](https://github.com/dpbac/evanescence_and_within_temptation_in_Python/blob/master/webscraping_lyrics.py).

After trying different websites we choose [songteksten.net](https://songteksten.net/) to retrieve lyrics from [Evanescence](https://songteksten.net/artist/lyrics/1938/evanescence.html) and [Within Temptation](https://songteksten.net/artist/lyrics/320/within-temptation.html). Dataframes with retrieved lyrics and their titles were built and saved in the form of .csv files for later analysis.

`Webscraping` is a nice and very useful way to obtain data. One disadvantage is the possibility of having the structure of the webpage changed forcing major updates in the code. In this sense, retrieving data by using API is a more stable way, however not always possible.

In future, I plan to use also [Genius API](https://docs.genius.com/) to retrieve lyrics.