## Web Scraping Now!

As a prelude to later notebooks in related to natural language processing on lyricism of Arcade Fire's album Everything Now, basic web scraping is demonstrated here to collect the data and text from the internet. The objective of this notebook is to layout a basic framework for scraping and cleaning text from the internet.

### Writing the Dictionary

Using the website https://songmeanings.com/, lyrics from Arcade Fire's album Everything Now are scraped.

The first task is to create a dictionary data structure with keys (the song names) that will eventually act as file names along with their values (URL ID) corresponding to their unique page based on what was identified on the album page https://songmeanings.com/albums/view/tracks/282088/

In [1]:
# Import the libraries we will use
from bs4 import BeautifulSoup  # Front-end web parsing library
import requests  # Requests is the only Non-GMO HTTP library for Python, safe for human consumption.
import re  # Regular expression parsing

In [2]:
# Dictionary with each song and corresponding URL id on songmeanings.com
everythingnow_dict = {
    "Everything_Now_1": 3530822107859552543,
    "Everything_Now": 3530822107859549671,
    "Signs_of_Life": 3530822107859552545,
    "Creature_Comfort.txt": 3530822107859549890,
    "Peter_Pan": 3530822107859552546,
    "Chemistry": 3530822107859552547,
    "Infinite_Content": 3530822107859552548,
    "Infinite_Content_2": 3530822107859552549,
    "Electric_Blue": 3530822107859552544,
    "Good_God_Damn": 3530822107859552550,
    "Put_Your_Money_On_Me": 3530822107859552551,
    "We_Dont_Deserve_Love": 3530822107859552552,
    "Everything_Now_2": 3530822107859552553
}

### Scrape Baby, Scrape

Looping through each key (song) and corresponding value (unqiue song ID), songs are requested from the internet.

Searching through divisions of the webpage, "holder lyric-box" was identified through manual exploration using Google Chrome's Developer Tools as the class where the lyrics are contained. This class could also be identified directly from the website's source.

Using the library BeautfulSoup, the html parser options is used to parse out any HTML in the corresponding class.

In [3]:
# Loop through each song and corresponding url id
for song in everythingnow_dict:

    # HTML parser using beautiful soup
    listingurl = "https://songmeanings.com/songs/view/%s/" % everythingnow_dict[song]
    response = requests.get(listingurl)
    soup = BeautifulSoup(response.text, "html.parser")
    lyrics = []

    # Search for class that contains lyrics
    for rows in soup.find_all("div", class_="holder lyric-box"):
        lyrics.append(rows.get_text())

    # Write scraped lyrics to text file
    song_file = open('./Scraped_Lyrics/%s.txt' % song, 'w')
    for lyric in lyrics:
        song_file.write("%s\n" % lyric)


### "Clean up clean up, everybody do your share." -Barney

After downloading the lyrics of each song in the album, some clean up is requried to get rid of extraneous text, HTML, and Javascript artifacts in the raw downloaded data scrape.

Parsing through each song, regular expressions (RegEx) are used to remove consecutive new lines and completely remove any tabs. Python's string maniupultion replace function is used to remove additional irrelevant strings that were scraped.

The cleaned lyrics are saved in a seperate directory for safe keeping.

In [4]:
# Loop through each song and corresponding url id
for song in everythingnow_dict:
    with open('./Scrap_Lyrics/%s.txt' % song) as f:
        lyrics = f.readlines()
        cleaned_lyrics = []
        for line in lyrics:

            # RegEx to remove consercutive new lines and tabs
            line = re.sub(r'\n+', '\n', line)
            line = re.sub(r'\t+', '', line)

            # String replace to remove phrases parsed but not part of lyrics
            line = line.replace('Add Video', '')
            line = line.replace('Edit Wiki', '')
            line = line.replace('Add Video', '')
            line = line.replace('Edit Lyrics', '')
            line = line.replace(" eval(ez_write_tag([[300,250],'songmeanings_com-medrectangle-4','ezslot_4']));", '')
            line = line.replace(" eval(ez_write_tag([[300,250],'songmeanings_com-medrectangle-4','ezslot_5']));", '')

            # Append cleaned line
            cleaned_lyrics.append(line)

        # Append cleaned lyrics into new file in seperate directory
        song_file = open('./Cleaned_Lyrics/%s.txt' % song, 'w')
        for lyric in cleaned_lyrics:
            song_file.write("%s\n" % lyric)
            # Print lyrics to spot check
            print(lyric)



Put your money on me

'Cause I can barely breathe

Put your money on me



Put your money on me

If you think I'm losing you, you must be crazy

All your money on me

I'm never gonna let you go, even when it's easy

Put your money on me

Go tuck me into bed, and wake me when I'm dead

I know that you gotta be free

But I'm never gonna let it go



If there was a race

A race for your heart

It started before you were born

Above the chloroform sky

Clouds made of ambien

Sitting on carpets in the basement of heaven

We were born innocent, but it lies today

And baby you can give all the money away

But if there's a race, a race for your heart

It's over, before it starts

Singing put your money on me



If you think I'm losing you, you must be crazy

All your money on me

I'm never gonna let you go, even when it's easy

Put your money on me

Go tuck me into bed, and wake me when I'm dead

I know that you gotta be free

But I'm never gonna let it go



All my presents are broken, befo