# Web-scraping Playground

According to Wikipedia:
> Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

We will use the website from https://genius.com/ to scrape your own favorite songs.

This notebook will guide you how to scrape the lyrics and I will explain each blocks of codes step by step.

Let's start with importing necessary libraries

In [1]:
import requests
import time
import re

from bs4 import BeautifulSoup
from tqdm import tqdm

1. requests: http requests (we mainly use GET method from this library)
2. bs4: BeautifulSoup library which is widely used in web-scraping
3. re: for regex purpsoes
4. tqdm: a nice progress bar for scraping indicator

Next, find your favorite songs or artists! I will use the songs from a K-pop girlgroup, IZ*ONE https://genius.com/artists/Izone

In [2]:
# specify lyric's urls
URL = ['https://genius.com/albums/Izone/Color-iz',
       'https://genius.com/albums/Izone/Heart-iz',
       'https://genius.com/albums/Izone/Vampire',
       'https://genius.com/albums/Izone/Bloom-iz',
       'https://genius.com/albums/Izone/Oneiric-diary'] # let's start with these urls

TIMEOUT = 20 # set timeout before each request
LYRIC_PATH = '../data/lyrics.txt' # path to store the scraped lyrics

Now, onto the exciting part, scraping!

In [None]:
for u in URL:
    print('Waiting for', TIMEOUT, 'seconds before the next album...')
    time.sleep(TIMEOUT)
    print('Current album:', u)
    
    req = requests.get(u) # initiate GET request to the url
    soup = BeautifulSoup(req.text, 'lxml') # use BeautifulSoup to find the html tags
    
    raw_url = []
    for s in soup.find_all('div'):
        link = s.find('a')
        if link is not None:
            raw_url.append(link.attrs['href'])
            
    songs = []
    for r in raw_url:
        x = re.search('^(http|https)://.*Izone-.*lyrics$', r)
        if x is not None:
            songs.append(r)
            
    non_duplicate_songs = []
    for s in songs:
        if s not in non_duplicate_songs:
            non_duplicate_songs.append(s)
            
    for n in non_duplicate_songs:
        print('Waiting for', TIMEOUT, 'seconds before writing the next song...')
        time.sleep(TIMEOUT)
        
        print('Writing lyrics:', n)
        req = requests.get(n)
        soup = BeautifulSoup(req.text, 'lxml')
        lyrics = []
        for a in tqdm(soup.find_all('div')):
            lyric = a.find('p')
            if lyric is not None:
                lyrics.append(lyric)
    
        lyrics = clean_html_tags(lyrics[0])
        with open(LYRIC_PATH, 'a', encoding='utf-8') as f:
            f.write(lyrics)

## 1 - GET Request   

    req = requests.get(u) # initiate GET request to the url
    soup = BeautifulSoup(req.text, 'lxml') # use BeautifulSoup to find the html tags

There are several HTTP Requests. In this project, we only use the GET request to retrieve the body of the HTML. Then, BeautifulSoup will parse them into plain text with lxml format.

Be careful when you parse non-English lyrics. When you don't specify the parsing format, it will scrape weird looking characters

## 2.1 - Scraping Links

    raw_url = []
    for s in soup.find_all('div'):
        link = s.find('a')
        if link is not None:
            raw_url.append(link.attrs['href'])
    
Before scraping, it is a good practice to look inside the website yourself and inspect them. Press F12 or right click then inspect element to view the HTML.
Genius always structures their body like this

![Genius HTML Body](../assets/html_body_1.png)

Notice that each of the songs in the album are arranged in order:
```html
<div>
    content goes here...
    <div>
        content goes here...
        <a href> link to song 1 </a>
    </div>
    <div>
        content goes here...
        <a href> link to song 2 </a>
    </div>
    <div>
        content goes here...
        <a href> link to song 3 </a>
    </div>
    ...
</div>
```

This structure makes it easier for us to scrape out of the link to the songs in an album. We can use find_all('div') to retrieve all the content inside the 'div' tag, then use find('a') to retrieve those links.

    songs = []
    for r in raw_url:
        x = re.search('^(http|https)://.*Izone-.*lyrics$', r)
        if x is not None:
            songs.append(r)
            
    non_duplicate_songs = []
    for s in songs:
        if s not in non_duplicate_songs:
            non_duplicate_songs.append(s)
            
This is where re library is useful. You usually find unrelated contents (links) which are not the part of the album. Use re.search(pattern) to find the related links. You can sometimes encounter duplicate links when scraping. Add another line of codes to copy non-duplicate links.

## 2.2 - Scraping Lyrics

    for n in non_duplicate_songs:
        print('Waiting for', TIMEOUT, 'seconds before writing the next song...')
        time.sleep(TIMEOUT)
        
        print('Writing lyrics:', n)
        req = requests.get(n)
        soup = BeautifulSoup(req.text, 'lxml')
        lyrics = []
        for a in tqdm(soup.find_all('div')):
            lyric = a.find('p')
            if lyric is not None:
                lyrics.append(lyric)
    
        lyrics = clean_html_tags(lyrics[0])
        with open(LYRIC_PATH, 'a', encoding='utf-8') as f:
            f.write(lyrics)