# Web Scraping

If you want to do a project with scraping and want to focus on the analysis, I would advise you to choose one scraper. 

I'll be using two in this notebook because one of my objectives is to learn about webscraping.

## Selenium Webdriver

Selenium is useful for webpages that are dynamically generated or for interacting with a webpage. For example, you might want to scrape content in a card, but the default state might be set to none. As a result, you'd want a program like selenium to go in and click certain tags to make the content visible.

### Installation Process (for Chrome)
To use selenium webdriver, install the selenium python package using

`pip3 install selenium`

Next, check the version of chrome you're using by going to the ... button in the top-right corner, "Help" and then "About Chrome".
`chrome://settings/help`
Based on the version, download the appropriate chromedriver from
[ChromeDriver Downloads](https://sites.google.com/a/chromium.org/chromedriver/downloads)
and placing it somewhere on your laptop. Then use the path as I did in the text block below.

In [1]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True
PATH = r'C:\Program Files (x86)'
driver = webdriver.Chrome(PATH+r'\chromedriver.exe')

Thankfully, because of the scope of the project, there isn't many sites I'd need to go to and copy transcripts from. However, I know very little about webscrapers and thought this would be a good opportunity for me to try it out and use it (as well as show my progress in learning).

In [2]:
speeches_url = "https://jamesclear.com/great-speeches"

driver.get(speeches_url)
link_tags = driver.find_elements_by_xpath("//h2[text()='Famous Speeches and Great Talks']/following::ul[1]/li/a") # //ul/li/a
links = [tag.get_attribute('href') for tag in link_tags]


# driver.quit()

chropath

In [3]:
links[:10]

['https://jamesclear.com/great-speeches/the-danger-of-a-single-story-by-chimamanda-ngozi-adichie',
 'https://jamesclear.com/great-speeches/what-matters-more-than-your-talents-by-jeff-bezos',
 'https://jamesclear.com/great-speeches/enough-by-john-c-bogle',
 'https://jamesclear.com/great-speeches/the-anatomy-of-trust-by-brene-brown',
 'https://jamesclear.com/great-speeches/creativity-in-management-by-john-cleese',
 'https://jamesclear.com/great-speeches/solitude-and-leadership-by-william-deresiewicz',
 'https://jamesclear.com/great-speeches/seeking-new-laws-by-richard-feynman',
 'https://jamesclear.com/great-speeches/make-good-art-by-neil-gaiman',
 'https://jamesclear.com/great-speeches/personal-renewal-by-john-w-gardner',
 'https://jamesclear.com/great-speeches/your-elusive-creative-genius-by-elizabeth-gilbert']

In [4]:
for link in links[:5]:
    speeches_url = "https://jamesclear.com/great-speeches"
    driver.get(speeches_url)

    link_tags = driver.find_elements_by_xpath('//div[@class="entry content"]/h2[. = "Speech Transcript"]/following-sibling::p')
        #"//*[preceding-sibling::h2[. = 'Speech Transcript'] and following-sibling::h2[. = 'Headline 2']]")
driver.quit()
link_tags

[]

## Beautiful Soup

beautifulsoup4 is normally preferred for static sites (like the ones I'm using) or if there's a lot of pages to scrape. bs4 is faster and more robust than Selenium so it's usually preferred if possible.  

Before running the code, make sure you insalled the necessary packages:

`pip3 install beautifulsoup4`

`pip3 install requests` (for http requests)

`pip3 install lxml` (for parsing)

If you have the time, I would recommend trying out both of these popular options because they are quite different and scraping $\textit{can}$ be fun.

I'll be using bs4 for scraping lyrics from https://www.azlyrics.com/ which seems to have songs and albums besides "Greatest Hits" albums which we'll be avoiding. Using a Business Insider article from 2016, we'll collect lyrics from songs from the best selling albums of all time. 

Best selling is a quantifiable selection criteria, so I'll go with this methodology for songs. I'm selecting albums instead of individual songs because song popularity is very subjective (hard to find an objective list) and albums will have corpora that are more comparable to speeches than songs. Plus songs normally focus on one topic whereas speeches often have stories or subtopics that give them more variety.

In [None]:
Again, we want to check if the pages we are scraping allow it. https://www.businessinsider.com/robots.txt tells us, that for `User-agent: *` (ie. all bots), we cannot scrape a whole bunch of extensions. They don't disallow '/' so since my artice below is just "businessinsider.com/..." then I'm ok to scrape it.

For how to interpret a robots.txt file, go to the [Google Webmasters](https://support.google.com/webmasters/answer/6062596?hl=en) page.

In [None]:
import re
import requests
# import lxml
from bs4 import BeautifulSoup


best_albums_src = requests.get('https://www.businessinsider.com/50-best-selling-albums-all-time-2016-9').text
# class 'slide'
soup = BeautifulSoup(best_albums_src, 'lxml')
# show (still not that) pretty html
# print(soup.prettify())

In [None]:
slides = soup.find('div',class_='slide-wrapper')
slide_ls = slides.find_all('div',class_='slide')
len(slide_ls)

Alright, so looks like we have out 50 best-selling albums. Next, we'll want the 5 best selling albums and artists without "Greatest" in the album title (since azlyrics doesn't have those albums and it's a good site to scrape). Possible alternatives like Lyrics.com are harder to navigate en masse (page urls can't be guessed since they have long numbers) and they don't have straightforward album-song navigation.

In [None]:
num_albums = 5

def get_valid_albums(slides, n):
    i=len(slides)-1
    selected_albums = 0
    album_info = []
    while i > -1 and selected_albums < n:
        txt = slides[i].text
        if "Greatest" not in txt:
            artist_album_str = re.search(r'.*\d\. (.+) — (".+").*', txt)
            album_info.append([artist_album_str.group(1),artist_album_str.group(2)])
            selected_albums+=1
        i-=1
    return album_info

get_valid_albums(slide_ls, num_albums)

So we now have our artist and albums. The URL with album and song lyrics is pretty straightforward. URLs follow the pattern
`https://www.azlyrics.com/<first letter of artist's last name>/<last name>.html` or 
`https://www.azlyrics.com/<first letter of group name>/<group name (only letters)>.html`

Our problem is that we can't really tell if the name is a group or an individual (and I'm not about to train a model to distinguish the two). I'll be using the fact that names are (usually) two words and that we would be redirected to the main page if the url isn't found to bypass this problem. Artists with three names (so far as I've seen) have URLs like groups. There's also some inconsistency with the two word names since Billy Joel is at `/b/billyjoel.html` and Michael Jackson is at `/j/jackson.html`. I thought maybe since the Jackson 5 was linked on the same page that this was the reason for a different pattern, but David bowie is also at `/b/bowie.html`. I won't agonize it too much, I will just see if I can get success on one of the possible pages.

The home page has "Welcome to AZLyrics!" so I'll be looking for that as a sign that my URL didn't work.

In [None]:
az_root = 'https://www.azlyrics.com/'

def get_songs(artist,album, root_url=az_root):
    name = artist.split()
    if len(name) == 2:
        # try last name centric URL
        requests.get('/'.join(root_url,name[1][0],name[1])
#                      ['href']
    return