# Web Scraping

If you want to do a project with scraping and want to focus on the analysis, I would advise you to choose one scraper. 

I'll be using two in this notebook because one of my objectives is to learn about webscraping.

## Selenium Webdriver

Selenium is useful for webpages that are dynamically generated or for interacting with a webpage. For example, you might want to scrape content in a card, but the default state might be set to none. As a result, you'd want a program like selenium to go in and click certain tags to make the content visible.

### Installation Process (for Chrome)
To use selenium webdriver, install the selenium python package using

`pip3 install selenium`

Next, check the version of chrome you're using by going to the ... button in the top-right corner, "Help" and then "About Chrome".
`chrome://settings/help`
Based on the version, download the appropriate chromedriver from
[ChromeDriver Downloads](https://sites.google.com/a/chromium.org/chromedriver/downloads)
and placing it somewhere on your laptop. Then use the path as I did in the text block below.

In [1]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.chrome.options import Options
import pandas as pd
import re
import sys

options = Options()
options.headless = True
PATH = r'C:\Program Files (x86)'

Thankfully, because of the scope of the project, there isn't many sites I'd need to go to and copy transcripts from. However, I know very little about webscrapers and thought this would be a good opportunity for me to try it out and use it (as well as show my progress in learning).

In [2]:
driver = webdriver.Chrome(PATH+r'\chromedriver.exe')
speeches_url = "https://jamesclear.com/great-speeches"

driver.get(speeches_url)
link_tags = driver.find_elements_by_xpath("//h2[text()='Famous Speeches and Great Talks']/following::ul[1]/li/a") # //ul/li/a
links = [tag.get_attribute('href') for tag in link_tags]

driver.quit()

chropath

In [3]:
links[:10]

['https://jamesclear.com/great-speeches/the-danger-of-a-single-story-by-chimamanda-ngozi-adichie',
 'https://jamesclear.com/great-speeches/what-matters-more-than-your-talents-by-jeff-bezos',
 'https://jamesclear.com/great-speeches/enough-by-john-c-bogle',
 'https://jamesclear.com/great-speeches/the-anatomy-of-trust-by-brene-brown',
 'https://jamesclear.com/great-speeches/creativity-in-management-by-john-cleese',
 'https://jamesclear.com/great-speeches/solitude-and-leadership-by-william-deresiewicz',
 'https://jamesclear.com/great-speeches/seeking-new-laws-by-richard-feynman',
 'https://jamesclear.com/great-speeches/make-good-art-by-neil-gaiman',
 'https://jamesclear.com/great-speeches/personal-renewal-by-john-w-gardner',
 'https://jamesclear.com/great-speeches/your-elusive-creative-genius-by-elizabeth-gilbert']

In [4]:
# driver = webdriver.Chrome(PATH+r'\chromedriver.exe')
# driver.get(links[0])
# transcript = ''
# title_orator = driver.find_element_by_xpath("//h1[@class='entry-title']").text
# driver.quit()
# title_orator
# title_orator_re = re.search('.*“(.+)”.by (.+)', ' '.join(title_orator.split("\n")))
# title_orator_re.group(2)
# .split("\n")
# ''.join(title_orator.split("\n"))


In [8]:
def scrape_speech_data(speech_urls,save_as='',n=5):
    i=0
    df = pd.DataFrame(columns=['Orator','Title','Transcript','Link'])
    num_speeches = min(len(speech_urls),n)
    driver = webdriver.Chrome(PATH+r'\chromedriver.exe')
    while i < num_speeches:
        try:
            driver.get(speech_urls[i])
            transcript = ''
            title_orator = driver.find_element_by_xpath("//h1[@class='entry-title']").text
            # Not the non-standard quotation marks in my regex string
            title_orator_re = re.search(r'“(.+)”.+by (.+)', ' '.join(title_orator.split("\n")))
            transcript_elements = driver.find_elements_by_xpath('//h2[text() = "Speech Transcript"]/following-sibling::p')
            if not transcript_elements:
                transcript_elements = driver.find_elements_by_xpath("//h2[contains(text(),'Speech Transcript')]/following-sibling::div/div/div/p")
            for p in transcript_elements:
                transcript += p.text + ' '
            df.loc[i] = [title_orator_re.group(2),title_orator_re.group(1),transcript,speech_urls[i]]
        except:
            print(i)
            print("Unexpected error:",sys.exc_info()[0],sys.exc_info()[1])
        i+=1
    driver.quit()
    if save_as:
        df.to_excel(save_as, index=False)
    return df

scrape_speech_data(links,'',10)

alt


Unnamed: 0,Orator,Title,Transcript,Link
0,Chimamanda Ngozi Adichie,The Danger of a Single Story,I'm a storyteller. And I would like to tell yo...,https://jamesclear.com/great-speeches/the-dang...
1,Jeff Bezos,What Matters More Than Your Talents,"As a kid, I spent my summers with my grandpare...",https://jamesclear.com/great-speeches/what-mat...
2,John C. Bogle,Enough,Here’s how I recall the wonderful story that s...,https://jamesclear.com/great-speeches/enough-b...
3,Brené Brown,The Anatomy of Trust,"Oh, it just feels like an incredible understat...",https://jamesclear.com/great-speeches/the-anat...
4,John Cleese,Creativity in Management,"You know, when Video Arts asked me if I'd like...",https://jamesclear.com/great-speeches/creativi...
5,William Deresiewicz,Solitude and Leadership,My title must seem like a contradiction. What ...,https://jamesclear.com/great-speeches/solitude...
6,Richard Feynman,Seeking New Laws,What I want to talk to you about tonight is st...,https://jamesclear.com/great-speeches/seeking-...
7,Neil Gaiman,Make Good Art,I never really expected to find myself giving ...,https://jamesclear.com/great-speeches/make-goo...
8,John W. Gardner,Personal Renewal,I'm going to talk about “Self-Renewal.” One of...,https://jamesclear.com/great-speeches/personal...
9,Elizabeth Gilbert,Your Elusive Creative Genius,I am a writer. Writing books is my profession ...,https://jamesclear.com/great-speeches/your-elu...


After I got the function working, my problem was getting the transcript for John C. Bogle's speech "Enough". Now I saw (since Selenium actually opens a window) that there's a banner that often came up for this page, but it's not actually a modal so I don't think that should be a problem. First I will take a look at the html. 

The html indeed has a different structure, but at least in the first 10 transcripts, it is the only page structured like that. After looking at the page for "Enough", I came added the next two lines:

```if not transcript_elements:
    transcript_elements = driver.find_elements_by_xpath("//h2[contains(text(),'Speech Transcript')]/following-sibling::div/div/div/p")
```

Which uses this alternate x_path in the case that the usual ```transcript_elements``` list is empty. Although sort of hacky, it works and without giving me an error. We now have our complete speech dataset. 

## Beautiful Soup

beautifulsoup4 is normally preferred for static sites (like the ones I'm using) or if there's a lot of pages to scrape. bs4 is faster and more robust than Selenium so it's usually preferred if possible.  

Before running the code, make sure you insalled the necessary packages:

`pip3 install beautifulsoup4`

`pip3 install requests` (for http requests)

`pip3 install lxml` (for parsing)

If you have the time, I would recommend trying out both of these popular options because they are quite different and scraping $\textit{can}$ be fun.

I'll be using bs4 for scraping lyrics from https://www.azlyrics.com/ which seems to have songs and albums besides "Greatest Hits" albums which we'll be avoiding. Using a Business Insider article from 2016, we'll collect lyrics from songs from the best selling albums of all time. 

Best selling is a quantifiable selection criteria, so I'll go with this methodology for songs. I'm selecting albums instead of individual songs because song popularity is very subjective (hard to find an objective list) and albums will have corpora that are more comparable to speeches than songs. Plus songs normally focus on one topic whereas speeches often have stories or subtopics that give them more variety.

Again, we want to check if the pages we are scraping allow it. https://www.businessinsider.com/robots.txt tells us, that for `User-agent: *` (ie. all bots), we cannot scrape a whole bunch of extensions. They don't disallow '/' so since my artice below is just "businessinsider.com/..." then I'm ok to scrape it.

For how to interpret a robots.txt file, go to the [Google Webmasters](https://support.google.com/webmasters/answer/6062596?hl=en) page.

In [2]:
import sys
import re
import requests
# import lxml
from bs4 import BeautifulSoup


best_albums_src = requests.get('https://www.businessinsider.com/50-best-selling-albums-all-time-2016-9').text
# class 'slide'
soup = BeautifulSoup(best_albums_src, 'lxml')

# show (still not that) pretty html
# print(soup.prettify())

In [3]:
slides = soup.find('div',class_='slide-wrapper')
slide_ls = slides.find_all('div',class_='slide')
len(slide_ls)

50

Alright, so looks like we have out 50 best-selling albums. Next, we'll want the 5 best selling albums and artists without "Greatest" in the album title (since azlyrics doesn't have those albums and it's a good site to scrape). Possible alternatives like Lyrics.com are harder to navigate en masse (page urls can't be guessed since they have long numbers) and they don't have straightforward album-song navigation.

In [5]:
num_albums = 5

def get_valid_albums(slides, n):
    i=len(slides)-1
    selected_albums = 0
    album_info = []
    while i > -1 and selected_albums < n:
        txt = slides[i].text
        if "Greatest" not in txt:
            artist_album_str = re.search(r'.*\d\. (.+) — (".+").*', txt)
            album_info.append([artist_album_str.group(1),artist_album_str.group(2)])
            selected_albums+=1
        i-=1
    return album_info

# Note that album names are in quotes on AZlyrics so we'll keep them in the string
art_alb_pairs = get_valid_albums(slide_ls, num_albums)
art_alb_pairs

[['Michael Jackson', '"Thriller"'],
 ['Eagles', '"Hotel California"'],
 ['Led Zeppelin', '"Led Zeppelin IV"'],
 ['Pink Floyd', '"The Wall"'],
 ['AC/DC', '"Back In Black"']]

So we now have our artist and albums. The URL with album and song lyrics is pretty straightforward. URLs follow the pattern
`https://www.azlyrics.com/<first letter of artist's last name>/<last name>.html` or 
`https://www.azlyrics.com/<first letter of group name>/<group name (only letters)>.html`

Our problem is that we can't really tell if the name is a group or an individual (and I'm not about to train a model to distinguish the two). I'll be using the fact that names are (usually) two words and that we would be redirected to the main page if the url isn't found to bypass this problem. Artists with three names (so far as I've seen) have URLs like groups. There's also some inconsistency with the two word names since Billy Joel is at `/b/billyjoel.html` and Michael Jackson is at `/j/jackson.html`. I thought maybe since the Jackson 5 was linked on the same page that this was the reason for a different pattern, but David bowie is also at `/b/bowie.html`. I won't agonize it too much, I will just see if I can get success on one of the possible pages.

The home page has "Welcome to AZLyrics!" so I'll be looking for that as a sign that my URL didn't work.

In [75]:
az_root = 'https://www.azlyrics.com'

def get_songs(artist,album, root_url=az_root):
    songs=[]
    artist_lc = artist.lower()
    name = artist_lc.split()
    artist_letters = re.sub('[^a-zA-Z]+', '', artist_lc)
    if len(name) == 2:
        # try last name centric URL
        artist_url = '/'.join([root_url,name[1][0],name[1]])+'.html'
        src = requests.get(artist_url).text
        soup = BeautifulSoup(src, 'lxml')
        h1 = soup.find('h1')
    # try other url
    if h1.text == "Welcome to AZLyrics!":
        artist_url = '/'.join([root_url,artist[0],artist_letters])+'.html'
        src = requests.get(artist_url).text
        soup = BeautifulSoup(src, 'lxml')
    try:
        # every other sibling is a navigablestring (and don't want to add more complexity to the loop)
        cur_node = soup.findAll('b',string=album)[0].parent.next_sibling.next_sibling
        while "listalbum-item" in  cur_node['class']:
            songs.append(artist_url + '/../' + cur_node.find('a')['href'])
            cur_node = cur_node.next_sibling.next_sibling
    #                      ['href']
    except:
        print("Unexpected error:", sys.exc_info()[0])
    return songs

get_songs(art_alb_pairs[0][0],art_alb_pairs[0][1])

['https://www.azlyrics.com/j/jackson.html/../../lyrics/michaeljackson/wannabestartinsomethin.html',
 'https://www.azlyrics.com/j/jackson.html/../../lyrics/michaeljackson/babybemine.html',
 'https://www.azlyrics.com/j/jackson.html/../../lyrics/michaeljackson/thegirlismine.html',
 'https://www.azlyrics.com/j/jackson.html/../../lyrics/michaeljackson/thriller.html',
 'https://www.azlyrics.com/j/jackson.html/../../lyrics/michaeljackson/beatit.html',
 'https://www.azlyrics.com/j/jackson.html/../../lyrics/michaeljackson/billiejean.html',
 'https://www.azlyrics.com/j/jackson.html/../../lyrics/michaeljackson/humannature.html',
 'https://www.azlyrics.com/j/jackson.html/../../lyrics/michaeljackson/pytprettyyoungthing.html',
 'https://www.azlyrics.com/j/jackson.html/../../lyrics/michaeljackson/theladyinmylife.html',
 'https://www.azlyrics.com/j/jackson.html/../../lyrics/michaeljackson/carousel.html']

Success! Well... maybe we're halfway there, but that's still something!