### GESIS Fall Seminar in Computational Social Science 2022
### Introduction to Computational Social Science with Python
# Day 3-2: Scraping Web Data

## Overview

* Inspecting webpages
* Parsing static HTML with BeautifulSoup
* Scraping dynamic pages with Selenium

## Inspecting webpages
* Scraping is a very useful tool for accessing and analysing the vast amount of unstructured data on the web.
* Much more efficient than bulk manual copying and pasting information you find on the web.
* We need a systematic way of identifying and extracting the information we require from webpages.
* To do this, we need to be able to read and understand the underlying HTML (HyperText Markup Language) and CSS (Cascading Style Sheets) that builds a webpage.
* This is a brief introduction to scraping, much more detail in next week's GESIS course "Automated Web Data Collection with Python"

Consider the file 'simple.html' for a simple webpage. Open it with both a web browser and a text editor and study its structure.

Real pages are usually much more complicated!

![html](figs/html.png "html")

Now consider the file 'css_demo.html' for a simple webpage. Open it with both a web browser and a text editor and study its structure.

We have added some style information with CSS. The CSS style information is often very useful for finding and selecting elements.

### BeautifulSoup
* BeautifulSoup is a package for parsing html.
* It creates a navigable tree of the structured html.
* We can also easily search through the tree for the elements we require.

In [None]:
from bs4 import BeautifulSoup

with open('data/simple.html', 'r') as f:
    webpage = f.read()

soup = BeautifulSoup(webpage)
print(soup.prettify())

In [None]:
# Let's navigate through the tree:

print(soup.head)
# print(soup.head.title)
# print(soup.head.title.text)

# print(soup.body.p.text)

In [None]:
# Let's load the css_demo.html file

with open('data/css_demo.html', 'r') as f:
    webpage = f.read()

soup = BeautifulSoup(webpage)
print(soup.prettify())

In [None]:
# We can use select() to find all the elements that match a particular tag

print(soup.select('p'))

In [None]:
# Or to find all the elements that match a particular tag and 'class'

print(soup.select('.division-two'))

## 🏋️‍♀️ PRACTICE

In [None]:
# Q1: 
# a) Use BeautifulSoup to navigate the css_demo.html file down to the bolded text "formatted"
#    i.e., soup.body...
# a) Use BeautifulSoup select() to find the same text
#    i.e., soup.body...



In [None]:
# Q2: Get the text of all paragraphs in css_demo.html with the class "text-about-web-scraping"


In [None]:
# Q3: Open the file PL_table.html in a text editor. Search the html until you find the Premier League table.
# Read the file into BeautifulSoup
# Try to identify the selector that gives you all the table cell names (club names) from the table.
# (Yes, this is meant to be quite difficult at this point!)


## Parsing static HTML with BeautifulSoup
* As you have just seen, real webpages are a lot more complicated than our toy examples!
* We need a better way of identifying the element selection codes.
* Let's navigate to Wikipedia's [List of United States cities by population](https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population).

* "Right-click > Inspect" quickly shows us the HTML structure
* [SelectorGadget](https://selectorgadget.com/) is a useful tool to instantly show us the relevant tags.

In [None]:
# Let's integrate this into Python...

import requests
import time

url_population = "https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population"

page_population = requests.get(url_population) # Requests gets the page html for us
soup = BeautifulSoup(page_population.content) # Parse with BeautifulSoup

In [None]:
# We can extract the top 10 cities from the map as identified:

soup.select(".noresize b")


## 🏋️‍♀️ PRACTICE

In [None]:
# Q4: Re-open PL_table.html from Q3 in your web browser.
# Use the Selector Gadget and BeautifulSoup to select the elements (club names) as before.


In [None]:
# Q5: Select all of the city names, as well as their associated links (.get('href')) from the map
# at the bottom of the Wikipedia article List of United States cities by population
# Save the result as a dictionary, with city name as the key, and full wikipedia link as the value.


In [None]:
# Q6: Loop through all of the links you have just collected.
# For each of these city pages, extract all hyperlinks to pages within Wikipedia
# (i.e. starting with /wiki/) from the main article content
# Add time.sleep(1) as the last stage in your loop, don't overload the site!
# Save the result as a dictionary with city name as key, and list of links as values
# Perfect parsing here is quite tough!


## Scraping dynamic pages with Selenium
* Not all of the web is simple HTML files.
* Lots of dynamic content (e.g. Javascript) can only be accessed by having a proper web browser rendering the website
* We may also want to interact with pages (e.g. scrolling, clicking) and even send data to websites.
* **[Selenium](https://selenium-python.readthedocs.io/)** is the solution!

### Installation
* Ensure you [install Selenium](https://selenium-python.readthedocs.io/installation.html) correctly
 - You need to install the Python package `conda install selenium`.
 - You also need to download a 'driver' (Firefox recommended) for your system, and move it to the correct location on your system.
* The 'driver' is an automated web browser that you can see performing the actions you program.

In [None]:
# If this cell runs, you have correctly installed Selenium!
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

driver = webdriver.Firefox()
driver.get("http://www.python.org")

In [None]:
elem = driver.find_element(By.NAME, "q") # find the search box
elem.clear() # clear anything currently in the search box
elem.send_keys("Guido")  # type the keys "Guido"
elem.send_keys(Keys.RETURN) # Send a return key to search

In [None]:
driver.quit() # Quit the driver once finished

In [None]:
# Let's interact with a page further by playing Happy Birthday

driver = webdriver.Firefox()
driver.get("https://en.wikipedia.org/wiki/Happy_Birthday_to_You")
elem = driver.find_element(By.CSS_SELECTOR, ":nth-child(9) .mw-tmh-play-icon") # find the play button

In [None]:
elem.click() # click the element

In [None]:
driver.quit() # Quit the driver once finished

In [None]:
# Let's use Selenium to extract data from a page on Amazon deals

driver = webdriver.Firefox()
driver.get("https://www.amazon.com/gp/goldbox/")

In [None]:
# Get selector from SelectorGadget for the deal names and use the driver to select elements
elems = driver.find_elements(By.CSS_SELECTOR, "#slot-15 .DealContent-module__truncate_sWbxETx42ZPStTc9jwySW")
dealnames = [x.text for x in elems]
driver.quit()

In [None]:
dealnames

## 🏋️‍♀️ PRACTICE

In [None]:
# Q7: Use Selenium to navigate to this shared Google Doc
# https://docs.google.com/document/d/1NKY4eeVtBjNNdx1cpS8VPiyazVCPnQRfMBTB4yHtk7I/edit?usp=sharing
# Select the document using elem = driver.find_element_by_xpath('//body')
# Write a message on a **new line** using send_keys


In [None]:
# Q8: Use Selenium to navigate to the Amazon deals page https://www.amazon.com/gp/goldbox/
# As demonstrated previously, extract the names of all the deals on the page
# Then, find the element to progress to the next page of results and click it
# Extract the names of all the deals on this page, and repeat for the first 5 pages of results
# Sleep 1s between pages


In [None]:
# Q9: If you have finished the previous exercises, consider how you could improve your scraper from Q8.
# Scrape deals by category, or get the hyperlink, price, use a custom search etc.
