# Data Acquisition - Web Scraping

## Structure
1. HTML pages
2. Web scraping packages
    * Exercise
3. Ethical considerations of web scraping

## What you will be able to do after the lecture
* Inspect an HTML page and identify which parts you want to scrape.
* Scrape web pages with `requests` and `BeautifulSoup`.
* Judge when web scraping is the most suitable approach and what you should consider before doing so (be a good citizen of the Internet).

## HTML page structure

**Hypertext Markup Language (HTML)** is the standard markup language for documents designed to be displayed in a web browser. HTML describes the structure of a web page and it can be used with **Cascading Style Sheets (CSS)** and a scripting language such as **JavaScript** to create interactive websites. HTML consists of a series of elements that "tell" to the browser how to display the content. Lastly, elements are represented by **tags**.

Here are some tags:
* `<!DOCTYPE html>`   
* `<html>`   
* `<div>` 
* `<head>` 
* `<title>` 
* `<body>` 
* `<h1>` 
* `<p>` 
* `<a>` 

Here are some tags:
* `<!DOCTYPE html>` declaration defines this document to be HTML5.  
* `<html>` element is the root element of an HTML page.  
* `<div>` tag defines a division or a section in an HTML document. It's usually a container for other elements.
* `<head>` element contains meta information about the document.  
* `<title>` element specifies a title for the document.  
* `<body>` element contains the visible page content.  
* `<h1>` element defines a large heading.  
* `<p>` element defines a paragraph.  
* `<a>` element defines a hyperlink.

HTML tags normally come in pairs like `<p>` and `</p>`. The first tag in a pair is the opening tag, the second tag is the closing tag. The end tag is written like the start tag, but with a slash inserted before the tag name.

![image.png](attachment:image.png)



HTML has a tree-like 🌳 🌲 structure thanks to the **Document Object Model (DOM)**, a cross-platform and language-independent interface. Here's how a very simple HTML tree looks like.

![image.png](attachment:image.png)


In [None]:
# Creating a simple HTML page
from IPython.core.display import display, HTML

display(HTML("""
<!DOCTYPE html>
<html lang="en" dir="ltr">
<head>
  <title>Intro to HTML</title>
</head>

<body>
  <h1>Heading h1</h1>
  <h2>Heading h2</h2>
  <h3>Heading h3</h3>
  <h4>Heading h4</h4>

  <p>
    That's a text paragraph. You can also <b>bold</b>, <mark>mark</mark>, <ins>underline</ins>, <del>strikethrough</del> and <i>emphasize</i> words.
    You can also add links - here's one to <a href="https://en.wikipedia.org/wiki/Main_Page">Wikipedia</a>.
  </p>

  <p>
    This <br> is a paragraph <br> with <br> line breaks
  </p>

  <p style="color:red">
    Add colour to your paragraphs.
  </p>

  <p>Unordered list:</p>
  <ul>
    <li>Python</li>
    <li>R</li>
    <li>Julia</li>
  </ul>

  <p>Ordered list:</p>
  <ol>
    <li>Data collection</li>
    <li>Exploratory data analysis</li>
    <li>Data analysis</li>
    <li>Policy recommendations</li>
  </ol>
  <hr>

  <!-- This is a comment -->

</body>
</html>
"""))

## Web Scraping with `request` and `BeautifulSoup`


### What is `BeautifulSoup`?

It is a Python library for pulling data out of HTML and XML files. It provides methods to navigate the document's tree structure that we discussed before and scrape its content.

### Our pipeline
<img src='https://github.com/nestauk/im-tutorials/blob/3-ysi-tutorial/figures/Web-Scraping/scrape-pipeline.png?raw=1' width="1024">

### Exercise
Scrape all the book info (title, UPC, price, tax, availability ...) from the following website:
http://books.toscrape.com/
(a 'hello-world' website for webscraping)

In [None]:
import urllib.request
import re as re
import pandas as pd
from bs4 import BeautifulSoup

In [None]:
# get the first book info

baseUrl = "http://books.toscrape.com/"
page = urllib.request.urlopen(baseUrl).read()
soup = BeautifulSoup(page)

product = soup.find(class_="product_pod")

productLinks = product.find_all("a")

'''
for p in productLinks:
    print(p, end="\n\n")
'''

print(productLinks[0].get('href'))
print(productLinks[0].get('title'))

print(productLinks[1].get('href'))
print(productLinks[1].get('title'))

In [None]:
# go to the hyperlink

href = productLinks[0].get('href')

bookUrl = baseUrl + href

bookPage = urllib.request.urlopen(bookUrl).read()
bookSoup = BeautifulSoup(bookPage)

product = bookSoup.find(class_="product_page")
bookTable = product.find_all("table")
bookTable

tableRows = bookTable[0].find_all("tr")

for tr in tableRows:
    td = tr.find_all("td")
    th = tr.find_all("th")
    
    print(td[0].text, th[0].text)

In [None]:
# Scrape all the info of all the books across all the pages

import urllib.request
import re as re
import pandas as pd
from bs4 import BeautifulSoup

from urllib.error import HTTPError, URLError

allBookData = []

baseUrl = "http://books.toscrape.com/"

curPage = baseUrl

while True:
   
    try:
        print("Retrieving ", curPage)
        curPageContent = urllib.request.urlopen(curPage).read()
    except:
        break
        
    soup = BeautifulSoup(curPageContent)

    products = soup.find_all(class_="product_pod")

    for product in products:
        bookData = {}
        productLink = product.find_all("a")

        href  = productLink[0].get('href')
        title = productLink[1].get('title')

        # Add Data to dict
        bookData["href"]  = href
        bookData["title"] = title

        if "catalogue" not in href:
            bookUrl = baseUrl + "catalogue/" + href
        else:
            bookUrl = baseUrl + href
            
        try:
            bookPage = urllib.request.urlopen(bookUrl).read()
        except HTTPError as e:
            print(curPage, " ", bookUrl, " ", e)
        except URLError as e:
            print(curPage, " ", bookUrl, " ", e)
            
        bookSoup = BeautifulSoup(bookPage)

        productSoup = bookSoup.find(class_="product_page")
        bookTable = productSoup.find_all("table")

        tableRows = bookTable[0].find_all("tr")
        for tr in tableRows:
            td = tr.find_all("td")
            th = tr.find_all("th")

            value  = td[0].text
            column = th[0].text

            bookData[column] = value

        allBookData.append(bookData)
        
    nextPage = soup.find("a", text="next")
    if nextPage is None:
        break
        
    elif "catalogue" not in nextPage.get("href"):
        curPage = baseUrl +  "catalogue/" + nextPage.get("href")
    else:
        curPage = baseUrl + nextPage.get("href")

In [None]:
# Display the results

#len(allBookData)
#allBookData[0]

headers = allBookData[0].keys()
bookDf = pd.DataFrame(allBookData, columns=headers)
bookDf.shape
bookDf

In [None]:
# Save the results to disk
bookDf.to_csv("results/book_info.csv")
bookDf.to_excel("results/book_info.xlsx")

## Ethical considerations

**You can scrape it, should you though?**

A very good summary of practices for [ethical web scraping](https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01).

Some other [important components](http://robertorocha.info/on-the-ethics-of-web-scraping/) of ethical web scraping practices include:

* Read the Terms of Service and Privacy Policies of a website before scraping it (this might not be possible in many situations though).
* If it’s not clear from looking at the website, contact the webmaster and ask if and what you’re allowed to harvest.
* Be gentle on smaller websites
    * Run your scraper in off-peak hours
    * Space out your requests.
* Identify yourself by name and email in your User-Agent strings.
* Inspecting the **robots.txt** file for rules about what pages can be scraped, indexed, etc.

### What is a robots.txt?

A simple text file placed on the web server which tells crawlers which file they can and cannot access. It's also called _The Robots Exclusion Protocol_.

![image.png](attachment:image.png)

In [None]:
import requests

# some examples
print(requests.get('https://www.google.com/robots.txt').text)
print('-----')
print(requests.get('https://www.boxofficemojo.com/robots.txt').text)
print('-----')
print(requests.get('https://www.howtogeek.com/robots.txt').text)

#### What's a User-Agent?

A User-Agent is a string identifying the browser and operating system to the web server. It's your machine's way of saying _Hi, I am Chrome on macOS_ to a web server.

#### Q: Why do web servers use user agents?

Web servers use user agents for a variety of purposes:
* Serving different web pages to different web browsers. This can be used for good – for example, to serve simpler web pages to older browsers – or evil – for example, to display a “This web page must be viewed in Internet Explorer” message.
* Displaying different content to different operating systems – for example, by displaying a slimmed-down page on mobile devices.
* Gathering statistics showing the browsers and operating systems in use by their users. If you ever see browser market-share statistics, this is how they’re acquired.

Let's break down the structure of a human-operated User-Agent:

```Mozilla/5.0 (iPad; U; CPU OS 3_2_1 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Mobile/7B405```

The components of this string are as follows:

* Mozilla/5.0: Previously used to indicate compatibility with the Mozilla rendering engine.
* (iPad; U; CPU OS 3_2_1 like Mac OS X; en-us): Details of the system in which the browser is running.
* AppleWebKit/531.21.10: The platform the browser uses.
* (KHTML, like Gecko): Browser platform details.
* Mobile/7B405: This is used by the browser to indicate specific enhancements that are available directly in the browser or through third parties. An example of this is Microsoft Live Meeting which registers an extension so that the Live Meeting service knows if the software is already installed, which means it can provide a streamlined experience to joining meetings.

When scraping websites, it is a good idea to include your contact information as a custom **User-Agent** string so that the webmaster can get in contact. For example:

In [None]:
headers = {
    'User-Agent': 'BGSU bot',
    'From': 'myname@bgsu.edu'
}
request = requests.get('https://www.google.com/', headers=headers)
print(request.request.headers)