In [None]:
# Before we begin, run this cell if you are using Colab
!git clone https://github.com/danielinux7/StemLab.git

# Web Scraping

## Content
1. HTML pages
2. Chrome DevTools
3. Web scraping packages
    * BeautifulSoup
        * Exercise
    * Selenium
        * Exercise
4. Ethical considerations of web scraping

## What you will be able to do after the tutorial
* Inspect an HTML page and identify which parts you want to scrape.
* Scrape web pages with `requests` and `BeautifulSoup`.
* Navigate Javascript elements with `Selenium`
* Judge when web scraping is the most suitable approach and what you should consider before doing so (be a good citizen of the Internet).


## HTML page structure

**Hypertext Markup Language (HTML)** is the standard markup language for documents designed to be displayed in a web browser. HTML describes the structure of a web page and it can be used with **Cascading Style Sheets (CSS)** and a scripting language such as **JavaScript** to create interactive websites. HTML consists of a series of elements that "tell" to the browser how to display the content. Lastly, elements are represented by **tags**.

Here are some tags:
* `<!DOCTYPE html>` declaration defines this document to be HTML5.  
* `<html>` element is the root element of an HTML page.  
* `<div>` tag defines a division or a section in an HTML document. It's usually a container for other elements.
* `<head>` element contains meta information about the document.  
* `<title>` element specifies a title for the document.  
* `<body>` element contains the visible page content.  
* `<h1>` element defines a large heading.  
* `<p>` element defines a paragraph.  
* `<a>` element defines a hyperlink.

HTML tags normally come in pairs like `<p>` and `</p>`. The first tag in a pair is the opening tag, the second tag is the closing tag. The end tag is written like the start tag, but with a slash inserted before the tag name.

![](https://github.com/danielinux7/StemLab/blob/master/3-Web-Scraping/figures/tags.png?raw=1)

HTML has a tree-like 🌳 🌲 structure thanks to the **Document Object Model (DOM)**, a cross-platform and language-independent interface. Here's how a very simple HTML tree looks like.

![](https://github.com/danielinux7/StemLab/blob/master/3-Web-Scraping/figures/dom_tree.gif?raw=1)


## Creating a simple HTML page

In [None]:
from IPython.core.display import display, HTML

In [None]:
display(HTML("""
<!DOCTYPE html>
<html lang="en" dir="ltr">
<head>
  <title>Intro to HTML</title>
</head>

<body>
  <h1>Heading h1</h1>
  <h2>Heading h2</h2>
  <h3>Heading h3</h3>
  <h4>Heading h4</h4>

  <p>
    That's a text paragraph. You can also <b>bold</b>, <mark>mark</mark>, <ins>underline</ins>, <del>strikethrough</del> and <i>emphasize</i> words.
    You can also add links - here's one to <a href="https://en.wikipedia.org/wiki/Main_Page">Wikipedia</a>.
  </p>

  <p>
    This <br> is a paragraph <br> with <br> line breaks
  </p>

  <p style="color:red">
    Add colour to your paragraphs.
  </p>

  <p>Unordered list:</p>
  <ul>
    <li>Python</li>
    <li>R</li>
    <li>Julia</li>
  </ul>

  <p>Ordered list:</p>
  <ol>
    <li>Data collection</li>
    <li>Exploratory data analysis</li>
    <li>Data analysis</li>
    <li>Policy recommendations</li>
  </ol>
  <hr>

  <!-- This is a comment -->

</body>
</html>
"""))

## Chrome DevTools: exercise

[Chrome DevTools](https://developers.google.com/web/tools/chrome-devtools/) is a set of web developer tools built directly into the Google Chrome browser. DevTools can help you view and edit web pages. We will use Chrome's tool to inspect an HTML page and find which elements correspond to the data we might want to scrape.

### Short exercise
To get some experience with the HTML page structure and Chrome DevTools, we will search and locate elements in [Apsnypress](https://www.apsnypress.info/). 

**Tip**: Hit *Command+Option+C* (Mac) or *Control+Shift+C* (Windows, Linux) to access the elements panel.

#### Tasks (we will do them together)
* Find the flags images in the main page.
* Find the headers in the news list.
* Find the links in the pagination stripe.
* Locate one of the photos from the news list.
* Locate the read more link from the news list.

# `BeautifulSoup`

We will use `requests` and `BeautifulSoup` to access and scrape the content of [Sputnik Abkhazia](https://sputnik-abkhazia.info). We need to scrape parallel articles in Russian and Abkhazian, then build a prallel corpus of text.

### What is `BeautifulSoup`?

It is a Python library for pulling data out of HTML and XML files. It provides methods to navigate the document's tree structure that we discussed before and scrape its content.

### Our pipeline

![](https://github.com/danielinux7/StemLab/blob/master/3-Web-Scraping/figures/scrape-pipeline2.png?raw=1)

In [1]:
# Imports
import requests
from bs4 import BeautifulSoup

In [None]:
# Sputnik's homepage for news in Abkhazia
url_ru = "https://sputnik-abkhazia.ru/Abkhazia"
url_ab = "https://sputnik-abkhazia.info"

# Doing some research, it seems their website is grouping articles that were released for everyday.
# We need to add YYYYMMDD i.e 20220609
ymd = "20220609"

# Use requests to retrieve data from a given URL
sputnik_response_ru = requests.get(url_ru+"/"+ymd)
sputnik_response_ab = requests.get(url_ab+"/"+ymd)

# Parse the whole HTML page using BeautifulSoup
sputnik_soup_ru = BeautifulSoup(sputnik_response_ru.text, 'html.parser')
sputnik_soup_ab = BeautifulSoup(sputnik_response_ab.text, 'html.parser')

# Title of the parsed page
print(sputnik_soup_ru.title)
print(sputnik_soup_ab.title)
# sputnik_soup.title.string

In [None]:
# The HTML web page, here we Chrome DevTools to better understand the page.
# i.e https://sputnik-abkhazia.ru/Abkhazia/20220609/
print(sputnik_soup_ru.prettify())

### Find links

In many cases, it is useful to collect the links contained in a webpage (for example, you might want to scrape them too). Here is how you can do this.

In [None]:
# Collect first link from the articles list
first_link = sputnik_soup_ab.find('a', {'class': 'list__title'})
print(first_link.prettify())


In [None]:
first_link.get('href')

In [4]:
# Find all links
links_ru = []
links_ab = []

# for loop
# for link in sputnik_soup_ab.find_all('a', {'class': 'list__title'}):
#   links.append(link.get('href'))

# for loop in one line (list comprehension: create a new list based on the values of an existing list)
links_ru = [link.get('href') for link in sputnik_soup_ru.find_all('a', {'class': 'list__title'})]
links_ab = [link.get('href') for link in sputnik_soup_ab.find_all('a', {'class': 'list__title'})]

# Add homepage and keep the unique links
fixed_links_ru = list([''.join([url_ru, link]) for link in links_ru if link])
fixed_links_ab = list([''.join([url_ab, link]) for link in links_ab if link])

In [None]:
import pprint
pp = pprint.PrettyPrinter(indent=1)
pp.pprint(fixed_links_ab)
print()
pp.pprint(fixed_links_ru)

### Collect articles

In [6]:
# Collect all articles by requesting the pages with their links.
article__texts_ab = []
article__texts_ru = []

for link in fixed_links_ab:
  article_response = requests.get(link)
  article_soup = BeautifulSoup(article_response.text, 'html.parser')
  article__texts_ab.append(article_soup.find_all('div', {'class': 'article__text'}))

for link in fixed_links_ru:
  article_response = requests.get(link)
  article_soup = BeautifulSoup(article_response.text, 'html.parser')
  article__texts_ru.append(article_soup.find_all('div', {'class': 'article__text'}))

In [None]:
# We print the articles in the standard output
for article__text in article__texts_ab:
  [print(text.getText()) for text in article__text]
  print()

In [10]:
# We print the articles in two files
import sys

origin_stdout = sys.stdout
sys.stdout = open("ab.txt", "w")
for article__text in article__texts_ab:
  [print(text.getText()) for text in article__text]
  print()

sys.stdout = open("ru.txt", "w")
for article__text in article__texts_ru:
  [print(text.getText()) for text in article__text]
  print()
sys.stdout = origin_stdout

## Advanced web scraping tools

**[Scrapy](https://scrapy.org)** is a Python framework for large scale web scraping. It gives you all the tools you need to efficiently extract data from websites, process them as you want, and store them in your preferred structure and format.

**[ARGUS](https://github.com/datawizard1337/ARGUS)** is an easy-to-use web mining tool that's built on Scrapy. It is able to crawl a broad range of different websites.

**[Selenium](https://selenium-python.readthedocs.io/index.html)** is an umbrella project encapsulating a variety of tools and libraries enabling web browser automation. Selenium specifically provides infrastructure for the W3C WebDriver specification — a platform and language-neutral coding interface compatible with all major web browsers. We can use it to imitate a user's behaviour and interact with Javascript elements (buttons, sliders etc.).



## Ethical considerations

**You can scrape it, should you though?**

A very good summary of practices for [ethical web scraping](https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01):

* If you have a public API that provides the data I’m looking for, I’ll use it and avoid scraping all together.
* I will only save the data I absolutely need from your page.
* I will respect any content I do keep. I’ll never pass it off as my own.
* I will look for ways to return value to you. Maybe I can drive some (real) traffic to your site or credit you in an article or post.
* I will respond in a timely fashion to your outreach and work with you towards a resolution.
* I will scrape for the purpose of creating new value from the data, not to duplicate it.

Some other [important components](http://robertorocha.info/on-the-ethics-of-web-scraping/) of ethical web scraping practices include:

* Read the Terms of Service and Privacy Policies of a website before scraping it (this might not be possible in many situations though).
* If it’s not clear from looking at the website, contact the webmaster and ask if and what you’re allowed to harvest.
* Be gentle on smaller websites
    * Run your scraper in off-peak hours
    * Space out your requests.
* Identify yourself by name and email in your User-Agent strings.
* Inspecting the **robots.txt** file for rules about what pages can be scraped, indexed, etc.

### What is a robots.txt?

A simple text file placed on the web server which tells crawlers which file they can and cannot access. It's also called _The Robots Exclusion Protocol_.

![](https://github.com/danielinux7/StemLab/blob/master/3-Web-Scraping/figures/robots.png?raw=1)

#### Some examples

In [None]:
print(requests.get('https://sputnik-abkhazia.info/robots.txt').text)

# Check out apsnypress's robots file on their site
# https://apsnypress.info/robots.txt

#### What's a User-Agent?

A User-Agent is a string identifying the browser and operating system to the web server. It's your machine's way of saying _Hi, I am Chrome on macOS_ to a web server.

Web servers use user agents for a variety of purposes:
* Serving different web pages to different web browsers. This can be used for good – for example, to serve simpler web pages to older browsers – or evil – for example, to display a “This web page must be viewed in Internet Explorer” message.
* Displaying different content to different operating systems – for example, by displaying a slimmed-down page on mobile devices.
* Gathering statistics showing the browsers and operating systems in use by their users. If you ever see browser market-share statistics, this is how they’re acquired.

Let's break down the structure of a human-operated User-Agent:

```Mozilla/5.0 (iPad; U; CPU OS 3_2_1 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Mobile/7B405```

The components of this string are as follows:

* Mozilla/5.0: Previously used to indicate compatibility with the Mozilla rendering engine.
* (iPad; U; CPU OS 3_2_1 like Mac OS X; en-us): Details of the system in which the browser is running.
* AppleWebKit/531.21.10: The platform the browser uses.
* (KHTML, like Gecko): Browser platform details.
* Mobile/7B405: This is used by the browser to indicate specific enhancements that are available directly in the browser or through third parties. An example of this is Microsoft Live Meeting which registers an extension so that the Live Meeting service knows if the software is already installed, which means it can provide a streamlined experience to joining meetings.

When scraping websites, it is a good idea to include your contact information as a custom **User-Agent** string so that the webmaster can get in contact. For example:

In [None]:
headers = {
    'User-Agent': 'Nart Tlisha bot',
    'From': 'daniel.abzakh@gmail.com'
}
request = requests.get('https://apsnypress.info', headers=headers)
print(request.request.headers)

## Additional resources/references:

* [Document Object Model (DOM)](https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model/Introduction)
* [HTML elements reference guide](https://www.w3schools.com/tags/default.asp)
* [About /robots.txt](https://www.robotstxt.org/robotstxt.html)
* [The robots.txt file](https://varvy.com/robottxt.html)
* [Ethics in Web Scraping](https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01)
* [On the Ethics of Web Scraping](http://robertorocha.info/on-the-ethics-of-web-scraping/)
* [User-Agent](https://en.wikipedia.org/wiki/User_agent)
* [BeautifulSoup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
* [Selinium Python - Unofficial documentation](https://selenium-python.readthedocs.io/)
* [ARGUS paper](http://ftp.zew.de/pub/zew-docs/dp/dp18033.pdf)
* [Brian's C. Keegan](http://www.brianckeegan.com/) excellent [5-week web scraping course](https://github.com/CU-ITSS/Web-Data-Scraping-S2019) intended for researchers in the social sciences and humanities.