# Web Scraping

<img src="http://images.fandango.com/MDCsite/images/featured/201405/no1_great_power_comic.jpg" width=600>

## Foreword: 

A cautionary tale:

https://en.wikipedia.org/wiki/Aaron_Swartz

http://www.rollingstone.com/culture/news/the-brilliant-life-and-tragic-death-of-aaron-swartz-20130215

Other readings:

https://www.quora.com/What-is-the-legality-of-web-scraping

https://www.bna.com/legal-issues-raised-by-the-use-of-web-crawling-and-scraping-tools-for-analytics-purposes

## Take away message:

__ ==> Before embarking yourself in web scraping exercises to collect data for your research, talk to your supervisor <== __ 

## The Document Object Model (DOM)

"The ___Document Object Model (DOM)___ is a cross-platform and language-independent convention for _representing and interacting_ with objects in HTML, XHTML, and XML documents. 

The _nodes_ of every document are organized in a _tree structure_, called the ___DOM tree___. Objects in the DOM tree may be addressed and manipulated by using methods on the objects. 
The public interface of a DOM is specified in its application programming interface (API)." 
Source: https://en.wikipedia.org/wiki/Document_Object_Model

<img src="https://telerikhelper.files.wordpress.com/2013/04/image15.png" width=800>


In practice, what happens when you navigate the web is that your browser creates the DOM tree after parsing the HTML code. This is done following the specifications and rules provided by the World Wide Web Consortium (W3C, https://www.w3.org/TR/REC-DOM-Level-1/level-one-core.html  ).

Then, once the DOM is generated, it can be accessed using programming languages like JavaScript and since DOM exposes every aspect of a web page as an object to the programming language, all the content including elements, styles, etc.. can be accessed and modified.

<img src="https://telerikhelper.files.wordpress.com/2013/04/image14.png" width=800>


In this class, we are going to learn how we can search for specific elements within the DOM tree using Python.

# HTML Introduction

from https://www.w3schools.com/html/html_intro.asp

## What is HTML?
HTML is the standard markup language for creating Web pages.

- HTML stands for Hyper Text Markup Language
- HTML describes the structure of Web pages using markup
- HTML elements are the building blocks of HTML pages
- HTML elements are represented by tags
- HTML tags label pieces of content such as "heading", "paragraph", "table", and so on
- Browsers do not display the HTML tags, but use them to render the content of the page

## Example

## HTML Tags
HTML tags are element names surrounded by angle brackets:

- HTML tags normally come in pairs like $<p>$ and $</p>$
- The first tag in a pair is the start tag (or opening tag), the second tag is the end tag (or closing tag)
- The end tag is written like the start tag, but with a forward slash inserted before the tag name
- The purpose of a web browser (Chrome, IE, Firefox, Safari) is to read HTML documents and display them
- The browser does not display the HTML tags, but uses them to determine how to display the document

## HTML Basic Examples

### HTML Documents

- All HTML documents must start with a document type declaration: <!DOCTYPE html>.
- The HTML document itself begins with $<html>$ and ends with $</html>$.
- The visible part of the HTML document is between $<body>$ and $</body>$.

<!DOCTYPE html>
<html>
<body>

<h1>My First Heading</h1>
<p>My first paragraph.</p>

</body>
</html>

### HTML Headings

- HTML headings are defined with the $<h1>$ to $<h6>$ tags.
- $<h1>$ defines the most important heading. $<h6>$ defines the least important heading. 

<h1>This is heading 1</h1>
<h2>This is heading 2</h2>
<h3>This is heading 3</h3>
<h4>This is heading 4</h4>

### HTML Paragraphs
- HTML paragraphs are defined with the $<p>$ tag

<p>This is a paragraph.</p>
<p>This is another paragraph.</p>

### HTML Links
- HTML links are defined with the $<a>$ tag

<a href="https://www.w3schools.com">This is a link</a>

- The link's destination is specified in the href attribute. 
- Attributes are used to provide additional information about HTML elements.

### HTML Images
- HTML images are defined with the <img> tag
- The source file (src), alternative text (alt), width, and height are provided as attributes.

# How to Read HTML in Python: Beautifulsoup

"Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work."

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

## Toy Example

In [1]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

In [2]:
from bs4 import BeautifulSoup

In [3]:
# Make a soup
soup = BeautifulSoup(html_doc,'lxml')

In [4]:
# Let us see how it looks like
print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>


In [5]:
# Get the title node
print(soup.title)


<title>The Dormouse's story</title>


In [6]:
title = soup.title

In [10]:
title.text

"The Dormouse's story"

In [11]:
# Get the name of the title node object
title.string



"The Dormouse's story"

In [5]:
# Get the text of the actual title




In [12]:
# Get the name of the parent node
title.parent.name



'head'

In [13]:
# Print the first paragraph of the webpage
soup.p



<p class="title"><b>The Dormouse's story</b></p>

In [14]:
soup.p['class']


['title']

In [15]:
soup.p.attrs

{'class': ['title']}

In [16]:
# Get the first link-type node in the webpage
soup.find_all('a')


[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [17]:
soup.find(id='link3')

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

In [20]:
# Find all link-type nodes
for link in soup.find_all('a'):
    print(link.get('href'))



http://example.com/elsie
http://example.com/lacie
http://example.com/tillie


In [21]:
soup.get_text()


"The Dormouse's story\n\nThe Dormouse's story\nOnce upon a time there were three little sisters; and their names were\nElsie,\nLacie and\nTillie;\nand they lived at the bottom of a well.\n...\n"

In [22]:
print(soup.get_text())

The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...



# Robots Exclusion Standard (robots.txt)

https://en.wikipedia.org/wiki/Robots_exclusion_standard

http://www.robotstxt.org/robotstxt.html

- It is a standard used by websites to communicate with web crawlers and other web robots
- The standard specifies how to inform the web robot about which areas of the website should not be processed or scanned
- Robots are often used by search engines to categorize web sites
- Not all robots cooperate with the standard; email harvesters, spambots, malware, and robots that scan for security vulnerabilities may even start with the portions of the website where they have been told to stay out

In practice,
- when a site owner wishes to give instructions to web robots they place a text file called robots.txt in the root of the web site hierarchy (e.g. https://www.example.com/robots.txt)
- this text file contains the instructions in a specific format
- robots that choose to follow the instructions try to fetch this file and read the instructions before fetching any other file from the web site
- if this file doesn't exist, web robots assume that the web owner wishes to provide no specific instructions, and crawl the entire site.
- a robots.txt file covers one origin. For websites with multiple subdomains, each subdomain must have its own robots.txt file. 


## Examples

All robots that they can visit all files because the wildcard * stands for all robots and the Disallow directive has no value, meaning no pages are disallowed:

All robots to stay out of a website:

All robots not to enter three directories:

All robots to stay away from one specific file:

# Python "requests" module

http://docs.python-requests.org/en/master/

Requests allows you to send organic, grass-fed HTTP/1.1 requests, without the need for manual labor. 

In other words, we are going to use "requests" to download web pages from the Internet and read them *as if* they were files saved on our hard drives.

In [23]:
import requests


In [24]:
r = requests.get("http://www.mobs-lab.org/robots.txt")

In [26]:
r

<Response [200]>

In [27]:
r.text

'Sitemap: http://www.mobs-lab.org/sitemap.xml\n\nUser-agent: *\nDisallow: /\n'

In [28]:
print(r.text)

Sitemap: http://www.mobs-lab.org/sitemap.xml

User-agent: *
Disallow: /



In [29]:
r.status_code

200



### HTTP response status codes

https://developer.mozilla.org/en-US/docs/Web/HTTP/Status

In [30]:
r.text


'Sitemap: http://www.mobs-lab.org/sitemap.xml\n\nUser-agent: *\nDisallow: /\n'

In [31]:
r.encoding


'UTF-8'

In [32]:
r.headers

{'Date': 'Thu, 28 Sep 2017 21:40:36 GMT', 'Server': 'Apache', 'Cache-Control': 'public', 'ETag': 'W/"8d4cfd8f77dd91b28b8b34f4267eefb0-gzip"', 'Vary': 'Accept-Encoding,User-Agent', 'Content-Encoding': 'gzip', 'X-Host': 'pages27.sf2p.intern.weebly.net', 'X-UA-Compatible': 'IE=edge,chrome=1', 'Content-Length': '88', 'Keep-Alive': 'timeout=10, max=74', 'Connection': 'Keep-Alive', 'Content-Type': 'text/plain; charset=UTF-8'}