# Web Scraping
### Applied Data Science 

- Web scraping (web harvesting or web data extraction) is data scraping used for extracting data from websites
- Web scraping should be done only with the permission of the website’s administrators. Doing otherwise may result in significant costs attributed to hosts
- Scraping technologies must be tolerant to several artifacts of real-world code
- The erroneous data makes information retrieval difficult
- Several tools have been developed to tackle this problem
- BeautifulSoup will be discussed and used as an example

In [1]:
from bs4 import BeautifulSoup
import requests

This imports Beautiful soup into a project so that it can be used for web scraping

In [2]:
source = """
<!DOCTYPE html>  
<html>  
  <head>
    <title>Scraping</title>
  </head>
  <body class="col-sm-12">
    <h1>section1</h1>
    <p>paragraph1</p>
    <p>paragraph2</p>
    <div class="col-sm-2">
      <h2>section2</h2>
      <p>paragraph3</p>
      <p>unclosed
    </div>
  </body>
</html>  
"""

soup = BeautifulSoup(source, "html.parser")

- <!DOCTYPE html>: HTML documents must start with a type declaration.
- The HTML document is contained between `<html> and \</html>`.
- The meta and script declaration of the HTML document is between `<head> and \</head>`.
- The visible part of the HTML document is between `<body> and </body>`.
- Title headings are defined with the `<h1> to \<h6>` tags.
- The section/division tags `<div>` are often used to segment the source.
- Paragraphs are defined with the `<p>` tag

## The DOM

The Document Object Model (DOM) is a cross-platform and language-independent application programming interface that treats an HTML, XHTML, or XML document as a tree structure wherein each node is an object representing a part of the document. Tools (including BeautifulSoup) parse HTML and produce a DOM-like representation

![alt text](resources/dom.png)

## Retrieving Information
- Once the DOM has been parsed, BeautifulSoup objects may be queried
- Two query functions exist:
    - find Return the first instance in the DOM to matches the query
    - find_all Return a list of all tags that match the query
- These functions take two parameters: the first parameter defines the tag to be searched for (e.g. head), and the second specifies filters
- Attributes can also be handled by the search functions, e.g. to select only the segments with particular class attributes.

In [3]:
print("Head:")
print('', soup.find_all("head"))

Head:
 [<head>
<title>Scraping</title>
</head>]


In [4]:
print('\nType of head:')
print('', map(type, soup.find_all("head")))


Type of head:
 <map object at 0x7f7808c6d390>


In [5]:
print('\nTitle tag:')
print('', soup.find("title"))


Title tag:
 <title>Scraping</title>


In [6]:
print('\nTitle text:')
print('', soup.find("title").text)


Title text:
 Scraping


In [7]:
divs = soup.find_all("div", attrs={"class": "col-sm-2"})
print('\nDiv with class=col-sm-2:')
print('', divs)


Div with class=col-sm-2:
 [<div class="col-sm-2">
<h2>section2</h2>
<p>paragraph3</p>
<p>unclosed
    </p></div>]


In [8]:
print('\nClass of first div:')
print('', divs[0].attrs['class'])


Class of first div:
 ['col-sm-2']


In [9]:
print('\nAll paragraphs:')
print('', soup.find_all("p"))


All paragraphs:
 [<p>paragraph1</p>, <p>paragraph2</p>, <p>paragraph3</p>, <p>unclosed
    </p>]


## Beautilful soup on real data

In this example I will show how you can use BeautifulSoup to retreive information from live web pages. We make use of The Guardian newspaper, and retreive the HTML from an arbitrary article. We then create the BeautifulSoup object, and query the links that were discovered in the DOM. Since a large number are returned, we then apply attribute filters that let us reduce significantly the number of returned links. I selected the filters selected for this example in order to focus on the names in the paper. The parameterisation of the attributes was discovered by using the inspect functionality of Google Chrome


In [10]:
url = 'https://www.theguardian.com/technology/2017/jan/31/amazon-expedia-microsoft-support-washington-action-against-donald-trump-travel-ban'
req = requests.get(url)
source = req.text
soup = BeautifulSoup(source, 'html.parser')

In [11]:
links = soup.find_all('a')
print(links)

[<a class="u-h skip" data-link-name="skip : main content" href="#maincontent">Skip to main content</a>, <a class="new-header__logo" data-link-name="nav2 : logo" href="https://www.theguardian.com/uk">
<span class="u-h">The Guardian - Back to home</span>
<span class="inline-guardian-logo-160 inline-logo new-header__logo">
<svg class="new-header__logo__svg inline-guardian-logo-160__svg inline-logo__svg" height="30" viewbox="0 0 320 60" width="160">
<path d="M284 45h16v-3l-3-1.5v-20c1.2-.9 2.8-1.1 4.3-1.1 2.8 0 3.8.9 3.8 4.1v17l-3 1.5v3h16v-3l-3-1.5v-19c0-5.7-2.2-8.3-7.2-8.3-4.1 0-8.1 1.5-10.8 4V13h-1l-12.4 2.2v2.7l3.4 1.6v21l-3 1.5-.1 3zM245.3.4c-3 0-5.4 2.4-5.4 5.5 0 3 2.4 5.4 5.4 5.4 2.9 0 5.4-2.4 5.4-5.4-.1-3.1-2.5-5.5-5.4-5.5zM237 15.1v2.8l3 1.6v20.9l-3 1.5v3h16v-3l-3-1.5V13.1h-1l-12 2zM222.9 39c-.7.6-1.6 1.1-3.1 1.1-4 0-5.9-3.3-5.9-10.9 0-8.7 2.4-11.7 5.6-11.7 1.8 0 2.7.6 3.4 1.4V39zm0-24.5c-1.2-.9-3.2-1.4-4.9-1.4-7.4 0-14.5 4.3-14.5 16.8 0 11.9 7.1 15.7 11.8 15.7 3.8 0 6.4-1.7 7.6-3

In [12]:
links = soup.find_all('a', attrs={
    'data-component': 'auto-linked-tag'
})

for link in links: 
    print(link['href'], link.text)

https://www.theguardian.com/us-news/donaldtrump Donald Trump
https://www.theguardian.com/technology/amazon Amazon


# Chaining queries

Now, let us conisder a more general query that might be done on a website such as this. 
We will query the base technology page, and attempt to list all articles that pertain to this main page

In [13]:
url = 'https://www.theguardian.com/uk/technology'
req = requests.get(url)
source = req.text
soup = BeautifulSoup(source, 'html.parser')

After inspecting the DOM (via the `inspect` tool in my browser), I see that the attributes that define 
a `technology` article are: 
    
    class = "js-headline-text"

In [14]:
articles = soup.find_all('a', attrs={
    'class': 'js-headline-text'
})

for article in articles: 
    print(article['href'][:70], article.text[:20])

https://www.theguardian.com/technology/2017/jul/25/microsoft-paint-sav Program saved after 
https://www.theguardian.com/technology/2017/jul/25/roomba-maker-could- Roomba maker may sha
https://www.theguardian.com/technology/commentisfree/2017/jul/25/dont- Don't let the sun go
https://www.theguardian.com/technology/2017/jul/25/amazon-double-numbe Amazon to double num
https://www.theguardian.com/technology/2017/jul/25/splatoon-2-review-n Return of Nintendo's
https://www.theguardian.com/technology/2017/jul/24/facebook-cafeteria- Worker living in gar
https://www.theguardian.com/technology/2017/jul/24/smart-tvs-fridges-s Smart fridges and TV
https://www.theguardian.com/technology/2017/jul/24/internet-firms-shou Internet firms shoul
https://www.theguardian.com/technology/2017/jul/24/microsoft-paint-kil Microsoft Paint to b
https://www.theguardian.com/technology/2017/jul/24/pokemon-go-fest-fan Pokémon Go fans enra
https://www.theguardian.com/technology/gallery/2017/jul/21/joy-of-stic 10 greate

With this set of articles, it is now possible to chain further querying, for example with code 
similar to the following 

```python
for article in articles: 
    req = requests.get(article['href'])
    source = req.text 
    soup = BeautifulSoup(source, 'html.parser') 
    
    ... and so on...
```

However, I won't go into much detail about this now. For scraping like this tools, such as `scrapy` are more 
appropriate than `BeautifulSoup` since they are designed for multithreadded web crawling. 
Once again, however, I urge caution and hope that before any crawling is initiated you determine whether 
crawling is within the terms of use of the website. 
If in doubt contact the website administrators. 

https://scrapy.org/

## Web APIs

- We have seen in the last section that parsing raw HTML is nontrivial since it will be necessary to contend with:
    - Evolving source code
    - Erroneous HTML tags
- Fortunately, web APIs for data retrieval exist, and these are generally less prone to the previous issues since:
    - Code is optimised for retreival and not for visual layout/aesthetics
    - Standard serialisation tools (e.g. JSON) are typically used
    - The core items of interest have been extracted (e.g. dates, URLs

## Web API Examples

- https://github.com/toddmotto/public-apis
- https://en.wikipedia.org/wiki/List_of_open_APIs
- These APIs, although similar in principle, will be very different in practise.
- However, writing code for APIs is generally simpler, requires less maintenance, and results in faster overall code

### Worked Example

Worked example: 
- I will continue using ‘The Guardian’ since they have an open platform
- The task will be to acquire a list of recent posts from the ‘technology’ section of the newspaper
- http://open-platform.theguardian.com/
- http://open-platform.theguardian.com/documentation/

## The Guardian API

In the `beautiful_soup.ipynb` notebook, I showed how BeautifulSoup can be used  to parse messy HTML, tp extract information, and to act as a rudimentary web crawler.  I used The Guardian as an illustrative example about how this can be achieved.  The reason for choosing The Guardian was because they provide a REST API to their servers. With these it is possible to perform specific queries on their servers, and to receive current information from their servers according to their API guide (ie in JSON)

Python bindings to their API are provided by The Guardian here
https://github.com/prabhath6/theguardian-api-python

We use four parameters in our queries here: 
1. `section`: the section of the newspaper that we are interested in querying. In this case I'm lookin in 
the technology section 
2. `order-by`: I have specifie that the newest items should be closer to the front of the query list 
3. `api-key`: I have left this as test (which works here), but for *real* deployment of such a spider
a real API key should be specified 
4. `page-size`: The number of results to return. 

In [15]:
import requests 
import json 

These are the packages that are required for this example.

## Inspect all sections and search for technology-based sections

In [16]:
url = 'https://content.guardianapis.com/sections?api-key=test'
req = requests.get(url)
src = req.text 

In [17]:
sections = json.loads(src)['response']

print(sections.keys())

dict_keys(['status', 'userTier', 'results', 'total'])


In [18]:
print(json.dumps(sections['results'][0], indent=2, sort_keys=True))

{
  "apiUrl": "https://content.guardianapis.com/artanddesign",
  "editions": [
    {
      "apiUrl": "https://content.guardianapis.com/artanddesign",
      "code": "default",
      "id": "artanddesign",
      "webTitle": "Art and design",
      "webUrl": "https://www.theguardian.com/artanddesign"
    }
  ],
  "id": "artanddesign",
  "webTitle": "Art and design",
  "webUrl": "https://www.theguardian.com/artanddesign"
}


In [19]:
for result in sections['results']: 
    if 'tech' in result['id'].lower(): 
        print(result['webTitle'], result['apiUrl'])

Technology https://content.guardianapis.com/technology


## Manual query on whole API

In [20]:
# Specify the arguments
args = {
    'section': 'technology', 
    'order-by': 'newest', 
    'api-key': 'test', 
    'page-size': '100'
}
# Construct the URL
base_url = 'http://content.guardianapis.com/search'
url = '{}?{}'.format(
    base_url, '&'.join(["{}={}".format(kk, vv) for kk, vv in args.items()])
)
# Make the request and extract the source
req = requests.get(url) 
src = req.text

In [21]:
print('Number of byes received:', len(src))

Number of byes received: 53847


## Parsing the returned JSON

The API returns JSON, so we parse this using the in-built JSON library. The API specifies that all data are returned within the `response` key, even under failure. Thereofre, I have immediately descended to the response field 

In [22]:
response = json.loads(src)['response']
print('The following are available:\n ', sorted(response.keys()))

The following are available:
  ['currentPage', 'orderBy', 'pageSize', 'pages', 'results', 'startIndex', 'status', 'total', 'userTier']


## Verifying the status code

It is important to verify that the status message is `ok` before continuing - if it is not `ok` no 'real' data 
will have been received. 

In [23]:
assert response['status'] == 'ok'

## Listing the results 

The API standard states that the results will be found in the `results` field under the `response` field. 
Furthermore, the URLs will be found in the `webUrl` field, and the title will be found in the `webTitle` 
field. 

First let's look to see what a single result looks like in full, and then I will print a restricted 
set of parameters on the full set of results .

In [24]:
print(json.dumps(response['results'][0], indent=2, sort_keys=True))

{
  "apiUrl": "https://content.guardianapis.com/technology/commentisfree/2017/jul/25/dont-let-sun-go-down-on-snopes-helped-start-the-internet",
  "id": "technology/commentisfree/2017/jul/25/dont-let-sun-go-down-on-snopes-helped-start-the-internet",
  "isHosted": false,
  "sectionId": "technology",
  "sectionName": "Technology",
  "type": "article",
  "webPublicationDate": "2017-07-25T12:10:09Z",
  "webTitle": "Don't let the sun go down on Snopes \u2013 it helped start the internet",
  "webUrl": "https://www.theguardian.com/technology/commentisfree/2017/jul/25/dont-let-sun-go-down-on-snopes-helped-start-the-internet"
}


In [25]:
for result in response['results']: 
    print(result['webUrl'][:70], result['webTitle'][:20])

https://www.theguardian.com/technology/commentisfree/2017/jul/25/dont- Don't let the sun go
https://www.theguardian.com/technology/2017/jul/25/amazon-double-numbe Amazon to double num
https://www.theguardian.com/technology/2017/jul/25/roomba-maker-could- Roomba maker may sha
https://www.theguardian.com/technology/2017/jul/25/splatoon-2-review-n Splatoon 2 review: r
https://www.theguardian.com/technology/2017/jul/25/microsoft-paint-sav Microsoft Paint save
https://www.theguardian.com/technology/2017/jul/25/as-it-dies-i-die-al 'As it dies, I die a
https://www.theguardian.com/technology/2017/jul/24/microsoft-paint-kil Microsoft Paint to b
https://www.theguardian.com/technology/2017/jul/24/kingdom-hearts-3-rp Kingdom Hearts 3: th
https://www.theguardian.com/technology/2017/jul/24/share-your-microsof Share your Microsoft
https://www.theguardian.com/technology/2017/jul/24/pokemon-go-fest-fan Pokémon Go fans enra
https://www.theguardian.com/technology/2017/jul/24/facebook-cafeteria- Facebook 