# The Guardian API

In the `beautiful_soup.ipynb` notebook, I showed how BeautifulSoup can be used 
to parse messy HTML, tp extract information, and to act as a rudimentary web crawler. 
I used The Guardian as an illustrative example about how this can be achieved. 
The reason for choosing The Guardian was because they provide a REST API to their servers. 
With theise it is possible to perform specific queries on their servers, and to receive 
current information from their servers according to their API guide (ie in JSON)

http://open-platform.theguardian.com/

In order to use their API, you will need to register for an API key. 
At the time of writing (Feb 1, 2017) this was an automated process that can be completed at 

http://open-platform.theguardian.com/access/

The API is documented here: 

http://open-platform.theguardian.com/documentation/

and Python bindings to their API are provided by The Guardian here

https://github.com/prabhath6/theguardian-api-python

and these can easily be integrated into a web-crawler based on API calls, rather than being based 
on HTML parsing, etc. 

We use four parameters in our queries here: 

1. `section`: the section of the newspaper that we are interested in querying. In this case I'm lookin in 
the technology section 

2. `order-by`: I have specifie that the newest items should be closer to the front of the query list 

3. `api-key`: I have left this as test (which works here), but for *real* deployment of such a spider
a real API key should be specified 

4. `page-size`: The number of results to return. 

In [1]:
import requests 
import json 

# Inspect all sections and search for technology-based sections

In [2]:
url = 'https://content.guardianapis.com/sections?api-key=test'
req = requests.get(url)
src = req.text 

In [3]:
sections = json.loads(src)['response']

print sections.keys()

[u'status', u'userTier', u'total', u'results']


In [4]:
print json.dumps(sections['results'][0], indent=2, sort_keys=True)

{
  "apiUrl": "https://content.guardianapis.com/artanddesign", 
  "editions": [
    {
      "apiUrl": "https://content.guardianapis.com/artanddesign", 
      "code": "default", 
      "id": "artanddesign", 
      "webTitle": "Art and design", 
      "webUrl": "https://www.theguardian.com/artanddesign"
    }
  ], 
  "id": "artanddesign", 
  "webTitle": "Art and design", 
  "webUrl": "https://www.theguardian.com/artanddesign"
}


In [5]:
for result in sections['results']: 
    if 'tech' in result['id'].lower(): 
        print result['webTitle'], result['apiUrl']

Technology https://content.guardianapis.com/technology


# Manual query on whole API

In [6]:
# Specify the arguments
args = {
    'section': 'technology', 
    'order-by': 'newest', 
    'api-key': 'test', 
    'page-size': '100'
}

# Construct the URL
base_url = 'http://content.guardianapis.com/search'
url = '{}?{}'.format(
    base_url, 
    '&'.join(["{}={}".format(kk, vv) for kk, vv in args.iteritems()])
)

# Make the request and extract the source
req = requests.get(url) 
src = req.text

In [7]:
print 'Number of byes received:', len(src)

Number of byes received: 51824


The API returns JSON, so we parse this using the in-built JSON library. 
The API specifies that all data are returned within the `response` key, even under failure. 
Thereofre, I have immediately descended to the response field 

# Parsing the JSON

The following are available:
  [u'currentPage', u'orderBy', u'pageSize', u'pages', u'results', u'startIndex', u'status', u'total', u'userTier']


# Verifying the status code

It is important to verify that the status message is `ok` before continuing - if it is not `ok` no 'real' data 
will have been received. 

In [9]:
assert response['status'] == 'ok'

# Listing the results 

The API standard states that the results will be found in the `results` field under the `response` field. 
Furthermore, the URLs will be found in the `webUrl` field, and the title will be found in the `webTitle` 
field. 

First let's look to see what a single result looks like in full, and then I will print a restricted 
set of parameters on the full set of results .

In [10]:
print json.dumps(response['results'][0], indent=2, sort_keys=True)

{
  "apiUrl": "https://content.guardianapis.com/technology/2017/jan/31/apple-record-revenue-holiday-sales-iphone-7", 
  "id": "technology/2017/jan/31/apple-record-revenue-holiday-sales-iphone-7", 
  "isHosted": false, 
  "sectionId": "technology", 
  "sectionName": "Technology", 
  "type": "article", 
  "webPublicationDate": "2017-01-31T22:10:00Z", 
  "webTitle": "Apple posts record revenue thanks to holiday sales of iPhone 7", 
  "webUrl": "https://www.theguardian.com/technology/2017/jan/31/apple-record-revenue-holiday-sales-iphone-7"
}


In [11]:
for result in response['results']: 
    print result['webUrl'][:70], result['webTitle'][:20]

https://www.theguardian.com/technology/2017/jan/31/apple-record-revenu Apple posts record r
https://www.theguardian.com/technology/2017/jan/31/club-penguin-the-ki Club Penguin: the ki
https://www.theguardian.com/technology/2017/jan/31/trump-travel-ban-te #DeleteUber: how tec
https://www.theguardian.com/technology/2017/jan/31/amazon-expedia-micr Amazon pledges legal
https://www.theguardian.com/technology/2017/jan/31/hitman-review-a-bea Hitman review – a be
https://www.theguardian.com/technology/2017/jan/31/chatterbox-tuesday Chatterbox: Tuesday
https://www.theguardian.com/technology/2017/jan/30/libratus-poker-arti Oh the humanity! Pok
https://www.theguardian.com/technology/2017/jan/30/deleteuber-how-soci #DeleteUber: how soc
https://www.theguardian.com/technology/2017/jan/30/honor-6x-review-a-l Honor 6X review: a l
https://www.theguardian.com/technology/2017/jan/30/chatterbox-monday Chatterbox: Monday
https://www.theguardian.com/media/2017/jan/28/how-to-stop-arguing-and- How to stop arg