**References**: the code in this notebook comes from this script (written by my baby-programmer self 5 years ago: https://github.com/effernan/New-York-Times-Archive-API-code), and this more senior script: https://github.com/rochelleterman/scrape-interwebz/blob/master/1_APIs/3_api_workbook.ipynb 

# The Guardian

In this exercise we will be using the API of **The Guardian** (that does provide the full text of articles): https://open-platform.theguardian.com/access

First, you need a key, that you can get in here: https://bonobo.capi.gutools.co.uk/register/developer 

Let's explore **The Guardian** documentation website: https://open-platform.theguardian.com/documentation/

There are **5 endpoints**: Content, Tags, Sections, Editions, Single item. The one that we need is **content**: https://open-platform.theguardian.com/documentation/search 

So, in there, we have all the different options that the API is providing. The base_url is: "https://content.guardianapis.com/search?"

Everything else is very similar to The New York Times, but in here, there is one option that we can include in the parameters that will return the full text of the articles: ""show-fields" : "body". How **super cool** is that?

#### 1. Let's build the API request

In [1]:
import requests
import json
import time  # to pause after each API call 
from __future__ import division
import math
import csv
import matplotlib.pyplot as plt
import pandas as pd  # to see our CSV 

In [2]:
# set key
key= "8ab705db-157a-4270-9fd1-a5499f3f1196"

# set base url
base_url= "https://content.guardianapis.com/search?"

# set response format
response_format= ".json"

In [3]:
search_params = {"q": "David Beckham",
                 "from-date" : "2001-01-01", #we need to change the dates format
                 "to-date" : "2001-12-31", 
                 "show-fields" : "body", #this is the full text of the article!
                 "id" : "id",
                 "format" : "json",
                 "api-key": key}    

In [8]:
# make request
r = requests.get(base_url + response_format, params = search_params)

In [13]:
print(r.url)

https://content.guardianapis.com/search?.json&q=David+Beckham&from-date=2001-01-01&to-date=2001-12-31&show-fields=body&id=id&format=json&api-key=8ab705db-157a-4270-9fd1-a5499f3f1196


In [14]:
response_text= r.text
print(response_text[:1000])  #looks good!

{"response":{"status":"ok","userTier":"developer","total":2859,"startIndex":1,"pageSize":10,"currentPage":1,"pages":286,"orderBy":"relevance","results":[{"id":"football/2001/oct/09/sport.seaningle","type":"article","sectionId":"football","sectionName":"Football","webPublicationDate":"2001-10-09T14:04:18Z","webTitle":"David Beckham: how deep is your love?","webUrl":"https://www.theguardian.com/football/2001/oct/09/sport.seaningle","apiUrl":"https://content.guardianapis.com/football/2001/oct/09/sport.seaningle","fields":{"body":"<p>Back in the 1970s, the Bee Gees asked the world: 'How deep is your love.' Twenty-five years later, Guardian Unlimited feels like asking our readers the same question about England captain David Beckham.</p> <p>So we will. How deep is your love, self-confessed \"Manchester United fan, Beckham admirer and Essex girl\" Mrs Mandy Seymore?</p> <p>\"Beckham's performance on Saturday was one that all Brits will be proud of,\" she gushes, and before we can speak to an

In [15]:
# Convert JSON response to a dictionary
data = json.loads(response_text)

In [16]:
print(data.keys()) #we need to access the data that is stored in "response"

dict_keys(['response'])


In [17]:
data['response'].keys()

dict_keys(['status', 'userTier', 'total', 'startIndex', 'pageSize', 'currentPage', 'pages', 'orderBy', 'results'])

Now the keys have changed! What we need to access is "results"

In [18]:
type(data['response']['results'])

list

In [19]:
docs = data['response']['results']

In [20]:
len(docs) #same problem as before: we need to lover over the request to get all the documents!

10

In [21]:
#docs[0] #and there is the body of our article!

Now let's build the proper call modifying our previous script.

In [23]:
# set key
key=  "8ab705db-157a-4270-9fd1-a5499f3f1196"

# set base url
base_url= "https://content.guardianapis.com/search?" #we also need to change this to The Guardian one

# set response format
response_format=".json"

# set search parameters
search_params = {"q": "David Beckham",
                 "from-date" : "2001-01-01", #we need to change the dates format
                 "to-date" : "2001-12-31", 
                 "show-fields" : "body", #this is the full text of the article!
                 "id" : "id",
                 "format" : "json",
                 "api-key": key}    

# make request
r = requests.get(base_url+response_format, params=search_params)
    
# convert to a dictionary
data=json.loads(r.text)
    
# get number of hits
hits = data['response']['pages'] #we need to change this too
print("number of hits: ", str(hits))
    
# get number of pages
pages = hits 
    
# make an empty list where we'll hold all of our docs for every page
all_docs = [] 
    
# now we're ready to loop through the pages
for i in range(pages):
    print("collecting page", str(i))
        
    # set the page parameter
    search_params['page'] = i
        
    # make request
    r = requests.get(base_url+response_format, params=search_params)
    r = requests.get(base_url+response_format, params=search_params)
    if r.status_code == 200:
        # get text and convert to a dictionary
        data=json.loads(r.text)
        
        # get just the docs
        if 'response' in data and 'results' in data['response']: #we need to change this to "results"
            docs = data['response']['results']
        
            # add those docs to the big list
            all_docs = all_docs + docs
    
    #IMPORTANT pause between calls
    time.sleep(1)

number of hits:  286
collecting page 0
collecting page 1
collecting page 2
collecting page 3
collecting page 4
collecting page 5
collecting page 6
collecting page 7
collecting page 8
collecting page 9
collecting page 10
collecting page 11
collecting page 12
collecting page 13
collecting page 14
collecting page 15
collecting page 16
collecting page 17
collecting page 18
collecting page 19
collecting page 20
collecting page 21
collecting page 22
collecting page 23
collecting page 24
collecting page 25
collecting page 26
collecting page 27
collecting page 28
collecting page 29
collecting page 30
collecting page 31
collecting page 32
collecting page 33
collecting page 34
collecting page 35
collecting page 36
collecting page 37
collecting page 38
collecting page 39
collecting page 40
collecting page 41
collecting page 42
collecting page 43
collecting page 44
collecting page 45
collecting page 46
collecting page 47
collecting page 48
collecting page 49
collecting page 50
collecting page 51
c

Let's have a look at that first element in our list of documents.

In [24]:
all_docs[0]

{'id': 'football/2001/oct/09/sport.seaningle',
 'type': 'article',
 'sectionId': 'football',
 'sectionName': 'Football',
 'webPublicationDate': '2001-10-09T14:04:18Z',
 'webTitle': 'David Beckham: how deep is your love?',
 'webUrl': 'https://www.theguardian.com/football/2001/oct/09/sport.seaningle',
 'apiUrl': 'https://content.guardianapis.com/football/2001/oct/09/sport.seaningle',
 'fields': {'body': '<p>Back in the 1970s, the Bee Gees asked the world: \'How deep is your love.\' Twenty-five years later, Guardian Unlimited feels like asking our readers the same question about England captain David Beckham.</p> <p>So we will. How deep is your love, self-confessed "Manchester United fan, Beckham admirer and Essex girl" Mrs Mandy Seymore?</p> <p>"Beckham\'s performance on Saturday was one that all Brits will be proud of," she gushes, and before we can speak to anyone in Scotland about her comments, she\'s off again. "I have always admired him for his footballing abilities and can only hop

So, what we want is: id, webPublicationDate, webTitle, webUrl, and the content of the article (that is in fields). Let's modify our function to get us that!

In [25]:
def format_articles(unformatted_docs):
    '''
    This function takes in a list of documents returned by the NYT api 
    and parses the documents into a list of dictionaries, 
    with 'id', 'header', and 'date' keys
    '''
    formatted = []
    for i in unformatted_docs:
        dic = {}
        dic['id'] = i['id']
        dic['webPublicationDate'] = i['webPublicationDate'] # cutting time of day.
        dic['webTitle'] = i['webTitle']
        dic["webUrl"] = i["webUrl"]
        dic["fields"] = i["fields"]
        formatted.append(dic)
    return(formatted)

In [26]:
all_formatted = format_articles(all_docs)

In [27]:
all_formatted[0]

{'id': 'football/2001/oct/09/sport.seaningle',
 'webPublicationDate': '2001-10-09T14:04:18Z',
 'webTitle': 'David Beckham: how deep is your love?',
 'webUrl': 'https://www.theguardian.com/football/2001/oct/09/sport.seaningle',
 'fields': {'body': '<p>Back in the 1970s, the Bee Gees asked the world: \'How deep is your love.\' Twenty-five years later, Guardian Unlimited feels like asking our readers the same question about England captain David Beckham.</p> <p>So we will. How deep is your love, self-confessed "Manchester United fan, Beckham admirer and Essex girl" Mrs Mandy Seymore?</p> <p>"Beckham\'s performance on Saturday was one that all Brits will be proud of," she gushes, and before we can speak to anyone in Scotland about her comments, she\'s off again. "I have always admired him for his footballing abilities and can only hope my son grows up to half as good as he is. His attitude and determination whenever he is on the field is something else.</p> <p>Anything else? "Add to the fa

And now let's store that. We have a super cool David Beckham dataset that we can use for our future data analysis!

In [28]:
keys = all_formatted[1]

# writing the rest
with open('david_beckham.csv', 'w', encoding = 'utf-8') as output_file:
    dict_writer = csv.DictWriter(output_file, keys)
    dict_writer.writeheader()
    dict_writer.writerows(all_formatted)

# Exercise

And now: repeat the exercise but enter some term that you may be interested about (i.e. another Athlete, or some other group of news that you would like to see.) Remember to change the name of the csv file to not overwrite your data!