![UKDS Logo](./images/UKDS_Logos_Col_Grey_300dpi.png)

# Web-scraping for Social Science Research

Welcome to the <a href="https://ukdataservice.ac.uk/" target=_blank>UK Data Service</a> training series on *New Forms of Data for Social Science Research*. This series guides you through some of the most common and valuable new sources of data available for social science research: data collected from websites, social media platorms, text data, conducting simulations (agent based modelling), to name a few. To help you get to grips with these new forms of data, we provide webinars, interactive notebooks containing live programming code, reading lists and more.

* To access training materials for the entire series: <a href="https://github.com/UKDataServiceOpen/new-forms-of-data" target=_blank>[Training Materials]</a>

* To keep up to date with upcoming and past training events: <a href="https://ukdataservice.ac.uk/news-and-events/events" target=_blank>[Events]</a>

* To get in contact with feedback, ideas or to seek assistance: <a href="https://ukdataservice.ac.uk/help.aspx" target=_blank>[Help]</a>

<a href="https://www.research.manchester.ac.uk/portal/julia.kasmire.html" target=_blank>Dr Julia Kasmire</a> and <a href="https://www.research.manchester.ac.uk/portal/diarmuid.mcdonnell.html" target=_blank>Dr Diarmuid McDonnell</a> <br />
UK Data Service  <br />
University of Manchester <br />
April 2020

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Guide-to-using-this-resource" data-toc-modified-id="Guide-to-using-this-resource-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Guide to using this resource</a></span><ul class="toc-item"><li><span><a href="#Interaction" data-toc-modified-id="Interaction-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Interaction</a></span></li><li><span><a href="#Learn-more" data-toc-modified-id="Learn-more-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Learn more</a></span></li></ul></li><li><span><a href="#Collecting-data-from-online-databases-using-an-API" data-toc-modified-id="Collecting-data-from-online-databases-using-an-API-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Collecting data from online databases using an API</a></span><ul class="toc-item"><li><span><a href="#What-is-an-API?" data-toc-modified-id="What-is-an-API?-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>What is an API?</a></span></li><li><span><a href="#Reasons-to-interact-with-an-API" data-toc-modified-id="Reasons-to-interact-with-an-API-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Reasons to interact with an API</a></span></li><li><span><a href="#Logic-of-using-an-API" data-toc-modified-id="Logic-of-using-an-API-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Logic of using an API</a></span></li></ul></li><li><span><a href="#Example:-Capturing-Covid-19-data" data-toc-modified-id="Example:-Capturing-Covid-19-data-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Example: Capturing Covid-19 data</a></span><ul class="toc-item"><li><span><a href="#Locating-the-API" data-toc-modified-id="Locating-the-API-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Locating the API</a></span></li><li><span><a href="#API-terms-of-use" data-toc-modified-id="API-terms-of-use-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>API terms of use</a></span></li><li><span><a href="#Locating-data" data-toc-modified-id="Locating-data-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Locating data</a></span></li><li><span><a href="#Registering-use-of-API" data-toc-modified-id="Registering-use-of-API-4.4"><span class="toc-item-num">4.4&nbsp;&nbsp;</span>Registering use of API</a></span></li><li><span><a href="#Requesting-data" data-toc-modified-id="Requesting-data-4.5"><span class="toc-item-num">4.5&nbsp;&nbsp;</span>Requesting data</a></span></li><li><span><a href="#Saving-results" data-toc-modified-id="Saving-results-4.6"><span class="toc-item-num">4.6&nbsp;&nbsp;</span>Saving results</a></span></li><li><span><a href="#Refining-Covid-19-data-collection" data-toc-modified-id="Refining-Covid-19-data-collection-4.7"><span class="toc-item-num">4.7&nbsp;&nbsp;</span>Refining Covid-19 data collection</a></span></li></ul></li><li><span><a href="#Value,-limitations-and-ethics" data-toc-modified-id="Value,-limitations-and-ethics-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Value, limitations and ethics</a></span></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Conclusion</a></span></li><li><span><a href="#Bibliography" data-toc-modified-id="Bibliography-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Bibliography</a></span></li><li><span><a href="#Further-reading-and-resources" data-toc-modified-id="Further-reading-and-resources-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Further reading and resources</a></span></li><li><span><a href="#Appendices" data-toc-modified-id="Appendices-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Appendices</a></span><ul class="toc-item"><li><span><a href="#Appendix-A---Requesting-URLs" data-toc-modified-id="Appendix-A---Requesting-URLs-9.1"><span class="toc-item-num">9.1&nbsp;&nbsp;</span>Appendix A - Requesting URLs</a></span></li><li><span><a href="#Appendix-B---Capturing-UK-Police-data" data-toc-modified-id="Appendix-B---Capturing-UK-Police-data-9.2"><span class="toc-item-num">9.2&nbsp;&nbsp;</span>Appendix B - Capturing UK Police data</a></span></li></ul></li></ul></div>

## Introduction

In this training series we cover some of the essential skills needed to collect data from the web. In particular we focus on two different approaches:
1. Collecting data stored on web pages. <a href="./web-scraping-websites-code-2020-04-23.ipynb" target=_blank>[LINK]</a>
2. Downloading data from online databases using Application Programming Interfaces (APIs). [Focus of this notebook]
    
Do not be alarmed by the technical aspects: both approaches can be implemented using simple code, a standard desktop or laptop, and a decent internet connection.    

Given the Covid-19 public health crisis in which this programme of work occurred, we will examine ways in which web-scraping techniques can provide valuable data for studying this phenomenon. This is a fast moving, evolving public health emergency that, in addition to other impacts, will shape research agendas across the sciences for years to come. Therefore it is important to learn how we, as social scientists, can access or generate data that will provide a better understanding of this disease.

In this lesson we will search the <a href="https://www.theguardian.com" target=_blank>Guardian newspaper's</a> online database for articles referring to Covid-19. We will also examine two APIs that should be of wider interest to the social science community: 
* UK Police API, which provides data on street-level crime and outcomes, police stations and much more.
* Companies House API, which provides data on the governance, finances and activities of UK registered companies.

## Guide to using this resource

This learning resource was built using <a href="https://jupyter.org/" target=_blank>Jupyter Notebook</a>, an open-source software application that allows you to mix code, results and narrative in a single document. As <a href="https://jupyter4edu.github.io/jupyter-edu-book/" target=_blank>Barba et al. (2019)</a> espouse:
> In a world where every subject matter can have a data-supported treatment, where computational devices are omnipresent and pervasive, the union of natural language and computation creates compelling communication and learning opportunities.

If you are familiar with Jupyter notebooks then skip ahead to the main content (*Collecting data from online databases using an API*). Otherwise, the following is a quick guide to navigating and interacting with the notebook.

### Interaction

**You only need to execute the code that is contained in sections which are marked by `In []`.**

To execute a cell, click or double-click the cell and press the `Run` button on the top toolbar (you can also use the keyboard shortcut Shift + Enter).

Try it for yourself:

In [None]:
print("Enter your name and press enter:")
name = input()
print("Hello {}, enjoy learning more about Python and web-scraping!".format(name)) 

Enter your name and press enter:


### Learn more

Jupyter notebooks provide rich, flexible features for conducting and documenting your data analysis workflow. To learn more about additional notebook features, we recommend working through some of the <a href="https://github.com/darribas/gds19/blob/master/content/labs/lab_00.ipynb" target=_blank>materials</a> provided by Dani Arribas-Bel at the University of Liverpool. 

## Collecting data from online databases using an API

### What is an API?

An Application Programming Interface (API) is
> a set of functions and procedures allowing the creation of applications that access the features or data of an operating system, application, or other service" (Oxford English Dictionary). 

In essence: an API acts as an intermediary between software applications. It performs this role by providing a set of protocols/standards for making *requests* and formulating *responses* between applications. For example, a smart phone application might need real-time traffic data from an online database. An API can valididate the smart phone's request for data, and handle the online database's response (i.e., the transfer of data to the application). In the absence of an API, the smart phone application would need to know a lot more technical information about the online database in order to communicate with it. But thanks to the API, it only needs to know how to formulate a request that the API understands, which then communicates the request to the database.

Think of an API's role as similar to that of a translator faciliating a conversation between two individuals who do not speak the same language. Neither individual needs to know the other's language, just how to formulate their response in a way the translator can understand. In sum, an API simplifies how applications communicate with each other.

(If you want to learn more about APIs, especially from a technical perspective, we highly recommend one of our previous webinars: <a href="https://www.youtube.com/watch?v=ffSQHPHO_c0" target=_blank>What are APIs?</a>)

### Reasons to interact with an API

Many public, private and charitable institutions collect and share data of value to social scientists. Often they deposit their data to a data portal - e.g., <a href="https://data.gov.uk/" target=_blank>UK Government Open Data</a> -, allowing you to download the files as and when needed. However, another approach they can adopt is to allow access to the underlying information that is stored in their database through an API. Using this method, individuals can send a customised *request* for information to the database; if the request is valid, the database *responds* by providing you with the information you asked for. Think of using an API as the difference between downloading a raw data file which then needs to be filtered to arrive at the information you need, and performing the filtering when you request the data, so only what you need is returned (the API method).

Before we delve into writing code to capture data through an API, let's clearly state the logic underpinning the technique.

### Logic of using an API

We begin by identifying an online database containing information of interest. Then we need to **know** the following:
1. The location of the API (i.e., web address) through which the database can be accessed. For example, the UK Police API can be accessed via <a href="https://data.police.uk/api" target=_blank>https://data.police.uk/api</a>.
2. The terms of use associated with the API. Many APIs restrict the number of requests you can make over a given time period, while others require registration in order to authenicate who is trying to access the data. For example, the UK Police API does not require you to provide authentication but restricts the number of requests for data you can make (15 per second) - the number of allowable requests is known as the *rate limit*.
3. The location of the data of interest on the API. For example, data on street-level crime from the UK Police API is available at: <a href="https://data.police.uk/api/crimes-street" target=_blank>https://data.police.uk/api/crimes-street</a>. The location of the data is known as its *endpoint*.

We can usually find all of the information we need by reading the API's documentation e.g., <a href="https://data.police.uk/docs/" target=_blank>https://data.police.uk/docs/</a>.

Then we need to **do** the following:
4. Register your use of the API (if required).
5. Request data from the endpoint of interest, supplying authentication if required. This process is known as *making a call* to the API.
6. Write this data to a file for future use.

## Example: Capturing Covid-19 data

<a href="https://open-platform.theguardian.com/documentation/" target=_blank>The Guardian's API</a> provides access to some of the data and metadata associcated with its content. For example, you can query its database to search for articles with certain tags ("environmenmt", "covid-19"), or articles published over a certain date range.

### Locating the API

In some cases the organisation will allow you to explore the API without the need to write code or register for an API key. For instance, the Guardian's API can be explored at <a href="https://open-platform.theguardian.com/explore/" target=_blank>https://open-platform.theguardian.com/explore/</a>.

**TASK**: take some time to interact with the Guardian's API through its user interface.

(Note: it possible to load websites into Python in order to view them, however the Guardian's API doesn't allow this. See the example code below for how it would work for a different website - just remove the quotation marks enclosing the code and run the cell).

In [None]:
"""
from IPython.display import IFrame

IFrame("https://ukdataservice.ac.uk/", width="600", height="650")
"""

However, interacting with an API through a user interface (i.e., text boxes, drop-down menues) is slow, untransparent, labour intensive and often not possible. We will instead focus on writing code that performs the requests for us. Therefore, we need to use the following web address to access the API: https://content.guardianapis.com.

(Note that you cannot request this web address through your browser; this is because access to the API is only possible by providing authentication i.e., an API key. Try for yourself by clicking on the links).

### API terms of use

The Guardian API is well documented (not always the case, unfortunately) and we can clearly identify what is required in order to interact with it. Firstly, the API requires authentication, in the form of an API key, to be provided when making requests. The API key is generated when you register your use of the API.

Secondly, the API provides multiple levels of access, each with its own set of restrictions and usage allowances. The free level of access allows you to make up to 12 calls per second, with a daily limit of 5,000; requests are also restricted to text content (no access to images, audio, video) - if you need more than the free level of access provides, there is a commercial product with custom rate limits, access to other type of data (e.g., images, videos) etc.

See <a href="https://open-platform.theguardian.com/access/" target=_blank>https://open-platform.theguardian.com/access/</a> for full information on levels of access.

### Locating data

The Guardian API allows access to five endpoints:
* Content - https://content.guardianapis.com/search
* Tags - http://content.guardianapis.com/tags
* Sections - https://content.guardianapis.com/sections
* Editions - https://content.guardianapis.com/editions
* Single Item - https://content.guardianapis.com/

### Registering use of API

We need an API key in order to start requesting data. The API key acts as a user's unique id when accessing the API. For the purposes of this lesson we have <a href="https://bonobo.capi.gutools.co.uk/register/developer" target=_blank>registered our use</a> and been given an API key which is contained in a file called *guardian-api-key.txt*. 

Run the code below to check if the file exists:

In [1]:
import os

os.listdir("./auth")

['guardian-api-key.txt']

You should generate your own API key, and keep it private and secure, when using the Guardian API for your own purposes.

### Requesting data

We're ready for the interesting bit: requesting data through the API. We'll focus on finding and saving data about articles relating to the Covid-19 public health crisis.

There is a preliminary step, which is setting up Python with the modules it needs to interact with the API.

In [1]:
# Import modules

import os # module for navigating your machine (e.g., file directories)
import requests # module for requesting urls
import json # module for working with JSON data structures
from datetime import datetime # module for working with dates and time
print("Succesfully imported necessary modules")

Succesfully imported necessary modules


Modules are additional techniques or functions that are not present when you launch Python. Some do not even come with Python when you download it and must be installed on your machine separately - think of using `ssc install <package>` in Stata, or `install.packages(<package>)` in R. For now just understand that many useful modules need to be imported every time you start a new Python session.

Now let's make our first request to the API. First, we need to load in the API key as it is stored in a separate file for privacy reasons.

(Delete the `#` symbol if you want to see the value of the API key).

In [2]:
api_key = open("./auth/guardian-api-key.txt", "r").read() # open the file and read its contents
# api_key

Next, let's search for articles mentioning the term "covid-19".

In [3]:
# Define web address and search terms

baseurl = "http://content.guardianapis.com/search?" # base web address
searchterm = "covid-19" # term we want to search for
auth = {"api-key": api_key} # authentication
webadd = baseurl + "q=" + searchterm # construct web address for requesting
print(webadd)

# Make call to API

response = requests.get(webadd, headers=auth) # request the web address
response.status_code # check if API was requested successfully

http://content.guardianapis.com/search?q=covid-19


200

Let's unpack the above code, as there is a lot happening. First, we declare a variable (also known as an 'object' in Python) called `baseurl` that contains the base web address of the *Content* endpoint. Then we declare a variable containing the term we are interested in searching for (`searchterm`). Next, we define a variable to store the API key that is needed when making the request (`auth`). Finally we concatenate these separate elements to form a valid web address that can be requested from the API (`webadd`). <br> The next step is to use the `get()` method of the `requests` module to request the web address, and in the same line of code, we store the results of the request in a variable called `response`. Finally, we check whether the request was successful by calling on the `status_code` attribute of the `response` variable.

Confused? Don't worry, the conventions of Python and using its modules take a bit of getting used to. At this point, just understand that you can store the results of commands in variables, and a variable can have different attributes that can be accessed when needed. Also note that you have a lot of freedom in how you name your variables (subject to certain restrictions - see <a href="https://www.python.org/dev/peps/pep-0008/" target=_blank>here for some guidance</a>).

For example, the following would also work (bonus points if you can name the brewery lending its name to the variables):

In [42]:
highlander = "http://content.guardianapis.com/tags?" # base web address
jarl = "covid-19" # term we want to search for
avalanche = {"api-key": api_key} # authentication
hurricane_jack = highlander + "q=" + jarl # construct web address for requesting
print(hurricane_jack)

# Make call to API

beers = requests.get(webadd, headers=avalanche) # request the web address
beers.status_code # check if API was requested successfully

http://content.guardianapis.com/tags?q=covid-19


200

Back to the request:

Good, we get a status code of _200_ - this means we made a successful call to the API. <a href="https://www.textbook.ds100.org/ch/07/web_http.html" target=_blank>Lau, Gonzalez and Nolan</a> provide a succinct description of different types of status codes:

* **100s** - Informational: More input is expected from client or server (e.g. 100 Continue, 102 Processing)
* **200s** - Success: The client's request was successful (e.g. 200 OK, 202 Accepted)
* **300s** - Redirection: Requested URL is located elsewhere; May need user's further action (e.g. 300 Multiple Choices, 301 Moved Permanently)
* **400s** - Client Error: Client-side error (e.g. 400 Bad Request, 403 Forbidden, 404 Not Found)
* **500s** - Server Error: Server-side error or server is incapable of performing the request (e.g. 500 Internal Server Error, 503 Service Unavailable)

For clarity:
* **Client**: your machine
* **Server**: the machine you are requesting the resource from

You may be wondering exactly what it is we requested. To see the content of our request, we can call the `json()` method on the `response` variable:

In [6]:
data = response.json() 
data # view the content of the response

{'response': {'status': 'ok',
  'userTier': 'developer',
  'total': 15654,
  'startIndex': 1,
  'pageSize': 10,
  'currentPage': 1,
  'pages': 1566,
  'orderBy': 'relevance',
  'results': [{'id': 'world/2020/feb/27/what-is-covid-19',
    'type': 'article',
    'sectionId': 'world',
    'sectionName': 'World news',
    'webPublicationDate': '2020-03-25T07:00:39Z',
    'webTitle': 'What is Covid-19?',
    'webUrl': 'https://www.theguardian.com/world/2020/feb/27/what-is-covid-19',
    'apiUrl': 'https://content.guardianapis.com/world/2020/feb/27/what-is-covid-19',
    'isHosted': False,
    'pillarId': 'pillar/news',
    'pillarName': 'News'},
   {'id': 'environment/2020/apr/18/covid-19-a-blessing-for-pangolins',
    'type': 'article',
    'sectionId': 'environment',
    'sectionName': 'Environment',
    'webPublicationDate': '2020-04-18T17:00:14Z',
    'webTitle': 'Covid-19 – a blessing for pangolins?',
    'webUrl': 'https://www.theguardian.com/environment/2020/apr/18/covid-19-a-blessin

Just by scanning the first few lines we can pick out useful metadata about the response: we can see our search has yielded over 15,000 results, and there are ten results per page spread out over 1,500 pages. We can also see where the actual content of the search is contained, in a section helpfully called `results`.

It is important we familiarise ourselves with the hierarchical structure of the response. Note how we needed to call the `json()` method on the `response` variable. JSON stands for Javascript Object Notation, and is a hierarchical data structure based on key-value pairs (known as *items*), which are separated by commas (Brooker, 2020; Tagliaferri, n.d.). For example, the `pageSize` key stores the value `10`. The hierarchical structure is evident in how a key can contain a list of other key-value pairs (items). For example, the `results` key contains a list of items relating to each search result.

A JSON data structure (known in Python as a *dictionary*) can be difficult to understand at first, in no small part due to the unappealing presentation format. Visually, it is worth noting that this data structure begins and ends with curly braces (`{}`); this is useful to know when you want to create your own dictionary. 

Let's examine some of Python's methods for navigating and processing this data structure.

#### Navigating a dictionary (JSON) variable

The first thing we should do is list the keys contained in the dictionary:

In [7]:
data.keys()

dict_keys(['response'])

If you're wondering why only one key is listed, remember that we are dealing with a hierarchical data structure. The value of the `response` key is itself another dictionary, therefore we can access those keys as follows: 

In [8]:
data["response"].keys() # list keys contained in the "response" key

dict_keys(['status', 'userTier', 'total', 'startIndex', 'pageSize', 'currentPage', 'pages', 'orderBy', 'results'])

Note how we mentioned that keys can contain a list of other key-value pairs. A *list* is a Python data type used to store ordered sequences of elements (Tagliaferri, n.d.). Let's see how we can deal with lists by examining the `results` key:

In [10]:
search_results = data["response"]["results"] 
search_results # view the contents of the "results" key nested within the "response" key

[{'id': 'world/2020/feb/27/what-is-covid-19',
  'type': 'article',
  'sectionId': 'world',
  'sectionName': 'World news',
  'webPublicationDate': '2020-03-25T07:00:39Z',
  'webTitle': 'What is Covid-19?',
  'webUrl': 'https://www.theguardian.com/world/2020/feb/27/what-is-covid-19',
  'apiUrl': 'https://content.guardianapis.com/world/2020/feb/27/what-is-covid-19',
  'isHosted': False,
  'pillarId': 'pillar/news',
  'pillarName': 'News'},
 {'id': 'environment/2020/apr/18/covid-19-a-blessing-for-pangolins',
  'type': 'article',
  'sectionId': 'environment',
  'sectionName': 'Environment',
  'webPublicationDate': '2020-04-18T17:00:14Z',
  'webTitle': 'Covid-19 – a blessing for pangolins?',
  'webUrl': 'https://www.theguardian.com/environment/2020/apr/18/covid-19-a-blessing-for-pangolins',
  'apiUrl': 'https://content.guardianapis.com/environment/2020/apr/18/covid-19-a-blessing-for-pangolins',
  'isHosted': False,
  'pillarId': 'pillar/news',
  'pillarName': 'News'},
 {'id': 'world/2020/apr

Visually, a list begins and ends with square parentheses (`[]`) -  however we can just ask Python to confirm that the `search_results` variable is a list:

In [11]:
type(search_results)

list

We can check how many elements are in a list by calling on the `len()` function:

In [12]:
len(search_results)

10

We can view each element in the list of search results as follows:

In [13]:
for result in search_results:
    print(result["type"]) # view content type
    print(result["sectionName"]) # view newspaper section content appeared in
    print(result["webPublicationDate"]) # view date content was published online
    print("\r")
    print("-------------")
    print("\r")

article
World news
2020-03-25T07:00:39Z

-------------

article
Environment
2020-04-18T17:00:14Z

-------------

article
World news
2020-04-17T15:46:13Z

-------------

article
World news
2020-04-19T16:57:00Z

-------------

article
World news
2020-03-29T16:37:18Z

-------------

article
World news
2020-03-25T19:27:20Z

-------------

article
World news
2020-03-18T16:57:30Z

-------------

article
Environment
2020-04-17T09:00:10Z

-------------

article
Technology
2020-04-06T17:00:23Z

-------------

article
World news
2020-04-04T16:29:35Z

-------------



It takes a bit of time to get used to unfamiliar data structures and types. If you are a quantitative social scientist, you may be used to working with data structured in a tabular (i.e., variable-by-case) format, such as the sample dataset below.

In section 4.7, we'll see how we can convert data from a JSON structure to a tabular format, but just keep in mind that JSON is a popular means of storing and sharing data found on the web and it is worth becoming proficient in managing and manipulating it.

**TASK**: jnknknk

### Saving results

The final task is to save the data to a file that we can use in the future. We'll write to a JSON file, as this is the format the data were returned in.

In [50]:
# Create a downloads folder

try:
    os.mkdir("./downloads")
except:
    print("Unable to create folder: already exists")

Unable to create folder: already exists


The use of "./" tells the `os.mkdir()` command that the "downloads" folder should be created at the same level of the directory where this notebook is located. So if this notebook was stored in a directory located at "C:/Users/joebloggs/notebooks", the `os.mkdir()` command would result in a new folder located at "C:/Users/joebloggs/notebooks/downloads".
   
(Technically the "./" is not needed and you could just write `os.mkdir("downloads")` but it's good practice to be explicit)

In [51]:
# Write the results to a JSON file

date = datetime.now().strftime("%Y-%m-%d") # get today's date in YYYY-MM-DD format
print(date)

outfile = "./downloads/guardian-api-covid-19-search-" + date + ".json"

with open(outfile, "w") as f:
    json.dump(data, f)

2020-04-27


The code above defines a name and location for the file which will store the results of the API request. We then open the file in *write* mode and save (or "dump") the contents of the `data` variable to it.

How do we know this worked? The simplest way is to check whether a) the file was created, and b) the results were written to it.

In [52]:
# a) Check presence of file in "downloads" folder

os.listdir("./downloads")

['covid-19-country-statistics-2020-04-23.csv',
 'covid-19-statistics-2020-04-23.csv',
 'guardian-api-covid-19-search-2020-04-27.json']

In [53]:
# b) Open file and read (import) its contents

with open(outfile, "r") as f:
    data = json.load(f) # use the "load()" method of the "json" module
    
print(data)

{'response': {'status': 'ok', 'userTier': 'developer', 'total': 15585, 'startIndex': 1, 'pageSize': 10, 'currentPage': 1, 'pages': 1559, 'orderBy': 'relevance', 'results': [{'id': 'world/2020/feb/27/what-is-covid-19', 'type': 'article', 'sectionId': 'world', 'sectionName': 'World news', 'webPublicationDate': '2020-03-25T07:00:39Z', 'webTitle': 'What is Covid-19?', 'webUrl': 'https://www.theguardian.com/world/2020/feb/27/what-is-covid-19', 'apiUrl': 'https://content.guardianapis.com/world/2020/feb/27/what-is-covid-19', 'isHosted': False, 'pillarId': 'pillar/news', 'pillarName': 'News'}, {'id': 'world/2020/apr/17/pregnancy-and-the-covid-19-frontline', 'type': 'article', 'sectionId': 'world', 'sectionName': 'World news', 'webPublicationDate': '2020-04-17T15:46:13Z', 'webTitle': 'Pregnancy and the Covid-19 frontline | Letters', 'webUrl': 'https://www.theguardian.com/world/2020/apr/17/pregnancy-and-the-covid-19-frontline', 'apiUrl': 'https://content.guardianapis.com/world/2020/apr/17/pregna

And Voila, we have successfully requested data using an API!

### Refining Covid-19 data collection

We will complete our work gathering data on articles relating to Covid-19 by refining our request and handling of the response. In particular, we will deal with the following outstanding issues:
1. Including additional search terms
2. Dealing with multiple pages of results
3. Requesting the contents (text) of some of the articles contained in our list of search results 
4. Converting data stored in JSON to a different file format
5. Handling the rate limit

The following blocks of code contain fewer comments explaining what each command is doing, so feel free to add these back in if you like.

Let's get some the preliminaries out of the way before we begin:

In [40]:
import os
import requests
import json
import pandas as pd
from datetime import datetime
from bs4 import BeautifulSoup as soup

date = datetime.now().strftime("%Y-%m-%d")

#### Including additional search terms

We can use logical operators - AND, OR and NOT - to include additional search terms in our request.

In [41]:
# Define web address and search terms

baseurl = "http://content.guardianapis.com/search?"
searchterms = "covid-19 OR coronavirus" # terms we want to search for
auth = {"api-key": api_key}

webadd = baseurl + "q=" + searchterms
print(webadd)

# Make call to API

response = requests.get(webadd, headers=auth)
response.status_code

http://content.guardianapis.com/search?q=covid-19 OR coronavirus


200

In [42]:
response.json()

{'response': {'status': 'ok',
  'userTier': 'developer',
  'total': 19610,
  'startIndex': 1,
  'pageSize': 10,
  'currentPage': 1,
  'pages': 1961,
  'orderBy': 'relevance',
  'results': [{'id': 'world/2020/apr/24/coronavirus-what-have-scientists-learned-about-covid-19-so-far',
    'type': 'article',
    'sectionId': 'world',
    'sectionName': 'World news',
    'webPublicationDate': '2020-04-24T10:16:18Z',
    'webTitle': 'Coronavirus: what have scientists learned about Covid-19 so far?',
    'webUrl': 'https://www.theguardian.com/world/2020/apr/24/coronavirus-what-have-scientists-learned-about-covid-19-so-far',
    'apiUrl': 'https://content.guardianapis.com/world/2020/apr/24/coronavirus-what-have-scientists-learned-about-covid-19-so-far',
    'isHosted': False,
    'pillarId': 'pillar/news',
    'pillarName': 'News'},
   {'id': 'australia-news/2020/apr/21/australia-coronavirus-victims-age-names-deaths-covid-19-australian-death-toll',
    'type': 'article',
    'sectionId': 'austral

Note how the total number of results is higher than if we just searched for "covid-19" on its own.

#### Dealing with multiple pages of results

Sticking with our previous example, we noted how the search results were restricted to ten per page. There are a couple of ways of dealing with this issue:
* Increasing the number of results returned per page; AND
* Requesting each individual page of results

First, increase the number of results per page using the `page-size` filter:

In [43]:
# Define web address and search terms

baseurl = "http://content.guardianapis.com/search?"
searchterms = "covid-19 OR coronavirus"
auth = {"api-key": api_key}
numresults = "50" # number of results per page

webadd = baseurl + "q=" + searchterms + "&page-size=" + numresults
print(webadd)

# Make call to API

response = requests.get(webadd, headers=auth)
response.status_code

http://content.guardianapis.com/search?q=covid-19 OR coronavirus&page-size=50


200

Confirm the `page-size` filter was applied:

In [44]:
data = response.json()
data["response"]["pageSize"]

50

Good, now let's request each page of results. We start by finding out how many pages we need to request:

In [45]:
total_pages = data["response"]["pages"]
total_pages = total_pages + 1
total_pages

394

Note how we needed to add one to the total number of pages. This is because of the way Python loops over a range of numbers. For example, the code below loops over numbers beginning at one and **up to but not including** five:

In [46]:
for i in range(1, 5):
    print(i)

1
2
3
4


Requesting the total number of pages may take a while, so let's just request the first 20:

In [49]:
for pagenum in range(1, 21):
    
    baseurl = "http://content.guardianapis.com/search?"
    searchterms = "covid-19 OR coronavirus"
    auth = {"api-key": api_key}
    numresults = "50"
    
    webadd = baseurl + "q=" + searchterms + "&page-size=" + numresults \
        + "&page=" + str(pagenum)
    print(webadd)

    # Make call to API

    response = requests.get(webadd, headers=auth)
    response.status_code
    
    # Save to file
    
    outfile = "./downloads/guardian-api-covid-19-search-page-" \
        + str(pagenum) + "-" + date + ".json"

    with open(outfile, "w") as f:
        json.dump(data, f)

http://content.guardianapis.com/search?q=covid-19 OR coronavirus&page-size=50&page=1
http://content.guardianapis.com/search?q=covid-19 OR coronavirus&page-size=50&page=2
http://content.guardianapis.com/search?q=covid-19 OR coronavirus&page-size=50&page=3
http://content.guardianapis.com/search?q=covid-19 OR coronavirus&page-size=50&page=4
http://content.guardianapis.com/search?q=covid-19 OR coronavirus&page-size=50&page=5
http://content.guardianapis.com/search?q=covid-19 OR coronavirus&page-size=50&page=6
http://content.guardianapis.com/search?q=covid-19 OR coronavirus&page-size=50&page=7
http://content.guardianapis.com/search?q=covid-19 OR coronavirus&page-size=50&page=8
http://content.guardianapis.com/search?q=covid-19 OR coronavirus&page-size=50&page=9
http://content.guardianapis.com/search?q=covid-19 OR coronavirus&page-size=50&page=10
http://content.guardianapis.com/search?q=covid-19 OR coronavirus&page-size=50&page=11
http://content.guardianapis.com/search?q=covid-19 OR coronaviru

In [50]:
os.listdir("./downloads")

['guardian-api-covid-19-search-page-1-2020-04-28.json',
 'guardian-api-covid-19-search-page-10-2020-04-28.json',
 'guardian-api-covid-19-search-page-11-2020-04-28.json',
 'guardian-api-covid-19-search-page-12-2020-04-28.json',
 'guardian-api-covid-19-search-page-13-2020-04-28.json',
 'guardian-api-covid-19-search-page-14-2020-04-28.json',
 'guardian-api-covid-19-search-page-15-2020-04-28.json',
 'guardian-api-covid-19-search-page-16-2020-04-28.json',
 'guardian-api-covid-19-search-page-17-2020-04-28.json',
 'guardian-api-covid-19-search-page-18-2020-04-28.json',
 'guardian-api-covid-19-search-page-19-2020-04-28.json',
 'guardian-api-covid-19-search-page-2-2020-04-28.json',
 'guardian-api-covid-19-search-page-20-2020-04-28.json',
 'guardian-api-covid-19-search-page-3-2020-04-28.json',
 'guardian-api-covid-19-search-page-4-2020-04-28.json',
 'guardian-api-covid-19-search-page-5-2020-04-28.json',
 'guardian-api-covid-19-search-page-6-2020-04-28.json',
 'guardian-api-covid-19-search-page-7

#### Requesting article contents (text)

The list of search results contain data and metadata about individual articles. What if we wanted the contents of the article? Thankfully the API has an endpoint (*Single Item*) that provides access to this data.

First, we need to extract an article from our list of search results. We do this by accessing its *positional value* (index) in the list: 

In [51]:
search_results = data["response"]["results"]
article = search_results[0] # extract first article in list of results
article

{'id': 'world/2020/apr/24/coronavirus-what-have-scientists-learned-about-covid-19-so-far',
 'type': 'article',
 'sectionId': 'world',
 'sectionName': 'World news',
 'webPublicationDate': '2020-04-24T10:16:18Z',
 'webTitle': 'Coronavirus: what have scientists learned about Covid-19 so far?',
 'webUrl': 'https://www.theguardian.com/world/2020/apr/24/coronavirus-what-have-scientists-learned-about-covid-19-so-far',
 'apiUrl': 'https://content.guardianapis.com/world/2020/apr/24/coronavirus-what-have-scientists-learned-about-covid-19-so-far',
 'isHosted': False,
 'pillarId': 'pillar/news',
 'pillarName': 'News'}

In Python, indexing begins at zero (in R indexing begins at 1). Therefore, the first item in a list is located at position zero (e.g., `search_results[0]`), the second item at position one (e.g., `search_results[1]`) etc.

**TASK**: extract a different article using another index value. Remember, there are 50 elements in the `search_results` list, so `[49]` is the highest index value you can refer to.

Next, we need to request its contents using the web address contained in the `apiUrl` key and specifying we want the body (text) of the article returned also:

In [60]:
baseurl = article["apiUrl"]
auth = {"api-key": api_key}
field = "body"

webadd = baseurl + "?" + "show-fields=" + field

# Make call to API

response = requests.get(webadd, headers=auth)
response.status_code

200

In [61]:
data = response.json()
data

{'response': {'status': 'ok',
  'userTier': 'developer',
  'total': 1,
  'content': {'id': 'world/2020/apr/24/coronavirus-what-have-scientists-learned-about-covid-19-so-far',
   'type': 'article',
   'sectionId': 'world',
   'sectionName': 'World news',
   'webPublicationDate': '2020-04-24T10:16:18Z',
   'webTitle': 'Coronavirus: what have scientists learned about Covid-19 so far?',
   'webUrl': 'https://www.theguardian.com/world/2020/apr/24/coronavirus-what-have-scientists-learned-about-covid-19-so-far',
   'apiUrl': 'https://content.guardianapis.com/world/2020/apr/24/coronavirus-what-have-scientists-learned-about-covid-19-so-far',
   'fields': {'body': '<p>Coronaviruses have been causing problems for humanity for a long time. Several versions are known to trigger common colds and more recently two types have set off outbreaks of deadly illnesses: severe acute respiratory syndrome (Sars) and Middle East respiratory syndrome (Mers).</p> \n<p>But their impact has been mild compared with

Note we now have a key called `fields` which contains a dictionary with one key (`body`):

In [63]:
data["response"]["content"]["fields"].keys()

dict_keys(['body'])

In [64]:
text = data["response"]["content"]["fields"]["body"]
text

'<p>Coronaviruses have been causing problems for humanity for a long time. Several versions are known to trigger common colds and more recently two types have set off outbreaks of deadly illnesses: severe acute respiratory syndrome (Sars) and Middle East respiratory syndrome (Mers).</p> \n<p>But their impact has been mild compared with the global havoc unleashed by the coronavirus that is causing the Covid-19 pandemic. In only a few months it has triggered lockdowns in dozens of nations, and the disease continues to spread.</p> \n<p>That is an extraordinary achievement for a spiky ball of genetic material coated in fatty chemicals called lipids, and which measures 80 billionths of a metre in diameter. Humanity has been brought low by a very humble assailant.</p> \n<p>On the other hand, our knowledge about the Sars-CoV-2, the virus that causes Covid-19, is also remarkable. This was an organism unknown to science five months ago. Today it is the subject of study on an unprecedented scale

The article's content is returned as a long piece of text (known as a *string* data type in Python). However you may have noticed lots of strange symbols or tags scattered throughout (e.g., `<p>`, `</ul>`) - their presence confirms that the content is actually a piece of HTML, the programming language that web pages are written in.

[TALK ABOUT BEAUTIFUL SOUP]

Finally, let's conclude by saving the article's data and metadata to a JSON file:

In [67]:
# Use the articles unique id to name the file

article_id = data["response"]["content"]["id"].replace("/", "-")
#
# We need to remove forward slashes ("/") as our computer interprets
# these as folder separators.
#

outfile = "./downloads/" + article_id + ".json"

with open(outfile, "w") as f:
    json.dump(data, f)

#### Converting data stored in JSON to a different file format

While JSON is an excellent format for storing and sharing data found on the web, it is not always the easiest to work with when using other commonn social science software applications (e.g., Stata, SPSS, NVivo). In some circumstances it can be worth converting data stored as JSON into a "friendlier" data structure. There are two ways of achieving this:
1. Extracting information of interest from the JSON variable and storing in a different data structure (e.g., a table, data frame).
2. Collapsing ("flattening") the hierachical structure of the JSON variable.

The former is simpler but time intensive if there are lots of fields you need to extract. The latter is quicker but can be difficult to implement if there are lots of nested fields in the data.

We'll demonstrate both approaches, leaving you to choose the method you prefer for your own work.

## Value, limitations and ethics

Computational methods for collecting data from the web are an increasingly important component of a social scientist's toolkit. They enable individuals to collect and reshape data - qualitative and quantitative - that otherwise would be inaccessible for research purposes. Thus far, this notebook has focused on the logic and practice of using APIs, however it is crucial we reflect critically on its value, limitations and ethical implications for social science purposes.

## Conclusion

*Web-scraping is a simple yet powerful computational method for collecting data of value for social science research. It provides a relatively gentle introduction to using programming languages, also. However, "with great power comes great responsibility" (sorry). Web-scraping takes you into the realm of data protection, website Terms of Service (ToS), and many murky ethical issues. Wielded sensibly and sensitively, web-scraping is a valuable and exciting social science research method.* [UPDATE]

Good luck on your data-driven travels!

## Bibliography

Barba, Lorena A. et al. (2019). *Teaching and Learning with Jupyter*. <a href="https://jupyter4edu.github.io/jupyter-edu-book/" target=_blank>https://jupyter4edu.github.io/jupyter-edu-book/</a>.

Brooker, P. (2020). *Programming with Python for Social Scientists*. London: SAGE Publications Ltd.

Lau, S., Gonzalez, J., & Nolan, D. (n.d.). *Principles and Techniques of Data Science*. https://www.textbook.ds100.org

Tagliaferri, L. (n.d.). *How to Code in Python 3*. https://assets.digitalocean.com/books/python/how-to-code-in-python.pdf

## Further reading and resources

We publish a list of useful books, papers, websites and other resources on our web-scraping Github repository: <a href="https://github.com/UKDataServiceOpen/web-scraping/tree/master/reading-list/" target=_blank>[Reading list]</a>

The help documentation for the `requests` module is refreshingly readable and useful:
* <a href="https://requests.readthedocs.io/en/master/" target=_blank>`requests`</a>

You may also be interested in the following articles specifically relating to APIs:

## Appendices

### Appendix A - Requesting URLs

We refer to a website's location on the internet as its web address or Uniform Resource Locator (URL).

In Python we've made use of the excellent `requests` module. By calling the `requests.get()` method, we mimic the manual process of launching a web browser and visiting a website. The `requests` module achieves this by placing a _request_ to the server hosting the website (e.g., show me the contents of the website), and handling the _response_ that is returned (e.g., the contents of the website and some metadata about the request). This _request-response_ protocol is known as HTTP (HyperText Transfer Protocol); HTTP allows computers to communicate with each other over the internet - you can learn more about it at <a href="https://www.w3schools.com/whatis/whatis_http.asp" target=_blank>W3 Schools</a>.

Run the code below to learn more about the data and metadata returned by `requests.get()`.

In [None]:
import requests

url = "https://httpbin.org/html"
response = requests.get(url)

print("1. {}".format(response)) # returns the object type (i.e. a response) and status code
print("\r")

print("2. {}".format(response.headers)) # returns a dictionary of response headers
print("\r")

print("3. {}".format(response.headers["Date"])) # return a particular header
print("\r")

print("4. {}".format(response.request)) # returns the request object that requested this response
print("\r")

print("5. {}".format(response.url)) # returns the URL of the response
print("\r")

#print(response.text) # returns the text contained in the response (i.e. the paragraphs, headers etc of the web page)
#print(response.content) # returns the content of the response (i.e. the HTML contents of the web page)

# Visit https://www.w3schools.com/python/ref_requests_response.asp for a full list of what is returned by the server
# in response to a request.

### Appendix B - Capturing UK Police data

For our first example we will attempt to download data on 'Stop and searches' activity by police forces in England, Wales and Northern Ireland. We will use open police data available at [https://data.police.uk/](https://data.police.uk/).

Let's see how we can use Python to achieve this task.

In [None]:
## Title: Downloading 'Stop and search' police data
## Created: 10/02/2020
## Creater: Diarmuid McDonnell, University of Manchester

# Importing modules #

# Python comes with a large suite of ready-to-use functions; however, some must be explicitly downloaded and imported 
# into your Python session. 
#
# A module bundles together code, data, documentation and tests, and provides an easy method to share with others.

try:
    import requests # module for requesting urls
    import csv # module for handling csv files
    import json # module for handling json data
    import os # module for performing operating system tasks
    print("Successfully imported modules")
except:
    print("Did not import one or more modules!")   

**QUESTION:** What do you think the ```try, except``` block does? 

The next step is to figure out what datasets are available through the API and how they can be accessed. This can only be learned by reading the [API documentation](https://data.police.uk/docs/).

The unique id or name of a dataset available via an API is known as an *endpoint*. For example, 'Stop and search' data is available via the ```stops-street``` endpoint, and we would request this dataset by sending the following URL to the API: ```https://data.police.uk/api/stops-street?``` (any characters following the '?' symbol represent customisable search terms e.g. 'Stop and search' data for certain areas).

In [None]:
# Exploring the Police API #

# See what 'Stop and search' datasets are available

datasets = 'https://data.police.uk/api/crimes-street-dates' # define the endpoint where a list of available datasets is found


# Request the list of available datasets from the above endpoint

response = requests.get(datasets, allow_redirects=True) # request the url
print("----------------------------") # additional print() commands to format output
print("\r")
print(response.status_code, " | ", response.headers) # print the metadata behind the request to see if it was successful
print("\r")

The status code is **200**, signifying we made a successful request to the API. Now let's examine the contents of the request i.e. the list of available datasets:

In [None]:
sdata = json.loads(response.content) # store the returned response as a json object (easier for manipulating and saving)
print(type(sdata)) # view the response type
sdata[0] # view the first item in the list

Let's parse the results of the API request:
1. We store the returned response - a list of all 'Stop and search' data available via the API - as a json object; this makes it easier to manipulate the contents and save to a file.
2. We ask Python to tell us what data type we are dealing with. In this instance it is a ```list```.

-- END OF FILE --