![UKDS Logo](./images/UKDS_Logos_Col_Grey_300dpi.png)

# Web-scraping for Social Science Research

Welcome to the <a href="https://ukdataservice.ac.uk/" target=_blank>UK Data Service</a> training series on *New Forms of Data for Social Science Research*. This series guides you through some of the most common and valuable new sources of data available for social science research: data collected from websites, social media platorms, text data, conducting simulations (agent based modelling), to name a few. To help you get to grips with these new forms of data, we provide webinars, interactive notebooks containing live programming code, reading lists and more.

* To access training materials for the entire series: <a href="https://github.com/UKDataServiceOpen/new-forms-of-data" target=_blank>[Training Materials]</a>

* To keep up to date with upcoming and past training events: <a href="https://ukdataservice.ac.uk/news-and-events/events" target=_blank>[Events]</a>

* To get in contact with feedback, ideas or to seek assistance: <a href="https://ukdataservice.ac.uk/help.aspx" target=_blank>[Help]</a>

<a href="https://www.research.manchester.ac.uk/portal/julia.kasmire.html" target=_blank>Dr Julia Kasmire</a> and <a href="https://www.research.manchester.ac.uk/portal/diarmuid.mcdonnell.html" target=_blank>Dr Diarmuid McDonnell</a> <br />
UK Data Service  <br />
University of Manchester <br />
April 2020

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Guide-to-using-this-resource" data-toc-modified-id="Guide-to-using-this-resource-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Guide to using this resource</a></span><ul class="toc-item"><li><span><a href="#Interaction" data-toc-modified-id="Interaction-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Interaction</a></span></li><li><span><a href="#Learn-more" data-toc-modified-id="Learn-more-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Learn more</a></span></li></ul></li><li><span><a href="#Collecting-data-from-online-databases-using-an-API" data-toc-modified-id="Collecting-data-from-online-databases-using-an-API-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Collecting data from online databases using an API</a></span><ul class="toc-item"><li><span><a href="#What-is-an-API?" data-toc-modified-id="What-is-an-API?-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>What is an API?</a></span></li><li><span><a href="#Reasons-to-interact-with-an-API" data-toc-modified-id="Reasons-to-interact-with-an-API-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Reasons to interact with an API</a></span></li><li><span><a href="#2.1.1.-What-is-an-API?" data-toc-modified-id="2.1.1.-What-is-an-API?-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>2.1.1. What is an API?</a></span></li><li><span><a href="#2.1.3.-Example:-Requesting-'Stop-and-search'-police-data" data-toc-modified-id="2.1.3.-Example:-Requesting-'Stop-and-search'-police-data-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>2.1.3. Example: Requesting 'Stop and search' police data</a></span></li></ul></li><li><span><a href="#Value,-limitations-and-ethics" data-toc-modified-id="Value,-limitations-and-ethics-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Value, limitations and ethics</a></span></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Conclusion</a></span></li><li><span><a href="#Bibliography" data-toc-modified-id="Bibliography-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Bibliography</a></span></li><li><span><a href="#Further-reading-and-resources" data-toc-modified-id="Further-reading-and-resources-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Further reading and resources</a></span></li><li><span><a href="#Appendices" data-toc-modified-id="Appendices-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Appendices</a></span></li></ul></div>

## Introduction

In this training series we cover some of the essential skills needed to collect data from the web. In particular we focus on two different approaches:
1. Collecting data stored on web pages. <a href="./web-scraping-websites-code-2020-04-23.ipynb" target=_blank>[LINK]</a>
2. Downloading data from online databases using Application Programming Interfaces (APIs). [Focus of this notebook]
    
Do not be alarmed by the technical aspects: both approaches can be implemented using simple code, a standard desktop or laptop, and a decent internet connection.    

Given the Covid-19 public health crisis in which this programme of work occurred, we will examine ways in which web-scraping techniques can provide valuable data for studying this phenomenon. This is a fast moving, evolving public health emergency that, in addition to other impacts, will shape research agendas across the sciences for years to come. Therefore it is important to learn how we, as social scientists, can access or generate data that will provide a better understanding of this disease.

In this lesson we will search the <a href="https://www.theguardian.com" target=_blank>Guardian newspaper's</a> online database for articles referring to Covid-19. We will also examine two APIs that should be of wider interest to the social science community: 
* UK Police API, which provides data on street-level crime and outcomes, police stations and much more.
* Companies House API, which provides data on the governance, finances and activities of UK registered companies.

## Guide to using this resource

This learning resource was built using <a href="https://jupyter.org/" target=_blank>Jupyter Notebook</a>, an open-source software application that allows you to mix code, results and narrative in a single document. As <a href="https://jupyter4edu.github.io/jupyter-edu-book/" target=_blank>Barba et al. (2019)</a> espouse:
> In a world where every subject matter can have a data-supported treatment, where computational devices are omnipresent and pervasive, the union of natural language and computation creates compelling communication and learning opportunities.

If you are familiar with Jupyter notebooks then skip ahead to the main content (*Collecting data from online databases using an API*). Otherwise, the following is a quick guide to navigating and interacting with the notebook.

### Interaction

**You only need to execute the code that is contained in sections which are marked by `In []`.**

To execute a cell, click or double-click the cell and press the `Run` button on the top toolbar (you can also use the keyboard shortcut Shift + Enter).

Try it for yourself:

In [None]:
print("Enter your name and press enter:")
name = input()
print("Hello {}, enjoy learning more about Python and web-scraping!".format(name)) 

Enter your name and press enter:


### Learn more

Jupyter notebooks provide rich, flexible features for conducting and documenting your data analysis workflow. To learn more about additional notebook features, we recommend working through some of the <a href="https://github.com/darribas/gds19/blob/master/content/labs/lab_00.ipynb" target=_blank>materials</a> provided by Dani Arribas-Bel at the University of Liverpool. 

## Collecting data from online databases using an API

### What is an API?

An Application Programming Interface (API) is a "a set of functions and procedures allowing the creation of applications that access the features or data of an operating system, application, or other service" (Oxford English Dictionary). In plain English: an API acts as an intermediary between software applications. For example, if I want to build a smart phone application...

### Reasons to interact with an API

Many public, private and charitable institutions collect and share data of value to social scientists. Often they deposit their data to a data portal - e.g., <a href="https://data.gov.uk/" target=_blank>UK Government Open Data</a> -, allowing you to download the files as and when needed. However, another approach they can adopt is to allow direct access to the underlying information that is stored in their database. Using this method, individuals can send a customised *request* for information to the database; if the request is valid, the database *responds* by providing you with the data. Think of this approach as the difference between downloading a raw data file and needing to filter out the rows 

Online databases can be an important source of publicly available information on phenomena of interest.  - for instance, they are used to store and disseminate files, text, photos, videos, tables etc. However, the data stored on websites are typically not structured or formatted for ease of use by researchers: for example, it may not be possible to perform a bulk download of all the files you need (think of needing the annual accounts of all registered companies in London for your research...), or the information may not even be held in a file and instead spread across paragraphs and tables throughout a web page (or worse, web pages). Luckily, web-scraping provides a means of quickly and accurately capturing and formatting data stored on web pages.

Before we delve into writing code to capture data from the web, let's clearly state the logic underpinning the technique.

In this section we will learn how to request and process data from online databases. We interact with these databases using an approach known as an *Application Programming Interface* (API).

```Update this section with material from Peter's webinar.```

### 2.1.3. Example: Requesting 'Stop and search' police data

For our first example we will attempt to download data on 'Stop and searches' activity by police forces in England, Wales and Northern Ireland. We will use open police data available at [https://data.police.uk/](https://data.police.uk/).

Let's see how we can use Python to achieve this task.

In [None]:
## Title: Downloading 'Stop and search' police data
## Created: 10/02/2020
## Creater: Diarmuid McDonnell, University of Manchester

# Importing modules #

# Python comes with a large suite of ready-to-use functions; however, some must be explicitly downloaded and imported 
# into your Python session. 
#
# A module bundles together code, data, documentation and tests, and provides an easy method to share with others.

try:
    import requests # module for requesting urls
    import csv # module for handling csv files
    import json # module for handling json data
    import os # module for performing operating system tasks
    print("Successfully imported modules")
except:
    print("Did not import one or more modules!")   

**QUESTION:** What do you think the ```try, except``` block does? 

The next step is to figure out what datasets are available through the API and how they can be accessed. This can only be learned by reading the [API documentation](https://data.police.uk/docs/).

The unique id or name of a dataset available via an API is known as an *endpoint*. For example, 'Stop and search' data is available via the ```stops-street``` endpoint, and we would request this dataset by sending the following URL to the API: ```https://data.police.uk/api/stops-street?``` (any characters following the '?' symbol represent customisable search terms e.g. 'Stop and search' data for certain areas).

In [None]:
# Exploring the Police API #

# See what 'Stop and search' datasets are available

datasets = 'https://data.police.uk/api/crimes-street-dates' # define the endpoint where a list of available datasets is found


# Request the list of available datasets from the above endpoint

response = requests.get(datasets, allow_redirects=True) # request the url
print("----------------------------") # additional print() commands to format output
print("\r")
print(response.status_code, " | ", response.headers) # print the metadata behind the request to see if it was successful
print("\r")

The status code is **200**, signifying we made a successful request to the API. Now let's examine the contents of the request i.e. the list of available datasets:

In [None]:
sdata = json.loads(response.content) # store the returned response as a json object (easier for manipulating and saving)
print(type(sdata)) # view the response type
sdata[0] # view the first item in the list

Let's parse the results of the API request:
1. We store the returned response - a list of all 'Stop and search' data available via the API - as a json object; this makes it easier to manipulate the contents and save to a file.
2. We ask Python to tell us what data type we are dealing with. In this instance it is a ```list```.

## Value, limitations and ethics

Computational methods for collecting data from the web are an increasingly important component of a social scientist's toolkit. They enable individuals to collect and reshape data - qualitative and quantitative - that otherwise would be inaccessible for research purposes. Thus far, this notebook has focused on the logic and practice of using APIs, however it is crucial we reflect critically on its value, limitations and ethical implications for social science purposes.

## Conclusion

*Web-scraping is a simple yet powerful computational method for collecting data of value for social science research. It provides a relatively gentle introduction to using programming languages, also. However, "with great power comes great responsibility" (sorry). Web-scraping takes you into the realm of data protection, website Terms of Service (ToS), and many murky ethical issues. Wielded sensibly and sensitively, web-scraping is a valuable and exciting social science research method.* [UPDATE]

Good luck on your data-driven travels!

## Bibliography

Barba, Lorena A. et al. (2019). *Teaching and Learning with Jupyter*. <a href="https://jupyter4edu.github.io/jupyter-edu-book/" target=_blank>https://jupyter4edu.github.io/jupyter-edu-book/</a>.

Brooker, P. (2020). *Programming with Python for Social Scientists*. London: SAGE Publications Ltd.

Lau, S., Gonzalez, J., & Nolan, D. (n.d.). *Principles and Techniques of Data Science*. https://www.textbook.ds100.org

Tagliaferri, L. (n.d.). *How to Code in Python 3*. https://assets.digitalocean.com/books/python/how-to-code-in-python.pdf

## Further reading and resources

We publish a list of useful books, papers, websites and other resources on our web-scraping Github repository: <a href="https://github.com/UKDataServiceOpen/web-scraping/tree/master/reading-list/" target=_blank>[Reading list]</a>

The help documentation for the `requests` module is refreshingly readable and useful:
* <a href="https://requests.readthedocs.io/en/master/" target=_blank>`requests`</a>

You may also be interested in the following articles specifically relating to APIs:

## Appendices

-- END OF FILE --