![UKDS Logo](./images/UKDS_Logos_Col_Grey_300dpi.png)

# Collecting Data I: Web-scraping

Welcome to the <a href="https://ukdataservice.ac.uk/" target=_blank>UK Data Service</a> training series on *Computational Social Science*. This series guides you through some of the most common and valuable new sources of data available for social science research: data collected from websites, social media platforms, text data, conducting simulations (agent based modelling), to name a few. To help you get to grips with these new forms of data, we provide webinars, interactive notebooks containing live programming code, reading lists and more.

* To access training materials for the entire series: <a href="https://github.com/UKDataServiceOpen/computational-social-science" target=_blank>[Training Materials]</a>

* To keep up to date with upcoming and past training events: <a href="https://ukdataservice.ac.uk/news-and-events/events" target=_blank>[Events]</a>

* To get in contact with feedback, ideas or to seek assistance: <a href="https://ukdataservice.ac.uk/help.aspx" target=_blank>[Help]</a>

<a href="https://www.research.manchester.ac.uk/portal/julia.kasmire.html" target=_blank>Dr Julia Kasmire</a> and <a href="https://www.research.manchester.ac.uk/portal/diarmuid.mcdonnell.html" target=_blank>Dr Diarmuid McDonnell</a> <br />
UK Data Service  <br />
University of Manchester <br />
May 2020

## Introduction

Computational methods for collecting, cleaning and analysing data are an increasingly important component of a social scientist’s toolkit. Central to engaging in these methods is the ability to write readable and effective code using a programming language.

In this training series we demonstrate core programming concepts and methods through the use of social science examples. In particular we focus on four areas of programming/computational social science:
1. Introduction to Python.
2. Collecting data I: web-scraping. [Focus of this notebook]
3. Collecting data II: APIs.
4. Setting up your computational environment.

### Aims

This lesson - **Collecting data I: web-scraping** - has two aims:
1. Demonstrate how to use Python to collect data found on websites.
2. Cultivate your computational thinking skills through coding examples. In particular, how to define and solve a data collection problem using a computational method.

### Lesson details

* **Level**: Introductory
* **Time**: 30-60 minutes
* **Pre-requisites**: None, though you may find it useful to work through our <a href="https://github.com/UKDataServiceOpen/code-demos/blob/master/code/ukds-intro-to-python-2020-05-06.ipynb" target=_blank>*Introduction to Python for social scientists*</a> lesson first.
* **Audience**: Researchers and analysts from any disciplinary background. The materials are slightly tailored for social scientists through the use of social data.
* **Learning outcomes**:
    1. Understand the key steps and requirements for collecting data from web pages using computational methods.
    2. Be able to use Python for requesting, parsing, extracting and saving data stored on a web page.

## Guide to using this resource

This learning resource was built using <a href="https://jupyter.org/" target=_blank>Jupyter Notebook</a>, an open-source software application that allows you to mix code, results and narrative in a single document. As <a href="https://jupyter4edu.github.io/jupyter-edu-book/" target=_blank>Barba et al. (2019)</a> espouse:
> In a world where every subject matter can have a data-supported treatment, where computational devices are omnipresent and pervasive, the union of natural language and computation creates compelling communication and learning opportunities.

If you are familiar with Jupyter notebooks then skip ahead to the main content (*What is web-scraping?*). Otherwise, the following is a quick guide to navigating and interacting with the notebook.

### Interaction

**You only need to execute the code that is contained in sections which are marked by `In []`.**

To execute a cell, click or double-click the cell and press the `Run` button on the top toolbar (you can also use the keyboard shortcut Shift + Enter).

Try it for yourself:

In [None]:
print("Enter your name and press enter:")
name = input()
print("\r")
print("Hello {}, enjoy learning more about Python and web-scraping!".format(name))

### Learn more

Jupyter notebooks provide rich, flexible features for conducting and documenting your data analysis workflow. To learn more about additional notebook features, we recommend working through some of the <a href="https://github.com/darribas/gds19/blob/master/content/labs/lab_00.ipynb" target=_blank>materials</a> provided by Dani Arribas-Bel at the University of Liverpool. 

## What is web-scraping?

It is a computational technique for capturing information stored on a web page. "Computational" is the key word, as it is possible to perform this task manually, though that carries considerable disadvantages in terms of accuracy and labour resource.

### Why would you want to "scrape" a web page?

Web-scraping provides a computational means of quickly and accurately capturing data stored on web pages. Web pages can be an important source of publicly available information on phenomena of interest - for instance, they are used to store and disseminate files, text, photos, videos, tables etc. However, the data stored on websites are typically not structured or formatted for ease of use by researchers: for example, it may not be possible to perform a bulk download of all the files you need (think of needing the annual accounts of all registered companies in London for your research...), or the information may not even be held in a file and instead spread across paragraphs and tables throughout a web page (or worse, web pages).

### What is the general approach for scraping data from a web page?

We begin by identifying a web page containing information we are interested in collecting. Then we need to **know** the following:
1. The location (i.e., web address) where the web page can be accessed. For example, the UK Data Service homepage can be accessed via <a href="https://ukdataservice.ac.uk/" target=_blank>https://ukdataservice.ac.uk/</a>.
2. The location of the information we are interested in within the structure of the web page. This involves visually inspecting a web page's underlying code using a web browser.

And **do** the following:
3. Request the web page using its web address.
4. Parse the structure of the web page so your programming language can work with its contents.
5. Extract the information we are interested in.
6. Write this information to a file for future use.

For any programming task, it is useful to write out the steps needed to solve the problem: we call this *pseudo-code*, as it is captures the main tasks and the order in which they need to be executed.

## A simple web-scraping example

Let's work through the steps in our general approach using a real web page, one that is designed for practicing web-scraping.

###  Identifying the web address

The web page we are interested in can be found at the following web address: <a href="https://httpbin.org/html" target=_blank>https://httpbin.org/html</a>.

You can click on the link to open the web page in your browser, though we could just use Python to view it in this notebook:

In [None]:
from IPython.display import IFrame

IFrame("https://httpbin.org/html", width="800", height="650")

We can see that the web page contains some text - this is an abstract from Herman Melville's classic novel *Moby Dick*.

### Locating information

Our task is to extract the text on this web page. In order to do so, we need to understand where the text is located within the underlying *source code* of the web page. Web pages are written in a langauge called HyperText Markup Language (HTML). HTML describes the structure of a web page, and consists of a number of elements (e.g., paragraphs, tables, headers), with each element represented by a tag (e.g., `<p>`, `<table>`, `<h1>`). 

See <a href="https://www.w3schools.com/html/html_intro.asp" target=_blank>https://www.w3schools.com/html/html_intro.asp</a> for more information on HTML.

#### Visually inspecting the underlying HTML code

Therefore, what we need are the tags that identify the section of the web page where the text is stored. We can discover the tags by examining the *source code* (HTML) of the web page. This can be done using your web browser: for example, if you use use Firefox you can right-click on the web page and select *View Page Source* from the list of options. (Chrome: *View page source*; Safari: follow <a href="https://www.lifewire.com/view-html-source-in-safari-3469315" target=_blank>these instructions</a>).

The cell below shows the full HTML code for the web page.

In the HTML code above, we can see multiple tags identifying different elements on the web page: there is a set of `<h1></h1>` tags representing the page title, a set of `<div></div>` tags representing a section, and a set of `<p></p>` tags representing the paragraph containing the text we are interested. (There are also some metadata tags outwith the `<body></body>` tags that we do not need to concern ourselves with).

### Requesting the web page

Now that we possess the necessary information, let's begin the process of scraping the web page. There is a preliminary step, which is setting up Python with the modules it needs to perform the web-scrape.

In [None]:
# Import modules

import os # module for navigating your machine (e.g., file directories)
import requests # module for requesting urls
from bs4 import BeautifulSoup as soup # module for parsing web pages

print("Succesfully imported necessary modules")

Modules are additional techniques or functions that are not present when you launch Python. Some do not even come with Python when you download it and must be installed on your machine separately - think of using `ssc install <package>` in Stata, or `install.packages(<package>)` in R. For now just understand that many useful modules need to be imported every time you start a new Python session.

Now, let's implement the process of scraping the page. First, we need to request the web page using Python; this is analogous to opening a web browser and entering the web address manually. We refer to a page's location on the internet as its web address or Uniform Resource Locator (URL).

In [None]:
# Define the URL where the web page can be accessed

url = "https://httpbin.org/html"

# Request the web page

response = requests.get(url) # request the url
response.status_code # check if page was requested successfully

Good, we get a status code of *200*, which means the request was successful. A status code in *400s* or *500s* represent an unsuccessful attempt at requesting a web page (see <a href="https://www.textbook.ds100.org/ch/07/web_http.html" target=_blank>Lau, Gonzalez and Nolan</a> for a succinct description of different types of response status codes).

Let's unpack the code a bit. First, we define a variable (also known as an 'object' in Python) called `url` that contains the web address of the page we want to request. Next, we use the `get()` method of the `requests` module to request the web page, and in the same line of code, we store the results of the request in a variable called `response`. Finally, we check whether the request was successful by calling on the `status_code` attribute of the `response` variable.

We can also view the metadata associated with our request:

In [None]:
response.headers

You may be wondering exactly what it is we requested: if you were to type the URL (https://httpbin.org/html) into your browser and hit `enter`, the web page should appear on your screen. This is not the case when we request the URL through Python but rest assured, we have successfully requested the web page. To see the content of our request, we can examine the `text` attribute of the `response` variable:

In [None]:
response.text

This shows us the underlying code (HTML) of the web page we requested. It should be obvious that in its current form, the result of this request will be difficult to work with. This is where the `BeautifulSoup` module comes in handy.

### Parsing the web page

Now it's time to identify and understand the structure of the web page we requested. We do this by converting the content contained in the `response.text` attribute into a `BeautifulSoup` variable. `BeautifulSoup` is a Python module that provides a systematic way of navigating the elements of a web page and extracting its contents. Let's see how it works in practice:

In [None]:
# Extract the contents of the webpage from the response

soup_response = soup(response.text, "html.parser") # Parse the text as a Beautiful Soup object
soup_response

Notice how the hierarchical structure of the web page is now recognised by Python? Not only that, `BeautifulSoup` provides some methods for accessing the tags contained in the web page.

### Extracting information

Now that we have parsed the web page, we can use Python to navigate and extract the information of interest.

In [None]:
paragraph = soup_response.find("p")
paragraph

We used the `find()` method on the `soup_response` variable to capture the set of `<p></p>` tags on the page. Remember, we used our visual inspection of the source code to identify that the text we needed was contained within a set of `<p></p>` tags, and that there was only one set.

We're near the end of the scrape: we just need to extract the text from within the tags like so:

In [None]:
data = paragraph.text
print(data)

### Saving results from the scrape

Let's conclude by saving the scraped data to a file for future use.

In [None]:
# Define a file to store the data

outfile = "./moby-dick-scraped-data.txt" # location and name of file

# Open the file and write (save) the data to it

with open(outfile, "w") as f:
    f.write(data)

How do we know this worked? The simplest way is to check whether a) the file was created, and b) the results were written to it.

In [None]:
# Check presence of file in current folder

os.listdir()

In [None]:
# Open file and read (import) its contents

with open(outfile, "r") as f:
    newdata = f.read()
    
print(newdata)  

And Voila, we have successfully scraped a web page!

## What have we learned?

Let's recap what key skills and techniques we've learned:
* **How to import modules**. You will usually need to import modules into Python to support your work. Python does come with some methods and functions that are ready to use straight away, but for computational social science tasks you'll almost certainly need to import some additional modules.
* **How to request and parse web pages**. You can use Python to request a web page, and the `BeautifulSoup` module to parse its contents.
* **How to read and write data**. You can save the results of your scrape to a file for future use.
* **How to do all of the above in an efficient, clear and effective manner**.

## Conclusion

Web-scraping is a simple yet powerful computational method for collecting data of value for social science research. It provides a relatively gentle introduction to using programming languages, also. However, "with great power comes great responsibility" (sorry). Web-scraping takes you into the realm of data protection, website Terms of Service (ToS), and many murky ethical issues. Wielded sensibly and sensitively, web-scraping is a valuable and exciting social science research method. 

Good luck on your data-driven travels!

## Bibliography

Barba, Lorena A. et al. (2019). *Teaching and Learning with Jupyter*. <a href="https://jupyter4edu.github.io/jupyter-edu-book/" target=_blank>https://jupyter4edu.github.io/jupyter-edu-book/</a>.

Lau, S., Gonzalez, J., & Nolan, D. (n.d.). *Principles and Techniques of Data Science*. https://www.textbook.ds100.org

## Further reading and resources

We hope this brief lession has whetted your appetite for learning more about web-scraping and Python programming in general. There are some fantastic learning materials available to you, many of them free. We highly recommend the materials referenced in the Bibliography.

In addition, you may find the following resources useful:
* <a href="https://github.com/UKDataServiceOpen/web-scraping" target=_blank>**Web-scraping for Social Science Research**</a> - a free UK Data Service training series on web-scraping, with three webinars and lots of detailed coding examples.
* <a href="https://automatetheboringstuff.com/" target=_blank>**Automate the Boring Stuff with Python**</a> - a free ebook covering lots of interesting, practical uses of Python. Chapter 12 covers web-scraping.

Hello

--END OF FILE--