# New Forms of Data

## Web-scraping for Social Science Research: Websites as a Source of Data

* Date: 2020-04-23
* Time: 3pm (GMT)
* Facilitator: GM
* Presenter: DM

### List of non-Powerpoint resources covered during the webinar

These are the web pages or other resources that I need to switch to and from Powerpoint during the webinar:
* [Worldometer](https://www.worldometers.info/coronavirus/)
* [Web-scraping notebook](https://on-machine)
* [UKDS Github](https://github.com/UKDataServiceOpen/new-forms-of-data)

### Slide 1

[*No notes*]

### Slide 2

Good afternoon, welcome to our webinar on computational methods for collecting data from the web. This is part of a larger UK Data Service training programme for social scientists wishing to learn about new forms of data e.g. data from the web, social media sites, text data.


Firstly, I hope everyone is well under the current circumstances and thank you to those joining us live during this challenging period.

Before we begin with the webinar, let's check the setup is working for you.

### Slide 3

There should be a poll where you can respond yes or no as to whether you can hear me, please take a couple of seconds to complete this.

[*If 100% yes, move to Slide 5; else move to Slide 4*]

### Slide 4

If you cannot hear me then please work through these steps for a few moments; if your issue is unresolved then please write a message using the *question* box and the facilitator will help you solve the issue.

Finally, if you have any questions at any point during the webinar then please log them using the "Question" box and I will go through them one-by-one at the end.

### Slide 5

This webinar is part of a work package at UK Data Service focusing on new forms of data for social science research. Most of our activities will be producing webinars on topics such as web-scraping, social networks, text mining, agent-based modelling, biomarker data and more. We will also be running live coding demonstrations for social scientists wanting to learn the Python programming language. We will also provide sample programming code that you can run, edit and adapt for your own purposes, and I'll demonstrate one of these scripts today.

[*Post UKDS training and events link in Question box*]

### Slide 6

Let's begin. The aims of today's webinar are as follows:
1. Outline the logic of web-scraping as a social science research method.
2. Demonstrate the practice of web-scraping through a live demonstration to capture statistics about the Covid-19 public health crisis.
3. Reflect on the value, limitations and ethical implications of web-scraping for social science purposes.

We covered some of this material - in particular, the logic and value of web-scraping - in a previous webinar, therefore today we will focus more on demonstrating the Python code that underpins web-scraping techniques.

### Slide 7

What is web-scraping? It is a computational technique for capturing information stored on a web page. "Computational" is the key word, as it is possible to perform this task manually, though that carries considerable disadvantages in terms of accuracy and labour resource.

It is generally implemented using a programming script - that is, executable code written in some programming language, though there are software applications that you can use - a previous webinar by our colleague Peter Smyth on the 16 April demonstrated how to use MS Excel to collect data from a website.

It is relatively simple to implement using open-source programming languages e.g., Python, R. You do not need to be highly computationally literate, nor write screeds of code: this is a popular and mature computational method, with tons of documentation and examples for you to learn from.

### Slide 8

Web pages can be an important source of publicly available information on social phenomena of interest. The Covid-19 public health crisis is, unfortunately, an excellent case in point: see for example UK Government website, World Health Organisation and many others.

The information may only be available on a web page, and not shared as downloadable data files.

### Slide 9

By parse, we mean we explicitly tell our programming language that we are working with a web page and thus we need certain methods or functions. Web pages are written in a language called HTML, which uses tags to delineate the elements that make up a web page. For example, there are tags to identify tables, paragraphs, links, images etc.

For any programming task, it is useful to write out the steps needed to solve the problem: we call this *pseudo-code*, as it is captures the main tasks and the order in which they need to be executed. Of course, we cannot just supply these instructions as is to Python, we have to more specific and adhere to the rules of the language.

For our first example, let's convert the steps above into executable Python code for capturing data about Covid-19 cases.

### External 1 

[*Switch to Worldometer website*]

Let's say we're interested in capturing up-to-date information about Covid-19. We've inspected a few websites and this one looks to be a valid and reliable source of information [*show the Covid-19 about page*].

Let's keep our task simple for now and restrict our interest to the headline statistics: cases, deaths and recoveries. Now we can see that the statistics are near the top of the web page, however we need more information than this in order to extract the figures. What we need are the tags that identify the section of the web page where the statistics are stored. We can discover the tags by examining the *source code* (HTML) of the web page. This can be done using your web browser: for example, if you use use Firefox you can right-click on the web page and select *View Page Source* from the list of options.

A better way is to *Inspect Element* of interest and that should take you to the tag(s) you need [*demonstrate this*].

### External 2

[*Switch to Jupyter Notebook presentation*]
[*Make sure to clear all output before entering Slideshow mode*]

Now that we possess the two key bits of information we need - the web address and tags containing statistics -, we can begin scraping the data. We're going to Python to do so, and we're going to write our code in an application called a Jupyter notebook. Jupyter notebooks allow you to mix code, narrative and output in a single, sharable electronic document [*scroll through notebook*].

If you have a second screen, then you will be able to work through this notebook and follow what I'm doing at the same time. I've posted the link in the *Questions* box [*post link to Binder notebook*]. However the notebook has been designed as a self-directed learning activity, so for now you can just observe what I'm doing and access the notebook later.

[*Begin slideshow*]

**Importing modules**

Modules are additional techniques or functions that are not present when you launch Python (remember: we are using Python through this notebook); some do not even come with Python when you download it, they must be installed on your machine separately - think of using `ssc install <package>` in Stata, or `install.packages(<package>)` in R. For now just understand that many useful modules need to be imported every time you start a new Python session.

**Requesting the web page**

First, we declare a variable (also known as an 'object') called `url` that contains the web address of the web page we want to request. Next, we use the `get()` method of the `requests` module to request the web page, and in the same line of code, we store the results of the request in a variable called `response`. Finally, we check whether the request was successful by calling on the `status_code` attribute of the `response` variable.

Confused? Don't worry, the conventions of Python and using its modules take a bit of getting used to. At this point, just understand that you can store the results of commands in variables, and a variable can have different attributes that can be accessed when needed. Also note that you have a lot of freedom in how you name your variables (subject to certain restrictions - see <a href="https://www.python.org/dev/peps/pep-0008/" target=_blank>here for some guidance</a>).

For example, the following would also work: [*Run code*]

You may be wondering exactly what it is we requested: if you were to type the URL (https://www.worldometers.info/coronavirus/) into your browser and hit `enter`, the web page should appear on your screen. This is not the case when we request the URL through Python but rest assured, we have successfully requested the web page. To see the content of our request, we can call the `text` attribute of the `response` variable: [*Run code*]

This shows us the underlying code (HTML) of the web page we requested. It should be obvious that in its current form, it will be difficult to work with. This is where the `BeautifulSoup` module comes in handy.

**Parsing the web page**

The mass of text that is produced should look familiar: it is the full version of the source code we examined earlier. Note again how we call on a method (soup()) from a module (BeautifulSoup) and store the results in a variable (soup_response).

How do we navigate such voluminous results? Thankfully the BeautifulSoup module provides some intuitive methods for doing so.

We used the `find_all()` method to search for all `<div>` tags where the id="maincounter-wrap". And because there is more than one set of tags matching this id, we get a list of results. We can check how many tags match this id by calling on the `len()` function:

**Extracting information**

The above code performs a couple of operations:

* For each item (i.e., set of `<div>` tags) in the list, it finds the `<span>` tags and extracts the text enclosed within them.
* We clean the text by removing blank spaces and commas.

In this example, referring to an item's positional index works because our list of `<div>` tags stored in the sections variable is ordered: the tag containing the number of cases appears before the tag containing the number of deaths, which appears before the tag containing the number of recovered patients.

In Python, indexing begins at zero (in R indexing begins at 1). Therefore, the first item in the list is accessed using sections[0], the second using sections[1] etc.

**Saving results from the scrape**

The code above defines some headers and a name and location for the file which will store the results of the scrape. We then open the file in *write* mode, and write the headers to the first row, and the statistics to subsequent rows.

**Country-level statistics**

Let's extend what we've learned to capture more information relating to Covid-19 for individual countries. [*View table in IFrame*]

First, we load in modules (libraries), make the request, parse the result and find the rows in the table of interest.

Then we need to loop through each row in the table and extract the content etc...

**Concluding remarks on Covid-19**

The Covid-19 pandemic is a seismic public health crisis that will dominate our lives for the foreseeable future. The example code above is not a craven attempt to provide some topicality to these materials, nor is it simply a particularly good example for learning web-scraping techniques. There are real opportunities for social scientists to capture and analyse data on this phenomenon, starting with the core figures provided through the Worldometer website.

### Slide 10

Computational methods for collecting data from the web are an increasingly important component of a social scientist's toolkit. They enable individuals to collect and reshape data - qualitative and quantitative - that otherwise would be inaccessible for research purposes.

Value:
* As a result the learning curve is not as steep as with other methods, and it is possible for a beginner to create and execute a functioning web-scraping script in a matter of hours.
* Schedule the script to run every month/quarter etc.
* Many public, private and charitable institutions use their web sites to release and regularly update information of value to social scientists. Getting a handle on the volume, variety and velocity of this information is extremely challenging without the use of computational methods.
* While Python and HTML might be unfamiliar, the data that is returned through web-scraping can be formatted in such a way as to be compatible with your usual analytical methods (e.g., regression modelling, content analysis) and software applications (e.g., Stata, NVivo). In fact, we would go as far to say that computational methods are particularly valuable to social scientists from a data collection and processing perspective, and you can achieve much without ever engaging in "big data analytics" (e.g., machine learning, neural networks, natural language processing).

Limitations:
* For example, the <a href="https://www.worldometers.info/licensing/faq/" target=_blank>Worldometer Covid-19 data</a> that we scrape in this notebook cannot be used without their permission, even though the <a href="https://www.worldometers.info/disclaimer/" target=_blank>ToS</a> do not expressly prohibit web-scraping. o even in instances where scraping data is not prohibited, you may not be able to use it without seeking permission from the data owner. The safest approach is to seek permission in advance of conducting a web-scrape, especially if you intend to build a working relationship with the data owner - do not rely on the argument that the information is publicly available to begin with.
* In the UK there is no specific law prohibiting web-scraping or the use of data obtained via this method; however there are other laws which impinge on this activity.
* [*Nothing more to say about updates*]
* For example, they may "blacklist" (ban) your IP address - your computer's unique id on the internet - from making requests to its server.
* For example, you may not possess the administrative rights for your machine, preventing you from scheduling your script to run on a regular basis (i.e., your computer automatically goes to sleep after a set period of time). There are ways around this and you do not need a high performance computing setup, but it is worth keeping in mind nonetheless.

### Slide 11

Thank you for joining us today. Let's work through some of the questions you have posed.

[*If questions: read them out and answer, also copy-and-paste them somewhere*]

[*If no questions: skip to final slide*]

### Final slide

That concludes our webinar, I hope it has demonstrated the value of web-scraping as a social science research method. Please consider completing the evaluation form that will appear once this webinar concludes: feedback from the previous session has influenced today's webinar, and we're keen to hear about how we can improve our offering.

This video will shortly be hosted on our Youtube channel. The code underpinning this webinar is available through our Github repository (see link), which also contains a copy of the slides, a reading list on this topic, more information about our New Form of Data training series.

[*Post link to Github in Question box*]

[*Switch to UKDS Github*]

[*Switch back to final slide*]

Please don't hesitate to get in contact if you have questions, queries or feedback, particularly around your training needs in this area; we're also on Twitter and Facebook if you want to stay up-to-date with our activities. Stay safe and hope to engage with you again in the near future.