![UKDS Logo](./media/UKDS_Logos_Col_Grey_300dpi.png)

# Web-scraping for Social Science Research

Welcome to the <a href="https://ukdataservice.ac.uk/" target=_blank>UK Data Service</a> training series on *New Forms of Data for Social Science Research*. This series guides you through some of the most common and valuable new sources of data available for social science research: data collected from websites, social media platorms, text data, conducting simulations (agent based modelling), to name a few. To help you get to grips with these new forms of data, we provide webinars, interactive notebooks containing live programming code, reading lists and more.

* To access training materials for the entire series: <a href="https://github.com/UKDataServiceOpen/new-forms-of-data" target=_blank>[Training Materials]</a>

* To keep up to date with upcoming and past training events: <a href="https://ukdataservice.ac.uk/news-and-events/events" target=_blank>[Events]</a>

* To get in contact with feedback, ideas or to seek assistance: <a href="https://ukdataservice.ac.uk/help.aspx" target=_blank>[Help]</a>

<a href="https://www.research.manchester.ac.uk/portal/julia.kasmire.html" target=_blank>Dr Julia Kasmire</a> and <a href="https://www.research.manchester.ac.uk/portal/diarmuid.mcdonnell.html" target=_blank>Dr Diarmuid McDonnell</a> <br />
UK Data Service  <br />
University of Manchester <br />
April 2020

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Guide-to-using-this-notebook" data-toc-modified-id="Guide-to-using-this-notebook-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Guide to using this notebook</a></span><ul class="toc-item"><li><span><a href="#Interaction" data-toc-modified-id="Interaction-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Interaction</a></span></li><li><span><a href="#Learn-more" data-toc-modified-id="Learn-more-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Learn more</a></span></li></ul></li><li><span><a href="#Collecting-data-from-web-pages-(web-scraping)" data-toc-modified-id="Collecting-data-from-web-pages-(web-scraping)-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Collecting data from web pages (web-scraping)</a></span><ul class="toc-item"><li><span><a href="#Reasons-to-engage-in-web-scraping" data-toc-modified-id="Reasons-to-engage-in-web-scraping-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Reasons to engage in web-scraping</a></span></li><li><span><a href="#Logic-of-web-scraping" data-toc-modified-id="Logic-of-web-scraping-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Logic of web-scraping</a></span></li></ul></li><li><span><a href="#Example:-Capturing-Covid-19-data" data-toc-modified-id="Example:-Capturing-Covid-19-data-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Example: Capturing Covid-19 data</a></span><ul class="toc-item"><li><span><a href="#Identifying-URL-of-web-page" data-toc-modified-id="Identifying-URL-of-web-page-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Identifying URL of web page</a></span></li><li><span><a href="#Locating-information" data-toc-modified-id="Locating-information-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Locating information</a></span></li><li><span><a href="#Requesting-the-web-page" data-toc-modified-id="Requesting-the-web-page-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Requesting the web page</a></span></li><li><span><a href="#Parsing-the-web-page" data-toc-modified-id="Parsing-the-web-page-4.4"><span class="toc-item-num">4.4&nbsp;&nbsp;</span>Parsing the web page</a></span></li><li><span><a href="#Extracting-information" data-toc-modified-id="Extracting-information-4.5"><span class="toc-item-num">4.5&nbsp;&nbsp;</span>Extracting information</a></span></li><li><span><a href="#Saving-results-from-the-scrape" data-toc-modified-id="Saving-results-from-the-scrape-4.6"><span class="toc-item-num">4.6&nbsp;&nbsp;</span>Saving results from the scrape</a></span></li><li><span><a href="#Country-level-Covid-19-data" data-toc-modified-id="Country-level-Covid-19-data-4.7"><span class="toc-item-num">4.7&nbsp;&nbsp;</span>Country-level Covid-19 data</a></span></li><li><span><a href="#Concluding-remarks-on-Covid-19-data" data-toc-modified-id="Concluding-remarks-on-Covid-19-data-4.8"><span class="toc-item-num">4.8&nbsp;&nbsp;</span>Concluding remarks on Covid-19 data</a></span></li><li><span><a href="#A-social-research-example:-charity-data" data-toc-modified-id="A-social-research-example:-charity-data-4.9"><span class="toc-item-num">4.9&nbsp;&nbsp;</span>A social research example: charity data</a></span></li></ul></li><li><span><a href="#Value,-limitations-and-ethics" data-toc-modified-id="Value,-limitations-and-ethics-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Value, limitations and ethics</a></span><ul class="toc-item"><li><span><a href="#Value" data-toc-modified-id="Value-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Value</a></span></li><li><span><a href="#Limitations" data-toc-modified-id="Limitations-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Limitations</a></span></li><li><span><a href="#Ethical-considerations" data-toc-modified-id="Ethical-considerations-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Ethical considerations</a></span></li></ul></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Conclusion</a></span></li><li><span><a href="#Bibliography" data-toc-modified-id="Bibliography-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Bibliography</a></span></li><li><span><a href="#Further-reading-and-resources" data-toc-modified-id="Further-reading-and-resources-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Further reading and resources</a></span></li><li><span><a href="#Appendices" data-toc-modified-id="Appendices-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Appendices</a></span><ul class="toc-item"><li><span><a href="#Appendix-A---Requesting-URLs" data-toc-modified-id="Appendix-A---Requesting-URLs-9.1"><span class="toc-item-num">9.1&nbsp;&nbsp;</span>Appendix A - Requesting URLs</a></span></li><li><span><a href="#Appendix-B---Covid-19-annotated-script" data-toc-modified-id="Appendix-B---Covid-19-annotated-script-9.2"><span class="toc-item-num">9.2&nbsp;&nbsp;</span>Appendix B - Covid-19 annotated script</a></span></li><li><span><a href="#Appendix-C:-Capturing-charity-data" data-toc-modified-id="Appendix-C:-Capturing-charity-data-9.3"><span class="toc-item-num">9.3&nbsp;&nbsp;</span>Appendix C: Capturing charity data</a></span></li></ul></li></ul></div>

## Introduction

In this training series we cover some of the essential skills needed to collect data from the web. In particular we focus on two different approaches:
1. Collecting data stored on web pages. [Focus of this notebook]
2. Downloading data from online databases using Application Programming Interfaces (APIs).
    
Do not be alarmed by the technical aspects: both approaches can be implemented using simple code, a standard desktop or laptop, and a decent internet connection.    

Given the Covid-19 public health crisis in which this programme of work occurred, we will examine ways in which web-scraping techniques can provide valuable data for studying this phenomenon. This is a fast moving, evolving public health emergency that, in addition to other impacts, will shape research agendas across the sciences for years to come. Therefore it is important to learn how we, as social scientists, can access or generate data that will provide a better understanding of this disease.

We will also indulge the research interests of one of the authors of these materials (Diarmuid) by drawing on examples relating to the UK charity sector. Rest assured that this field offers an excellent example of what is possible using web-based data collection techniques.

## Guide to using this notebook

This learning resource was built using <a href="https://jupyter.org/" target=_blank>Jupyter Notebook</a>, an open-source software application that allows you to mix code, results and narrative in a single document. As <a href="https://jupyter4edu.github.io/jupyter-edu-book/" target=_blank>Barba et al. (2019)</a> espouse:
> In a world where every subject matter can have a data-supported treatment, where computational devices are omnipresent and pervasive, the union of natural language and computation creates compelling communication and learning opportunities.

If you are familiar with Jupyter notebooks then skip ahead to the main content (Section 3). Otherwise, the following is a quick guide to navigating and interacting with the notebook.

### Interaction

If you want to turn this resource from a static web page into a live Python programming environment, simply click on the following button (usually at the top of a page):
<img src="https://camo.githubusercontent.com/483bae47a175c24dfbfc57390edd8b6982ac5fb3/68747470733a2f2f6d7962696e6465722e6f72672f62616467655f6c6f676f2e737667" alt="Binder" data-canonical-src="https://mybinder.org/badge_logo.svg" style="max-width:100%;">

After a short loading period, you are now able to execute/run the code yourself and see the results in real time.

**You only need to execute the code that is contained in sections which are marked by `In []`.**

To execute a cell, click or double-click the cell and press the `Run` button on the top toolbar (you can also use the keyboard shortcut Shift + Enter).

Try it for yourself: click the *Launch Binder* button and then execute the cell below.

In [2]:
print("Enter your name:")
name = input()
print("Hello {}, enjoy learning more about Python and web-scraping!".format(name)) 

Enter your name:
Diarmuid
Hello Diarmuid, enjoy learning more about Python and web-scraping!


### Learn more

Jupyter notebooks provide rich, flexible features for conducting and documenting your data analysis workflow. To learn more about additional notebook features, we recommend working through some of the <a href="https://github.com/darribas/gds19/blob/master/content/labs/lab_00.ipynb" target=_blank>materials</a> provided by Dani Arribas-Bel at the University of Liverpool. 

## Collecting data from web pages (web-scraping)


### Reasons to engage in web-scraping

Websites can be an important source of publicly available information on phenomena of interest - for instance, they are used to store and disseminate files, text, photos, videos, tables etc. However, the data stored on websites are typically not structured or formatted for ease of use by researchers: for example, it may not be possible to perform a bulk download of all the files you need (think of needing the annual accounts of all registered companies in London for your research...), or the information may not even be held in a file and instead spread across paragraphs and tables throughout a web page (or worse, web pages). Luckily, web-scraping provides a means of quickly and accurately capturing and formatting data stored on web pages.

Before we delve into writing code to capture data from the web, let's clearly state the logic underpinning the technique.

### Logic of web-scraping

We begin by identifying a web page containing information we are interested in collecting. Then we need to **know** the following:
1. The location (i.e., web address) where the web page can be accessed. For example, the UK Data Service homepage can be accessed via <a href="https://ukdataservice.ac.uk/" target=_blank>https://ukdataservice.ac.uk/</a>.
2. The location of the information we are interested in within the structure of the web page. This involves visually inspecting a web page's underlying code using a web browser.

And **do** the following:
3. Request the web page using its web address.
4. Parse the structure of the web page so your programming language can work with its contents.
5. Extract the information we are interested in.
6. Write this information to a file for future use.

For any programming task, it is useful to write out the steps needed to solve the problem: we call this *pseudo-code*, as it is captures the main tasks and the order in which they need to be executed. 

For our first example, let's convert the steps above into executable Python code for capturing data about Covid-19.

## Example: Capturing Covid-19 data

Worldometer is a website that provides up-to-date statistics about: the global population; food, water and energy consumption; environmental degradation etc (known as its Real Time Statistics Project). In its own words:<sup>[1]</sup>
> Worldometer is run by an international team of developers, researchers, and volunteers with the goal of making world statistics available in a thought-provoking and time relevant format to a wide audience around the world. Worldometer is owned by Dadax, an independent company. We have no political, governmental, or corporate affiliation.

Since the outbreak of Covid-19 it has provided regular daily snapshots on the progress of this disease, both globally and at a country level.

[1]: https://www.worldometers.info/about/

### Identifying URL of web page

The website can be accessed here: <a href="https://www.worldometers.info/coronavirus/" target=_blank>https://www.worldometers.info/coronavirus/</a>

Let's work through the steps necessary to collect data about the number of Covid-19 cases, deaths and recoveries globally.

First, let's use Python to view this website in our notebook. We can do this by using the `IFrame` function to display external content in the notebook.

In [14]:
from IPython.display import IFrame

IFrame("https://www.worldometers.info/coronavirus/", width="600", height="700")

### Locating information

The statistics we need are near the top of the page under the following headings:
* Coronavirus Cases:
* Deaths:
* Recovered:

However, we need more information than this in order to scrape the statistics. Websites are written in a langauge called HyperText Markup Language (HTML), which can be understood as follows:<sup>[2]</sup>
* HTML describes the structure of a web page
* HTML consists of a series of elements
* HTML elements tell the browser how to display the content
* HTML elements are represented by tags
* HTML tags label pieces of content such as "heading", "paragraph", "table", and so on
* Browsers do not display the HTML tags, but use them to render the content of the page 

#### Visually inspecting the underlying HTML code

Therefore, what we need are the tags that identify the section of the web page where the statistics are stored. We can discover the tags by examining the *source code* (HTML) of the web page. This can be done using your web browser: for example, if you use use Firefox you can right-click on the web page and select *View Page Source* from the list of options. 

**TASK**: Try this yourself with the Worldometer web page that was produced using the `IFrame` command above.

The snippet below shows sample source code for the section of the Covid-19 web page we are interested in.

<a id="source_code_example"></a>

[2]: https://www.w3schools.com/html/html_intro.asp

In the above example, we can see multiple tags enclosing elements of text. For instance, we can see that the Covid-19 statistics are enclosed in `<span><\span>` tags, which themselves are located within `<div><\div>` tags.

Navigating and locating the contents of a web page remains a manual and visual process, and in Brooker's estimation (2020, 252):
> Hence, more so than the actual Python, it's the detective work of unpicking the internal structure of a webpage that is probably the most vital skill here.

### Requesting the web page

Now that we possess the necessary information, let's begin the process of scraping the web page. There is a preliminary step, which is setting up Python with the modules it needs to perform the web-scrape.

In [2]:
# Import modules

import os # module for navigating your machine (e.g., file directories)
import requests # module for requesting urls
import csv # module for handling csv files
import pandas as pd # module for handling data frames
from datetime import datetime # module for working with dates and time
from bs4 import BeautifulSoup as soup # module for parsing web pages

print("Succesfully imported necessary modules")

Succesfully imported necessary modules


Modules are additional techniques or functions that are not present when you launch Python (remember: we are using Python through this notebook); some do not even come with Python when you download it, they must be installed on your machine separately - think of using `ssc install <package>` in Stata, or `install.packages(<package>)` in R. For now just understand that many useful modules need to be imported every time you start a new Python session.

Now, let's implement the process of scraping the page. First, we need to request the web page using Python; this is analogous to opening a web browser and entering the web address manually. We refer to a page's location on the internet as its web address or Uniform Resource Locator (URL).

In [5]:
# Define the URL where the webpage can be accessed

url = "https://www.worldometers.info/coronavirus/"

# Request the webpage from the URL

response = requests.get(url, allow_redirects=True) # request the url
response.status_code # check if page was requested successfully

First, we declare a variable (also known as an 'object') called `url` that contains the web address of the web page we want to request. Next, we use the `get()` method of the `requests` module to request the web page, and in the same line of code, we store the results of the request in a variable called `response`. Finally, we check whether the request was successful by calling on the `status_code` attribute of the `response` variable.

Confused? Don't worry, the conventions of Python and using its modules take a bit of getting used to. At this point, just understand that you can store the results of commands in variables, and a variable can have different attributes that can be accessed when needed. Also note that you have a lot of freedom in how you name your variables (subject to certain restrictions - see <a href="https://www.python.org/dev/peps/pep-0008/" target=_blank>here for some guidance</a>).

For example, the following would also work:

In [6]:
web_address = "https://www.worldometers.info/coronavirus/"

scrape_result = requests.get(web_address, allow_redirects=True)
scrape_result.status_code

200

Back to the request:

Good, we get a status code of _200_ - this means we successfully requested the web page. <a href="https://www.textbook.ds100.org/ch/07/web_http.html" target=_blank>Lau, Gonzalez and Nolan</a> provide a succinct description of different types of response status codes:

* **100s** - Informational: More input is expected from client or server (e.g. 100 Continue, 102 Processing)
* **200s** - Success: The client's request was successful (e.g. 200 OK, 202 Accepted)
* **300s** - Redirection: Requested URL is located elsewhere; May need user's further action (e.g. 300 Multiple Choices, 301 Moved Permanently)
* **400s** - Client Error: Client-side error (e.g. 400 Bad Request, 403 Forbidden, 404 Not Found)
* **500s** - Server Error: Server-side error or server is incapable of performing the request (e.g. 500 Internal Server Error, 503 Service Unavailable)

For clarity:
* **Client**: your machine
* **Server**: the machine you are requesting the web page from

You may be wondering exactly what it is we requested: if you were to type the URL (https://www.worldometers.info/coronavirus/) into your browser and hit `enter`, the web page should appear on your screen. This is not the case when we request the URL through Python but rest assured, we have successfully requested the web page. To see the content of our request, we can call the `text` attribute of the `response` variable:

In [7]:
response.text

'\n<!DOCTYPE html>\n<!--[if IE 8]> <html lang="en" class="ie8"> <![endif]-->\n<!--[if IE 9]> <html lang="en" class="ie9"> <![endif]-->\n<!--[if !IE]><!-->\n<html lang="en">\n<!--<![endif]-->\n<head>\n<meta charset="utf-8">\n<meta http-equiv="X-UA-Compatible" content="IE=edge">\n<meta name="viewport" content="width=device-width, initial-scale=1">\n<title>Coronavirus Update (Live): 2,426,789 Cases and 166,122 Deaths from COVID-19 Virus Pandemic - Worldometer</title>\n<meta name="description" content="Live statistics and coronavirus news tracking the number of confirmed cases, recovered patients, tests, and death toll due to the COVID-19 coronavirus from Wuhan, China. Coronavirus counter with new cases, deaths, and number of tests per 1 Million population. Historical data and info. Daily charts, graphs, news and updates">\n\n<link rel="shortcut icon" href="/favicon/favicon.ico" type="image/x-icon">\n<link rel="apple-touch-icon" sizes="57x57" href="/favicon/apple-icon-57x57.png">\n<link re

This shows us the underlying code (HTML) of the web page we requested. It should be obvious that in its current form, it will be difficult to work with. This is where the `BeautifulSoup` module comes in handy.

(See Appendix C for more examples of how `requests` works and what information it returns.)

### Parsing the web page

Now it's time to identify and understand the structure of the web page we requested. We do this by converting the content contained in the `response.text` attribute into a `BeautifulSoup` variable. `BeautifulSoup` is a Python module that provides a systematic way of navigating the elements of a web page and extracting its contents. Let's see how it works in practice:

In [9]:
# Extract the contents of the webpage from the response

soup_response = soup(response.text, "html.parser") # Parse the text as a Beautiful Soup object
soup_response # print the contents of the web page; this is analagous to looking at the source code


<!DOCTYPE html>

<!--[if IE 8]> <html lang="en" class="ie8"> <![endif]-->
<!--[if IE 9]> <html lang="en" class="ie9"> <![endif]-->
<!--[if !IE]><!-->
<html lang="en">
<!--<![endif]-->
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>Coronavirus Update (Live): 2,426,789 Cases and 166,122 Deaths from COVID-19 Virus Pandemic - Worldometer</title>
<meta content="Live statistics and coronavirus news tracking the number of confirmed cases, recovered patients, tests, and death toll due to the COVID-19 coronavirus from Wuhan, China. Coronavirus counter with new cases, deaths, and number of tests per 1 Million population. Historical data and info. Daily charts, graphs, news and updates" name="description"/>
<link href="/favicon/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
<link href="/favicon/apple-icon-57x57.png" rel="apple-touch-icon" sizes="57x57"/>
<link href="/favico

The mass of text that is produced should look familiar: it is the full version of the source code [we examined earlier](#source_code_example). Note again how we call on a method (`soup()`) from a module (`BeautifulSoup`) and store the results in a variable (`soup_response`).

How do we navigate such voluminous results? Thankfully the `BeautifulSoup` module provides some intuitive methods for doing so.

In [10]:
# Find the sections containing the data of interest

sections = soup_response.find_all("div", id="maincounter-wrap") # find the <div> tags and store its contents in an object called 'sections'
sections

[<div id="maincounter-wrap" style="margin-top:15px">
 <h1>Coronavirus Cases:</h1>
 <div class="maincounter-number">
 <span style="color:#aaa">2,426,789 </span>
 </div>
 </div>,
 <div id="maincounter-wrap" style="margin-top:15px">
 <h1>Deaths:</h1>
 <div class="maincounter-number">
 <span>166,122</span>
 </div>
 </div>,
 <div id="maincounter-wrap" style="margin-top:15px;">
 <h1>Recovered:</h1>
 <div class="maincounter-number" style="color:#8ACA2B ">
 <span>636,702</span>
 </div>
 </div>]

We used the `find_all()` method to search for all `<div>` tags where the id="maincounter-wrap". And because there is more than one set of tags matching this id, we get a list of results. We can check how many tags match this id by calling on the `len()` function:

In [15]:
len(sections)

3

We can view each element in the list of results as follows:

In [17]:
for section in sections:
    print("--------")
    print(section)
    print("--------")
    print("\r") # print some blank space for better formatting

--------
<div id="maincounter-wrap" style="margin-top:15px">
<h1>Coronavirus Cases:</h1>
<div class="maincounter-number">
<span style="color:#aaa">2,426,789 </span>
</div>
</div>
--------

--------
<div id="maincounter-wrap" style="margin-top:15px">
<h1>Deaths:</h1>
<div class="maincounter-number">
<span>166,122</span>
</div>
</div>
--------

--------
<div id="maincounter-wrap" style="margin-top:15px;">
<h1>Recovered:</h1>
<div class="maincounter-number" style="color:#8ACA2B ">
<span>636,702</span>
</div>
</div>
--------



### Extracting information

We are nearing the end of our scrape. The penultimate task is to extract the statistics within the `<span>` tags and store them in some variables. We do this by accessing each item in the _sections_ list using its positional value (index).

In [25]:
cases = sections[0].find("span").text.replace(" ", "").replace(",", "")
deaths = sections[1].find("span").text.replace(",", "")
recov = sections[2].find("span").text.replace(",", "")
print("Number of cases: {}; deaths: {}; and recoveries: {}.".format(cases, deaths, recov))

Number of cases: 2414617; deaths: 165174; and recoveries: 629441.


The above code performs a couple of operations:
* For each item (i.e., set of `<div>` tags) in the list, it finds the `<span>` tags and extracts the text enclosed within them.
* We clean the text by removing blank spaces and commas.

In this example, referring to an item's positional index works because our list of `<div>` tags stored in the `sections` variable is ordered: the tag containing the number of cases appears before the tag containing the number of deaths, which appears before the tag containing the number of recovered patients.

In Python, indexing begins at zero (in R indexing begins at 1). Therefore, the first item in the list is accessed using `sections[0]`, the second using `sections[1]` etc.

### Saving results from the scrape

The final task is to save the variables to a file that we can use in the future. We'll write to a Comma-Separated Values (CSV) file for this purpose, as it is an open-source, text-based file format that is very common for sharing data on the web.

In [20]:
# Create a downloads folder

try:
    os.mkdir("./downloads")
except:
    print("Unable to create folder: already exists")

Unable to create folder: already exists


The use of "./" tells the `os.mkdir()` command that the "downloads" folder should be created at the same level of the directory where this notebook is located. So if this notebook was stored in a directory located at "C:/Users/joebloggs/notebooks", the `os.mkdir()` command would result in a new folder located at "C:/Users/joebloggs/notebooks/downloads".
   
Technically the "./" is not needed and you could just write `os.mkdir("downloads")` but it's good practice to be explicit.

In [21]:
# Write the results to a CSV file

date = datetime.now().strftime("%Y-%m-%d") # get today's date in YYYY-MM-DD format
date # view today's date; useful for naming files

variables = ["Cases", "Deaths", "Recoveries"] # define variable names for the file
outfile = "./downloads/covid-19-statistics-" + date + ".csv" # define a file for writing the results
obs = cases, deaths, recov # define an observation (row)

with open(outfile, "w", newline="") as f: # with the file open in "write" mode, and giving it a shorter name (f)
    writer = csv.writer(f) # define a 'writer' object that allows us to export information to a CSV
    writer.writerow(variables) # write the variable names to the first row of the file
    writer.writerow(obs) # write the observation to the next row in the file

NameError: name 'cases' is not defined

The code above defines some headers and a name and location for the file which will store the results of the scrape. We then open the file in *write* mode, and write the headers to the first row, and the statistics to subsequent rows.

How do we know this worked? The simplest way is to check whether a) the file was created, and b) the results were written to it.

In [None]:
# Check presence of file in "downloads" folder

os.listdir("./downloads")

In [None]:
# Open file and read (import) its contents

with open(outfile, "r") as f:
    data = f.read()
    
print(data)    

And Voila, we have successfully carried out a web-scrape!

### Country-level Covid-19 data

We will complete our work gathering data on the Covid-19 pandemic by employing the techniques we learned previously to capture country-level statistics. The code is presented in full without annotation or explanation as to what each block (i.e., chunk or section) is doing. There are some coding techniques you haven't seen before, and there are some tasks and questions for you to complete - enjoy!

**TASK**: Document the code below so that a person new to Python and web-scraping could understand what is happening in each block. Once you've written your notes, execute the code to see the results. (If you get stuck, see Appendix A for an annotated copy of the script)

In [30]:
# Block 1

import os
import requests
import csv
import pandas as pd
from datetime import datetime
from bs4 import BeautifulSoup as soup

date = datetime.now().strftime("%Y-%m-%d")


# Block 2

url = "https://www.worldometers.info/coronavirus/"
response = requests.get(url, allow_redirects=True)
print(response.status_code, " | ", response.headers)
#
# QUESTION: How do you know if the web page was requested successfully?
#


# Block 3

html_response = response.text
soup_response = soup(html_response, "html.parser")
# print(soup_response)


# Block 4

table = soup_response.find("table", id="main_table_countries_today").find("tbody")
#
# QUESTION: What is the second "find()" function looking for?
#
rows = table.find_all("tr", style="")


# Block 5

global_info = []
for row in rows:
    tds = row.find_all("td")
    country_info = [var.text.strip() for var in tds]
    global_info.append(country_info)

print("\r")
print("Number of rows in table: {}".format(len(global_info)))
print("\r")
#
# QUESTION: # What would happen if we created the blank list within the loop?
# (Feel free to change the code to see what happens).
#


# Block 6

try:
    os.mkdir("./downloads")
except OSError as error:
    print(error)
    print("Folder already exists")

variables = ["Country", "Total Cases", "New Cases", "Total Deaths", 
            "New Deaths", "Total Recovered", "Active Cases", 
            "Serious_Critical", "Total Cases Per 1m Pop", "Deaths Per 1m Pop",
            "Total Tests", "Tests Per 1m Pop"]
outfile = "./downloads/covid-19-country-statistics-" + date + ".csv"

with open(outfile, "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(variables)
    for country in global_info:
        writer.writerow(country)
#
# QUESTION: How can you check if the file was saved to the machine?
#


# Block 7

data = pd.read_csv(outfile, encoding = "ISO-8859-1", index_col=False)
data.sample(5)
#
# QUESTION: In the output below, what do you think the value "NaN" represents?
#

200  |  {'Date': 'Mon, 20 Apr 2020 08:50:09 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Set-Cookie': '__cfduid=d7814426b115d464b8931c40b396d27771587372609; expires=Wed, 20-May-20 08:50:09 GMT; path=/; domain=.worldometers.info; HttpOnly; SameSite=Lax; Secure', 'X-LiteSpeed-Cache': 'hit', 'Vary': 'Accept-Encoding', 'X-Turbo-Charged-By': 'LiteSpeed', 'CF-Cache-Status': 'DYNAMIC', 'Expect-CT': 'max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"', 'Server': 'cloudflare', 'CF-RAY': '586d9dbb2a92bc42-LHR', 'Content-Encoding': 'gzip', 'alt-svc': 'h3-27=":443"; ma=86400, h3-25=":443"; ma=86400, h3-24=":443"; ma=86400, h3-23=":443"; ma=86400', 'cf-request-id': '023860e8f40000bc42f6b12200000001'}

Number of rows in table: 210

[WinError 183] Cannot create a file when that file already exists: './downloads'
Folder already exists


Unnamed: 0,Country,Total Cases,New Cases,Total Deaths,New Deaths,Total Recovered,Active Cases,Serious_Critical,Total Cases Per 1m Pop,Deaths Per 1m Pop,Total Tests,Tests Per 1m Pop
49,South Africa,3158,,54,,903,2201,36.0,53,0.9,114711,1934
101,San Marino,461,,39,,60,362,4.0,13586,1149.0,1711,50426
116,Kenya,270,,14,,67,189,2.0,5,0.3,13239,246
145,Liechtenstein,81,,1,,55,25,,2124,26.0,900,23605
70,Armenia,1339,48.0,22,2.0,580,737,30.0,452,7.0,13373,4513


In [31]:
data[data["Country"]=="China"]

Unnamed: 0,Country,Total Cases,New Cases,Total Deaths,New Deaths,Total Recovered,Active Cases,Serious_Critical,Total Cases Per 1m Pop,Deaths Per 1m Pop,Total Tests,Tests Per 1m Pop
209,China,82747,12,4632,,77084,1031,81,57,3,,


In [32]:
# TASK: Search for a different country by amending the code below

#data[data["Country"]=="COUNTRY"]

Unnamed: 0,Country,Total Cases,New Cases,Total Deaths,New Deaths,Total Recovered,Active Cases,Serious_Critical,Total Cases Per 1m Pop,Deaths Per 1m Pop,Total Tests,Tests Per 1m Pop
18,Ireland,15251,,610,,77,14564,294,3089,124,90646,18358


### Concluding remarks on Covid-19 data

The Covid-19 pandemic is a seismic public health crisis that will dominate our lives for the foreseeable future. The example code above is not a craven attempt to provide some topicality to these materials, nor is it simply a particularly good example for learning web-scraping techniques. There are real opportunities for social scientists to capture and analyse data on this phenomenon, starting with the core figures provided through the <a href="https://www.worldometers.info/coronavirus/" target=_blank>Worldometer website</a>.

You may also be interested in the publicly available data repository provided by Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE): <a href="https://github.com/CSSEGISandData/COVID-19" target=_blank>https://github.com/CSSEGISandData/COVID-19</a>. Updated daily, this resource provides CSV (Comma Separated Values) files of global Covid-19 statistics (e.g., country-level time series), as well as PDF copies of the World Health Organisation's situation reports.

At a UK level, the NHS releases data about Covid-19 symptoms reported through its NHS Pathways and 111 online platforms: <a href="https://digital.nhs.uk/data-and-information/publications/statistical/mi-potential-covid-19-symptoms-reported-through-nhs-pathways-and-111-online/latest" target=_blank>NHS Open Data</a>. Data on reported cases is also provided by Public Health England (PHE): <a href="https://www.gov.uk/government/publications/covid-19-track-coronavirus-cases" target=_blank>COVID-19: track coronavirus cases</a>. Many of these datasets are available as openly available as CSV files - you can learn how to download files in the [charity data example](#section_9_3).

Finally, the Office for National Statistics (ONS) provides data and experimental indicators of social like in the UK under Covid-19: <a href="https://www.ons.gov.uk/peoplepopulationandcommunity/healthandsocialcare/conditionsanddiseases" target=_blank>Coronavirus (COVID-19)</a>.

### A social research example: charity data

This example introduces slightly more complicated web pages, and techniques for handling exceptions, downloading files and more. If you feel comfortable with what you've learned so far then we highly recommend completing this lesson; if not take some more time to digest the Covid-19 example and return to it at a later date.

[Charity Data Example - Appendix C](#section_9_3)

## Value, limitations and ethics

Computational methods for collecting data from the web are an increasingly important component of a social scientist's toolkit. They enable individuals to collect and reshape data - qualitative and quantitative - that otherwise would be inaccessible for research purposes. Thus far, this notebook has focused on the logic and practice of web-scraping, however it is crucial we reflect critically on its value, limitations and ethical implications for social science purposes.

### Value

* Web-scraping is a mature computational method, with lots of established packages (e.g., `requests` and `BeautifulSoup` in Python), examples and help available. As a result the learning curve is not as steep as with other methods, and it is possible for a beginner to create and execute a functioning web-scraping script in a matter of hours.
* Using computational, rather than manual, methods provides the ability to schedule or automate your data collection activities. For instance, you could schedule the Covid-19 script in this notebook to execute at a set time every day.
* The richness of some of the information and data stored on web pages is a point worth repeating. Many public, private and charitable institutions use their web sites to release and regularly update information of value to social scientists. Getting a handle on the volume, variety and velocity of this information is extremely challenging without the use of computational methods.
* Computational methods not only enable accurate, real-time and reliable data collection, they also enable the reshaping of data into familiar formats (e.g., a CSV file, a database, a text document). While Python and HTML might be unfamiliar, the data that is returned through web-scraping can be formatted in such a way as to be compatible with your usual analytical methods (e.g., regression modelling, content analysis) and software applications (e.g., Stata, NVivo). In fact, we would go as far to say that computational methods are particularly valuable to social scientists from a data collection and processing perspective, and you can achieve much without ever engaging in "big data analytics" (e.g., machine learning, neural networks, natural language processing).

### Limitations

* Web-scraping may contravene the Terms of Service (ToS) of a website.  Much like open datasets will have a licence stipulating how the data may be used, information stored on the web can also come with restrictions on use. For example, the <a href="https://www.worldometers.info/licensing/faq/" target=_blank>Worldometer Covid-19 data</a> that we scrape in this notebook cannot be used without their permission, even though the <a href="https://www.worldometers.info/disclaimer/" target=_blank>ToS</a> do not expressly prohibit web-scraping. In contrast, the beneficiary data provided by the Charity Commission for England and Wales is available under the <a href="https://www.nationalarchives.gov.uk/doc/open-government-licence/version/2/" target=_blank>Open Government Licence (OGL) Version 2</a>, which permits copying, publishing, distribution and transmission. <br>So even in instances where scraping data is not prohibited, you may not be able to use it without seeking permission from the data owner. The safest approach is to seek permission in advance of conducting a web-scrape, especially if you intend to build a working relationship with the data owner - do not rely on the argument that the information is publicly available to begin with. Sometimes it may be easier to manually record or collect the data you are interested in.
* This brings us to a related point concerning the legal basis for collecting and using data from websites. In the UK there is no specific law prohibiting web-scraping or the use of data obtained via this method; however there are other laws which impinge on this activity. Copyright or intellectual property law may prohibit what information, if any, may be scraped from a website (<a href="http://copyrightblog.kluweriplaw.com/2015/01/26/ryanair-ltd-v-pr-aviation-bv-contracts-rights-and-users-in-a-low-cost-database-law/?doing_wp_cron=1586338639.9420111179351806640625" target=_blank>see this example</a>). <br>Data protection laws, such as the General Data Protection Regulations (GDPR), also influence whether and how you collect data about individuals. This means you take responsibility for processing personal data, even if it’s publicly available. This is a critical and detailed area of data-driven activities, and we encourage you to consult relevant guidance (see Further Reading and Resources section).
* Web pages are frequently updated, therefore changes to their structure can break your script e.g., the URL for a file may change, or the table element now has a different id or was moved to a different web page. It can be a lot of work maintaining your code, especially if you make it available for use by others.
* Some websites may be advanced enough that they throttle or block scraping of their contents. For example, they may "blacklist" (ban) your IP address - your computer's unique id on the internet - from making requests to its server.
* Web-scraping, and computational social science in general, is dependent on your computing setup. For example, you may not possess the administrative rights for your machine, preventing you from scheduling your script to run on a regular basis (i.e., your computer automatically goes to sleep after a set period of time). There are ways around this and you do not need a high performance computing setup, but it is worth keeping in mind nonetheless.

See the *Further Reading and Resources* section for useful articles exploring many of these issues.

### Ethical considerations

For the purposes of this discussion, we will assume a researcher has sought and received ethical approval for a piece of research through the usual institutional processes: you've already considered consent, harm to researcher and participant, data security and curation etc. Instead, we will focus on a major ethical implication specific to web-scraping: the impact of web-scraping on the data owner's website. Each request you make to a website consumes computational resources, on your end and theirs: the server (i.e., computer) hosting the website must use some of its processing power and bandwidth to respond to the request. Web-scraping, especially frequently scheduled scripts, can overload a server by making too many requests, causing the website to crash. Individuals and organisations may rely on a website for vital and timely information, and causing a website to crash could carry significant real-world implications.

## Conclusion

Web-scraping is a simple yet powerful computational method for collecting data of value for social science research. It provides a relatively gentle introduction to using programming languages, also. However, "with great power comes great responsibility" (sorry). Web-scraping takes you into the realm of data protection, website Terms of Service (ToS), and many murky ethical issues. Wielded sensibly and sensitively, web-scraping is a valuable and exciting social science method. 

Good luck on your data-driven travels!

## Bibliography

Barba, Lorena A. et al. (2019). *Teaching and Learning with Jupyter*. <a href="https://jupyter4edu.github.io/jupyter-edu-book/" target=_blank>https://jupyter4edu.github.io/jupyter-edu-book/</a>.

Brooker, P. (2020). *Programming with Python for Social Scientists*. London: SAGE Publications Ltd.

Lau, S., Gonzalez, J., & Nolan, D. (n.d.). *Principles and Techniques of Data Science*. https://www.textbook.ds100.org

## Further reading and resources

We publish a list of useful books, papers, websites and other resources on our Github repository: <a href="https://github.com/UKDataServiceOpen/new-forms-of-data/blob/master/web-scraping/reading-list.md" target=_blank>[Reading list]</a>

The help documentation for the `requests` and `BeautifulSoup` modules is refreshingly readable and useful:
* <a href="https://requests.readthedocs.io/en/master/" target=_blank>`requests`</a>
* <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" target=_blank>`BeautifulSoup`</a> 

You may also be interested in the following articles specifically relating to web-scraping:
* <a href="https://ico.org.uk/for-organisations/guide-to-data-protection" target=_blank>Guide to Data Protection</a>
* <a href="https://ocean.sagepub.com/blog/collecting-social-media-data-for-research" target=_blank>Collecting social media data for research</a>
* <a href="https://benbernardblog.com/web-scraping-and-crawling-are-perfectly-legal-right/" target=_blank>Web Scraping and Crawling Are Perfectly Legal, Right?</a>
* <a href="https://parissmith.co.uk/blog/web-crawling-screen-scraping-legal-position/" target=_blank>Web Crawling and Screen Scraping – the Legal Position</a>

## Appendices

### Appendix A - Requesting URLs

In Python we've made use of the excellent `requests` module. By calling the `requests.get()` function, we mimic the manual process of launching a web browser and visiting a website. The `requests` module achieves this by placing a _request_ to the server hosting the website (e.g., show me the contents of the website), and handling the _response_ that is returned (e.g., the contents of the website and some metadata about the request). This _request-response_ protocol is known as HTTP (HyperText Transfer Protocol); HTTP allows computers to communicate with each other over the internet - you can learn more about it at <a href="https://www.w3schools.com/whatis/whatis_http.asp" target=_blank>W3 Schools</a>.

Run the code for simple example below to learn more about the data and metadata returned by `requests.get()`. To learn more about the `requests` module, see the <a href="https://requests.readthedocs.io/en/master/" target=_blank>official documentation</a>.

In [33]:
import requests

url = "https://httpbin.org/html"
response = requests.get(url)

print("1. {}".format(response)) # returns the object type (i.e. a response) and status code
print("\r")

print("2. {}".format(response.headers)) # returns a dictionary of response headers
print("\r")

print("3. {}".format(response.headers["Date"])) # return a particular header
print("\r")

print("4. {}".format(response.request)) # returns the request object that requested this response
print("\r")

print("5. {}".format(response.url)) # returns the URL of the response
print("\r")

#print(response.text) # returns the text contained in the response (i.e. the paragraphs, headers etc of the web page)
#print(response.content) # returns the content of the response (i.e. the HTML contents of the web page)

# Visit https://www.w3schools.com/python/ref_requests_response.asp for a full list of what is returned by the server
# in response to a request.

1. <Response [200]>

2. {'Date': 'Mon, 20 Apr 2020 09:19:55 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Content-Length': '3741', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true'}

3. Mon, 20 Apr 2020 09:19:55 GMT

4. <PreparedRequest [GET]>

5. https://httpbin.org/html



### Appendix B - Covid-19 annotated script

We've provided answers to some of the questions posed throughout the Covid-19 country-level statistics code. Note also that you can copy-and-paste any of the code found in this notebook into a different p

In [None]:
# Import modules

import os
import requests
import csv
import pandas as pd
from datetime import datetime
from bs4 import BeautifulSoup as soup
   
date = datetime.now().strftime("%Y-%m-%d") # get today's date and format it 


# Request web page

url = "https://www.worldometers.info/coronavirus/"
response = requests.get(url, allow_redirects=True)
print(response.status_code, " | ", response.headers)
#
# QUESTION: How do you know if the web page was requested successfully?
#
# ANSWER: We get a response code of '200'.
#


# Parse the contents of the web page as an HTML document

html_response = response.text
soup_response = soup(html_response, "html.parser")
# print(soup_response)


# Locate information of interest

table = soup_response.find("table", id="main_table_countries_today").find("tbody")
#
# QUESTION: What is the second "find()" function looking for?
#
# ANSWER: Within the <table> tags, find the <tbody> tags.
# <tbody> tags contain the content of a table i.e., rows and columns.
#

rows = table.find_all("tr", style="") # find all rows in the table


# Extract country information from table

global_info = []
for row in rows:
    tds = row.find_all("td")
    country_info = [var.text.strip() for var in tds] # for every row in the list, extract the text
    global_info.append(country_info)

print("\r")
print("Number of rows in table: {}".format(len(global_info)))
print("\r")
#
# QUESTION: Why do we create a blank list (global_info) and then
# populate it within the "for loop"?
#
# ANSWER: If the list was created within the loop, then it would be
# overwritten everytime the loop iterates.
#
# What would happen if we created the blank list within the loop?
# (Feel free to change the code to see what happens).
#


# Save results to a file

try:
    os.mkdir("./downloads")
except OSError as error:
    print(error)
    print("Folder already exists")

variables = ["Country", "Total Cases", "New Cases", "Total Deaths", 
            "New Deaths", "Total Recovered", "Active Cases", 
            "Serious_Critical", "Total Cases Per 1m Pop", "Deaths Per 1m Pop",
            "Total Tests", "Tests Per 1m Pop"]
outfile = "./downloads/covid-19-country-statistics-" + date + ".csv"

with open(outfile, "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(variables)
    for country in global_info:
        writer.writerow(country)
#
# QUESTION: How can you check if the file was saved to the machine?
#
# ANSWER: Using Python, we can list all files in the "downloads" folder,
# or open the newly created file and view its contents.
#


# Read in data set containing country-level statistics

data = pd.read_csv(outfile, encoding = "ISO-8859-1", index_col=False)
data.sample(5)
#
# QUESTION: In the output below, what do you think the value "NaN" represents?
#
# ANSWER: "NaN" is "Not a Number" and represents missing values in Python.
#

<a id="section_9_3"></a>
### Appendix C: Capturing charity data

We conclude this notebook with a social research-oriented example: who do UK charities claim to support through their activities?

Let's decompose the data collection process into its constituent parts:
1. Extract a list of charity numbers from an administrative dataset.
2. For each of these charity numbers, request and parse its web page from the Charity Commission for England and Wales (CCEW) website.
3. Extract the list of beneficiary groups - listed under the header "Who the charity helps".
4. Write the results to a CSV file.
5. Download a charity's set of annual accounts to capture additional information (e.g., how the organisation helped its beneficiaries).

When developing a more complicated script, it's best to test it with a simple/limited example first. In our case, let's try to extract the beneficiary groups for a randomly selected charity before doing so for a larger sample of organisations.

First, let's take care of importing modules and other preliminary tasks.

In [None]:
# Import modules

try:
    import csv # module for handling csv files
    import requests # module for requesting urls
    import os # module for performing operating system tasks
    import pandas as pd # module for working with datasets
    import random # module for generating pseudo-random numbers
    from datetime import datetime # module for working with dates and time
    from bs4 import BeautifulSoup as soup # module for parsing web pages
    print("Successfully imported modules")
except:
    print("Did not import one or more modules!")  
    
# Create folder and file
try:
    os.mkdir('./downloads')
except OSError as error:
    print("Folder already exists")

date = datetime.now().strftime("%Y-%m-%d") # get today's date

outfile = "./downloads/charity-beneficiaries-" + date + ".csv" # CSV file for saving results of scrape
outfile # view path to file

#### Extract list of charity numbers

Good, we've successfully imported the modules we need and define a location and file for storing the results of the scrape. Now let us import the raw data file containing a list of charity numbers and take a random sample of these for testing our script.

In [None]:
# Import list of charities from raw data
#
# We'll use an older copy of the Register of Charities provided by the Charity Commission for England and Wales (CCEW).
# This file contains information on all registered charities in England and Wales (c. 160,000 organisations).
#

charreg_file = "./data/extract_main_charity.csv" # location of file
charreg = pd.read_csv(charreg_file, encoding = "ISO-8859-1", index_col=False) # import file


# Explore dataset characteristics

print("{} observations in the dataset".format(len(charreg))) # print the number of rows in the dataset
print(charreg.shape) # print the number of rows and columns in the dataset
print(charreg.columns) # print the names of the columns in the dataset
charreg.sample(5) # view 5 randomly chosen observations

In [None]:
# Extract charity numbers and take a random sample of these

charreg["regno"] = charreg["regno"].fillna(0).astype(int) # convert missing values to "0" and remove decimal places
regno_list = charreg.regno.values.tolist() # extract values in "regno" column and place in a list
print(regno_list[0:5]) # return first five charity numbers in the list

random.seed(2) # ensure the random sample is consistent every time the code is run
regno_rsamp = random.sample(regno_list, 10) # Draw a random sample of charity numbers from the list
print(regno_rsamp)
regno = regno_rsamp[2] # select third number in list

#### Requesting and parsing charity details

Now we need to use the ```requests``` and ```BeautifulSoup``` modules to request and parse each charity's web page.

In [None]:
# Request web pages of charities

print("Charity number: {}".format(str(regno)))
print("\r")

url = "https://beta.charitycommission.gov.uk/charity-details/?regId=" + str(regno) + "&subId=0"
print("Requesting URL: {}".format(url))
print("\r")
#
# For now we will just request the details of the first charity number
# in the list of random samples.
#

response = requests.get(url, allow_redirects=True) # request the url
print(response.status_code, " | ", response.headers) # print the metadata behind the request to see if it was successful
print("\r")


# Parse the contents of the web page

html_response = response.text # get the text elements of the page
soup_response = soup(html_response, 'html.parser') # parse the text as a BeautifulSoup object
print(type(soup_response)) # return object type (just to confirm it is a BeautifulSoup object)

Great, we've successfully requested the web page and parsed it as a BeautifulSoup object; this is important as the ```BeautifulSoup``` module has a number of useful functions (e.g. ```find()```, ```find_all()```) that only work on this object type.

Now we come to the scrape - finding and extracting the list of beneficiaries.

#### Extracting beneficiary group information

In [None]:
# Extract the list of beneficiary groups
#
# The informtion we need is contained in a set of <div></div> tags identified by its 
# "class=pcg-charity-details__block col-lg-4" attribute. Unfortunately this is not a unique id, so we
# need to find all instances where "class=pcg-charity-details__block col-lg-4" and filter to the correct set of <div></div>.
#

sections = soup_response.find_all("div", class_="pcg-charity-details__block col-lg-4") # find all <div> tags where class equals stated value
print(len(sections)) # return how many tags were found matching the soup_response.find_all() expression above
print("\r")


# Find beneficiary section in list of sections

searchterm = "Who the charity helps" # search term identifying section containing list of beneficiaries
for section in sections: # for each section contained in the sections list:
    if searchterm in str(section): # if the search term exists in the section
        print("Index (position) of relevant section: {}".format(sections.index(section))) # return a message saying we found the correct section
        benlocation = sections.index(section) # store the location in the list of the correct section
        print("\r")
    else:
        continue

bensection = sections[benlocation] # create a new object containing the correct section
print(bensection) # return the contents of the section
print("\r")


# Extract beneficiary groups from correct section #

benlist = [] # define a blank list for storing results of scrape

for item in bensection.find_all('li'): # for each <li> tag in the beneficaries section
    charid = str(regno) # store the unique id of the charity
    beneficiary = str(item.text) # store the text contained within the <li></li> tags
    
    observation = [charid, beneficiary] # create a list containing charity id, name and beneficiary group
    benlist.append(observation) # add the observations to the original list

print(benlist) # now we have a list of beneficiary groups (long format) for Oxfam

Let's unpick the logic of the code above:
1. We know the list of beneficiaries is contained in a section (`<div>`) where *class_="pcg-charity-details__block col-lg-4"*.
2. We find all sections where the _class_ attribute equals "pcg-charity-details__block col-lg-4", and navigate to the correct one by evaluating whether it contains a relevant string ("Who the charity helps"). This process revealed that the list of beneficiaries was contained in the fifth section (remember: lists begin at position 0, so 4 identifies the fifth element of a list). If we knew that the list of beneficiaries was always contained in the fifth section we wouldn't need the use of a search term, but this way is more robust to deviations in the structure and content of each charity's web page.
3. Once we identify the correct section, we extract all of the text contained in the `<li></li>` tags and store the results in a list.

#### Save results to a file

Our final task is to write the results of the scrape (benlist) to a CSV file.

In [None]:
# 4. Write the results to a CSV file #

variables = ["Charity Number", "Beneficiary Group"] # define variable names for the file

with open(outfile, 'w', newline='') as f: # with the file open in "write" mode, and giving it a shorter name (f)
    writer = csv.writer(f) # define a 'writer' object that allows us to export information to a CSV
    writer.writerow(variables) # write the variable names to the first row of the file
    for row in benlist: # for every observation in the list
        writer.writerow(row) # write the observation to a row in the file

print("----------------------------")   
print("\r")
print("Successfully saved the file")
print("\r")
print("Contents of downloads folder: {}".format(os.listdir("./downloads"))) # list the contents of the 'downloads' directory
print("\r")


# Open file to check scrape worked #

with open(outfile, "r") as f: # with the file open in "read" mode, and giving it a shorter name (f)
    print(f.read()) # print the contents of the file

print("----------------------------")    

And that completes the scrape! It seems to work as intended for this one charity but it would be good to check its robustness with a larger sample. Appendix B contains a more detailed script which:
* Executes the script for a random sample of 1000 charities.
* Creates a log file to record metadata about the scrape (e.g., how long it takes to execute, which urls it requested).

Let's conclude by using the `requests` module to download the most recent annual report for this charity.

#### Downloading files

As a final task, we will attempt to download the 2018/19 annual accounts for our charity (regno: 211535), which are available on the Charity Commission for England and Wales (CCEW) public website. Our task is simplified by knowing the URL through which the file can be accessed: <a href="http://apps.charitycommission.gov.uk/Accounts/Ends35/0000211535_AC_20190331_E_C.PDF" target=_blank>http://apps.charitycommission.gov.uk/Accounts/Ends35/0000211535_AC_20190331_E_C.PDF</a>.

Let's see how we can use Python to achieve this task.

In [None]:
# Define the URL where the file can be downloaded

accounts = "http://apps.charitycommission.gov.uk/Accounts/Ends35/0000211535_AC_20190331_E_C.PDF"


# Define where the file should be downloaded to

outfile = "./downloads/annual-accounts-211535-2019.pdf"


# Request the file from the URL 

response = requests.get(accounts, allow_redirects=True) # request the url and allow redirects if needed
print(response.status_code, " | ", response.headers) # print the metadata behind the request to see if it was successful

In [None]:
print(type(response.content)) # reveal what type of Python object the response is; 'bytes' is a file

# Save the PDF in the location we defined earlier (i.e. 'outfile') #

with open(outfile, "wb") as f: # with the file open in "write binary" mode, and giving it a shorter name (f)
    f.write(response.content) # write the contents (i.e. PDF) of the request we made to the file

print("Successfully saved the file")
#
# It may seem strange to open the file before placing the contents of
# the PDF in it, but think of it like opening a blank spreadsheet and 
# copying-and-pasting a table into it. 
#

How can we tell it worked? First, we wrote a simple ```print``` command that would only execute if the code preceding it worked correctly. We could also ask Python to list the contents of the _downloads_ folder.

In [None]:
os.listdir("./downloads") # list the contents of the 'downloads' directory

Finally, we can open the PDF in our Jupyter notebook to view its contents.

In [None]:
from IPython.display import IFrame
IFrame(outfile, width=600, height=500)

Congratulations! You've learned how to download a file from a URL using simple and efficient Python code. Why didn't we just open a browser and peform this task manually? Well, a programming script has the following advantages:
* Quicker (once the script has been written) 
* Reproducible
* Automatable 
* Less prone to error

The advantages are obvious once we expand our data collection efforts to more units of analysis: there are c. 160,000 registered charities in England and Wales, many of which must submit accounts going back at least five years. I certainly don't fancy performing this task manually or subjecting a research assistant to it...

**EXERCISE**: Adapt the script below to download all of the annual accounts available for this charity from the regulator's website: <br><br><a href="https://beta.charitycommission.gov.uk/charity-details/?regid=211535&subid=0" target=_blank>https://beta.charitycommission.gov.uk/charity-details/?regid=211535&subid=0</a>

In [None]:
# Download charity accounts

# Create a list of URLs where the files can be downloaded from

accounts = ["http://apps.charitycommission.gov.uk/Accounts/Ends35/0000211535_AC_20161231_E_C.PDF",
            "http://apps.charitycommission.gov.uk/Accounts/Ends35/0000211535_AC_20151231_E_C.PDF",
            "http://apps.charitycommission.gov.uk/Accounts/Ends35/0000211535_AC_20141231_E_C.PDF"]

year = 2016
for url in accounts:
    
    outfile = "./downloads/annual-accounts-211535-" + str(year) + ".pdf" # folder and file name for downloaded file
    
    # TASK: INSERT CODE HERE TO REQUEST THE URL
    print(response.status_code, " | ", response.headers) # print the metadata behind the request to see if it was successful
    

    with open(outfile, "wb") as f: # with the file open in "write binary" mode, and giving it a shorter name (f)
        # TASK: INSERT CODE HERE TO WRITE CONTENTS OF THE PDF TO THE FILE

    year -=1 # decrement the year variable by one
    
# QUESTION: Why do we use a "year" variable for naming the files?

In [None]:
# TASK: INSERT CODE HERE TO VIEW THE 2016 ACCOUNTS

#### Extensions

This section contains a more detailed charity beneficiary scrape script. It performs the scrape for a larger sample of charities, is robust to instances where beneficiary information is not available, and captures metadata about the scrape also. It is not perfect and I'm sure you can think of improvements like the following:
* Currently, the script scrapes information for each charity and stores this in an expanding list, which is only written to the CSV file once the scrape is complete. This could be improved by writing to the CSV file **within** the loop; not only does this ensure we capture some records in the event the script stops executing, it also prevents the list becoming so large it uses up too much computer memory.
* What happens if the web page cannot be requested? At the moment the script would break, as there is a difference between an unsuccessful request (no web page appears) and a successful one but the beneficiary information is simply not available (which the script is able to handle).

In [None]:
## Title: Scraping charity beneficiaries
## Created: 01/04/2020
## Creater: Diarmuid McDonnell, University of Manchester

# 0. Preliminaries

# Import modules

try:
    import csv # module for handling csv files
    import requests # module for requesting urls
    import os # module for performing operating system tasks
    import pandas as pd # module for working with datasets
    import random # module for generating pseudo-random numbers
    from datetime import datetime # module for working with dates and time
    from bs4 import BeautifulSoup as soup # module for parsing web pages
    print("Successfully imported modules")
except:
    print("Did not import one or more modules!")  
    

# Define files to save results of scrape

try:
    os.mkdir('./downloads')
except OSError as error:
    print("Folder already exists")

date = datetime.now().strftime("%Y-%m-%d") # get today's date

outfile = "./downloads/charity-beneficiaries-" + date + ".csv" # CSV file for saving results of scrape
logfile = "./downloads/charity-beneficiaries-log-" + date + ".csv" # log file for saving metadata of scrape



##############################################################################################

##############################################################################################

    
# 1. Import list of charities from raw data
#
# We'll use an older copy of the Register of Charities provided by the Charity Commission for England and Wales (CCEW).
# This file contains information on all registered charities in England and Wales (c. 160,000 organisations).
#

charreg_file = "./data/extract_main_charity.csv" # location of file
charreg = pd.read_csv(charreg_file, encoding = "ISO-8859-1", index_col=False) # import file


# Extract charity numbers and take a random sample of these

charreg["regno"] = charreg["regno"].fillna(0).astype(int) # convert missing values to "0" and remove decimal places
regno_list = charreg.regno.values.tolist() # extract values in "regno" column and place in a list
print(regno_list[0:5]) # return first five charity numbers in the list

regno_rsamp = random.sample(regno_list, 1000) # Draw a random sample of charity numbers from the list


##############################################################################################

##############################################################################################

    
# 2. Request web pages of charities

benlist = [] # define a blank list for storing results of scrape
loglist = [] # define a blank list for storing metadata of scrape

for regno in regno_rsamp:
    
    print("--------------------------------------------------------")
    print("Starting scrape of charity number: {}".format(str(regno)))
    print("\r")
    starttime = datetime.now() # Track how long it takes to capture information for each charity
    
    charurl = "https://beta.charitycommission.gov.uk/charity-details/?regId=" + str(regno) + "&subId=0" 
    # define the URL of for a given charity's web page

    response = requests.get(charurl, allow_redirects=True) # request the url
    print(response.status_code, " | ", response.headers) # print the metadata behind the request to see if it was successful
    print("\r")

    html_response = response.text # Get the text elements of the page
    soup_response = soup(html_response, 'html.parser') # Parse the text as a BeautifulSoup object

    
    # 3. Find beneficiary section in list of sections
    
    if not soup_response.find('span', {"class": "pcg-page-path__status pcg-page-path__status--removed"}): # If the charity isn't removed from the Register then proceed with scraping beneficiary info
        sections = soup_response.find_all("div", class_="pcg-charity-details__block col-lg-4") # find all <div> tags where class equals stated value
        #print(len(sections)) # return how many tags were found matching the soup_response.find_all() expression above

        searchterm = "Who the charity helps" # search term identifying section containing list of beneficiaries
        for section in sections: # for each section contained in the sections list:
            if searchterm in str(section): # if the search term exists in the section
                #print("Index (position) of relevant section: {}".format(sections.index(section))) # return a message saying we found the correct section
                benlocation = sections.index(section) # store the location in the list of the correct section
                #print("\r")
            else:
                continue

        bensection = sections[benlocation] # create a new object containing the correct section

        for item in bensection.find_all('li'): # for each <li> tag in the beneficaries section
            charid = str(regno) # store the unique id of the charity
            beneficiary = str(item.text) # store the text contained within the <li></li> tags
            observation = [charid, beneficiary] # create a list containing charity id, name and beneficiary group
            benlist.append(observation) # add the observations to the original list

        runtime = datetime.now() - starttime # calculate how long the scrape took for this charity
        scraped = "Yes"
        logobs = [runtime, charid, charurl, response.status_code, scraped]
        loglist.append(logobs)
        
        print("Finished scraping beneficiaries for: {}".format(str(regno)))
        print("--------------------------------------------------------")
        
    else: # charity is no longer registered, thus no beneficiary info available
        charid = str(regno)
        runtime = datetime.now() - starttime # calculate how long the scrape took for this charity
        scraped = "No"
        logobs = [runtime, charid, charurl, response.status_code, scraped]
        loglist.append(logobs)
        
        print("Could not scrape beneficiaries for: {}".format(str(regno)))
        print("--------------------------------------------------------")

##############################################################################################

##############################################################################################


# 4. Write the results to a CSV file 

variables = ["Charity Number", "Organisation Name", "Beneficiary Group"] # define variable names for the results file
logheaders = ["Timestamp", "Charity Number", "URL", "Status Code", "Scraped"] # define variable names for the log file


# Write the results #

with open(outfile, "w", newline="") as f: # with the file open in "write" mode, and giving it a shorter name (f)
    writer = csv.writer(f) # define a 'writer' object that allows us to export information to a CSV
    writer.writerow(variables) # write the variable names to the first row of the file
    for row in benlist: # for every observation in the list
        writer.writerow(row) # write the observation to a row in the file
        
        
# Write the log file #

with open(logfile, "w", newline="") as f: # with the file open in "write" mode, and giving it a shorter name (f)
    writer = csv.writer(f) # define a 'writer' object that allows us to export information to a CSV
    writer.writerow(logheaders) # write the variable names to the first row of the file
    for row in loglist: # for every observation in the list
        writer.writerow(row) # write the observation to a row in the file        

        
print("----------------------------")   
print("\r")
print("Successfully saved the files")
print(os.listdir("./downloads")) # list the contents of the 'downloads' directory
print("\r")
print("Script complete!")

-- END OF FILE --