# Web Scraping with Python

MIT Political Methodology Lab Workshop Series  
Spring 2021  
Andy Halterman

## Scraping Overview

Most news sites and similar web pages can be scraped through a three step process:

1. given a link to an article, extract and format all the needed info from the page
2. given an archive-type page of links, finding all the links on the page.
3. iterating through each page of archives, scraping all the pages from it, and saving to disk

We'll write one function for each.

## Libraries and setup

In [1]:
# requests is for general HTTP loading
import requests
# BeautifulSoup is an HTML parser
from bs4 import BeautifulSoup
# JSON is a nice format for writing out
# ujson can handle datetimes better and is a drop in replacement for the json module
import json

In [2]:
# Sometimes you'll need the regular expressions library and a date library
import re
import dateutil.parser

## Page scraper

We'll write a function here that takes in the URL of an article or page, extracts the information we want from HTML, and structures the output.

Python things to learn:

- calling methods from objects
- what a dictionary looks like
- how to define a function

HTML things to learn: 

- Chrome inspector
- what HTML tags look like

In [3]:
# find the URL of an article to scrape
url = "https://reliefweb.int/report/afghanistan/more-200-displaced-families-receive-cash-assistance-laghman-province"

In [4]:
# download article page and get content
page = requests.get(url)
content = page.content
content

b'<!DOCTYPE html>\n<html lang="en" dir="ltr" prefix="fb: http://ogp.me/ns/fb# og: http://ogp.me/ns# article: http://ogp.me/ns/article#" class="no-js">\n\n<head>\n  <meta charset=\'UTF-8\'>\n  <title>More than 200 Displaced families Receive Cash Assistance in Laghman Province - Afghanistan | ReliefWeb</title>\n  <script>document.documentElement.className = typeof document.addEventListener !== \'undefined\' ? \'js\' : \'no-js\';</script>\n  <meta name="viewport" content="width=device-width, initial-scale=1" />\n\n  <link rel="apple-touch-icon" sizes="180x180" href="/apple-touch-icon.png">\n  <link rel="icon" type="image/png" sizes="32x32" href="/favicon-32x32.png">\n  <link rel="icon" type="image/png" sizes="16x16" href="/favicon-16x16.png">\n  <link rel="manifest" href="/manifest.json" crossOrigin="use-credentials">\n  <link rel="mask-icon" href="/safari-pinned-tab.svg" color="#5bbad5">\n  <meta name="msapplication-TileColor" content="#055372">\n  <meta name="theme-color" content="#ffff

In [5]:
page.status_code

200

In [6]:
# convert to BeautifulSoup
soup = BeautifulSoup(content, "lxml")

In [7]:
# quick aside to demonstrate "strip"
txt = "  Andy asdf  "
print(txt)
print(txt.strip())

  Andy asdf  
Andy asdf


In [8]:
# extract body text from page HTML (we'll do this together)
body = soup.find("div", {"class" : "content"})
#body.text
print(body.text.strip())

Directorate of Refugees and Repatriations(DoRR) of Laghman province with co-operation of DRC and CWSA have distributed for 203 displaced families which displaced from insecure places of Alingar, Kanda gul, districts and Khogiani district of Nangarhar province and returnees from Pakistan have placed in the central of this province.
Thursday January 4, 2018 of the current year for 143 displaced families distributed 18000Af by DRC and for 60 displaced families distributed 100$ dolor by CWSA.


In [9]:
# extract title from page HTML
title = soup.find("article").find("h2")
print(title.text)
title

More than 200 Displaced families Receive Cash Assistance in Laghman Province


<h2 lang="en">More than 200 Displaced families Receive Cash Assistance in Laghman Province</h2>

In [10]:
# extract author from page HTML
author = soup.find("dd", {"class" : "source"})
author.text.strip()

'Govt. Afghanistan'

In [11]:
# extract date from page HTML
raw_date = soup.find("dd", {"class" : "date published"})
print(raw_date.text)
# convert from raw text into a standardized form
date = dateutil.parser.parse(raw_date.text)
# Put it into a standard ISO format 
date = date.strftime("%Y-%m-%d")
print(date)

6 Jan 2018
2018-01-06


In [12]:
# put it all together!
def page_scraper(url):
    """Function to scrape a page"""
    # Code to download and soupify the page
    page = requests.get(url)
    content = page.content
    soup = BeautifulSoup(content, "lxml")
    # All the code to extract pieces from the HTML
    title = soup.find("article").find("h2")
    title = title.text.strip()
    body = soup.find("div", {"class" : "content"})
    body = body.text.strip()
    raw_date = soup.find("dd", {"class" : "date published"})
    date = dateutil.parser.parse(raw_date.text)
    date = date.strftime("%Y-%m-%d")
    author = soup.find("dd", {"class" : "source"})
    author = author.text.strip()
    article = {
        "title" : title,
        "body" : body,
        "date" : date,
        "author" : author
    }
    return article

In [13]:
# test it
page_scraper(url)

{'title': 'More than 200 Displaced families Receive Cash Assistance in Laghman Province',
 'body': 'Directorate of Refugees and Repatriations(DoRR) of Laghman province with co-operation of DRC and CWSA have distributed for 203 displaced families which displaced from insecure places of Alingar, Kanda gul, districts and Khogiani district of Nangarhar province and returnees from Pakistan have placed in the central of this province.\nThursday January 4, 2018 of the current year for 143 displaced families distributed 18000Af by DRC and for 60 displaced families distributed 100$ dolor by CWSA.',
 'date': '2018-01-06',
 'author': 'Govt. Afghanistan'}

In [14]:
# test it on a different page
page_scraper("https://reliefweb.int/report/iraq/reconstruction-needed-displaced-iraqis-continue-return-iom-iraq")

{'title': 'Reconstruction Needed as Displaced Iraqis Continue to Return: IOM Iraq [EN/AR/KU]',
 'body': 'IOM reports that thousands of Iraqi internally displaced persons (IDPs) continue to return home. © IOM\n\nAttachments\n\n\n Download document (PDF | 96.27 KB | Arabic version)\n\n\n Download document (PDF | 91.81 KB | Kurkish version)\n\n\n\n\nKuwait – As the recent Kuwait International Conference for the Reconstruction of Iraq came to a close, IOM, the UN Migration Agency, reported that internally displaced persons (IDPs) in Iraq continue to return home by the thousands.\nAccording to IOM’s latest Displacement Tracking Matrix (DTM) report, as of 31 January 2018 more than 3.3 million Iraqis have returned to their areas of origin, while nearly 2.5 million people continue to live in displacement. New IOM figures show that return movements are ongoing – in January another 125,000 returnees were identified – mainly to the four governorates of Ninewa, Salah al-Din, Kirkuk and Anbar.\nSin

## Link getter

Second, we need to get the URLS of all the pages we want to scrape. We can do this by finding the directory pages, where the links are on it, and how to get all available directory pages.

Python things to learn here:

- getting values from dictionaries
- for loops and list comprehensions
- regex with the `re` library
- basic string operations

In [15]:
### Function to get all the article links from a single directory page

In [16]:
url = "https://reliefweb.int/updates?advanced-search=%28PC13%29_%28F10%29&page={0}".format(1)
page = requests.get(url)
content = page.content

In [17]:
soup = BeautifulSoup(content, "lxml")

In [18]:
links_raw = soup.find("div", {"class" : "articles"}).find_all("a")
print(links_raw[0:5])

[<a href="/country/afg">Afghanistan</a>, <a href="https://reliefweb.int/report/afghanistan/return-undocumented-afghans-weekly-situation-report-05-11-feb-2021-enpashtodari">Return of Undocumented Afghans - Weekly Situation Report (05 - 11 Feb 2021) [EN/Pashto/Dari]</a>, <a href="/organization/iom">IOM</a>, <a href="/country/afg">Afghanistan</a>, <a href="https://reliefweb.int/report/afghanistan/afghanistan-humanitarian-bulletin-issue-84-december-2020-february-2021">Afghanistan Humanitarian Bulletin Issue 84 | December 2020 – February 2021</a>]


In [19]:
# pull out just the links
links_raw = [i['href'] for i in links_raw]
links_raw[0:5]

['/country/afg',
 'https://reliefweb.int/report/afghanistan/return-undocumented-afghans-weekly-situation-report-05-11-feb-2021-enpashtodari',
 '/organization/iom',
 '/country/afg',
 'https://reliefweb.int/report/afghanistan/afghanistan-humanitarian-bulletin-issue-84-december-2020-february-2021']

In [20]:
# uh oh! `links` is full of all sorts of garbage. Is there a term
#  that we can search for to reliably pull out article links only?
links = [i for i in links_raw if bool(re.search("/report", i))]
links[0:5]

['https://reliefweb.int/report/afghanistan/return-undocumented-afghans-weekly-situation-report-05-11-feb-2021-enpashtodari',
 'https://reliefweb.int/report/afghanistan/afghanistan-humanitarian-bulletin-issue-84-december-2020-february-2021',
 'https://reliefweb.int/report/afghanistan/afghanistan-covid-19-multi-sectoral-response-operational-situation-report-18',
 'https://reliefweb.int/report/afghanistan/who-emro-weekly-epidemiological-monitor-volume-14-issue-no-5-31-january-2021',
 'https://reliefweb.int/report/afghanistan/afghanistan-weekly-humanitarian-update-8-14-february-2021']

In [21]:
# If these weren't complete URLs we could use string operations plus a list comprehension to fix this:
#links = ["https://reliefweb.int" + i for i in links]
#links[0:5]

In [22]:
# Put it all together into a function that takes in a "page number" 
#  and returns all the links to scrape from it.
def page_to_link(page_num):
    # how to use .format()
    url = "https://reliefweb.int/updates?advanced-search=%28PC13%29_%28F10%29&page={0}".format(page_num)
    # download the page
    page = requests.get(url)
    # get its content
    content = page.content
    # soupify
    soup = BeautifulSoup(content, "lxml")
    # pull out links
    links_raw = soup.find("div", {"class" : "articles"}).find_all("a")
    links = [i['href'] for i in links_raw]
    # clean links
    links = [i for i in links if bool(re.search("/report", i))]
    return links

In [23]:
page_to_link(44)

['https://reliefweb.int/report/afghanistan/afghanistan-weekly-field-report-16-22-july-2018-enps',
 'https://reliefweb.int/report/afghanistan/afghan-assisted-return-weekly-update-21-july-2018',
 'https://reliefweb.int/report/afghanistan/afghanistan-humanitarian-response-wash-cluster-updates-cluster-achievements-june',
 'https://reliefweb.int/report/afghanistan/who-afghanistan-situation-report-june-2018',
 'https://reliefweb.int/report/afghanistan/iom-humanitarian-assistance-programme-hap-weekly-report-11-17-july-2018',
 'https://reliefweb.int/report/afghanistan/afghanistan-weekly-field-report-9-15-july-2018-enps',
 'https://reliefweb.int/report/afghanistan/afghan-assisted-return-weekly-update-14-july-2018',
 'https://reliefweb.int/report/afghanistan/midyear-update-protection-civilians-armed-conflict-1-january-30-june-2018',
 'https://reliefweb.int/report/afghanistan/iom-humanitarian-assistance-programme-hap-weekly-report-4-10-july-2018',
 'https://reliefweb.int/report/afghanistan/afghan

## Putting it together

Now we have a function that'll take a page number for the archive page and return all the links.
We have another function that'll take in an article URL and give us the structured content from the page.

Let's put them together and download a (small!) range of stories.

Note: let's be nice to the UN and not all download the whole thing at once.

In [24]:
# get all the links we want to scrape
all_links = []

for num in range(1, 3):
    lks = page_to_link(num)
    all_links.extend(lks) # extend! not append.
    
len(all_links)

all_links[12]

'https://reliefweb.int/report/afghanistan/return-undocumented-afghans-weekly-situation-report-22-28-january-2021'

In [25]:
all_content = []

for link in all_links[10:20]: # be nice to reliefweb and only get some 
    try:
        content = page_scraper(link)
        all_content.append(content) # back to append!
    except Exception as e:
        # if something goes wrong, keep trucking,
        #  but print out the link so we can diagnose it.
        print(e)
        print(link)

In [26]:
len(all_content)

10

In [27]:
all_content[4]

{'title': 'Afghanistan: Weekly Humanitarian Update (25 – 31 January 2021)',
 'body': "Attachments\n\n\n Download document (PDF | 364.25 KB)\n\n\n\n\nSouth: Hundreds of people displaced by ongoing fighting in Kandahar province\nFighting between Afghan National Security Forces (ANSF) and a non-state armed group (NSAG) continued in Hilmand, Kandahar and Uruzgan provinces.\nIn Kandahar, fighting continued mainly in Arghandab, Zheray and Panjwayi districts. Ongoing fighting displaced hundreds of people in Kandahar province, but the exact number of internally displaced persons (IDPs) is yet to be confirmed.\nHumanitarian actors with coordination of provincial authorities are assessing the needs of IDPs and will provide them with immediate assistance. Farmers and agricultural activities continued to be affected by ongoing fighting. All movements on the main highway-1 connecting Hilmand to Kandahar provinces reportedly resumed, however improvised explosive devices (IEDs) along the highway cont

## Saving as CSV

If you're going to work with your text next in R, a CSV is probably the most useful form to save your text in. To save it as a csv, we will convert into a `DataFrame` using `pandas`, a package for working with data in Python. `pandas` will then let us easily write it out to a CSV.

In [28]:
import pandas as pd

content_df = pd.DataFrame(all_content)
#content_df
content_df.to_csv("all_content.csv")

## Saving as JSON

We can also store is as a JSON file. JSON and dictionaries are almost equivalent, so it's a natural form to save a dict as a JSON file.

In [29]:
import json

In [30]:
FILENAME = "reliefweb.json"

with open(FILENAME, "w") as f:
    json.dump(all_content, f)

### Read it back in

If you want to load it back in later to analyze, you can do this:

In [31]:
FILENAME = "reliefweb.json"

with open(FILENAME, "r") as f:
    loaded_content = json.load(f)

In [32]:
# is it the same?
assert loaded_content[4] == all_content[4]

In [33]:
list(range(0, 3))

[0, 1, 2]