# Web Scraping with Python

MIT Political Science Methods Series  
Spring 2018  
Andy Halterman

## Overview

Most news sites and similar web pages can be scraped through a three step process:

1. given a link to an article, extracting and formatting all the needed info from the page
2. given an archive-type page of links, finding all the links on the page.
3. iterating through each page of archives, scraping all the pages from it, and saving to disk

We'll write one function for each.

## Libraries and setup

In [None]:
# requests is for general HTTP loading
import requests
# BeautifulSoup is an HTML parser
from bs4 import BeautifulSoup
# JSON is a nice format for writing out
# ujson can handle datetimes better and is a
# drop in replacement for the json module
import json

In [None]:
# Sometimes you'll need the regular expressions library and a date library
import re
import dateutil.parser

In [109]:
from __future__ import print_function

## Page scraper

We'll write a function here that takes in the URL of an article or page, extracts the information we want from HTML, and structures the output.

Python things to learn:

- calling methods from objects
- what a dictionary is
- defining a function

HTML things to learn: 

- Chrome inspector
- what HTML tags look like

In [121]:
# find the URL of an article to scrape
url = "https://reliefweb.int/report/afghanistan/more-200-displaced-families-receive-cash-assistance-laghman-province"

In [122]:
# download article page and get content
page = requests.get(url)
content = page.content

In [123]:
# convert to BeautifulSoup
soup = BeautifulSoup(content, "lxml")
soup

<!DOCTYPE html>
<html dir="ltr" lang="en" prefix="fb: http://ogp.me/ns/fb# og: http://ogp.me/ns# article: http://ogp.me/ns/article#">
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<link href="/profiles/reliefweb/themes/kobe/images/apple-touch-icon.png" rel="apple-touch-icon"/>
<link href="/profiles/reliefweb/themes/kobe/images/favicon.png" rel="icon"/>
<!--[if IE]><link rel="shortcut icon" href="/profiles/reliefweb/themes/kobe/favicon.ico"><![endif]-->
<link href="https://reliefweb.int/profiles/reliefweb/themes/kobe/favicon.ico" rel="shortcut icon" type="image/vnd.microsoft.icon"/>
<meta content="News and Press Release from Government of Afghanistan, published 08 Jan 2018" name="description"/>
<meta content="Afghanistan, , , Government of Afghanistan, News and Press Release, Shelter and Non-Food Items, English, humanitarian, IDPs" name="keywords"/>
<meta content="Afghanistan, , , Government of Afghanistan, News and Press Release, Shelter and Non-Food Items

In [128]:
# extract body text from page HTML (we'll do this together)
body = soup.find("div", {"class" : "field body"})
body = body.text.strip()
body

'Directorate of Refugees and Repatriations(DoRR) of Laghman province with co-operation of DRC and CWSA have distributed for 203 displaced families which displaced from insecure places of Alingar, Kanda gul, districts and Khogiani district of Nangarhar province and returnees from Pakistan have placed in the central of this province.\nThursday January 4, 2018 of the current year for 143 displaced families distributed 18000Af by DRC and for 60 displaced families distributed 100$ dolor by CWSA.'

In [129]:
# extract title from page HTML
title = soup.find("h1", {"class" : "node-title clearfix"})
title.text

'More than 200 Displaced families Receive Cash Assistance in Laghman Province'

In [131]:
# extract author from page HTML
author = soup.find("span", {"class" : "field field-source"})
author.text.strip()

'Government of Afghanistan'

In [136]:
# extract date from page HTML
raw_date = soup.find("span", {"class" : "date-display-single"}).text
# convert from raw text into a standardized form
date = dateutil.parser.parse(raw_date)
# Put it into a standard ISO format 
date = date.strftime("%Y-%m-%d")
date

'2018-01-06'

In [140]:
# put it all together!
def page_scraper(url):
    """Function to scrape a page"""
    # Code to download and soupify the page
    page = requests.get(url)
    content = page.content
    soup = BeautifulSoup(content, "lxml")
    # All the code to extract pieces from the HTML
    title = soup.find("h1", {"class" : "node-title clearfix"})
    title = title.text.strip()
    body = soup.find("div", {"class" : "field body"})
    body = body.text.strip()
    raw_date = soup.find("span", {"class" : "date-display-single"}).text
    date = dateutil.parser.parse(raw_date)
    date = date.strftime("%Y-%m-%d")
    author = soup.find("span", {"class" : "field field-source"})
    author = author.text.strip()
    article = {
        "title" : title,
        "body" : body,
        "date" : date,
        "author" : author
    }
    return article

In [141]:
# test it
page_scraper(url)

{'author': 'Government of Afghanistan',
 'body': 'Directorate of Refugees and Repatriations(DoRR) of Laghman province with co-operation of DRC and CWSA have distributed for 203 displaced families which displaced from insecure places of Alingar, Kanda gul, districts and Khogiani district of Nangarhar province and returnees from Pakistan have placed in the central of this province.\nThursday January 4, 2018 of the current year for 143 displaced families distributed 18000Af by DRC and for 60 displaced families distributed 100$ dolor by CWSA.',
 'date': '2018-01-06',
 'title': 'More than 200 Displaced families Receive Cash Assistance in Laghman Province'}

In [142]:
# test it on a different page
page_scraper("https://reliefweb.int/report/iraq/reconstruction-needed-displaced-iraqis-continue-return-iom-iraq")

{'author': 'International Organization for Migration',
 'body': 'Kuwait – As the recent Kuwait International Conference for the Reconstruction of Iraq came to a close, IOM, the UN Migration Agency, reported that internally displaced persons (IDPs) in Iraq continue to return home by the thousands.\nAccording to IOM’s latest Displacement Tracking Matrix (DTM) report, as of 31 January 2018 more than 3.3 million Iraqis have returned to their areas of origin, while nearly 2.5 million people continue to live in displacement. New IOM figures show that return movements are ongoing – in January another 125,000 returnees were identified – mainly to the four governorates of Ninewa, Salah al-Din, Kirkuk and Anbar.\nSince the start of the crisis in early 2014, with ISIL later holding large parts of central Iraq and the subsequent conflict to retake these areas, nearly six million Iraqis have been displaced. Their communities have suffered widespread devastation and damage.\nIn January 2018, for the

## Link getter

Second, we need to get the URLS of all the pages we want to scrape. We can do this by finding the directory pages, where the links are on it, and how to get all available directory pages.

Python things to learn here:

- getting values from dictionaries
- for loops and list comprehensions
- regex with the `re` library
- basic string operations

In [None]:
### Function to get all the article links from a single directory page

In [144]:
url = "https://reliefweb.int/country/afg?format=8&page=1"
page = requests.get(url)
content = page.content

In [146]:
soup = BeautifulSoup(content, "lxml")

In [152]:
links = soup.find("div", {"class" : "river-list river-updates river-sort-latest"})
links = links.find_all("a")
print(links[0:5])

[<a class="toggle" href="#item-2465814">description</a>, <a href="/country/afg">Afghanistan</a>, <a href="/report/afghanistan/uk-helps-rid-war-torn-afghanistan-lethal-landmines-making-thousands-safer">UK helps rid war-torn Afghanistan of lethal landmines making thousands safer</a>, <a href="/organization/dfid">Department for International Development</a>, <a class="toggle" href="#item-2464539">description</a>]


In [153]:
# pull out just the links
links = [i['href'] for i in links]
links[0:10]

['#item-2465814',
 '/country/afg',
 '/report/afghanistan/uk-helps-rid-war-torn-afghanistan-lethal-landmines-making-thousands-safer',
 '/organization/dfid',
 '#item-2464539',
 '/country/afg',
 '/report/afghanistan/polio-week-07-february-2018',
 '/organization/gpei',
 'https://reliefweb.int/sites/reliefweb.int/files/resources/This%20Week%20%E2%80%93%20GPEI%20FEB%207.pdf',
 'https://reliefweb.int/sites/reliefweb.int/files/resources/This%20Week%20%E2%80%93%20GPEI%20FEB%207.pdf']

In [154]:
# uh oh! `links` is full of all sorts of garbage. Is there a term
#  that we can search for to reliably pull out article links only?
links = [i for i in links if bool(re.match("/report", i))]
links[0:5]

['/report/afghanistan/uk-helps-rid-war-torn-afghanistan-lethal-landmines-making-thousands-safer',
 '/report/afghanistan/polio-week-07-february-2018',
 '/report/afghanistan/afghanistan-managing-wastewater-help-improve-living-conditions',
 '/report/world/we-need-olympic-effort-climate-change',
 '/report/afghanistan/community-unites-resolve-long-standing-tribal-conflict-surobi-district']

In [155]:
# These aren't complete urls! We can use string operations plus a list comprehension to fix this:
links = ["https://reliefweb.int" + i for i in links]
links[0:5]

['https://reliefweb.int/report/afghanistan/uk-helps-rid-war-torn-afghanistan-lethal-landmines-making-thousands-safer',
 'https://reliefweb.int/report/afghanistan/polio-week-07-february-2018',
 'https://reliefweb.int/report/afghanistan/afghanistan-managing-wastewater-help-improve-living-conditions',
 'https://reliefweb.int/report/world/we-need-olympic-effort-climate-change',
 'https://reliefweb.int/report/afghanistan/community-unites-resolve-long-standing-tribal-conflict-surobi-district']

In [157]:
# Put it all together into a function that takes in a "page number" 
#  and returns all the links to scrape from it.
def page_to_link(page_num):
    # how to use .format()
    url = "https://reliefweb.int/country/afg?format=8&page={0}#content".format(page_num)
    # download the page
    page = requests.get(url)
    # get its content
    content = page.content
    # soupify
    soup = BeautifulSoup(content, "lxml")
    # pull out links
    links = soup.find("div", {"class" : "river-list river-updates river-sort-latest"})
    links = links.find_all("a")
    links = [i['href'] for i in links]
    # clean links
    links = [i for i in links if bool(re.match("/report", i))]
    links = ["https://reliefweb.int" + i for i in links]
    return links

In [158]:
page_to_link(33)

['https://reliefweb.int/report/afghanistan/toll-conflict-afghan-women-worsens',
 'https://reliefweb.int/report/pakistan/pakistan-prepared-polio-high-season',
 'https://reliefweb.int/report/afghanistan/surobi-leaders-call-local-communities-embrace-mediation-peace',
 'https://reliefweb.int/report/afghanistan/afghanistan-flash-floods-noaa-pmd-media-echo-daily-flash-18-july-2017',
 'https://reliefweb.int/report/afghanistan/iom-special-case-nimroz',
 'https://reliefweb.int/report/afghanistan/unfpa-midwifery-helpline-supports-helps-midwife-save-mother-s-life-remote-village',
 'https://reliefweb.int/report/afghanistan/nearly-100-displaced-families-eastern-waziristan-return-home',
 'https://reliefweb.int/report/afghanistan/nearly-300-afghan-returnees-received-cash-assistance-turkham',
 'https://reliefweb.int/report/afghanistan/nearly-50-displaced-families-receive-humanitarian-assistance',
 'https://reliefweb.int/report/pakistan/council-conclusions-pakistan-17-july-2017',
 'https://reliefweb.in

## Putting it together

Now we have a function that'll take a page number for the archive page and return all the links.
We have another function that'll take in an article URL and give us the structured content from the page.

Let's put them together and download a (small!) range of stories.

Note: let's be nice to the UN and not all download the whole thing at once.

In [163]:
# get all the links we want to scrape
all_links = []

for num in range(1, 5):
    lks = page_to_link(num)
    all_links.extend(lks) # extend! not append.
    
len(all_links)

all_links[12] = "http://asdngasdng.com"

In [164]:
all_content = []

for link in all_links[10:20]: # be nice to reliefweb and only get some 
    try:
        content = page_scraper(link)
        all_content.append(content) # back to append!
    except Exception as e:
        # if something goes wrong, keep trucking,
        #  but print out the link so we can diagnose it.
        print(e)
        print(link)

HTTPConnectionPool(host='asdngasdng.com', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x10e625080>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known',))
http://asdngasdng.com


In [165]:
len(all_content)

9

In [162]:
all_content[4]

{'author': 'Food and Agriculture Organization of the United Nations',
 'body': 'With conflict and climate-related shocks sending global hunger numbers marching back up after declining for decades, FAO is asking for $1.06 billion to save lives and livelihoods and address acute hunger in 26 countries.\nWith donor support, FAO is hoping to reach 30+ million people who rely on agriculture for their livelihoods via a range of interventions that seek to rapidly restore local food production and enhance nutrition. These include, for example, providing seeds, tools and other materials for crop farming, safeguarding livestock through lifesaving veterinary care, organizing trainings in improved production, processing, and land and water management, and giving at-need families cash so they can immediately access food.\nEscalating humanitarian needs are largely the result of the persistence, intensification and spread of violence and conflict – the impacts of which are often being amplified and ag

## Saving as CSV

If you're going to work with your text next in R, a CSV is probably the most useful form to save your text in. To save it as a csv, we will convert into a `DataFrame` using `pandas`, a package for working with data in Python. `pandas` will then let us easily write it out to a CSV.

In [166]:
from __future__ import unicode_literals
import pandas as pd

content_df = pd.DataFrame(all_content)
content_df

Unnamed: 0,author,body,date,title
0,UN Assistance Mission in Afghanistan,HERAT – Participants at a UN-backed radio prog...,2018-02-10,Women empowerment discussed at UN-backed radio...
1,International Organization for Migration,"Jakarta – IOM, the UN Migration Agency, has fa...",2018-02-09,UN Migration Agency Facilitates Release of Ref...
2,"European Commission, Government of Afghanistan","Bruxelles\nUNIQUE ID: 180208_8\nToday, the fir...",2018-02-08,Joint Press Release: The European Union and Af...
3,Food and Agriculture Organization of the Unite...,With conflict and climate-related shocks sendi...,2018-02-08,$1.06 billion needed to fight back against hun...
4,European Commission's Directorate-General for ...,FACTS & FIGURES\nOver 815 000 refugees and mig...,2018-02-07,Factsheet on the former Yugoslav Republic of M...
5,UN Assistance Mission in Afghanistan,JALALABAD - Civilians must be protected from a...,2018-02-08,Protection of civilians is the responsibility ...
6,UN Office for the Coordination of Humanitarian...,"Geneva, 7 February 2018\nAs delivered\nExcelle...",2018-02-07,Assistant-Secretary-General and Deputy Emergen...
7,Jesuit Refugee Service,"Rome, 1 February 2018 – The Jesuit Refugee Ser...",2018-02-01,World Interfaith Harmony Week: Working togethe...
8,Norwegian Refugee Council,According to a new report by The Norwegian Ref...,2018-02-07,50 people displaced every hour


In [174]:
content_df.to_csv("all_content.csv", encoding = 'utf-8')

## Saving as JSON

We can also store is as a JSON file. JSON and dictionaries are almost equivalent, so it's a natural form to save a dict as a JSON file.

In [168]:
import json

In [169]:
FILENAME = "reliefweb.json"

with open(FILENAME, "w") as f:
    json.dump(all_content, f)

### Read it back in

If you want to load it back in later to analyze, you can do this:

In [170]:
FILENAME = "reliefweb.json"

with open(FILENAME, "r") as f:
    loaded_content = json.load(f)

In [172]:
# is it the same?
assert loaded_content[4] == all_content[4]