# Python and Web Scraping

MIT Political Science Methods Series  
Fall 2019  
Andy Halterman

## Jupyter Notebooks

Jupyter notebooks are a way of writing code, running code, and displaying output in one convenient place. You can write code in code blocks, or write markdown/text in blocks like this. It's often useful to explain what you're doing and finding so when you or someone else picks up the notebook in the future, they'll know what's going on.

You can execute code chunks by clicking the cell to run and hitting "Run" button on the top bar, or by typing "shift-enter". You can always return to a previous code chunk, modify it, and re-run it.

You can write math in notebooks, just like in Rmarkdown: $\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i$

Jupyter is great for prototyping, but for more heavy duty use, replication, running on a server, etc., I recommend re-writing the code into a `.py` file that can be called from the command line with `python mycode.py`. This process also gives you a chance to refactor your code, making it more efficient, readable, and dependable.

## Python vs. R

Python and R are fairly similar. This is a quick overview of the differences to help you get up to speed.

### Importing packages

- Importing packages is similar: instead of `[R] library(mypackage)`, do `[Python] import mypackage`.
- Python also lets you import specific functions from a package: `from mypackage import cool_function`
- You can also rename packages if they're too long: `import numpy as np`

In [27]:
## Practice! We need to import
# 1. a library called "requests"
import requests
# 2. a module called "BeautifulSoup" from a library called "bs4"
from bs4 import BeautifulSoup

### What are all those dots for? (Or, methods, attributes, and namespaces)

- Dots have special meaning in Python. It's not like R, where people put dots in all sorts of names.
- Python is much more careful about keeping packages' functions attached to the functions. If the `requests` library has a function called `get`, you call it like this `requests.get()`. This reminds you where the `get` function came from and prevents you from overwriting some other package's `get` function.
- Python is also more "object oriented" than R. Objects often have built in or attached functions, called methods. 
- These methods are called with a dot notation. Compare: 
```
[R] strsplit("Andy Halterman", " ")
```
and

```
    [Python] "Andy Halterman".split(" ")
```

- Objects can also have attributes, which are pieces of data attached to an object. Example: `andy.subfields = ['methods', 'security']`


In [23]:
## Practice! Can you figure out how to make a string all upper case?


### Data Structures

- Like R's vectors, Python uses a lot of lists. These are ordered arrays. Note that Python starts with 0! 

```
my_list = ["x", "y", "z"]
> my_list[0] 
x
```

- Python has a data structure called a dictionary, which are like lists that you access by key name instead of by position (think a more general form of R's dataframes). Example:

```
article = {"title": "Rivalry and Revenge",
           "author" : "Balcells",
           "year" : "2017"}

> article.keys()
['title', 'author', 'year']

> article['author']
"Balcells"
```

### Loops and functions


- Functions are only slightly different from R:

```
def my_function(x):
    z = (x + 2)^2
    return z
```

- Loops are fast and nice in Python, unlike in R, and are very easily done:

```
for i in my_list:
    my_function(i)
```

- Pro move: list comprehensions:

```
[my_function(i) for i in my_list]
```

In [25]:
## Practice! Can you write a loop that goes over a list of words and prints a capitalized version of each one?
words = 'i love to scrape'.split(" ")
for w in words:
    print(w.upper())

I
LOVE
TO
SCRAPE


In [None]:
### Whitespace

- As you can tell, Python makes heavy use of whitespace to set apart different levels of functions, for loops, etc. Use four spaces (Jupyter converts tabs to four spaces automatically. 
- No need for curly braces!

```
def my_function(big_list):
    print(len(big_list))
    for l in big_list:
        for i in l:
            ...
    return stuff
```

## Scraping Overview

Most news sites and similar web pages can be scraped through a three step process:

1. given a link to an article, extracting and formatting all the needed info from the page
2. given an archive-type page of links, finding all the links on the page.
3. iterating through each page of archives, scraping all the pages from it, and saving to disk

We'll write one function for each.

In [2]:
# my comment

## Libraries and setup

In [4]:
# requests is for general HTTP loading
import requests
# BeautifulSoup is an HTML parser
from bs4 import BeautifulSoup
# JSON is a nice format for writing out
# ujson can handle datetimes better and is a drop in replacement for the json module
import json

In [5]:
# Sometimes you'll need the regular expressions library and a date library
import re
import dateutil.parser

## Page scraper

We'll write a function here that takes in the URL of an article or page, extracts the information we want from HTML, and structures the output.

Python things to learn:

- calling methods from objects
- what a dictionary looks like
- how to define a function

HTML things to learn: 

- Chrome inspector
- what HTML tags look like

In [6]:
# find the URL of an article to scrape
url = "https://reliefweb.int/report/afghanistan/more-200-displaced-families-receive-cash-assistance-laghman-province"

In [11]:
# download article page and get content
page = requests.get(url)
content = page.content
content

b'<!DOCTYPE html>\n<html lang="en" dir="ltr" prefix="fb: http://ogp.me/ns/fb# og: http://ogp.me/ns# article: http://ogp.me/ns/article#">\n\n<head>\n  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />\n<link rel="apple-touch-icon" href="/profiles/reliefweb/themes/kobe/images/apple-touch-icon.png"/>\n<link rel="icon" href="/profiles/reliefweb/themes/kobe/images/favicon.png"/>\n<!--[if IE]><link rel="shortcut icon" href="/profiles/reliefweb/themes/kobe/favicon.ico"><![endif]-->\n<link rel="shortcut icon" href="https://reliefweb.int/profiles/reliefweb/themes/kobe/favicon.ico" type="image/vnd.microsoft.icon" />\n<meta name="description" content="English News and Press Release on Afghanistan about Shelter and Non-Food Items and IDPs; published on 06 Jan 2018 by Govt. Afghanistan" />\n<link rel="canonical" href="https://reliefweb.int/report/afghanistan/more-200-displaced-families-receive-cash-assistance-laghman-province" />\n<link rel="shortlink" href="https://reliefweb.in

200

In [10]:
# convert to BeautifulSoup
soup = BeautifulSoup(content, "lxml")

In [31]:
# quick aside to demonstrate "strip"
txt = "  Andy asdf  "
print(txt)
print(txt.strip())

  Andy asdf  
Andy asdf


In [28]:
# extract body text from page HTML (we'll do this together)
body = soup.find("div", {"class" : "field body"})
#body.text
print(body.text.strip())

Directorate of Refugees and Repatriations(DoRR) of Laghman province with co-operation of DRC and CWSA have distributed for 203 displaced families which displaced from insecure places of Alingar, Kanda gul, districts and Khogiani district of Nangarhar province and returnees from Pakistan have placed in the central of this province.
Thursday January 4, 2018 of the current year for 143 displaced families distributed 18000Af by DRC and for 60 displaced families distributed 100$ dolor by CWSA.


In [22]:
# extract title from page HTML
title = soup.find("h1", {"class" : "node-title clearfix"})
print(title.text)
title

More than 200 Displaced families Receive Cash Assistance in Laghman Province


<h1 class="node-title clearfix">More than 200 Displaced families Receive Cash Assistance in Laghman Province</h1>

In [33]:
# extract author from page HTML
author = soup.find("span", {"class" : "field field-source"})
author.text.strip()

'Government of Afghanistan'

In [36]:
# extract date from page HTML
raw_date = soup.find("span", {"class" : "date-display-single"})
print(raw_date.text)
# convert from raw text into a standardized form
date = dateutil.parser.parse(raw_date.text)
# Put it into a standard ISO format 
date = date.strftime("%Y-%m-%d")
print(date)

06 Jan 2018
2018-01-06


In [37]:
# put it all together!
def page_scraper(url):
    """Function to scrape a page"""
    # Code to download and soupify the page
    page = requests.get(url)
    content = page.content
    soup = BeautifulSoup(content, "lxml")
    # All the code to extract pieces from the HTML
    title = soup.find("h1", {"class" : "node-title clearfix"})
    title = title.text.strip()
    body = soup.find("div", {"class" : "field body"})
    body = body.text.strip()
    raw_date = soup.find("span", {"class" : "date-display-single"}).text
    date = dateutil.parser.parse(raw_date)
    date = date.strftime("%Y-%m-%d")
    author = soup.find("span", {"class" : "field field-source"})
    author = author.text.strip()
    article = {
        "title" : title,
        "body" : body,
        "date" : date,
        "author" : author
    }
    return article

In [38]:
# test it
page_scraper(url)

{'author': 'Government of Afghanistan',
 'body': 'Directorate of Refugees and Repatriations(DoRR) of Laghman province with co-operation of DRC and CWSA have distributed for 203 displaced families which displaced from insecure places of Alingar, Kanda gul, districts and Khogiani district of Nangarhar province and returnees from Pakistan have placed in the central of this province.\nThursday January 4, 2018 of the current year for 143 displaced families distributed 18000Af by DRC and for 60 displaced families distributed 100$ dolor by CWSA.',
 'date': '2018-01-06',
 'title': 'More than 200 Displaced families Receive Cash Assistance in Laghman Province'}

In [39]:
# test it on a different page
page_scraper("https://reliefweb.int/report/iraq/reconstruction-needed-displaced-iraqis-continue-return-iom-iraq")

{'author': 'International Organization for Migration',
 'body': 'Kuwait – As the recent Kuwait International Conference for the Reconstruction of Iraq came to a close, IOM, the UN Migration Agency, reported that internally displaced persons (IDPs) in Iraq continue to return home by the thousands.\nAccording to IOM’s latest Displacement Tracking Matrix (DTM) report, as of 31 January 2018 more than 3.3 million Iraqis have returned to their areas of origin, while nearly 2.5 million people continue to live in displacement. New IOM figures show that return movements are ongoing – in January another 125,000 returnees were identified – mainly to the four governorates of Ninewa, Salah al-Din, Kirkuk and Anbar.\nSince the start of the crisis in early 2014, with ISIL later holding large parts of central Iraq and the subsequent conflict to retake these areas, nearly six million Iraqis have been displaced. Their communities have suffered widespread devastation and damage.\nIn January 2018, for the

## Link getter

Second, we need to get the URLS of all the pages we want to scrape. We can do this by finding the directory pages, where the links are on it, and how to get all available directory pages.

Python things to learn here:

- getting values from dictionaries
- for loops and list comprehensions
- regex with the `re` library
- basic string operations

In [None]:
### Function to get all the article links from a single directory page

In [40]:
url = "https://reliefweb.int/country/afg?format=8&page=1"
page = requests.get(url)
content = page.content

In [41]:
soup = BeautifulSoup(content, "lxml")

In [58]:
links = soup.find("div", {"class" : "river-list"}).find_all("a")
print(links[0:5])

[<a class="toggle" href="#item-2809629">description</a>, <a href="/country/pak">Pakistan</a>, <a href="/report/pakistan/refugees-may-be-allowed-stay-till-june-2019">Refugees may be allowed to stay till June 2019</a>, <a href="/organization/frontier-post">Frontier Post</a>, <a class="toggle" href="#item-2809464">description</a>]


In [59]:
# pull out just the links
links = [i['href'] for i in links]
links[0:10]

['#item-2809629',
 '/country/pak',
 '/report/pakistan/refugees-may-be-allowed-stay-till-june-2019',
 '/organization/frontier-post',
 '#item-2809464',
 '/country/afg',
 '/report/afghanistan/ghor-leaders-strategize-building-community-cohesion-foundations-peace',
 '/organization/unama',
 '#item-2808884',
 '/country/afg']

In [60]:
# uh oh! `links` is full of all sorts of garbage. Is there a term
#  that we can search for to reliably pull out article links only?
links = [i for i in links if bool(re.match("/report", i))]
links[0:5]

['/report/pakistan/refugees-may-be-allowed-stay-till-june-2019',
 '/report/afghanistan/ghor-leaders-strategize-building-community-cohesion-foundations-peace',
 '/report/afghanistan/afghanistan-eqra-education-project',
 '/report/world/children-under-attack-six-grave-violations-against-children-times-war',
 '/report/spain/mediterranean-migrant-arrivals-reach-81207-2018-deaths-reach-1733']

In [61]:
# These aren't complete urls! We can use string operations plus a list comprehension to fix this:
links = ["https://reliefweb.int" + i for i in links]
links[0:5]

['https://reliefweb.int/report/pakistan/refugees-may-be-allowed-stay-till-june-2019',
 'https://reliefweb.int/report/afghanistan/ghor-leaders-strategize-building-community-cohesion-foundations-peace',
 'https://reliefweb.int/report/afghanistan/afghanistan-eqra-education-project',
 'https://reliefweb.int/report/world/children-under-attack-six-grave-violations-against-children-times-war',
 'https://reliefweb.int/report/spain/mediterranean-migrant-arrivals-reach-81207-2018-deaths-reach-1733']

In [65]:
# Put it all together into a function that takes in a "page number" 
#  and returns all the links to scrape from it.
def page_to_link(page_num):
    # how to use .format()
    url = "https://reliefweb.int/country/afg?format=8&page={0}#content".format(page_num)
    # download the page
    page = requests.get(url)
    # get its content
    content = page.content
    # soupify
    soup = BeautifulSoup(content, "lxml")
    # pull out links
    links = soup.find("div", {"class" : "river-list river-updates river-sort-latest"})
    links = links.find_all("a")
    links = [i['href'] for i in links]
    # clean links
    links = [i for i in links if bool(re.match("/report", i))]
    links = ["https://reliefweb.int" + i for i in links]
    return links

## Putting it together

Now we have a function that'll take a page number for the archive page and return all the links.
We have another function that'll take in an article URL and give us the structured content from the page.

Let's put them together and download a (small!) range of stories.

Note: let's be nice to the UN and not all download the whole thing at once.

In [67]:
# get all the links we want to scrape
all_links = []

for num in range(1, 5):
    lks = page_to_link(num)
    all_links.extend(lks) # extend! not append.
    
len(all_links)

all_links[12]

'https://reliefweb.int/report/afghanistan/afghanistan-hails-new-improved-hydrometerological-service'

In [68]:
all_content = []

for link in all_links[10:20]: # be nice to reliefweb and only get some 
    try:
        content = page_scraper(link)
        all_content.append(content) # back to append!
    except Exception as e:
        # if something goes wrong, keep trucking,
        #  but print out the link so we can diagnose it.
        print(e)
        print(link)

In [69]:
len(all_content)

10

In [70]:
all_content[4]

{'author': "European Commission's Directorate-General for European Civil Protection and Humanitarian Aid Operations, International Rescue Committee",
 'body': 'Assessment Finding and Recommendation in brief:\n\nThe security situation in Ghazni province started worsening during the month of July 2018 and finally On 10 August, AOGs attacked Ghazni City, the provincial capital of Ghazni Province 150 kilometers south-west of Kabul with an estimated 270,000 residents. Following attacks against key Government offices, critical infrastructure was damaged. Communications networks and the electricity supply got down in Ghazni, resulting in water shortages due to non-functional pumps. The reported presence of members of the Taliban in civilian homes and market places has heightened the risk of civilian casualties arising from any military aerial response, while the placement of improvised explosive devices (IEDs) along Highway 1 both north and south of the city has prevented civilians from safel

## Saving as CSV

If you're going to work with your text next in R, a CSV is probably the most useful form to save your text in. To save it as a csv, we will convert into a `DataFrame` using `pandas`, a package for working with data in Python. `pandas` will then let us easily write it out to a CSV.

In [71]:
import pandas as pd

content_df = pd.DataFrame(all_content)
#content_df
content_df.to_csv("all_content.csv")

Unnamed: 0,author,body,date,title
0,UN General Assembly,GA/12064\n26 SEPTEMBER 2018\nGENERAL ASSEMBLY ...,2018-09-26,"Climate Change, Economic Inequality, Systemic ..."
1,UN High Commissioner for Refugees,"Sixty-ninth session Geneva, 1-5 October 2018\n...",2018-09-26,69th Session of the Executive Committee of the...
2,World Meteorological Organization,A WMO-led project to establish a functioning h...,2018-09-26,"Afghanistan hails new, improved hydrometerolog..."
3,US Agency for International Development,Office of Press Relations\nTelephone: +1.202.7...,2018-09-25,Administrator Green Announces Finalists for th...
4,European Commission's Directorate-General for ...,Assessment Finding and Recommendation in brief...,2018-09-25,ERM Household Assessment Report: 1# assessment...
5,UN Assistance Mission in Afghanistan,KABUL - Preliminary findings from the UN Assis...,2018-09-25,Preliminary findings indicate airstrike killed...
6,US Agency for International Development,"Kabul, Afghanistan - The U.S. Government's Wom...",2018-09-25,U.S. Government Programs Recognize 300 Afghan ...
7,UN Assistance Mission in Afghanistan,KABUL - The UN Assistance Mission in Afghanist...,2018-09-25,Concern about rising number of civilian casual...
8,IRIN,Slim prospects for moving on or earning a livi...,2018-09-24,"Fleeing worsening war, Afghans find narrowing ..."
9,World Health Organization,"Kabul, 18 September 2018 – A new polio case ha...",2018-09-18,Ministry urges vaccination as 14th polio case ...


## Saving as JSON

We can also store is as a JSON file. JSON and dictionaries are almost equivalent, so it's a natural form to save a dict as a JSON file.

In [72]:
import json

In [73]:
FILENAME = "reliefweb.json"

with open(FILENAME, "w") as f:
    json.dump(all_content, f)

### Read it back in

If you want to load it back in later to analyze, you can do this:

In [74]:
FILENAME = "reliefweb.json"

with open(FILENAME, "r") as f:
    loaded_content = json.load(f)

In [75]:
# is it the same?
assert loaded_content[4] == all_content[4]

In [76]:
list(range(0, 3))

[0, 1, 2]