# Web Scraping with Python

MIT Political Science Methods Series  
Spring 2018  
Andy Halterman

## Libraries and setup

In [None]:
# requests is for general HTTP loading
import requests
# BeautifulSoup is an HTML parser
from bs4 import BeautifulSoup
# JSON is a nice format for writing out
# ujson can handle datetimes better and is a drop in replacement for the json module
import json

In [None]:
# Sometimes you'll need the regular expressions library and a date library
import re
import dateutil.parser

## Page scraper

We'll write a function here that takes in the URL of an article or page, extracts the information we want from HTML, and structures the output.

Python things to learn:

- calling methods from objects
- what a dictionary is
- error handling (later)

HTML things to learn: 

- Chrome inspector
- what HTML tags look like

In [None]:
# find the URL of an article to scrape
url = "https://reliefweb.int/report/afghanistan/more-200-displaced-families-receive-cash-assistance-laghman-province"

In [None]:
# download article page and get content
page = ""
content = page.content

In [None]:
# convert to BeautifulSoup
soup = ""

In [None]:
# extract body text from page HTML (we'll do this together)
body = ""

In [None]:
# extract title from page HTML
title = ""

In [None]:
# extract author from page HTML
author = ""

In [None]:
# extract date from page HTML
raw_date = ""
# convert from raw text into a standardized form
date = dateutil.parser.parse(date_raw)
# Put it into a standard ISO format 
date = date.strftime("%Y-%m-%d")

In [None]:
# put it all together!
def page_scraper(url):
    """Function to scrape a page"""
    # Put all your individual pieces from above into here.
    
    article = {
        "title" : None,
        "text" : None,
        "date" : None,
        "author" : None,
        "theme" : None}
    return article

In [None]:
# test it
url = ""
page_scraper(url)

## Link getter

Second, we need to get the URLS of all the pages we want to scrape. We can do this by finding the directory pages, where the links are on it, and how to get all available directory pages.

Python things to learn here:

- getting values from dictionaries
- list comprehensions
- regex with the `re` library
- basic string operations
- for loops

In [None]:
### Function to get all the article links from a single directory page

In [None]:
url = ""
page = 
content = 

In [None]:
soup = BeautifulSoup(page, "lxml")

In [None]:
links = soup.find(?????).find_all("a")
# pull out just the links
links = [i['href'] for i in links]
links[0:5]

In [None]:
# uh oh! `links` is full of all sorts of garbage. Is there a term
#  that we can search for to reliably pull out article links only?
links = [i for i in links if bool(re.match("?????", i))]
links[0:5]

In [None]:
# These aren't complete urls! We can use string operations plus a list comprehension to fix this:
links = ["https://reliefweb.int/" + i for i in links]
links[0:5]

In [None]:
# Put it all together into a function that takes in a "page number" 
#  and returns all the links to scrape from it.
def page_to_link(page_num):
    # how to use .format()
    link = "https://reliefweb.int/country/afg?format=8&page={0}#content".format(page_num)
    # download the page
    # get its content
    # soupify
    # pull out links
    # clean links
    return None

## Putting it together

Now we have a function that'll take a page number for the archive page and return all the links.
We have another function that'll take in an article URL and give us the structured content from the page.

Let's put them together and download a (small!) range of stories.

Note: let's be nice to the UN and not all download the whole thing at once.

In [None]:
# get all the links we want to scrape
all_links = []

for num in range(1, 5):
    lks = page_to_link(num)
    all_links.extend(lks) # extend! not append.
    
len(all_links)

In [None]:
all_content = []

for link in all_links[10:20]: # be nice to reliefweb and only get some 
    try:
        content = page_scraper(link)
        all_content.append(content) # back to append!
    except Exception as e:
        # if something goes wrong, keep trucking,
        #  but print out the link so we can diagnose it.
        print(e)
        print(link)

In [None]:
len(all_content)

In [None]:
all_content[4]

## Saving

Now that we have some results as a list of dictionaries, we can store is as a JSON file. JSON and dictionaries are almost equivalent, so it's a natural form to save a dict as a JSON file.

In [None]:
import json

In [None]:
FILENAME = "reliefweb.json"

with open(FILENAME, "w") as f:
    json.dump(all_content, f)

### Read it back in

If you want to load it back in later to analyze, you can do this:

In [None]:
FILENAME = "reliefweb.json"

with open(FILENAME, "r") as f:
    loaded_content = json.load(f)

In [None]:
# is it the same?
assert loaded_content[4] == all_content[4]