# Python and Web Scraping

MIT Political Science Methods Series  
Fall 2019  
Andy Halterman

## Jupyter Notebooks

Jupyter notebooks are a way of writing code, running code, and displaying output in one convenient place. You can write code in code blocks, or write markdown/text in blocks like this. It's often useful to explain what you're doing and finding so when you or someone else picks up the notebook in the future, they'll know what's going on.

You can execute code chunks by clicking the cell to run and hitting "Run" button on the top bar, or by typing "shift-enter". You can always return to a previous code chunk, modify it, and re-run it.

You can write math in notebooks, just like in Rmarkdown: $\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i$

Jupyter is great for prototyping, but for more heavy duty use, replication, running on a server, etc., I recommend re-writing the code into a `.py` file that can be called from the command line with `python mycode.py`. This process also gives you a chance to refactor your code, making it more efficient, readable, and dependable.

## Python vs. R

Python and R are fairly similar. This is a quick overview of the differences to help you get up to speed.

### Importing packages

- Importing packages is similar: instead of `[R] library(mypackage)`, do `[Python] import mypackage`.
- Python also lets you import specific functions from a package: `from mypackage import cool_function`
- You can also rename packages if they're too long: `import numpy as np`
- Installing packages is a bit different. In R, you run `install.packages(ggplot2)` from *inside* R. To install packages in Python, you need to open the command line and run the installation command from there: `pip install requests`.

In [1]:
## Practice! We need to import:
# 1. a library called "requests"

# 2. a module called "BeautifulSoup" from a library called "bs4"


### What are all those dots for? (Or, methods, attributes, and namespaces)

- Dots have special meaning in Python. It's not like R, where people put dots in all sorts of names.
- Python is much more careful about keeping packages' functions attached to the functions. If the `requests` library has a function called `get`, you call it like this `requests.get()`. This reminds you where the `get` function came from and prevents you from overwriting some other package's `get` function.
- Python is also more "object oriented" than R. Objects often have built in or attached functions, called methods. 
- These methods are called with a dot notation. Compare: 
```
[R] strsplit("Andy Halterman", " ")
```
and

```
    [Python] "Andy Halterman".split(" ")
```

- Objects can also have attributes, which are pieces of data attached to an object. Example: `andy.subfields = ['methods', 'comparative']`

In [2]:
## Practice! Can you figure out how to make a string all upper case?


### Data Structures

- Like R's vectors, Python uses a lot of lists. These are ordered arrays. Note that Python starts with 0! 

```
my_list = ["x", "y", "z"]
> my_list[0] 
"x"
```

- Python has a data structure called a *dictionary*, which are like lists that you access by key name instead of by position (think a more general form of R's dataframes). Example:

```
article = {"title": "Rivalry and Revenge",
           "author" : "Balcells",
           "year" : "2017"}

> article.keys()
['title', 'author', 'year']

> article['author']
"Balcells"
```

### Loops and functions


- Functions are only slightly different from R:

```
def my_function(x):
    z = (x + 2)^2
    return z
```

- Loops are fast and nice in Python, unlike in R, and are very easily done:

```
for i in my_list:
    my_function(i)
```

- Pro move: list comprehensions:

```
[my_function(i) for i in my_list]
```

In [None]:
## Practice! Can you write a loop that goes over a list of words and prints a capitalized version of each one?


### Whitespace

- As you can tell, Python makes heavy use of whitespace to set apart different levels of functions, for loops, etc. Use four spaces (Jupyter converts tabs to four spaces automatically. 
- No need for curly braces!

```
def my_function(big_list):
    print(len(big_list))
    for l in big_list:
        for i in l:
            ...
    return stuff
```

## Scraping Overview

Most news sites and similar web pages can be scraped through a three step process:

1. given a link to an article, extracting and formatting all the needed info from the page
2. given an archive-type page of links, finding all the links on the page.
3. iterating through each page of archives, scraping all the pages from it, and saving to disk

We'll write one function for each.

## Libraries and setup

In [None]:
# requests is for general HTTP loading
import requests
# BeautifulSoup is an HTML parser
from bs4 import BeautifulSoup
# JSON is a nice format for writing out
# ujson can handle datetimes better and is a drop in replacement for the json module
import json

In [None]:
# Sometimes you'll need the regular expressions library and a date library
import re
import dateutil.parser

## Page scraper

We'll write a function here that takes in the URL of an article or page, extracts the information we want from HTML, and structures the output.

Python things to learn:

- calling methods from objects
- what a dictionary looks like
- how to define a function

HTML things to learn: 

- Chrome inspector
- what HTML tags look like

In [None]:
# find the URL of an article to scrape
url = "https://reliefweb.int/report/afghanistan/more-200-displaced-families-receive-cash-assistance-laghman-province"

In [None]:
# download article page and get content
page = 
content = 
content

In [None]:
# convert to BeautifulSoup
soup = 

In [None]:
# extract body text from page HTML (we'll do this together)
body = 

In [None]:
# extract title from page HTML
title = 

In [None]:
# extract author from page HTML
author = 

In [None]:
# extract date from page HTML
raw_date = 
# convert from raw text into a standardized form
date = dateutil.parser.parse(raw_date)
# Put it into a standard ISO format 
date = date.strftime("%Y-%m-%d")

In [None]:
# put it all together!
def page_scraper(url):
    """Function to scrape a page"""
    # Code to download and soupify the page
    
    # All the code to extract pieces from the HTML
    title = ""
    ...
    article = {
        "title" : title,
        "body" : body,
        "date" : date,
        "author" : author
    }
    return article

In [None]:
# test it
page_scraper(url)

In [None]:
# test it on a different page
page_scraper("https://reliefweb.int/report/iraq/reconstruction-needed-displaced-iraqis-continue-return-iom-iraq")

## Link getter

Second, we need to get the URLS of all the pages we want to scrape. We can do this by finding the directory pages, where the links are on it, and how to get all available directory pages.

Python things to learn here:

- getting values from dictionaries
- for loops and list comprehensions
- regex with the `re` library
- basic string operations

In [None]:
### Function to get all the article links from a single directory page

In [None]:
url = "https://reliefweb.int/country/afg?format=8&page=1"
page = 
content = 

In [None]:
soup = 

In [None]:
links = soup.find("?????").find_all("a")
print(links[0:5])

In [None]:
# pull out just the links
links = [i['href'] for i in links]
links[0:10]

In [None]:
# uh oh! `links` is full of all sorts of garbage. Is there a term
#  that we can search for to reliably pull out article links only?
links = [i for i in links if bool(re.match("?????", i))]
links[0:5]

In [None]:
# These aren't complete urls! We can use string operations plus a list comprehension to fix this:
links = ["https://reliefweb.int/" + i for i in links]
links[0:5]

In [None]:
# Put it all together into a function that takes in a "page number" 
#  and returns all the links to scrape from it.
def page_to_link(page_num):
    # how to use .format()
    link = "https://reliefweb.int/country/afg?format=8&page={0}#content".format(page_num)
    # download the page
    # get its content
    # soupify
    # pull out links
    # clean links
    return links

## Putting it together

Now we have a function that'll take a page number for the archive page and return all the links.
We have another function that'll take in an article URL and give us the structured content from the page.

Let's put them together and download a (small!) range of stories.

Note: let's be nice to the UN and not all download the whole thing at once.

In [None]:
# get all the links we want to scrape
all_links = []

for num in range(1, 5):
    lks = page_to_link(num)
    all_links.extend(lks) # extend! not append.
    
len(all_links)

In [None]:
all_content = []

for link in all_links[10:20]: # be nice to reliefweb and only get some 
    try:
        content = page_scraper(link)
        all_content.append(content) # back to append!
    except Exception as e:
        # if something goes wrong, keep trucking,
        #  but print out the link so we can diagnose it.
        print(e)
        print(link)

In [None]:
len(all_content)

In [None]:
all_content[4]

## Saving as CSV

If you're going to work with your text next in R, a CSV is probably the most useful form to save your text in. To save it as a csv, we will convert into a `DataFrame` using `pandas`, a package for working with data in Python. `pandas` will then let us easily write it out to a CSV.

In [None]:
import pandas as pd

content_df = pd.DataFrame(all_content)
content_df.to_csv("all_content.csv")

## Saving as JSON

We can also store is as a JSON file. JSON and dictionaries are almost equivalent, so it's a natural form to save a dict as a JSON file.

In [None]:
import json

In [None]:
FILENAME = "reliefweb.json"

with open(FILENAME, "w") as f:
    json.dump(all_content, f)

### Read it back in

If you want to load it back in later to analyze, you can do this:

In [None]:
FILENAME = "reliefweb.json"

with open(FILENAME, "r") as f:
    loaded_content = json.load(f)

In [None]:
# is it the same?
assert loaded_content[4] == all_content[4]