## Python vs R for Scraping

I found the dataset that we used previously here in [github](https://github.com/BrianWeinstein/state-of-the-union/blob/master/get_transcripts.R). For your enjoyment, you should take a look at their code (which, by the way, gives you something wrong).

Now let's compare this to beautiful, beautiful soup.

In [None]:
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
import datetime

In [None]:
url = "https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/annual-messages-congress-the-state-the-union"
page = urlopen(url)
soup = BeautifulSoup(page, "lxml")
urls = [i.get("href") for i in soup.select("table a")]
urls = [u for u in urls if u is not None and u.startswith("http")]

In [None]:
l = []
for url in urls:
    print("Parsing "+url)
    page = urlopen(url)
    soup = BeautifulSoup(page, "lxml")
    # the actual extraction steps
    transcript = " ".join([e.get_text(strip=True) for e in soup.select("div.field-docs-content p")])
    title = soup.select_one("div.field-ds-doc-title h1").string
    name = soup.select_one("div.field-title a").string
    # to ensure that the date format is exactly the same as what we've been using so far, I need to do some extra work
    date = soup.select_one("div.field-docs-start-date-time span").string
    # this reads in the date (via the formatting defined here https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior)
    date = datetime.datetime.strptime(date, "%B %d, %Y")
    # this outputs it in the format that we're used to: 2017-02-27
    date = datetime.datetime.strftime(date, "%Y-%m-%d")
    l.append({
        "url": url,
        "title": title,
        "date": date,
        "president": name,
        "transcript": transcript
    })

In [None]:
import json
with open('transcripts-fixed.json', 'w') as f:
    json.dump(l, f, indent=2)