Skip to content

expersso/scraping_workshop

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 

Getting Data from the Web: Scraping for Economists

Material for the web scraping workshop at the 2018 Annual Congress of the European Economic Association.

What most tutorials are like What this tutorial will be like
asdsa

Key takeaways

  1. The standard workflow:

    • Read the page with read_html.
    • Use the inspector tool to find the relevant nodes.
    • Write a CSS selector query (using e.g. CSS Selector Cheat Sheet to remind you of the syntax).
    • Fetch the nodes using html_nodes(page, css_query).
    • Extract the data using html_text(nodes, trim = TRUE).
    • If you want to extract an attribute of a node instead of just the text, use html_attr(node, attribute), e.g. html_attr(node, "href") to extract the href from an anchor (<a>) node.
  2. If you're lucky enough to have the data in a <table> node, try html_table(page).

    • Remember that this returns a list of dataframes, even if there's only one table on the page.
  3. If your data is spread across multiple pages, write a function that extracts the data for one page, then map that function over all the other pages. For example,

get_data_for_specific_year <- function(year) { ... }
years <- 2010:2016
all_data <- map_dfr(years, get_data_for_specific_year)
  1. Before you start scraping, kick the tires of the website a bit first.

    • Remember how we changed the per_page query to 600 instead of the 25, 50, 100 options available on the webpage.
  2. Make sure to look for a "Terms of Use" page on the website before you start scraping.

    • Cavalierly scraping a website that explicitly forbids it can be a very bad idea.
  3. Use the robotstxt package as a programmatic way to see what parts of a website are off-limits.

  4. When working with modern and more sophisticated websites you may often have to use the more low-level httr package rather than rvest.

    • If you see data in the browser, but it doesn't show up in R, the website is probably using Javascript to generate that data.
    • If this is the case, try to look around in the "Network" tab of the Inspector Tool to see if you can find the data there.
    • Then use the GET function to send your own GET request, and the content function to extract the content of the response.
    • The GET function (and the other httr functions corresponding to the http verbs like UPDATE, DELETE, etc) is very general, so you can modify your request using e.g. user_agent, add_headers, etc.
  5. CSS Diner is the single best resource I've found for learning the syntax of CSS selectors. Highly recommended!

  6. I didn't spend much time explaining how lapply and map really work, and how these functions allow you to avoid writing explicit loops. In my view, these functions (and their siblings in the purrr package) are some of the most powerful functions in the entire R language, so I would strongly advise you to spend some time trying to understand and get comfortable using them. The Functionals chapter in Advanced R is the best resource I know for learning these.

  7. We didn't have time to talk about web forms (such as how to scrape a website that first requires you to log in). The short answer for how to do this is to use the html_forms function in rvest. You can take a look at this page for an exercise, and see my solution at the bottom of the scripts/scrapethis.R file in the Github repo.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages