NICAR 2016 Training Material
Switch branches/tags
Nothing to show
Latest commit 72603f3 Mar 9, 2016 Robert Gebeloff ppt updates
Failed to load latest commit information.
.DS_Store intro Mar 7, 2016 intro Mar 7, 2016
defendants.xlsx more stuff Mar 7, 2016
importio.docx Mar 6, 2016
no_programing_handout.docx more stuff Mar 7, 2016
no_programming.pptx ppt updates Mar 9, 2016
pdf_wrangling16.docx pdfs xmls Mar 6, 2016
prof.csv more stuff Mar 7, 2016
refine.pdf more stuff Mar 7, 2016
xml_miracle.docx pdfs xmls Mar 6, 2016

NICAR 2016 Training Material

Web Scraping Without Programming

In this presentation, Tom Johnson of the Institute for Analytic Journalism and I will demonstrate various ways of harvesting data from the Internet without programming. While we heartily recommend that reporters explore the power of programming languages such as Python, Ruby and R, we believe these software tools are a valuable means to getting information that is otherwise unobtainable.

    You can download
  • The primary handout
  • Our powerpoint
  • A detailed tutorial on using to scrape Web sites
  • An example of how to find and parse hidden XML or JSON data
  • A walkthrough of various methods for dealing with PDFs

  • An Introduction to Open Refine

    Open Refine is a vital tool for cleaning dirty data. A typical example is when a dataset contains names of people or companies but with inconsistent spelling that needs to be standardized. At NICAR, Nils Mulvad and I will walk through a tutorial he created. The exercise is here, the practice data here and here.

    Note: After clicking on links, click "View Raw" if the file doesn't download immediately