How to find reporting leads and publishable facts in text data you already have

Jeff Ernsthausen
Jeremy Merrill
Youyou Zhou

Description

Let's discuss some published projects that have extracted useful, newsy information from big piles of text data — so you can use similar techniques. We'll walk you through real-world examples of every step of the process: gathering text data, dividing it into chunks the computer can understand, analyzing it with fancy or simple techniques and the challenges you'll face in analyzing, bulletproofing and presenting what you find. This session isn't quite a hands-on, but the panelists will discuss the tools, practical techniques and tricks they used to transform giant piles of text into publishable insights and reporting leads. These techniques are often called "natural language processing," but we're going to keep it practical: no obscure mathematical formulas, guaranteed!

Notes

What kinds of insights can you get? What sorts of text do you have (or can get)?

Example insights

Patterns across documents — which documents are most similar to documents you already know are interesting.
Outliers — What's distinctive about some documents compared to others
Extract meaning — topic/sentiment

Example: Extract motivation and mentions of killer's race from 141 hours of TV coverage

Example: Database of more than 100,000 disciplinary documents of doctor misconduct, model to tag them and statistically estimate liklihood.

Example: Press releases of government representatives — similar to other legislators, distinctive topics, policy priorities.

Pipeline (How)

Getting the data
Dividing it up
Analyzing
Bulletproofing
Presentation

1. Getting data is harder than it sounds

Some sources:

Readily available (Speeches, academics, libraries)
APIs
Scraping
Speech to text
FOIA

2. Dividing it up

Cleaning
- OCR
- Filtering stuff out like documents in other languages
- Lowercasing, punctuation, stripping HTML, bylines, etc.
Tokenization (words -> columns)
- "this is not comprehensible to the computer" -> ["this", "is", "not", "comprehensible", "to", "the", "computer"]
Stemming/lemmatization & part-of-speech tagging
Remove stopwords and meaningless words

DocumentCloud

3. Analysis

Counting words TFIDF (Term Frequency Inverse Document Frequency) Keyword Clustering Sentiment Vectorization

Cleaning data and removing stop words = senator's most common words were his last name and "previous_article"

Vectorization = give it lots of documents, it'll figure out which words appear in similar contexts

Can ask analogies like "what is the Republican version of what the Democrats call an estate tax".

4. Bulletproofing

Bulletproofing doctor's harassment reports: Periodically check whether predictions were useful. Randomly select documents with low scores and read those. Be aware of whether you've unintentionally biased your algorithm. False negatives are ok, false positives are unacceptable.

Beware of external factors: One stylebook requires "spokesperson", one requires "spokesman/spokeswoman".

5. Presenting and visulization

Bar charts, bubbles, heat maps. Small multiples.

No word clouds. (exception: Street name visualization)

Questions

Story planning process: Incremental process.

People start coming to you with every pile of text. Use the easiest tool that will get the job done (DocumentCloud).

Initial loading step: figure out what you actually need from the data. When dealing with large amounts of data, just moving it around can take a long time. Paring it down can be helpful.

What do you want to do with this next: JM - Integrate it into search engine (e.g. search for estate tax, also get stuff related to death tax).

Speakers

Jeff Ernsthausen is a data reporter at the Atlanta Journal-Constitution. He previously interned at The Nation and Harper's Magazine.

Jeremy is a news apps developer at ProPublica. He likes scraping data that's hard to get, maps and public records. He lives in Atlanta, Georgia. He works on a variety of open-source newsroom tools like Tabula, Stevedore and FOIA Lawya.

Youyou Zhou is a visual journalist at Quartz. She digs into data, writing stories, designing visuals and building interactives out of it. Youyou has a keen interest in the global transfer of knowledge and text data. She is a Mizzou alum and has previously built interactives and election apps for The Associated Press. @zhoyoyo

Description and speakers from official schedule

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

02-01-data-you-already-have.md

02-01-data-you-already-have.md

How to find reporting leads and publishable facts in text data you already have

Description

Notes

Example insights

Pipeline (How)

1. Getting data is harder than it sounds

2. Dividing it up

3. Analysis

4. Bulletproofing

5. Presenting and visulization

Questions

Speakers

Files

02-01-data-you-already-have.md

Latest commit

History

02-01-data-you-already-have.md

File metadata and controls

How to find reporting leads and publishable facts in text data you already have

Description

Notes

Example insights

Pipeline (How)

1. Getting data is harder than it sounds

2. Dividing it up

3. Analysis

4. Bulletproofing

5. Presenting and visulization

Questions

Speakers