Skip to content

Round Up: Ethics and Skepticism

Amanda Hickman edited this page Mar 4, 2019 · 15 revisions

There are a whole lot of different ways to misunderstand or be duped by data. This is my round up of good links that illustrate some of the most common problems with relying on data. Additions are welcome, but I'm looking for news stories, rather than theoretical examples.

Your job, as a reporter, is to put the data in context. Sometimes that means making an honest decision about whether or not maps are even the right way to tell the story, because how you tell a story matters.

For Starters

The Quartz guide to bad data is an excellent round up of ways data goes bad and how to work with it responsibly. Samantha Sunne's review of challenges and possible pitfalls of data journalism is another nice overview, as is Jacob Harris's summary of Six Ways to Make Mistakes with Data.

Alberto Cairo's round up of bad data journalism is full of great examples of irresponsible charts, headlines the data doesn't support

Correlation is not Causation

I used to think correlation was causation but then I took a statistics course.

You knew that, right? So here are some Spurious Correlations for you.

Data is People.

Databases do not burst fully formed from the head of Zeus. The data in them is either entered by people or gathered by devices built and placed by people. That means that you can't report on data without talking to people.

Incentives Corrupt

The fastest way to reduce the number of felony robberies in a single police precinct is to start classifying incidents as misdemeanors, and there’s good evidence that New York Police Department precincts did exactly that when the commissioner started rewarding precincts that got their serious crime rates down.

It isn't entirely clear why Baltimore County Police Department has more “unfounded” rape complaints than most departments nationwide, but BuzzFeed News found that many of those “unfounded” complaints were never really investigated. By marking them "unfounded" the department kept their "unsolved rape" statistics down. More reporting on unrecorded rapes that impact data:

VA Hospital metrics are highly gameable, often in ways that make problems worse.

Inconsistent Data Entry

Sometimes there are just quirks in the way data gets recorded -- one report found that coroners don’t have solid standards about how to decide whether to record a gun death as accident or homicide and as a result, accidental homicides are split between the two categories, making it hard to track down reliable data.

Cherry Picking Corrupts Your Results

The Fires (Joe Flood, 2011) is one of a few excellent stories about the famous burning Bronx of the 1970s. One thing he covers in detail is the direct relationship between radical cutbacks in FDNY station funding in the Bronx and the rise in serious fires. The Bronx was burning because the infrastructure to put fires out had been decimated. Those closures were the outcome of a data driven process that had it's own holes: RAND's statistical team analyzed citywide response times and determined that the city could safely close a lot of Bronx firehouses -- without acknowledging that those firehouses were some of the busiest in the city and often weren't responding to fires in their immediate vicinity because they were already out fighting fires. So a very "pure" data-driven approach conveniently rationalized shuttering a lot of stations in poor areas. See also: Why The Bronx Burned (NY Post, May 2010) | Goodreads on The Flood | Reviews: A City on Fire (City Limits, June 2010)

Newsrooms cherry pick data, too. After the Wall Street Journal published a story arguing that Cellphones Are Eating The Family Budget, the Atlantic took a closer look at the data and begged to differ: What's Really Eating the Family Budget? It Ain't Smartphones (The Atlantic, 2012)

Test Data Only Captures Test Takers

Groups Are Not People

I tell you that states with more foreign born residents have more wealthy households, what’s your next question? (Are foreign born people more likely to be wealthy? No.) One study found a positive state-by-state correlation between literacy and foreign born populations: areas with high immigrant populations were likely to be more literate. That does not mean that immigrants are more likely to be literate.

Statwing's blog has a nice overview of the ecological fallacy and Statistics Professor Andrew Gelman added some thoughs

There are a few studies that indicate that the best way to reduce urban bicycle injuries is to increase the number of riders (citation needed -- if you have one I'm all ears) and some contested research that indicates that helmet laws discourage cycling. One logical conclusion is that, while individual adults should be encouraged to wear helmets, mandatory helmet laws actually make cycling less safe in the big picture. I need to distill the citations on that but I have, in the past, seen data on both (and this is just an anecdote anyhow). But more than one person has come to some funny conclusions from that research. Also keep in mind that readers are bad at nuance, so when Vox published a story on helmet laws a lot of people started hollering that Vox was saying cyclists shouldn't wear helmets.

N/A is Data Too

Look up 850 Bryant Street in San Francisco and see if you can guess why an exceptional number of crimes in San Francisco are mapped to that address.

Maps, IP addresses, and generalizations -- there are a few interesting ways to get data organized geographically by IP address, but you need to know how your data source handles unknown locations or you'll end up attributing a lot of things to the middle of Kansas:

Understand Your Data

  • 538 had to retract a story on broadband reach because they didn't understand the data they were working from. Asking good questions helps you avoid drawing conclusions that the data can't actually support.

Be Skeptical of Surveys

What did you ask first?

Pew has a lot of great research about survey design. One thing they've established is that question order changes how people respond to questions. In 2003, Americans were more likely to say they support civil unions if they had already been asked whether they support gay marriage. Read their full report on questionnaire design for more findings.

What was the question again?

Before you report on a survey, especially one with dramatic findings, make sure the actual questions line up with your reporting.

Keeping my powder dry on this one. I'd like to see how the question was worded. Did it use the word "source," without defining it? It's a term of art with a specific meaning that may have been misunderstood by some respondents.

— Scott Klein (@kleinmatic) February 26, 2019

Charts, tho!

We haven't even begun to dig into all the ways that we can use bad visualization to misrepresent perfectly good data, but I'd love to add your examples!

For now, here's QZ on a very disingenous Apple chart.

Deciding what to publish

How you tell a story makes a big difference in what your readers take from your reporting.

And not everything should be published, even if you have the data. A few examples, but there are more:

There are also books

Not Yet Organized Notes

Unsorted / To Review

Copy pasta'd a bunch of notes when this came up on NICAR-L recently.

There's a chapter in Ethics for Digital Journalists on ethics in data journalism - See also:

paper from a Nordic data journalism conference:

Paul Bradshaw's list of resources and examples related to ethics in data journalism here - including some related to big data:

reflection about the gun map scenario and the role of forums like NICAR in discussing ethics:

Markkula Center event:

Ethics of scraping:

You can’t perform that action at this time.