Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Round Up: Ethics and Skepticism
There are a whole lot of different ways to misunderstand or be duped by data. This is my round up of good links that illustrate some of the most common problems with relying on data. Additions are welcome, but I'm looking for news stories, rather than theoretical examples.
Your job, as a reporter, is to put the data in context. Sometimes that means making an honest decision about whether or not maps are even the right way to tell the story, because how you tell a story matters.
The Quartz guide to bad data is an excellent round up of ways data goes bad and how to work with it responsibly. Samantha Sunne's review of challenges and possible pitfalls of data journalism is another nice overview, as is Jacob Harris's summary of Six Ways to Make Mistakes with Data.
Alberto Cairo's round up of bad data journalism is full of great examples of irresponsible charts, headlines the data doesn't support
Correlation is not Causation
You knew that, right? So here are some Spurious Correlations for you.
Data is People.
Databases do not burst fully formed from the head of Zeus. The data in them is either entered by people or gathered by devices built and placed by people. That means that you can't report on data without talking to people.
The fastest way to reduce the number of felony robberies in a single police precinct is to start classifying incidents as misdemeanors, and there’s good evidence that New York Police Department precincts did exactly that when the commissioner started rewarding precincts that got their serious crime rates down.
It isn't entirely clear why Baltimore County Police Department has more “unfounded” rape complaints than most departments nationwide, but BuzzFeed News found that many of those “unfounded” complaints were never really investigated. By marking them "unfounded" the department kept their "unsolved rape" statistics down. More reporting on unrecorded rapes that impact data:
VA Hospital metrics are highly gameable, often in ways that make problems worse.
Inconsistent Data Entry
Sometimes there are just quirks in the way data gets recorded -- one report found that coroners don’t have solid standards about how to decide whether to record a gun death as accident or homicide and as a result, accidental homicides are split between the two categories, making it hard to track down reliable data.
Cherry Picking Corrupts Your Results
The Fires (Joe Flood, 2011) is one of a few excellent stories about the famous burning Bronx of the 1970s. One thing he covers in detail is the direct relationship between radical cutbacks in FDNY station funding in the Bronx and the rise in serious fires. The Bronx was burning because the infrastructure to put fires out had been decimated. Those closures were the outcome of a data driven process that had it's own holes: RAND's statistical team analyzed citywide response times and determined that the city could safely close a lot of Bronx firehouses -- without acknowledging that those firehouses were some of the busiest in the city and often weren't responding to fires in their immediate vicinity because they were already out fighting fires. So a very "pure" data-driven approach conveniently rationalized shuttering a lot of stations in poor areas. See also: Why The Bronx Burned (NY Post, May 2010) | Goodreads on The Flood | Reviews: A City on Fire (City Limits, June 2010)
Newsrooms cherry pick data, too. After the Wall Street Journal published a story arguing that Cellphones Are Eating The Family Budget, the Atlantic took a closer look at the data and begged to differ: What's Really Eating the Family Budget? It Ain't Smartphones (The Atlantic, 2012)
Test Data Only Captures Test Takers
A widely cited (but rarely sourced) statistic says that in the UK 90% of fetuses diagnosed with Down Syndrome are aborted. Worldwide, rates are lower but comparable. That sounds dramatic until you consider the fact the the only way to test for trisomy 21, amniocentesis, is relatively invasive. There's no reason to test a growing fetus for Down Syndrome if you don't plan to act on the test results.
Any time you see a hospital that advertises high surgical success rates, ask whether that means they have the best surgeons or that they only take easy cases.
Crime Statistics always reflect enforcement priorities. Chicago Sun Times ran a story about exceptionally high crime rates at CTA stations, but the "crime spike" turned out to be turnstile jumping.
311 calls are not a random sample of lived experiences https://nextcity.org/daily/entry/who-is-most-likely-dial-311
In the 70s, researchers looked at admissions rates at UC Berkeley and found that women were far, far more likely to be rejected. A closer examination revealed that women were more likely to apply to more competitive programs, so department by department, there wasn’t evidence of discrimination. This is a classic in statistics texts. http://vudlab.com/simpsons/
Groups Are Not People
I tell you that states with more foreign born residents have more wealthy households, what’s your next question? (Are foreign born people more likely to be wealthy? No.) One study found a positive state-by-state correlation between literacy and foreign born populations: areas with high immigrant populations were likely to be more literate. That does not mean that immigrants are more likely to be literate.
There are a few studies that indicate that the best way to reduce urban bicycle injuries is to increase the number of riders (citation needed -- if you have one I'm all ears) and some contested research that indicates that helmet laws discourage cycling. One logical conclusion is that, while individual adults should be encouraged to wear helmets, mandatory helmet laws actually make cycling less safe in the big picture. I need to distill the citations on that but I have, in the past, seen data on both (and this is just an anecdote anyhow). But more than one person has come to some funny conclusions from that research. Also keep in mind that readers are bad at nuance, so when Vox published a story on helmet laws a lot of people started hollering that Vox was saying cyclists shouldn't wear helmets.
N/A is Data Too
Look up 850 Bryant Street in San Francisco and see if you can guess why an exceptional number of crimes in San Francisco are mapped to that address.
Maps, IP addresses, and generalizations -- there are a few interesting ways to get data organized geographically by IP address, but you need to know how your data source handles unknown locations or you'll end up attributing a lot of things to the middle of Kansas:
Understand Your Data
- 538 had to retract a story on broadband reach because they didn't understand the data they were working from. Asking good questions helps you avoid drawing conclusions that the data can't actually support.
Be Skeptical of Surveys
What did you ask first?
Pew has a lot of great research about survey design. One thing they've established is that question order changes how people respond to questions. In 2003, Americans were more likely to say they support civil unions if they had already been asked whether they support gay marriage. Read their full report on questionnaire design for more findings.
What was the question again?
Before you report on a survey, especially one with dramatic findings, make sure the actual questions line up with your reporting.
Keeping my powder dry on this one. I'd like to see how the question was worded. Did it use the word "source," without defining it? It's a term of art with a specific meaning that may have been misunderstood by some respondents. https://t.co/X6qzf3TA6x— Scott Klein (@kleinmatic) February 26, 2019
We haven't even begun to dig into all the ways that we can use bad visualization to misrepresent perfectly good data, but I'd love to add your examples!
For now, here's QZ on a very disingenous Apple chart.
Deciding what to publish
How you tell a story makes a big difference in what your readers take from your reporting.
And not everything should be published, even if you have the data. A few examples, but there are more:
- Gun Ownership: https://www.karmapeiro.com/2013/01/04/hasta-donde-llega-la-etica-en-el-periodismodedatos/
- Public employee salaries: https://www.americanpressinstitute.org/publications/reports/strategy-studies/challenges-data-journalism/
There are also books
- How To Lie With Statistics, Darrell Huff
- Bad Pharma, Ben Goldacre (2014)
- Numbers Rule Your World, Kaiser Fung (2011)
- The Tiger That Isn't, Blastland and Dilnot (2009)
- The Victory Lab, Sasha Issenberg
- The numbers game : why everything you know about football is wrong, Chris Anderson and David Sally (2014)
- The Signal and The Noise: why so many predictions fail-- but some don't, Nate Silver (2015)
Not Yet Organized Notes
- I haven't culled the good stuff from my older classes but I will. https://github.com/amandabee/CUNY-data-storytelling/blob/master/lecture%20notes/skepticism.md
Unsorted / To Review
Copy pasta'd a bunch of notes when this came up on NICAR-L recently.
- Distrust Your Data: Jacob Harris on Six Ways to Make Mistakes with Data
- Bad Science, Ben Goldacre
- The Data Journalism module at ONA ethics
There's a chapter in Ethics for Digital Journalists https://www.palgrave.com/br/book/9783319972824 on ethics in data journalism - See also: https://onlinejournalismblog.com/2013/09/13/ethics-in-data-journalism-accuracy
paper from a Nordic data journalism conference: https://jyx.jyu.fi/bitstream/handle/123456789/58616/ETHICS%20OF%20DATA%20JOURNALISMI.pdf
Paul Bradshaw's list of resources and examples related to ethics in data journalism here - including some related to big data: https://pinboard.in/u:paulbradshaw/t:ethics+dj
reflection about the gun map scenario and the role of forums like NICAR in discussing ethics: http://datadrivenjournalism.net/news_and_analysis/ethical_questions_in_data_journalism_and_the_power_of_online_discussion
Markkula Center event:
Ethics of scraping: