# A Big Data Fail

Consider the 1936 federal presidential election of FDR vs. Al Landon. The magazine
Literary Digest’s straw poll had correctly predicted the outcome of the previous five
presidential elections. Running up to the election, they polled over 10 million
individuals including
- magazine subscribers
- registered automobile owners
- telephone owners
and received responses from about 2.4 million of those polled. The Literary Digest
predicted Landon would win in a landslide. By contrast, George Gallup’s quota
sample consisted of bi-weekly surveys of 2000 individuals, and correctly predicted
a landslide for FDR.

### 1. What are some potential sources of bias in each of these polling schemes?

#### Literary Digest
None of the three samples was a probability sample, because we can't assume that any of the three subpopulations (magazine subscribers, registered automobile owners, and telephone owners) has the same distribution as the whole voting population. In fact, it is very likely that each of these is biased towards the Republican candidate, given that they are wealthier than the population at large. On top of this, not all responded, adding a self-selection bias to the sample, given that more passionate individuals are typically more likely to air their opinions.

#### Gallup
A quota sample is closer to a SRS, given that polled groups are selected at random. However, there is likely to be more bias, given that only a subset of all possible samples of a given size can be selected.

### Note

According to [Case Study I: The 1936 Literary Digest Poll](https://www.math.upenn.edu/~deturck/m170/wk4/lecture/case1.html), Gallup's sample size was 50,000.
According to [The First Measured Century](http://www.pbs.org/fmc/timeline/pgallup.htm), it was 3,000.
The bottomline is still the same though: a small random sample is much better than a huge biased sample.

# Data-Driven Study Design: COMPAS Algorithm for Predicting Recidivism
Recidivism is the tendency of a convicted criminal to reoffend. The COMPAS
(Correctional Offender Management Profiling for Alternative Sanctions) algorithm,
developed by the company Northpointe (now equivant), predicts recidivism risk
based on variables related to criminal history, drug involvement, and juvenile
delinquency. It is used by US courts for the purpose of case management, to predict
a defendant’s risk of committing more crimes.

We will examine the COMPAS algorithm and, in particular, a ProPublica study
pointing to racial biases associated with it (https://www.propublica.org/article/how-we-analyzed-the-compasrecidivism-algorithm). 

We will discuss general issues raised by the application of such algorithms, in terms of ethics, privacy, security, and governance. We will also
walk through steps you might take to address questions related to the accuracy and
potential racial bias of the COMPAS algorithm.
The questions are meant to be discussed with the people around you as a group and
there is no right or wrong answer.
#### (a) What is the population of interest for COMPAS?
Pre-trial, sentencing and parole defendants in the US.
#### (b) What is the imagined utility of the algorithm in contrast to a human judge?
An algorithm is supposed to be more objective and fair.
#### (c) What are some features or attributes that were used by COMPAS to design the algorithm? Are there features or attributes that you think should’ve been included or taken out?
Crime, sex, age, crime degree, prior crimes, juvenile felonies and misdeeds. There is also data from demographics.
However, in spite of what its documentation claims, COMPAS uses only 6 features to predict recidivism.
The questionnaire filled out by defendants has 137 questions, most of which are intended for the judge rather than the algorithm (and maybe future use?).

The "Crime" feature is known to be a proxy for race. However, it is also the most predictive feature for recidivism.
Blacks are also more likely to have commited prior crimes, and more, and these are features for COMPAS.
Features that are proxies for race should, either, be removed (or fixed mathematically so that this proxy effect is neutralized), or just bite the bullet and accept that, no matter how hard we try, some features that are proxy for race will always creep in, so just explicitly include race as a feature so that it is easier to assess and compare the results to detect race bias.
#### (d) How do you define ”accuracy” and ”racial bias”?
Accuracy is hard to define, because there are multiple plausible measures, and optimizing for one often degrades the other. COMPAS's authors claim that the results are calibrated (the 1-10 recidivism prediction has the same meaning for every race).
The main claim that there is racial bias is based on the fact that false positives are twice as likely for Blacks, and false negatives twice as likely for White. The tradeoff is that, mathematically, it is impossible to calibrate such a system without racial bias, because recidivism rates are different for both races.
#### (e) How does the history of criminal justice institutions inform the data used by the algorithm?
Some of the features used by the algorithm, like prior crimes, are higher for Blacks. If we asssume that human bias against Blacks did affect sentencing in the past, then many would have been wrongly or too harshly sentenced for crimes, and too often. Then, the data that the algorithm uses today would be biased, and the resulting recidivism predictions would be hopelesly biased agains Blacks, and this is not something that can be fixed by throwing more (biased) data at the algorithm.
#### (f) How should data be collected or obtained to assess the accuracy of predictors like COMPAS? Would you sample at random from the population of interest?
Any data that might have been tainted by human biases must not be used by the algorithm. For instance, data from an era when bias was worse should be removed, because it will not reflect current attitudes.
Sampling at random will not fix this, if it includes features that might contain human bias.
However, once the features that cause the bias have been removed, or somehow fixed mathematically, random sampling might be considered.
Is the data a census, or an administrative sample? If it is the latter, then some kind of random sampling should be considered.
#### (g) What are some ways we can assess the accuracy of COMPAS?
Ideally, an experiment involving judges should be run to compare their decisions to COMPAS's.
The MIT Media Lab did a crowdsourcing experiment with non-experts, and their accuracy was almost exactly the same as COMPAS's.
They also reverse engineered COMPAS (the algorithm is proprietary). MIT used the only 7 features they had access to, and found that all of the following models were, more or less, as accurate as COMPAS: SVM(7), LR(7) and LR(2).
The most remarcable finding was that vanilla linear regression with only two features ("Crime" and "Age") was as accurate as COMPAS.
#### (h) Think about the concepts of false positives and false negatives in this scenario. What are the ramifications or costs of a false positive and/or false negative?
A false positive could result in unfairly harsher sentencing, or in bail or parole being unfairly denied. It will also add to the defendant's prior "Prior Crimes" feature, so any future algorithmic predictions will also be unfairly harsher.
A false negative gives the defendant the opportunity to commit more crime when he should have been in jail instead.
#### (i) Is the COMPAS algorithm fair? For whom? According to what/whose definition of fairness?
If we look at the false negatives and positives, it is unfair. See (d) for further elaboration.

### Note
Given that this exercise was originally intended as a group discussion with an instructor, as a self-learner I substituted this by watching a very relevant youtube video from the MIT Media Lab, that presents, discusses and extends ProPublica's criticism, and has a long Q&A session: [The Accuracy, Fairness, and Limits of Predicting Recidivism](https://www.youtube.com/watch?v=G0OE8p-fc10)

An important issue raised in the video and not mentioned above is that the algorithm is a black box. Judges don't know why or how it made a prediction. Given the increasing influence of algorithmic predictions in society, and that many of these algorithms are closed and proprietary, it is worrying when they can't be independently assessed. We must not assume that they just work and are fair. COMPAS should be a lesson, more oversight is needed.

Someone raised an interesting question in the Q&A. In spite of the discouraging results, to what extent can be claim that the algorithm is bad or worse than humans? Judges are often elected and worry about reelection. Their chances are more likely to be harmed by false negatives than by false positives.