In a 2022 lecture at Princeton University, associate professor of computer science Arvind Narayanan stated: "currently, quantitative methods are primarily used to justify the status quo. I would argue that they do more harm than good" (@narayanan2022limits). As machine learning algorithms become more prevalent in our lives, and are used to aid in high-stakes decisions like prison sentencing, credit assessment, and pricing, it's essential that we have a way to measure bias and fairness. Narayanan believes that the quantitative methods which we currently use to assess bias and fairness in machine learning are inadequate. In this essay, I will dissect Narayanan's position on quantitative methods in bias and fairness, discuss some of the benefits of these quantitative methods, discuss some of the limitations of these quantitative methods, and unpack my own views on these methods. 

In Narayanan's speech, he discusses some key limitations he believes exists in our current quantitative methods for assessing fairness and bias in machine learning. He uses the COMPAS study that we have worked with in class as an example -- he was initially excited about the data that exhibited bias in the COMPAS algorithm, but after further consideration he developed 7 serious limitations of the way quantitative methods are used to study discrimination. Although Narayanan believes that we should continue to use quantitative methods, he thinks these limitations need to be addressed if these methods continue to be used: 

1. What counts as evidence of discrimination is a subjective choice
2. Null hypotheses allocate the burden of proof unfairly
3. Compounding inequality is not detectable by quantitative methods
4. Snapshot datasets can hide discrimination
5. Statistical controls can mask discrimination
6. Different fairness metrics can produce conflicting results
7. Numbers are the language of policymaking, even when they are misleading

We'll unpack these different ideas along the way as we bring in some more sources to highlight both the benefits and drawbacks of using these methods. First, let's understand some of the ways in which these quantitative methods have been effective, despite their potential limitations. 

In *Fairness and Machine Learning*, Barocas, Hardt, and Narayanan outline 3 quantitative definitions of fairness in machine learning: independence, separation, and sufficiency (@barocasFairnessMachineLearning2023). We'll focus on independece here, which requires the sensitive characteristic to be statistically independent of the score. In the early 2010s, Amazon created and AI tool to aid their recruiting process by scoring candidates based on their resumes that did not meet this definition of independence across genders (@ReutersAmazonAIRecruiting). If we defined $R$ as the score function and $\hat{Y} = I\{R > t\}$ thresholds the score at $t$: 

$P\{\hat{Y} | \text{male candidate}\} \neq P\{\hat{Y} | \text{female candidate}\}$

That is, the probability of assigning a male candidate a score over the threshold was not the same as the probability of assigning a female candidate a score over the threshold. In this case, the male candidates were more likely to be assigned a higher score. This happened because the data the model was trained on was resumes submitted to the company over a 10-year period, of which most came from men. If resumes had the word "women" in them; for example, "women's chess club captain", these resumes were likely to be penalized. Once Amazon identified that the independence criteria was not being met for this system, the edited the tool for gender-neutrality, and decided to use the scores generated by the tool as one part of the hiring process, but not rule out any candidates based on the AI process. This example highlights some of the benefits of using quantitative methods to assess fairness and bias in machine learning algorithms. Amazon was able to use the quantitative criteria described above to identify a problem with its algorithm, and make changes to both the algorithm itself and the context in which it was used. Even though the methods like using the independence criteria certainly have some limitations, which we will discuss next, they are still able to quickly and objectively identify a problem and prompt developers to address it. Now that we've seen an example of the ways in which quantitative methods can be useful, we can start to understand the drawbacks and limitations of them. 

The ProPublica article on machine bias, which we have looked at a few times in class, which analyzed the COMPAS algorithm for bias, is a great example of some of the concerns that Narayanan raises about quantitative methods to analyze bias (@ProPublicaMachineBias). Even though the algorithm had similar accuracy rates across Black and White defendants, people using the algorithm failed to consider disparities in false positive and false negative rates between these groups. Many of the issues that Narayanan raises are at play here. First, there was a difference in what the algorithm's creators thought counted as "fair" and what ProPublica thought was "fair". To frame this in language from *Fairness and Machine Learning*, COMPAS's creators were looking for the fact that, if $Y$ is an indicator denoting whether or not a defendant is likely to recommit a crime and $\hat{Y}$ is an indicator denoting whether that defendant actually did recommit: 

$P\{\hat{Y} = Y| \text{white defendant}\} = P\{\hat{Y} = Y| \text{black defendant}\}$

That is, that the accuracy of the algorithm is the same for each racial group. In moral, this is the narrow view of equality as defined in *Fairness and Machine Learning*. But, the ProPublica authors wanted a more rigorous definition of fairness that aligns with the middle view of equality: better equality in false positive rates across racial groups. In more technical terms:

$P\{Y = 1 | \hat{Y} = 0, \text{white defendant}\} = P\{Y = 1 | \hat{Y} = 0, \text{black defendant}\}$

For the algorithm's creators, similar accuracy rates across racial groups counted as a fair algorithm, but ProPublica argued that the false positive rate for Black defendants was nearly twice that of white defendants -- causing unfair differences in how Black defendants were treated. Second, the creators of COMPAS assumed the null hypothesis that their algorithm was fair unless proven otherwise, and their defense was that no direct racial variables were used in the model—shifting the burden of proof to critics like ProPublica. This ignores the reality that historical and structural biases are embedded in the data itself, making it unnecessary for explicit race-based discrimination to still result in biased outcomes​. Third, COMPAS methods of assessing fairness could not detect compounding inequality. They assessed individual risk without considering the structural inequalities that led to higher arrest rates among Black individuals like the fact that areas with high concentrations of Black residents are more likely to be over-policed and over-charged. The COMPAS algorithm treated past arrests and convictions as neutral indicators, failing to account for how past injustice compounds over time​. Fourth, COMPAS only used a snapshot dataset -- a dataset that only considered historical arrest records, not defendants’ actual long-term outcomes. Their algorithm did not consider how a Black defendant’s false high-risk classification could lead to longer incarceration times, which might in turn reduce future employment opportunities and reinforces systemic inequality. Fifth, the COMPAS algorithm controlled for factors like criminal history, age, and gender, in order to argue that any disparities across racial groups were due to legitimate risk factors rather than systemic racism. But, these risk factors were actually shaped by systemic racism, making them a proxy for assigning different scores based on race. Sixth, similar to the first issue, using different fairness metrics, in this case predictive accuracy vs. false positive rates, created conflicting results about whether or not the algorithm was fair. Finally, the seventh issue showed up as COMPAS continued to be used in criminal justice decisions because its numerical outputs carried authority -- judges, parole officers, and policymakers trusted the algorithm’s scores despite evidence of bias. People often feel that data is unbiased and algorithms can't be racist, but we can some of the drawbacks of using quantitative methods do assess the fairness of the COMPAS algorithm across racial groups by applying each of Narayanan's issues with the quantitative methods uses to assess bias in the COMPAS algorithm used.

Ultimately, I'd agree with Narayanan's claim about quantitative methods for assessing bias. Although quantitative methods are important and we can use them to understand limitations of our machine learning algorithms, there are many other dimensions in algorithmic bias that need to be considered. As discussed in *Data Feminism*, power dynamics that can exacerbate bias are embedded in all aspects of data collection, analysis, and interpretation of results (@datafeminism2020). Narayanan claims that the burden of proof serves to favor the status quo, and *Data Feminism* goes even further as to argue that quantitative methods are not neutral -- they are shaped by and continue to hold up existing power structures. In my experience, people often view quantitative methods as unbiased, they can either be "right" or "wrong", but when we begin to dig deeper into the systemic power imbalances that are embedded into collecting and interpreting data, even outside of machine learning algorithms, we can see that when quantitative methods are used on datasets that reflect biases in our society, the method themselves begin to reinforce these inequalities. We need to use context, history, and lived experiences in addition to quantitative methods in order to truly capture bias in these algorithms -- as  D’Ignazio and Klein and write, "context is queen" (@datafeminism2020). This past J-term, I took a class called Questioning Technology, where we discussed algorithmic bias at length. One of the most interesting issues that we discussed, brough up in a book called *More Than A Glitch: Confronting Race, Gender, and Ability Bias in Tech* was technochauvanism, a bias that considers computational solutions to be superior to all other solutions (@morethanaglitch2023). I think that this issue is what lies at the center of the entire discussion around machine learning bias, the fact that people are so willing to accept technological solutions as unbiased and correct. So, like Narayanan claims, I think we can continue to use machine learning algorithms and use quantitative methods to assess their fairness. In fact, I think we *should*, in order to learn how we can improve these systems. But, we can't come at this from a technochauvanist angle. Instead, we need to approach machine learning with skepticism, bringing our knowledge of the bias intrinsically present in our datasets to our assessment of machine learning algorithms, especially when using them to inform consequential decisions that can have immense impact on peoples' lives. To put this into context, in the case of the ProPublica study, technochauvanism was certainly at play. The system we are looking at, the criminal justice system, is already known to have a problem with human bias. We know that judges tend to be biased against people of color, even when they don't do it intentionally. So, when presented with an algorithmic solution, the criminal justice system was immediately inclined to accept this solution as potentially less biased than human judges. The algorithm is assumed not to include racial bias because developers don't think that it has been exposed to years of systemic racism like humans have. Therefore, people started using this algorithm without much investigation into its "accuracy" besides its accuracy rates. It required skilled researchers to do a in-depth study in order to expose the problems with the algorithm. But, as we can see from the study, peoples' lives have already been negatively impacted by this algorithm, despite it being called into question now. Instead, of finding these issues after the fact, a more skeptical approach would look like asking the critical questions that the ProPublica researchers asked, but doing so before the algorithm is deployed. Or, using the algorithm first on a very small subset of cases to understand its limitations and impacts before allowing it to be used on any case without restriction. Furthermore, developers should be skeptical of their data-- if they are looking to reduce racial bias, they should be take the time to understand the underlying bias that might be present in the variables they are using to train their model. Ultimately, we cannot continue acting with technochauvanist views and accepting algorithms as unbiased without rigorous testing that involves exploring all the types of potential bias that an algorithm can exhibit. These algorithms have the potential to ruin lives, and we must use them with care. 