## Lesson 1.3: Equality, diversity and inclusion in data science

Data science has grown a lot during the last decade and its applications now span most scientific fields and play an important role in industry and government. As a result, data science and data scientists have growing influence and power. Decision taken using data affect individuals and communities around the world in more ways than ever before.

Despite this influence, a number of important topics around the ethics of data science and its impact on equality, diversity and inclusion have been under-discussed.

In this lesson we:
- Discuss and criticise some simplistic but widely used metaphors about the role of data science in today's world.
- Discuss power and it relationship to data science. We try to capture some of the ways in which data science reflects, reproduces or causes inequalities and oppression in society.
- Demonstrate how data scientists can detect and challenge practices, ideas and privileges that reinforce inequality.
- Give examples of real-world data science projects where EDI principles have been applied with or without success and demonstrate how to do participatory data science.

### Data is the new oil

The importance and value of data are often highlighted in the press with a popular metaphor: 'Data is the new oil'  (e.g. see [this](https://www.economist.com/leaders/2017/05/06/the-worlds-most-valuable-resource-is-no-longer-oil-but-data) article in the Economist). This is meant to convey that data is a resource that is out there, available to be extracted and with large value for fueling the modern digital economy.

![data_oil](../../figures/m1/data_oil.jpeg)

<i>Team exercise (Split in groups)</i>

The metaphor sounds appealing and accurate: Data is indeed a resource that (like oil) we can mine, process and use to generate profit. 

But the metaphor can serve to hide some aspects of the role of data in the real-world.

Specifically, let's discuss the following questions in groups:
- Who benefits from this 'new oil'?  Who doesn't and why?
- Who has control over the process of gathering and analysing data and making data-driven decisions? 
- How is data 'extracted'? Is it fair to treat data as a free resource available to be extracted?
- What are the risks of using data to solve problems?



### Data and power

Data is indeed a valuable resource that increasingly plays a major role in fueling our economies. It can and is being used to make the world a better place. At the same time, **data and data science can be oppressive** and there are multiple instances and hisitorical examples of this in various fields.

To understand how data oppression operates in today's world, it is useful to examine **power**: How does it operate in society? How are data used within existing power structures?

Here, we provide one definition of power which is useful within our context. We do not hope to even remotely understand the nature of power which has been the topic of counteless philosophical, sociological and other studies. But the following definition can help us examine how the relationship between power and data science. According to the book [Data Feminism](https://data-feminism.mitpress.mit.edu/):

> Power is the current configuration of structural privilege and structural oppression, in which some groups experience unearned advantages — because various systems have been designed by people like them and work for people like them — and other groups experience systematic disadvantages — because those same systems were not designed by them or with people like them in mind. 

To better understand how power is organised and experienced by people in societies, we can use the following **matrix of oppression** proposed in [Black Feminist Thought](https://projects.iq.harvard.edu/hksdigitalbookdisplay/publications/black-feminist-thought-knowledge-consciousness-and-politics) and used in Data Feminism:
![matrix](../../figures/m1/matrix.png)

Data science overalps with these four domains in various ways. These forces of oppression are encountered in our daily lives but are also present in our datasets, our data science industry, our research, our code. Some examples are provided below.

### Examples

#### What data do we collect?
Political and cultural factors have a strong influence on what types of data are collected and not collected. The choices our governments, organisations and corporations make say a lot about which problems are prioritised in our societies and in our data science communities. The disciplinaary and hegemominc domains are often important in this discussion.

Examples:
- There are many datasets that one would expect they would exist but they don't. For example, see this [Missing Datasets list](https://github.com/MimiOnuoha/missing-datasets). Our decisions not to collect these data often express biases, systematic failures, oppression.
- Up to 2018, there was still no national system in the US for tracking complications sustained in pregnancy and childbirth, even though similar systems had long been in place for tracking any number of other health issues, such as teen pregnancy, hip replacements, or heart attacks ([USA today](https://www.usatoday.com/series/deadlydeliveries/)). Recent research has shown that black women are over 3 times more likely than white women to die from such complications ([ProPublica article](https://www.propublica.org/article/nothing-protects-black-women-from-dying-in-pregnancy-and-childbirth)). It took a social media post from Serena Williams who experienced complications when giving birth to her daughter to ignite a public converstation.
- A lot of the data we collect for research and industrial purposes are predominantly male (see [this](https://data2x.org/wp-content/uploads/2019/05/Data2X_MappingGenderDataGaps_FullReport.pdf) and Caroline Criado Perez, Invisible Women: Exposing Data Bias in a World Designed for Men). For example, car crash dummies were until recently designed to represent male bodies, which meant a significant increase in the risk of injury for women.
- Many of the datasets that organisations/states publish might be missing important variables, not break down numbers by gender, age, race etc (which can hide many biases) and/or exclude people that should be there (e.g. children's mental health is often not measured, data used for designing products or conducting medical research have historically been male-dominated).
- A well know example of biased data collection and use comes from the domain of face recongition, where it was recently [shown](http://proceedings.mlr.press/v81/buolamwini18a/buolamwini18a.pdf) that commercial facial recognition software misclassifies darker-skinned people significantly more than lighter-skinned people, due to biased training and benchmarking data (78% male and 84% white).

#### What do we use them for?
Many applications of data science have helped shed light to important societal problems and make the world a better place. Nevertheless, some of the most common motivations for collecting and using data involve generating profit, using it for surveillance, using it to administer scarcity or using it for scientific purposes (in many cases aiming to benefit specific groups). 

Examples:
- There is a major drive to convert even the most mudane aspects of human lives and experience to data in pursuit of profit. This is often described as 'datafication'. We collect data about every simple action we do online (e.g. how many seconds a user looks at a Facebook post or what their searches are), data about our behaviour in the workplace (e.g. see [Amazon's tracking of workers' movements](https://www.theverge.com/2019/4/25/18516004/amazon-warehouse-fulfillment-centers-productivity-firing-terminations) and the increasingly prevalent [monitoring of employees' usage of their computers](https://desktime.com/employee-time-tracking-guide)), a large number of things that happen in our cities and roads, data about crime and police reaction to it. There are connections here to the concept of ['biopolitics'](https://en.wikipedia.org/wiki/Biopolitics) as described by Michel Foucault but also the more recent concept of 'psychopolitics' as described in Byung-Chul Han [Psychopolitics: Neoliberalism and New Technologies of Power](https://www.worldcat.org/title/psychopolitics-neoliberalism-and-new-technologies-of-power/oclc/1004206745). 
- Social media like Facebook use data collected by user in ways that have been ethically challenged, e.g. see recent stories about the [Cambridge Analytica scandal](https://en.wikipedia.org/wiki/Facebook%E2%80%93Cambridge_Analytica_data_scandal) and the documentary [The Social Dilemma](https://www.thesocialdilemma.com/) about how Facebook designs their platform to be addictive and the impact this can have on society and individuals (also see Facebook's reply to the allegations [here](https://about.fb.com/wp-content/uploads/2020/10/What-The-Social-Dilemma-Gets-Wrong.pdf)).
- There are multiple instances where data and algorithms have reinforced existing oppression and injustice. For example, [this](https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing) widely circulated report by ProPublica demonstrated how a machine learning algorithm built to predict recidivism of convicted criminals was racially biased. Despite not using race as a feature, the algorithm used various other features which acted as a proxy for race. It is scary that algorithms like this one have been used in many US states and influenced judges' decisions.
![propublica](../../figures/m1/propublica.png)
- A similar example of racial bias comes from algorithms used to predict high crime areas with the purpose of focusing police presence there. [PredPol](https://www.predpol.com/technology/) in an example of such tools used by the City of Los Angeles for nearly a decade to determine which neighborhoods to patrol more heavily. Like many tools based on historic data, PredPol actually predicts the **past**, rather than the future. Historically, police presence has disproportinoately focused on black neighbourhoods. This racist practice now finds its way through the algorithm to the present; higher crime predicted in  those neightbourhood, police presence is increased; but increased police presence leads to more crime being detected and reports, creating a feedback loops that perpetuates the same practices. 
- There is a flip side to the coin of biased facial recongition software mentioned above. These systems are increasingly used for aggressive surveillance by states around the world; you might not want your face to be rcognisable if this is going to lead to violence against you or unfair presecution!

#### Who controls the data and algorithms?

If we think about who collects and controls data in today's world and with whose benefit in mind, we can see some worrying patterns.
- The collection and control of large, valuable and data is increasingly concentrated in the hands of a few major organisations (e.g. Google, Facebook, Apple, Alibaba) and various smaller ones, most of which are not under democratic control. These organisations have accumulated power in the form of data and algorithms that creates dangerous imbalances.
- Despite the high value of data, users and citizens have limited choice when it comes to giving away their data if they want to maintain access to certain platforms; these platforms are essentially monopolies.
- Regulation and legislation are still lagging behind, leaving a lot of space for misuse. A lot of decisions are in the hands of organisations that are in control of these platforms

#### Who works in data science?
Data science has a clear diversity problem. People employed in the field are strongly white and male and from a relatively limited set of academic backgrounds (mostly STEM). This is true accross economy sectors. Indicatively:
- 

### Data myths and privilege

- converting life experiiences to to. data entails a reduction of that experience
- no dataset, algorithm or visualisation is thework of. one person. There. are people that offer uptheir experience to be analysed and counted, people that who perform  the counting anda nalysis,  people whoo. visuaalise and promote
- Data scientists are not wizard or superstars. They are often strangers to the problems they try to solve and their success depends on a laarage number of contributions from other people and a. rich set of open, communiity-developed tools.
- Data are not raw/neutral: They are the biased output of unequal social, historical and economic condition and they should be treated as such
- 

### How to challenge bias and oppression as a data scientist