<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Intro to Data Science Ethics

_Authors: Greg Baker (SYD) and Justin Pounders_

---

### Learning Objectives

- Define some of the political, ethical and economic ramifications of using black-box models in data science.
- Describe the features of explainable vs non-explainable (black-box) models
- Define and discuss the global implications of the EU GDPR

### Lesson Guide
- [Introduction: legal and ethical responsibilities](#intro)
- [Black-box models and explainability](#blackbox)
- [GDPR](#gdpr)
- [Options for Machine Learning on EU Citizen Data](#options)
- [LIME Example](#lime)
- [Resources](#resources)

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import sklearn

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

<a id='intro'></a>

## Introduction: legal and ethical responsibilities

As data scientists, we often have access to very personal information about 
our customers or users and **we are asked or required to make decisions based on that data**.

Many countries have very strict laws on how such data must be handled, but we all have a professional responsibility to act *ethically*.

### Warning

Laws change -- sometimes quite suddenly -- and General Assembly doesn't
promise that anything in this topic won't be out-of-date by the time you
finish the class.

Also, _I AM NOT A LAWYER_.

## What is the context?

Data science involves gathering data and using that data to draw inferences and make predictions.

- Creditors want to know how reliable you are.
- Insurance companies want to know how risky you are.
- Dating sites want to know what your "type" is.
- Retailers want to know what you will buy.

![](./assets/target.png)

> https://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a-teen-girl-was-pregnant-before-her-father-did/#53d014186668

**The upshot**

- Target assigned customers an ID;
- They tracked all purchases;
- They linked IDs to demographic data when available;
- They learned to predict pregnancy (even due dates!!);

> "But even if you’re following the law, you can do things where people get queasy.” Target statistician Andrew Pole

## Why do we care?

Machine learning and other data-based decision-making algorithms can have...

- **immediate** ramifications because humans are bypassed and machines work fast and on a large scale;
- **impact** directly on people in ways that affect their livelihoods;
- **invisible** reach, i.e., how much of your data is collected? Who owns it?  Where is it stored?

![](assets/chowdhury.png)

> Taken from Rumman Chowdhury's presentation at the Southern Data Science Conference, 2018.
>
> Links to articles
> - [Here's how Centrelink can win back Australians' trust after the robo-debt debacle](http://www.abc.net.au/news/2017-03-21/how-centrelink-can-win-back-trust-after-the-robo-debt-debacle/8372788); analysis [here](http://www.rogerclarke.com/DV/CRD17.html)
> - [Facebook Lets Advertisers Exclude Users by Race](https://www.propublica.org/article/facebook-lets-advertisers-exclude-users-by-race); NYTimes article [here](https://www.nytimes.com/2016/11/12/business/media/facebook-will-stop-some-ads-from-targeting-users-by-race.html).
> - [I asked Tinder for my data. It sent me 800 pages of my deepest, darkest secrets](https://www.theguardian.com/technology/2017/sep/26/tinder-personal-data-dating-app-messages-hacked-sold)

### Food for thought

- Did Centrelink do anything "wrong"?  What did they _not_ consider?
- What about Facebook?
- Who owns your data?  (Tinder or otherwise.)

---

## OKCupid

![](assets/okcupid.png)

> http://fortune.com/2014/07/28/okcupid-we-experiment-on-users-too/

### Food for thought:

- Websites use "A/B testing" all the time.  Is this different?  Why?
- How could OK Cupid possibly have performed this experiment in a more ethical manner?

## Cambridge Analytica

> **"The firm offered tools that could identify the personalities of American voters and influence their behavior."**

![](assets/cambridge1.png)

---

![](assets/cambridge2.png)

> https://www.nytimes.com/2018/03/19/technology/facebook-cambridge-analytica-explained.html

### Food for thought?

- Did users consent to have their data scraped?
- Is Facebook culpable?
- What would you do if you saw this happen, either at Facebook, the university or Cambridge Analytica?

----

## Racial "profiling" by country

<img src="assets/France.png" style="float: left; margin: 10px; height: 50px">

### France - you must not ask about race

<br>

<blockquote>Race is such a taboo term that a 1978 law specifically banned the collection and computerized storage of race-based data without the express consent of the interviewees or a waiver by a state committee. France therefore collects no census or other data on the race (or ethnicity) of its citizens.  [Race policy in France](https://www.brookings.edu/articles/race-policy-in-france/)
</blockquote>

Even if you don't use race in your model, you simply can't store this in a database or even
have it appear in a pandas dataframe.


### Malaysia - you must ask about race

<br>

<blockquote>
Race is such an important term in Malaysia and the history of discrimination so
strong, that there are great efforts to make sure that companies aren't being discriminatory 
or racist. It is quite
common to collect some quite personal data about every customer, student or user:

<ul>
<li> _What is your race?_
<li> _What is your religion?_
</ul>

These are written on everyone's national identity card and stored in numerous databases.
Changing religion requires completing a formal application and being issued with a new 
identity card.
</blockquote>

If you are dealing with data from Malaysia it will be important to show that any algorithm you 
use doesn't adversely affect people of different religions as a court can easily request data 
on your per-race outcomes. Analysing student education 
outcomes to confirm no racial or religius bias is important, for example.

So you will definitely need to do generate some descriptive statistics on race, for example.

### Side-note:

This makes it very difficult for a French university to open a campus in Malaysia!

## What is the underlying ethical motivation, and how does it affect data scientists?


- Generally, countries set up laws so that the colour of your skin or what you believe shouldn't affect how
you are treated by other human beings.
- But many decisions are now being made by _algorithms_, rather than human beings.

> That's the first rule of algorithms.
> Algorithms are **opinions embedded in code**.
> ...
> (As data scientists), we should not be the arbiters of truth. We should be translators of ethical discussions that happen in larger society.
> --- [Cathy O'Neill TED Talk](https://www.ted.com/talks/cathy_o_neil_the_era_of_blind_faith_in_big_data_must_end/transcript)

- Data scientists have a unique responsibility to make algorithms that **treat everyone fairly**, just as you would
expect a human being to treat everyone fairly.
- In Europe, this is now a **legal responsibility**, which will mostly fall on data scientists: justify why your
model is reasonable and fair.
 

## How do you **know** you're right?  Can you know??

![](./assets/compas.png)

> https://www.washingtonpost.com/news/monkey-cage/wp/2016/10/17/can-an-algorithm-be-racist-our-analysis-is-more-cautious-than-propublicas/?noredirect=on&utm_term=.475a0b30d2aa

Also see this analysis: [The accuracy, fairness, and limits of predicting recidivism](http://advances.sciencemag.org/content/4/1/eaao5580.full)


### Food for thought:

- Algorithmic decision making is **not easy**.
- Was Northpointe's "product" impactful to people's livelihoods?
- Could they _explain why_ they were making the predictions that were being made?

## General Considerations

- Privacy/anonymity
- Data ownership
- Consent
- Explainability

<a id='resources'></a>

## Additional resources

---


- [Is there are a right to explanation for Machine Learning in the GDPR?](https://iapp.org/news/a/is-there-a-right-to-explanation-for-machine-learning-in-the-gdpr)
- [Why a Right to Explanation of Automated Decision-Making Does Not Exist in the General Data Protection Regulation](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2903469)
- [Towards Interpretable Reliable Models](https://blog.kjamistan.com/towards-interpretable-reliable-models/)
- [GDPR and you](https://blog.kjamistan.com/gdpr-you-my-talk-at-cloudera-sessions-munchen/)
- [Hold Your Machine Learning and AI Models Accountable](https://medium.com/pachyderm-data/hold-your-machine-learning-and-ai-models-accountable-de887177174c)
- [How GDPR Affects Data Science](https://kdnuggets.com/2017/07/gdpr-affects-data-science.html)
- [Scaleable Bayesian rule lists](https://arxiv.org/pdf/1602.08610v2.pdf)

- [Why Should I Trust You? Explaining the Predictions of Any Classfier ](https://www.youtube.com/watch?v=hUnRCxnydCc)
- [Explaining Complex Machine Learning Models with LIME](https://datascienceplus.com/explaining-complex-machine-learning-models-with-lime/)

