<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Explainble Models with LIME

_Authors: Greg Baker (SYD)_

---

### Learning Objectives
- Be aware that there are often political and legal implications to data science
- Develop a sense of scale of the implications of the EU GDPR
- Identify explainable vs non-explainable models
- ...

### Lesson Guide
- [Introduction: National Data Requirements](#intro)
- [GDPR](#gdpr)
- [Options for Machine Learning on EU Citizen Data](#options)
- [Resources](#resources)

In [14]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

<a id='intro'></a>

## Introduction: National Data Requirements

As data scientists, we often have access to very personal information about 
our customers or users. 

Many countries have very strict laws on how such data must be handled.

Laws change -- sometimes quite suddenly -- and General Assembly doesn't
promise that anything in this topic won't be out-of-date by the time you
finish the class.

Also, IANAL.

We'll start with two countries with extremely different approaches and then we'll look at 
some important laws in the EU.

----

<img src="assets/malaysia.png" style="float: left; margin: 10px; height: 50px">

## Malaysia

<br>

To make sure that companies aren't being discriminatory or racist, it is quite
common to collect some quite personal data about every customer, student or user:

- _What is your race?_

- _What is your religion?_

These are written on everyone's national identity card and stored in numerous databases.
Changing these requires completing a formal application and being issued with a new 
identity card.

The Malaysian goverment cares very deeply about discrimination against particular
races.

If you are dealing with data from Malaysia it will be important to show that any algorithm you 
use doesn't adversely affect people of different religions as a court can easily request data 
on your per-race outcomes. Analysing student education 
outcomes to confirm no racial or religius bias is important, for example.

----

<img src="assets/France.png" style="float: left; margin: 10px; height: 50px">

## France

<br>

<blockquote>Race is such a taboo term that a 1978 law specifically banned the collection and computerized storage of race-based data without the express consent of the interviewees or a waiver by a state committee. France therefore collects no census or other data on the race (or ethnicity) of its citizens.  [Race policy in France](https://www.brookings.edu/articles/race-policy-in-france/)
</blockquote>

Even if you don't use race in your model, you simply can't store this in a database or data frame.

This makes it very difficult for a French university to open a campus in Malaysia!

<a id='gdpr'></a>

----

<img src="assets/eu.png" style="float: left; margin: 20px; height: 50px">

# EU GDPR

<br>

The General Data Protection Regulation was agreed to in April 2016. It goes into effect in **25th May 2018**.

It puts some very serious constraints on what can and can't be kept in a database, and makes some
common machine learning algorithms difficult to implement.

## Article 22

<blockquote>
<p>
The data subject shall have the right not to be subject to a **decision based solely on automated processing**, including profiling, which produces legal effects concerning him or her or similarly significantly affects him or her.
</p>

<p>
The data controller shall implement suitable measures to safeguard the data subject's rights and freedoms and legitimate interests, at least the **right to obtain human intervention** on the part of the controller, to express his or her point of view and to contest the decision. 
</p>
</blockquote>

Since a machine learning classifier is definitely a "decision based solely on automated processing", this means that EU citizens have the right to ask to bypass that wonderful new machine learning algorithm you created and have a human being make the decision instead.

The legislation doesn't strictly define what a "legal effect" is, but at the extremes:

- Choosing the optimal image for a landing page to match someone's preferences is probably OK and probably not subject to Article 22.

- If you have different pricing models for different customers and you use a machine learning algorithm to
 choose the offer that you think will best suit them (or makes your company the most money!) you are probably subject to
 Article 22

## Article 9

<blockquote>
<p>
 Processing of personal data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership ... data concerning health or data concerning a natural person's sex life or sexual orientation shall be prohibited. 
 </p>
</blockquote>

This applies to _any company, anywhere in the world_ working with data from EU citizens. Your input dataframe ($X$) in 
your model _can't_ have columns for sexual orientation, race, politics or religion. You can't even have columns that
are strongly correlated with any of these things.


## Article 83

<blockquote>
Infringements of the ... provisions shall ... be subject to administrative fines ... up to 4 % of the total worldwide annual turnover of the preceding financial year
</blockquote>

For reference, this is the maximum potential fine for various companies:

| Company | Annual Revenue | Maximum Fine ||
|----------|---------------|---------------|
| Microsoft (including LinkedIn) | USD100B | USD 4,000,000,000 |
| Alphabet (Google) | USD90B | USD 3,600,000,000 |
| Facebook | USD27B | USD 1,080,000,000 |
| Netflix | USD8.8B | USD 320,000,000 |


## Recital 71

<blockquote>
...processing should be subject to suitable safeguards, which should include ... **an explanation of the decision** reached after such assessment and [the option] to challenge the decision.
</blockquote>

Recitals are not law, they are additional information written out to explain the purposes and thinking of the Articles. It's possible that the EU courts may not interpret the law the same way as the lawmakers did!

But this key phrase: _an explanation of the decision_ has caused a lot of angst in the machine learning community.
Some machine learning classifiers are really very hard to explain in language that can be conveyed to a non-technical audience.

- Deep learning

- Random forests

Some are (relatively) easy:

- Logistic regression

- Bayesian methods

- Decision trees

- Rule lists





## Recital 71 (continued)

<blockquote>
... the controller should use appropriate mathematical or statistical procedures ... [to take] account of the potential risks involved for the interests and rights of the data subject and that prevents, inter alia, discriminatory effects on natural persons on the basis of racial or ethnic origin, political opinion, religion or beliefs, trade union membership, genetic or health status or sexual orientation, or that result in measures having such an effect.
</blockquote>

Not only do you have to show that you didn't deliberately include race, religion or sexual orientation in your model, you also have to show that your model isn't discriminating against anyone accidentally. Of course, this is a bit challenging
for your French users, because you won't be able to store what race they identify with in the first place.

<a id="options"></a>

----

# Options for Machine Learning on EU Citizen's Data


There are two approaches that are likely to satisfy the GDPR:

- Only use machine learning algorithms that can be understood very easily by non-technical users

- Retro-fit explainability over other algorithms


## Decision Lists

Decision lists are one of the easiest classifiers to understand. Training
creates a list of if-then-else statements and their associated probabilities.

The sklearn-expertsys library (which is not yet integrated well into sklearn)
is a convenient way of creating decision lists.

In [2]:
!git clone https://github.com/tmadl/sklearn-expertsys

Cloning into 'sklearn-expertsys'...
remote: Counting objects: 186, done.[K
remote: Total 186 (delta 0), reused 0 (delta 0), pack-reused 186[K
Receiving objects: 100% (186/186), 81.40 KiB | 122.00 KiB/s, done.
Resolving deltas: 100% (116/116), done.


In [11]:
!curl http://www.borgelt.net/bin64/py2/fim.so --output sklearn-expertsys/fim.so
# I should make a MacOS version as well, and provide instructions for Windows users

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  790k  100  790k    0     0   131k      0  0:00:06  0:00:06 --:--:--  165k


In [3]:
import sys
sys.path.append('sklearn-expertsys')

In [12]:
import RuleListClassifier

In [149]:
rlc = RuleListClassifier.RuleListClassifier(class1label='alive', listlengthprior=7, n_chains=7, 
                                            listwidthprior=2,
                                            minsupport=3, maxcardinality=3)

In [150]:
titanic = pd.read_csv('datasets/titanic.csv', index_col=['PassengerId'])
titanic.dropna(inplace=True)
class_dict = {1: "First Class", 2: "Second Class", 3: "Third Class"}
X = pd.DataFrame({
    'Pclass': titanic.Pclass.map(class_dict.get),
    'Sex': titanic.Sex,
    'Siblings' : titanic.SibSp,
    'Parch': titanic.Parch,
    'Age': titanic.Age
})
Y = pd.DataFrame({
    'Survived': titanic.Survived
})
X.reset_index(inplace=True)
Y.reset_index(inplace=True)
del X['PassengerId']
Y = Y.Survived

In [156]:
import sklearn.model_selection
(Xtrain, Xtest, Ytrain, Ytest) = sklearn.model_selection.train_test_split(X.data,Y.data)

AttributeError: 'DataFrame' object has no attribute 'data'

In [157]:
#rlc.fit(X,Y, 
#        feature_labels=['Pclass', 'Sex'], 
#        undiscretized_features=['Pclass'])

rlc.fit(Xtrain,Ytrain, feature_labels=['Pclass', 'Sex', 'Siblings', 'Parch', 'Age'])

Discretizing  ['Pclass', 'Sex', 'Age'] ...


KeyError: 9

In [152]:
print rlc

Trained RuleListClassifier for detecting alive
IF male THEN probability of alive: 43.3% (33.6%-53.2%)
ELSE IF Third Class THEN probability of alive: 57.1% (22.3%-88.2%)
ELSE IF male AND First Class THEN probability of alive: 50.0% (2.5%-97.5%)
ELSE IF female THEN probability of alive: 94.1% (88.3%-98.0%)
ELSE IF First Class THEN probability of alive: 50.0% (2.5%-97.5%)
ELSE IF Second Class THEN probability of alive: 50.0% (2.5%-97.5%)
ELSE probability of alive: 50.0% (2.5%-97.5%)



<a id='resources'></a>

## Additional resources

---


- [Is there are a right to explanation for Machine Learning in the GDPR?](https://iapp.org/news/a/is-there-a-right-to-explanation-for-machine-learning-in-the-gdpr)
- [Why a Right to Explanation of Automated Decision-Making Does Not Exist in the General Data Protection Regulation](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2903469)
- [Towards Interpretable Reliable Models](https://blog.kjamistan.com/towards-interpretable-reliable-models/)
- [GDPR and you](https://blog.kjamistan.com/gdpr-you-my-talk-at-cloudera-sessions-munchen/)
- [Hold Your Machine Learning and AI Models Accountable](https://medium.com/pachyderm-data/hold-your-machine-learning-and-ai-models-accountable-de887177174c)
- [How GDPR Affects Data Science](https://kdnuggets.com/2017/07/gdpr-affects-data-science.html)
- [Scaleable Bayesian rule lists](https://arxiv.org/pdf/1602.08610v2.pdf)