# Sam SHAP
#### Out of pure hubris, I decided to try to replicate tabular ```shap``` in a more lightweight form. I did it pretty poorly, but it is certainly more lightweight (storage-wise)...

## Implementation:
You should only have to touch ```ShapleyApproximator.shap_vals``` to explain a whole input. Here is a quick API reference:

```ShapleyApproximator(predict, data_matrix)```
- ```predict```: pass in a function for prediction. I assume your model will predict on a 2D array, where axis 0 is the index of the datapoint and axis 1 is each feature of the input.
- ```data_matrix```: the data matrix from which we will pull random samples. This *must* be a 2D array, preferably of ```np.ndarray``` typing.

```powerset``` (Build the powerset, or all coalitions, of an input. Returns a ```list[tuple[int, int, ..., int]]```)
- ```num_features```: How many indices there are in your input.

```shapley_value_fast``` (Samples both coalitions and samples at random. Most variation of the two calculations. Returns ```float```)
- ```index```: index of feature of focus (FoF)
- ```class_index```: Which class we are explaining (type ```int```)
- ```num_coalitions```: number of coalitions we should permute through, sampled at random.
- ```samples```: how many samples we want to pull at random for a coalition.

```shapley_value``` (Samples *only* the replacement values at random. Goes through all coalitions. Returns ```float```)
- ```index```: index of feature of focus
- ```class_index```: Which class we are explaining (type ```int```)
- ```samples```: how many samples we want to pull at random for a coalition.

```shap_vals``` (Approximate the Shapley values for an input)
-  ```x```: Your input. 1D array assumed.
-  ```class_index```: Which class we are explaining 
-  ```num_coalitions```: (used if ```fast``` is ```True```) number of coalitions we should permute through, sampled at random.
-  ```num_samples```: How many samples we want to pull at random for a coalition.
-  ```fast```: Should we use ```shapley_value_fast``` (True) or base ```shapley_value``` (False)?
-  ```cheat```: (boolean) Should we cheat by scaling everything to be proportional to ```f(x) - E[f(x)]``` (True), or not (False)?

In [9]:
from tqdm import tqdm
import numpy as np
from itertools import chain, combinations
import random

class ShapleyApproximator:
    def __init__(self, predict, data_matrix):
        self.predict = predict
        self.data_matrix = data_matrix
    
    # Creates all possible coalitions for Shapley values calculation
    def powerset(self, num_features):
        s = list(range(num_features))
        return list(chain.from_iterable(combinations(s, r) for r in range(len(s)+1)))

    def shapley_value_fast(self, index: int, x: list, class_index: int, num_coalitions: int, samples: int = 10):
        pos = [np.mean(self.data_matrix, axis=0).tolist()]
        # We seed with x^n_{-j} = Ø and x^n_{+j} = {j} to anchor the Shapley value a little
        pos[0][index] = x[index]
        neg = [np.mean(self.data_matrix, axis=0).tolist()]
        coalitions = self.powerset(len(x))
        for _ in range(num_coalitions):
            # Pull coalition:
            coalition = random.choice(coalitions)
            for _ in range(samples): 
                # Pull sample:
                random_row = self.data_matrix[np.random.choice(self.data_matrix.shape[0])]               
                # x^n_{+j}
                xn_plus_j = [random_row[i] if i not in coalition else x[i] for i in range(len(x))]
                xn_plus_j[index] = x[index]
                # x^n_{-j}
                xn_min_j = [random_row[i] if i in coalition else x[i] for i in range(len(x))]
                xn_min_j[index] = random_row[index]
                
                pos.append(xn_plus_j)
                neg.append(xn_min_j)
        return (np.sum(self.predict(pos)[:, class_index]) - np.sum(self.predict(neg)[:, class_index])) / (samples * num_coalitions)

    # Calculates the Shapley value for a single feature
    def shapley_value(self, index: int, x, class_index: int, samples: int = 10):
        pos = [np.mean(self.data_matrix, axis=0).tolist()]
        # We seed with x^n_{-j} = Ø and x^n_{+j} = {j} to anchor the Shapley value a little
        pos[0][index] = x[index]
        neg = [np.mean(self.data_matrix, axis=0).tolist()]
        coalitions = self.powerset(len(x))
        for coalition in coalitions:
            for _ in range(samples):
                # Pull sample:
                random_row = self.data_matrix[np.random.choice(self.data_matrix.shape[0])]
                # x^n_{+j}
                xn_plus_j = [random_row[i] if i not in coalition else x[i] for i in range(len(x))]
                xn_plus_j[index] = x[index]
                # x^n_{-j}
                xn_min_j = [random_row[i] if i in coalition else x[i] for i in range(len(x))]
                xn_min_j[index] = random_row[index]
                
                pos.append(xn_plus_j)
                neg.append(xn_min_j)
        
        return (np.sum(self.predict(pos)[:, class_index]) - np.sum(self.predict(neg)[:, class_index])) / (samples * len(coalitions))
    
    def shap_vals(self, x: list, class_index: int= 0, num_coalitions: int = 500, num_samples: int = 10, fast=True, cheat: bool = False):
        shapley_values = {}
        for i in tqdm(range(len(x))):
            if fast:
                shapley_values[i] = self.shapley_value_fast(i, x, class_index, num_coalitions, num_samples)
            else:
                shapley_values[i] = self.shapley_value(i, x, class_index, num_samples)
        
        # E[f(x)] and f(x)      
        e_f_x = np.mean(self.predict(self.data_matrix)[:, class_index])
        f_x = self.predict(np.array([x]))[0, class_index]
        error = abs((e_f_x - f_x) - sum(shapley_values.values())) / (e_f_x - f_x)
        # Allow us to cheat a little bit by scaling the Shapley values to sum to the difference between E[f(x)] and f(x)
        if cheat:
            ratio = (f_x - e_f_x) / sum(shapley_values.values())
            shapley_values = {i: shapley_values[i] * ratio for i in shapley_values}
                    
        return {
            "E[f(x)]": e_f_x,
            "f(x)": f_x,
            "Shapley Values": shapley_values,
            "Error": error
        }


## Example usage on one of our canned models:

In [4]:
import keras
import sys

sys.path.append('../src/')
from data_loader import DataLoader
X_train, X_test, y_train, y_test = DataLoader().load("../data/")

# Load the model
model = keras.models.load_model('../src/keras_mlp.keras')
pfunction = lambda x: model.predict(x, verbose=0)




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[['gender']] = integer_encoded






In [10]:
x = X_test[0] # Input to explain
''' Create the Shapley Approximator '''
approximator = ShapleyApproximator(pfunction, X_train)
''' fast=True is faster but less accurate '''
# shapley_values = approximator.shap_vals(x, class_index=1, num_coalitions=2500, num_samples=2, fast=True)
''' fast=False means slower but more accurate '''
# shapley_values = approximator.shap_vals(x, class_index=1, num_samples=10, fast=False)
''' cheat=True means we scale the Shapley values to sum to the difference between E[f(x)] and f(x) '''
shapley_values = approximator.shap_vals(x, class_index=1, num_samples=10, fast=False, cheat=True)

for key in shapley_values:
    print(key, ":", shapley_values[key])
print("f(x) - E[f(x)]:", shapley_values["f(x)"] - shapley_values["E[f(x)]"])
print("Sum of Shapley Values:", np.sum(list(shapley_values["Shapley Values"].values())))

100%|██████████| 10/10 [00:21<00:00,  2.18s/it]


E[f(x)] : 0.033213135
f(x) : 5.2700335e-07
Shapley Values : {0: 0.0016679241773405122, 1: -0.0007817714125470969, 2: 0.00184488491325342, 3: -0.01129692363521916, 4: -0.005596782742638152, 5: -0.014374065098999281, 6: -0.006869393410860417, 7: -0.0024487372137574847, 8: 0.0008827292420177977, 9: 0.0037595255923099768}
Error : 1.5315589905693963
f(x) - E[f(x)]: -0.03321261
Sum of Shapley Values: -0.03321260958909988
