In [1]:
import os
import sys
import numpy as np
import pandas as pd
from sobol_indices.dataset_analyser import analyze
from sobol_indices.test_sensivity_indices import gaussian_data_generator

Introduction
==========

In this notebook we will show how sobol indices can be used to characterize the influence of the variable over the output of a function. Let's first describe how experiment will be performed.

Experimental setup
------------------

The analysis of the Dataset & function can be performed using the `analyse` function.

In [2]:
def pretty_print(table):
    table = table.copy()
    for index in ["S", "ST", "S_ind", "ST_ind"]:
        table[index] = np.vectorize(
            lambda index_val, index_inf_val, index_sup_val: "%.2f [%.2f, %.2f]" %
                                                            (index_val, index_inf_val, index_sup_val)
        )(table[index], table[index+"_inf"], table[index+"_sup"])
    return table[["S", "ST", "S_ind", "ST_ind"]]

def run_experiment(experiment_name, function, data_generator, data_generator_kwargs):
    """
    Compute the sobol indices for a given function and a given input data sample.
    """
    # use the generator to construct a data sample
    data = data_generator(**data_generator_kwargs)
    # compute the indices
    indices_df = analyze(function, data, nsample, bs=bootstrap_size)
    # return the dataframe with the results
    return pretty_print(indices_df)

to generate our data we will use a function called `gaussian_data_generator`. It is a function that generate gaussian data. It has 7 parameters:
- N: number of points in the output sample
- sigma12 (resp. sigma13, sigma23): correlation coefficient bewteen the 1st and the 2nd variable (resp. 1st/3rd and 2nd/3rd)
- var1 (resp. var2, var3): variance of the 1st (resp. 2nd, 3rd) variable.

This will allow us to control how correlation/variance affect the sobol indices.

now we will define parameters common to every experiment

In [3]:
nsample = 1 * 10 ** 4  # how many point are used to evaluate the monte carlo estimator for the sobol indices
data_sample = 1 * 10 ** 4 # how many point are sample to constitute the dataset (used in the generator)
bootstrap_size = 150  # how many time we repeat the evaluation in order to construct confidence intervals

Trivial example
---------------------

Sobol indice can be read as:
> how much of the output variance can be explained by the variance of Xi?

This yield indices ranging from 0 to 1.

The next cell shows the indices values for the following function:
$$ f(x) = 2x_0 + x_1 $$
When the inputs follows independent and centered gaussian distributions.

In [4]:
func1 = lambda x: 2 * x[:, 0] + 1 * x[:, 1]
run_experiment(
        "simple example",
        function=func1,
        data_generator=gaussian_data_generator,
        data_generator_kwargs=dict(sigma12=0., sigma13=0., sigma23=0., N=data_sample))

Unnamed: 0,S,ST,S_ind,ST_ind
X0,"0.80 [0.74, 0.85]","0.80 [0.76, 0.84]","0.79 [0.75, 0.84]","0.79 [0.75, 0.84]"
X1,"0.20 [0.17, 0.24]","0.20 [0.19, 0.21]","0.20 [0.17, 0.24]","0.20 [0.17, 0.24]"
X2,"0.00 [0.00, 0.00]","0.00 [0.00, 0.00]","0.00 [0.00, 0.00]","0.00 [0.00, 0.00]"


This show tha most of the output variance can be explained by X0 and some of the output variance can be explained by x1.

As we can aslo see S, ST, S_ind and ST_ind have same values. This is because the inputs are independents, and the function don't have joint effects.

Scale independance (outputs)
-----------------------------


Now we will show a basic property of sobol indices: scaling the output don't change the indices. This will be done with two functions
$$ f(x)=x_0 $$
$$ f(x)=10x_0 $$

In [5]:
func2 = lambda x: x[:, 0]
func2bis = lambda x: 10 * x[:, 0]

In [6]:
run_experiment(
        "output_scaling_1",
        function=func2,
        data_generator=gaussian_data_generator,
        data_generator_kwargs=dict(sigma12=0., sigma13=0., sigma23=0., N=data_sample))

Unnamed: 0,S,ST,S_ind,ST_ind
X0,"1.00 [0.94, 1.00]","1.00 [0.94, 1.00]","1.00 [0.95, 1.00]","1.00 [0.95, 1.00]"
X1,"0.00 [0.00, 0.00]","0.00 [0.00, 0.00]","0.00 [0.00, 0.00]","0.00 [0.00, 0.00]"
X2,"0.00 [0.00, 0.00]","0.00 [0.00, 0.00]","0.00 [0.00, 0.00]","0.00 [0.00, 0.00]"


In [7]:
run_experiment(
        "output_scaling_2",
        function=func2bis,
        data_generator=gaussian_data_generator,
        data_generator_kwargs=dict(sigma12=0., sigma13=0., sigma23=0., N=data_sample))

Unnamed: 0,S,ST,S_ind,ST_ind
X0,"1.00 [0.96, 1.00]","1.00 [0.96, 1.00]","1.00 [0.94, 1.00]","1.00 [0.94, 1.00]"
X1,"0.00 [0.00, 0.00]","0.00 [0.00, 0.00]","0.00 [0.00, 0.00]","0.00 [0.00, 0.00]"
X2,"0.00 [0.00, 0.00]","0.00 [0.00, 0.00]","0.00 [0.00, 0.00]","0.00 [0.00, 0.00]"


Scale independence (inputs)
-----------------------------------------

If we scale **all** inputs by the **same** factor, the indices won't change.

In [8]:
run_experiment(
    "input_scaling_1",
    function=func2,
    data_generator=gaussian_data_generator,
    data_generator_kwargs=dict(sigma12=0., sigma13=0., sigma23=0., var1=1., N=data_sample))

Unnamed: 0,S,ST,S_ind,ST_ind
X0,"1.00 [0.94, 1.00]","1.00 [0.94, 1.00]","1.00 [0.94, 1.00]","1.00 [0.94, 1.00]"
X1,"0.00 [0.00, 0.00]","0.00 [0.00, 0.00]","0.00 [0.00, 0.00]","0.00 [0.00, 0.00]"
X2,"0.00 [0.00, 0.00]","0.00 [0.00, 0.00]","0.00 [0.00, 0.00]","0.00 [0.00, 0.00]"


now we change the variance of x0 from 1 to 10.

In [9]:
run_experiment(
    "input_scaling_2",
    function=func2,
    data_generator=gaussian_data_generator,
    data_generator_kwargs=dict(sigma12=0., sigma13=0., sigma23=0., var1=10, N=data_sample))

Unnamed: 0,S,ST,S_ind,ST_ind
X0,"1.00 [0.95, 1.00]","1.00 [0.95, 1.00]","1.00 [0.96, 1.00]","1.00 [0.95, 1.00]"
X1,"0.00 [0.00, 0.00]","0.00 [0.00, 0.00]","0.00 [0.00, 0.00]","0.00 [0.00, 0.00]"
X2,"0.00 [0.00, 0.00]","0.00 [0.00, 0.00]","0.00 [0.00, 0.00]","0.00 [0.00, 0.00]"


Impact of input distribution
---------------------------------------

The indices can be affected by changes in input distribution:
For instance the function:
$$ f(x)=x0 + x1 $$ won't give the same result when $$ \sigma_0 = 1 \quad \sigma_1 = 1 $$ and when $$ \sigma_0 = 10 \quad \sigma_1 = 1 $$

In [10]:
func3 = lambda x: x[:, 0] + x[:, 1]

In [11]:
run_experiment(
    "distribution_1",
    function=func3,
    data_generator=gaussian_data_generator,
    data_generator_kwargs=dict(sigma12=0., sigma13=0., sigma23=0., var1=1., N=data_sample))

Unnamed: 0,S,ST,S_ind,ST_ind
X0,"0.50 [0.46, 0.54]","0.50 [0.47, 0.52]","0.51 [0.47, 0.56]","0.51 [0.46, 0.55]"
X1,"0.49 [0.44, 0.54]","0.49 [0.47, 0.52]","0.50 [0.46, 0.56]","0.50 [0.45, 0.55]"
X2,"0.00 [0.00, 0.00]","0.00 [0.00, 0.00]","0.00 [0.00, 0.00]","0.00 [0.00, 0.00]"


In [12]:
run_experiment(
        "distribution_2",
        function=func3,
        data_generator=gaussian_data_generator,
        data_generator_kwargs=dict(sigma12=0., sigma13=0., sigma23=0., var1=10., N=data_sample))

Unnamed: 0,S,ST,S_ind,ST_ind
X0,"0.85 [0.80, 0.91]","0.86 [0.81, 0.90]","0.86 [0.81, 0.92]","0.86 [0.82, 0.91]"
X1,"0.14 [0.11, 0.17]","0.14 [0.13, 0.15]","0.14 [0.11, 0.17]","0.14 [0.12, 0.17]"
X2,"0.00 [0.00, 0.00]","0.00 [0.00, 0.00]","0.00 [0.00, 0.00]","0.00 [0.00, 0.00]"


Impact of correlations
-------------------------------

However the indices are affected by the correlations. Comparing the values of S with S_ind and ST with ST_ind can reveal this.

Now we will show how by using the function:
$$ f(x)=x_0 $$

In [13]:
run_experiment(
        "correlation_1",
        function=func2,
        data_generator=gaussian_data_generator,
        data_generator_kwargs=dict(sigma12=0., sigma13=0., sigma23=0., N=data_sample))

Unnamed: 0,S,ST,S_ind,ST_ind
X0,"1.00 [0.95, 1.00]","1.00 [0.95, 1.00]","1.00 [0.96, 1.00]","1.00 [0.96, 1.00]"
X1,"0.00 [0.00, 0.00]","0.00 [0.00, 0.00]","0.00 [0.00, 0.00]","0.00 [0.00, 0.00]"
X2,"0.00 [0.00, 0.00]","0.00 [0.00, 0.00]","0.00 [0.00, 0.00]","0.00 [0.00, 0.00]"


now we add a 0.6 correlation coefficient between x0 and x1.

In [14]:
run_experiment(
        "correlation_2",
        function=func2,
        data_generator=gaussian_data_generator,
        data_generator_kwargs=dict(sigma12=0.6, sigma13=0., sigma23=0., N=data_sample))

Unnamed: 0,S,ST,S_ind,ST_ind
X0,"1.00 [0.95, 1.00]","1.00 [0.95, 1.00]","0.65 [0.61, 0.70]","0.65 [0.61, 0.70]"
X1,"0.35 [0.31, 0.38]","0.35 [0.33, 0.37]","0.00 [0.00, 0.00]","0.00 [0.00, 0.00]"
X2,"0.00 [0.00, 0.00]","0.00 [0.00, 0.00]","0.00 [0.00, 0.00]","0.00 [0.00, 0.00]"


Joint effects
------------------

Sometime a variable can have no influence by itself, but impact the output through a joint work with an other variable.

One example of this could be the following function:
$$ f(x)= X_0 \text{ when } x_1 > 0 \text{ and } x_2 >0 $$
$$ f(x) = -X_0 \text{ else } $$

In [15]:
func4 = lambda x: x[:, 0] * (((x[:, 1] > 0.5) * (x[:, 2] > 0.5) * 2) - 1)
run_experiment(
    "joint_effect_1",
    function=func4, data_generator=gaussian_data_generator,
    data_generator_kwargs=dict(sigma12=0., sigma13=0., sigma23=0., var1=1., N=data_sample))

Unnamed: 0,S,ST,S_ind,ST_ind
X0,"0.65 [0.58, 0.72]","1.00 [0.95, 1.00]","0.65 [0.55, 0.73]","1.00 [0.95, 1.00]"
X1,"0.00 [0.00, 0.04]","0.27 [0.20, 0.34]","0.00 [0.00, 0.04]","0.27 [0.21, 0.33]"
X2,"0.00 [0.00, 0.03]","0.27 [0.22, 0.32]","0.00 [0.00, 0.04]","0.27 [0.21, 0.33]"


CVM indices
==========

There are two main differences between Sobol indices and CVM indices:
1. Sobol indices are based on correlations while CVM indices are based on rank
2. Sobol indices need access to the function while CVM need only a sample of evaluations

In [23]:
from CVM_indices.CVM_draft import analyze as cvm_analyse

def run_cvm_experiment(experiment_name, function, data_generator, data_generator_kwargs, real_function=None):
    data = data_generator(**data_generator_kwargs)
    if real_function is not None:
        y = np.abs(function[0](data) - real_function[0](data))
    else:
        y = function[0](data)
    x = pd.DataFrame(data, columns=["x_{}".format(i) for i in range(data.shape[1])])
    x["Y"] = y
    return cvm_analyse(x=x, output_var="Y")

We will showcase how to analyse the error of a predictor using CVM indices.
This is done by using two functions:
- the *true* function wich is sampled in the dataset (our `y_true`)
- the *estimated* function which is learned ( out `y_pred` )

In [21]:
func_real1 = lambda x: x[:, 0] + x[:, 1], 'f(x) -> X_0 + X_1'
func_pred1 = lambda x: x[:, 0] + x[:, 1], 'f(x) -> X_0 + X_1'
func_pred2 = lambda x: x[:, 0], 'f(x) -> X_0'

In [24]:
run_cvm_experiment(
    "CVM_error_1",
    function=func_pred1, real_function=func_real1, data_generator=gaussian_data_generator,
    data_generator_kwargs=dict(sigma12=0., sigma13=0., sigma23=0., var1=1., N=data_sample))

Unnamed: 0,CVM,CVM_indep
x_0,0.0068,0.006405
x_1,0.004391,0.0
x_2,0.00625,0.0


In [26]:
run_cvm_experiment(
    "CVM_error_2",
    function=func_pred2, real_function=func_real1, data_generator=gaussian_data_generator,
    data_generator_kwargs=dict(sigma12=0., sigma13=0., sigma23=0., var1=1., N=data_sample))

Unnamed: 0,CVM,CVM_indep
x_0,0.006956,0.0
x_1,0.999562,0.909522
x_2,0.002992,0.0


Conclusions
==========

- Sobol indices range from 0 (no influence) to 1 (all influence explained by a variable).
- It account both the input distribution and the function.
- correlations are captured by the regular indices but not by the independent indices.
- joint effects are captured by the total indices but not by the regular indices.

Main limitations:
- interactions are modeled with a gaussian copula (capture only linear correlations + may not work for degenerate cases). Theory allow the use of other copulas and other decompositions (openturn library contains an API for those).
- Does not allow to construct fairness based on errors (eg. equality of odds) as the sampling evaluate the function on points that are not necessarly in the dataset.

Applications
==========

- Detection of biases: traditional Fairness
- integration of domain knowledge: by performing regularisation of training
- dataset shift: by comparing indices between train/test sets
- to improve applications where correlation is used: for instance, explainability