# Measure performance
This notebook loads a file with precomputed measures (*qmeans*, *qbas* & *qinv*) for a set of rankings for a given instance of the dataset and measures the performance of the different alternative measures

## 1. Load libraries, model and data

In [146]:
FILENAME = 'avila_70_measures.npz'

# Import the necessary libraries
import sys
import os
PROJ_DIR = os.path.realpath(os.path.dirname(os.path.abspath('')))
sys.path.append(os.path.join(PROJ_DIR,'src'))
import xai_faithfulness_experiments_lib_edits as fl
import numpy as np

# Load data
data = fl.load_generated_data(os.path.join(PROJ_DIR, 'results', FILENAME))
qmeans = data['qmeans']
qmeans_basX = [data['qmean_bas']]
qmeans_inv = data['qmean_invs']

# Compute qmeans_bas[2-10]
def compute_qbas(measure, num_samples):
    random_indices = np.random.randint(0,  measure.shape[0], (measure.shape[0], num_samples))
    random_qmeans = measure[random_indices]
    mean = np.mean(random_qmeans, axis=1)

    # First way to deal with std==0; add some epsilon
    #std = np.std(random_qmeans, axis=1) + 1e-10

    # Second way to deal with std==0; ignore std (divide by 1)
    std = np.std(random_qmeans, axis=1)
    std[std==0] = 1

    # Always ignore std
    std=1
    return (measure - mean) / std
for i in range(2,11):
    qmeans_basX.append(compute_qbas(qmeans, i))

# Compute z-score??
qmean_mean = np.mean(qmeans)
qmean_std = np.std(qmeans)
z_scores = ((qmeans - qmean_mean) / qmean_std).flatten()

# Stratify z-index to be able to compare performance on different parts of the spectrum
indices = np.arange(z_scores.shape[0])
z_scores_numbered = np.vstack((z_scores, indices))
level_indices = []
boundaries = [float('-inf'), 0, 0.5, 1, 1.5, 2, 2.5]
for i in range(1,len(boundaries)+1):
    bottom_limit = boundaries[i-1]
    top_limit = float('inf')
    if i < len(boundaries):
        top_limit = boundaries[i]
    level_indices.append((z_scores_numbered[:,np.logical_and(bottom_limit<=z_scores, z_scores<top_limit)][1,:].astype(int),(bottom_limit, top_limit)))
exceptional_indices = level_indices[-1][0]

## 2. Measure performance
### 2.1 Order preservation
 1. The issue with using qmean directly is that it doesn't have a fixed scale and you don't get an idea of how good your explanation is compared to other explanations
 2. To address this, ideally you would determine the distribution of all qmeans and then compute the z-score. That's very costly, so you either:
    1. Estimate the qmeans distribution with X samples $\rightarrow$ qbasX
    2. Calculate an alternative to the z-index directly $\rightarrow$ qinv
 3. The problem with both alternatives is that you adulterate the value of your original qmean measurement, so you may end up in a situation where $qmean_i<qmean_j$ but $qinv_i<qinv_j$, which is undesirable
 4. Hence, we measure how many times that happens for each measure.

 (This may be measuring the same as Pearson correlation, which is computed below)

In [149]:
def measure_correct_orderings(truths, estimators):
    '''
    Creates len(truth) x,y pairs and computes the fraction of them for which (truths[x]<truths[y] and estimators[x]<estimators[y]) or (truths[x]>truths[y] and estimators[x]>estimators[y])
    Inputs:
        - Truths & estimators contain num_elems floats
    Output:
        - Float representing the fraction of correctly ordered pairings
    '''
    xs = np.random.permutation(truths.size)
    ys = np.random.permutation(truths.size)
    truthX_lt_Y = truths[xs] < truths[ys]
    estimatorX_lt_Y = estimators[xs] < estimators[ys]
    hits = truthX_lt_Y==estimatorX_lt_Y
    return hits.sum()/truths.size

correct_pairings_basX = []
for i in range(len(qmeans_basX)):
    correct_pairings_basX.append(measure_correct_orderings(qmeans, qmeans_basX[i]))
    print(f'qmeans_bas{i+1}: {correct_pairings_basX[i]:.4f}')
correct_pairings_inv = measure_correct_orderings(qmeans, qmeans_inv)
print(f'qmeans_inv: {correct_pairings_inv:.4f}')

qmeans_bas1: 0.7505
qmeans_bas2: 0.7888
qmeans_bas3: 0.8142
qmeans_bas4: 0.8317
qmeans_bas5: 0.8445
qmeans_bas6: 0.8553
qmeans_bas7: 0.8630
qmeans_bas8: 0.8696
qmeans_bas9: 0.8758
qmeans_bas10: 0.8808
qmeans_inv: 0.8341


### 2.2. Spearman correlation
Same thing, is the order of qmeans preserved in qbasX/qinv?

In [151]:
from scipy.stats import spearmanr
spearman_basX = []
for i in range(len(qmeans_basX)):
    spearman_basX.append(spearmanr(qmeans, qmeans_basX[i])[0])
    print(f'qmeans_bas{i+1}: {spearman_basX[i]:.4f}')
spearman_inv = spearmanr(qmeans, qmeans_inv)[0]
print(f'qmeans_inv: {spearman_inv:.4f}')

qmeans_bas1: 0.6758
qmeans_bas2: 0.7690
qmeans_bas3: 0.8197
qmeans_bas4: 0.8509
qmeans_bas5: 0.8723
qmeans_bas6: 0.8881
qmeans_bas7: 0.8997
qmeans_bas8: 0.9091
qmeans_bas9: 0.9168
qmeans_bas10: 0.9231
qmeans_inv: 0.8474


### 2.3. Ability to detect exceptionally good rankings
As stated above, there are some ordering errors in the estimators. Are they in the relevant part of the distribution? i.e. Do they affect the ability to identify exceptionally good rankings?

In [None]:
#TODO

### 2.4 Ability to rank exceptionally good rankings
How well is the order preserved for exceptionally good rankings?

In [None]:
#TODO