# Feature Threshold Analysis
In this section, we will design a functional extension to LIME/SHAP which will tell the limit/boundary values of a chosen feature (set of features) which can change the outcome to opposite class (one of chosen) class

## Developed with miniconda Python 3.9.12

## Import Libraries

In [1]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

## Download and prepare data
url: https://archive.ics.uci.edu/ml/datasets/banknote+authentication

In [2]:
notes = pd.read_csv("/Users/binayak/Projects/BITS_MTech_Dissertation/data/data_banknote_authentication.txt", 
                    header=None, names=['variance','skewness','kurtosis','entropy','class'])
X_train, X_test, y_train, y_test = train_test_split(notes.iloc[:,0:4], notes.iloc[:,4], test_size = 0.3, 
                                                    random_state=0)

In [3]:
# Reset indexes for handling by index keys
X_train.reset_index(inplace=True, drop = True)
y_train.reset_index(inplace=True, drop=True)
X_test.reset_index(inplace=True, drop = True)
y_test.reset_index(inplace=True, drop=True)

In [4]:
# Initialize and train model
model = RandomForestClassifier()
model.fit(X_train,y_train)

## Implement the Explainer

In [5]:
class ModelExplainer:
    '''
    This class implements feature threshold analysis which is limited to binary classification based on 
    numeric features
    '''
    def __init__(self, model, training_features):
        '''
        Method inilializes the explainer with the model and the training feature set. Calculate and store 
        the mean and standard deviation of the numeric features 
        '''
        self.model = model
        self.training_features = training_features
        self.feature_details = dict()
        for feature in self.training_features.columns:
            self.feature_details[feature] = dict()
            self.feature_details[feature]['mean'] = training_features.loc[:,feature].mean()
            self.feature_details[feature]['std'] = training_features.loc[:,feature].std()
        print("Explainer initialized")
    
    def describe(self):
        '''
        Describes the model and the feature set
        '''
        print("Description:")
        print("Explainer for Model {}".format(model))
        print("Feature set is [{}]".format(self.training_features.columns))
        
    def analyze_one_feature_change(self,feature_data, var_feature_name, outcome, step_size=0, limit=(0,0)):
        '''
        Input:
            feature_data: data representing an instance of data set of feature vector
            var_feature_name: name of the feature which is chosen for threshold analysis
            outcome: desired outcome for which threshold analysis should be performed
            step_size: iteration step which when defaulted, is derived in the method
            limit: tuple storing the upper and lower limit of iteration. Derived when defaulted
        Output:
            upper_boundary_flag: Boolean flag indicating if upper boundary is found
            upper_boundary_value: Closest upper value for outcome. Valid only when upper_boundary_flag is set
            lower_boundary_flag: Boolean flag indicating if lower boundary is found
            lower_boundary_value: Closest lower value for outcome. Valid only when lower_boundary_flag is set
            description: descriptive text of explainer output
        '''
        
        # Initialize response variables
        resp = dict()   
        upper_boundary_flag, lower_boundary_flag = False, False
        upper_boundary_value, lower_boundary_value = np.inf, -np.inf
        
        var = var_feature_name
        if (step_size == 0):  # step_size not supplied
            step_size = self.feature_details[var]['std']/10   # Here N=10 and can be made configurable

        if limit[0] == limit[1]:   #limits not supplied
            lower_limit = self.feature_details[var]['mean'] - 3 * self.feature_details[var]['std']
            upper_limit = self.feature_details[var]['mean'] + 3 * self.feature_details[var]['std']
        else:
            lower_limit = limit[0]
            upper_limit = limit[1]
        
        # Implement iterative increments to find upper value
        Z = feature_data.copy()
        Z.reset_index(inplace=True, drop=True)
        while Z.loc[0, var] < upper_limit :
            Z.loc[0, var] += step_size
            if self.model.predict(Z)[0] == outcome:
                upper_boundary_flag = True
                upper_boundary_value = Z.loc[0, var]
                break

        # Implement iterative decrements to find lower value
        Z = feature_data.copy()
        Z.reset_index(inplace=True, drop=True)
        while Z.loc[0, var] > lower_limit :
            Z.loc[0, var] -= step_size
            if self.model.predict(Z)[0] == outcome:
                lower_boundary_flag = True
                lower_boundary_value = Z.loc[0,var]
                break
                
        # Prepare response
        resp_str = "With other feature values held constant,"
        if upper_boundary_flag:
            resp_str += "Outcome can be changed to [{}] for value of [{}] set to [{}],".format(
            outcome, var, upper_boundary_value)
        else:
            resp_str += "there are no higher values of [{}] to change outcome to [{}],".format(
            var,outcome)
        if lower_boundary_flag:
            resp_str += "Outcome can be changed to [{}] for value of [{}] set to [{}]".format(
            outcome, var, lower_boundary_value)
        else:
            resp_str += "there are no lower values of [{}] to change outcome to [{}]".format(
            var,outcome)

        resp['upper_boundary_flag'] = upper_boundary_flag
        resp['upper_boundary_value'] = upper_boundary_value
        resp['lower_boundary_flag'] = lower_boundary_flag
        resp['lower_boundary_value'] = lower_boundary_value
        resp['description'] = resp_str
        return(resp)

## Test output

In [6]:
# Let us take a feature instance from test/validation test and execute model prediction
print("Predicted output for feature \n{} \nis \n[{}]".format(X_test.loc[1:1], 
                                                             model.predict(X_test.loc[1:1])))

Predicted output for feature 
   variance  skewness  kurtosis  entropy
1    5.1321 -0.031048   0.32616   1.1151 
is 
[[0]]


In [7]:
# As a usecase, we will now use the model_explainer to check (if any) and find nearest values of "kurtosis" 
# feature for which the model prediction changes to [1]

In [8]:
# Initialize ModelExplainer class
E = ModelExplainer(model, X_train)

Explainer initialized


In [9]:
# Describe the explainer
E.describe()

Description:
Explainer for Model RandomForestClassifier()
Feature set is [Index(['variance', 'skewness', 'kurtosis', 'entropy'], dtype='object')]


In [10]:
# Analyze threshold for kurtosis for desired outcome [1]
E.analyze_one_feature_change(X_test.loc[1:1].copy(),'kurtosis',1, step_size = 0, limit = (0,0))

{'upper_boundary_flag': False,
 'upper_boundary_value': inf,
 'lower_boundary_flag': False,
 'lower_boundary_value': -inf,
 'description': 'With other feature values held constant,there are no higher values of [kurtosis] to change outcome to [1],there are no lower values of [kurtosis] to change outcome to [1]'}

## Intepretation
As evident, for given feature data set, kurtosis is 0.32616 and there are no higher values (other features constant) which can change prediction output to 1. However if on the lower side, if kurtosis value is reduced to -4.77 approx. the prediction output could be reversed from 0 to 1  

## Testing the results

In [11]:
temp = X_test.loc[1:1].copy()
temp['kurtosis'] = -4.77
print(temp)
model.predict(temp)

   variance  skewness  kurtosis  entropy
1    5.1321 -0.031048     -4.77   1.1151


array([0])

## Current limitations and Next steps
As detailed in the dissertation research paper, the present implementation is limited to single feature out of a set of all numeric features and a model which outputs a binary classification.
However the intuition and design could easily be extend to combination of features, discrete features or multi-class classification