# <center> Welcome to PyExplainer Quickstart Guide </center>

# Top Note - MUST READ !!
#### 1. When initialising the PyExplainer object, you should prepare 5 necessary parameters and follow the data type 
(1) X_train (pd.core.frame.DataFrame) - feature columns from training data <br><br> 
(2) y_train (pd.core.series.Series) - label column from training data <br><br>
(3) indep (pd.core.indexes.base.Index) - names of feature columns > most of the time, you can get it by 'X_explain.columns' <br><br>
(4) dep (str) - name of label column<br><br>
(5) blackbox_model (any supervised classification model trained from sklearn lib) - model trained from sklearn lib<br><br>

#### 2. When using the explain() function under PyExplainer object, you should prepare 2 parameters and follow the data type
(1) X_explain (pd.core.frame.DataFrame) - one row of feature data <br><br> 
(2) y_explain (pd.core.series.Series) - one row of predicted data 

#### 3. Be careful when using the custom pandas index for Series and DataFrame 
In our Full Tutorial (PART B) example, the FileName column was used as the custom index.<br>  
However, it is fine if you don't have custom index, pandas will generate default row index starting from 0.<br><br>
If you do want to make use of custom index, make sure to use it consistently, whenever you do the data processing.<br><br>
Otherwise, some of your data may have pandas default index while the others have your custom index, <br><br>
which will trigger errors whenever you try to combine your DataFrame and Series. 

---

# PART A - Quick Start

## 1. Prepare data and model

Note. We use the default data and model here for an example

### 1.1 Import required library

In [1]:
from pyexplainer import pyexplainer_pyexplainer

### 1.2 Obtain default dataset and global model (Random Forest)

In [2]:
default_data_and_model = pyexplainer_pyexplainer.get_dflt()
py_explainer = pyexplainer_pyexplainer.PyExplainer(X_train = default_data_and_model['X_train'],
                           y_train = default_data_and_model['y_train'],
                           indep = default_data_and_model['indep'],
                           dep = default_data_and_model['dep'],
                           blackbox_model = default_data_and_model['blackbox_model'])

## 🔧2. Create PyExplainer object 

### 2.1 Prepare data for creating PyExplainer

In [3]:
X_explain = default_data_and_model['X_explain']
y_explain = default_data_and_model['y_explain']

### 2.2 Create rules

In [4]:
created_rules = py_explainer.explain(X_explain=X_explain,
                                     y_explain=y_explain,
                                     search_function='crossoverinterpolation')

## 3. Create interactive visualization

You can change feature values at the slider bar to observe change of risk score.

In [5]:
py_explainer.visualise(created_rules)

HBox(children=(Label(value='Risk Score: '), FloatProgress(value=0.0, bar_style='info', layout=Layout(width='40…

Output(layout=Layout(border='3px solid black'))

FloatSlider(value=32.52, continuous_update=False, description='#1 The value of rtime is less than 32.52', layo…

FloatSlider(value=2374.0, continuous_update=False, description='#2 The value of rrexp is more than 2374.0', la…

FloatSlider(value=0.67, continuous_update=False, description='#3 The value of age is more than 0.67', layout=L…

# PART B - Full Tutorial

## 1. Prepare sample data and model

### 1.1 For the simplicity, we load the sample DataFrame that is included in the package already

In [17]:
import pandas as pd
import numpy as np
from pyexplainer import pyexplainer_pyexplainer

df = pyexplainer_pyexplainer.load_sample_data()
df.head(3)

Unnamed: 0,File,CountDeclMethodPrivate,AvgLineCode,CountLine,MaxCyclomatic,CountDeclMethodDefault,AvgEssential,CountDeclClassVariable,SumCyclomaticStrict,AvgCyclomatic,...,OWN_LINE,OWN_COMMIT,MINOR_COMMIT,MINOR_LINE,MAJOR_COMMIT,MAJOR_LINE,RealBug,HeuBug,HeuBugCount,RealBugCount
0,activemq-console/src/main/java/org/apache/acti...,0,10,171,5,0,2,0,18,2,...,1.0,1.0,0,1,1,0,False,False,0,0
1,activemq-console/src/main/java/org/apache/acti...,0,8,123,5,0,1,1,15,3,...,0.98374,0.5,0,1,2,1,False,False,0,0
2,activemq-console/src/main/java/org/apache/acti...,0,7,136,5,0,1,1,16,2,...,1.0,1.0,0,1,1,0,False,False,0,0


### 1.2 Define index column (OPTIONAL) and drop unwanted columns
##### First, we set 'File' col as index col since it is the file that we wanna inspect, and it has nothing to do with features or label
##### We use 'RealBug' as the label col, and the cols before 'RealBug' as feature cols
##### Then we drop unnecessary cols (e.g. File, HeuBug, HeuBugCount, RealBugCount)

In [18]:
df = df.set_index(df['File'])
df = df.drop(['File', 'HeuBug', 'HeuBugCount', 'RealBugCount'], axis=1)
df.head(3)

Unnamed: 0_level_0,CountDeclMethodPrivate,AvgLineCode,CountLine,MaxCyclomatic,CountDeclMethodDefault,AvgEssential,CountDeclClassVariable,SumCyclomaticStrict,AvgCyclomatic,AvgLine,...,DDEV,Added_lines,Del_lines,OWN_LINE,OWN_COMMIT,MINOR_COMMIT,MINOR_LINE,MAJOR_COMMIT,MAJOR_LINE,RealBug
File,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
activemq-console/src/main/java/org/apache/activemq/console/command/AbstractAmqCommand.java,0,10,171,5,0,2,0,18,2,18,...,1,32,18,1.0,1.0,0,1,1,0,False
activemq-console/src/main/java/org/apache/activemq/console/command/AbstractCommand.java,0,8,123,5,0,1,1,15,3,17,...,2,30,28,0.98374,0.5,0,1,2,1,False
activemq-console/src/main/java/org/apache/activemq/console/command/AbstractJmxCommand.java,0,7,136,5,0,1,1,16,2,13,...,1,8,8,1.0,1.0,0,1,1,0,False


### 1.3 Define feature cols (X), and label col (y)

In [26]:
# select all rows, and all feature cols
# the last col, which is label col, is not selected
X = df.iloc[:, :-1]
# select all rows, and the last label col
y = df.iloc[:, -1]

print('feature cols:', '\n\n', X.head(1), '\n\n')
print('label col:', '\n\n', y.head(1))

feature cols: 

                                                     CountDeclMethodPrivate  \
File                                                                         
activemq-console/src/main/java/org/apache/activ...                       0   

                                                    AvgLineCode  CountLine  \
File                                                                         
activemq-console/src/main/java/org/apache/activ...           10        171   

                                                    MaxCyclomatic  \
File                                                                
activemq-console/src/main/java/org/apache/activ...              5   

                                                    CountDeclMethodDefault  \
File                                                                         
activemq-console/src/main/java/org/apache/activ...                       0   

                                                    AvgEssential  \
Fi

### 1.4 Split data into training and testing set

In [89]:
from sklearn.model_selection import train_test_split
# 70% training and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) 

## 2. Training and Predicting

### 2.1 Train a RandomForest model using sklearn

In [97]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(n_estimators=100, max_depth=None, min_samples_split=2, random_state=0)
rf_model.fit(X_train, y_train)

RandomForestClassifier(random_state=0)

### 2.2 Generate predictions

In [98]:
# generate prediction from the model, which will return a list of predicted labels
y_preds = rf_model.predict(X_test) 
# create a DataFrame which only has predicted label column
y_preds = pd.DataFrame(data={'PredictedBug': y_preds}, index=y_test.index) 
y_preds.head(3)

Unnamed: 0_level_0,PredictedBug
File,Unnamed: 1_level_1
activemq-core/src/main/java/org/apache/activemq/thread/Scheduler.java,False
activemq-core/src/main/java/org/apache/activemq/store/TransactionRecoveryListener.java,False
activemq-core/src/test/java/org/apache/activemq/openwire/v3/KeepAliveInfoTest.java,False


## 3. Prediction post processing

### 3.1 Combine feature cols, label col, and the predicted col in testing set

In [99]:
combined_testing_data = X_test.join(y_test.to_frame())
combined_testing_data = combined_testing_data.join(y_preds)
combined_testing_data.head(3)
# total num of rows
total_rows = len(combined_testing_data)

### 3.2 Filter out wronly predicted rows 

In [119]:
correctly_predicted_data = combined_testing_data[combined_testing_data['RealBug']==combined_testing_data['PredictedBug']]
correctly_predicted_rows = len(correctly_predicted_data)
print('The model correctly predicted ', round((correctly_predicted_rows / total_rows), 3) * 100, '% of testing data')

The model correctly predicted  88.9 % of testing data


### 3.3 We focus on the bug file, therefore, filter out the non-buggy file

In [122]:
correctly_predicted_bug = correctly_predicted_data[correctly_predicted_data['RealBug']==True]
correctly_predicted_bug.head(3)

Unnamed: 0_level_0,CountDeclMethodPrivate,AvgLineCode,CountLine,MaxCyclomatic,CountDeclMethodDefault,AvgEssential,CountDeclClassVariable,SumCyclomaticStrict,AvgCyclomatic,AvgLine,...,Added_lines,Del_lines,OWN_LINE,OWN_COMMIT,MINOR_COMMIT,MINOR_LINE,MAJOR_COMMIT,MAJOR_LINE,RealBug,PredictedBug
File,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
activemq-core/src/test/java/org/apache/activemq/transport/fanout/FanoutTransportBrokerTest.java,0,10,203,2,0,1,1,12,1,14,...,38,44,0.738916,0.8,0,2,2,1,True,True
activemq-core/src/main/java/org/apache/activemq/broker/region/policy/PendingDurableSubscriberMessageStoragePolicy.java,0,1,41,1,1,1,0,1,1,10,...,21,18,0.536585,1.0,0,2,1,0,True,True
activemq-core/src/main/java/org/apache/activemq/kaha/impl/container/BaseContainerImpl.java,0,5,230,6,0,1,1,43,2,5,...,108,111,0.613043,0.666667,0,2,2,0,True,True


### 3.4 Define feature cols and label col using correctly predicted testing data

In [123]:
# select all rows and feature cols
feature_cols = correctly_predicted_bug.iloc[:, :-2]
# selected all rows and one label col (either RealBug or PredictedBug is fine since they are the same)
label_col = correctly_predicted_bug.iloc[:, -2]

### 3.5 Select one row of correctly predicted bug to be explained

In [124]:
# decide which row to be selected
selected_row = 0
# select the row in X_test which contains all of the feature values
X_explain = feature_cols.iloc[[selected_row]]
# select the corresponding label from the DataFrame that we just created above
y_explain = label_col.iloc[[selected_row]]
print('one row of feature:', '\n\n', X_explain, '\n')
print('one row of label:', '\n\n', y_explain)

one row of feature: 

                                                     CountDeclMethodPrivate  \
File                                                                         
activemq-core/src/test/java/org/apache/activemq...                       0   

                                                    AvgLineCode  CountLine  \
File                                                                         
activemq-core/src/test/java/org/apache/activemq...           10        203   

                                                    MaxCyclomatic  \
File                                                                
activemq-core/src/test/java/org/apache/activemq...              2   

                                                    CountDeclMethodDefault  \
File                                                                         
activemq-core/src/test/java/org/apache/activemq...                       0   

                                                    AvgEssential

## 4. Create rules (explanations) and visualise it !

### 4.1 Initialise a PyExplainer object

In [125]:
from pyexplainer import pyexplainer_pyexplainer

py_explainer = pyexplainer_pyexplainer.PyExplainer(X_train = X_train,
                                                   y_train = y_train,
                                                   indep = X_train.columns,
                                                   dep = 'RealBug',
                                                   blackbox_model = rf_model)

### 4.2 Create rules by triggering explain function under PyExplainer object
##### Attention: This step can be time-consuming

In [126]:
rules = py_explainer.explain(X_explain=X_explain,
                             y_explain=y_explain,
                             search_function='crossoverinterpolation')

##### Those created rules are stored in a dictionary, for more information about what is contained in each key, please refer to 'Appendix' part

In [129]:
rules.keys()

dict_keys(['synthetic_data', 'synthetic_predictions', 'X_explain', 'y_explain', 'indep', 'dep', 'top_k_positive_rules', 'top_k_negative_rules', 'local_rulefit_model'])

### 4.3 Simply trigger visualise function under PyExplainer object to visualise the created rules 

In [128]:
py_explainer.visualise(rules)

HBox(children=(Label(value='Risk Score: '), FloatProgress(value=0.0, bar_style='info', layout=Layout(width='40…

Output(layout=Layout(border='3px solid black'))

FloatSlider(value=2.0, continuous_update=False, description='#1 The value of MAJOR_COMMIT is more than 2', lay…

# Appendix

## The detail of variables used to to create PyExplainer

### Synthetic_data

Synthetic_data is data that are generated by PyExplainer using one of the following approaches.

1. Crossover and Interpolation
2. Random Perturbation.

After Synthetic_data is generated, it is stored as a pandas DataFrame object. 

In [17]:
print("Type of pyExp_rule_obj['synthetic_data'] - ", type(created_rule_obj['synthetic_data']), "\n")

print('Example')
display(created_rule_obj['synthetic_data'].head(2))

Type of pyExp_rule_obj['synthetic_data'] -  <class 'pandas.core.frame.DataFrame'> 

Example


Unnamed: 0,la,nd,ns,ent,nrev,rtime,self,ndev,age,app,rrexp,asawr,rsawr
0,194.0,2.0,1.0,0.8,12.0,9.4,0.0,44.0,1.97,2.0,1290.0,0.01,0.61
1,130.0,4.0,1.0,0.83,15.0,14.47,0.0,55.0,0.06,2.0,1206.0,0.0,0.54


### Synthetic_predictions

Synthetic_predictions is the prediction of Synthetic_data, which is obtained from the global model inside PyExplainer.

In [12]:
print("Type of pyExp_rule_obj['synthetic_predictions'] - ", type(created_rule_obj['synthetic_predictions']), "\n")
print("Example", "\n\n", created_rule_obj['synthetic_predictions'])

Type of pyExp_rule_obj['synthetic_predictions'] -  <class 'numpy.ndarray'> 

Example 

 [False  True False ...  True  True  True]


### X_explain

X_explain is an instance to be explained (which is a defective commit in this context)

In [15]:
print("Type of pyExp_rule_obj['X_explain'] - ", type(created_rule_obj['X_explain']), "\n")

print('Example')
display(created_rule_obj['X_explain'])

Type of pyExp_rule_obj['X_explain'] -  <class 'pandas.core.frame.DataFrame'> 

Example


Unnamed: 0_level_0,la,nd,ns,ent,nrev,rtime,self,ndev,age,app,rrexp,asawr,rsawr
commit_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
a9a59ccbacafd6eb94f57861cfc28f5a24f474db,155,3,1,0.736602,12.0,32.516586,0,70.0,0.669703,3.0,2374.0,0.143199,0.390069


### y_explain

y_explain is a label of X_explain 

In [18]:
print("Type of pyExp_rule_obj['y_explain'] - ", type(created_rule_obj['y_explain']), "\n")
print("Example", "\n\n", created_rule_obj['y_explain'])

Type of pyExp_rule_obj['y_explain'] -  <class 'pandas.core.series.Series'> 

Example 

 commit_id
a9a59ccbacafd6eb94f57861cfc28f5a24f474db    True
Name: defect, dtype: bool


### indep
#### indep is feature names of X_explain

In [19]:
print("Type of pyExp_rule_obj['indep'] - ", type(created_rule_obj['indep']), "\n")
print("Example", "\n\n", created_rule_obj['indep'])

Type of pyExp_rule_obj['indep'] -  <class 'pandas.core.indexes.base.Index'> 

Example 

 Index(['la', 'nd', 'ns', 'ent', 'nrev', 'rtime', 'self', 'ndev', 'age', 'app',
       'rrexp', 'asawr', 'rsawr'],
      dtype='object')


### dep
#### dep is a label name

In [20]:
print("Type of pyExp_rule_obj['dep'] - ", type(created_rule_obj['dep']), "\n")
print("Example", "\n\n", created_rule_obj['dep'])

Type of pyExp_rule_obj['dep'] -  <class 'str'> 

Example 

 defect


### top_k_positive_rules

top_k_positive_rules is top-k rules that are genereated by PyExplainer to explain why a commit is predicted as defective.

Here we show top-3 rules that lead to defective commits=

In [26]:
print("Type of pyExp_rule_obj['top_k_positive_rules'] - ", type(created_rule_obj['top_k_positive_rules']), "\n")
print('Example')
display(created_rule_obj['top_k_positive_rules'].head(3))

Type of pyExp_rule_obj['top_k_positive_rules'] -  <class 'pandas.core.frame.DataFrame'> 

Example


Unnamed: 0,index,rule,type,coef,support,importance,is_satisfy_instance
0,572,la <= 104.84500122070312 & ndev > 71.319999694...,rule,0.220501,0.232376,0.093128,True
1,359,la > 103.68999862670898 & nrev > 8.11499977111...,rule,0.167275,0.237598,0.071194,True
2,98,app <= 3.9550000429153442 & ndev <= 103.875 & ...,rule,0.136704,0.631854,0.065933,True


### top_k_negative_rules

top_k_negative_rules is top-k negative rules that are genereated by PyExplainer to explain why a commit is predicted as clean.

The default number of generated rules is 3.


In [27]:
print("Type of pyExp_rule_obj['top_k_negative_rules'] - ", type(created_rule_obj['top_k_negative_rules']), "\n")
print('Example')
display(created_rule_obj['top_k_negative_rules'])

Type of pyExp_rule_obj['top_k_negative_rules'] -  <class 'pandas.core.frame.DataFrame'> 

Example


Unnamed: 0,rule,type,coef,support,importance,Class
820,ent <= 0.9350000023841858 & app > 2.9900000095...,rule,-0.216263,0.389034,0.105435,Clean
1609,app <= 4.009999990463257 & app > 2.99000000953...,rule,-0.206549,0.428198,0.102204,Clean
323,la <= 104.84500122070312 & ndev <= 71.31999969...,rule,-0.173105,0.391645,0.084496,Clean


# Bug Report Channel
#### Please report <a href="https://github.com/awsm-research/pyExplainer/issues">here</a>
#### 📧 or email your report to michaelfu1998@gmail.com