# <center> Welcome to PyExplainer Quickstart Guide </center>

# Top Note - MUST READ !!
#### 1. When initialising the PyExplainer object, you should prepare 5 necessary parameters and follow the data type 
(1) X_train (pd.core.frame.DataFrame) - feature columns from training data <br><br> 
(2) y_train (pd.core.series.Series) - label column from training data <br><br>
(3) indep (pd.core.indexes.base.Index) - names of feature columns > most of the time, you can get it by 'X_explain.columns' <br><br>
(4) dep (str) - name of label column<br><br>
(5) blackbox_model (any supervised classification model trained from sklearn lib) - model trained from sklearn lib<br><br>

#### 2. When using the explain() function under PyExplainer object, you should prepare 2 parameters and follow the data type
(1) X_explain (pd.core.frame.DataFrame) - one row of feature data <br><br> 
(2) y_explain (pd.core.series.Series) - one row of predicted data 

#### 3. Be careful when using the custom pandas index for Series and DataFrame 
In our Full Tutorial (PART B) example, the FileName column was used as the custom index.<br>  
However, it is fine if you don't have custom index, pandas will generate default row index starting from 0.<br><br>
If you do want to make use of custom index, make sure to use it consistently, whenever you do the data processing.<br><br>
Otherwise, some of your data may have pandas default index while the others have your custom index, <br><br>
which will trigger errors whenever you try to combine your DataFrame and Series. 

---

# PART A - Quick Start

## 1. Prepare data and model

Note. We use the default data and model here for an example

### 1.1 Import required library

In [1]:
from pyexplainer import pyexplainer_pyexplainer

### 1.2 Obtain default dataset and global model (Random Forest)

In [2]:
default_data_and_model = pyexplainer_pyexplainer.get_dflt()
py_explainer = pyexplainer_pyexplainer.PyExplainer(X_train = default_data_and_model['X_train'],
                           y_train = default_data_and_model['y_train'],
                           indep = default_data_and_model['indep'],
                           dep = default_data_and_model['dep'],
                           blackbox_model = default_data_and_model['blackbox_model'])



## 🔧2. Create PyExplainer object 

### 2.1 Prepare data for creating PyExplainer

In [3]:
X_explain = default_data_and_model['X_explain']
y_explain = default_data_and_model['y_explain']

### 2.2 Create rules

In [4]:
created_rules = py_explainer.explain(X_explain=X_explain,
                                     y_explain=y_explain,
                                     search_function='crossoverinterpolation')

## 3. Create interactive visualization

You can change feature values at the slider bar to observe change of risk score.

In [5]:
py_explainer.visualise(created_rules)

HBox(children=(Label(value='Risk Score: '), FloatProgress(value=0.0, bar_style='info', layout=Layout(width='40…

Output(layout=Layout(border='3px solid black'))

FloatSlider(value=70.0, continuous_update=False, description='#1 The value of ndev is more than 70.0', layout=…

FloatSlider(value=2374.0, continuous_update=False, description='#2 The value of rrexp is more than 2374.0', la…

FloatSlider(value=0.67, continuous_update=False, description='#3 The value of age is less than 0.67', layout=L…

# PART B - Full Tutorial

## 1. Prepare sample data and model

### 1.1 For the simplicity, we load the sample DataFrame that is included in the package already

In [6]:
import pandas as pd
import numpy as np
from pyexplainer import pyexplainer_pyexplainer

df = pyexplainer_pyexplainer.load_sample_data()
df.head(3)

Unnamed: 0,File,CountDeclMethodPrivate,AvgLineCode,CountLine,MaxCyclomatic,CountDeclMethodDefault,AvgEssential,CountDeclClassVariable,SumCyclomaticStrict,AvgCyclomatic,...,OWN_LINE,OWN_COMMIT,MINOR_COMMIT,MINOR_LINE,MAJOR_COMMIT,MAJOR_LINE,RealBug,HeuBug,HeuBugCount,RealBugCount
0,activemq-console/src/main/java/org/apache/acti...,0,10,171,5,0,2,0,18,2,...,1.0,1.0,0,1,1,0,False,False,0,0
1,activemq-console/src/main/java/org/apache/acti...,0,8,123,5,0,1,1,15,3,...,0.98374,0.5,0,1,2,1,False,False,0,0
2,activemq-console/src/main/java/org/apache/acti...,0,7,136,5,0,1,1,16,2,...,1.0,1.0,0,1,1,0,False,False,0,0


### 1.2 Define index column (OPTIONAL) and drop unwanted columns
##### First, we set 'File' col as index col since it is the file that we wanna inspect, and it has nothing to do with features or label
##### We use 'RealBug' as the label col, and the cols before 'RealBug' as feature cols
##### Then we drop unnecessary cols (e.g. File, HeuBug, HeuBugCount, RealBugCount)

In [7]:
df = df.set_index(df['File'])
df = df.drop(['File', 'HeuBug', 'HeuBugCount', 'RealBugCount'], axis=1)
df.head(3)

Unnamed: 0_level_0,CountDeclMethodPrivate,AvgLineCode,CountLine,MaxCyclomatic,CountDeclMethodDefault,AvgEssential,CountDeclClassVariable,SumCyclomaticStrict,AvgCyclomatic,AvgLine,...,DDEV,Added_lines,Del_lines,OWN_LINE,OWN_COMMIT,MINOR_COMMIT,MINOR_LINE,MAJOR_COMMIT,MAJOR_LINE,RealBug
File,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
activemq-console/src/main/java/org/apache/activemq/console/command/AbstractAmqCommand.java,0,10,171,5,0,2,0,18,2,18,...,1,32,18,1.0,1.0,0,1,1,0,False
activemq-console/src/main/java/org/apache/activemq/console/command/AbstractCommand.java,0,8,123,5,0,1,1,15,3,17,...,2,30,28,0.98374,0.5,0,1,2,1,False
activemq-console/src/main/java/org/apache/activemq/console/command/AbstractJmxCommand.java,0,7,136,5,0,1,1,16,2,13,...,1,8,8,1.0,1.0,0,1,1,0,False


### 1.3 Define feature cols (X), and label col (y)

In [8]:
# select all rows, and all feature cols
# the last col, which is label col, is not selected
X = df.iloc[:, :-1]
# select all rows, and the last label col
y = df.iloc[:, -1]

print('feature cols:', '\n\n', X.head(1), '\n\n')
print('label col:', '\n\n', y.head(1))

feature cols: 

                                                     CountDeclMethodPrivate  \
File                                                                         
activemq-console/src/main/java/org/apache/activ...                       0   

                                                    AvgLineCode  CountLine  \
File                                                                         
activemq-console/src/main/java/org/apache/activ...           10        171   

                                                    MaxCyclomatic  \
File                                                                
activemq-console/src/main/java/org/apache/activ...              5   

                                                    CountDeclMethodDefault  \
File                                                                         
activemq-console/src/main/java/org/apache/activ...                       0   

                                                    AvgEssential  \
Fi

### 1.4 Split data into training and testing set

In [9]:
from sklearn.model_selection import train_test_split
# 70% training and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) 

## 2. Training and Predicting

### 2.1 Train a RandomForest model using sklearn

In [10]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(n_estimators=100, max_depth=None, min_samples_split=2, random_state=0)
rf_model.fit(X_train, y_train)

RandomForestClassifier(random_state=0)

### 2.2 Generate predictions

In [11]:
# generate prediction from the model, which will return a list of predicted labels
y_preds = rf_model.predict(X_test) 
# create a DataFrame which only has predicted label column
y_preds = pd.DataFrame(data={'PredictedBug': y_preds}, index=y_test.index) 
y_preds.head(3)

Unnamed: 0_level_0,PredictedBug
File,Unnamed: 1_level_1
activemq-core/src/main/java/org/apache/activemq/command/SubscriptionInfo.java,False
activemq-core/src/main/java/org/apache/activemq/kaha/impl/container/ContainerListIterator.java,False
activemq-core/src/test/java/org/apache/activemq/camel/CamelJmsTest.java,True


## 3. Prediction post processing

### 3.1 Combine feature cols, label col, and the predicted col in testing set

In [12]:
combined_testing_data = X_test.join(y_test.to_frame())
combined_testing_data = combined_testing_data.join(y_preds)
combined_testing_data.head(3)
# total num of rows
total_rows = len(combined_testing_data)

### 3.2 Filter out wronly predicted rows 

In [13]:
correctly_predicted_data = combined_testing_data[combined_testing_data['RealBug']==combined_testing_data['PredictedBug']]
correctly_predicted_rows = len(correctly_predicted_data)
print('The model correctly predicted ', round((correctly_predicted_rows / total_rows), 3) * 100, '% of testing data')

The model correctly predicted  90.60000000000001 % of testing data


### 3.3 We focus on the bug file, therefore, filter out the non-buggy file

In [14]:
correctly_predicted_bug = correctly_predicted_data[correctly_predicted_data['RealBug']==True]
correctly_predicted_bug.head(3)

Unnamed: 0_level_0,CountDeclMethodPrivate,AvgLineCode,CountLine,MaxCyclomatic,CountDeclMethodDefault,AvgEssential,CountDeclClassVariable,SumCyclomaticStrict,AvgCyclomatic,AvgLine,...,Added_lines,Del_lines,OWN_LINE,OWN_COMMIT,MINOR_COMMIT,MINOR_LINE,MAJOR_COMMIT,MAJOR_LINE,RealBug,PredictedBug
File,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
activemq-core/src/test/java/org/apache/activemq/broker/BrokerTest.java,0,17,1666,4,0,1,0,95,2,23,...,850,877,0.535414,0.857143,0,2,2,3,True,True
activemq-core/src/main/java/org/apache/activemq/broker/jmx/ManagedRegionBroker.java,0,10,500,8,0,1,1,81,2,10,...,316,327,0.654,0.625,0,3,2,1,True,True
activemq-core/src/main/java/org/apache/activemq/transport/tcp/TcpTransport.java,0,5,481,9,0,1,1,71,2,7,...,199,119,0.571726,0.7,0,3,2,1,True,True


### 3.4 Define feature cols and label col using correctly predicted testing data

In [15]:
# select all rows and feature cols
feature_cols = correctly_predicted_bug.iloc[:, :-2]
# selected all rows and one label col (either RealBug or PredictedBug is fine since they are the same)
label_col = correctly_predicted_bug.iloc[:, -2]

### 3.5 Select one row of correctly predicted bug to be explained

In [16]:
# decide which row to be selected
selected_row = 0
# select the row in X_test which contains all of the feature values
X_explain = feature_cols.iloc[[selected_row]]
# select the corresponding label from the DataFrame that we just created above
y_explain = label_col.iloc[[selected_row]]
print('one row of feature:', '\n\n', X_explain, '\n')
print('one row of label:', '\n\n', y_explain)

one row of feature: 

                                                     CountDeclMethodPrivate  \
File                                                                         
activemq-core/src/test/java/org/apache/activemq...                       0   

                                                    AvgLineCode  CountLine  \
File                                                                         
activemq-core/src/test/java/org/apache/activemq...           17       1666   

                                                    MaxCyclomatic  \
File                                                                
activemq-core/src/test/java/org/apache/activemq...              4   

                                                    CountDeclMethodDefault  \
File                                                                         
activemq-core/src/test/java/org/apache/activemq...                       0   

                                                    AvgEssential

## 4. Create rules (explanations) and visualise it !

### 4.1 Initialise a PyExplainer object

In [17]:
from pyexplainer import pyexplainer_pyexplainer

py_explainer = pyexplainer_pyexplainer.PyExplainer(X_train = X_train,
                                                   y_train = y_train,
                                                   indep = X_train.columns,
                                                   dep = 'RealBug',
                                                   blackbox_model = rf_model)

### 4.2 Create rules by triggering explain function under PyExplainer object
##### Attention: This step can be time-consuming

In [18]:
rules = py_explainer.explain(X_explain=X_explain,
                             y_explain=y_explain,
                             search_function='crossoverinterpolation')

##### Those created rules are stored in a dictionary, for more information about what is contained in each key, please refer to 'Appendix' part

In [19]:
rules.keys()

dict_keys(['synthetic_data', 'synthetic_predictions', 'X_explain', 'y_explain', 'indep', 'dep', 'top_k_positive_rules', 'top_k_negative_rules', 'local_rulefit_model'])

### 4.3 Simply trigger visualise function under PyExplainer object to visualise the created rules 

In [20]:
py_explainer.visualise(rules)

HBox(children=(Label(value='Risk Score: '), FloatProgress(value=0.0, bar_style='info', layout=Layout(width='40…

Output(layout=Layout(border='3px solid black'))

FloatSlider(value=2.0, continuous_update=False, description='#1 The value of DDEV is more than 2', layout=Layo…

# Appendix

## The detail of variables used to to create PyExplainer

### Synthetic_data

Synthetic_data is data that are generated by PyExplainer using one of the following approaches.

1. Crossover and Interpolation
2. Random Perturbation.

After Synthetic_data is generated, it is stored as a pandas DataFrame object. 

In [22]:
print("Type of pyExp_rule_obj['synthetic_data'] - ", type(rules['synthetic_data']), "\n")

print('Example')
display(rules['synthetic_data'].head(2))

Type of pyExp_rule_obj['synthetic_data'] -  <class 'pandas.core.frame.DataFrame'> 

Example


Unnamed: 0,CountDeclMethodPrivate,AvgLineCode,CountLine,MaxCyclomatic,CountDeclMethodDefault,AvgEssential,CountDeclClassVariable,SumCyclomaticStrict,AvgCyclomatic,AvgLine,...,ADEV,DDEV,Added_lines,Del_lines,OWN_LINE,OWN_COMMIT,MINOR_COMMIT,MINOR_LINE,MAJOR_COMMIT,MAJOR_LINE
0,0.0,14.0,334.0,4.0,0.0,1.0,1.0,34.0,2.0,17.0,...,4.0,1.0,37.0,45.0,0.82,1.0,0.0,2.0,1.0,1.0
1,0.0,3.0,263.0,2.0,0.0,1.0,0.0,53.0,1.0,3.0,...,7.0,2.0,193.0,176.0,0.39,0.57,0.0,3.0,2.0,1.0


### Synthetic_predictions

Synthetic_predictions is the prediction of Synthetic_data, which is obtained from the global model inside PyExplainer.

In [23]:
print("Type of pyExp_rule_obj['synthetic_predictions'] - ", type(rules['synthetic_predictions']), "\n")
print("Example", "\n\n", rules['synthetic_predictions'])

Type of pyExp_rule_obj['synthetic_predictions'] -  <class 'numpy.ndarray'> 

Example 

 [False  True  True ...  True False  True]


### X_explain

X_explain is an instance to be explained (which is a defective commit in this context)

In [25]:
print("Type of pyExp_rule_obj['X_explain'] - ", type(rules['X_explain']), "\n")

print('Example')
display(rules['X_explain'])

Type of pyExp_rule_obj['X_explain'] -  <class 'pandas.core.frame.DataFrame'> 

Example


Unnamed: 0_level_0,CountDeclMethodPrivate,AvgLineCode,CountLine,MaxCyclomatic,CountDeclMethodDefault,AvgEssential,CountDeclClassVariable,SumCyclomaticStrict,AvgCyclomatic,AvgLine,...,ADEV,DDEV,Added_lines,Del_lines,OWN_LINE,OWN_COMMIT,MINOR_COMMIT,MINOR_LINE,MAJOR_COMMIT,MAJOR_LINE
File,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
activemq-core/src/test/java/org/apache/activemq/broker/BrokerTest.java,0,17,1666,4,0,1,0,95,2,23,...,7,2,850,877,0.535414,0.857143,0,2,2,3


### y_explain

y_explain is a label of X_explain 

In [26]:
print("Type of pyExp_rule_obj['y_explain'] - ", type(rules['y_explain']), "\n")
print("Example", "\n\n", rules['y_explain'])

Type of pyExp_rule_obj['y_explain'] -  <class 'pandas.core.series.Series'> 

Example 

 File
activemq-core/src/test/java/org/apache/activemq/broker/BrokerTest.java    True
Name: RealBug, dtype: bool


### indep
#### indep is feature names of X_explain

In [27]:
print("Type of pyExp_rule_obj['indep'] - ", type(rules['indep']), "\n")
print("Example", "\n\n", rules['indep'])

Type of pyExp_rule_obj['indep'] -  <class 'pandas.core.indexes.base.Index'> 

Example 

 Index(['CountDeclMethodPrivate', 'AvgLineCode', 'CountLine', 'MaxCyclomatic',
       'CountDeclMethodDefault', 'AvgEssential', 'CountDeclClassVariable',
       'SumCyclomaticStrict', 'AvgCyclomatic', 'AvgLine',
       'CountDeclClassMethod', 'AvgLineComment', 'AvgCyclomaticModified',
       'CountDeclFunction', 'CountLineComment', 'CountDeclClass',
       'CountDeclMethod', 'SumCyclomaticModified', 'CountLineCodeDecl',
       'CountDeclMethodProtected', 'CountDeclInstanceVariable',
       'MaxCyclomaticStrict', 'CountDeclMethodPublic', 'CountLineCodeExe',
       'SumCyclomatic', 'SumEssential', 'CountStmtDecl', 'CountLineCode',
       'CountStmtExe', 'RatioCommentToCode', 'CountLineBlank', 'CountStmt',
       'MaxCyclomaticModified', 'CountSemicolon', 'AvgLineBlank',
       'CountDeclInstanceMethod', 'AvgCyclomaticStrict',
       'PercentLackOfCohesion', 'MaxInheritanceTree', 'CountClassDerived',
 

### dep
#### dep is a label name

In [28]:
print("Type of pyExp_rule_obj['dep'] - ", type(rules['dep']), "\n")
print("Example", "\n\n", rules['dep'])

Type of pyExp_rule_obj['dep'] -  <class 'str'> 

Example 

 RealBug


### top_k_positive_rules

top_k_positive_rules is top-k rules that are genereated by PyExplainer to explain why a commit is predicted as defective.

Here we show top-3 rules that lead to defective commits=

In [29]:
print("Type of pyExp_rule_obj['top_k_positive_rules'] - ", type(rules['top_k_positive_rules']), "\n")
print('Example')
display(rules['top_k_positive_rules'].head(3))

Type of pyExp_rule_obj['top_k_positive_rules'] -  <class 'pandas.core.frame.DataFrame'> 

Example


Unnamed: 0,index,rule,type,coef,support,importance,is_satisfy_instance
0,438,SumCyclomaticModified > 38.095001220703125 & M...,rule,2.569322e-23,0.565333,1.2736470000000001e-23,True
1,360,SumCyclomaticModified > 36.19499969482422 & MA...,rule,2.5677540000000002e-23,0.565333,1.2728700000000002e-23,True
2,72,SumCyclomatic > 24.269999504089355 & DDEV > 1....,rule,2.5831660000000002e-23,0.594667,1.2682220000000001e-23,True


### top_k_negative_rules

top_k_negative_rules is top-k negative rules that are genereated by PyExplainer to explain why a commit is predicted as clean.

The default number of generated rules is 3.


In [30]:
print("Type of pyExp_rule_obj['top_k_negative_rules'] - ", type(rules['top_k_negative_rules']), "\n")
print('Example')
display(rules['top_k_negative_rules'])

Type of pyExp_rule_obj['top_k_negative_rules'] -  <class 'pandas.core.frame.DataFrame'> 

Example


Unnamed: 0,rule,type,coef,support,importance,Class
635,MAJOR_COMMIT <= 1.5,rule,-2.038902e-23,0.376,9.876034e-24,Clean
593,DDEV <= 1.5099999904632568,rule,-2.038902e-23,0.354667,9.754356e-24,Clean
1263,DDEV <= 1.5049999952316284 & DDEV <= 1.4950000...,rule,-2.0318680000000003e-23,0.349333,9.687124e-24,Clean


# Bug Report Channel
#### Please report <a href="https://github.com/awsm-research/pyExplainer/issues">here</a>
#### 📧 or email your report to michaelfu1998@gmail.com