# <center> 👩‍💻 Welcome to PyExplainer Quickstart Guide 👨‍💻 </center>

# 🛠 Installation
#### The pyexplainer package is not stable enough to publish on PyPI at the moment, but we can still install the package pyexplainer on <a href="https://test.pypi.org/project/pyexplainer/">testPyPI</a>

### ☠️Warning☠️ >>> Do not use the command from <a href="https://test.pypi.org/project/pyexplainer/">testPyPI</a> (pip install -i https://test.pypi.org/simple/ pyexplainer) !  This code may trigger "distribution not found error" when pip is installing the dependencies.

### 🤖 Run the cell below to install pyexplainer 0.1.0

In [None]:
!pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple pyexplainer

### 🤖 If the code above did not work, run the cell below, otherwise, you are good to go!

In [None]:
!pip3 install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple pyexplainer

# Let's Start !

## 👩🏻‍🔧 1. Build a blackbox model (Here we use Random Forest as an example)

### 1.1 Import Libraries Needed

In [1]:
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

### 1.2 Load Sample Dataset

In [2]:
data = pd.read_csv('../tests/example-datasets/activemq-5.0.0.csv', index_col = 'File')

### 1.3 Slicing Data to Prepare X_train, y_train Before Constructing Blackbox Model 

In [3]:
dep = data.columns[-4]
indep = data.columns[0:(len(data.columns) - 4)]

print("data type of 'dep' - ", type(dep), "\n")
print("data type of 'indep' - ", type(indep), "\n")
print("content of 'dep' - ", dep, "\n")
print("content of 'indep' - ", indep)

data type of 'dep' -  <class 'str'> 

data type of 'indep' -  <class 'pandas.core.indexes.base.Index'> 

content of 'dep' -  RealBug 

content of 'indep' -  Index(['CountDeclMethodPrivate', 'AvgLineCode', 'CountLine', 'MaxCyclomatic',
       'CountDeclMethodDefault', 'AvgEssential', 'CountDeclClassVariable',
       'SumCyclomaticStrict', 'AvgCyclomatic', 'AvgLine',
       'CountDeclClassMethod', 'AvgLineComment', 'AvgCyclomaticModified',
       'CountDeclFunction', 'CountLineComment', 'CountDeclClass',
       'CountDeclMethod', 'SumCyclomaticModified', 'CountLineCodeDecl',
       'CountDeclMethodProtected', 'CountDeclInstanceVariable',
       'MaxCyclomaticStrict', 'CountDeclMethodPublic', 'CountLineCodeExe',
       'SumCyclomatic', 'SumEssential', 'CountStmtDecl', 'CountLineCode',
       'CountStmtExe', 'RatioCommentToCode', 'CountLineBlank', 'CountStmt',
       'MaxCyclomaticModified', 'CountSemicolon', 'AvgLineBlank',
       'CountDeclInstanceMethod', 'AvgCyclomaticStrict',
       '

### 1.4 Prepare X_train, y_train Data for the Blackbox Model Based on the Data from 1.3

In [4]:
X_train = data.loc[:, indep]
y_train = data.loc[:, dep]

print("data type of 'X_train' - ", type(X_train), "\n")
print("data type of 'y_train' - ", type(y_train), "\n")
print("--------------------", "content of 'X_train' -------------------- ", "\n\n", X_train, "\n")
print("--------------------", "content of 'y_train' -------------------- ", "\n\n", y_train)

data type of 'X_train' -  <class 'pandas.core.frame.DataFrame'> 

data type of 'y_train' -  <class 'pandas.core.series.Series'> 

-------------------- content of 'X_train' --------------------  

                                                     CountDeclMethodPrivate  \
File                                                                         
activemq-console/src/main/java/org/apache/activ...                       0   
activemq-console/src/main/java/org/apache/activ...                       0   
activemq-console/src/main/java/org/apache/activ...                       0   
activemq-console/src/main/java/org/apache/activ...                       0   
activemq-console/src/main/java/org/apache/activ...                       0   
...                                                                    ...   
assembly/src/test/java/org/apache/activemq/benc...                       0   
assembly/src/test/java/org/apache/activemq/benc...                       0   
assembly/src/test/java/

### 1.5 Build a Sample Blackbox RF. Model Using Data from 1.4

In [5]:
blackbox_model = RandomForestClassifier(max_depth=3, random_state=0)
blackbox_model.fit(X_train, y_train)

RandomForestClassifier(max_depth=3, random_state=0)

## 🚀  2. Create a PyExplainer Object

### 2.1 Import Library Needed

In [6]:
from pyexplainer_pyexplainer import PyExplainer

### 2.2 Define Class Labels for Classification Model

In [7]:
class_label = ['Clean', 'Defect']

### 2.3 Create an Object

In [8]:
pyExp = PyExplainer(X_train,
            y_train,
            indep,
            dep,
            class_label,
            blackbox_model = blackbox_model)
pyExp

<pyexplainer_pyexplainer.PyExplainer at 0x193eca5cca0>

## 💾 3. Prepare Testing Data to be Explained by PyExplainer

### 3.1 Load Testing Data as pd DataFrame

In [9]:
sample_files = pd.read_csv('../tests/example-datasets/activemq-5.0.0.csv', index_col = 'File')
sample_files

Unnamed: 0_level_0,CountDeclMethodPrivate,AvgLineCode,CountLine,MaxCyclomatic,CountDeclMethodDefault,AvgEssential,CountDeclClassVariable,SumCyclomaticStrict,AvgCyclomatic,AvgLine,...,OWN_LINE,OWN_COMMIT,MINOR_COMMIT,MINOR_LINE,MAJOR_COMMIT,MAJOR_LINE,RealBug,HeuBug,HeuBugCount,RealBugCount
File,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
activemq-console/src/main/java/org/apache/activemq/console/command/AbstractAmqCommand.java,0,10,171,5,0,2,0,18,2,18,...,1.000000,1.000000,0,1,1,0,False,False,0,0
activemq-console/src/main/java/org/apache/activemq/console/command/AbstractCommand.java,0,8,123,5,0,1,1,15,3,17,...,0.983740,0.500000,0,1,2,1,False,False,0,0
activemq-console/src/main/java/org/apache/activemq/console/command/AbstractJmxCommand.java,0,7,136,5,0,1,1,16,2,13,...,1.000000,1.000000,0,1,1,0,False,False,0,0
activemq-console/src/main/java/org/apache/activemq/console/command/AmqBrowseCommand.java,0,29,241,17,0,4,5,29,9,53,...,1.000000,1.000000,0,1,1,0,False,False,0,0
activemq-console/src/main/java/org/apache/activemq/console/command/BrowseCommand.java,0,24,212,17,0,3,5,26,8,44,...,1.000000,1.000000,0,1,1,0,False,False,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
assembly/src/test/java/org/apache/activemq/benchmark/BenchmarkSupport.java,0,5,215,6,0,1,0,35,1,5,...,0.846512,0.666667,0,2,2,1,True,False,0,1
assembly/src/test/java/org/apache/activemq/benchmark/Consumer.java,0,9,108,7,0,1,0,16,3,11,...,0.814815,1.000000,0,2,1,1,False,False,0,0
assembly/src/test/java/org/apache/activemq/benchmark/Producer.java,0,9,180,8,0,1,0,30,2,10,...,0.866667,0.666667,0,2,2,2,True,False,0,1
assembly/src/test/java/org/apache/activemq/benchmark/ProducerConsumer.java,0,6,80,7,0,1,0,13,2,6,...,0.837500,1.000000,0,2,1,1,False,False,0,0


### 3.2 Slicing Data to Prepare X_test, y_test Before Making Predictions 

In [10]:
X_test = sample_files.loc[:, indep]
y_test = sample_files.loc[:, dep]

print("data type of 'X_test' - ", type(X_test), "\n")
print("data type of 'y_test' - ", type(y_test), "\n")
print("--------------------", "content of 'X_test' -------------------- ", "\n\n", X_test, "\n")
print("--------------------", "content of 'y_test' -------------------- ", "\n\n", y_test)

data type of 'X_test' -  <class 'pandas.core.frame.DataFrame'> 

data type of 'y_test' -  <class 'pandas.core.series.Series'> 

-------------------- content of 'X_test' --------------------  

                                                     CountDeclMethodPrivate  \
File                                                                         
activemq-console/src/main/java/org/apache/activ...                       0   
activemq-console/src/main/java/org/apache/activ...                       0   
activemq-console/src/main/java/org/apache/activ...                       0   
activemq-console/src/main/java/org/apache/activ...                       0   
activemq-console/src/main/java/org/apache/activ...                       0   
...                                                                    ...   
assembly/src/test/java/org/apache/activemq/benc...                       0   
assembly/src/test/java/org/apache/activemq/benc...                       0   
assembly/src/test/java/org

### 3.3 Choose One File to be Explained
##### each row of X_test, y_test represents the information of one file.

In [19]:
# in this case > index can be 0 ~ 1883, use len(X_test) to check the index range
explain_index = 10
X_explain = X_test.iloc[[explain_index]]
X_explain

Unnamed: 0_level_0,CountDeclMethodPrivate,AvgLineCode,CountLine,MaxCyclomatic,CountDeclMethodDefault,AvgEssential,CountDeclClassVariable,SumCyclomaticStrict,AvgCyclomatic,AvgLine,...,ADEV,DDEV,Added_lines,Del_lines,OWN_LINE,OWN_COMMIT,MINOR_COMMIT,MINOR_LINE,MAJOR_COMMIT,MAJOR_LINE
File,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
activemq-console/src/main/java/org/apache/activemq/console/command/ShellCommand.java,0,10,134,10,0,1,0,19,3,13,...,1,1,32,33,0.798507,1.0,0,2,1,1


In [20]:
y_explain = y_test.iloc[[explain_index]]
y_explain

File
activemq-console/src/main/java/org/apache/activemq/console/command/ShellCommand.java    False
Name: RealBug, dtype: bool

### 🔧3.4 Create a Rule Object Manually 
#### 📝Note. Rule Object is important ! 
#### 📝Note. This may take a while to execute because it contains more computing process regarding ML

In [23]:
%%time
# index 0, 10 not converge

for i in range(10, 50):
    print(i)
    explain_index = i
    X_explain = X_test.iloc[[explain_index]]
    y_explain = y_test.iloc[[explain_index]]
    # Create Rule Object
    create_pyExp_rule_obj = pyExp.explain(X_explain,
                               y_explain,
                               search_function = 'crossoverinterpolation',
                               top_k = 3, 
                               max_rules=30, 
                               max_iter =5, 
                               cv=5,
                               debug = False)
    

# ---------------------------------------- ignore ---------------------------------------- #
# err 01                                                                                   #
# unexpected keyword argument - max_iter                                                   #
# err 02                                                                                   #
# unexpected keyword argument - n_jobs                                                     #
# err 03                                                                                   #
# ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.     #
# ---------------------------------------- ignore ---------------------------------------- #


10




KeyboardInterrupt: 

### ⏳ 3.5 Alternatively, we can directly load a Rule Object if we already have it 

#### 3.5.1 Create Reading and Writing Functions

In [14]:
import pickle
import os.path

# Util functions for reading and writing data
def save_object(object_i, filename):
    with open(filename, 'wb') as file:
        pickle.dump(object_i, file)

def load_object(filename):
    with open(filename, 'rb') as file:
        object_o = pickle.load(file)
    return (object_o)

#### 3.5.2 Load Sample Rule Object - we use pyExplainer_obj.pyobject file as an example

In [15]:
# load rule obj
if os.path.isfile('../tests/pyExplainer_obj.pyobject'):
    load_pyExp_rule_obj = load_object('../tests/pyExplainer_obj.pyobject')

## 👩🏽‍🎨 4. Pass Rule Object to .visualise(rule_obj) to Generate the Bullet Chart and Interactive Slider
#### 📝Note. the interactive slider is not available in this version

#### 🔧 Visualise the Rule Object we created manually using .explain(...) method

In [16]:
pyExp.visualise(create_pyExp_rule_obj)

NameError: name 'create_pyExp_rule_obj' is not defined

#### ⏳ Visualise the Rule Object we loaded from pyExplainer_obj.pyobject file

In [17]:
pyExp.visualise(load_pyExp_rule_obj)

Min 2 Max 987 threshold 97.83 Actual 10 Plot_min 2.0 Plot_max 196.0
Min 1 Max 4 threshold 1.55 Actual 1 Plot_min 1.0 Plot_max 2.0
Min 1 Max 23 threshold 5.5 Actual 1 Plot_min 1 Plot_max 8.0


HBox(children=(Label(value='Risk Score: '), FloatProgress(value=0.0, bar_style='info', layout=Layout(width='40…

FloatSlider(value=10.0, continuous_update=False, description='#1 Increase the values of CountStmt to more than…

FloatSlider(value=1.0, continuous_update=False, description='#2 Decrease the values of MAJOR_COMMIT to less th…

FloatSlider(value=1.0, continuous_update=False, description='#3 Decrease the values of COMM to less than 1', l…

Output(layout=Layout(border='3px solid black'))

## 🕵🏻 What's in the Rule Object (pyExp_rule_obj) ?  Let's unbox it ! 📦

### 1. Basic Data Check

In [16]:
print("Type of Rule Object: ", type(pyExp_rule_obj))
print()
print("All of the keys in Rule Object")
i = 1
for k in pyExp_rule_obj.keys():
    print("Key ", i, " - ",k)
    i += 1

Type of Rule Object:  <class 'dict'>

All of the keys in Rule Object
Key  1  -  synthetic_data
Key  2  -  synthetic_predictions
Key  3  -  X_explain
Key  4  -  y_explain
Key  5  -  indep
Key  6  -  dep
Key  7  -  top_k_positive_rules
Key  8  -  top_k_negative_rules


### 🔑 Key 1 - synthetic_data
#### As can be seen below, the synthetic data are data coming from feature columns
#### This synthetic data was generated internally by the PyExplainer when the .explain(...) method is triggered
#### Currently we have 2 approaches to generate synthetic_data
#### Approach (1) Crossover and Interpolation
#### Approach (2) Random Pertubation
#### After the process of C&I. or RP., synthetic_data will be generated as a DataFrame below

In [19]:
print("Type of pyExp_rule_obj['synthetic_data'] - ", type(pyExp_rule_obj['synthetic_data']), "\n")
print("Example", "\n\n", pyExp_rule_obj['synthetic_data'].head(2))

Type of pyExp_rule_obj['synthetic_data'] -  <class 'pandas.core.frame.DataFrame'> 

Example 

    CountDeclMethodPrivate  AvgLineCode  CountLine  MaxCyclomatic  \
0                     0.0          9.0      157.0            5.0   
1                     0.0          9.0       78.0            2.0   

   CountDeclMethodDefault  AvgEssential  CountDeclClassVariable  \
0                     0.0           1.0                     1.0   
1                     0.0           1.0                     0.0   

   SumCyclomaticStrict  AvgCyclomatic  AvgLine  ...  DDEV  Added_lines  \
0                 13.0            2.0     21.0  ...   1.0         35.0   
1                  6.0            2.0     18.0  ...   2.0         18.0   

   Del_lines  OWN_LINE  OWN_COMMIT  MINOR_COMMIT  MINOR_LINE  MAJOR_COMMIT  \
0       27.0       1.0         1.0           0.0         1.0           1.0   
1       12.0       1.0         0.5           0.0         1.0           2.0   

   MAJOR_LINE  RealBug  
0         0.0  

### 🔑 Key 2 - synthetic_predictions
#### As can be seen below, the synthetic prediction are data coming from the prediction column
#### This synthetic prediction was generated internally by the PyExplainer when the .explain(...) method is triggered
#### This synthetic prediction is created based on the black box model we passed to the PyExplainer when initialising (section 1.5 & 2.3)
#### This synthetic prediction is generated based on the synthetic data above therefore it's called synthetic_predictions
#### >>> e.g. synthetic_predictions = blackbox_model.predict(synthetic_data)  Note. we only need feature cols in synthetic_data

In [23]:
print("Type of pyExp_rule_obj['synthetic_predictions'] - ", type(pyExp_rule_obj['synthetic_predictions']), "\n")
print("Example", "\n\n", pyExp_rule_obj['synthetic_predictions'])

Type of pyExp_rule_obj['synthetic_predictions'] -  <class 'numpy.ndarray'> 

Example 

 [False False  True ... False False  True]


### 🔑 Key 3 - X_explain
#### This X_explain is exactly the same as the one we passed to .explain(...) method (section 3.3 & section 3.4)

In [29]:
print("Type of pyExp_rule_obj['X_explain'] - ", type(pyExp_rule_obj['X_explain']), "\n")
print("Example", "\n\n", pyExp_rule_obj['X_explain'])

Type of pyExp_rule_obj['X_explain'] -  <class 'pandas.core.frame.DataFrame'> 

Example 

                                                     CountDeclMethodPrivate  \
File                                                                         
activemq-camel-loadtest/src/test/java/org/apach...                       0   

                                                    AvgLineCode  CountLine  \
File                                                                         
activemq-camel-loadtest/src/test/java/org/apach...            3         36   

                                                    MaxCyclomatic  \
File                                                                
activemq-camel-loadtest/src/test/java/org/apach...              1   

                                                    CountDeclMethodDefault  \
File                                                                         
activemq-camel-loadtest/src/test/java/org/apach...                       0  

### 🔑 Key 4 - y_explain
#### This y_explain is exactly the same as the one we passed to .explain(...) method (section 3.3 & section 3.4)

In [30]:
print("Type of pyExp_rule_obj['y_explain'] - ", type(pyExp_rule_obj['y_explain']), "\n")
print("Example", "\n\n", pyExp_rule_obj['y_explain'])

Type of pyExp_rule_obj['y_explain'] -  <class 'pandas.core.series.Series'> 

Example 

 File
activemq-camel-loadtest/src/test/java/org/apache/activemq/soaktest/LoadTest.java    False
Name: RealBug, dtype: bool


### 🔑 Key 5 - indep
#### Names of the Selected Feature Cols

In [36]:
print("Type of pyExp_rule_obj['indep'] - ", type(pyExp_rule_obj['indep']), "\n")
print("Example", "\n\n", pyExp_rule_obj['indep'])

Type of pyExp_rule_obj['indep'] -  <class 'pandas.core.indexes.base.Index'> 

Example 

 Index(['CountDeclMethodPrivate', 'AvgLineCode', 'CountLine', 'MaxCyclomatic',
       'CountDeclMethodDefault', 'AvgEssential', 'CountDeclClassVariable',
       'SumCyclomaticStrict', 'AvgCyclomatic', 'AvgLine',
       'CountDeclClassMethod', 'AvgLineComment', 'AvgCyclomaticModified',
       'CountDeclFunction', 'CountLineComment', 'CountDeclClass',
       'CountDeclMethod', 'SumCyclomaticModified', 'CountLineCodeDecl',
       'CountDeclMethodProtected', 'CountDeclInstanceVariable',
       'MaxCyclomaticStrict', 'CountDeclMethodPublic', 'CountLineCodeExe',
       'SumCyclomatic', 'SumEssential', 'CountStmtDecl', 'CountLineCode',
       'CountStmtExe', 'RatioCommentToCode', 'CountLineBlank', 'CountStmt',
       'MaxCyclomaticModified', 'CountSemicolon', 'AvgLineBlank',
       'CountDeclInstanceMethod', 'AvgCyclomaticStrict',
       'PercentLackOfCohesion', 'MaxInheritanceTree', 'CountClassDerived',
 

### 🔑 Key 6 - dep
#### Names of the Label Col (Prediction Col)

In [37]:
print("Type of pyExp_rule_obj['dep'] - ", type(pyExp_rule_obj['dep']), "\n")
print("Example", "\n\n", pyExp_rule_obj['dep'])

Type of pyExp_rule_obj['dep'] -  <class 'str'> 

Example 

 RealBug


### 🔑 Key 7 - top_k_positive_rules
#### This shows the top k positive rules generated by the RuleFit model inside the .explain(...) function
#### The value of 'top_k' can be tuned in when we create a Rule Object manually (section 3.4), the default value is 3

In [41]:
print("Type of pyExp_rule_obj['top_k_positive_rules'] - ", type(pyExp_rule_obj['top_k_positive_rules']), "\n")
print("Example", "\n\n", pyExp_rule_obj['top_k_positive_rules'])

Type of pyExp_rule_obj['top_k_positive_rules'] -  <class 'pandas.core.frame.DataFrame'> 

Example 

                                                  rule  type      coef  \
92  MaxNesting_Mean > 0.1850000023841858 & CountDe...  rule  1.898139   
69  CountLineCodeDecl <= 46.209999084472656 & SumC...  rule  1.352392   
87  SumCyclomaticModified > 24.84500026702881 & DD...  rule  0.471191   

     support  importance   Class  
92  0.101828    0.574038  Defect  
69  0.028721    0.225877  Defect  
87  0.130548    0.158747  Defect  


### 🔑 Key 8 - top_k_negative_rules
#### This shows the top k negative rules generated by the RuleFit model inside the .explain(...) function
#### The value of 'top_k' can be tuned in when we create a Rule Object manually (section 3.4), the default value is 3
#### However, in the current version, the top_k value is always the same for both negative and positive rules which can be improved in the future version

In [49]:
print("Type of pyExp_rule_obj['top_k_negative_rules'] - ", type(pyExp_rule_obj['top_k_negative_rules']), "\n")
print("Example", "\n\n", pyExp_rule_obj['top_k_negative_rules'])

Type of pyExp_rule_obj['top_k_negative_rules'] -  <class 'pandas.core.frame.DataFrame'> 

Example 

                                                  rule  type      coef  \
85  CountStmt > 97.83000183105469 & MAJOR_COMMIT <...  rule -7.757467   
81       COMM <= 5.5 & CountStmt <= 97.83000183105469  rule -2.193372   
79  SumCyclomatic <= 31.5 & CountLineCodeDecl <= 4...  rule -1.491272   

     support  importance  Class  
85  0.049608    1.684413  Clean  
81  0.796345    0.883305  Clean  
79  0.817232    0.576341  Clean  


# 🤡 Important - Bug Report Channel 🤡
#### 📧 Please email your report to michaelfu1998@gmail.com
#### ✈️ More channels will be opened soon

# 

# <center> 🙏Thanks for playing around with PyExplainer, I really appreciate your time! 🙏 </center>
#### <center> 🔥 More Features will be Released Soon 🔥 </center>