<a href="https://colab.research.google.com/github/ahmed-boutar/interpreting-rule-based-models/blob/main/interpreting_rule_based_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction

The Google Colab notebook uses the imodels library to interpret these rule-based models:
- OneR rule list (oneR)
- Slipper rule set (SLIPPER)
- Optimal rule list (CORELS) 
<br>

Source: https://github.com/csinva/imodels
<br>

Details on how these models work can be found in the slides included in the repo under Interpreting-Decision-Rules.pptx
<br>

One thing I definitely found interesting is that downgrading the version of imodels and scikit-learn influences the accuracy, precicison, and recall of each model

In [58]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Please use this to connect your GitHub repository to your Google Colab notebook
# Connects to any needed files from GitHub and Google Drive
import os

# Remove Colab default sample_data
!rm -r ./sample_data

# Clone GitHub files to colab workspace
repo_name = "interpreting-rule-based-models" # Change to your repo name
git_path = 'https://github.com/ahmed-boutar/interpreting-rule-based-models.git' #Change to your path
!git clone "{git_path}"

%cd "{repo_name}"

# Install dependencies from requirements.txt file
!pip install -r requirements.txt #Add if using requirements.txt


In [None]:
#Install corels explicitely to allow OptimalRuleListClassifier to use the Corels algorithm
!pip install corels --quiet

After running the previous cell, restart the Colab session and run everything except for the first cell to avoid cloning the repo twice

In [60]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, accuracy_score
from graphviz import Digraph

from imodels import OptimalRuleListClassifier, SlipperClassifier, OneRClassifier
import os

## Dataset Description

The dataset I will be using to train the model is the breast cancer dataset, provided in the scikit-learn library. The dataset titled, Breast Cancer Wisconsin (Diagnostic), is a popular dataset in machine learning, particularly for binary classification tasks. In this case, the target feature is whether or not the cancer tumor is benign or malignant. <br>
(Source: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html)

#### Provenance 

This dataset contains features computed from digitized images of samples of breast mass tissue. The dataset is used to predict whether a breast tumor is malignant or benign based on various characteristics of the cell nuclei present in the images. 

#### Authors & License 

The Breast Cancer Wisconsin (Diagnostic) dataset is part of scikit-learn, which is distributed under the *BSD 3-Clause license*, allowing it to be freely used for academic, commercial, and personal projects (provided the original copyright notice and the BSD 3-Clause license text are included)

The dataset was created by Dr. William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian at the University of Wisconsin

#### Overview 
The dataset includes 30 features (all numerical) such as radius, texture, perimeter, area of each cell's nucleus. These features are computed for each cell nucleus, and the mean, standard error, and "worst" (largest) values are calculated for each feature.


In [None]:
breast_cancer = load_breast_cancer()
df = pd.DataFrame(breast_cancer.data, columns=breast_cancer.feature_names)
# Add the target variable, where 1 is benign and 0 is malignant
df['diagnosis'] = breast_cancer.target
df.head()

In [None]:
df.describe()

In [None]:
print(df['diagnosis'].value_counts())

#### Modeling

In [64]:
X = pd.DataFrame(breast_cancer.data, columns=breast_cancer.feature_names)
y = breast_cancer.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

All of the models used here for the classification of the breast cancer diagnosis are provided through the imodels library. (Souce: https://github.com/csinva/imodels?tab=readme-ov-file)

## One R 

This algorithm is often used as a benchmark for other methods. It is considered one of the simplest rule-based classification algorithms. 

From all the features, one R selects the one that carries the most information about the outcome of interest and creates decision rules from this feature (based on a **single feature**)

Source: https://csinva.io/imodels/rule_list/one_r.html

In [None]:
oneR_model = OneRClassifier()
oneR_model.fit(X_train, y_train, feature_names=breast_cancer.feature_names)

In [None]:
# Make predictions on the test set
y_pred = oneR_model.predict(X_test)

# Evaluate the model
acc = accuracy_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
print(f'Accuracy: {acc:.2f}')
print(f'Precision: {precision:.2f}')
print(f'Recall: {recall:.2f}')

## Optimal Rule List (CORELS)

 This algorithm aims to generate optimal rule lists. Unlike OneR, which focuses on a single feature, CORELS considers multiple features and aims to produce the most interpretable and accurate rule list.

Source: https://csinva.io/imodels/rule_list/corels_wrapper.html#imodels.rule_list.corels_wrapper.OptimalRuleListClassifier

In [None]:
optimal_rule_list_model = OptimalRuleListClassifier()
optimal_rule_list_model.fit(X_train, y_train, feature_names=breast_cancer.feature_names)

In [None]:
# Make predictions on the test set
y_pred = optimal_rule_list_model.predict(X_test)
# Evaluate the model
acc = accuracy_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
print(f'Accuracy: {acc:.2f}')
print(f'Precision: {precision:.2f}')
print(f'Recall: {recall:.2f}')

## SLIPPER

SLIPPER is designed to generate decision rule sets for binary classification. It constructs a set of simple rules that work together to classify new observations. It is a **boosting-based rule-learning algorithm**. 

The boosting process involves a loop that generates simple rules, selecting the best rule based on **weighted error** of each rule, and updated the weights of misclassified observations (increasing their weights, while decreasing the weight of correctly classified observations). 

Source: https://csinva.io/imodels/rule_set/slipper.html

In [None]:
slipper_model = SlipperClassifier()
slipper_model.fit(X_train, y_train, feature_names=breast_cancer.feature_names)

In [None]:
# Make predictions on the test set
y_pred = slipper_model.predict(X_test)

# Evaluate the model
acc = accuracy_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
print(f'Accuracy: {acc:.2f}')
print(f'Precision: {precision:.2f}')
print(f'Recall: {recall:.2f}')


## Creating graphs to visualize the models' outputs

In [149]:
# Used a combination of this documentation https://networkx.org/documentation/stable/reference/classes/digraph.html
# And some help from Claude to figure out the correct way to create the graph
def create_rule_graph(model_name, model):
    # Create a new directed graph
    graph = Digraph(comment=f'Decision Rule Visualization for {model_name}')
    graph.attr(rankdir='TB', size='8,8')

    rules = model.rules_
    
    # Add nodes and edges based on the rules
    for i, rule in enumerate(rules):
        node_id = f"node_{i}"
        try:
            # Added this exception to work around the default rule as the format made it harder to include 
            # in the Digraph
            graph.node(node_id, f"{rule['col']}\n≤ {rule['cutoff']:.5f}")
        except KeyError:
            break

        
        # Add left (False) branch
        left_id = f"leaf_{i}_left"
        graph.node(left_id, f"Value: {rule['val']:.4f}\nPoints: {rule['num_pts'] - rule['num_pts_right']}")
        graph.edge(node_id, left_id, label='False')
        
        # Add right (True) branch
        right_id = f"leaf_{i}_right"
        graph.node(right_id, f"Value: {rule['val_right']:.4f}\nPoints: {rule['num_pts_right']}")
        graph.edge(node_id, right_id, label='True')
        
        # Connect to the next rule if it exists
        if i < len(rules) - 1:
            graph.edge(left_id, f"node_{i+1}", style='dashed')

    return graph

In [150]:
# Function to save the graph as a dot file to display 
# Used this website https://www.devtoolsdaily.com/graphviz/ to visualize the outputs 
# Just copy pasted the output of the .dot file into the visualization website
def save_graph(dot, filename, format='pdf', graphviz_path=None):
    if graphviz_path:
        dot.engine = os.path.join(graphviz_path, 'dot')
    
    dot.save(f'{filename}.dot')
    print(f"DOT file saved as '{filename}.dot'")
        

In [None]:
# Try to save the graph
oneR_graph = create_rule_graph('One R', oneR_model)
save_graph(oneR_graph, 'oneR_box_diagram')

In [None]:
optimal_rule_list_graph= create_rule_graph('Corels', optimal_rule_list_model)
save_graph(optimal_rule_list_graph, 'CORELS_diagram')

In [None]:
dot = Digraph(comment='Slipper Model Decision Tree')
dot.attr(rankdir='TB', size='12,12')
rules = slipper_model.rules_
formatted_rules = []
#format the numbers in the rules to have 3 digits after the decimal
for i, rule in enumerate(rules):
    split_rule = rule.rule.split(' ')
    for i in range(len(split_rule)):
        try:
            tmp = '%.3f'%(float(split_rule[i]))
            #print(tmp)
            split_rule[i] = str(tmp)
            
        except ValueError:
            continue
    formatted_rules.append(' '.join(split_rule))

print(formatted_rules)

# Followed almost exact same visualization code as above to create the graph 
# Had to do it separately here since the output of the SLIPPER model is a decision rule set, which is different than the other models
# Add nodes and edges based on the rules
for i, rule in enumerate(formatted_rules):
    node_id = f"rule_{i}"
    dot.node(node_id, f"Rule {i+1}\n{rule}", shape='box')
    
    # Add Yes/No branches
    yes_id = f"yes_{i}"
    no_id = f"no_{i}"
    dot.node(yes_id, "Yes", shape='ellipse')
    dot.node(no_id, "No", shape='ellipse')
    dot.edge(node_id, yes_id, label='True')
    dot.edge(node_id, no_id, label='False')
    
    # Connect 'No' to the next rule if it's not the last rule
    if i < len(rules) - 1:
        dot.edge(no_id, f"rule_{i+1}", style='dashed')

save_graph(dot, 'SLIPPER')