# Example of using the BSD, CBSD and CPBS algorithms

## About this document

The purpose of this document is to show an example of using the BSD, CBSD and CPBS algorithms.

In the following sections, an introduction of the these algorithms will be presented, followed by instructions to install the `subgroups` library. Then, the execution process of the algorithms will be described, including the necessary steps to consider. Finally, the results obtained from the application of these algorithms will be presented, highlighting the information obtained in the output file and the one that can be accessed through the model properties.

## BSD algorithm

BSD [[1]](#1) is a subgroup discovery algorithm that introduces the concept of dominance relation between subgroups. This algorithm also uses a list of the $k$ best subgroups along with an optimistic estimation to prune the search space.

To handle the dominances between subgroups, BSD uses a bitset that stores for each discovered pattern and each row of the dataset whether the pattern appears in the row or not.

Regarding the dominance relation, we will say that a subgroup $S$ makes another subgroup $S'$ irrelevant by dominance if and only if the positive instances of the bitset of subgroup $S'$ are included in the positive instances of subgroup $S$ and the negative instances of subgroup $S$ are included in the negative instances of subgroup $S'$.

## CBSD and CPBSD algorithms

These algorithms are variants for the BSD algorithm which implement the Closed and Closed on Positives relations.

## Installing the `subgroups` library

To install the `subgroups` library, simply execute the following cell:

In [None]:
!pip install subgroups

To verify that the installation was successful, you can run the following cell:

In [None]:
import subgroups.tests as st
st.run_all_tests()

## Running the BSD algorithm

To run the BSD algorithm on a dataset, the following steps are necessary:

- Load the dataset into a Pandas `DataFrame` object.
- Set the target, which must be a tuple of the form (column_name, value).
- Select the quality measure and optimistic estimation to use.
- Create the BSD model with the desired parameters and run it.

Below is an example of running this algorithm on a small dataset:

In [None]:
from pandas import DataFrame
from subgroups.algorithms import BSD
from subgroups.quality_measures import WRAcc
from subgroups.quality_measures import WRAccOptimisticEstimate1

dataset = DataFrame({'bread': ['yes', 'yes', 'no', 'yes', 'yes', 'yes', 'yes'], 'milk': ['yes', 'no', 'yes', 'yes', 'yes', 'yes', 'yes'], 'beer': ['no', 'yes', 'yes', 'yes', 'no', 'yes', 'no'], 'coke': ['no', 'no', 'yes', 'no', 'yes', 'no', 'yes'], 'diaper': ['no', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes']})
target = ("diaper", "yes")

bsd_model = BSD(min_support=0, quality_measure=WRAcc(), optimistic_estimate = WRAccOptimisticEstimate1(), num_subgroups=8, max_depth=100, write_results_in_file = True, file_path = "./results_BSD.txt" )
bsd_model.fit(dataset, target)

## Results

Running the following cell, we get the output of the first subgroups found by the algorithm:

In [None]:
subgroups_to_read = 10
with open("./results_BSD.txt", "r") as file:
    while(subgroups_to_read > 0):
        current_line = file.readline()
        print(current_line.strip())
        subgroups_to_read = subgroups_to_read - 1

Each of these lines represents a subgroup discovered by the algorithm. Taking the second result as an example, we have the following characteristics:
- The subgroup is described by the condition `milk = 'yes'`.
- The target is the one we defined in the first place, i.e., `diaper = 'yes'`.
- The quality of the subgroup is measured with the WRAcc measure, which has a value of -0.020408163265306048.
- The values of tp, fp, TP, and FP are as follows: tp = 5; fp = 1; TP = 6; FP = 1.

These results have been verified in the output file of the execution of the BSD algorithm on a toy dataset.

We can also access different statistics about the results:

In [None]:
print("Selected subgroups: ", bsd_model.selected_subgroups) # Number of selected subgroups
print("Unselected subgroups: ", bsd_model.unselected_subgroups) # Number of unselected subgroups
print("Visited subgroups: ", bsd_model.visited_subgroups) # Number of generated subgroups

## Running the CBSD and CPBSD algorithms

These variants for the BSD algorithm are also available in the `subgroups` library and their use is similar to the BSD case. Below is the same example using these algorithms:

In [None]:
from pandas import DataFrame
from subgroups.algorithms import CBSD
from subgroups.algorithms import CPBSD
from subgroups.quality_measures import WRAcc
from subgroups.quality_measures import WRAccOptimisticEstimate1


dataset = DataFrame({'bread': ['yes', 'yes', 'no', 'yes', 'yes', 'yes', 'yes'], 'milk': ['yes', 'no', 'yes', 'yes', 'yes', 'yes', 'yes'], 'beer': ['no', 'yes', 'yes', 'yes', 'no', 'yes', 'no'], 'coke': ['no', 'no', 'yes', 'no', 'yes', 'no', 'yes'], 'diaper': ['no', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes']})
target = ("diaper", "yes")

cbsd_model = CBSD(min_support=0, quality_measure=WRAcc(), optimistic_estimate = WRAccOptimisticEstimate1(), num_subgroups=8, max_depth=100, write_results_in_file = True, file_path = "./results_CBSD.txt" )
cbsd_model.fit(dataset, target)

cpbsd_model = CPBSD(min_support=0, quality_measure=WRAcc(), optimistic_estimate = WRAccOptimisticEstimate1(), num_subgroups=8, max_depth=100, write_results_in_file = True, file_path = "./results_CPBSD.txt" )
cpbsd_model.fit(dataset, target)

## Results

Running the following cell, we get the output of the first subgroups found by the algorithms:

In [None]:
subgroups_to_read = 10
with open("./results_CBSD.txt", "r") as file:
    while(subgroups_to_read > 0):
        current_line = file.readline()
        print(current_line.strip())
        subgroups_to_read = subgroups_to_read - 1

In [None]:
subgroups_to_read = 10
with open("./results_CPBSD.txt", "r") as file:
    while(subgroups_to_read > 0):
        current_line = file.readline()
        print(current_line.strip())
        subgroups_to_read = subgroups_to_read - 1

We can also access different statistics about the results:

In [None]:
print("Selected subgroups: ", cbsd_model.selected_subgroups) # Number of selected subgroups
print("Unselected subgroups: ", cbsd_model.unselected_subgroups) # Number of unselected subgroups
print("Visited subgroups: ", cbsd_model.visited_subgroups) # Number of generated subgroups

In [None]:
print("Selected subgroups: ", cpbsd_model.selected_subgroups) # Number of selected subgroups
print("Unselected subgroups: ", cpbsd_model.unselected_subgroups) # Number of unselected subgroups
print("Visited subgroups: ", cpbsd_model.visited_subgroups) # Number of generated subgroups

# References

<a id="1">[1]</a>
Florian Lemmerich, Mathias Rohlfs, & Martin Atzmueller. (2010, May). Fast discovery of
relevant subgroup patterns. In Twenty-Third International FLAIRS Conference. 428-433