# SDMap* usage example

## About this document

The purpose of this document is to show an example of how to use the SDMap* algorithm.

In the following section, a detailed introduction of the SDMap* algorithm will be given, followed by instructions to install the `subgroups` library in this environment. Then, the process of executing the algorithm will be described, including the necessary to consider.

Finally, the results obtained from the application of the SDMap* algorithm will be presented, highlighting the information obtained in the output file and the information that can be found by accessing the properties of the model.

## The SDMap* algorithm

SDMap* is a subgroup discovery algorithm that modifies the original design of SDMap to add some pruning techniques that improve performance. These changes are:

- A list of best subgroups is used to prune patterns whose optimistic estimate is worse than the quality of the worst subgroup in the list.
- In each call to FpTree where the tree does not have a single path, selectors are sorted so that those with a better optimistic estimate are processed first.
- Each new pattern generated after this sorting is checked so that its optimistic estimate is better than the quality of the worst subgroup in the list of best subgroups. If it is not, the pattern is discarded.
- In the construction of conditional FpTrees, branches whose optimistic estimate is worse than the quality of the worst subgroup in the list of best subgroups are discarded.

The SDMap* algorithm was introduced in [[1]](#1) and can be described as follows:

[![SDMap* Algorithm](https://i.imgur.com/wuQZ1kx.png)](https://i.imgur.com/wuQZ1kx.png)



## Installation of the `subgroups` library

To install the `subgroups` library in this environment, simply run the following cell:


In [None]:
!pip install subgroups

To verify that the installation was successful, we may run the following cell:

In [None]:
import subgroups.tests as st
st.run_all_tests()

test_Operator_evaluate_method (tests.core.test_operator.TestOperator) ... ok
test_Operator_evaluate_method_with_pandasSeries (tests.core.test_operator.TestOperator) ... ok
test_Operator_generate_from_str_method (tests.core.test_operator.TestOperator) ... ok
test_Operator_string_representation (tests.core.test_operator.TestOperator) ... ok
test_Pattern_general (tests.core.test_pattern.TestPattern) ... ok
test_Pattern_is_contained_method (tests.core.test_pattern.TestPattern) ... ok
test_Pattern_is_refinement_method (tests.core.test_pattern.TestPattern) ... ok
test_Selector_attributes (tests.core.test_selector.TestSelector) ... ok
test_Selector_comparisons (tests.core.test_selector.TestSelector) ... ok
test_Selector_creation_process (tests.core.test_selector.TestSelector) ... ok
test_Selector_deletion_process (tests.core.test_selector.TestSelector) ... ok
test_Selector_generate_from_str_method (tests.core.test_selector.TestSelector) ... ok
test_Selector_match_method (tests.core.test_selec



##################################
########## CORE PACKAGE ##########
##################################


##############################################
########## QUALITY MEASURES PACKAGE ##########
##############################################


#############################################
########## DATA STRUCTURES PACKAGE ##########
#############################################


ok
test_FPTreeForSDMap_build_tree_2 (tests.data_structures.test_fp_tree_for_sdmap.TestFPTreeForSDMap) ... ok
test_FPTreeForSDMap_build_tree_3 (tests.data_structures.test_fp_tree_for_sdmap.TestFPTreeForSDMap) ... ok
test_FPTreeForSDMap_build_tree_4 (tests.data_structures.test_fp_tree_for_sdmap.TestFPTreeForSDMap) ... ok
test_FPTreeForSDMap_generate_conditional_fp_tree_1 (tests.data_structures.test_fp_tree_for_sdmap.TestFPTreeForSDMap) ... ok
test_FPTreeForSDMap_generate_conditional_fp_tree_2 (tests.data_structures.test_fp_tree_for_sdmap.TestFPTreeForSDMap) ... ok
test_FPTreeForSDMap_generate_set_of_frequent_selectors_1 (tests.data_structures.test_fp_tree_for_sdmap.TestFPTreeForSDMap) ... ok
test_FPTreeForSDMap_generate_set_of_frequent_selectors_2 (tests.data_structures.test_fp_tree_for_sdmap.TestFPTreeForSDMap) ... ok
test_FPTreeForSDMap_generate_set_of_frequent_selectors_3 (tests.data_structures.test_fp_tree_for_sdmap.TestFPTreeForSDMap) ... ok
test_FPTreeNode_general (tests.data_struc



########################################
########## ALGORITHMS PACKAGE ##########
########################################


ok
test_CPBSD_fit3 (tests.algorithms.individual_subgroups.nominal_target.test_cpbsd.TestCPBSD) ... ok
test_CPBSD_fit4 (tests.algorithms.individual_subgroups.nominal_target.test_cpbsd.TestCPBSD) ... ok
test_CPBSD_init_method (tests.algorithms.individual_subgroups.nominal_target.test_cpbsd.TestCPBSD) ... ok
test_SDMap_additional_parameters_in_fit_method (tests.algorithms.individual_subgroups.nominal_target.test_sdmap.TestSDMap) ... ok
test_SDMap_fit_method_1 (tests.algorithms.individual_subgroups.nominal_target.test_sdmap.TestSDMap) ... ok
test_SDMap_fit_method_10 (tests.algorithms.individual_subgroups.nominal_target.test_sdmap.TestSDMap) ... ok
test_SDMap_fit_method_11 (tests.algorithms.individual_subgroups.nominal_target.test_sdmap.TestSDMap) ... ok
test_SDMap_fit_method_2 (tests.algorithms.individual_subgroups.nominal_target.test_sdmap.TestSDMap) ... ok
test_SDMap_fit_method_3 (tests.algorithms.individual_subgroups.nominal_target.test_sdmap.TestSDMap) ... ok
test_SDMap_fit_method_4 (t



###################################
########## UTILS PACKAGE ##########
###################################


## Running the algorithm

To run the SDMap* algorithm on a dataset, it is necessary to follow these steps:

- Load the dataset into a Pandas `DataFrame` object and the target (column, value) as a tuple.
- Select the quality measure and optimistic estimate to use.
- Create the SDMap* model with the desired parameters and run it.

The following is an example of running this algorithm on a small dataset:


In [None]:
from pandas import DataFrame
from subgroups.algorithms.individual_subgroups.nominal_target.sdmapstar import SDMapStar
from subgroups.quality_measures.wracc import WRAcc
from subgroups.quality_measures.wracc_optimistic_estimate_1 import WRAccOptimisticEstimate1


dataset = DataFrame({'att1': ['v3', 'v2', 'v1', 'v3', 'v4', 'v4'], 'att2': ['1', '2', '3', '3', '5', '6'], 'att3': ['B', 'A', 'A', 'B', 'A', 'B'], 'class': ['0', '1', '0', '0', '1', '1']})
target = ("class", "1")

model = SDMapStar(WRAcc(),WRAccOptimisticEstimate1(), 0.01, num_subgroups=3, minimum_n = 0,  write_results_in_file=True, file_path="./results.txt")
model.fit(dataset, target)

## Results

Running the following cell, we get the output of the first subgroups found by the algorithm:

In [None]:
N = 10
with open("./results.txt", "r") as file: 
    for line in file:
        print(line.strip())
        N -= 1
        if N == 0:
            break

Description: [att1 = 'v4'], Target: class = '1' ; Quality Measure WRAcc = 0.16666666666666666 ; tp = 2 ; fp = 0 ; TP = 3 ; FP = 3
Description: [att1 = 'v4', att3 = 'B'], Target: class = '1' ; Quality Measure WRAcc = 0.08333333333333333 ; tp = 1 ; fp = 0 ; TP = 3 ; FP = 3
Description: [att1 = 'v4', att3 = 'A'], Target: class = '1' ; Quality Measure WRAcc = 0.08333333333333333 ; tp = 1 ; fp = 0 ; TP = 3 ; FP = 3
Description: [att3 = 'A'], Target: class = '1' ; Quality Measure WRAcc = 0.08333333333333331 ; tp = 2 ; fp = 1 ; TP = 3 ; FP = 3
Description: [att2 = '6'], Target: class = '1' ; Quality Measure WRAcc = 0.08333333333333333 ; tp = 1 ; fp = 0 ; TP = 3 ; FP = 3
Description: [att1 = 'v4', att2 = '6'], Target: class = '1' ; Quality Measure WRAcc = 0.08333333333333333 ; tp = 1 ; fp = 0 ; TP = 3 ; FP = 3
Description: [att2 = '6', att3 = 'B'], Target: class = '1' ; Quality Measure WRAcc = 0.08333333333333333 ; tp = 1 ; fp = 0 ; TP = 3 ; FP = 3
Description: [att1 = 'v4', att2 = '6', att3 =

Each of these lines represents a subgroup discovered by the algorithm. If we take the second result as an example, we have the following characteristics:
- The subgroup is described by the pattern `[att1 = 'v4', att3 = 'B']`.
- The target is the one we defined initially, i.e., `class = '1'`.
- The quality of the subgroup is measured by the WRAcc measure, which has a value of 0.0833...
- The values of tp, fp, TP, and FP are as follows: tp = 1, fp = 0, TP = 3, FP = 3.

These results have been verified in the output file of the SDMap* algorithm run on a toy dataset.

We can also access different statistics about the result:


In [None]:
print("Pruned branches: ", model.pruned_branches) # Number of branches pruned by the threshold of the best subgroups
print("Conditional pruned branches: ", model.conditional_pruned_branches) # Number of branches pruned from some conditional FPTree by the threshold of the best subgroups
print("Selected subgroups: ", model.selected_subgroups) # Number of selected subgroups
print("Unselected subgroups: ", model.unselected_subgroups) # Number of unselected subgroups due to not meeting the minimum quality threshold
print("Visited nodes: ", model.visited_nodes) # Number of nodes (subgroups) visited from the search space


Pruned subgroups:  4
Conditional pruned branches:  0
Selected subgroups:  18
Unselected subgroups:  1
Visited nodes:  19


## References
<a id="1">[1]</a> 
Atzmueller, M., Lemmerich, F. (2009). Fast Subgroup Discovery for Continuous Target Concepts. In: Rauch, J., Raś, Z.W., Berka, P., Elomaa, T. (eds) Foundations of Intelligent Systems. ISMIS 2009. Lecture Notes in Computer Science(), vol 5722. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04125-9_7