# Example of using the CN2SD algorithm

## About this document

The purpose of this document is to show an example of using the CN2SD algorithm.

The following section will provide a detailed introduction to the CN2SD algorithm, followed by instructions to install the `subgroups` library. Then, the execution process of this algorithm will be described, including the necessary steps to consider.

Finally, the results obtained from the application of this algorithm will be presented, highlighting the information obtained in the output file and the one that can be accessed through the model properties.

## CN2SD Algorithm

CN2SD [[1]](#1) algorithm is a subgroup discovery algorithm based on a rule induction system by adapting the classification rule method CN2.

The modifications carried out to adapt the CN2 algorithm are:

- Substitution of the precision-based search heuristic by a new unusualness heuristic combining the generality and the precision of the rule.
- Incorporation of the weighting of the examples in the coverage algorithm.
- Incorporation of example weightings into the unusualness search heuristic.
- Utilizing a probabilistic classification based on the class distribution of examples covered by individual rules.

The WRACC quality measure is used to evaluate the quality measures of the subgroups.

## Installing the `subgroups` library

To install the `subgroups` library, simply execute the following cell:

In [1]:
!pip install subgroups



To verify that the installation was successful, you can run the following cell:

In [2]:
import subgroups.tests as st
st.run_all_tests()

test_Operator_evaluate_method (tests.core.test_operator.TestOperator) ... ok
test_Operator_evaluate_method_with_pandasSeries (tests.core.test_operator.TestOperator) ... ok
test_Operator_generate_from_str_method (tests.core.test_operator.TestOperator) ... ok
test_Operator_string_representation (tests.core.test_operator.TestOperator) ... ok
test_Pattern_contains_method (tests.core.test_pattern.TestPattern) ... ok
test_Pattern_general (tests.core.test_pattern.TestPattern) ... ok
test_Pattern_is_contained_method (tests.core.test_pattern.TestPattern) ... ok
test_Pattern_is_refinement_method (tests.core.test_pattern.TestPattern) ... ok
test_Selector_attributes (tests.core.test_selector.TestSelector) ... ok
test_Selector_comparisons (tests.core.test_selector.TestSelector) ... ok
test_Selector_creation_process (tests.core.test_selector.TestSelector) ... ok
test_Selector_deletion_process (tests.core.test_selector.TestSelector) ... ok
test_Selector_generate_from_str_method (tests.core.test_selec



##################################
########## CORE PACKAGE ##########
##################################


##############################################
########## QUALITY MEASURES PACKAGE ##########
##############################################


#############################################
########## DATA STRUCTURES PACKAGE ##########
#############################################


ok
test_FPTreeForSDMap_generate_conditional_fp_tree_2 (tests.data_structures.test_fp_tree_for_sdmap.TestFPTreeForSDMap) ... ok
test_FPTreeForSDMap_generate_set_of_frequent_selectors_1 (tests.data_structures.test_fp_tree_for_sdmap.TestFPTreeForSDMap) ... ok
test_FPTreeForSDMap_generate_set_of_frequent_selectors_2 (tests.data_structures.test_fp_tree_for_sdmap.TestFPTreeForSDMap) ... ok
test_FPTreeForSDMap_generate_set_of_frequent_selectors_3 (tests.data_structures.test_fp_tree_for_sdmap.TestFPTreeForSDMap) ... ok
test_FPTreeNode_general (tests.data_structures.test_fp_tree_node.TestFPTreeNode) ... ok
test_subgroup_list_1 (tests.data_structures.test_subgroup_list.TestSubgroupList) ... ok
test_subgroup_list_2 (tests.data_structures.test_subgroup_list.TestSubgroupList) ... ok
test_subgroup_list_3 (tests.data_structures.test_subgroup_list.TestSubgroupList) ... ok
test_subgroup_list_4 (tests.data_structures.test_subgroup_list.TestSubgroupList) ... ok
test_vertical_list_1 (tests.data_structures



########################################
########## ALGORITHMS PACKAGE ##########
########################################


ok
test_BSD_cardinality (tests.algorithms.subgroup_sets.test_bsd.TestBSD) ... ok
test_BSD_checkRel (tests.algorithms.subgroup_sets.test_bsd.TestBSD) ... ok
test_BSD_checkRelevancies (tests.algorithms.subgroup_sets.test_bsd.TestBSD) ... ok
test_BSD_fit1 (tests.algorithms.subgroup_sets.test_bsd.TestBSD) ... ok
test_BSD_fit2 (tests.algorithms.subgroup_sets.test_bsd.TestBSD) ... ok
test_BSD_fit3 (tests.algorithms.subgroup_sets.test_bsd.TestBSD) ... ok
test_BSD_fit4 (tests.algorithms.subgroup_sets.test_bsd.TestBSD) ... ok
test_BSD_init_method (tests.algorithms.subgroup_sets.test_bsd.TestBSD) ... ok
test_BSD_logicalAnd (tests.algorithms.subgroup_sets.test_bsd.TestBSD) ... ok
test_CBSD_checkRel (tests.algorithms.subgroup_sets.test_cbsd.TestCBSD) ... ok
test_CBSD_checkRelevancies (tests.algorithms.subgroup_sets.test_cbsd.TestCBSD) ... ok
test_CBSD_fit1 (tests.algorithms.subgroup_sets.test_cbsd.TestCBSD) ... ok
test_CBSD_fit2 (tests.algorithms.subgroup_sets.test_cbsd.TestCBSD) ... ok
test_CBSD_



###################################
########## UTILS PACKAGE ##########
###################################


ok

----------------------------------------------------------------------
Ran 1 test in 0.007s

OK


## Running the algorithm

To run the CN2SD algorithm on a dataset, the following steps are necessary:

- Load the dataset into a Pandas `DataFrame` object and the target (column).
- Create the CN2SD model with the desired parameters and run it.

Below is an example of running this algorithm on a small dataset:


In [3]:
from pandas import DataFrame
from subgroups.algorithms import CN2SD

input_dataframe = DataFrame({'bread': ['yes', 'yes', 'no', 'yes', 'yes', 'yes', 'yes'], 'milk': ['yes', 'no', 'yes', 'yes', 'yes', 'yes', 'yes'], 'beer': ['no', 'yes', 'yes', 'yes', 'no' ,'yes' ,'no'], 'coke': ['no', 'no', 'yes', 'no', 'yes', 'no', 'yes'], 'diaper' : ['no', 'yes', 'yes', 'yes', 'yes', 'yes','yes']})
target = ("diaper")

model = CN2SD(beam_width = 2, weighting_scheme = 'aditive', write_results_in_file=True, file_path="./results.txt")
binary_attributes = ["bread", "milk", "beer", "coke", "diaper"]
result = model.fit(input_dataframe, target, binary_attributes)

ImportError: cannot import name 'CN2SD' from 'subgroups.algorithms' (C:\Users\anlopezmc\anaconda3\lib\site-packages\subgroups\algorithms\__init__.py)

## Results

Running the following cell, we get the output of the first subgroups found by the algorithm:

In [None]:
subgroups_to_read = 10
with open("./results.txt", "r") as file:
    while(subgroups_to_read > 0):
        current_line = file.readline()
        print(current_line.strip())
        subgroups_to_read = subgroups_to_read - 1

Each of these lines represents a subgroup discovered by the algorithm. Taking the first result as an example, we have the following characteristics:
- The subgroup is described by the condition `[beer = 'no', coke = 'no']`.
- The target is the one we defined in the first place, i.e., `diaper`.
- The quality of the subgroup is measured with the WRAcc measure, which has a value of 0.122...
- The values of tp, fp, TP, and FP are as follows: tp = 1 ; fp = 0 ; TP = 1 ; FP = 6.

These results have been verified in the output file of the execution of the CN2SD algorithm on a toy dataset.

We can also access different statistics about the result:


In [None]:
print(model.selected_subgroups) # Number of selected subgroups
print(model.unselected_subgroups) # # Number of unselected subgroups
print(model.visited_nodes) # # Number of generated subgroups

```
Description: [beer = 'no', coke = 'no'], Target: diaper = 'no' ; Quality Measure WRACC = 0.12244897959183673 ; tp = 1 ; fp = 0 ; TP = 1 ; FP = 6
Description: [beer = 'no', coke = 'no'], Target: diaper = 'no' ; Quality Measure WRACC = 0.12244897959183673 ; tp = 1 ; fp = 0 ; TP = 1 ; FP = 6
Description: [beer = 'no'], Target: diaper = 'no' ; Quality Measure WRACC = 0.04733727810650888 ; tp = 0.5 ; fp = 2 ; TP = 0.5 ; FP = 6
.
.
.
Description: [coke = 'yes'], Target: diaper = 'yes' ; Quality Measure WRACC = 0.16000000000000003 ; tp = 0.25 ; fp = 0 ; TP = 0.25 ; FP = 1
Description: [coke = 'yes'], Target: diaper = 'yes' ; Quality Measure WRACC = 0.16000000000000003 ; tp = 0.25 ; fp = 0 ; TP = 0.25 ; FP = 1
Description: [coke = 'yes'], Target: diaper = 'yes' ; Quality Measure WRACC = 0.1487603305785124 ; tp = 0.2222222222222222 ; fp = 0 ; TP = 0.2222222222222222 ; FP = 1
Description: [coke = 'yes'], Target: diaper = 'yes' ; Quality Measure WRACC = 0.1487603305785124 ; tp = 0.2222222222222222 ; fp = 0 ; TP = 0.2222222222222222 ; FP = 1
Description: [coke = 'yes'], Target: diaper = 'yes' ; Quality Measure WRACC = 0.1487603305785124 ; tp = 0.2222222222222222 ; fp =
```

## References
<a id="1">[1]</a>
Clark, Peter & Niblett, Tim. (2000). Induction in Noisy Domains.