# Example of using the SDIGA algoritm

## About this document

The purpose of this document is to provide a simple example of how to use the SDIGA algorithm.

In the following sections, an introduction of the this algorithm will be presented, followed by instructions to install the subgroups library. Then, the execution process of the SDIGA algorithm will be described, including the necessary steps to consider. Finally, the results obtained from the application of this algorithm will be presented, highlighting the information obtained in the output file and the one that can be accessed through the model properties.

## SDIGA algorithm

SDIGA (Subgroup Discovery with Iterative Genetic Algorithms) is a subgroup discovery algorithm that uses an iterative genetic algorithm to find interesting subgroups in a dataset. It is designed to work with large datasets and can handle categorical results. The algorithm uses a fitness function to evaluate the quality of the discovered subgroups and iteratively refines the search space to find the most interesting subgroups, once the GA has converged, the algorithm will perform a local search to find the best subgroup.

## Installing the `subgroups` library
To install the `subgroups` library, you have to execute the following cell:

In [1]:
!pip install subgroups



After that, to verify that the installation was successful, yo can run the following cell:

In [2]:
import subgroups.tests as st
st.run_all_tests()

test_Operator_evaluate_method (tests.core.test_operator.TestOperator.test_Operator_evaluate_method) ... ok
test_Operator_evaluate_method_with_pandasSeries (tests.core.test_operator.TestOperator.test_Operator_evaluate_method_with_pandasSeries) ... ok
test_Operator_generate_from_str_method (tests.core.test_operator.TestOperator.test_Operator_generate_from_str_method) ... ok
test_Operator_string_representation (tests.core.test_operator.TestOperator.test_Operator_string_representation) ... ok
test_Pattern_contains_method (tests.core.test_pattern.TestPattern.test_Pattern_contains_method) ... ok
test_Pattern_general (tests.core.test_pattern.TestPattern.test_Pattern_general) ... ok
test_Pattern_is_contained_method (tests.core.test_pattern.TestPattern.test_Pattern_is_contained_method) ... ok
test_Pattern_is_refinement_method (tests.core.test_pattern.TestPattern.test_Pattern_is_refinement_method) ... ok
test_Selector_attributes (tests.core.test_selector.TestSelector.test_Selector_attributes) ..



##################################
########## CORE PACKAGE ##########
##################################


##############################################
########## QUALITY MEASURES PACKAGE ##########
##############################################


#############################################
########## DATA STRUCTURES PACKAGE ##########
#############################################


test_vertical_list_1 (tests.data_structures.test_vertical_list_with_bitsets.TestVerticalListWithBitsets.test_vertical_list_1) ... ok
test_vertical_list_2 (tests.data_structures.test_vertical_list_with_bitsets.TestVerticalListWithBitsets.test_vertical_list_2) ... ok
test_vertical_list_3 (tests.data_structures.test_vertical_list_with_bitsets.TestVerticalListWithBitsets.test_vertical_list_3) ... ok
test_vertical_list_str_method (tests.data_structures.test_vertical_list_with_bitsets.TestVerticalListWithBitsets.test_vertical_list_str_method) ... ok
test_vertical_list_1 (tests.data_structures.test_vertical_list_with_sets.TestVerticalListWithSets.test_vertical_list_1) ... ok
test_vertical_list_2 (tests.data_structures.test_vertical_list_with_sets.TestVerticalListWithSets.test_vertical_list_2) ... ok
test_vertical_list_3 (tests.data_structures.test_vertical_list_with_sets.TestVerticalListWithSets.test_vertical_list_3) ... ok
test_vertical_list_str_method (tests.data_structures.test_vertical_li



########################################
########## ALGORITHMS PACKAGE ##########
########################################


ok
test_BSD_cardinality (tests.algorithms.subgroup_sets.test_bsd.TestBSD.test_BSD_cardinality) ... ok
test_BSD_checkRel (tests.algorithms.subgroup_sets.test_bsd.TestBSD.test_BSD_checkRel) ... ok
test_BSD_checkRelevancies (tests.algorithms.subgroup_sets.test_bsd.TestBSD.test_BSD_checkRelevancies) ... ok
test_BSD_fit1 (tests.algorithms.subgroup_sets.test_bsd.TestBSD.test_BSD_fit1) ... ok
test_BSD_fit2 (tests.algorithms.subgroup_sets.test_bsd.TestBSD.test_BSD_fit2) ... ok
test_BSD_fit3 (tests.algorithms.subgroup_sets.test_bsd.TestBSD.test_BSD_fit3) ... ok
test_BSD_fit4 (tests.algorithms.subgroup_sets.test_bsd.TestBSD.test_BSD_fit4) ... ok
test_BSD_init_method (tests.algorithms.subgroup_sets.test_bsd.TestBSD.test_BSD_init_method) ... ok
test_BSD_logicalAnd (tests.algorithms.subgroup_sets.test_bsd.TestBSD.test_BSD_logicalAnd) ... ok
test_CBSD_checkRel (tests.algorithms.subgroup_sets.test_cbsd.TestCBSD.test_CBSD_checkRel) ... ok
test_CBSD_checkRelevancies (tests.algorithms.subgroup_sets.test



###################################
########## UTILS PACKAGE ##########
###################################


## Running the SDIGA algorithm

To run the SDIGA algorithm on a dataset, it is necessary to follow these steps:

- Load the dataset into a Pandas `DataFrame` object.
- Set the target, which must be a tuple of the form (column_name, value).
- Create the SDIGA model with the desired parameters and run it.

The following is an example of running this algorithm on a dataset:

In [3]:
import pandas as pd
from subgroups.algorithms import SDIGA

dataset = pd.DataFrame(
    {
        'att1': ['v3', 'v2', 'v1'],
        'att2': ['v1', 'v2', 'v3'],
        'att3': ['v2', 'v1', 'v1'],
        'class': ['no', 'yes', 'no']
    }
)

target = ('class', 'yes')

model = SDIGA(max_generation=10,population_size=5,crossover_prob=0.7,mutation_prob=0.01,
              confidence_weight=0.4,support_weight=0.3,min_confidence=0.6,write_results_in_file=True, file_path = "./sdiga_results.txt")
model.fit(dataset, target)

## Results

Running the following cell, we get the subgroups obtained by the algorithm:

In [4]:
with open("./sdiga_results.txt", "r") as file:
    for current_line in file:
        print(current_line.strip())

Description: [att1 = 'v2', att2 = 'v2'], Target: class = 'yes' ; Quality Measure Fitness = 0.7142857142857143 ; tp = 1 ; fp = 0 ; TP = 1 ; FP = 2 ; unchecked_tp = 1 ; unchecked_fp = 0 ; TP_unchecked = 1 ; FP_unchecked = 2


Each of these lines represents a subgroup discovered by the algorithm along with some of its characteristics. The output will have the following characteristics:

- The subgroup found.
- The target is the one we defined initially, i.e., `class = 'yes'`.
- The quality of the subgroup is measured by the fitness measure.
- The values of tp, fp, TP, and FP.
- The values of unchecked tp, fp, TP, and FP.

These results have been verified in the output file of the SDIGA algorithm run on a toy dataset.

We can also access different statistics about the result:

In [5]:
print("Selected subgroups: ", model.selected_subgroups) # Number of selected subgroups.
print("Unselected subgroups: ", model.unselected_subgroups) # Number of unselected subgroups due to not meeting the minimum quality threshold.

Selected subgroups:  1
Unselected subgroups:  48


You can also access the unchecked dataset and encoded dictionary:

In [6]:
print("Encode Dictionary: ", model.encoded_dict)
print("Unchecked Dataframe: ", model.unchecked_dataframe)

Encode Dictionary:  {'att1': ['v1', 'v2', 'v3'], 'att2': ['v1', 'v2', 'v3'], 'att3': ['v1', 'v2']}
Unchecked Dataframe:     att1  att2  att3 class
0     3     1     2    no
2     1     3     1    no


## Another example

In [7]:
from subgroups import datasets

dataset_car = datasets.load_car_evaluation_csv()
target_car = ('class', 'good')

model_car = SDIGA(max_generation=25,population_size=1000,crossover_prob=0.9,mutation_prob=0.01,
                       confidence_weight=0.4,support_weight=0.3,min_confidence=0.9,write_results_in_file=True, file_path = "./sdiga_result_car.txt")
model_car.fit(dataset_car, target_car)

In [8]:
with open("./sdiga_result_car.txt", "r") as file:
    for current_line in file:
        print(current_line.strip())

Description: [buying = 'low', doors = '5more', lug_boot = 'med', maint = 'low', persons = '4', safety = 'med'], Target: class = 'good' ; Quality Measure Fitness = 0.5716765873015874 ; tp = 1 ; fp = 0 ; TP = 69 ; FP = 1659 ; unchecked_tp = 1 ; unchecked_fp = 0 ; TP_unchecked = 69 ; FP_unchecked = 1659
Description: [buying = 'low', lug_boot = 'small', maint = 'low', persons = '4', safety = 'high'], Target: class = 'good' ; Quality Measure Fitness = 0.5724212093638846 ; tp = 4 ; fp = 0 ; TP = 69 ; FP = 1659 ; unchecked_tp = 4 ; unchecked_fp = 0 ; TP_unchecked = 68 ; FP_unchecked = 1659
Description: [buying = 'med', lug_boot = 'big', maint = 'low', persons = 'more', safety = 'med'], Target: class = 'good' ; Quality Measure Fitness = 0.5724235138048256 ; tp = 4 ; fp = 0 ; TP = 69 ; FP = 1659 ; unchecked_tp = 4 ; unchecked_fp = 0 ; TP_unchecked = 64 ; FP_unchecked = 1659
Description: [buying = 'low', doors = '4', lug_boot = 'med', maint = 'low', persons = 'more', safety = 'med'], Target: cla

In [9]:
print("Selected subgroups: ", model_car.selected_subgroups) # Number of selected subgroups.
print("Unselected subgroups: ", model_car.unselected_subgroups) # Number of unselected subgroups due to not meeting the minimum quality threshold.

Selected subgroups:  35
Unselected subgroups:  37764


In [10]:
print("Encode Dictionary: ", model_car.encoded_dict)
print("Unchecked Dataframe: ", model_car.unchecked_dataframe)

Encode Dictionary:  {'buying': ['high', 'low', 'med', 'vhigh'], 'maint': ['high', 'low', 'med', 'vhigh'], 'doors': ['2', '3', '4', '5more'], 'persons': ['2', '4', 'more'], 'lug_boot': ['big', 'med', 'small'], 'safety': ['high', 'low', 'med']}
Unchecked Dataframe:        buying  maint  doors  persons  lug_boot  safety  class
0          4      4      1        1         3       2  unacc
1          4      4      1        1         3       3  unacc
2          4      4      1        1         3       1  unacc
3          4      4      1        1         2       2  unacc
4          4      4      1        1         2       3  unacc
...      ...    ...    ...      ...       ...     ...    ...
1720       2      2      4        3         3       3    acc
1722       2      2      4        3         2       2  unacc
1724       2      2      4        3         2       1  vgood
1725       2      2      4        3         1       2  unacc
1727       2      2      4        3         1       1  vgood

[1

## References
<a id="1">[1]</a>
Carmona, C.J., González, P., del Jesus, M.J. and Herrera, F. (2014) - Overview on evolutionary subgroup discovery: analysis of the suitability and potential of the search performed by evolutionary algorithms. WIREs Data Mining Knowl Discov, 4: 87-103. https://doi.org/10.1002/widm.1118

<a id="2">[2]</a> 
Del Jesus, María José & González, Pedro & Herrera, Francisco & Mesonero, Mikel. (2007) - Evolutionary Fuzzy Rule Induction Process for Subgroup Discovery: A Case Study in Marketing. Fuzzy Systems, IEEE Transactions on. 15. 578 - 592. 10.1109/TFUZZ.2006.890662. 