# Example of using the GMSL algorithm

## About this document

The purpose of this document is to show an example of using the GMSL algorithm.

In the following sections, an introduction of the this algorithm will be presented, followed by instructions to install the `subgroups` library. Then, the execution process of the GMSL algorithm will be described, including the necessary steps to consider. Finally, the results obtained from the application of this algorithm will be presented, highlighting the information obtained in the output file.

## GMSL algorithm

GMSL (Generation of Multiple Subgroup Lists) [[1]](#1) is an algorithm that generates diverse top-k subgroup lists. This algorithm combines the subgroup discovery (SD) technique and the Minimum Description Length (MDL) principle and uses the Subgroup List model.

## Installing the `subgroups` library

To install the `subgroups` library, you have to execute the following cell:

In [1]:
!pip install subgroups



After that, to verify that the installation was successful, yo can run the following cell:

In [2]:
import subgroups.tests as st
st.run_all_tests()

test_Operator_evaluate_method (tests.core.test_operator.TestOperator.test_Operator_evaluate_method) ... ok
test_Operator_evaluate_method_with_pandasSeries (tests.core.test_operator.TestOperator.test_Operator_evaluate_method_with_pandasSeries) ... ok
test_Operator_generate_from_str_method (tests.core.test_operator.TestOperator.test_Operator_generate_from_str_method) ... ok
test_Operator_string_representation (tests.core.test_operator.TestOperator.test_Operator_string_representation) ... ok
test_Pattern_contains_method (tests.core.test_pattern.TestPattern.test_Pattern_contains_method) ... ok
test_Pattern_general (tests.core.test_pattern.TestPattern.test_Pattern_general) ... ok
test_Pattern_is_contained_method (tests.core.test_pattern.TestPattern.test_Pattern_is_contained_method) ... ok
test_Pattern_is_refinement_method (tests.core.test_pattern.TestPattern.test_Pattern_is_refinement_method) ... ok
test_Selector_attributes (tests.core.test_selector.TestSelector.test_Selector_attributes) ..



##################################
########## CORE PACKAGE ##########
##################################


##############################################
########## QUALITY MEASURES PACKAGE ##########
##############################################


#############################################
########## DATA STRUCTURES PACKAGE ##########
#############################################


test_FPTreeForSDMap_generate_set_of_frequent_selectors_2 (tests.data_structures.test_fp_tree_for_sdmap.TestFPTreeForSDMap.test_FPTreeForSDMap_generate_set_of_frequent_selectors_2) ... ok
test_FPTreeForSDMap_generate_set_of_frequent_selectors_3 (tests.data_structures.test_fp_tree_for_sdmap.TestFPTreeForSDMap.test_FPTreeForSDMap_generate_set_of_frequent_selectors_3) ... ok
test_FPTreeNode_general (tests.data_structures.test_fp_tree_node.TestFPTreeNode.test_FPTreeNode_general) ... ok
test_subgroup_list_1 (tests.data_structures.test_subgroup_list.TestSubgroupList.test_subgroup_list_1) ... ok
test_subgroup_list_2 (tests.data_structures.test_subgroup_list.TestSubgroupList.test_subgroup_list_2) ... ok
test_subgroup_list_3 (tests.data_structures.test_subgroup_list.TestSubgroupList.test_subgroup_list_3) ... ok
test_subgroup_list_4 (tests.data_structures.test_subgroup_list.TestSubgroupList.test_subgroup_list_4) ... ok
test_subgroup_list_5 (tests.data_structures.test_subgroup_list.TestSubgroupLis



########################################
########## ALGORITHMS PACKAGE ##########
########################################


ok
test_BSD_cardinality (tests.algorithms.subgroup_sets.test_bsd.TestBSD.test_BSD_cardinality) ... ok
test_BSD_checkRel (tests.algorithms.subgroup_sets.test_bsd.TestBSD.test_BSD_checkRel) ... ok
test_BSD_checkRelevancies (tests.algorithms.subgroup_sets.test_bsd.TestBSD.test_BSD_checkRelevancies) ... ok
test_BSD_fit1 (tests.algorithms.subgroup_sets.test_bsd.TestBSD.test_BSD_fit1) ... ok
test_BSD_fit2 (tests.algorithms.subgroup_sets.test_bsd.TestBSD.test_BSD_fit2) ... ok
test_BSD_fit3 (tests.algorithms.subgroup_sets.test_bsd.TestBSD.test_BSD_fit3) ... ok
test_BSD_fit4 (tests.algorithms.subgroup_sets.test_bsd.TestBSD.test_BSD_fit4) ... ok
test_BSD_init_method (tests.algorithms.subgroup_sets.test_bsd.TestBSD.test_BSD_init_method) ... ok
test_BSD_logicalAnd (tests.algorithms.subgroup_sets.test_bsd.TestBSD.test_BSD_logicalAnd) ... ok
test_CBSD_checkRel (tests.algorithms.subgroup_sets.test_cbsd.TestCBSD.test_CBSD_checkRel) ... ok
test_CBSD_checkRelevancies (tests.algorithms.subgroup_sets.test

test_VLSD_fit_method_3 (tests.algorithms.subgroup_sets.test_vlsd.TestVLSD.test_VLSD_fit_method_3) ... ok
test_VLSD_fit_method_4 (tests.algorithms.subgroup_sets.test_vlsd.TestVLSD.test_VLSD_fit_method_4) ... ok
test_VLSD_fit_method_5 (tests.algorithms.subgroup_sets.test_vlsd.TestVLSD.test_VLSD_fit_method_5) ... ok
test_VLSD_fit_method_6 (tests.algorithms.subgroup_sets.test_vlsd.TestVLSD.test_VLSD_fit_method_6) ... ok
test_VLSD_fit_method_7 (tests.algorithms.subgroup_sets.test_vlsd.TestVLSD.test_VLSD_fit_method_7) ... ok
test_VLSD_init_method_1 (tests.algorithms.subgroup_sets.test_vlsd.TestVLSD.test_VLSD_init_method_1) ... ok
test_VLSD_init_method_2 (tests.algorithms.subgroup_sets.test_vlsd.TestVLSD.test_VLSD_init_method_2) ... ok

----------------------------------------------------------------------
Ran 91 tests in 2.379s

OK
test_dataframe_filters_general (tests.utils.test_dataframe_filters.TestDataFrameFilter.test_dataframe_filters_general) ... ok
test_to_input_format_for_subgroup_li



###################################
########## UTILS PACKAGE ##########
###################################


## Running the GMSL algorithm

To run the GMSL algorithm on a dataset, it is necessary to follow these steps:

- Load the dataset into a Pandas `DataFrame` object.
- Set the target, which must be a tuple of the form (column_name, value).
- Execute an SD algorithm (for example, the VLSD algorithm) to mine the initial collection of candidates.
- Create the GMSL model with the desired parameters and run it.

The following is an example of running this algorithm on a dataset:

In [3]:
import pandas as pd
from subgroups import datasets

dataset = datasets.load_car_evaluation_csv()
target = ('class', 'acc')

In [4]:
from subgroups.quality_measures import WRAcc
from subgroups.quality_measures import WRAccOptimisticEstimate1
from subgroups.algorithms import VLSD

# First, we execute the VLSD algorithm to mine the inicial collection of candidates.
vlsd_model = VLSD(quality_measure = WRAcc(), q_minimum_threshold  = -1, optimistic_estimate = WRAccOptimisticEstimate1(), oe_minimum_threshold = -1, sort_criterion_in_s1 = VLSD.SORT_CRITERION_NO_ORDER, sort_criterion_in_other_sizes = VLSD.SORT_CRITERION_NO_ORDER, vertical_lists_implementation = VLSD.VERTICAL_LISTS_WITH_BITSETS, write_results_in_file = True, file_path = "./vlsd_result.txt")
vlsd_model.fit(dataset, target)

In [5]:
# The resulting file generated by the VLSD algorithm contains a lot of information (it is more verbose).
# However, the input file of the GMSL algorithm need to have a specific format.
# This means that we have to make a transformation.
from subgroups.utils.file_format_transformations import to_input_format_for_subgroup_list_algorithms

subgroups_correctly_read, subgroups_not_correctly_read = to_input_format_for_subgroup_list_algorithms("./vlsd_result.txt", "./vlsd_result_transformed.txt")
print("Subgroups correctly read: " + str(subgroups_correctly_read))
print("Subgroups not correctly read: " + str(subgroups_not_correctly_read))

Subgroups correctly read: 7999
Subgroups not correctly read: 0


In [6]:
from subgroups.algorithms import GMSL

# Now, we execute the GMSL algorithm to mine diverse top-k subgroup lists.
gmsl_model = GMSL(input_file_path = "./vlsd_result_transformed.txt", max_sl = 3, beta = 0.0, output_file_path = "gmsl_result.txt")
gmsl_model.fit(dataset, target)

## Results

Running the following cell, we get the subgroup lists obtained by the algorithm:

In [7]:
file = open("./gmsl_result.txt", "r")
print(file.read())
file.close()

Dataset information:
	- Number of instances: 1728.
	- Number of positive instances: 384.
	- Number of negative instances: 1344.
	- Total number of attributes (including the target): 7.


Reading input file.
Read subgroups: 7999.
Input file read.


## Subgroup list (4 subgroups) ##
s1: Description: [persons = '2'], Target: class = 'acc'
	Considering its position in the list:
	- positive instances covered: 0
	- negative instances covered: 576
	- total instances covered: 576
	Considering it individually:
	- positive instances covered: 0
	- negative instances covered: 576
	- total instances covered: 576
s2: Description: [safety = 'low'], Target: class = 'acc'
	Considering its position in the list:
	- positive instances covered: 0
	- negative instances covered: 384
	- total instances covered: 384
	Considering it individually:
	- positive instances covered: 0
	- negative instances covered: 576
	- total instances covered: 576
s3: Description: [safety = 'high'], Target: class = 'acc'
	Consider

In [8]:
import os
os.remove("./vlsd_result.txt")
os.remove("./vlsd_result_transformed.txt")
os.remove("./gmsl_result.txt")

# References

<a id="1">[1]</a>
Lopez-Martinez-Carrasco, A., Proença, H.M., Juarez, J.M., van Leeuwen, M., Campos, M. (2023). Discovering Diverse Top-K Characteristic Lists. In: Advances in Intelligent Data Analysis XXI. IDA 2023. Lecture Notes in Computer Science, vol 13876. https://doi.org/10.1007/978-3-031-30047-9_21