# Example of using the DSLM algorithm

## About this document

The purpose of this document is to show an example of using the DSLM algorithm.

In the following sections, an introduction of the this algorithm will be presented, followed by instructions to install the `subgroups` library. Then, the execution process of the DSLM algorithm will be described, including the necessary steps to consider. Finally, the results obtained from the application of this algorithm will be presented, highlighting the information obtained in the output file.

## DSLM algorithm

DSLM (Diverse Subgroup Lists Miner) [[1]](#1) is an algorithm that generates diverse top-k subgroup lists. This algorithm combines the subgroup discovery (SD) technique and the Minimum Description Length (MDL) principle and uses the Subgroup List model.

## Installing the `subgroups` library

To install the `subgroups` library, you have to execute the following cell:

In [None]:
!pip install subgroups

After that, to verify that the installation was successful, yo can run the following cell:

In [None]:
import subgroups.tests as st
st.run_all_tests()

## Running the DSLM algorithm

To run the DSLM algorithm on a dataset, it is necessary to follow these steps:

- Load the dataset into a Pandas `DataFrame` object.
- Set the target, which must be a tuple of the form (column_name, value).
- Execute an SD algorithm (for example, the VLSD algorithm) to mine the initial collection of candidates.
- Create the DSLM model with the desired parameters and run it.

The following is an example of running this algorithm on a dataset:

In [None]:
import pandas as pd
from subgroups import datasets

dataset = datasets.load_car_evaluation_csv()
target = ('class', 'acc')

In [None]:
from subgroups.quality_measures import WRAcc
from subgroups.quality_measures import WRAccOptimisticEstimate1
from subgroups.algorithms import VLSD

# First, we execute the VLSD algorithm to mine the inicial collection of candidates.
vlsd_model = VLSD(quality_measure = WRAcc(), q_minimum_threshold  = -1, optimistic_estimate = WRAccOptimisticEstimate1(), oe_minimum_threshold = -1, sort_criterion_in_s1 = VLSD.SORT_CRITERION_NO_ORDER, sort_criterion_in_other_sizes = VLSD.SORT_CRITERION_NO_ORDER, vertical_lists_implementation = VLSD.VERTICAL_LISTS_WITH_BITSETS, write_results_in_file = True, file_path = "./vlsd_result.txt")
vlsd_model.fit(dataset, target)

In [None]:
# The resulting file generated by the VLSD algorithm contains a lot of information (it is more verbose).
# However, the input file of the DSLM algorithm need to have a specific format.
# This means that we have to make a transformation.
from subgroups.utils.file_format_transformations import to_input_format_for_subgroup_list_algorithms

subgroups_correctly_read, subgroups_not_correctly_read = to_input_format_for_subgroup_list_algorithms("./vlsd_result.txt", "./vlsd_result_transformed.txt")
print("Subgroups correctly read: " + str(subgroups_correctly_read))
print("Subgroups not correctly read: " + str(subgroups_not_correctly_read))

In [None]:
from subgroups.algorithms import DSLM

# Now, we execute the DSLM algorithm to mine diverse top-k subgroup lists.
dslm_model = DSLM(input_file_path = "./vlsd_result_transformed.txt",
                  max_sl = 3,
                  sl_max_size = 10,
                  beta = 0.0,
                  maximum_positive_overlap = 0.06,
                  maximum_negative_overlap = 0.06,
                  output_file_path = "dslm_result.txt")
dslm_model.fit(dataset, target)

## Results

Running the following cell, we get the subgroup lists obtained by the algorithm:

In [None]:
file = open("./dslm_result.txt", "r")
print(file.read())
file.close()

In [None]:
import os
os.remove("./vlsd_result.txt")
os.remove("./vlsd_result_transformed.txt")
os.remove("./dslm_result.txt")

# References

<a id="1">[1]</a>
Lopez-Martinez-Carrasco, A., Proença, H.M., Juarez, J.M., Leeuwen, M.v., Campos, M. (2023). Novel Approach for Phenotyping Based on Diverse Top-K Subgroup Lists. In: Artificial Intelligence in Medicine. AIME 2023. Lecture Notes in Computer Science, vol 13897. https://doi.org/10.1007/978-3-031-34344-5_6