# Example of using the QFinder algorithm

## About this document

The purpose of this document is to show an example of using the QFinder algorithm.

In the following sections, an introduction of the this algorithm will be presented, followed by instructions to install the `subgroups` library. Then, the execution process of the QFinder algorithm will be described, including the necessary steps to consider. Finally, the results obtained from the application of this algorithm will be presented, highlighting the information obtained in the output file and the one that can be accessed through the model properties.

## QFinder algorithm

QFinder [[1]](#1) is a subgroup discovery algorithm that aims to generate statistically credible subgroups and combines an exhaustive search with a cascade of filters based on metrics assessing key credibility criteria, including relative risk reduction assessment, adjustment on confounding factors, individual feature’s contribution to the subgroup’s effect, interaction tests for assessing between-subgroup treatment effect interactions and tests adjustment (multiple testing).

## Installing the `subgroups` library

To install the `subgroups` library, simply execute the following cell:

In [1]:
!pip install subgroups



To verify that the installation was successful, we may run the following cell:

In [2]:
import subgroups.tests as st
st.run_all_tests()

test_Operator_evaluate_method (tests.core.test_operator.TestOperator.test_Operator_evaluate_method) ... ok
test_Operator_evaluate_method_with_pandasSeries (tests.core.test_operator.TestOperator.test_Operator_evaluate_method_with_pandasSeries) ... ok
test_Operator_generate_from_str_method (tests.core.test_operator.TestOperator.test_Operator_generate_from_str_method) ... ok
test_Operator_string_representation (tests.core.test_operator.TestOperator.test_Operator_string_representation) ... ok
test_Pattern_contains_method (tests.core.test_pattern.TestPattern.test_Pattern_contains_method) ... ok
test_Pattern_general (tests.core.test_pattern.TestPattern.test_Pattern_general) ... ok
test_Pattern_is_contained_method (tests.core.test_pattern.TestPattern.test_Pattern_is_contained_method) ... ok
test_Pattern_is_refinement_method (tests.core.test_pattern.TestPattern.test_Pattern_is_refinement_method) ... ok
test_Selector_attributes (tests.core.test_selector.TestSelector.test_Selector_attributes) ..



##################################
########## CORE PACKAGE ##########
##################################


##############################################
########## QUALITY MEASURES PACKAGE ##########
##############################################


#############################################
########## DATA STRUCTURES PACKAGE ##########
#############################################


test_FPTreeForSDMap_generate_conditional_fp_tree_1 (tests.data_structures.test_fp_tree_for_sdmap.TestFPTreeForSDMap.test_FPTreeForSDMap_generate_conditional_fp_tree_1) ... ok
test_FPTreeForSDMap_generate_conditional_fp_tree_2 (tests.data_structures.test_fp_tree_for_sdmap.TestFPTreeForSDMap.test_FPTreeForSDMap_generate_conditional_fp_tree_2) ... ok
test_FPTreeForSDMap_generate_set_of_frequent_selectors_1 (tests.data_structures.test_fp_tree_for_sdmap.TestFPTreeForSDMap.test_FPTreeForSDMap_generate_set_of_frequent_selectors_1) ... ok
test_FPTreeForSDMap_generate_set_of_frequent_selectors_2 (tests.data_structures.test_fp_tree_for_sdmap.TestFPTreeForSDMap.test_FPTreeForSDMap_generate_set_of_frequent_selectors_2) ... ok
test_FPTreeForSDMap_generate_set_of_frequent_selectors_3 (tests.data_structures.test_fp_tree_for_sdmap.TestFPTreeForSDMap.test_FPTreeForSDMap_generate_set_of_frequent_selectors_3) ... ok
test_FPTreeNode_general (tests.data_structures.test_fp_tree_node.TestFPTreeNode.test_FPTr



########################################
########## ALGORITHMS PACKAGE ##########
########################################


ok
test_BSD_cardinality (tests.algorithms.subgroup_sets.test_bsd.TestBSD.test_BSD_cardinality) ... ok
test_BSD_checkRel (tests.algorithms.subgroup_sets.test_bsd.TestBSD.test_BSD_checkRel) ... ok
test_BSD_checkRelevancies (tests.algorithms.subgroup_sets.test_bsd.TestBSD.test_BSD_checkRelevancies) ... ok
test_BSD_fit1 (tests.algorithms.subgroup_sets.test_bsd.TestBSD.test_BSD_fit1) ... ok
test_BSD_fit2 (tests.algorithms.subgroup_sets.test_bsd.TestBSD.test_BSD_fit2) ... ok
test_BSD_fit3 (tests.algorithms.subgroup_sets.test_bsd.TestBSD.test_BSD_fit3) ... ok
test_BSD_fit4 (tests.algorithms.subgroup_sets.test_bsd.TestBSD.test_BSD_fit4) ... ok
test_BSD_init_method (tests.algorithms.subgroup_sets.test_bsd.TestBSD.test_BSD_init_method) ... ok
test_BSD_logicalAnd (tests.algorithms.subgroup_sets.test_bsd.TestBSD.test_BSD_logicalAnd) ... ok
test_CBSD_checkRel (tests.algorithms.subgroup_sets.test_cbsd.TestCBSD.test_CBSD_checkRel) ... ok
test_CBSD_checkRelevancies (tests.algorithms.subgroup_sets.test

test_VLSD_fit_method_3 (tests.algorithms.subgroup_sets.test_vlsd.TestVLSD.test_VLSD_fit_method_3) ... ok
test_VLSD_fit_method_4 (tests.algorithms.subgroup_sets.test_vlsd.TestVLSD.test_VLSD_fit_method_4) ... ok
test_VLSD_fit_method_5 (tests.algorithms.subgroup_sets.test_vlsd.TestVLSD.test_VLSD_fit_method_5) ... ok
test_VLSD_fit_method_6 (tests.algorithms.subgroup_sets.test_vlsd.TestVLSD.test_VLSD_fit_method_6) ... ok
test_VLSD_fit_method_7 (tests.algorithms.subgroup_sets.test_vlsd.TestVLSD.test_VLSD_fit_method_7) ... ok
test_VLSD_init_method_1 (tests.algorithms.subgroup_sets.test_vlsd.TestVLSD.test_VLSD_init_method_1) ... ok
test_VLSD_init_method_2 (tests.algorithms.subgroup_sets.test_vlsd.TestVLSD.test_VLSD_init_method_2) ... ok

----------------------------------------------------------------------
Ran 91 tests in 2.354s

OK
test_dataframe_filters_general (tests.utils.test_dataframe_filters.TestDataFrameFilter.test_dataframe_filters_general) ... ok
test_to_input_format_for_subgroup_li



###################################
########## UTILS PACKAGE ##########
###################################


## Running the QFinder algorithm

To run the QFinder algorithm on a dataset, it is necessary to follow these steps:

- Load the dataset into a Pandas `DataFrame` object.
- Set the target, which must be a tuple of the form (column_name, value).
- Select the quality measure and optimistic estimate to use.
- Create the QFinder model with the desired parameters and run it.

The following is an example of running this algorithm on a small dataset:

In [3]:
from pandas import DataFrame
from subgroups.algorithms import QFinder

df = DataFrame({'bread': {0: 'yes', 1: 'yes', 2: 'no', 3: 'yes', 4: 'yes', 5: 'yes', 6: 'yes'}, 'milk': {0: 'yes', 1: 'no', 2: 'yes', 3: 'yes', 4: 'yes', 5: 'yes', 6: 'yes'}, 'beer': {0: 'no', 1: 'yes', 2: 'yes', 3: 'yes', 4: 'no', 5: 'yes', 6: 'no'}, 'coke': {0: 'no', 1: 'no', 2: 'yes', 3: 'no', 4: 'yes', 5: 'no', 6: 'yes'}, 'diaper': {0: 'no', 1: 'yes', 2: 'yes', 3: 'yes', 4: 'yes', 5: 'yes', 6: 'yes'}})        
target = ("diaper", "yes")
        
model = QFinder(num_subgroups=5, write_results_in_file=True, file_path='results.txt')
model.fit(df, target)

## Results

Running the following cell, we get the subgroups obtained by the algorithm:

In [4]:
with open("./results.txt", "r") as file:
    for current_line in file:
        print(current_line.strip())

Description: [bread = 'yes'], Target: diaper = 'yes' ; coverage: 0.8571428571428571 ; odds_ratio: 4.999999996293546 ; p_value: 0.14176982554891598 ; absolute_contribution: 1.0 ; contribution_ratio: 1.0 ; adjusted_p_value: 6.946721451896884 ;
Description: [milk = 'yes'], Target: diaper = 'yes' ; coverage: 0.8571428571428571 ; odds_ratio: 4.999999996293546 ; p_value: 0.14176982554891598 ; absolute_contribution: 1.0 ; contribution_ratio: 1.0 ; adjusted_p_value: 6.946721451896884 ;
Description: [coke = 'no'], Target: diaper = 'yes' ; coverage: 0.5714285714285714 ; odds_ratio: 2.9999999999910956 ; p_value: 0.34138767460093966 ; absolute_contribution: 1.0 ; contribution_ratio: 1.0 ; adjusted_p_value: 16.72799605544604 ;
Description: [beer = 'no'], Target: diaper = 'yes' ; coverage: 0.42857142857142855 ; odds_ratio: 1.9999999999995954 ; p_value: 0.5714261343156742 ; absolute_contribution: 1.0 ; contribution_ratio: 1.0 ; adjusted_p_value: 27.999880581468037 ;
Description: [beer = 'yes', bread 

We can also access different statistics about the result:

In [5]:
print("Selected subgroups: ", model.selected_subgroups) # Number of selected subgroups
print("Unselected subgroups: ", model.unselected_subgroups) # Number of unselected subgroups due to not meeting the minimum quality threshold
print("Visited nodes: ", model.visited_subgroups) # Number of nodes (subgroups) visited from the search space

Selected subgroups:  5
Unselected subgroups:  44
Visited nodes:  49


# References

<a id="1">[1]</a>
Esnault, C., Gadonna, M. L., Queyrel, M., Templier, A., & Zucker, J. D. (2020). Q-Finder: An Algorithm for Credible Subgroup Discovery in Clinical Data Analysis - An Application to the International Diabetes Management Practice Study. Frontiers in artificial intelligence, 3, 559927. https://doi.org/10.3389/frai.2020.559927