In [12]:
import sd4py

In [13]:
import numpy as np
import pandas as pd

## Generate some data

In order to showcase sd4py, we need some data in a Pandas dataframe. 

The code below just generates a dataframe with a variety of different data types.  

In [14]:
df2 = pd.DataFrame(
    {
        "A": pd.Series(np.random.randn(100), dtype="float16"),
        "B": pd.Series(np.random.randn(100)),
        "C": pd.Series(np.array(np.random.randn(100), dtype="uint8")),
        "D": ["foo"]*50 + ["bar"]*50,
        "E": pd.date_range("2018-01-01", periods=100, freq="H"),
        "F": ["red", "yellow", "green", "blue","magenta"] * 20,
        "G": ([1]*5 + [10]*3 + [5,5]) * 10,
        "H": [pd.Timedelta(hours=x) for x in [1,3,2,6,5,2,5,9,1,6]*10],
        "I": [True, False] * 50
    }
)

In [15]:
df2.head()

Unnamed: 0,A,B,C,D,E,F,G,H,I
0,0.936035,1.806944,255,foo,2018-01-01 00:00:00,red,1,0 days 01:00:00,True
1,-0.120239,0.589709,0,foo,2018-01-01 01:00:00,yellow,1,0 days 03:00:00,False
2,-1.049805,-2.842112,0,foo,2018-01-01 02:00:00,green,1,0 days 02:00:00,True
3,-0.537109,0.162823,0,foo,2018-01-01 03:00:00,blue,1,0 days 06:00:00,False
4,0.168701,0.975684,0,foo,2018-01-01 04:00:00,magenta,1,0 days 05:00:00,True


## Discover subgroups

With a pandas dataframe containing the data we are interested in, performing subgroup discovery is quite easy. Below is the minimal code to perform subgroup discovery on a pandas dataframe. 

In [17]:
subgroups = sd4py.discover_subgroups(df2, "A")

SDMapNum
Time - took: 12 ms. (11 ms. CPU time)
Steps: 173


The results from the `discover_subgroups()` function are returned in a custom `PySubgroupResults` object, which has three important attributes: 

 * `subgroups`, a list of custom `PySubgroup` objects 
 * `population_value`, the average target value of the full dataset, and
 * `population_size`, the size of the full dataset. 

It also has a convenient `to_dataframe()` function that converts the results into a dataframe for easy viewing. 

A `PySubgroup` object contains the quality, subgroup size, average target value, and a list of custom selector objects. 

The selector objects come in two varieties. `PyNominalSelector` objects contain a column name and a corresponding value. `PyNumericSelector` objects contain a column name, a minimum value, and a maximum value.

Now, let's call the `PySubgroupResults.to_df()` function to take a look at the results.  

In [6]:
subgroups.to_df()

Unnamed: 0,pattern,target_evaluation,size,quality
0,G < 4.00 AND -0.80 <= B < 0.83 AND H < 0 days ...,0.529589,17.0,10.082837
1,G < 4.00 AND H < 0 days 03:40:00,0.248931,30.0,9.373483
2,-0.80 <= B < 0.83,0.106018,54.0,9.154955
3,G < 4.00,0.107092,50.0,8.530542
4,H < 0 days 03:40:00 AND -0.80 <= B < 0.83,0.274852,25.0,8.459271
5,G < 4.00 AND -0.80 <= B < 0.83,0.221383,29.0,8.262143
6,E < 2018-01-02 09:00:00 AND G < 4.00,0.383887,18.0,8.053307
7,E < 2018-01-02 09:00:00 AND G < 4.00 AND D = foo,0.383887,18.0,8.053307
8,E < 2018-01-02 09:00:00 AND G < 4.00 AND C < 8...,0.425888,16.0,7.830506
9,G < 4.00 AND H < 0 days 03:40:00 AND C < 85.00,0.246596,24.0,7.44275


## Selecting rows from a dataframe based on a subgroup

After finding subgroups, it is possible to use them to select rows from a dataframe (i.e. select rows matching the pattern). 

This can be done on the same dataframe that was originally used to discover subgroups, or another dataframe with the same columns (which might be useful if, e.g., using a train/test split). 

In [7]:
results = sd4py.discover_subgroups(df2, "A")

## Select rows based on the first subgroup
print(results.subgroups[0])

results.subgroups[0].get_rows(df2)

SDMapNum
Time - took: 40 ms. (39 ms. CPU time)
Steps: 103
G < 4.00 AND -0.80 <= B < 0.83 AND H < 0 days 03:40:00


Unnamed: 0,A,B,C,D,E,F,G,H,I
0,1.37207,-0.369828,0,foo,2018-01-01 00:00:00,red,1,0 days 01:00:00,True
1,1.823242,-0.228837,0,foo,2018-01-01 01:00:00,yellow,1,0 days 03:00:00,False
2,0.317383,-0.115102,0,foo,2018-01-01 02:00:00,green,1,0 days 02:00:00,True
10,1.987305,-0.322194,1,foo,2018-01-01 10:00:00,red,1,0 days 01:00:00,True
21,0.439209,0.724878,0,foo,2018-01-01 21:00:00,yellow,1,0 days 03:00:00,False
40,-0.182373,-0.512337,1,foo,2018-01-02 16:00:00,red,1,0 days 01:00:00,True
51,1.693359,-0.545217,254,bar,2018-01-03 03:00:00,yellow,1,0 days 03:00:00,False
61,-0.803711,-0.492396,255,bar,2018-01-03 13:00:00,yellow,1,0 days 03:00:00,False
70,-0.411621,-0.350567,255,bar,2018-01-03 22:00:00,red,1,0 days 01:00:00,True
71,0.088989,-0.312703,1,bar,2018-01-03 23:00:00,yellow,1,0 days 03:00:00,False


## Examples using more parameters

The below examples show the use of a more complex combination of parameters

In [8]:
sd4py.discover_subgroups(
    df2, 
    target="I",
    included_attributes=["B", "E", "H", "I", "A"],
    nbins=5,
    method="bsd",
    qf="gain",
    k=100,
    minqual=0,
    minsize=0,
    mintp=1,
    max_selectors=7,
    ignore_defaults=True,
    filter_irrelevant=True,
    postfilter="weighted_covering",
    postfilter_param=5
).to_df()

Unnamed: 0,pattern,target_evaluation,size,quality
0,H < 0 days 02:36:00,0.75,40.0,0.124511
1,0.51 <= B < 1.49 AND H < 0 days 02:36:00,0.833333,12.0,0.047251
2,H < 0 days 02:36:00 AND -0.48 <= B < 0.51,0.818182,11.0,0.038731
3,-0.53 <= A < 0.33 AND H < 0 days 02:36:00,0.769231,13.0,0.032752
4,0.33 <= A < 1.18 AND H < 0 days 02:36:00,0.833333,6.0,0.022227


In [9]:
sd4py.discover_subgroups(
    df2, 
    target="A",
    included_attributes=["B", "E", "H", "I", "A"],
    nbins=5,
    method="beam",
    qf="wracc",
    k=50,
    minqual=0,
    minsize=2,
    mintp=0,
    max_selectors=7,
    ignore_defaults=False,
    filter_irrelevant=False,
    postfilter="min_improve_global",
    postfilter_param=0.3
).to_df()

Unnamed: 0,pattern,target_evaluation,size,quality
0,-0.48 <= B < 0.51 AND E < 2018-01-01 19:48:00,0.639221,6.0,0.042164
1,0 days 02:36:00 <= H < 0 days 04:12:00 AND -0....,0.904205,4.0,0.038709
2,0 days 07:24:00 <= H,0.29554,10.0,0.035906
3,E < 2018-01-01 19:48:00 AND H < 0 days 02:36:00,0.329391,8.0,0.031433
4,0 days 02:36:00 <= H < 0 days 04:12:00,0.249872,10.0,0.031339
5,2018-01-04 07:12:00 <= E AND I = False AND -0....,0.611206,3.0,0.020242
6,0 days 07:24:00 <= H AND 2018-01-02 15:36:00 <...,0.947571,2.0,0.020222
7,0 days 02:36:00 <= H < 0 days 04:12:00 AND E <...,0.855865,2.0,0.018388
8,2018-01-04 07:12:00 <= E AND 0 days 02:36:00 <...,0.852295,2.0,0.018316


In [10]:
sd4py.discover_subgroups(
    df2, 
    target="I",
    included_attributes=["B", "E", "H", "I", "A"],
    nbins=5,
    method="bsd",
    qf="chi2",
    k=10,
    minqual=0,
    minsize=0,
    mintp=10
).to_df()

Unnamed: 0,pattern,target_evaluation,size,quality
0,0 days 04:12:00 <= H < 0 days 05:48:00,1.0,20.0,25.0
1,H < 0 days 02:36:00,0.75,40.0,16.666667
2,-0.53 <= A < 0.33 AND 0 days 04:12:00 <= H < 0...,1.0,11.0,12.359551
3,0.51 <= B < 1.49 AND H < 0 days 02:36:00,0.833333,12.0,6.060606
4,-0.53 <= A < 0.33 AND H < 0 days 02:36:00,0.769231,13.0,4.332449
5,-0.48 <= B < 0.51,0.551724,29.0,0.437105
6,-0.53 <= A < 0.33,0.525,40.0,0.166667
7,0.33 <= A < 1.18,0.47619,21.0,0.060277
8,2018-01-04 07:12:00 <= E,0.5,20.0,0.0
9,-1.39 <= A < -0.53,0.5,24.0,0.0
