# AA test tutorial 
AA test is important part of randomized controlled experiment, for example AB test. 

The objectives of the AA test are to verify the assumption of uniformity of samples as a result of the applied partitioning method, to select the best partition from the available ones, and to verify the applicability of statistical criteria for checking uniformity. 

For example, there is a hypothesis about the absence of dependence of features on each other. If this hypothesis is not followed, the AA test will fail.

[Wiki AA test](https://github.com/sb-ai-lab/HypEx/wiki/%D0%90%D0%90-Test) with more detailed description of terms for AA test.

<ul>
  <li><a href="#creation-of-a-new-test-dataset-with-synthetic-data">Creation of a new test dataset with synthetic data.
  <li><a href="#one-split-of-aa-test">One split of AA test.
  <li><a href="#aa-test">AA test.
  <li><a href="#aa-test-with-stratification">AA test with stratification.
</ul>

In [15]:
from hypex import AATest
from hypex.dataset import (
    ConstGroupRole,
    Dataset,
    InfoRole,
    StratificationRole,
    TargetRole,
    TreatmentRole,
)
from hypex.utils import create_test_data

## Creation of a new test dataset with synthetic data. 

In order to be able to work with our data in HypEx, first we need to convert it into `dataset`. It is important to mark the data fields by assigning the appropriate `roles`:
- TargetRole: a role for columns that contain features or predictor variables. Our split will be based on them. Applied by default if the role is not specified for the column.
- TreatmentRole: a role for columns that show the treatment or intervention.
- InfoRole: a role for columns that contain information about the data, such as user IDs. 

In [16]:
data = Dataset(
    roles={
        "user_id": InfoRole(int),
        "pre_spends": TargetRole(),
        "post_spends": TargetRole(),
        "gender": StratificationRole(str),
    }, data=create_test_data(),
)
data

Unnamed: 0,user_id,signup_month,treat,pre_spends,post_spends,age,gender,industry
0,0.0,11.0,1.0,476.0,436.888889,28.0,F,E-commerce
1,1.0,1.0,1.0,519.5,525.222222,36.0,F,Logistics
2,2.0,0.0,0.0,498.5,414.333333,69.0,F,Logistics
3,3.0,10.0,1.0,473.0,445.888889,43.0,F,E-commerce
4,4.0,11.0,1.0,495.0,428.111111,56.0,F,E-commerce
...,...,...,...,...,...,...,...,...
9995,9995.0,0.0,0.0,475.0,408.111111,51.0,M,Logistics
9996,9996.0,0.0,0.0,472.5,414.666667,22.0,F,E-commerce
9997,9997.0,0.0,0.0,474.0,419.222222,63.0,M,E-commerce
9998,9998.0,4.0,1.0,481.0,519.888889,21.0,F,Logistics


In [17]:
data.roles

{'user_id': Info(<class 'int'>),
 'pre_spends': Target(<class 'float'>),
 'post_spends': Target(<class 'float'>),
 'gender': Stratification(<class 'str'>),
 'signup_month': Default(<class 'float'>),
 'treat': Default(<class 'float'>),
 'age': Default(<class 'float'>),
 'industry': Default(<class 'str'>)}

## AA test
Then we run the experiment on our prepared dataset, wrapped into ExperimentData. In this case we select one of the pre-assembled pipeline, AA_TEST.
We can set the number of iterations for simple execution. In this case the random states are the numbers of each iteration.

In [18]:
test = AATest(n_iterations=10)
result = test.execute(data)

100%|██████████| 10/10 [00:02<00:00,  3.48it/s]


In [19]:
result.resume

Unnamed: 0,feature,group,TTest aa test,KSTest aa test,TTest best split,KSTest best split,result,control mean,test mean,difference,difference %
0,pre_spends,test_1,OK,OK,OK,OK,OK,487.513637,487.369368,-0.144269,-0.029593
1,post_spends,test_1,OK,OK,OK,OK,OK,452.264227,452.160458,-0.103769,-0.022944


**Interpretation of AA test results**

Each row in the table corresponds to a target feature being tested for equality between the control and test groups. Two statistical tests are used:

- **TTest**: tests if means are statistically different.
- **KSTest**: tests if distributions differ.

The `OK` / `NOT OK` labels show whether the difference is statistically significant. A `NOT OK` result indicates a possible imbalance.

Typical threshold:
- If p-value < 0.05 → `NOT OK` (statistically significant difference)
- If p-value ≥ 0.05 → `OK` (no significant difference)

If any metric has a `NOT OK` status in the `AA test` column, it means at least one iteration showed significant difference.


In [20]:
result.aa_score

Unnamed: 0,score,pass
pre_spends TTest test_1,0.95,True
post_spends TTest test_1,0.95,True
pre_spends KSTest test_1,0.95,True
post_spends KSTest test_1,0.95,True


**Interpreting `aa_score`**

This output shows p-values and the overall pass/fail status for each test type and feature. A high p-value (close to 1.0) means the test passed — the groups are similar.

- `score`: p-value of the statistical test.
- `pass`: True if no iterations showed significant differences.

Note: Even if the average p-value is high, the `pass` might still be False if at least one of the iterations had a p-value < 0.05.


In [21]:
result.best_split

Unnamed: 0,user_id,signup_month,treat,pre_spends,post_spends,age,gender,industry,split
0,0.0,11.0,1.0,476.0,436.888889,28.0,F,E-commerce,test_1
1,1.0,1.0,1.0,519.5,525.222222,36.0,F,Logistics,control
2,2.0,0.0,0.0,498.5,414.333333,69.0,F,Logistics,control
3,3.0,10.0,1.0,473.0,445.888889,43.0,F,E-commerce,control
4,4.0,11.0,1.0,495.0,428.111111,56.0,F,E-commerce,test_1
...,...,...,...,...,...,...,...,...,...
9995,9995.0,0.0,0.0,475.0,408.111111,51.0,M,Logistics,test_1
9996,9996.0,0.0,0.0,472.5,414.666667,22.0,F,E-commerce,control
9997,9997.0,0.0,0.0,474.0,419.222222,63.0,M,E-commerce,test_1
9998,9998.0,4.0,1.0,481.0,519.888889,21.0,F,Logistics,test_1


**About `best_split`**

This shows the best found split of the dataset, where control and test groups are as similar as possible in terms of target metrics.

You can use this split for future modeling or as a validation check before proceeding to actual experiments.


In [22]:
result.best_split_statistic

Unnamed: 0,feature,group,control mean,test mean,difference,difference %,TTest pass,TTest p-value,KSTest pass,KSTest p-value
0,pre_spends,test_1,487.51363737983456,487.3693683745583,-0.1442690052762714,-0.0295928142752366,OK,0.7207517943718674,OK,
1,post_spends,test_1,452.2642273393448,452.1604583824107,-0.1037689569340614,-0.0229443211868685,OK,0.9012098780057256,OK,


**Understanding `best_split_statistic`**

This table contains detailed statistics for the best (most balanced) split found across all iterations. You can compare:

- Mean values in control vs test group.
- Absolute and relative differences.
- p-values for both tests.

Ideally, all rows should have `OK` in both TTest and KSTest columns, and small difference values (<1%).

In [23]:
result.experiments

Unnamed: 0,splitter_id,pre_spends GroupDifference control mean test_1,pre_spends GroupDifference test mean test_1,pre_spends GroupDifference difference test_1,pre_spends GroupDifference difference % test_1,post_spends GroupDifference control mean test_1,post_spends GroupDifference test mean test_1,post_spends GroupDifference difference test_1,post_spends GroupDifference difference % test_1,pre_spends TTest p-value test_1,...,post_spends TTest pass test_1,pre_spends KSTest p-value test_1,pre_spends KSTest pass test_1,post_spends KSTest p-value test_1,post_spends KSTest pass test_1,mean TTest p-value,mean TTest pass,mean KSTest p-value,mean KSTest pass,mean test score
0,AASplitter┴rs 0┴,487.163329,487.717994,0.554665,0.113856,452.817559,451.608239,-1.20932,-0.267066,0.169323,...,False,,False,,False,0.158645,0.0,0,0.0,0.052882
1,AASplitter┴rs 1┴,487.545424,487.336119,-0.209306,-0.042931,452.23207,452.19187,-0.040201,-0.008889,0.604036,...,False,,False,,False,0.78284,0.0,0,0.0,0.260947
2,AASplitter┴rs 2┴,487.165482,487.716826,0.551344,0.113174,452.108051,452.31607,0.208019,0.046011,0.171891,...,False,,False,,False,0.487685,0.0,0,0.0,0.162562
3,AASplitter┴rs 3┴,487.519111,487.36303,-0.156081,-0.032015,452.304988,452.119085,-0.185903,-0.041101,0.698958,...,False,,False,,False,0.761484,0.0,0,0.0,0.253828
4,AASplitter┴rs 4┴,487.053508,487.834079,0.780571,0.160264,452.248088,452.175456,-0.072632,-0.01606,0.053089,...,False,,False,,False,0.491926,0.0,0,0.0,0.163975
5,AASplitter┴rs 5┴,487.49612,487.385772,-0.110348,-0.022636,451.992067,452.432915,0.440848,0.097534,0.784536,...,False,,False,,False,0.691234,0.0,0,0.0,0.230411
6,AASplitter┴rs 6┴,487.383559,487.498886,0.115327,0.023663,452.334441,452.088929,-0.245513,-0.054277,0.77507,...,False,,False,,False,0.772029,0.0,0,0.0,0.257343
7,AASplitter┴rs 7┴,487.70203,487.182231,-0.5198,-0.106581,452.771481,451.657151,-1.11433,-0.246113,0.197756,...,False,,False,,False,0.190129,0.0,0,0.0,0.063376
8,AASplitter┴rs 8┴,487.513637,487.369368,-0.144269,-0.029593,452.264227,452.160458,-0.103769,-0.022944,0.720752,...,False,,False,,False,0.810981,0.0,0,0.0,0.270327
9,AASplitter┴rs 9┴,487.254543,487.628536,0.373993,0.076755,453.384506,451.033539,-2.350967,-0.518537,0.354097,...,True,,False,,False,0.179502,0.5,0,0.0,0.059834


# AA Test with random states

We can also adjust some of the preset parameters of the experiment by assigning them to the respective params of the experiment. I.e. here we set the range of the random states we want to run our AA test for. 

In [24]:
test = AATest(random_states=[56, 72, 2, 43])
result = test.execute(data)

100%|██████████| 4/4 [00:01<00:00,  3.54it/s]


In [25]:
result.resume

Unnamed: 0,feature,group,TTest aa test,KSTest aa test,TTest best split,KSTest best split,result,control mean,test mean,difference,difference %
0,pre_spends,test_1,OK,OK,OK,OK,OK,487.65672,487.22314,-0.433579,-0.088911
1,post_spends,test_1,OK,OK,OK,OK,OK,452.223868,452.20006,-0.023808,-0.005265


In [26]:
result.aa_score

Unnamed: 0,score,pass
pre_spends TTest test_1,0.95,True
post_spends TTest test_1,0.95,True
pre_spends KSTest test_1,0.95,True
post_spends KSTest test_1,0.95,True


In [27]:
result.best_split

Unnamed: 0,user_id,signup_month,treat,pre_spends,post_spends,age,gender,industry,split
0,0.0,11.0,1.0,476.0,436.888889,28.0,F,E-commerce,control
1,1.0,1.0,1.0,519.5,525.222222,36.0,F,Logistics,control
2,2.0,0.0,0.0,498.5,414.333333,69.0,F,Logistics,test_1
3,3.0,10.0,1.0,473.0,445.888889,43.0,F,E-commerce,test_1
4,4.0,11.0,1.0,495.0,428.111111,56.0,F,E-commerce,control
...,...,...,...,...,...,...,...,...,...
9995,9995.0,0.0,0.0,475.0,408.111111,51.0,M,Logistics,control
9996,9996.0,0.0,0.0,472.5,414.666667,22.0,F,E-commerce,test_1
9997,9997.0,0.0,0.0,474.0,419.222222,63.0,M,E-commerce,control
9998,9998.0,4.0,1.0,481.0,519.888889,21.0,F,Logistics,control


In [28]:
result.best_split_statistic

Unnamed: 0,feature,group,control mean,test mean,difference,difference %,TTest pass,TTest p-value,KSTest pass,KSTest p-value
0,pre_spends,test_1,487.65671971706456,487.22314049586777,-0.4335792211967941,-0.0889107447239467,OK,0.282682875917564,OK,
1,post_spends,test_1,452.2238677669712,452.2000595636959,-0.0238082032752799,-0.0052646940978284,OK,0.9772788561318062,OK,


In [29]:
result.experiments

Unnamed: 0,splitter_id,pre_spends GroupDifference control mean test_1,pre_spends GroupDifference test mean test_1,pre_spends GroupDifference difference test_1,pre_spends GroupDifference difference % test_1,post_spends GroupDifference control mean test_1,post_spends GroupDifference test mean test_1,post_spends GroupDifference difference test_1,post_spends GroupDifference difference % test_1,pre_spends TTest p-value test_1,...,post_spends TTest pass test_1,pre_spends KSTest p-value test_1,pre_spends KSTest pass test_1,post_spends KSTest p-value test_1,post_spends KSTest pass test_1,mean TTest p-value,mean TTest pass,mean KSTest p-value,mean KSTest pass,mean test score
0,AASplitter┴rs 56┴,487.65672,487.22314,-0.433579,-0.088911,452.223868,452.20006,-0.023808,-0.005265,0.282683,...,False,,False,,False,0.629981,0.0,0,0.0,0.209994
1,AASplitter┴rs 72┴,487.717835,487.163365,-0.55447,-0.113687,452.997042,451.424389,-1.572654,-0.347166,0.169474,...,False,,False,,False,0.114689,0.0,0,0.0,0.03823
2,AASplitter┴rs 2┴,487.165482,487.716826,0.551344,0.113174,452.108051,452.31607,0.208019,0.046011,0.171891,...,False,,False,,False,0.487685,0.0,0,0.0,0.162562
3,AASplitter┴rs 43┴,487.321524,487.56068,0.239156,0.049076,452.398564,452.025364,-0.3732,-0.082494,0.553469,...,False,,False,,False,0.604371,0.0,0,0.0,0.201457


# AA Test with stratification

Depending on your requirements it is possible to stratify the data. You can set `stratification=True` and `StratificationRole` in `Dataset` to run it with stratification.

Stratified AA tests ensure that both groups (control/test) have the same proportions of categories (e.g. same % of genders or regions). This prevents imbalances in categorical features that can distort results.

Make sure to assign `StratificationRole` to relevant columns in your dataset before enabling stratification.

In [30]:
test = AATest(random_states=[56, 72, 2, 43], stratification=True)
result = test.execute(data)

  0%|          | 0/4 [00:00<?, ?it/s]

100%|██████████| 4/4 [00:01<00:00,  2.25it/s]


In [31]:
result.resume

Unnamed: 0,feature,group,TTest aa test,KSTest aa test,TTest best split,KSTest best split,result,control mean,test mean,difference,difference %
0,pre_spends,test_1,NOT OK,OK,OK,OK,OK,487.458825,487.363377,-0.095448,-0.019581
1,post_spends,test_1,OK,OK,OK,OK,OK,452.886758,451.647464,-1.239294,-0.273643


In [32]:
result.aa_score

Unnamed: 0,score,pass
pre_spends TTest test_1,0.8,False
post_spends TTest test_1,0.95,True
pre_spends KSTest test_1,0.95,True
post_spends KSTest test_1,0.95,True


In [33]:
result.best_split

Unnamed: 0,user_id,signup_month,treat,pre_spends,post_spends,age,gender,industry,split
0,0.0,11.0,1.0,476.0,436.888889,28.0,F,E-commerce,test_1
1,1.0,1.0,1.0,519.5,525.222222,36.0,F,Logistics,test_1
2,2.0,0.0,0.0,498.5,414.333333,69.0,F,Logistics,control
3,3.0,10.0,1.0,473.0,445.888889,43.0,F,E-commerce,control
4,4.0,11.0,1.0,495.0,428.111111,56.0,F,E-commerce,control
...,...,...,...,...,...,...,...,...,...
9995,9995.0,0.0,0.0,475.0,408.111111,51.0,M,Logistics,
9996,9996.0,0.0,0.0,472.5,414.666667,22.0,F,E-commerce,
9997,9997.0,0.0,0.0,474.0,419.222222,63.0,M,E-commerce,
9998,9998.0,4.0,1.0,481.0,519.888889,21.0,F,Logistics,


In [34]:
result.best_split_statistic

Unnamed: 0,feature,group,control mean,test mean,difference,difference %,TTest pass,TTest p-value,KSTest pass,KSTest p-value
0,pre_spends,test_1,487.4588249754179,487.3633771386065,-0.0954478368113882,-0.0195806972652978,OK,0.8226591358784061,OK,
1,post_spends,test_1,452.8867584398558,451.6474639777392,-1.2392944621165611,-0.2736433421868578,OK,0.1602986902329776,OK,


In [35]:
result.experiments

Unnamed: 0,splitter_id,pre_spends GroupDifference control mean test_1,pre_spends GroupDifference test mean test_1,pre_spends GroupDifference difference test_1,pre_spends GroupDifference difference % test_1,post_spends GroupDifference control mean test_1,post_spends GroupDifference test mean test_1,post_spends GroupDifference difference test_1,post_spends GroupDifference difference % test_1,pre_spends TTest p-value test_1,...,post_spends TTest pass test_1,pre_spends KSTest p-value test_1,pre_spends KSTest pass test_1,post_spends KSTest p-value test_1,post_spends KSTest pass test_1,mean TTest p-value,mean TTest pass,mean KSTest p-value,mean KSTest pass,mean test score
0,AASplitterWithStratification┴rs 56┴,486.965432,487.857072,0.89164,0.183101,453.008916,451.530843,-1.478073,-0.326279,0.036259,...,False,,False,,False,0.065131,0.5,0,0.0,0.02171
1,AASplitterWithStratification┴rs 72┴,487.618177,487.20459,-0.413587,-0.084818,452.497077,452.042668,-0.45441,-0.100423,0.331448,...,False,,False,,False,0.469069,0.0,0,0.0,0.156356
2,AASplitterWithStratification┴rs 2┴,487.458825,487.363377,-0.095448,-0.019581,452.886758,451.647464,-1.239294,-0.273643,0.822659,...,False,,False,,False,0.491479,0.0,0,0.0,0.163826
3,AASplitterWithStratification┴rs 43┴,487.582697,487.239367,-0.34333,-0.070415,451.915866,452.624849,0.708983,0.156884,0.420121,...,False,,False,,False,0.420984,0.0,0,0.0,0.140328


# AA Test by samples 

Depending on your requirements and size of data it is possible to estimate AA test on samples the data. You can set `sample_size=size` to run it. 

In [36]:
test = AATest(n_iterations=10, sample_size=0.3)
result = test.execute(data)

100%|██████████| 10/10 [00:03<00:00,  2.67it/s]
100%|██████████| 10/10 [00:03<00:00,  2.72it/s]


In [37]:
result.resume

Unnamed: 0,feature,group,TTest aa test,KSTest aa test,TTest best split,KSTest best split,result,control mean,test mean,difference,difference %
0,pre_spends,test_1,OK,OK,OK,OK,OK,486.886228,487.537769,0.651542,0.133818
1,post_spends,test_1,OK,OK,OK,OK,OK,452.077428,452.235486,0.158057,0.034962


In [38]:
result.aa_score

Unnamed: 0,score,pass
pre_spends TTest test_1,0.95,True
post_spends TTest test_1,0.95,True
pre_spends KSTest test_1,0.95,True
post_spends KSTest test_1,0.95,True


In [39]:
result.best_split

Unnamed: 0,user_id,signup_month,treat,pre_spends,post_spends,age,gender,industry,split
0,0.0,11.0,1.0,476.0,436.888889,28.0,F,E-commerce,control
1,1.0,1.0,1.0,519.5,525.222222,36.0,F,Logistics,control
2,2.0,0.0,0.0,498.5,414.333333,69.0,F,Logistics,test_1
3,3.0,10.0,1.0,473.0,445.888889,43.0,F,E-commerce,test_1
4,4.0,11.0,1.0,495.0,428.111111,56.0,F,E-commerce,test_1
...,...,...,...,...,...,...,...,...,...
9995,9995.0,0.0,0.0,475.0,408.111111,51.0,M,Logistics,test_1
9996,9996.0,0.0,0.0,472.5,414.666667,22.0,F,E-commerce,test_1
9997,9997.0,0.0,0.0,474.0,419.222222,63.0,M,E-commerce,test_1
9998,9998.0,4.0,1.0,481.0,519.888889,21.0,F,Logistics,test_1


In [40]:
result.best_split_statistic

Unnamed: 0,feature,group,control mean,test mean,difference,difference %,TTest pass,TTest p-value,KSTest pass,KSTest p-value
0,pre_spends,test_1,486.8862275449102,487.53776908023485,0.6515415353246681,0.1338180253341869,OK,0.2510028127854083,OK,
1,post_spends,test_1,452.0774284763806,452.235485975212,0.1580574988313969,0.0349624840514817,OK,0.8930513511282197,OK,


In [41]:
result.experiments

Unnamed: 0,splitter_id,pre_spends GroupDifference control mean test_1,pre_spends GroupDifference test mean test_1,pre_spends GroupDifference difference test_1,pre_spends GroupDifference difference % test_1,post_spends GroupDifference control mean test_1,post_spends GroupDifference test mean test_1,post_spends GroupDifference difference test_1,post_spends GroupDifference difference % test_1,pre_spends TTest p-value test_1,...,post_spends TTest pass test_1,pre_spends KSTest p-value test_1,pre_spends KSTest pass test_1,post_spends KSTest p-value test_1,post_spends KSTest pass test_1,mean TTest p-value,mean TTest pass,mean KSTest p-value,mean KSTest pass,mean test score
0,AASplitter┴rs 0┴,487.492604,487.431952,-0.060652,-0.012442,450.998685,452.42649,1.427805,0.316587,0.914487,...,False,,False,,False,0.568377,0.0,0,0.0,0.189459
1,AASplitter┴rs 1┴,487.911853,487.35922,-0.552633,-0.113265,453.169792,452.045528,-1.124264,-0.248089,0.330682,...,False,,False,,False,0.335019,0.0,0,0.0,0.111673
2,AASplitter┴rs 2┴,487.263529,487.47236,0.208832,0.042858,453.065398,452.061582,-1.003817,-0.221561,0.711837,...,False,,False,,False,0.551563,0.0,0,0.0,0.183854
3,AASplitter┴rs 3┴,486.886228,487.537769,0.651542,0.133818,452.077428,452.235486,0.158057,0.034962,0.251003,...,False,,False,,False,0.572027,0.0,0,0.0,0.190676
4,AASplitter┴rs 4┴,486.762325,487.561764,0.799439,0.164236,452.70959,452.123542,-0.586048,-0.129453,0.156058,...,False,,False,,False,0.385857,0.0,0,0.0,0.128619
5,AASplitter┴rs 5┴,487.136667,487.494772,0.358105,0.073512,451.589053,452.321948,0.732894,0.162292,0.526323,...,False,,False,,False,0.528788,0.0,0,0.0,0.176263
6,AASplitter┴rs 6┴,486.904303,487.535607,0.631304,0.129657,452.432245,452.173236,-0.259009,-0.057248,0.264264,...,False,,False,,False,0.544629,0.0,0,0.0,0.181543
7,AASplitter┴rs 7┴,487.354074,487.456411,0.102337,0.020998,453.439671,451.995411,-1.44426,-0.318512,0.856311,...,False,,False,,False,0.536788,0.0,0,0.0,0.178929
8,AASplitter┴rs 8┴,486.892884,487.536525,0.643641,0.132194,451.726758,452.296533,0.569775,0.126133,0.256943,...,False,,False,,False,0.442485,0.0,0,0.0,0.147495
9,AASplitter┴rs 9┴,488.48357,487.258875,-1.224695,-0.250714,454.938262,451.735593,-3.20267,-0.703979,0.030778,...,True,,False,,False,0.018582,1.0,0,0.0,0.006194


# AATest with Target Role for a categorical feature

It is possible to assign Target Role to categorical features. A categorical feature can also be the target or outcome variable. In this case, the Chi-square test is added to the pipeline of AATest.

In [42]:
data = Dataset(
    roles={
        "user_id": InfoRole(int),
        "treat": TreatmentRole(int),
        "pre_spends": TargetRole(),
        "post_spends": TargetRole(),
        "gender": TargetRole(str)
    }, data=create_test_data(),
)
data

Unnamed: 0,user_id,signup_month,treat,pre_spends,post_spends,age,gender,industry
0,0.0,2.0,1.0,507.0,514.111111,59.0,M,E-commerce
1,1.0,8.0,1.0,501.5,450.777778,69.0,M,Logistics
2,2.0,0.0,0.0,496.0,424.222222,45.0,M,E-commerce
3,3.0,0.0,0.0,461.0,441.444444,51.0,M,E-commerce
4,4.0,0.0,0.0,489.0,410.444444,35.0,M,Logistics
...,...,...,...,...,...,...,...,...
9995,9995.0,7.0,1.0,477.5,467.000000,27.0,M,E-commerce
9996,9996.0,0.0,0.0,455.5,426.888889,47.0,M,E-commerce
9997,9997.0,6.0,1.0,473.0,482.444444,20.0,M,E-commerce
9998,9998.0,4.0,1.0,489.5,499.333333,60.0,F,Logistics


In [43]:
test = AATest(n_iterations=10)
result = test.execute(data)

100%|██████████| 10/10 [00:04<00:00,  2.26it/s]


In [44]:
result.resume

Unnamed: 0,feature,group,TTest aa test,KSTest aa test,Chi2Test aa test,TTest best split,KSTest best split,Chi2Test best split,result,control mean,test mean,difference,difference %
0,pre_spends,test_1,OK,OK,,OK,OK,,OK,487.356,487.460231,0.104231,0.021387
1,post_spends,test_1,OK,OK,,OK,OK,,OK,451.664938,452.415809,0.750871,0.166245
2,gender,test_1,,,OK,,,OK,OK,,,,


In [45]:
result.aa_score

Unnamed: 0,score,pass
pre_spends TTest test_1,0.95,True
post_spends TTest test_1,0.95,True
pre_spends KSTest test_1,0.95,True
post_spends KSTest test_1,0.95,True
gender Chi2Test test_1,0.95,True


In [46]:
result.best_split

Unnamed: 0,user_id,signup_month,treat,pre_spends,post_spends,age,gender,industry,split
0,0.0,2.0,1.0,507.0,514.111111,59.0,M,E-commerce,control
1,1.0,8.0,1.0,501.5,450.777778,69.0,M,Logistics,control
2,2.0,0.0,0.0,496.0,424.222222,45.0,M,E-commerce,test_1
3,3.0,0.0,0.0,461.0,441.444444,51.0,M,E-commerce,test_1
4,4.0,0.0,0.0,489.0,410.444444,35.0,M,Logistics,test_1
...,...,...,...,...,...,...,...,...,...
9995,9995.0,7.0,1.0,477.5,467.000000,27.0,M,E-commerce,test_1
9996,9996.0,0.0,0.0,455.5,426.888889,47.0,M,E-commerce,test_1
9997,9997.0,6.0,1.0,473.0,482.444444,20.0,M,E-commerce,test_1
9998,9998.0,4.0,1.0,489.5,499.333333,60.0,F,Logistics,test_1


In [47]:
result.best_split_statistic

Unnamed: 0,feature,group,control mean,test mean,difference,difference %,TTest pass,TTest p-value,KSTest pass,KSTest p-value,Chi2Test pass,Chi2Test p-value
0,pre_spends,test_1,487.356,487.4602310597645,0.1042310597645155,0.0213870476129418,OK,0.7960766529784996,OK,,,
1,post_spends,test_1,451.664938271605,452.4158088326051,0.7508705610000561,0.1662450408201676,OK,0.3683969500285644,OK,,,
2,gender,test_1,,,,,,,,,OK,1.0


In [48]:
result.experiments

Unnamed: 0,splitter_id,pre_spends GroupDifference control mean test_1,pre_spends GroupDifference test mean test_1,pre_spends GroupDifference difference test_1,pre_spends GroupDifference difference % test_1,post_spends GroupDifference control mean test_1,post_spends GroupDifference test mean test_1,post_spends GroupDifference difference test_1,post_spends GroupDifference difference % test_1,pre_spends TTest p-value test_1,...,post_spends KSTest pass test_1,gender Chi2Test p-value test_1,gender Chi2Test pass test_1,mean TTest p-value,mean TTest pass,mean KSTest p-value,mean KSTest pass,mean Chi2Test p-value,mean Chi2Test pass,mean test score
0,AASplitter┴rs 0┴,487.286493,487.529399,0.242906,0.049849,451.798052,452.28208,0.484028,0.107134,0.547002,...,False,0.272458,False,0.55452,0.0,0,0.0,0.272458,0.0,0.219887
1,AASplitter┴rs 1┴,487.528141,487.287433,-0.240708,-0.049373,452.549573,451.528421,-1.021151,-0.225644,0.550636,...,False,0.269199,False,0.385931,0.0,0,0.0,0.269199,0.0,0.184866
2,AASplitter┴rs 2┴,487.489227,487.326962,-0.162265,-0.033286,451.582062,452.499074,0.917012,0.203066,0.68745,...,False,0.759602,False,0.479715,0.0,0,0.0,0.759602,0.0,0.399784
3,AASplitter┴rs 3┴,487.356,487.460231,0.104231,0.021387,451.664938,452.415809,0.750871,0.166245,0.796077,...,False,1.0,False,0.582237,0.0,0,0.0,1.0,0.0,0.516447
4,AASplitter┴rs 4┴,487.586055,487.22768,-0.358375,-0.0735,451.678043,452.407896,0.729854,0.161587,0.374249,...,False,0.347871,False,0.378107,0.0,0,0.0,0.347871,0.0,0.21477
5,AASplitter┴rs 5┴,487.189579,487.627589,0.43801,0.089905,451.538162,452.544793,1.006631,0.222934,0.277469,...,False,0.486647,False,0.252667,0.0,0,0.0,0.486647,0.0,0.245192
6,AASplitter┴rs 6┴,487.50277,487.312946,-0.189824,-0.038938,451.985228,452.09591,0.110682,0.024488,0.637894,...,False,0.268135,False,0.766208,0.0,0,0.0,0.268135,0.0,0.260496
7,AASplitter┴rs 7┴,487.615016,487.202921,-0.412095,-0.084512,452.044177,452.036685,-0.007492,-0.001657,0.306896,...,False,0.552795,False,0.649868,0.0,0,0.0,0.552795,0.0,0.351092
8,AASplitter┴rs 8┴,486.93729,487.873233,0.935943,0.19221,451.540825,452.533937,0.993112,0.219938,0.020295,...,False,0.964712,False,0.127237,0.5,0,0.0,0.964712,0.0,0.411332
9,AASplitter┴rs 9┴,487.308289,487.508465,0.200176,0.041078,451.348084,452.736294,1.38821,0.30757,0.619674,...,False,0.360739,False,0.357989,0.0,0,0.0,0.360739,0.0,0.215893


# AATest with unequal group sizes

AATest can be performed to get a split with unequal the groups of different sizes by using `unequal_size` argument. Also Whelch correction can be applied by adding `t_test_equal_vat=False` argument while initiating AATest instance.

In [49]:
test = AATest(n_iterations=10, control_size=0.3, t_test_equal_var=False)
result = test.execute(data)

  0%|          | 0/10 [00:00<?, ?it/s]

100%|██████████| 10/10 [00:04<00:00,  2.25it/s]


In [50]:
result.best_split.data.groupby("split").agg("count")

Unnamed: 0_level_0,user_id,signup_month,treat,pre_spends,post_spends,age,gender,industry
split,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
control,2731,2731,2731,2731,2731,2731,2731,2731
test_1,6270,6270,6270,6270,6270,6270,6270,6270


In [51]:
result.best_split_statistic

Unnamed: 0,feature,group,control mean,test mean,difference,difference %,TTest pass,TTest p-value,KSTest pass,KSTest p-value,Chi2Test pass,Chi2Test p-value
0,pre_spends,test_1,487.5825704870011,487.3321371610845,-0.2504333259165605,-0.0513622391519086,OK,0.5639126299904302,OK,,,
1,post_spends,test_1,451.7230969526831,452.1786283891547,0.4555314364716309,0.1008430694699136,OK,0.6168334001746815,OK,,,
2,gender,test_1,,,,,,,,,OK,1.0


# AAnTest

AAnTest is an extension of AATest that allows to split the dataset into several test groups, additionally to the control group.

In [52]:
test = AATest(groups_sizes=[0.3, 0.2, 0.2, 0.3])
result = test.execute(data)

100%|██████████| 10/10 [00:08<00:00,  1.22it/s]


In [53]:
result.best_split.data.groupby("split").agg("count")

Unnamed: 0_level_0,user_id,signup_month,treat,pre_spends,post_spends,age,gender,industry
split,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
control,2710,2710,2710,2710,2710,2710,2710,2710
test_1,1792,1792,1792,1792,1792,1792,1792,1792
test_2,1802,1802,1802,1802,1802,1802,1802,1802
test_3,2697,2697,2697,2697,2697,2697,2697,2697


In [54]:
result.best_split_statistic

Unnamed: 0,feature,group,control mean,test mean,difference,difference %,TTest pass,TTest p-value,KSTest pass,KSTest p-value,Chi2Test pass,Chi2Test p-value
0,pre_spends,test_1,487.51162361623614,487.4553571428572,-0.0562664733789688,-0.0115415655039363,OK,0.9237459969594596,OK,,,
1,pre_spends,test_2,487.51162361623614,487.2347391786904,-0.2768844375457888,-0.0567954535097947,OK,0.6346033057388143,OK,,,
2,pre_spends,test_3,487.51162361623614,487.38857990359656,-0.1230437126395713,-0.0252391341414304,OK,0.8113756749318759,OK,,,
3,post_spends,test_1,451.119147191472,452.28211805555554,1.1629708640836045,0.2577968306873845,OK,0.3294332426709395,OK,,,
4,post_spends,test_2,451.119147191472,453.4614009125663,2.34225372109438,0.5192095559846122,OK,0.0544601690785347,OK,,,
5,post_spends,test_3,451.119147191472,451.8560952498661,0.7369480583941481,0.1633599599977442,OK,0.4908841810671904,OK,,,
6,gender,test_1,,,,,,,,,OK,0.8717290989479588
7,gender,test_2,,,,,,,,,OK,0.8911496337110267
8,gender,test_3,,,,,,,,,OK,0.7053738742058828


# AATest with partially pre-defined groups

Certain users can be pre-assigned to either the test or the control group, so that they are not randomly assigned. This can be done using the `ConstGroupRole` role. In order to pre-assign users to the control group they should have a value of `control`, and in the test group they should have a value of `test` in the column with the role `ConstGroupRole`. Users that are not pre-assigned to either the control or the test group should have `None`, so that they will be assigned randomly.

In [59]:
pd_data= create_test_data()
pd_data.loc[pd_data["treat"]==0, "const_grp"] = "control"
pd_data.loc[pd_data["treat"]==1, "const_grp"] = "test"
pd_data.loc[2000:, "const_grp"] = None

data = Dataset(
    roles={
        "user_id": InfoRole(int),
        "const_grp": ConstGroupRole(str),
        "pre_spends": TargetRole(),
        "post_spends": TargetRole(),
        "gender": StratificationRole(str),
        "industry": TargetRole(str),
    }, data=pd_data,
)
data

Unnamed: 0,user_id,signup_month,treat,pre_spends,post_spends,age,gender,industry,const_grp
0,0.0,0.0,0.0,498.0,405.111111,30.0,M,E-commerce,control
1,1.0,0.0,0.0,494.0,416.000000,68.0,F,Logistics,control
2,2.0,10.0,1.0,469.0,437.777778,25.0,F,Logistics,test
3,3.0,0.0,0.0,442.0,414.333333,68.0,F,Logistics,control
4,4.0,0.0,0.0,483.0,418.333333,35.0,M,E-commerce,control
...,...,...,...,...,...,...,...,...,...
9995,9995.0,6.0,1.0,479.0,505.888889,55.0,F,Logistics,
9996,9996.0,5.0,1.0,516.5,499.333333,56.0,F,Logistics,
9997,9997.0,3.0,1.0,489.5,526.000000,61.0,M,Logistics,
9998,9998.0,0.0,0.0,468.0,434.222222,52.0,M,E-commerce,


In [60]:
test = AATest(n_iterations=1)
result = test.execute(data)

  0%|          | 0/1 [00:00<?, ?it/s]

100%|██████████| 1/1 [00:00<00:00,  1.41it/s]


In [61]:
result.resume

Unnamed: 0,feature,group,TTest aa test,KSTest aa test,Chi2Test aa test,TTest best split,KSTest best split,Chi2Test best split,result,control mean,test mean,difference,difference %
0,pre_spends,test_1,OK,OK,,OK,OK,,OK,486.906311,487.295794,0.389482,0.079991
1,post_spends,test_1,NOT OK,OK,,NOT OK,OK,,OK,445.172017,458.400095,13.228078,2.971453
2,industry,test_1,,,OK,,,OK,OK,,,,


In [62]:
result.best_split

Unnamed: 0,user_id,signup_month,treat,pre_spends,post_spends,age,gender,industry,const_grp,split
0,0.0,0.0,0.0,498.0,405.111111,30.0,M,E-commerce,control,control
1,1.0,0.0,0.0,494.0,416.000000,68.0,F,Logistics,control,control
2,2.0,10.0,1.0,469.0,437.777778,25.0,F,Logistics,test,test_1
3,3.0,0.0,0.0,442.0,414.333333,68.0,F,Logistics,control,control
4,4.0,0.0,0.0,483.0,418.333333,35.0,M,E-commerce,control,control
...,...,...,...,...,...,...,...,...,...,...
9995,9995.0,6.0,1.0,479.0,505.888889,55.0,F,Logistics,,control
9996,9996.0,5.0,1.0,516.5,499.333333,56.0,F,Logistics,,control
9997,9997.0,3.0,1.0,489.5,526.000000,61.0,M,Logistics,,test_1
9998,9998.0,0.0,0.0,468.0,434.222222,52.0,M,E-commerce,,control


## Common issues and tips

- **Missing roles**: Make sure all target variables are assigned `TargetRole`. Columns without roles may cause silent failure.
- **Stratification**: If your dataset contains categorical features (e.g. `gender`, `region`) that may affect the outcome, use `StratificationRole` and enable `stratification=True` in `AATest(...)`.
- **Imbalanced categories**: If some categories have too few samples, stratified splits may become unstable. Consider filtering or merging rare categories.
- **Random fluctuations**: On small datasets, it's normal to see occasional `NOT OK` results. Use more iterations (e.g. `n_iterations=50`) for stability.
- **Missing values**: NaNs in stratification columns may be treated as separate categories. Clean or fill missing values before stratified AA tests.