# AB test 

A/B testing is the research method that allows you to find out the effect of a particular change in the product. The study shows which of the two versions of the product or offer gives greater effect on the selected metrics and if it is statistically significant.  

<ul>
  <li><a href="#creation-of-a-new-test-dataset-with-synthetic-data">Creation of a new test dataset with synthetic data.
  <li><a href="#ab-test">AB test.
  <li><a href="#additional-tests-in-ab-test">Additional tests in AB Test.
  <li><a href="#abn-test">ABn Test.
</ul>

In [1]:
import random

from hypex import ABTest
from hypex.dataset import Dataset, InfoRole, TargetRole, TreatmentRole
from hypex.utils import create_test_data

## Creation of a new test dataset with synthetic data. 

In order to be able to work with our data in HypEx, first we need to convert it into `dataset`. It is important to mark the data fields by assigning the appropriate `roles`:
- FeatureRole: a role for columns that contain features or predictor variables. Our split will be based on them. Applied by default if the role is not specified for the column.
- TreatmentRole: a role for columns that show the treatment or intervention.
- TargetRole: a role for columns that show the target or outcome variable.
- InfoRole: a role for columns that contain information about the data, such as user IDs. 

In [26]:
df=create_test_data()
df["treat"] = [random.choice([0, 1, 2]) for _ in range(len(df))]
data = Dataset(
    roles={
        "user_id": InfoRole(int),
        "treat": TreatmentRole(),
        "pre_spends": TargetRole(),
        "post_spends": TargetRole(),
        "gender": TargetRole()
    }, data=df,
)
data

Unnamed: 0,user_id,signup_month,treat,pre_spends,post_spends,age,gender,industry
0,0.0,5.0,0,491.0,491.555556,64.0,F,E-commerce
1,1.0,4.0,0,493.0,515.222222,19.0,F,E-commerce
2,2.0,1.0,2,529.0,520.222222,62.0,M,Logistics
3,3.0,0.0,1,486.5,418.222222,67.0,M,E-commerce
4,4.0,0.0,1,467.0,418.333333,62.0,F,E-commerce
...,...,...,...,...,...,...,...,...
9995,9995.0,0.0,1,482.0,413.333333,35.0,M,Logistics
9996,9996.0,0.0,2,493.5,413.333333,41.0,M,Logistics
9997,9997.0,0.0,0,497.0,423.222222,51.0,M,Logistics
9998,9998.0,1.0,2,542.0,534.888889,67.0,M,E-commerce


The roles' data types can be assigned automatically as shown below. Also, the fields, which were not marked, receive Feature role by default.
data["treat"] = [random.choice([0, 1, 2]) for _ in range(len(data))]
data

The roles' data types can be assigned automatically as shown below. Also, the fields, which were not marked, receive Feature role by default.

In [27]:
data.roles

{'user_id': Info(<class 'int'>),
 'treat': Treatment(<class 'int'>),
 'pre_spends': Target(<class 'float'>),
 'post_spends': Target(<class 'float'>),
 'gender': Target(<class 'str'>),
 'signup_month': Default(<class 'float'>),
 'age': Default(<class 'float'>),
 'industry': Default(<class 'str'>)}

## AB test
Then we select one of the pre-assembled pipelines, in our case `ABTest`. Also, a custom pipeline can be created based on your specific needs and requirements with custom executors.
After that we wrap our prepared `dataset` into `ExperimentData` to be able to run experiments on it and then execute the test with this data passed as the argument.

In [28]:
test = ABTest()
result = test.execute(data)

Note: HypEx automatically assumes the smallest value in the `TreatmentRole` column as the control group (typically `0`), and compares each other group (e.g. `1`, `2`) against it. Ensure treatment labels are correctly assigned.


### Experiment results
To show the report with summary of the test we run the `resume` method of the output of the experiment.

It displays the results of the test in the form of a table with the following columns:
- `feature`: name of the target feature, change of which we want to analyze.
- `group`: name of the test group we compare with the control group.
- `TTest pass`: result of the TTest, if it is significant or not.
- `TTest p-value`: p-value of the TTest shows the probability of obtaining the result when the null hypothesis is true. The lower the value the more significant the result is.
- `control mean`: the mean of the feature value across the control group.
- `test mean`: the mean of the feature value across the test group.
- `difference`: the difference between the mean of the test group and the mean of the control group.
- `difference %`: the normalized difference between the mean of the test group and the mean of the control group.

In [29]:
result.resume

Unnamed: 0,feature,group,control mean,test mean,difference,difference %,TTest pass,TTest p-value
0,pre_spends,1,487.106527,486.555463,-0.551064,-0.11313,NOT OK,0.260714
1,pre_spends,2,487.106527,486.777283,-0.329244,-0.067592,NOT OK,0.506114
2,post_spends,1,452.378687,452.589616,0.210929,0.046627,NOT OK,0.837795
3,post_spends,2,452.378687,452.192109,-0.186578,-0.041244,NOT OK,0.856633


The `TTest pass` column shows whether the difference between groups is statistically significant at the 5% level. 
- `OK` means the difference is significant (p < 0.05).
- `NOT OK` means no significant difference was found.

However, significance does not imply practical importance. Always examine the `difference` and `difference %` columns to assess business relevance.


The method sizes shows the statistics on the groups of the data.

The columns are:
- `control size`: the size of the control group.
- `test size`: the size of the test group.
- `control size %`: the share of the control group in the whole dataset.
- `test size %`: the share of the test group in the whole dataset.
- `group`: name of the test group.

In [30]:
result.sizes

Unnamed: 0,control size,test size,control size %,test size %,group
1,3262,3362,49,50,1
2,3262,3376,49,50,2


In [31]:
result.multitest

Unnamed: 0,field,test,old p-value,new p-value,correction,rejected,group
0,pre_spends,TTest,0.260714,1.0,0.260714,False,1
1,post_spends,TTest,0.506114,1.0,0.506114,False,1
2,pre_spends,TTest,0.837795,1.0,0.837795,False,2
3,post_spends,TTest,0.856633,1.0,0.856633,False,2


### Multiple Testing Correction

When multiple metrics or test groups are analyzed, the chance of false positives increases. The `result.multitest` output shows corrected p-values using Holm's method (default) or Bonferroni if specified. The column `rejected` indicates whether the null hypothesis was rejected after correction.

To change correction method:
```python
test = ABTest(multitest_method="bonferroni")


## Additional tests in AB Test 

It is possible to add u-test and chi2-test in pipeline.

Use `u-test` for numeric variables that are skewed or non-normally distributed. It’s a non-parametric alternative to t-test.

Use `chi2-test` for categorical variables (e.g. gender, conversion rate). Note: t-test is not appropriate for categorical outcomes.

In [32]:
test = ABTest(additional_tests=['t-test', 'u-test', 'chi2-test'])
result = test.execute(data)

The additional columns are:
- `UTest pass`: result of the UTest, if it is significant or not.
- `UTest p-value`: p-value of the UTest shows the probability of obtaining the result when the null hypothesis is true. The lower the value the more significant the result is.
- `Chi2Test pass`: result of the Chi2Test, if it is significant or not.
- `Chi2Test p-value`: p-value of the Chi2Test shows the probability of obtaining the result when the null hypothesis is true. The lower the value the more significant the result is.

In [33]:
result.resume

Unnamed: 0,feature,group,control mean,test mean,difference,difference %,TTest pass,TTest p-value,UTest pass,UTest p-value,Chi2Test pass,Chi2Test p-value
0,pre_spends,1,487.106527,486.555463,-0.551064,-0.11313,NOT OK,0.260714,NOT OK,,,
1,pre_spends,2,487.106527,486.777283,-0.329244,-0.067592,NOT OK,0.506114,NOT OK,,,
2,post_spends,1,452.378687,452.589616,0.210929,0.046627,NOT OK,0.837795,NOT OK,,,
3,post_spends,2,452.378687,452.192109,-0.186578,-0.041244,NOT OK,0.856633,NOT OK,,,
4,gender,1,,,,,,,,,NOT OK,0.68188
5,gender,2,,,,,,,,,NOT OK,0.574357


In [34]:
result.multitest

Unnamed: 0,field,test,old p-value,new p-value,correction,rejected,group
0,pre_spends,TTest,0.260714,1.0,0.260714,False,1
1,post_spends,TTest,0.506114,1.0,0.506114,False,1
2,pre_spends,TTest,0.837795,1.0,0.837795,False,2
3,post_spends,TTest,0.856633,1.0,0.856633,False,2
4,pre_spends,UTest,,,,False,1
5,post_spends,UTest,,,,False,1
6,pre_spends,UTest,,,,False,2
7,post_spends,UTest,,,,False,2


In [35]:
result.sizes

Unnamed: 0,control size,test size,control size %,test size %,group
1,3262,3362,49,50,1
2,3262,3376,49,50,2


## ABn Test 

Finally, we may run multiple ab tests with different methods.

In [36]:
test = ABTest(multitest_method="bonferroni")
result = test.execute(data)

In [37]:
result.resume

Unnamed: 0,feature,group,control mean,test mean,difference,difference %,TTest pass,TTest p-value
0,pre_spends,1,487.106527,486.555463,-0.551064,-0.11313,NOT OK,0.260714
1,pre_spends,2,487.106527,486.777283,-0.329244,-0.067592,NOT OK,0.506114
2,post_spends,1,452.378687,452.589616,0.210929,0.046627,NOT OK,0.837795
3,post_spends,2,452.378687,452.192109,-0.186578,-0.041244,NOT OK,0.856633


In [38]:
result.sizes

Unnamed: 0,control size,test size,control size %,test size %,group
1,3262,3362,49,50,1
2,3262,3376,49,50,2


In [39]:
result.multitest

Unnamed: 0,field,test,old p-value,new p-value,correction,rejected,group
0,pre_spends,TTest,0.260714,1.0,0.260714,False,1
1,post_spends,TTest,0.506114,1.0,0.506114,False,1
2,pre_spends,TTest,0.837795,1.0,0.837795,False,2
3,post_spends,TTest,0.856633,1.0,0.856633,False,2


## Advanced Variance Reduction Techniques

For improved statistical power and more sensitive A/B tests, consider using covariate adjustment methods:

### CUPED and CUPAC
**CUPED** (Controlled Experiments Using Pre-Experiment Data) and **CUPAC** (Covariate-Updated Pre-Analysis Correction) are advanced techniques that use historical data to reduce variance in your metrics, allowing you to:

- Detect smaller effects with the same sample size
- Reduce the sample size needed to detect the same effect  
- Increase statistical power of your experiments

These methods work by adjusting your target metrics using correlated historical features that are unaffected by the treatment.

**For a comprehensive guide on implementing these techniques, see the [CUPED & CUPAC Tutorial](СUPED&CUPAC.ipynb).**

Key benefits:
- **CUPED**: Simple single-covariate adjustment using linear regression
- **CUPAC**: Advanced multi-covariate adjustment with flexible model selection (linear, ridge, lasso, CatBoost)/

## Common Pitfalls and Recommendations

- Always assign correct roles: use `TreatmentRole` for group labels, `TargetRole` for outcome metrics. Missing roles may cause incorrect test logic.
- For categorical targets, avoid using `t-test`. Instead, include `chi2-test` in `additional_tests`.
- HypEx does not automatically balance groups. Ensure group sizes are roughly equal and comparable.
- Check for missing values. NaNs may silently affect metric calculation.
- If testing many metrics/groups, interpret results only after multiple testing correction.
- Use `result.sizes` to confirm group balance, and consider A/A testing to verify setup before real A/B.

