Skip to content

e10v/tea-tasting

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tea-tasting: statistical analysis of A/B tests

CI Coverage License Version Package Status PyPI Python Versions

tea-tasting is a Python package for statistical analysis of A/B tests that features:

  • Student's t-test and Z-test out of the box.
  • Extensible API: Define and use statistical tests of your choice.
  • Delta method for ratio metrics.
  • Variance reduction with CUPED/CUPAC (also in combination with delta method for ratio metrics).
  • Confidence interval for both absolute and percent change.
  • Sample ratio mismatch check.

tea-tasting calculates statistics within data backends such as BigQuery, ClickHouse, PostgreSQL, Snowflake, Spark, and other of 20+ backends supported by Ibis. This approach eliminates the need to import granular data into a Python environment, though Pandas DataFrames are also supported.

tea-tasting is still in alpha, but already includes all the features listed above. The following features are coming soon:

  • More statistical tests:
    • Bootstrap.
    • Quantile test (using Bootstrap).
    • Asymptotic and exact tests for frequency data.
    • Mann–Whitney U test.
  • Power analysis.
  • A/A tests and simulations.

Installation

pip install tea-tasting

Basic usage

Begin with this simple example to understand the basic functionality:

import tea_tasting as tt


data = tt.make_users_data(seed=42)

experiment = tt.Experiment(
    sessions_per_user=tt.Mean("sessions"),
    orders_per_session=tt.RatioOfMeans("orders", "sessions"),
    orders_per_user=tt.Mean("orders"),
    revenue_per_user=tt.Mean("revenue"),
)

result = experiment.analyze(data)
print(result)
#>             metric control treatment rel_effect_size rel_effect_size_ci pvalue
#>  sessions_per_user    2.00      1.98          -0.66%      [-3.7%, 2.5%]  0.674
#> orders_per_session   0.266     0.289            8.8%      [-0.89%, 19%] 0.0762
#>    orders_per_user   0.530     0.573            8.0%       [-2.0%, 19%]  0.118
#>   revenue_per_user    5.24      5.73            9.3%       [-2.4%, 22%]  0.123

In the following sections, each step of this process will be explained in detail.

Input data

The make_users_data function creates synthetic data for demonstration purposes. This data mimics what you might encounter in an A/B test for an online store. Each row represents an individual user, with the following columns:

  • user: The unique identifier for each user.
  • variant: The specific variant (e.g., 0 or 1) assigned to each user in the A/B test.
  • sessions: The total number of user's sessions.
  • orders: The total number of user's orders.
  • revenue: The total revenue generated by the user.

tea-tasting accepts data as either a Pandas DataFrame or an Ibis Table. Ibis is a Python package which serves as a DataFrame API to various data backends. It supports 20+ backends including BigQuery, ClickHouse, DuckDB, Polars, PostgreSQL, Snowflake, Spark etc. You can write an SQL-query, wrap it as an Ibis Table and pass it to tea-tasting.

Many statistical tests, like Student's t-test or Z-test, don't need granular data for analysis. For such tests, tea-tasting will query aggregated statistics, like mean and variance, instead of downloading all the detailed data.

tea-tasting assumes that:

  • The data is grouped by randomization units, such as individual users.
  • There is a column indicating the variant of the A/B test (typically labeled as A, B, etc.).
  • All necessary columns for metric calculations (like the number of orders, revenue, etc.) are included in the table.

A/B test definition

The Experiment class defines the parameters of an A/B test: metrics and a variant column name. There are two ways to define metrics:

  • Using keyword parameters, with metric names as parameter names and metric definitions as parameter values, as in example above.
  • Using the first argument metrics which accepts metrics if a form of dictionary with metric names as keys and metric definitions as values.

By default, tea-testing assumes that A/B test variant is stored in a column named "variant". You can change it using the variant parameter of the Experiment class.

Example usage:

experiment = tt.Experiment(
    {
        "sessions per user": tt.Mean("sessions"),
        "orders per session": tt.RatioOfMeans("orders", "sessions"),
        "orders per user": tt.Mean("orders"),
        "revenue per user": tt.Mean("revenue"),
    },
    variant="variant",
)

Metrics

Metrics are instances of metric classes which define how metrics are calculated. Those calculations include calculation of effect size, confidence interval, p-value and other statistics.

Use the Mean class to compare averages between variants of an A/B test. For example, average number of orders per user, where user is a randomization unit of an experiment. Specify the column containing the metric values using the first parameter value.

Use the RatioOfMeans class to compare ratios of averages between variants of an A/B test. For example, average number of orders per average number of sessions. Specify the columns containing the numerator and denominator values using the parameters numer and denom.

Use the following parameters of Mean and RatioOfMeans to customize the analysis:

  • alternative: Alternative hypothesis. The following options are available:
    • "two-sided" (default): the means are unequal.
    • "greater": the mean in the treatment variant is greater than the mean in the control variant.
    • "less": the mean in the treatment variant is less than the mean in the control variant.
  • confidence_level: Confidence level of the confidence interval. Default is 0.95.
  • equal_var: Defines whether equal variance is assumed. If True, pooled variance is used for the calculation of the standard error of the difference between two means. Default is False.
  • use_t: Defines whether to use the Student's t-distribution (True) or the Normal distribution (False). Default is True.

Example usage:

experiment = tt.Experiment(
    sessions_per_user=tt.Mean("sessions", alternative="greater"),
    orders_per_session=tt.RatioOfMeans("orders", "sessions", confidence_level=0.9),
    orders_per_user=tt.Mean("orders", equal_var=True),
    revenue_per_user=tt.Mean("revenue", use_t=False),
)

You can change the default values of these four parameters using global settings (see details below).

Analyzing and retrieving experiment results

After defining an experiment and metrics, you can analyze the experiment data using the analyze method of the Experiment class. This method takes data as an input and returns an ExperimentResult object with experiment result.

result = experiment.analyze(data)

By default, tea-tasting assumes that the variant with the lowest ID is a control. Change the default behavior using the control parameter:

result = experiment.analyze(data, control=0)

ExperimentResult is a mapping. Get a metric's analysis result using metric name as a key.

print(result["orders_per_user"])
#> MeanResult(control=0.5304003954522986, treatment=0.5730905412240769,
#> effect_size=0.04269014577177832, effect_size_ci_lower=-0.010800201598205564,
#> effect_size_ci_upper=0.0961804931417622, rel_effect_size=0.08048664016431273,
#> rel_effect_size_ci_lower=-0.019515294044062048,
#> rel_effect_size_ci_upper=0.19068800612788883, pvalue=0.11773177998716244,
#> statistic=1.5647028839586694)

The fields in the result depend on metrics. For Mean and RatioOfMeans, the fields include:

  • metric: Metric name.
  • control: Mean or ratio of means in the control variant.
  • treatment: Mean or ratio of means in the treatment variant.
  • effect_size: Absolute effect size. Difference between two means.
  • effect_size_ci_lower: Lower bound of the absolute effect size confidence interval.
  • effect_size_ci_upper: Upper bound of the absolute effect size confidence interval.
  • rel_effect_size: Relative effect size. Difference between two means, divided by the control mean.
  • rel_effect_size_ci_lower: Lower bound of the relative effect size confidence interval.
  • rel_effect_size_ci_upper: Upper bound of the relative effect size confidence interval.
  • pvalue: P-value
  • statistic: Statistic (standardized effect size).

ExperimentResult provides the following methods to serialize and view the experiment result:

  • to_dicts: Convert the result to a sequence of dictionaries.
  • to_pandas: Convert the result to a Pandas DataFrame.
  • to_pretty: Convert the result to a Pandas Dataframe with formatted values (as strings).
  • to_string: Convert the result to a string.
  • to_html: Convert the result to HTML.

print(result) is the same as print(result.to_string()).

print(result)
#>             metric control treatment rel_effect_size rel_effect_size_ci pvalue
#>  sessions_per_user    2.00      1.98          -0.66%      [-3.7%, 2.5%]  0.674
#> orders_per_session   0.266     0.289            8.8%      [-0.89%, 19%] 0.0762
#>    orders_per_user   0.530     0.573            8.0%       [-2.0%, 19%]  0.118
#>   revenue_per_user    5.24      5.73            9.3%       [-2.4%, 22%]  0.123

By default, methods to_pretty, to_string, and to_html return a predefined list of attributes. This list can be customized:

print(result.to_string(names=(
    "control",
    "treatment",
    "effect_size",
    "effect_size_ci",
)))
#>             metric control treatment effect_size     effect_size_ci
#>  sessions_per_user    2.00      1.98     -0.0132  [-0.0750, 0.0485]
#> orders_per_session   0.266     0.289      0.0233 [-0.00246, 0.0491]
#>    orders_per_user   0.530     0.573      0.0427  [-0.0108, 0.0962]
#>   revenue_per_user    5.24      5.73       0.489     [-0.133, 1.11]

In Jupyter and IPython, the output of the line result will be a rendered HTML table.

More features

Variance reduction with CUPED/CUPAC

tea-tasting supports variance reduction with CUPED/CUPAC, within both Mean and RatioOfMeans classes.

Example usage:

import tea_tasting as tt


data = tt.make_users_data(seed=42, covariates=True)

experiment = tt.Experiment(
    sessions_per_user=tt.Mean("sessions", "sessions_covariate"),
    orders_per_session=tt.RatioOfMeans(
        numer="orders",
        denom="sessions",
        numer_covariate="orders_covariate",
        denom_covariate="sessions_covariate",
    ),
    orders_per_user=tt.Mean("orders", "orders_covariate"),
    revenue_per_user=tt.Mean("revenue", "revenue_covariate"),
)

result = experiment.analyze(data)
print(result)
#>             metric control treatment rel_effect_size rel_effect_size_ci  pvalue
#>  sessions_per_user    2.00      1.98          -0.68%      [-3.2%, 1.9%]   0.603
#> orders_per_session   0.262     0.293             12%        [4.2%, 21%] 0.00229
#>    orders_per_user   0.523     0.581             11%        [2.9%, 20%] 0.00733
#>   revenue_per_user    5.12      5.85             14%        [3.8%, 26%] 0.00675

Set the covariates parameter of the make_users_data functions to True to add the following columns with pre-experimental data:

  • sessions_covariate: Number of sessions before the experiment.
  • orders_covariate: Number of orders before the experiment.
  • revenue_covariate: Revenue before the experiment.

Define the metrics' covariates:

  • In Mean, specify the covariate using the covariate parameter.
  • In RatioOfMeans, specify the covariates for the numerator and denominator using the numer_covariate and denom_covariate parameters, respectively.

Sample ratio mismatch check

The SampleRatio class in tea-tasting detects mismatches in the sample ratios of different variants of an A/B test.

Example usage:

import tea_tasting as tt


experiment = tt.Experiment(
    sample_ratio=tt.SampleRatio(),
)

data = tt.make_users_data(seed=42)
result = experiment.analyze(data)
print(result.to_string(("control", "treatment", "pvalue")))
#>       metric control treatment pvalue
#> sample_ratio    2023      1977  0.477

By default, SampleRatio expects equal number of observations across all variants. To specify a different ratio, use the ratio parameter. It accepts two types of values:

  • Ratio of the number of observation in treatment relative to control, as a positive number. Example: SampleRatio(0.5).
  • A dictionary with variants as keys and expected ratios as values. Example: SampleRatio({"A": 2, "B": 1}).

The method parameter determines the statistical test to apply:

  • "auto": Apply exact binomial test if the total number of observations is less than 1000, or normal approximation otherwise.
  • "binom": Apply exact binomial test.
  • "norm": Apply normal approximation of the binomial distribution.

The result of the sample ratio mismatch includes the following attributes:

  • metric: Metric name.
  • control: Number of observations in control.
  • treatment: Number of observations in treatment.
  • pvalue: P-value

Global settings

In tea-tasting, you can change defaults for the following parameters:

  • alternative: Alternative hypothesis.
  • confidence_level: Confidence level of the confidence interval.
  • equal_var: If False, assume unequal population variances in calculation of the standard deviation and the number of degrees of freedom. Otherwise, assume equal population variance and calculate pooled standard deviation.
  • use_t: If True, use Student's t-distribution in p-value and confidence interval calculations. Otherwise use Normal distribution.

Use get_config with the option name as a parameter to get a global option value:

import tea_tasting as tt


tt.get_config("equal_var")
#> False

Use get_config without parameters to get a dictionary of global options:

global_config = tt.get_config()

Use set_config to set a global option value:

tt.set_config(equal_var=True, use_t=False)

experiment = tt.Experiment(
    sessions_per_user=tt.Mean("sessions"),
    orders_per_session=tt.RatioOfMeans("orders", "sessions"),
    orders_per_user=tt.Mean("orders"),
    revenue_per_user=tt.Mean("revenue"),
)

experiment.metrics["orders_per_user"]
#> Mean(value='orders', covariate=None, alternative='two-sided',
#> confidence_level=0.95, equal_var=True, use_t=False)

Use config_context to temporarily set a global option value within a context:

with tt.config_context(equal_var=True, use_t=False):
    experiment = tt.Experiment(
        sessions_per_user=tt.Mean("sessions"),
        orders_per_session=tt.RatioOfMeans("orders", "sessions"),
        orders_per_user=tt.Mean("orders"),
        revenue_per_user=tt.Mean("revenue"),
    )

experiment.metrics["orders_per_user"]
#> Mean(value='orders', covariate=None, alternative='two-sided',
#> confidence_level=0.95, equal_var=True, use_t=False)

More than two variants

In tea-tasting, it's possible to analyze experiments with more than two variants. However, the variants will be compared in pairs through two-sample statistical tests.

How variant pairs are determined:

  • Default control variant: When the control parameter of the analyze method is set to None, tea-tasting automatically compares each variant pair. The variant with the lowest ID in each pair is a control.
  • Specified control variant: If a specific variant is set as control, it is then compared against each of the other variants.

The result of the analysis is a dictionary of ExperimentResult objects with tuples (control, treatment) as keys.

Keep in mind that tea-tasting does not adjust for multiple comparisons. When dealing with multiple variant pairs, additional steps may be necessary to account for this, depending on your analysis needs.

Package name

The package name "tea-tasting" is a play of words which refers to two subjects:

  • Lady tasting tea is a famous experiment which was devised by Ronald Fisher. In this experiment, Fisher developed the null hypothesis significance testing framework to analyze a lady's claim that she could discern whether the tea or the milk was added first to a cup.
  • "tea-tasting" phonetically resembles "t-testing" or Student's t-test, a statistical test developed by William Gosset.