# `spearmint` Basics

## Observations data
Spearmint takes as input a [pandas](https://pandas.pydata.org/) `DataFrame` containing experiment observations data. Each record represents an observation/trial recorded in the experiment and has the following columns:

- **One or more `treatment` columns**: each treatment column contains two or more distinct, discrete values that are used to identify the different groups in the experiment
- **One or more `metric` columns**: these are the values associated with each observation, and are used as the metric to compare groups in the experiment.
- **Zero or more `attributes` columns**: these define additional discrete properties assigned to the observations. These attributes can be used to perform additional segmentation across groups.

To demonstrate, let's generate some fake experiment observations data. The `metric` column--also named `"metric"`--is a series of binary outcomes (i.e. `True`/`False`). This binary `metric` is analogous to *conversion* or *success* in AB testing.


---
> 💡 These fake observations are simulated from 3 different [Bernoulli distribution](https://en.wikipedia.org/wiki/Bernoulli_distribution)s, with each distribution being associated with each of the the three `treatment`s (`"A"`, `"B"`, or `"C"`), and each distribution having increasing average probability of *conversion*.

---

In [1]:
import numpy as np
from spearmint.utils import generate_fake_observations
from spearmint import Experiment, HypothesisTest

experiment_observations = generate_fake_observations(
    distribution="bernoulli",
    n_treatments=3,
    n_attributes=4,
    n_observations=120,
    random_seed=123
)
experiment_observations.head()

Unnamed: 0,id,treatment,metric,attr_0,attr_1,attr_2,attr_3
0,0,C,True,A0a,A1b,A2a,A3a
1,1,B,True,A0a,A1b,A2a,A3b
2,2,C,True,A0a,A1a,A2a,A3b
3,3,C,True,A0a,A1a,A2a,A3b
4,4,A,True,A0a,A1b,A2a,A3a


The fake data's `treatment` column is named `"treatment"`, and the dataset also contains four `attribute` columns, named `"attr_*"`, that can potentially be used for segmentation.

## Running an AB test in `spearmint` is as easy as 1-2-3:

The three key components of running an AB test in `spearmint` are:

- 1. Initialize an **`Experiment`**, which holds the raw observations, and any metadata associated with an AB experiment.
- 2. Define the **`HypothesisTest`**, which declares the configuration of the statistical inference procedure.
- 3. Run the `HypothesisTest` against the `Experiment` and interpret the resulting **`InferenceResults`**. `InferenceResults`, hold the parameter estimates of the inference procedure, and are used to summarize, visualize, and save the results of the hypothesis test.

## Example Workflow

We'll demonstrate the basic workflow with an examples.

### 1. Initialize an `Experiment`

In [2]:
experiment = Experiment(data=experiment_observations)

### 2. Initialize the `HypothesisTest`

Spearment allows the scientist to configure many aspects of the hypothesis test, including
- the specific `metric` used -- we can even use `CustomMetrics` that are derived from multiple columns of the dataset (see below)
- `control` and `variation` groups. The `control` group can be thought of as the baseline or NULL hypothesis group.
- the specific `hypothesis`
- `variable_type` -- this can be explicitly configured, otherwise `spearmint` will attempt to infer it from the distribution of the `metric` values
- `inference_method`, and any specific configuration for the `inference_method` (particularly helpful when specifying priors in Bayesian hypothesis tests).

In [3]:
ab_test = HypothesisTest(
    metric="metric",
    treatment="treatment",
    control="A", variation="B",
    hypothesis="unequal",
    variable_type="binary",
    inference_method="frequentist"
)

#### Specifying the `hypothesis`
| `hypothesis`  | Hypothesis Interpretation | Hypothesis Type | 
|---|---|---|
| `"larger"` (default) | "The treatment is larger than the control" | one-tailed |
| `"smaller"` | "The treatment is smaller than the control" | one-tailed |
| `"unequal"` | "The treatment is not equal to the control" | two-tailed |

In this example, we specify the `"unequal"` hypothesis, which tests for any statistically significant difference between groups `"A"` and `"B"`. Therefore `"B"` could be smaller or larger than `"A"`, and the test could pass if the difference is large enough. 

---

> 💡 Note that we can also exclude the `hypothesis` argument, in which case `spearmint` will use the value configured in `$SPEARMINT_HOME/spearmint.cfg::hypothesis_test::default_hypothesis`. See the **Configuring `spearmint`** section below

---

#### Specifying the `variable_type`s and `inference_method`s

The the specific inference procedure used will depend on the `variable_type` of the observations, and the `inference_method` argument. A list of supported `variable_type`s and their associated `inference_methods` are shown below:

| `variable_type` | `inference_method`| Available Models |
|---|---|---|
| `"continuous"` | `"frequentist"` (default) | `"means_delta"` (t-test) |
|  | `"bayesian"` | `"gaussian"`, `"student_t"`|
| `"binary"` | `"frequentist"` (default) | `"proportions_delta"` (z-test) |
|  | `"bayesian"`| `"binomial"`, `"bernoulli"`  |
| `"counts"`  | `"frequentist"` (default) | `"rates_ratio"`  |
|  |`"bayesian"`| `"poisson"`  |
| `Any`  | `"bootstrap"`| `"bootstrap_delta"` |

In the example above we specified the `variable_type="binary"`, but this isn't necessary. If `variable_type` is not explicitly defined, `spearmint` will infer it from the distriubtion of observations in the `Experiment`. We also defined `inference_method="frequentist"`, which tell's `spearmint` to use the Frequentist inference procedure that is specific to binary variables.

---

> 💡 Note that we could have also excluded the `inference_method` argument, in which case `spearmint` will use the value configured `$SPEARMINT_HOME/spearmint.cfg::hypothesis_test::default_inference_method`. See the **Configuring `spearmint`** section below

---


### 3. Run the `HypothesisTest` against the `Experiment` and interpret the resulting `InferenceResults`

We run the hypothesis test against the experiment's data using the `run_test` method. This method allows you to set your Type I error rate, `alpha`. In this example we run the test with the standard `alpha=0.05`, which mean's we're willing to accept False positive from this test five times out of one-hundred.

In [4]:
ab_test_results = experiment.run_test(ab_test, alpha=.05)

# Check the test results decision
assert ab_test_results.accept_hypothesis

#### Interpreting results
Each `InferenceResults` instance has `.display()` and `.visualize()` methods that can be used to interpret the results of the test. The `.display()` method prints out the results to the console, while `.visualize()` plots a visual summary of the results.

In [5]:
# Print the test results to the console
ab_test_results.display()

In [6]:
# Visualize the results
layout = ab_test_results.visualize()
layout

  return _boost._binom_ppf(q, n, p)


The resulting frequentist test results (displayed and visualized above) indicate that hypothesis `"B != A"` should be accepted. A breakdown of the results plot is as follows:

#### Left Plot: Sample Distributions & Central Tendency Estimates
The left plot compares the _parameteric description of the samples_, with parameters that are estimated from the experiment observations. In this case, the parameteric distribution used is the [Binomial distribution](https://en.wikipedia.org/wiki/Binomial_distribution), giving the distribution over the number successful trials for each group when we run as many trials as observed in the dataset.

The left plot also includes estimates of the _central tendency estimates_ of the two sample groups. Central tendencies, along with their confidence intervals are plotted as as line intervals under each distribution. In this example the, central tendency is the expected number of successful trial given the observations, along with the 95% confidence intervals (CIs) around that expected number of successful trials. Note that here 95% CI is derived from the test's `alpha=0.05` (i.e. `confidence = 1-alpha`).

We see that there is large amount of separation between the two distributions. Furthermore, we can see that there is little-to-no overlap of the CIs, further indicating that the two groups are likely different, and, more specifically, that `"B"` is `"larger"` than `"A"` in terms of the observed metric values.

#### Right Plot: Deltas
The right plot shows the distribution of the estimated _difference in central tendencies_ in the black curve. One can think of this curve as the difference of the two curves in the left plot. Included in the deslta distribution are the mean and 95% CIs around the mean of delta distribution.

We can see that the confidence interval on the difference between the two groups does not intersect with the `ProportionsDelta=0` line indicated in red. This is another clue that supporting that three is a statistically significant difference between the two samples.

All the visual evidence supports the result of accepting the `hypothesis` that `B != A`

## Bootstrap Hypothesis Tests

If your samples do not follow standard parametric distributions (e.g. Gaussian, Binomial, Poisson), or if you're comparing more exotic or custom statistics (e.g. variance, skew, etc) then you might want to consider using a non-parametric [Bootstrap Hypothesis Test](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)).  Running bootstrap tests is easy in `spearmint`, you simply use the `"bootstrap"` `inference_method`. By default, `spearmint` bootstraps the mean of the metric distribution, but the scientist can use any `statistic_function` they like, as demonstrated below.

In [7]:
def my_test_statistic(samples):
    """Boootstrap tests support custom test statistics. Here
    we simply re-define the mean in order to compare with parametric
    tests above.

    That saide, we could test the difference in something more exotic,
    like the variance, or something custom
    
    return np.var(samples)
    """
    return np.mean(samples)

bootstrap_ab_test = ab_test.copy(
    inference_method='bootstrap',
    inference_procedure_params=dict(statistic_function=my_test_statistic)
)

---

> 💡
> Above we use the `.copy()` method to copy over all parameters from the original `ab_test`, but update the `inference_method` and `inference_procedure_params`. We could have also defined the `HypothesisTest` from scratch
> ```python
> bootstrap_ab_test = HypothesisTest(
    metric="metric",
    treatment="treatment",
    control="A", variation="B",
    hypothesis="unequal",
    variable_type="binary",
    inference_method="bootstrap",
    statistic_function=my_test_statistic
)
> ```

---

In [8]:
# Run the bootsrap AB test
bootstrap_ab_test_results = experiment.run_test(bootstrap_ab_test)
bootstrap_ab_test_results.display()
bootstrap_ab_test_results.visualize()

The `"bootstrap"` `inference_method` uses non-parametric methods--namely resampling with replacement--to estimate the distribution of the mean of the group sample (this is because the `my_test_statistic` simply returns the `mean`). Compared to the visualization for the `"frequentist"` `inference_method` for `"binary"` variables, the left plot for the `"bootstrap"` `inference_method` shows the distribution _conversion rates_, rather than the number of successful trials rather than Binomial distribution defined by parameters estimated from the data.

The `"bootstrap"` hypothesis test results are very similar to the results returned by the `"frequentist"` `inference_method`. The results from the different `inference_methods` should converge as the sample size of each group grows.

## Bayesian Hypothesis Tests
In addition to Frequentist and Bootstrap tests, `spearmint` supports Bayesian hypothesis tests. To run a Bayesian test, simply intitialize the `HypothesisTest` (or `.copy` one) with a `inference_method="bayesian"`.

In [9]:
# Here we again use the `.copy method`
bayesian_ab_test = ab_test.copy(inference_method='bayesian')

# Run the Bayesian test
bayesian_ab_test_results = experiment.run_test(bayesian_ab_test)
assert bayesian_ab_test_results.accept_hypothesis
assert bayesian_ab_test_results.prob_greater_than_zero > .95

##### Displaying Bayesian Results

In [10]:
bayesian_ab_test_results.display()

In the Bayesian results, we see that p(B > A) 0.974. This states that there is a 97.4% probability that conversion rate for group `"B"` is greater than group `"A"`. Given this evidence we can infer that `B != A`, and can accept the `hypothesis="unequal"`.

---

> ❓ Bayesian vs Frequentist Tests
>
> In Frequentist tests, we calculate some test statistic that is a function of the observed data (e.g. z-statistic, t-statistic), then use the p-value associated with the value of that calcualted test statistic in order to make a statement about statistical significance.
>
> In Bayesian tests we instead build a generative model that we believe could have generated the observed data, and estimate the parameters of that model such that it's as accurate as possible at capturing the distribution of observed data. The generative model's parameters all have their own probability distributions, and we compare the distributions of those parameters in order to make statistical statements about the data.
>
---

##### Visualizing Bayesian Results

In [11]:
bayesian_ab_test_results.visualize()

Bayesian visualizations have a similar layout to Frequentist and Bootstrap tests. Namely we have a left plot with shows estimates of the central tendencies amongts the groups, along with error intervals around those distributions. Note that here these are not statistics (e.g. mean) of the data per se, but expected value parameters of a generative model that tries to describe the data. The difference is subtle, but using a generative model under the hoood allows the research to incorporate prior knowledge into the analysis, which can be usefule in scenarios with few observations.

Similar to the Frequentist/Bootstrap visualization, we also include a Delta distribution. This plot carries similar semantics to the Delta plot for Frequentist/Bootstrap tests.

We can also see that rather than using Confidence Intervals (CIs), Bayesian results use Highest Density Intervals (HDIs). CIs and HDIs are not equivalent, but the differences between CIs and HDIs are subtle. But from a bird's-ey-view, CIs and HDIs differ in that CIs make parametric assumptions about the shape of the errors of our estimates, while HDIs estimate error distributions directly from the probability distributions that define the underlying Bayesian model. In terms of pragmatic interpretation of the results, you can use CIs and HDIs in a similar fashion, as indicated by similiar results reported by Frequentist/Bootstrap CIs and Bayesian HDIs.

### Bayesian Model Specification
Bayesian models allow the experimenter to incorporate prior beliefs. This can be helpful when you have little data, or can provide sound domain knowledge of baselines. Specifying custom priors is also straight-forward using `spearmint`. Simply pass in a `model_params` argument during `HypothesisTest` initialization. Below we demonstrate by running another Bayesian hypothesis test, this time with a hierarchical [Beta-Binomial model](https://en.wikipedia.org/wiki/Beta-binomial_distribution#:~:text=In%20probability%20theory%20and%20statistics,is%20either%20unknown%20or%20random.). This model allows the user to specify a prior over the base probability $p$ by setting two hyperparameters for the Beta Distribution $\alpha$ and $\beta$ such that the mean prior has a value of 

$$ p = \frac{\alpha}{\alpha + \beta}$$

where the larger $\alpha$ and $\beta$. Let's put a super-strong prior on $p$ and see how it affects the inference results.

Below we print the default prior parameters for the Bayesian Hypothesis test above.

In [12]:
print("Default Bayesian model hyperparams", bayesian_ab_test_results.model_hyperparams)

Default Bayesian model hyperparams {'prior_alpha': 1.0, 'prior_beta': 1.0}


 We see that we use a "Beta" prior with prior parameters (`prior_alpha=1.0` and `prior_beta=1.0`). This is equivalent to a non-informative prior, essentially a uniform prior over all possible conversion rates. You can verify this by adding the `include_prior=True` flag to the `binomial_ab_test_results.visualize()` method:

In [13]:
bayesian_ab_test_results.visualize(include_prior=True)

The default prior, plotted in gray is more-or-less uniform across the entire probability space. 

Below we rerun our Bayesian inference, but with a much stronger prior, one that defines a Beta prior tightly bound around 0.5. In this scenario we either need a lot of data to overcome the prior, or _very_ strong effects to pull the data away from that prior

In [14]:
# Run Bayesian test with custom prior

# strong prior that p = alpha / (alpha + beta) = 0.5
strong_prior_model_params = dict(prior_alpha=100, prior_beta=100)

custom_bayesian_ab_test = HypothesisTest(
    metric='metric',
    control='A', variation='C',
    inference_method='bayesian',
    bayesian_model_params=strong_prior_model_params
)

# run the test with strong prior
custom_bayesian_ab_test_results = experiment.run_test(custom_bayesian_ab_test)
assert not custom_bayesian_ab_test_results.prob_greater_than_zero > .95  # strong prior dominates data
custom_bayesian_ab_test_results.display()
custom_bayesian_ab_test_results.visualize(include_prior=True)  # `include_prior`

Here we see that the strong prior of $p=0.5$ influences the proportion parameters to values around 0.5. This causes our delta distribution (right plot) to be much smaller because the two groups' mdoels that are forced to be near a prior mean (here 0.5), and thus will be located closer to one another than if we had used a weaker prior.

If there were more data in the experiment, these parameter estimates would move toward the data distribution, rather than the super-confident prior distribution, providing results that are similar to the results provided by the Frequentist and weak-prior Bayesian models.

## Including Segmentations
`spearmint` supports the ability to segment experiment observations based on one or more attributes in your dataset using the `segmentation` argument to `HyptothesisTest`. The segmentation can be a string or list of string expressions, each of which follow the [pandas query API](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html)

In [15]:
# Initialize an A/B test with additional segmentation on the 'attr_1' attribute
ab_test_segmented = HypothesisTest(
    metric='metric',
    control='A', variation='C',
    inference_method='bayesian',
    hypothesis='larger',
    segmentation="attr_1 == 'A1a'"
)

# Run the segmented test
ab_test_segmented_results = experiment.run_test(ab_test_segmented)

# Display results (notice reduced sample sizes)
ab_test_segmented_results.display()
ab_test_segmented_results.visualize()

We now see that if we dig into a particular segment, namely the segement defined by `"attr_1 == 'A1a'"`, we can no longer accept the hypothesis that `"C is larger"`. Since we are using a Bayesian test (recommended for doing segmentations because Bayesian tests aren't affected by Multiple comparisons errors -- See **Running multiple Freqeuntist tests, and Multiple Comparison control** below.) evidence for rejecting the hypothesis comes from the fact the p(C > A) = 0.936, which is less than the `Credible Mass` of 0.95 required by our experiment (as defined by `alpha`). Other evidence comes from the fact that HDIs of the the Delta distribution (right plot) overlap with the zero-value associated with no difference.

## Running multiple Frequentist tests, and Multiple Comparison control
When running multiple Frequentist hypothesis tests on the same metric, you'll need to control for [multiple comparisons](https://en.wikipedia.org/wiki/Multiple_comparisons_problem). This is handled by running a `HypothesisTestGroup` against the experiment. The  `HypothesisTestGroup` is a list of hypothesis tests. When the `Experiment.run_test_group` method is applied to the `HypothesisTestGroup`  the value of each test's `alpha` is adjusted so that is more conservative in order to avoid inflated Type I error rates caused by Multiple Comparison artifacts.

### Example
In the example below we run 3 independent tests comparing A to A, B to A and C to A, and set the correction `method` to `'bonferroni'`, which simply updates the effective $\alpha_{corrected} = \frac{\alpha}{N_{tests}}$. Our original value for `alpha=0.05`, thus the corrected value would be $\frac{0.05}{3} = 0.0167$

In [16]:
from spearmint.hypothesis_test import HypothesisTestGroup

# Use the `HypothesisTest.copy` method for duplicating test
# configurations, while overwriting specific parameters, in
# this case `variation` parameter
aa_test = ab_test.copy(variation='A')
ac_test = ab_test.copy(variation='C')

# Initialize the `HypothesisTestGroup`
test_group = HypothesisTestGroup(
    tests=[aa_test, ab_test, ac_test],
    correction_method='bonferroni'
)

# Run tests
test_suite_results = experiment.run_test_group(test_group)

# Print results
test_suite_results.display()

------------------------------------------------------------
Test 1 of 3


------------------------------------------------------------
Test 2 of 3


------------------------------------------------------------
Test 3 of 3


Note that the alpha has been `(corrected)` to a value of 0.017, using `MC Correction='bonferroni_correction'`

The `HypothesisTestSuite` supports the following multiple comparison strategies:
- ['sidak'](http://en.wikipedia.org/wiki/%C5%A0id%C3%A1k_correction) (default)
- ['bonferonni'](http://en.wikipedia.org/wiki/Bonferroni_correction)
- [Benjamini-Hochberg false-discovery rate ('bh_fdr')](http://pdfs.semanticscholar.org/af6e/9cd1652b40e219b45402313ec6f4b5b3d96b.pdf)

## Custom Metrics
`spearmint` also supports the use of custom metrics, which can transform and combine information from one or more columns.

### Example
In the example below we create a `CustomMetric` always makes the `variation` greater than the `control` by adding a constant offset (plus noise) to the value of the the `control`.

In [17]:
from spearmint import CustomMetric

def custom_metric(row):
    """
    Define a custom 'metric' where the control is larger
    than the variation most of the time. This metric should
    result in a an observed delta of ~-4.0.
    """
    return 4 + np.random.rand() if row['treatment'] == 'A' else np.random.rand()

custom_test = HypothesisTest(
    metric=CustomMetric(custom_metric),
    control='A',
    variation='B',
    inference_method='frequentist',  # Note we use a t-test here.
    hypothesis='larger'
)

custom_test = experiment.run_test(custom_test)
custom_test.display()

We see that, as expected, we have highly significant results, accepting the hypothesis that `'B != A'`

## Working with other types of variables
The examples above demonstrate running AB tests for variables that take on binary values--i.e. variables that take on values that exist in the interval $[0, 1]$. This is a pretty common scenario, as a lot of AB tests measure metrics like conversions at various stages in a UX funnel.

However, `spearmint` also supports inference methods for other types of variables, like `"continuous"` variables (e.g. time spent on a page) and `"counts"` (e.g. number of clicks on a button per unit time).

### Continuous Variables
When testing the difference in means between samples of continuous variables we often those difference using Gaussian-distributed random variables. The most common statistical inference procedure for this scenario is the [Student's t-test](https://en.wikipedia.org/wiki/Student%27s_t-test) (referred to as `"means_delta"` under the hood, as it tests for differences in means of the underlying Gaussian distributions from each of treatments).

Below we generate some Gaussian-distributed observations and show how ✨spearmint✨can be used to run a t-test using the common `Experiment->HypothesisTest->InferenceResults` workflow.

In [18]:
# generate some fake Gaussian-distributed trial data
continuous_data = generate_fake_observations(
    distribution='gaussian',  # binary data
    n_treatments=3,
    n_attributes=2,
    n_observations=100
)
continuous_data.head()

Unnamed: 0,id,treatment,metric,attr_0,attr_1
0,0,C,2.093461,A0a,A1c
1,1,B,0.760331,A0a,A1b
2,2,C,3.459589,A0a,A1b
3,3,C,3.395353,A0a,A1c
4,4,A,1.959411,A0a,A1c


Here we use the `variable_type="continuous"` argument to run the t-test

In [19]:
# Initialize the Experiment
continuous_experiment = Experiment(data=continuous_data)

# Initialize the A/B test
continuous_ab_test = HypothesisTest(
    metric="metric",
    treatment="treatment",
    control="A", variation="B",
    hypothesis="unequal",
    variable_type="continuous"
)

# Run the test with an alpha of 0.05
continuous_ab_test_results = continuous_experiment.run_test(continuous_ab_test, alpha=.05)

# Check the test results decision
assert continuous_ab_test_results.accept_hypothesis
continuous_ab_test_results.display()
continuous_ab_test_results.visualize()

#### Bayesian models for continuous variables
The Bayesian analog to the t-test is called the "Hierarchical Gaussian" and involves modeling the observations as a generative process where each point is sampled from a Gaussian distribution with mean $\mu$ and variance $\sigma^2$. The model is "hierarchical" because it also assumes there is a distribution over both $\mu$ and $\sigma^2$ as well, namely $\mu \sim \text{Normal}(\bar{x}, \text{std(x)})$ and $\sigma \sim \text{Uniform}(0, \sigma_{max})$, where $\bar{x}$ and  $\text{std(x)}$ are the empirical mean and standard deviation of the observations, and $\sigma_{max}$ is a user-specified hyperparameter.


That all sounds pretty complicated, right? Well, in ✨spearmint✨ it's easy to run inference using this model. We simply update the `inference_method`:

In [20]:
# copying parameters from original AB test -- note we must update the `variable_type`, as we're using 
bayesian_continuous_ab_test = ab_test.copy(inference_method='bayesian', variable_type='continuous')

# Run the test with an alpha of 0.5; get back a InferenceResults object
bayesian_continuous_ab_test_results = continuous_experiment.run_test(bayesian_continuous_ab_test, alpha=.05)

# Check the test results decision
assert bayesian_continuous_ab_test_results.accept_hypothesis
bayesian_continuous_ab_test_results.display()
bayesian_continuous_ab_test_results.visualize()

### Counts / Rates variables
✨spearmint✨ also supports analysis of counts variables such as clicks or page views per standard unit of time. These discrete, countable variables often modeled as a Poisson distribution. In the Frequentist setting, rather than testing if the _difference_ between the two groups' expected values is greater than zero, we instead model whether the _ratio_ of their expected values different from one. The reasoning being that if the two treatments have equal expected number of events per the same unit of time (and thus the same _rate_) then their ratio will be close to one (Accordingly, in ✨spearmint✨, this underlying comparison model is called a "Rates Ratio").

Below we'll run an AB test on syntetic data drawn from a Poisson distribution, and test to see if the two distributions are statistically different.

In [21]:
# generate some fake Gaussian-distributed trial data
counts_data = generate_fake_observations(
    distribution='poisson',  # binary data
    n_treatments=3,
    n_observations=1000
)
counts_data.head()

Unnamed: 0,id,treatment,metric,attr_0,attr_1
0,0,C,8,A0a,A1a
1,1,B,1,A0a,A1a
2,2,C,1,A0a,A1b
3,3,C,5,A0a,A1a
4,4,A,3,A0a,A1b


In [22]:
# Initialize the A/B test
counts_experiment = Experiment(data=counts_data)

counts_ab_test = HypothesisTest(
    metric="metric",
    treatment="treatment",
    control="B", variation="C",
    hypothesis="unequal",
    # variable_type='counts'  # Note: spearmint will infer variable type from data
)

# Run the test with an alpha of 0.5; get back a InferenceResults object
poisson_ab_test_results = counts_experiment.run_test(counts_ab_test, alpha=.05)

# Check the test results decision
poisson_ab_test_results.display()
poisson_ab_test_results.visualize()

Here we can see that the ratio of the two rate parameters ranges between 1.38 and 1.68 (95% confidence), with no overlap with the value 1. This indicates that the variation `"C"`s location parameter is approximately 1.5x that of the control `"B"`, which makes sense, given the mean estimates for the two treatments are approximately 3 and 2, respectively.

#### Bayesian models for count variables
The Bayesian analog to the rates ratio test is what's called the [Gamma-Poisson model](http://www.math.wm.edu/~leemis/chart/UDR/PDFs/Gammapoisson.pdf). In this model the observations are assumed to be generated from a Poisson distribution with location parameter $\lambda$. Similar to the "Hierarchical Gaussian" Bayesian, there is a prior distribution associated with $\lambda$ (that's what makes it Bayesian!). Namely $\lambda \sim \text{Gamma}(\alpha, \beta)$. Here the hyperparameters $\alpha$ and $\beta$ can be set by the experimenter (as `bayesian_model_params=dict(prior_alpha=..., prior_beta=...)`  to encode any intuitions or domain knowledge about the problem.

Though this sounds complicated, implementing a hypothesis test using an inference method based off of the Gamma-Poisson model is not:

In [23]:
bayesian_counts_ab_test = counts_ab_test.copy(inference_method='bayesian')

bayesian_counts_ab_test_results = counts_experiment.run_test(bayesian_counts_ab_test)

bayesian_counts_ab_test_results.display()
bayesian_counts_ab_test_results.visualize()

Note here that samples drawn from the Bayesian model provide similar central tendency interval estimates to those calculated by the analytical rates ratio model. However, unlike the rates ratio model which looks at the ratio of central tendencies, the Bayesian AB tests provides _deltas_ or differences amongst the _Poisson rate parameter_ samples drawn from the model. Thus differences in $\lambda$ samples that are far away from zero (in this case the diffence is--and should be--approximately equal to one) indicate significant difference between the treatments when interpreting the Bayesian counts AB test.

## Conversion rate variables
If you are testing conversion rate data directly--i.e. floats in the range (0, 1), rather than binary values--you can use either the `"binary"` or `"continuous"` `variable_types`. This is because conversion rates are essentially means, and the [Central Limit Theorem](https://en.wikipedia.org/wiki/Central_limit_theorem) allows the scientist to model those conversion rates as Gaussian distributions. Additionally all the models for `"binary"` data should support conversion rates as well.

### Example

Below we model conversion rates using the `"bayesian"` `inference_method` for `"continuous"` variables.

In [24]:
conversion_rate_data = counts_data.copy()

# convert to proportions
conversion_rate_data.metric = conversion_rate_data.metric / conversion_rate_data.metric.max()

conversion_rate_experiment = Experiment(data=conversion_rate_data)
conversion_rate_test = HypothesisTest(
    metric="metric",
    treatment="treatment",
    control="A", variation="B",
    hypothesis="unequal",
    variable_type='continuous',
    inference_method='bayesian'
)

conversion_rate_test_results = conversion_rate_experiment.run_test(conversion_rate_test)
conversion_rate_test_results.display()
conversion_rate_test_results.visualize()


## Configuring `spearmint`
Upon the first import of the `spearmint` creates a `spearmint.cfg` file in your `SPEARMINT_HOME` directory. This directory can be set with an environment variable

```bash
export SPEARMINT_HOME=PATH/TO/SPEARMINT
```

otherwise `spearmint` will use `/USER_HOME/.spearmint/` as the location of the configuration file. The `spearmint.cfg` file allows the scientist to configure many global settings and default behaviors.