# Property-based + statistical testing

Here we will introduce a powerful concept for testing your code and validating data 
using property-based testing. We will make use of the [Hypothesis](https://hypothesis.readthedocs.io/en/latest/)
library and its integration with Pandas and Pandera.

## Data Synthesis

Pandera provides a utility for generating synthetic data purely from pandera schema or schema component objects. Under the hood, the schema metadata is collected to create a data-generating strategy using [hypothesis](https://hypothesis.readthedocs.io/), which is a property-based testing library.

### Basic Usage

Let's define an example dataframe model and use its `.example()` method to generate a sample dataframe:

In [None]:
import pandera as pa

class DFModel(pa.DataFrameModel):
    # note here we do not use the `Series` type hint
    # this is possible, pandera assumes the type is `Series` by default
    column1: int = pa.Field(lt=10)
    column2: float = pa.Field(gt=0.25, nullable=True)
    column3: str = pa.Field(str_contains="spam")

    # We can define custom checks as class methods
    # However, this would slow down the strategies
    # unless we create a custom strategy, which is an advanced topic

    # @pa.check("column1", name="foobar")
    # def is_odd(cls, column1: Series[int]) -> Series[bool]:
    #     return column1.mod(2).eq(1)

In [None]:
DFModel.example(size=3)

A few observations:
* The schema conforms to the constraints.
* The data may look a bit strange; this is on purpose as hypothesis is basically trying to break your code by generating edge cases 😏

**Exercise:**
1. Try to generate a bit more examples (by simply re-running the cell) and look at what is generated.
2. Make `column2` values to be between 0.25 and 1.0.
3. Make `column3` values end by `"spam"`.

*Optional:*

4. Add an index type to the schema.
5. Make index `str` type, unique, and consisting only of single small caps letters `a-z`.

### Using strategies in property-based testing

Pandera models also export `strategy` method that returns a hypothesis strategy for generating data from the schema. This can be used in property-based testing to generate data for testing.

Say we would like test a function like this:

In [None]:
import pandas as pd

def column2_remainder(df: pd.DataFrame) -> float:
    return (df["column2"] - 0.25).sum()

One assumption we can make is that the result is always >= 0. We can use the `strategy` method to generate data for testing:

In [None]:
from hypothesis import given

@given(df=DFModel.strategy())
def test_column2_remainder_is_positive(df: pd.DataFrame) -> None:
    assert column2_remainder(df) >= 0

Interestingly, we can run this test directly here (which would not be possible with plain pytest tests).

In [None]:
test_column2_remainder_is_positive()

**Exercise:** For the functions defined below:
1. Write a property-based test for `remove_spam` that checks there is no more spam (i.e. that `column3` values do not include "spam") in the output.
2. Write a property-based test for `multiply_large` that checks the number of rows in the output.
3. If you find a bug, try to fix it.

*Optional:*

4. Write more property-based tests for the two functions.

In [None]:
from pandera.typing import DataFrame
import pandas as pd

def remove_spam(df: DataFrame[DFModel]) -> DataFrame[DFModel]:
    return df.assign(column3=df["column3"].str.replace("spam", ""))


def multiply_large(df: DataFrame[DFModel], limit: int = 5) -> DataFrame[DFModel]:
    # repeat the rows with column1 > limit
    # e.g. if limit = 5, and there are 2 rows with column1 > 5
    # then the resulting dataframe will have 4 rows with column1 > 5
    large_rows = df[df["column1"] > limit]
    return pd.concat([df, large_rows] * 2).reset_index(drop=True)

### Hypothesis strategies for scientific stack

The Hypothesis package has a number of strategies already implemented, and in particular
many useful ones for the scientific stack. See the [Hypothesis documentation](https://hypothesis.readthedocs.io/en/latest/numpy.html) for more details.

As an example, we will define a strategy for generating dataframes that can be used to test `test_column2_remainder_is_positive`:


In [None]:
from hypothesis.extra.pandas import column, data_frames, range_indexes

df_strategy = data_frames(
    [
        column("column1", dtype=int),
        column("column2", dtype=float),
        column("column3", dtype=str),
    ],
    index=range_indexes(min_size=1),
)

In [None]:
df_strategy.example()

In [None]:
from hypothesis import given

@given(df=df_strategy)
def test_column2_remainder_is_positive_2(df: pd.DataFrame) -> None:
    assert column2_remainder(df) >= 0

In [None]:
test_column2_remainder_is_positive_2()

We can also use a hypothesis strategy to test a Pandera model.

For example:

In [None]:
from hypothesis.extra.pandas import column, data_frames, range_indexes
from hypothesis.strategies import floats
import pandera.errors
import pytest


ivalid_df_strategy = data_frames(
    [
        column("column1", dtype=int),
        column("column2", dtype=float, elements=floats(max_value=0.24)),
        column("column3", dtype=str),
    ],
    index=range_indexes(min_size=1),
)

@given(df=ivalid_df_strategy)
def test_df_model_validation_fails(df: pd.DataFrame) -> None:
    with pytest.raises(pandera.errors.SchemaError):
        # ensure the validation always fails
        DFModel.validate(df)

In [None]:
test_df_model_validation_fails()

**Exercise:** Write a property-based test (or more tests) for the `pandas.DataFrame.drop_duplicates()` method. You probably do not need a Pandera model for this. Let's see if we find a bug in Pandas 😈!

## Hypothesis Data Validation

Pandera enables you to perform statistical hypothesis tests on your data.
We will not cover this topic here, please read more in the [Pandera documentation](https://pandera.readthedocs.io/en/stable/hypothesis.html).

Also note there are other tools that can be used for this purpose, such as [Great Expectations](https://greatexpectations.io/) or [Frictionless Data](https://frictionlessdata.io/) or [pydeequ](https://pypi.org/project/pydeequ/).