<a href="https://colab.research.google.com/github/dalexa10/FINDER_Summer_School_2023/blob/main/3_Program_analysis_methods/Hypothesis_testing_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hypothesis Tutorial for FinDeR Summer School

by Mark Fuge, 2023

This notebook provides some initial examples of how to use the [Hypothesis Library](https://hypothesis.readthedocs.io/) to do simple testing of functions, from simple fuzz testing to property based testing. We will start with simple examples and then extend in complexity as we go.


First, let's install the library:

In [None]:
#!pip install hypothesis
# Do the below instead if you want to test the optional auto-writing features
!pip install hypothesis[cli]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting hypothesis[cli]
  Downloading hypothesis-6.76.0-py3-none-any.whl (414 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m414.9/414.9 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
Collecting black>=19.10b0 (from hypothesis[cli])
  Downloading black-23.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m26.1 MB/s[0m eta [36m0:00:00[0m
Collecting mypy-extensions>=0.4.3 (from black>=19.10b0->hypothesis[cli])
  Downloading mypy_extensions-1.0.0-py3-none-any.whl (4.7 kB)
Collecting pathspec>=0.9.0 (from black>=19.10b0->hypothesis[cli])
  Downloading pathspec-0.11.1-py3-none-any.whl (29 kB)
Installing collected packages: pathspec, mypy-extensions, hypothesis, black
Successfully installed black-23.3.0 hypothesis-6.76.0 mypy-extensions-1.0.0 pathspec-0.11.1


In [None]:
from operator import add
add(1,2) == 3
add(2,2) == 4
add(3,2) == 5

True

How would we check whether the addition is happening correctly? We could do a simple test by asserting it is similar to some reference solution:

In [None]:
def test_addition(a,b):
    assert add(a,b) == a+b

What kind of conditions would you run the above test against to check whether add was working properly?

Here is an example where there is a pretty bad and obvious bug in the code:

In [None]:
def bad_add(a,b):
    if (b>500):
        print(a,b)
        return -1
    else:
        print(a,b)
        return add(a,b)

Let's actually write this down the normal way like we would do in a typical test (see tests_normal.py)

In [None]:
!pytest tests_normal.py

platform linux -- Python 3.10.11, pytest-7.2.2, pluggy-1.0.0
rootdir: /content
plugins: hypothesis-6.76.0, anyio-3.6.2
[1mcollecting ... [0m[1mcollected 2 items                                                              [0m

tests_normal.py [32m.[0m[32m.[0m[32m                                                       [100%][0m



# Using Property Based Testing instead of manually writing the tests down

Here, we can use a library like Hypothesis to generate possible test cases.

In [None]:
from hypothesis import given
from hypothesis.strategies import integers

In [None]:
from hypothesis.strategies import floats
import numpy as np
from hypothesis.extra.numpy import arrays
# It can sample integers
print(integers().example())

# It can sample integers -- with some basic restrictions:
print(integers(10, 20).example())

# It can sample integers -- even with fairly complex restrictions
# (e.g., Even numbers)
print(integers().filter(lambda x: x % 2 == 0).example())

# Or even NumPy objects
print(arrays(dtype=np.float64,shape=(10,1)).example())
# You can put some limits on these
print(arrays(dtype=np.float64,shape=(10,1),
             elements=floats(1,200)).example())


21821
11
-78
[[-9.99990000e-001]
 [ 1.17549435e-038]
 [-3.33333333e-001]
 [ 4.07065917e+016]
 [-1.76754284e+258]
 [             nan]
 [-2.22507386e-309]
 [             nan]
 [ 2.00001000e+000]
 [             inf]]
[[1.29619889]
 [1.29619889]
 [1.29619889]
 [1.29619889]
 [1.29619889]
 [1.29619889]
 [1.29619889]
 [1.29619889]
 [1.29619889]
 [1.29619889]]


In [None]:
import hypothesis.strategies as st
# Uncomment to see other options for things it can generate:
print(dir(st))

['DataObject', 'DrawFn', 'SearchStrategy', '__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '_internal', '_strategies', 'binary', 'booleans', 'builds', 'characters', 'complex_numbers', 'composite', 'data', 'dates', 'datetimes', 'decimals', 'deferred', 'dictionaries', 'emails', 'fixed_dictionaries', 'floats', 'fractions', 'from_regex', 'from_type', 'frozensets', 'functions', 'integers', 'ip_addresses', 'iterables', 'just', 'lists', 'none', 'nothing', 'one_of', 'permutations', 'random_module', 'randoms', 'recursive', 'register_type_strategy', 'runner', 'sampled_from', 'sets', 'shared', 'slices', 'text', 'timedeltas', 'times', 'timezone_keys', 'timezones', 'tuples', 'uuids']
$xL!8f@Z1.RealesTAtE


Now, instead of manually specifying the test cases, you can actually just specify a range of inputs and it will test things randomly within this range:

In [None]:
@given(a=integers(), b=integers())
def test_addition(a,b):
    assert bad_add(a,b) == a+b


OK, now let's actually run the tests using something like pytest to actually get the failing examples (See tests_hypo.py), and let's print out what it is trying so that we can see what it is doing:

In [None]:
!pytest tests_hypo.py

platform linux -- Python 3.10.11, pytest-7.2.2, pluggy-1.0.0
rootdir: /content
plugins: hypothesis-6.76.0, anyio-3.6.2
collected 2 items                                                              [0m

tests_hypo.py [32m.[0m[32m.[0m[32m                                                         [100%][0m



Great! We seem to have found the error. We can attempt to fix the bug now, and to make sure that we capture it in the future, we can tell Hypothesis to directly look for this bug next time:

In [None]:
from hypothesis import example
@given(a=integers(), b=integers())
@example(0,201)
def test_bad_addition(a,b):
    assert bad_add(a,b) == a+b

In [None]:
!pytest tests_hypo.py

platform linux -- Python 3.10.11, pytest-7.2.2, pluggy-1.0.0
rootdir: /content
plugins: hypothesis-6.76.0, anyio-3.6.2
collected 2 items                                                              [0m

tests_hypo.py [32m.[0m[31mF[0m[31m                                                         [100%][0m

[31m[1m______________________________ test_bad_addition _______________________________[0m

    [37m@given[39;49;00m(a=floats(min_value=-[94m500[39;49;00m,max_value=[94m500[39;49;00m),[90m[39;49;00m
>          b=integers(min_value=-[94m500[39;49;00m,max_value=[94m500[39;49;00m))[90m[39;49;00m

[1m[31mtests_hypo.py[0m:40: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

a = 0.0, b = 51

    [37m@given[39;49;00m(a=floats(min_value=-[94m500[39;49;00m,max_value=[94m500[39;49;00m),[90m[39;49;00m
           b=integers(min_value=-[94m500[39;49;00m,max_value=[94m500[39;49;00m))[90m[39;49;00m
    [94mdef[39;49;00m [92m

How many of these tests should we do? We can specify this via the `settings` decorator. Too few and we may not catch the error. Too many and we may take forever!

In [None]:
from hypothesis import settings
@given(a=integers(), b=integers())
@settings(max_examples=10)
def test_bad_addition(a,b):
    assert bad_add(a,b) == a+b

OK, fine, but what if my function will never really be dealing with extreme values like the crazy integers that we see here. Can't I make this more helpful by restricting it to just the range of values that I care about? Absolutely! We can just change the testing strategy:

In [None]:
@given(a=integers(min_value=-500,max_value=500),
       b=integers(min_value=-500,max_value=500))
def test_bad_addition(a,b):
    assert bad_add(a,b) == a+b

We can also change the type of testing strategy that we use:

In [None]:
from hypothesis.strategies import floats
@given(a=floats(min_value=-500,max_value=500),
       b=integers(min_value=-500,max_value=500))
def test_bad_addition(a,b):
    assert bad_add(a,b) == a+b

# A (slightly) more complex example

We'll test a couple of functions to see if we can identify cases where they are not their own inverse (i.e., where $f(f(x)) = x$)

In [None]:
# Check whether a function is it's own inverse
def check_inverse(f,x):
    assert f(f(x)) == x

In [None]:
!pytest test_inverse.py

platform linux -- Python 3.10.11, pytest-7.2.2, pluggy-1.0.0
rootdir: /content
plugins: hypothesis-6.76.0, anyio-3.6.2
collected 3 items                                                              [0m

test_inverse.py [32m.[0m[31mF[0m[31mF[0m[31m                                                      [100%][0m

[31m[1m___________________________________ test_1x ____________________________________[0m

    [37m@given[39;49;00m(x = floats(allow_nan=[94mFalse[39;49;00m))[90m[39;49;00m
>   [94mdef[39;49;00m [92mtest_1x[39;49;00m(x):[90m[39;49;00m

[1m[31mtest_inverse.py[0m:13: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[1m[31mtest_inverse.py[0m:15: in test_1x
    check_inverse(f,x)[90m[39;49;00m
[1m[31mtest_inverse.py[0m:5: in check_inverse
    [94massert[39;49;00m f(f(x)) == x[90m[39;49;00m
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

a = 0.0

>   f = [94mlambda[39;49;00m a:

# A More Design/Engineering example: Detecting Logical Paradoxes

This example is taken directly from the [Hypothesis documentation](https://hypothesis.readthedocs.io/en/latest/examples.html#condorcet-s-paradox), wherein we can automatically discover a subtle paradox in social choice theory/voting called the [Condorcet Paradox](https://en.wikipedia.org/wiki/Condorcet_paradox). In this, we can find a case that breaks the formal property that a given voting system's preferences should be transitive. (See `condorcet_paradox.py`)

In [None]:
!pytest condorcet_paradox.py

platform linux -- Python 3.10.11, pytest-7.2.2, pluggy-1.0.0
rootdir: /content
plugins: hypothesis-6.76.0, anyio-3.6.2
collected 1 item                                                               [0m

condorcet_paradox.py [31mF[0m[31m                                                   [100%][0m

[31m[1m________________________ test_elections_are_transitive _________________________[0m

    [37m@given[39;49;00m(lists(permutations([[33m"[39;49;00m[33mA[39;49;00m[33m"[39;49;00m, [33m"[39;49;00m[33mB[39;49;00m[33m"[39;49;00m, [33m"[39;49;00m[33mC[39;49;00m[33m"[39;49;00m]), min_size=[94m4[39;49;00m))[90m[39;49;00m
>   [94mdef[39;49;00m [92mtest_elections_are_transitive[39;49;00m(election):[90m[39;49;00m

[1m[31mcondorcet_paradox.py[0m:10: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

election = [['A', 'C', 'B'], ['A', 'C', 'B'], ['B', 'A', 'C'], ['B', 'A', 'C'], ['C', 'B', 'A']]

    [37m@given[39;49;00m(lists

# Bonus: Writing Property Test code for us

For certain types of functions, if we appropriately type them, we can have it construct some of the testing code for us. This becomes more powerful for more complex datastructures, and is aided by the use of typing as we will see below.

(Note, this part requires the `!pip install hypothesis[cli]` at the beginning of the notebook.)

Let's say we have a simple subtraction function:

In [None]:
def subtract(a,b):
    return a-b

We can ask Hypothesis to generate some tests for us for this function:

In [None]:
!hypothesis write mark_math.subtract

[90;49m# This test code was written by the `hypothesis.extra.ghostwriter` module[0m
[90;49m# and is provided under the Creative Commons Zero public domain dedication.[0m

[91;49mimport[0m[97;49m [0m[97;49mmark_math[0m
[91;49mfrom[0m[97;49m [0m[97;49mhypothesis[0m[97;49m [0m[91;49mimport[0m[97;49m [0m[97;49mgiven[0m[97;49m,[0m[97;49m [0m[97;49mstrategies[0m[97;49m [0m[96;49mas[0m[97;49m [0m[97;49mst[0m

[90;49m# TODO: replace st.nothing() with appropriate strategies[0m


[92;49m@given[0m[97;49m([0m[97;49ma[0m[91;49m=[0m[97;49mst[0m[91;49m.[0m[97;49mnothing[0m[97;49m([0m[97;49m)[0m[97;49m,[0m[97;49m [0m[97;49mb[0m[91;49m=[0m[97;49mst[0m[91;49m.[0m[97;49mnothing[0m[97;49m([0m[97;49m)[0m[97;49m)[0m
[96;49mdef[0m[97;49m [0m[92;49mtest_fuzz_subtract[0m[97;49m([0m[97;49ma[0m[97;49m,[0m[97;49m [0m[97;49mb[0m[97;49m)[0m[97;49m:[0m
[97;49m    [0m[97;49mmark_math[0m[91;49m.[0m[97;49msubtract[

This is pretty disappointing, since it basically does nothing. Not very helpful! However, we didn't really give the library much information about the function. Here, we can use Python's typing system to provide some hints:

In [None]:
def subtract_wtypes(a: int, b: int):
    return a-b

In [None]:
!hypothesis write mark_math.subtract_wtypes > test_subtract.py

In [None]:
!pytest test_subtract.py

platform linux -- Python 3.10.11, pytest-7.2.2, pluggy-1.0.0
rootdir: /content
plugins: hypothesis-6.76.0, anyio-3.6.2
collected 3 items                                                              [0m

test_subtract.py [31mF[0m[31mF[0m[31mF[0m[31m                                                     [100%][0m

[31m[1m______________ test_associative_binary_operation_subtract_wtypes _______________[0m

    [37m@given[39;49;00m([90m[39;49;00m
>       a=subtract_wtypes_operands, b=subtract_wtypes_operands, c=subtract_wtypes_operands[90m[39;49;00m
    )[90m[39;49;00m

[1m[31mtest_subtract.py[0m:11: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

a = 0.0, b = 0.0, c = 1.0

    [37m@given[39;49;00m([90m[39;49;00m
        a=subtract_wtypes_operands, b=subtract_wtypes_operands, c=subtract_wtypes_operands[90m[39;49;00m
    )[90m[39;49;00m
    [94mdef[39;49;00m [92mtest_associative_binary_operation_subtract_wtypes[39;49;00m(a: 

OK! Now we see that because we specified that the inputs were typed as integers, it can figure out that this `subtract_wtypes` functions needs to (1) be commutative, (2) be associative, and (3) have an identity function.

Pretty cool. While this isn't terribly useful in this case, for more complex test cases this can rapidly help you cover anticipated test cases.