<a href="https://colab.research.google.com/github/apicem7217/Clase-9/blob/Phyton/Copia_de_testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Testing

![](https://github.com/khuyentran1401/Efficient_Python_tricks_and_tools_for_data_scientists/blob/master/img/test.png?raw=1)

### Efficiently Resume Work After Breaks with Failing Tests

Do you forget what feature to implement when taking a break from work?

To keep your train of thought, write a unit test that describes the desired behavior of the feature and makes it fail intentionally.

This will give you a clear idea of what to work on when returning to the project, allowing you to get back on track faster.

```python
def calculate_average(nums: list):
    return sum(nums)/len(nums)
    # TODO: code to handle an empty list

def test_calculate_average_two_nums():
    # Will work
    nums = [2, 3]
    assert calculate_average(nums) == 2.5

def test_calculate_average_empty_list():
    # Will fail intentionally
    nums = []
    return calculate_average(nums) == 0
```

### Choose a Descriptive Name Over a Short One When Naming Your Function

Using a short and unclear name for a testing function may lead to confusion and misunderstandings. To make your tests more readable, use a descriptive name instead, even if it results in a longer name.

Instead of this:

```python
def contain_word(word: str, text: str):
    return word in text


def test_contain_word_1():
    assert contain_word(word="duck", text="This is a duck")


def test_contain_word_2():
    assert contain_word(word="duck", text="This is my coworker, Mr. Duck")
```

Write this:

```python
def contain_word(word: str, text: str):
    return word in text


def test_contain_word_exact():
    assert contain_word(word="duck", text="This is a duck")


def test_contain_word_different_case():
    assert contain_word(word="duck", text="This is my coworker, Mr. Duck")
```

### pytest benchmark: A Pytest Fixture to Benchmark Your Code

In [None]:
!pip install pytest-benchmark

Collecting pytest-benchmark
  Downloading pytest_benchmark-4.0.0-py3-none-any.whl (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.0/44.0 kB[0m [31m326.7 kB/s[0m eta [36m0:00:00[0m
Installing collected packages: pytest-benchmark
Successfully installed pytest-benchmark-4.0.0


If you want to benchmark your code while testing with pytest, try pytest-benchmark.

To use pytest-benchmark works, add `benchmark` to the test function that you want to benchmark.

In [None]:
%%writefile pytest_benchmark_example.py
def list_comprehension(len_list=5):
    return [i for i in range(len_list)]


def test_concat(benchmark):
    res = benchmark(list_comprehension)
    assert res == [0, 1, 2, 3, 4]

Writing pytest_benchmark_example.py


On your terminal, type:
```bash
$ pytest pytest_benchmark_example.py
```
Now you should see the statistics of the time it takes to execute the test functions on your terminal:

In [None]:
!pytest pytest_benchmark_example.py

platform linux -- Python 3.10.6, pytest-7.2.2, pluggy-1.2.0
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /content
plugins: benchmark-4.0.0, anyio-3.7.1
[1mcollecting ... [0m[1mcollected 1 item                                                               [0m

pytest_benchmark_example.py [32m.[0m[32m                                            [100%][0m


[33m----------------------------------------------------------- benchmark: 1 tests ----------------------------------------------------------[0m
Name (time in ns)          Min             Max        Mean       StdDev      Median       IQR  Outliers  OPS (Kops/s)  Rounds  Iterations
[33m-----------------------------------------------------------------------------------------------------------------------------------------[0m
test_concat         [1m  658.0000[0m[1m  7,641,841.0000[0m

[Link to pytest-benchmark](https://github.com/ionelmc/pytest-benchmark).

### pytest.mark.parametrize: Test Your Functions with Multiple Inputs

In [None]:
!pip install pytest



If you want to test your function with different examples, use `pytest.mark.parametrize` decorator.

To use `pytest.mark.parametrize`, add `@pytest.mark.parametrize` to the test function that you want to experiment with.

In [None]:
%%writefile pytest_parametrize.py
import pytest

def text_contain_word(word: str, text: str):
    '''Find whether the text contains a particular word'''

    return word in text

test = [
    ('There is a duck in this text',True),
    ('There is nothing here', False)
    ]

@pytest.mark.parametrize('sample, expected', test)
def test_text_contain_word(sample, expected):

    word = 'duck'

    assert text_contain_word(word, sample) == expected

Writing pytest_parametrize.py


In the code above, I expect the first sentence to contain the word “duck” and expect the second sentence not to contain that word. Let's see if my expectations are correct by running:
```bash
$ pytest pytest_parametrize.py
```

In [None]:
!pytest -v pytest_parametrize.py

platform linux -- Python 3.10.6, pytest-7.2.2, pluggy-1.2.0 -- /usr/bin/python3
cachedir: .pytest_cache
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /content
plugins: benchmark-4.0.0, anyio-3.7.1
[1mcollecting ... [0m[1mcollected 2 items                                                              [0m

pytest_parametrize.py::test_text_contain_word[There is a duck in this text-True] [32mPASSED[0m[32m [ 50%][0m
pytest_parametrize.py::test_text_contain_word[There is nothing here-False] [32mPASSED[0m[32m [100%][0m



Sweet! 2 tests passed when running pytest.

[Link to my article about pytest](https://towardsdatascience.com/pytest-for-data-scientists-2990319e55e6?sk=2d3a81903b154db0c7ca832b9f29fee8).



### pytest parametrize twice: Test All Possible Combinations of Two Sets of Parameters

In [None]:
!pip install pytest



If you want to test the combinations of two sets of parameters, writing all possible combinations can be time-consuming and is difficult to read.

```python
import pytest

def average(n1, n2):
    return (n1 + n2) / 2

def perc_difference(n1, n2):
    return (n2 - n1)/n1 * 100

# Test the combinations of operations and inputs
@pytest.mark.parametrize("operation, n1, n2", [(average, 1, 2), (average, 2, 3), (perc_difference, 1, 2), (perc_difference, 2, 3)])
def test_is_float(operation, n1, n2):
    assert isinstance(operation(n1, n2), float)
```

You can save your time by using `pytest.mark.parametrize` twice instead.

In [None]:
%%writefile pytest_combination.py
import pytest

def average(n1, n2):
    return (n1 + n2) / 2

def perc_difference(n1, n2):
    return (n2 - n1)/n1 * 100

# Test the combinations of operations and inputs
@pytest.mark.parametrize("operation", [average, perc_difference])
@pytest.mark.parametrize("n1, n2", [(1, 2), (2, 3)])
def test_is_float(operation, n1, n2):
    assert isinstance(operation(n1, n2), float)

Writing pytest_combination.py


On your terminal, run:
```bash
$ pytest -v pytest_combination.py
```

In [None]:
!pytest -v pytest_combination.py

platform linux -- Python 3.10.6, pytest-7.2.2, pluggy-1.2.0 -- /usr/bin/python3
cachedir: .pytest_cache
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /content
plugins: benchmark-4.0.0, anyio-3.7.1
[1mcollecting ... [0m[1mcollected 4 items                                                              [0m

pytest_combination.py::test_is_float[1-2-average] [32mPASSED[0m[32m                 [ 25%][0m
pytest_combination.py::test_is_float[1-2-perc_difference] [32mPASSED[0m[32m         [ 50%][0m
pytest_combination.py::test_is_float[2-3-average] [32mPASSED[0m[32m                 [ 75%][0m
pytest_combination.py::test_is_float[2-3-perc_difference] [32mPASSED[0m[32m         [100%][0m



From the output above, we can see that all possible combinations of the given operations and inputs are tested.

### Assign IDs to Test Cases

When using pytest parametrize, it can be difficult to understand the role of each test case.

In [None]:
%%writefile pytest_without_ids.py
from pytest import mark


def average(n1, n2):
    return (n1 + n2) / 2

@mark.parametrize(
    "n1, n2",
    [(-1, -2), (2, 3), (0, 0)],
)
def test_is_float(n1, n2):
    assert isinstance(average(n1, n2), float)

Writing pytest_without_ids.py


```bash
$ pytest -v pytest_without_ids.py
```

In [None]:
!pytest -v pytest_without_ids.py

platform linux -- Python 3.10.6, pytest-7.2.2, pluggy-1.2.0 -- /usr/bin/python3
cachedir: .pytest_cache
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /content
plugins: benchmark-4.0.0, anyio-3.7.1
[1mcollecting ... [0m[1mcollected 3 items                                                              [0m

pytest_without_ids.py::test_is_float[-1--2] [32mPASSED[0m[32m                       [ 33%][0m
pytest_without_ids.py::test_is_float[2-3] [32mPASSED[0m[32m                         [ 66%][0m
pytest_without_ids.py::test_is_float[0-0] [32mPASSED[0m[32m                         [100%][0m



You can add `ids` to pytest parametrize to assign a name to each test case.

In [None]:
%%writefile pytest_ids.py
from pytest import mark

def average(n1, n2):
    return (n1 + n2) / 2

@mark.parametrize(
    "n1, n2",
    [(-1, -2), (2, 3), (0, 0)],
    ids=["neg and neg", "pos and pos", "zero and zero"],
)
def test_is_float(n1, n2):
    assert isinstance(average(n1, n2), float)

Writing pytest_ids.py


```bash
$ pytest -v pytest_ids.py
```

In [None]:
!pytest -v pytest_ids.py

platform linux -- Python 3.10.6, pytest-7.2.2, pluggy-1.2.0 -- /usr/bin/python3
cachedir: .pytest_cache
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /content
plugins: benchmark-4.0.0, anyio-3.7.1
[1mcollecting ... [0m[1mcollected 3 items                                                              [0m

pytest_ids.py::test_is_float[neg and neg] [32mPASSED[0m[32m                         [ 33%][0m
pytest_ids.py::test_is_float[pos and pos] [32mPASSED[0m[32m                         [ 66%][0m
pytest_ids.py::test_is_float[zero and zero] [32mPASSED[0m[32m                       [100%][0m



We can see that instead of `[-1--2]`, the first test case is shown as `neg and neg`. This makes it easier for others to understand the roles of your test cases.  

If you want to specify the test IDs together with the actual data, instead of listing them separately, use `pytest.param`.

In [None]:
%%writefile pytest_param.py
import pytest


def average(n1, n2):
    return (n1 + n2) / 2


examples = [
    pytest.param(-1, -2, id="neg-neg"),
    pytest.param(2, 3, id="pos-pos"),
    pytest.param(0, 0, id="0-0"),
]


@pytest.mark.parametrize("n1, n2", examples)
def test_is_float(n1, n2):
    assert isinstance(average(n1, n2), float)


Writing pytest_param.py


```bash
$ pytest -v pytest_param.py
```

In [None]:
!pytest -v pytest_param.py

platform linux -- Python 3.10.6, pytest-7.2.2, pluggy-1.2.0 -- /usr/bin/python3
cachedir: .pytest_cache
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /content
plugins: benchmark-4.0.0, anyio-3.7.1
[1mcollecting ... [0m[1mcollected 3 items                                                              [0m

pytest_param.py::test_is_float[neg-neg] [32mPASSED[0m[32m                           [ 33%][0m
pytest_param.py::test_is_float[pos-pos] [32mPASSED[0m[32m                           [ 66%][0m
pytest_param.py::test_is_float[0-0] [32mPASSED[0m[32m                               [100%][0m



### Pytest Fixtures: Use The Same Data for Different Tests

In [None]:
!pip install pytest textblob



If you want to use the same data to test different functions, use pytest fixtures.

To use pytest fixtures,  add the decorator `@pytest.fixture` to the function that creates the data you want to reuse.

In [None]:
%%writefile pytest_fixture.py
import pytest
from textblob import TextBlob

def extract_sentiment(text: str):
    """Extract sentimetn using textblob. Polarity is within range [-1, 1]"""

    text = TextBlob(text)
    return text.sentiment.polarity

@pytest.fixture
def example_data():
    return 'Today I found a duck and I am happy'

def test_extract_sentiment(example_data):
    sentiment = extract_sentiment(example_data)
    assert sentiment > 0

Writing pytest_fixture.py


On your terminal, type:
```bash
$ pytest pytest_fixture.py
```
Output:

In [None]:
!pytest pytest_fixture.py

platform linux -- Python 3.10.6, pytest-7.2.2, pluggy-1.2.0
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /content
plugins: benchmark-4.0.0, anyio-3.7.1
[1mcollecting ... [0m

### Execute a Fixture Only Once per Session

By default, every time you use a pytest fixture in a test, a fixture will be executed.

```python
# example.py
import pytest

@pytest.fixture
def my_data():
    print("Reading data...")
    return 1

def test_division(my_data):
    print("Test division...")
    assert my_data / 2 == 0.5

def test_modulus(my_data):
    print("Test modulus...")
    assert my_data % 2 == 1
```
From the output, we can see that the fixture `my_data` is executed twice.

```bash
$ pytest example.py -s
Reading data...
Test division...
Reading data...
Test modulus...
```

If a fixture is expensive to execute, you can make the fixture be executed only once per session using `scope=session`.

In [None]:
%%writefile pytest_scope.py
import pytest

@pytest.fixture(scope="session")
def my_data():
    print("Reading data...")
    return 1

def test_division(my_data):
    print("Test division...")
    assert my_data / 2 == 0.5

def test_modulus(my_data):
    print("Test modulus...")
    assert my_data % 2 == 1

From the output, we can see that the fixture `my_data` is executed only once.
```bash
$ pytest pytest_scope.py -s
Reading data...
Test division...
Test modulus...
```

### Pytest skipif: Skip a Test When a Condition is Not Met

If you want to skip a test when a condition is not met, use pytest `skipif`. For example, in the code below, I use `skipif` to skip a test if the python version is less than 3.9.

In [None]:
%%writefile pytest_skip.py
import sys
import pytest

def add_two(num: int):
    return num + 2

@pytest.mark.skipif(sys.version_info < (3, 9), reason="Eequires Python 3.9 or higher")
def test_add_two():
    assert add_two(3) == 5

On your terminal, type:
```bash
$ pytest pytest_skip.py -v
```

Output:

In [None]:
!pytest pytest_skip.py -v

### Pytest xfail: Mark a Test as Expected to Fail

If you expect a test to fail, use pytest `xfail` marker. This will prevent pytest from marking a test as failed when there is an exception.

To be more specific about what exception you expect to see, use the `raises` argument.

In [None]:
%%writefile pytest_mark_xfail.py
import pytest

def divide_two_nums(num1, num2):
    return num1 / num2

@pytest.mark.xfail(raises=ZeroDivisionError)
def test_divide_by_zero():
    res = divide_two_nums(2, 0)

On your terminal, type:

```bash
$ pytest pytest_mark_xfail.py
```

We can see that no test failed.

In [None]:
!pytest pytest_mark_xfail.py

### Verify Logging Error with pytest

To ensure that your application logs an error under a specific condition, use the built-in fixture called `caplog` in pytest.

This fixture allows you to capture log messages generated during the execution of your test.

In [None]:
%%writefile test_logging.py
from logging import getLogger

logger = getLogger(__name__)

def divide(num1: float, num2: float) -> float:
    if num2 == 0:
        logger.error(f"Can't divide {num1} by 0")
    else:
        logger.info(f"Divide {num1} by {num2}")
        return num1 / num2

def test_divide_by_0(caplog):
    divide(1, 0)
    assert "Can't divide 1 by 0" in caplog.text

```bash
$ pytest test_logging.py
```

In [None]:
!pytest test_logging.py

### Pytest repeat


In [None]:
!pip install pytest-repeat

It is a good practice to test your functions to make sure they work as expected, but sometimes you need to test 100 times until you found the rare cases when the test fails. That is when pytest-repeat comes in handy.

To use pytest-repeat, add the decorator `@pytest.mark.repeat(N)` to the test function you want to repeat `N` times

In [None]:
%%writfile pytest_repeat_example.py
import pytest
import random

def generate_numbers():
    return random.randint(1, 100)

@pytest.mark.repeat(100)
def test_generate_numbers():
    assert generate_numbers() > 1 and generate_numbers() < 100

On your terminal, type:
```bash
$ pytest pytest_repeat_example.py
```
We can see that 100 experiments are executed and passed:

In [None]:
!pytest pytest_repeat_example.py

[Link to pytest-repeat](https://github.com/pytest-dev/pytest-repeat)

### pytest-sugar: Show the Failures and Errors Instantly With a Progress Bar

In [None]:
!pip install pytest-sugar

It can be frustrating to wait for a lot of tests to run before knowing the status of the tests. If you want to see the failures and errors instantly with a progress bar, use pytest-sugar.

pytest-sugar is a plugin for pytest. To see how pytest-sugar works, assume we have several test files in the `pytest_sugar_example` directory.

In [None]:
%ls pytest_sugar_example

The code below shows how the outputs will look like when running pytest.

```bash
$ pytest pytest_sugar_example
```

In [None]:
!pytest pytest_sugar_example

[Link to pytest-sugar](https://github.com/Teemu/pytest-sugar).

### pytest-steps: Share Data Between Tests

Have you ever wanted to use the result of one test for another test? That is when pytest_steps comes in handy.

![](https://github.com/khuyentran1401/Efficient_Python_tricks_and_tools_for_data_scientists/blob/master/img/pytest_steps.png?raw=1)

In the code below, I use the result of `sum_test` as the input of `average_2_nums`. The argument `steps_data` allows me to share the data between 2 tests.

In [None]:
%%writefile test_steps.py
from pytest_steps import test_steps


def sum(n1, n2):
    return n1 + n2


def average_2_nums(sum):
    return sum / 2


def sum_test(steps_data):
    res = sum(1, 3)
    assert res == 4
    steps_data.res = res


def perc_difference_test(steps_data):
    avg = average_2_nums(steps_data.res)
    assert avg == 2


@test_steps(sum_test, perc_difference_test)
def test_calc_suite(test_step, steps_data):
    if test_step == 'sum_test':
        sum_test(steps_data)
    elif test_step == 'perc_difference_test':
        perc_difference_test(steps_data)

```bash
$ pytest test_steps.py
```

In [None]:
!pytest test_steps.py

[Link to pytest_steps](https://smarie.github.io/python-pytest-steps/).

### pytest-picked: Run the Tests Related to the Unstaged Files in Git

It can be time-consuming to run all tests in your project. Wouldn't it be nice if you can run only the tests related to the unstaged files in Git? That is when pytest-picked comes in handy.

In the code below, only tests in the file `test_picked.py` are executed because it is an unstaged file.

In [None]:
%%writefile test_picked.py
def plus_one(num: int):
    return num + 1


def test_plus_one():
    assert plus_one(2) == 3

```bash
$ git status
```

In [None]:
!git status

```bash
$ pytest --picked
```

In [None]:
!pytest --picked

[Link to pytest-picked](https://github.com/anapaulagomes/pytest-picked).

### Efficient Testing of Python Class with setUp Method

When testing a Python class, it can be repetitive and time-consuming to create multiple instances to test a large number of instance methods.

In [None]:
%%writefile get_dog.py
class Dog:
    def __init__(self, name, age):
        self.name = name
        self.age = age

    def walk(self):
        return f"{self.name} is walking"

    def bark(self):
        return f"{self.name} is barking"

In [None]:
%%writefile test_get_dog.py
import unittest
from get_dog import Dog

class TestDog(unittest.TestCase):
    def test_walk(self):
        dog = Dog("Max", 3)
        dog.walk() == "Max is walking"

    def test_bark(self):
        dog = Dog("Max", 3)
        dog.bark() == "Max is barking"

A better approach is to use the `setUp` method to instantiate a class object before running each test.

In [None]:
%%writefile test_get_dog.py
import unittest
from get_dog import Dog

class TestDog(unittest.TestCase):
    def setUp(self):
        self.dog = Dog("Max", 3)

    def test_walk(self):
        self.dog.walk() == "Max is walking"

    def test_bark(self):
        self.dog.bark() == "Max is barking"

### FreezeGun: Freeze Dynamic Time in Unit Testing

In [None]:
!pip install freezegun

Unit tests require static input, but time is dynamic and constantly changing. With FreezeGun, you can freeze time to a specific point, ensuring accurate verification of the tested features.

In [None]:
%%writefile test_freezegun.py
from freezegun import freeze_time
import datetime

def get_day_of_week():
    return datetime.datetime.now().weekday()

@freeze_time("2023-06-13")
def test_get_day_of_week():
    assert get_day_of_week() == 1


```bash
$ pytest test_freezegun.py
```

In [None]:
!pytest test_freezegun.py

[Link to FreezeGun](https://github.com/spulec/freezegun).

### Simulate External Services in Testing with Mock Objects

Testing code that relies on external services, like a database, can be difficult since the behaviors of these services can change.

A mock object can control the behavior of a real object in a testing environment by simulating responses from external services.

The following code uses a mock object to test the `get_data` function's behavior when calling an API that may either succeed or fail.

```python
from unittest.mock import patch
import requests
from requests.exceptions import ConnectionError


def get_data():
    """Make an API call to Postgres"""
    try:
        response = requests.get("http://localhost:5432")
        return response.json()
    except ConnectionError:
        return None


def test_get_data_fails():
    """Test the get_data function when the API call fails"""
    # Mock the requests.get function
    with patch("requests.get") as mock_get:
        # Define what happens when the function is called
        mock_get.side_effect = ConnectionError
        assert get_data() is None


def test_get_data_succeeds():
    """Test the get_data function when the API call succeeds"""
    # Mock the requests.get function
    with patch("requests.get") as mock_get:
        # Define the return value of the function
        mock_get.return_value.json.return_value = {"data": "test"}
        assert get_data() == {"data": "test"}

```

[Link to mock](https://docs.python.org/3/library/unittest.mock.html).

### pyfakefs: Create Fake File System in Memory for Testing

Sometimes you might want to test if the function that interacts with files is working properly but don't want the tests to touch the real disk.

pyfakefs allows your tests to operate on a file system in memory without touching the real disk.

In the code below, I created a fake directory and tested if `save_result` is creating a new file in the fake directory and writing the result to that file.

In [None]:
%%writefile test_pyfakefs.py
from pathlib import Path


def save_result(folder: str, file_name: str, result: str):
    # Create new file inside the folder
    file = Path(folder) / file_name
    file.touch()

    # Write result to the new file
    file.write_text(result)

def test_save_result(fs):
    folder = "new"
    file_name = "my_file.txt"
    result = "The accuracy is 0.9"

    fs.create_dir(folder)

    save_result(folder=folder, file_name=file_name, result=result)
    res = Path(f"{folder}/{file_name}").read_text()
    assert res == result

```bash
$ pytest test_pyfakefs.py
```

In [None]:
!pytest test_pyfakefs.py

[Link to pyfakefs](https://github.com/jmcgeheeiv/pyfakefs/).

### Pandera: a Python Library to Validate Your Pandas DataFrame

In [None]:
!pip install pandera

The outputs of your pandas DataFrame might not be like what you expected either due to the error in your code or the change in the data format. Using data that is different from what you expected can cause errors or lead to decrease performance.

Thus, it is important to validate your data before using it. A good tool to validate pandas DataFrame is pandera. Pandera is easy to read and use.

In [None]:
import pandera as pa
from pandera import check_input
import pandas as pd

df = pd.DataFrame({"col1": [5.0, 8.0, 10.0], "col2": ["text_1", "text_2", "text_3"]})
schema = pa.DataFrameSchema(
    {
        "col1": pa.Column(float, pa.Check(lambda minute: 5 <= minute)),
        "col2": pa.Column(str, pa.Check.str_startswith("text_")),
    }
)
validated_df = schema(df)
validated_df

You can also use the pandera’s decorator check_input to validates input pandas DataFrame before entering the function.

In [None]:
@check_input(schema)
def plus_three(df):
    df["col1_plus_3"] = df["col1"] + 3
    return df


plus_three(df)

[Link to Pandera](https://pandera.readthedocs.io/en/stable/)

### DeepDiff Find Deep Differences of Python Objects

In [None]:
!pip install deepdiff

When testing the outputs of your functions, it can be frustrated to see your tests fail because of something you don't care too much about such as:

- order of items in a list

- different ways to specify the same thing such as abbreviation

- exact value up to the last decimal point, etc


Is there a way that you can exclude certain parts of the object from the comparison? That is when DeepDiff comes in handy.

In [None]:
from deepdiff import DeepDiff

DeepDiff can output a meaningful comparison like below:

In [None]:
price1 = {'apple': 2, 'orange': 3, 'banana': [3, 2]}
price2 = {'apple': 2, 'orange': 3, 'banana': [2, 3]}

DeepDiff(price1, price2)

With DeepDiff, you also have full control of which characteristics of the Python object DeepDiff should ignore. In the example below, since the order is ignored `[3, 2]` is equivalent to `[2, 3]`.

In [None]:
# Ignore orders

DeepDiff(price1, price2, ignore_order=True)

We can also exclude certain part of our object from the comparison. In the code below, we ignore `ml` and `machine learning` since `ml` is a abbreviation of `machine learning`.

In [None]:
experience1 = {"machine learning": 2, "python": 3}
experience2 = {"ml": 2, "python": 3}

DeepDiff(
    experience1,
    experience2,
    exclude_paths={"root['ml']", "root['machine learning']"},
)

Cmpare 2 numbers up to a specific decimal point:

In [None]:
num1 = 0.258
num2 = 0.259

DeepDiff(num1, num2, significant_digits=2)

[Link to DeepDiff](https://github.com/seperman/deepdiff).

### dirty-equals: Write Declarative Assertions in Your Unit Tests

In [None]:
!pip install dirty-equals

If you want to write declarative assertions and avoid boilerplate code in your unit tests, try dirty_equals.

In [None]:
from dirty_equals import IsNow, IsPartialDict, IsList, IsStr, IsTrueLike

In [None]:
from datetime import datetime
from datetime import timedelta

shopping = {
    "time": datetime.today().now(),
    "quantity": {"apple": 1, "banana": 2, "orange": 1},
    "locations": ["Walmart", "Aldi"],
    "is_male": 1
}


In [None]:
assert shopping == {
    "time": IsNow(delta=timedelta(hours=1)),
    "quantity": IsPartialDict(apple=1, orange=1),
    "locations": IsList("Aldi", "Walmart", check_order=False),
    "is_male": IsTrueLike
}


[Link to dirty-equals](https://github.com/samuelcolvin/dirty-equals).

### hypothesis: Property-based Testing in Python

In [None]:
!pip install hypothesis

If you want to test some properties or assumptions, it can be cumbersome to write a wide range of scenarios. To automatically run your tests against a wide range of scenarios and find edge cases in your code that you would otherwise have missed, use hypothesis.

In the code below, I test if the addition of two floats is commutative. The test fails when either `x` or `y` is `NaN`.

In [None]:
%%writefile test_hypothesis.py
from hypothesis import given
from hypothesis.strategies import floats



@given(floats(), floats())
def test_floats_are_commutative(x, y):
    assert x + y == y + x

```bash
$ pytest test_hypothesis.py
```

In [None]:
!pytest test_hypothesis.py

Now I can rewrite my code to make it more robust against these edge cases.

[Link to hypothesis](https://hypothesis.readthedocs.io/en/latest/quickstart.html).

### Deepchecks: Check Category Mismatch Between Train and Test Set

In [None]:
!pip install deepchecks

Sometimes, it is important to know if your test set contains the same categories in the train set. If you want to check the category mismatch between the train and test set, use Deepchecks's `CategoryMismatchTrainTest`.

In the example below, the result shows that there are 2 new categories in the test set. They are 'd' and 'e'.

In [None]:
from deepchecks.checks.integrity.new_category import CategoryMismatchTrainTest
from deepchecks.base import Dataset
import pandas as pd

In [None]:
train = pd.DataFrame({"col1": ["a", "b", "c"]})
test = pd.DataFrame({"col1": ["c", "d", "e"]})

train_ds = Dataset(train, cat_features=["col1"])
test_ds = Dataset(test, cat_features=["col1"])

In [None]:
CategoryMismatchTrainTest().run(train_ds, test_ds)

[Link to Deepchecks](https://docs.deepchecks.com/en/stable/)

### Check Conflicting Labels with Deepchecks

Sometimes, your data might have identical samples with different labels. This might be because the data was mislabeled.

It is good to identify these conflicting labels in your data before using the data to train your ML model. To check conflicting labels in your data, use deepchecks.

In the example below, deepchecks identified that samples 0 and 1 have the same features but different labels.

In [None]:
import pandas as pd
from deepchecks.tabular import Dataset
from deepchecks.tabular.checks import ConflictingLabels

In [None]:
df = pd.DataFrame({
    "value1": [1, 1, 3],
    "value2": [2, 2, 4],
    "label": ["a", "b", "c"]
})
df

In [None]:
dataset = Dataset(df, label='label')
ConflictingLabels().run(dataset)

### Evaluate Your ML Model Performance with Simple Model Comparison

In [None]:
!pip install deepchecks

How do you check if your ML model is trained properly? One approach is to use a simple model for comparison.

A simple model establishes a minimum performance benchmark for the given task. A model achieving less or a similar score to the simple model indicates a possible problem with the model.

The following code shows how to evaluate a model's performance using Deepchecks' simple model comparison.

In [None]:
from deepchecks.tabular.datasets.classification.phishing import (
    load_data, load_fitted_model)

train_dataset, test_dataset = load_data()
model = load_fitted_model()


In [None]:
model.steps

In [None]:
from deepchecks.tabular.checks import SimpleModelComparison

# Using tree model as a simple model
check = SimpleModelComparison(strategy='tree')
check.run(train_dataset, test_dataset, model)

[Link to Deepchecks](https://docs.deepchecks.com/en/stable/)

In [None]:
from sklearn.preprocessing import MinMaxScaler
import numpy as np
# Original data
data = np.array([[1, 3, 5, 7, 9]])

# Scaling transformation
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_data = scaler.fit_transform(data)

# Inverse transformation
original_data = scaler.inverse_transform(scaled_data)

print("Original data:", data)
print("Scaled data:", scaled_data)
print("Restored data:", original_data)


### leAB: AB Testing Analysis in Python

In [None]:
!pip install leab

AB testing is crucial for assessing the effectiveness of changes in a controlled environment. With the leAB library, you can compute the appropriate sample size before launching the test.  

In [None]:
from leab import before

# What is the number of sample needed per variation to detect a 1% result
# difference in a population with a 15% conversion rate?
ab_test = before.leSample(conversion_rate=15, min_detectable_effect=1)
ab_test.get_size_per_variation()


After reaching the sample size, you can compare the successes between group A and group B.

In [None]:
from leab import after, leDataset

# Import sample data for A and B
data = leDataset.SampleLeSuccess()
data.A.head()


In [None]:
ab_test = after.leSuccess(data.A, data.B, confidence_level=0.95)

# Get the conclusion on the test
ab_test.get_verdict()

[Link to leAB](https://github.com/tlentali/leab).

### pytest-postgresql: Incorporate Database Testing into Your pytest Test Suite

In [None]:
!pip install pytest-postgresql

If you want to incorporate database testing seamlessly within your pytest test suite, use pytest-postgresql.

pytest-postgres provides fixtures that manage the setup and cleanup of test databases, ensuring repeatable tests. Additionally, each test runs in isolation, preventing any impact on the production database from testing changes.

In [None]:
%%writefile test_postgres.py
def test_query_results(postgresql):
    """Check that the query results are as expected."""
    with postgresql.cursor() as cur:
        cur.execute("CREATE TABLE test_table (id SERIAL PRIMARY KEY, name VARCHAR);")
        cur.execute("INSERT INTO test_table (name) VALUES ('John'), ('Jane'), ('Alice');")

        # Assert the results
        cur.execute("SELECT * FROM test_table;")
        assert cur.fetchall() == [(1, 'John'), (2, 'Jane'), (3, 'Alice')]

```bash
$ pytest test_postgres.py
```

In [None]:
!pytest test_postgres.py

[Link to pytest-postgresql](https://github.com/ClearcodeHQ/pytest-postgresql).