Skip to content

AI Generated Data Tests

Ivan Zhang edited this page Nov 1, 2023 · 2 revisions

Writing data tests from scratch can be a complicated process. Panda Patrol provides a way to automatically generate a few initial data tests based off of your table headers, sample of your data, and optional context. This allows you to quickly get started with data tests and then iterate and add to them as needed.

Usage

To generate data tests, you can use the generate_data_tests method. This method takes the following parameters:

  • columns: List[str] - The column titles of the data
  • data_preview: str - The preview of the data
  • context: str - Any additional context (i.e. business) around the data and which tests may be useful
  • output_file: str - The output file to save the generated data tests to. Replace the original generate_data_tests call with these generated data tests. If output_file already exists, then data tests will not be generated.

These data tests can be generated anywhere in your code or data pipeline. For example, you can generate them in a Jupyter notebook or in a dagster pipeline.

# Import the generate_data_tests method
from panda_patrol.ai import generate_data_tests
...
# Generate data tests with the columns, a preview of the data, and some useful context
generate_data_tests(
    columns=["id", "name", "age"],
    data_preview="""
    1,John,25,
    2,Jane,30,
    3,Joe,35,
    """,
    context="This is a table of young users who have signed up for a special service for people whose names start with J.",
    output_file="generated_user_data_tests.py"
)

This may generate the following data tests in generated_user_data_tests.py:

# Write any required imports in comments
# import ...

...
with patrol_group("Data Tests") as patrol:

    # Test 1 - Age range
    @patrol("test_age_range")
    def test_age_range():
        """
        This test ensures that the age of the users falls within the expected range of 20-40.
        """
        expected_age_range = (20, 40)
        for age_value in data['age']:
            assert expected_age_range[0] <= age_value <= expected_age_range[1], \
                f"Age {age_value} is outside the expected range of {expected_age_range[0]}-{expected_age_range[1]}."

    # Test 2 - Name starts with 'J'
    @patrol("test_name_starts_with_J")
    def test_name_starts_with_J():
        """
        This test ensures that all names in the dataset start with the letter 'J'.
        """
        for name_value in data['name']:
            assert name_value.startswith('J'), f"Name {name_value} does not start with the letter 'J'."

    # Test 3 - Count of users
    @patrol("test_user_count")
    def test_user_count():
        """
        This test ensures that the count of users in the dataset is accurate.
        """
        expected_user_count = 3
        assert len(data) == expected_user_count, \
            f"The count of users is {len(data)}, which does not match the expected count of {expected_user_count}."

Replace the original generate_data_tests call with these generated data tests (fixing any potential issues in the generated code). Then run your data pipeline as you normally would. The data tests will be run and the results will be displayed in the Panda Patrol UI.

❗IMPORTANT
In order to generate data tests you will need to have an account here and set the required environment variables (which is shown after you log in).

❗IMPORTANT
If you are generating data tests in a dagster pipeline, make sure to replace the original generate_data_tests call with the generated data tests so that the data tests are actually run. Otherwise, the data tests will not be run.