# Creating Data Unit Tests

A **data unit test** is a test designed to validate the quality, accuracy, and integrity of data at a granular level, similar to how software unit tests validate individual pieces of code. It is not a common concept yet, but there is a growing community around providing these kind of data validation tests. The goal is to catch errors and ensure data meets expected criteria before it’s used in downstream processes or models. We identified three key advantages of unit tests for data pipelines:

1. **Granularity:** Checking individual pieces of data or small subsets for specific rules or constraints in row or column level.
2. **Specificity:** Each test is focused on a specific aspect of the data, like checking for null values, data types, valid ranges, or business logic constraints.
3. **Automated Validation:** Like software unit tests, data unit tests can be automated to run continuously in a data pipeline or in response to new data ingestion.

A data unit test is a fundamental piece of "fairness assurance" since it provides a semi-automated approach to audit a "fairness evidence." When auditing the evidence, we should consider [^assurancereview]:
- Buggyness of the provided evidence
- Comprehensively reviewed
- Presented by a competent personnel
- Derived from a good-quality tool/method

[^assurancereview]: Kelly, T. P.. “Reviewing Assurance Arguments – A Step-By-Step Approach.” (2007).

In [1]:
import pandas as pd

In [2]:
FILE =  "../credit-scoring-german/data/german_credit_data.csv"

In [3]:
df = pd.read_csv(FILE)

In [4]:
df.head(3)

Unnamed: 0,CheckingStatus,LoanDuration,CreditHistory,LoanPurpose,LoanAmount,ExistingSavings,EmploymentDuration,InstallmentPercent,Sex,OthersOnLoan,...,OwnsProperty,Age,InstallmentPlans,Housing,ExistingCreditsCount,Job,Dependents,Telephone,ForeignWorker,Risk
0,less_0,6,outstanding_credit,radio_tv,1169,unknown,greater_7,4,male,none,...,real_estate,67,none,own,2,skilled,1,yes,yes,No Risk
1,no_checking,12,outstanding_credit,education,2096,less_100,4_to_7,2,male,none,...,real_estate,49,none,own,1,unskilled,2,none,yes,No Risk
2,less_0,42,credits_paid_to_date,furniture,7882,less_100,4_to_7,2,male,guarantor,...,savings_insurance,45,none,free,1,skilled,2,none,yes,No Risk


In [5]:
# You can create simple tests using the assert statement
assert df['Age'].dtype == 'int'

In [6]:
# Verifying that critical columns do not contain null or missing values.
assert df['Risk'].isnull().sum() == 0

It is one unit test that can be included to test unawareness. But, as you might realise, it is an extremely simplified version of testing with an assumption that sensitive characteristics appear in the column names. In real-life applications, the data might include relational information, have an unstructured format, or be disguised as proxy variables. In this case, more sophisticated test methods are needed.

We can define several unit test to check data type, null values, range like following:

In [7]:
# Checking that the values in a column are within a certain range.
def test_column_min():
    assert df['Age'].min() == 18

def test_column_max():
    assert df['Age'].max() == 66 # let's say we are considering the credit scoring use cases before retirement age

def test_column_range():
    assert df['Age'].between(18, 66).all()

# Alternatively, you can check if the min or max values are within a certain range.

# This is also a good opportunity to discuss if the data is representative of the population it is supposed to represent.
# In a facial biometric system you can also check if the data includes min and max possible values for skin colour hue and saturation.
# In an investment system you can check if the data includes min and max possible values for stock prices.

In [8]:
# Verifying that values in a categorical column belong to a predefined set of allowed values.
assert df['Sex'].isin(['male', 'female']).all()

In [9]:
# Verifying that columns do not contain sensitive demographic characteristics.
def test_unawareness():
    # convert df columns to lowercase
    columns = [col.lower() for col in df.columns]

    assert 'gender' not in columns
    assert 'sex' not in columns
    assert 'age' not in columns
    assert 'marital_status' not in columns

In [10]:
# Verify the selected column mean and median are within a certain range.
def test_column_mean():
    # Let's say the mean age of the working population is 42.
    assert df['Age'].mean() >= 39
    assert df['Age'].mean() <= 45
# You can similarly check the median age of the working population in the UK.

In [11]:
# Check value distribution in a column.
def test_value_distribution_is_balanced():
    assert df['Sex'].value_counts(normalize=True).min() >= 0.45
    assert df['Sex'].value_counts(normalize=True).max() <= 0.55

In [12]:
# Verify the selected column contains all unique values from a given set
ethnicities = ['white', 'black', 'asian', 'hispanic', 'other']
def test_column_completeness(ethnicities):
    assert set(ethnicities).issubset(set(df['Ethnicity'].unique()))

In [13]:
# Verify the distribution of unique values in a column is within a certain range given statistics.
ethnicity_distribution = {
    'white': 0.8,
    'black': 0.1,
    'asian': 0.05,
    'hispanic': 0.04,
    'other': 0.01
}

def test_column_distribution(ethnicity_distribution):
    actual_distribution = df['Ethnicity'].value_counts(normalize=True).to_dict()
    for ethnicity, expected_proportion in ethnicity_distribution.items():
        assert actual_distribution.get(ethnicity, 0) >= expected_proportion * 0.9
        assert actual_distribution.get(ethnicity, 0) <= expected_proportion * 1.1

In [14]:
# Test quantiles of a column is within a certain range.
def test_quantiles():
    assert df['Age'].quantile(0.25) >= 30
    assert df['Age'].quantile(0.75) <= 50

Further, using these kind of unit tests we can assure *uniqueness*, *referential integrity* (foreign key relationships are valid), *value set validation*, and other aspects of the data that we want to continuously verify. Using data sets allow developers to detect errors early, monitor the data quality automatically, improve the integrity and reproducibility of their data pipelines.

In [1]:
# You can use pytest-html to generate a report of the test results. (https://github.com/pytest-dev/pytest-html)
#!pip install pytest-html
# Then generate the report using the following command.
#!pytest --html=report.html --self-contained-html
# Note that pytest is not working in this notebook. You can run the command in your terminal.

### Using Great Expectations

**Great Expectations** is an open-source Python library designed for **data quality assurance**. It provides a flexible framework for defining, testing, and maintaining **"expectations"** about your data. These expectations are assertions or tests that describe what your data should look like and how it should behave.

We can use this library to use pre-defined validation expectations and create a reproducible context using the "expectation suite" of the library. In this notebook, we will use the core library, which is open-source and free of charge. You can use existing expectations from the core library or community contributions: https://greatexpectations.io/expectations/

In [None]:
import great_expectations as gx

In [7]:
# -- Set GX constants for artifact creation
NAME_DATA_SOURCE = "credit_score_source"
NAME_DATA_ASSET = "credit_score_data"
NAME_BATCH_DEF = "credit_score_batch_definition"
NAME_EXPECTATION_SUITE = "credit_score_expectation_suite"
NAME_VALIDATION_DEF = "credit_score_validation_definition"
NAME_CHECKPOINT = "credit_score_checkpoint"

# -- 1. Initialize GX for configuration
context = gx.get_context(mode="file")

data_source = context.data_sources.add_pandas(name=NAME_DATA_SOURCE)

data_asset = data_source.add_dataframe_asset(name=NAME_DATA_ASSET)
batch_definition = data_asset.add_batch_definition_whole_dataframe(NAME_BATCH_DEF)

# -- 2. Configure expectation suite to be called over runtime data later
expectation_suite = gx.ExpectationSuite(name=NAME_EXPECTATION_SUITE)
expectation_suite = context.suites.add(expectation_suite)

# -- 2.1. Define table level expectations
columns = list(df.columns)
exp0 = gx.expectations.ExpectTableColumnsToMatchSet(column_set=columns)
expectation_suite.add_expectation(exp0)

# Create an Expectation to test
exp1 = gx.expectations.ExpectColumnValuesToBeBetween(column="Age", max_value=100, min_value=18)
expectation_suite.add_expectation(exp1)

exp2 = gx.expectations.ExpectColumnProportionOfUniqueValuesToBeBetween(
    column="Sex",
    min_value=0.4,
    max_value=0.6
)
expectation_suite.add_expectation(exp2)

# -- 2.3. Evaluate results on test dataset
batch_parameters = {"dataframe": df}
batch = batch_definition.get_batch(batch_parameters=batch_parameters)
validation_results = batch.validate(expectation_suite)

# -- 3. Bundle suite and batch into validation definition and checkpoint w/ bundled
# --    actions for easy execution later
validation_definition = gx.ValidationDefinition(
    data=batch_definition, suite=expectation_suite, name=NAME_VALIDATION_DEF
)
validation_definition = context.validation_definitions.add(validation_definition)

action_list = [
    gx.checkpoint.UpdateDataDocsAction(
        name="update_all_data_docs",
    ),
]
checkpoint = gx.Checkpoint(
    name=NAME_CHECKPOINT,
    validation_definitions=[validation_definition],
    actions=action_list,
    result_format={
        "result_format": "COMPLETE",
    },
)
context.checkpoints.add(checkpoint)

# -- 4. Run checkpoint to validate if everything works properly
runid = gx.RunIdentifier(run_name="Configuration run")
results = checkpoint.run(batch_parameters=batch_parameters, run_id=runid)

Calculating Metrics:   0%|          | 0/14 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/14 [00:00<?, ?it/s]

In [12]:
# Print the results of the validation
# You can also see the results, expectations, and validations with their respective checkpoint information in the gx/ folder
# When you run a validation, the results are stored in the checkpoint folder
import pprint

pp = pprint.PrettyPrinter(indent=2)
pp.pprint(results.describe())

('{\n'
 '    "success": false,\n'
 '    "statistics": {\n'
 '        "evaluated_validations": 1,\n'
 '        "success_percent": 0.0,\n'
 '        "successful_validations": 0,\n'
 '        "unsuccessful_validations": 1\n'
 '    },\n'
 '    "validation_results": [\n'
 '        {\n'
 '            "success": false,\n'
 '            "statistics": {\n'
 '                "evaluated_expectations": 3,\n'
 '                "successful_expectations": 2,\n'
 '                "unsuccessful_expectations": 1,\n'
 '                "success_percent": 66.66666666666666\n'
 '            },\n'
 '            "expectations": [\n'
 '                {\n'
 '                    "expectation_type": '
 '"expect_table_columns_to_match_set",\n'
 '                    "success": true,\n'
 '                    "kwargs": {\n'
 '                        "batch_id": '
 '"credit_score_source-credit_score_data",\n'
 '                        "column_set": [\n'
 '                            "CheckingStatus",\n'
 '           

In [13]:
# As you can see from the output, the validation results are stored in the results object.
# One expectation is failed, which was gender distribution. Let's see what is the real distribution:
print(df['Sex'].value_counts(normalize=True))

Sex
male      0.69
female    0.31
Name: proportion, dtype: float64


In this notebook, we briefly introduced data unit tests and how we can utilise Great Expectations library. It is a good tool with lots of functionality. We can save checkpoints and deploy it to our CI/CD pipeline as part of deployment process. Despite the advantages, I found some limitations during this tutorial:

- The first and most important issues is the  **complexity of setting things up**.  Even for a single dataset, we define a complex environment, and migrating complex datasets can be time-consuming, especially for larger projects. Furhter, defining custom expectations requires detailed knowledge of your data and how it should behave.
- The library is focused on tabular data and it has very **limited support for non-tabular data**, which can be a drawback in the current era of multimodal structures.
- I didn't experience it, but in some of forums, users mentioned **performance overhead:**, particularly if many complex checks are applied. For big data workloads, this could slow down your pipeline.
- Not a major concern for the library, but heads up to creating custom expectations can be challenging, especially if you need to implement highly domain-specific or advanced logic that goes beyond the built-in features. Creating expectations purely depends on the skills of the workforce.

The library has limitations. However it is still a powerful tool for maintaining data quality. So, it's useful to explore the library and consider the use cases for data engineers and analysts who want to ensure that their data pipelines produce reliable, clean data.

# Using FAID

In [2]:
import sys
sys.path.append('../../')
from faid import logging as faidlog
faidlog.init_log()

Logging initialized
