The Pandas data frame is probably the most popular tool used to model tabular data in Python. For in-memory data, Pandas serves a role that might normally fall to a relational database. Though Pandas data frames are typically manipulated through methods, instead of with a relational query language. One can even extend Pandas to accept query languages or operator algebras, as we have done in with [the data algebra](https://github.com/WinVector/data_algebra).

However, a common missing component remains: a general "Pythonic" [data schema](https://en.wikipedia.org/wiki/Database_schema) definition, documentation, and invariant enforcement mechanism.

It turns out it is quite simple to add such functionality using Python decorators. This isn't particularly useful for general functions (such as `pd.merge()`), where the function is supposed to support arbitrary data schemas. However, it can be *very* useful in adding checks and safety to specific applications and analysis workflows built on top such generic functions. In fact, it is a good way to copy schema details from external data sources such as databases or CSV into enforced application invariants. Application code that transforms fixed tables into expected exported results can benefit greatly from schema documentation and enforcement.

I propose the following simple check criteria for both function signatures and data frames that applies to both inputs and outputs:

  * Data must have *at least* the set of argument names or column names specified.
  * Each column must have *no more* types (for non-null values) than the types specified.

In this note I will demonstrate the how to add such schema documentation and enforcement to Python functions working over data frames using Python decorators.

Let's import our modules.

In [None]:
# import modules
from pprint import pprint
import numpy as np
import pandas as pd
import polars as pl
import data_algebra as da
from data_algebra.data_schema import SchemaCheckSwitch

As I said, we are interested in documenting that the data frames we work with have:

  * At least the columns we expect.
  * No types we don't expect in those columns.

These two covariant constraints are what we need to ensure we can write the operations over columns (which we need to know exist), and to not get unexpected results (from unexpected types). Instead of getting down-stream signalling nor non-signalling errors during column operations, we get useful exceptions on columns and values. This can be particularly useful for data science code near external data sources such as databases or CSV (comma separated value) files. Many of these sources themselves have data schemas and schema documentation that one can copy into the application.

We also want to be able to turn enforcement on or off in an entire code base easily. To do this we define a indirect importer called [`schema_check.py`](https://github.com/WinVector/data_algebra/blob/main/Examples/data_schema/schema_check.py).  It's code looks like the following:

```
   from data_schema import SchemaCheckSwitch
   # from data_schema import SchemaMock as SchemaCheck
   from data_schema import SchemaRaises as SchemaCheck
   SchemaCheckSwitch().on()
```

Isolating these lines in a shared import lets all other code switch behavior by only editing this file.

Let's go ahead and import that code.

In [None]:
# use a indirect import, so entire package behavior
# can be changed globally all at once
import schema_check


The usual way to define a function in Python is as follows.

In [None]:
# standard define of a function
def fn(a, /, b, *, c, d=None):
    """doc"""
    return d


Let's instead, define the same function including the `SchemaCheck` decoration. The details of this decorator are documented [here](https://github.com/WinVector/Examples/tree/main/arg_types#readme).

In [None]:
# same function definition, now with schema decorator
@schema_check.SchemaCheck({
        'a': int, 
        'b': {int, float}, 
        'c': {'x': int},
        },
        return_spec={'z': float})
def fn(a, /, b, *, c, d=None):
    """doc"""
    return d

The decorator defines the types schemas of at least a subset of positional and named arguments. Declarations are either values (converted to Python types), Python types, or sets of types. A special case is dictionaries, which specify a subset of the column structure of function signatures or data frames. "return_spec" is reserved to name the return schema of the function.

Our decorator documentation declares that `fn()` expects at least:

  * an argument `a` of type `int`.
  * an argument `b` of type `int` or `float`.
  * an argument `c` that is a data frame (implied by the dictionary argument), and that data frame contains a column `x` that has no non-null elements of type other than `int`.
  * to return a data frame (indicated by the dictionary argument) that has at least a column `z` that contains no non-null elements of type other than `float`.

This gives us some enforceable invariants that can improve our code.

We can see this repeated back in the decorator altered `help()`.

In [None]:
# show altered help text
help(fn)

This is a learnable schema specification convention.

Let's see it catch an error. We show what happens if we call `fn()` with none of the expected arguments.

In [None]:
# catch schema mismatch
threw = False
try:
    fn()
except TypeError as e:
    print(e)
    threw = True
assert threw

Or, and this is where we start to get benefits, we can call with a wrong argument type.

In [None]:
# catch schema mismatch
threw = False
try:
    fn(1, 2, c=3)
except TypeError as e:
    print(e)
    threw = True
assert threw

And we show that this checking pushes down into the structure of data frame arguments! In our next example we see the argument is missing a required column.


In [None]:
# catch schema mismatch
threw = False
try:
    fn(1, 2, c=pd.DataFrame({'z': [7]}))
except TypeError as e:
    print(e)
    threw = True
assert threw

We can check column and cell types in addition to mere column names.

In [None]:
# catch schema mismatch
threw = False
try:
    fn(1, 2, c=pd.DataFrame({'x': [3.0]}))
except TypeError as e:
    print(e)
    threw = True
assert threw

And, we can check return types.

In [None]:
# catch schema mismatch
rv = None
threw = False
try:
    fn(
        1, 
        2, 
        c=pd.DataFrame({'x': [30], "z": [17.2]}), 
        d=pd.DataFrame({'q': [7.0]}))
except TypeError as e:
    print(e.args[0])
    rv = e.args[1]
    threw = True
assert threw

# the return value is available for inspection
rv

Notice the rejected return value is attached to the `TypeError` to help with diagnosis and debugging.

Again, these sort of checks are not for generic utility methods (such as `pd.merge()`), which are designed to work over a larger variety of schemas. However, they are very useful near client interfaces, APIs, and database tables. This technique and [data algebra](https://github.com/WinVector/data_algebra) processing may naturally live near data sources. There is a an-under appreciated design principle that package code should be generic, and application code should be specific (even in the same project).

Let's show a successful call.

In [None]:
fn(
    1, 
    b=2, 
    c=pd.DataFrame({'x': [3]}), 
    d=pd.DataFrame({'z': [7.0]}))

We can turn off the checking with a single global command.

In [None]:
# turn off checking globally
SchemaCheckSwitch().off()

Now notice a previously failing call is no longer checked.

In [None]:
# show wrong return value is now allowed
fn(
    1, 
    2, 
    c=pd.DataFrame({'x': [30], "z": [17.2]}), 
    d=pd.DataFrame({'q': [7.0]}))

The return value has is missing the required `z` column, but with checks off the function is not interfered with.

When checks are on: failures are detected much closer to causes, making debugging and diagnosis much easier. Also, the decorations are a easy way to document in human readable form some basics of the expected input and output schemas.

And, the input and output schema are attached to the function as objects.

In [None]:
# show argument schema specifications
pprint(fn.data_schema.arg_specs)

In [None]:
# show return value schema
pprint(fn.data_schema.return_spec)

This makes the schema data available for other uses.

A downside is, the technique *can* run into what I call "the first rule of meta-programming". Meta-programming only works as long as it doesn't run into other meta-programming (also called the "its only funny when I do it" theorem). That being said, I feel these decorators can be very valuable in Python data science projects.

This documentation and demo can be found [here](https://github.com/WinVector/data_algebra/tree/main/Examples/data_schema).


The system also works with Polars data frames instead of Pandas as the data frame realization.

In [None]:
# turn back on checking globally
SchemaCheckSwitch().on()

In [None]:
# failing example in Polars
threw = False
try:
    fn(1, 2, c=pl.DataFrame({'z': [7]}))
except TypeError as e:
    print(e)
    threw = True
assert threw

In [None]:
# failing example in Polars
rv = None
threw = False
try:
    fn(
        1, 
        2, 
        c=pl.DataFrame({'x': [30], "z": [17.2]}), 
        d=pl.DataFrame({'q': [7.0]}))
except TypeError as e:
    print(e.args[0])
    rv = e.args[1]
    threw = True
assert threw

# the return value is available for inspection
rv

In [None]:
# good example in Polars
fn(
    1, 
    b=2, 
    c=pl.DataFrame({'x': [3]}), 
    d=pl.DataFrame({'z': [7.0]}))

And we also have simple "types in data frame" inspection tools [here](https://github.com/WinVector/data_algebra/blob/main/Examples/data_schema/df_types.ipynb).

In conclusion: the `SchemaCheck` decoration is simple and effective tool to add schema documentation and enforcement to your analytics projects.

In [None]:
# show some relevant versions
pprint({
    'pd': pd.__version__, 
    'pl': pl.__version__, 
    'np': np.__version__, 
    'da': da.__version__})