The Pandas data frame is one tool used to model tabular data in Python. It serves the role an relational database might server for in-memory data, using methods instead of a relational query language.

The difference of Pandas from relational tables is not fundamental, and can even extend Pandas to accept or different operator algebras, as we have done in with [the data algebra](https://github.com/WinVector/data_algebra). However, a common missing component is: [data schema](https://en.wikipedia.org/wiki/Database_schema) definition and invariant enforcement.

It turns out it is quite simple to add such functionality using Python decorators. This isn't particularly useful for general functions (such as `pd.merge()`), where the function is supposed to support arbitrary data schema. However, it can be *very* useful in adding checks and safety to specific applications and analysis workflows built on top such generic functions. In fact, it is a good way to copy schema details from external data sources such as databases or CSV into enforced application invariants. Application code that transforms fixed tables into expected exported results can benefit greatly from schema documentation and enforcement.

In this note I will demonstrate the how to add schema documentation and enforcement to Python functions working over data frames using Python decorators.

Let's import our modules.

In [1]:
# import modules
from pprint import pprint
import numpy as np
import pandas as pd
import polars as pl
import data_algebra as da
from data_algebra.data_schema import SchemaCheckSwitch

We supply a lightweight implementation of a schema enforcement system.

This is a bit different that having a type system. We are not interested what is and what is not a data frame. But instead interested in documenting that the data frames we work with have:

  * At least the columns we expect.
  * No types we don't expect in those columns.

These two covariant constraints are what we want to ensure we can write the operations over columns (which we need to know exist) and not get unexpected results (from unexpected types). The idea is: can we document and enforce (at least partial) schemas both on function signatures and data frames? This can be particularly useful for data science code near external data sources such as databases or CSV (comma separated value) files. Many of these sources themselves have data schemas and schema documentation.

I've started experimenting with a Python module to automate this task as debugging feature. As it is a debugging feature we want to be able to turn the feature on or off in an entire code base easily. To do this we define a indirect importer called [`schema_check.py`](https://github.com/WinVector/data_algebra/blob/main/Examples/data_schema/schema_check.py).  It's code looks like the following:

```
   from data_schema import SchemaCheckSwitch
   # from data_schema import SchemaMock as SchemaCheck
   from data_schema import SchemaRaises as SchemaCheck
   SchemaCheckSwitch().on()
```

It is picking which class is called "`SchemaCheck`" (either a mock that does nothing, or a real one that raises exceptions on irregularities). And it is also globally setting if the system is "on" or "off" (only relevant in the case of the "raises" or checking system). Putting these lines in a shared import lets all other code switch behavior by only editing this file.

Let's go ahead and import that code.

In [2]:
# use a indirect import, so entire package behavior
# can be changed all at once
import schema_check

What we have imported is a decorator that shows the types schemas of at least a subset of positional and named arguments. Declarations are either Python types, or sets of types. A special case is dictoinaries, which specify a subset of the column structure of data frames. "return_spec" is reserved to name the return schema of the function.


In [3]:
# standard define a function
def fn(a, /, b, *, c, d=None):
    """doc"""
    return d


Let's decorate a function with `SchemaCheck`. The details of this decorator are documented [here](https://github.com/WinVector/Examples/tree/main/arg_types#readme).

In [4]:
# same function definition, now with schema decorator
@schema_check.SchemaCheck({
        'a': int, 
        'b': {int, float}, 
        'c': {'x': int},
        },
        return_spec={'z': float})
def fn(a, /, b, *, c, d=None):
    """doc"""
    return d

This declaring that `fn()` expects at least:

  * an argument `a` of type `int`.
  * an argument `b` of type `int` or `float`.
  * an argument `c` that is a data frame (implied by the dictionary argument), and that data frame contains a column `x` that has no non-null elements of type other than `int`.
  * The functions returns a data frame (indicated by the dictionary argument) that has at least a column `z` that contains no non-null elements of type other than `float`.

This gives us some enforceable invariants that can improve our code.

We can somewhat see this repeated back in the decorator altered `help()`.

In [5]:
# show altered help text
help(fn)

Help on function fn in module __main__:

fn(a, /, b, *, c, d=None)
     arg specifications
    {'a': <class 'int'>,
     'b': {<class 'float'>, <class 'int'>},
     'c': {'x': <class 'int'>}}
     return specification:
    {'z': <class 'float'>}
    
    
    doc



It is a learnable schema specification convention.

Let's see it catch an error. We show what happens if we call `fn()` with none of the expected arguments.

In [6]:
# catch schema mismatch
threw = False
try:
    fn()
except TypeError as e:
    print(e)
    threw = True
assert threw


function fn(), issues:
expected arg a missing  
expected arg b missing  
expected arg c missing


We can try with just one argument missing.

In [7]:
# catch schema mismatch
try:
    fn(1, 2)
except TypeError as e:
    print(e)
    threw = True
assert threw


function fn(), issues:
expected arg c missing


Or, and this is where we start to get benefits, we can call with a wrong argument type.

In [8]:
# catch schema mismatch
threw = False
try:
    fn(1, 2, c=3)
except TypeError as e:
    print(e)
    threw = True
assert threw


function fn(), issues:
arg c expected a Pandas or Polars data frame, had int


And we show that this checking pushes down into the structure of data frame arguments! 

These sort of checks are not for generic utility methods (such as `pd.merge()`), which are designed to work over a larger variety of schema. However, they are very useful near client interfaces, APIs, and database tables. This is a place where there is fixed schema information, and one can benefit from preserving it for just a bit longer. This technique and [the data algebra](https://github.com/WinVector/data_algebra) may naturally live near data sources.

In data science the natural types are data frame schemas, knowing the type of the outer variables just isn't and interesting invariant.

In [9]:
# catch schema mismatch
threw = False
try:
    fn(1, 2, c=pd.DataFrame({'z': [7]}))
except TypeError as e:
    print(e)
    threw = True
assert threw


function fn(), issues:
arg c missing required column 'x'


We can check column types in addition to mere column names.

In [10]:
# catch schema mismatch
threw = False
try:
    fn(1, 2, c=pd.DataFrame({'x': [3.0]}))
except TypeError as e:
    print(e)
    threw = True
assert threw


function fn(), issues:
arg c  column 'x' expected type int, found type float


And we check return types.

In [11]:
# catch schema mismatch
rv = None
threw = False
try:
    fn(
        1, 
        2, 
        c=pd.DataFrame({'x': [30], "z": [17.2]}), 
        d=pd.DataFrame({'q': [7.0]}))
except TypeError as e:
    print(e.args[0])
    rv = e.args[1]
    threw = True
assert threw

# the return value is available for inspection
rv

fn() return value: missing required column 'z'


Unnamed: 0,q
0,7.0


Notice the rejected return value is attached to the `TypeError` to help with diagnosis and debugging.

Now, let's show a successful call.

In [12]:
fn(
    1, 
    b=2, 
    c=pd.DataFrame({'x': [3]}), 
    d=pd.DataFrame({'z': [7.0]}))

Unnamed: 0,z
0,7.0


We can turn off the checking with a single global command.

In [13]:
# turn off checking globally
SchemaCheckSwitch().off()

And now notice a previously failing call is no longer checked.

In [14]:
# show wrong return value is now allowed
fn(
    1, 
    2, 
    c=pd.DataFrame({'x': [30], "z": [17.2]}), 
    d=pd.DataFrame({'q': [7.0]}))

Unnamed: 0,q
0,7.0


The return value has is missing the required `z` column, but with checks off the function is not interfered with.

The idea is: when checks are on failures are detected much closer to causes, making debugging and diagnosis much easier. Also the decorations are a easy way to document in human readable form some basics of the expected input and output schemas.

And the input and output schema are attached to the function as objects.

In [15]:
# show argument schema specifications
pprint(fn.data_schema.arg_specs)

{'a': <class 'int'>,
 'b': {<class 'float'>, <class 'int'>},
 'c': {'x': <class 'int'>}}


In [16]:
# show return value schema
pprint(fn.data_schema.return_spec)

{'z': <class 'float'>}


This makes the schema data available for other use, even some automatic checking of function composition conditions!


The technique can run into what I call "the first rule of meta-programming". Meta-programming only works as long as it doesn't run into other meta-programming. That being said, I feel these decorators can be very valuable in Python data science projects.

The implementation, documentation, and demo of this methodology can be found [here](https://github.com/WinVector/data_algebra/tree/main/Examples/data_schema).


Note, the system works about the same with Polars instead of Pandas as the data frame realization.

In [17]:
# turn back on checking globally
SchemaCheckSwitch().on()

In [18]:
# failing example in Polars
threw = False
try:
    fn(1, 2, c=pl.DataFrame({'z': [7]}))
except TypeError as e:
    print(e)
    threw = True
assert threw


function fn(), issues:
arg c missing required column 'x'


In [19]:
# failing example in Polars
rv = None
threw = False
try:
    fn(
        1, 
        2, 
        c=pl.DataFrame({'x': [30], "z": [17.2]}), 
        d=pl.DataFrame({'q': [7.0]}))
except TypeError as e:
    print(e.args[0])
    rv = e.args[1]
    threw = True
assert threw

# the return value is available for inspection
rv

fn() return value: missing required column 'z'


q
f64
7.0


In [20]:
# good example in Polars
fn(
    1, 
    b=2, 
    c=pl.DataFrame({'x': [3]}), 
    d=pl.DataFrame({'z': [7.0]}))

z
f64
7.0


The `SchemaCheck` decoration is simple and effective tool to add schema documentation and enforcement to your analytics projects.

In [21]:
# show some relevant versions
pprint({
    'pd': pd.__version__, 
    'pl': pl.__version__, 
    'np': np.__version__, 
    'da': da.__version__})

{'da': '1.6.9', 'np': '1.25.2', 'pd': '2.0.3', 'pl': '0.19.2'}
