The Pandas data frame is has a number of interesting features including complex indices (that allow implicit joins), and time oriented features.

However they have a number of design decisions that pose problems for classic (non-finance) data science. Some undesirable sharp edges include:


  * Separate types for atomic columns (such as `int`, `bool``, and `float``) and columns of objects (such as `str`).
  * No out of band representation of missing values. Instead missingness must be signaled by the insertion of a value representing missingness. This causes problems for types that don't have such a representation such as `int` and `bool`.

To work around the above the Pandas data frame promotes integer columns to float if missing values are present. Let's take a look at a problem data frame.

In [1]:
from pprint import pprint
import numpy as np
import pandas as pd
from type_signature import non_null_types_in_frame, TypeCheckSwitch


In [2]:
d =  pd.DataFrame({
        'b': [1, 3, 4],
        'q': np.nan,
        'r': [1, None, 3],
        's': [np.nan, 2.0, 3.0],
        'x': [1, 7.0, 2],
        'y': ["a", None, np.nan],
        'z': [1, 1.0, False],
    })

d

Unnamed: 0,b,q,r,s,x,y,z
0,1,,1.0,,1.0,a,1
1,3,,,2.0,7.0,,1.0
2,4,,3.0,3.0,2.0,,False


Notice that `None` has been converted to `NaN` in column `r`, but not in column `y`. The declared column types tell part of the story.

In [3]:
d.dtypes

b      int64
q    float64
r    float64
s    float64
x    float64
y     object
z     object
dtype: object

Directly inspecting the types found in the data frame cells shows a bit more detail.

In [4]:
non_null_types_in_frame(d)

{'b': {numpy.int64},
 'q': None,
 'r': {numpy.float64},
 's': {numpy.float64},
 'x': {numpy.float64},
 'y': {str},
 'z': {bool, float, int}}

Notice the scary profusion of "nearly compatible types" such as `float` v.s. `numpy.float64` and `int` v.s. `numpy.int64`. 

It is frankly a bit hard to predict what types will be in a Pandas data frame given a set of values and operations. There are promotion rules such as mixtures of `int`s and `float`s become floats, and other mixtures become `object` columns. Though notice in object columns the extra freedom is used to preserve the distinction between `int`s and `float`s!

With this in mind one may want to consider inspecting incoming and outgoing column types. We supply a lightweight implementation of a schema enforcement system.

This is a bit different that having a type system. We are not interested what is and what is not a data frame. But instead interested in documenting that the data frames we work with have:

  * At least the columns we expect.
  * No types we don't expect in those columns.

These two covariant constraints are what we want to ensure we can write the operations over columns (which we need to know exist) and not get unexpected results (from unexpected types).

I've started experimenting with a Python module to automate this task as debugging feature. As it is a debugging feature we want to be able to turn the feature on or off in an entire code base easily. To do this we define a indirect importer called [`import_tc.py`](https://github.com/WinVector/Examples/blob/main/arg_types/import_tc.py).  It's code looks like the following:

```
  from type_signature import TypeCheckSwitch
  # from type_signature import TypeSignatureNOOP as TypeSignature
  from type_signature import TypeSignatureRaises as TypeSignature
  TypeCheckSwitch().on()
```

It is picking which class is called "`TypeSignature`" (either a mock that does nothing, or a real one that raises exceptions on irregularities). And it is also globally setting if the system is "on" or "off" (only relevant in the case of the "raises" or checking system). Putting these lines in a shared import lets all other code switch behavior by only editing this file.

Let's go ahead and import that code.

In [5]:
import import_tc

What we have imported is a decorator that shows the types schemas of at least a subset of positional and named arguments. Declarations are either Python types, or sets of types. A special case is dictoinaries, which specify a subset of the column structure of data frames. "return_spec" is reserved to name the return schema of the function.

Let's decorate a function with `TypeSignature`. The details of this decorator are documented [here](https://github.com/WinVector/Examples/tree/main/arg_types#readme).

In [6]:
@import_tc.TypeSignature({
        'a': int, 
        'b': {int, float}, 
        'c': {'x': int},
        },
        return_spec={'z': float})
def fn(a, /, b, *, c, d=None):
    """doc"""
    return d

This declaring that `fn()` expects at least:

  * an argument `a` of type `int`.
  * an argument `b` of type `int` or `float`.
  * an argument `c` that is a data frame (implied by the dictionary argument), and that data frame contains a column `x` that has no non-null elements of type other than `int`.
  * The functions returns a data frame (indicated by the dictionary argument) that has at least a column `z` that contains no non-null elements of type other than `float`.

This gives us some enforceable invariants that can improve our code.

We can somewhat see this repeated back in the decorator altered `help()`.

In [7]:
help(fn)

Help on function fn in module __main__:

fn(a, /, b, *, c, d=None)
     arg specifications
    {'a': <class 'int'>,
     'b': {<class 'float'>, <class 'int'>},
     'c': {'x': <class 'int'>}}
     return specification:
    {'z': <class 'float'>}
    
    
    doc



It is a learnable schema specification convention.

Let's see it catch an error. We show what happens if we call `fn()` with none of the expected arguments.

In [8]:
threw = False
try:
    fn()
except TypeError as e:
    print(e)
    threw = True
assert threw


function fn(), issues:
expected arg a missing  
expected arg b missing  
expected arg c missing


We can try with just one argument missing.

In [9]:
try:
    fn(1, 2)
except TypeError as e:
    print(e)
    threw = True
assert threw


function fn(), issues:
expected arg c missing


Or, and this is where we start to get benefits, we can call with a wrong argument type.

In [10]:
threw = False
try:
    fn(1, 2, c=3)
except TypeError as e:
    print(e)
    threw = True
assert threw


function fn(), issues:
arg c expected type pandas.DataFrame, found type int


And we show that this checking pushes down into the structure of data frame arguments! 

In data science the natural types are data frame schemas, knowing the type of the outer variables just isn't and interesting invariant.

In [11]:
threw = False
try:
    fn(1, 2, c=pd.DataFrame({'z': [7]}))
except TypeError as e:
    print(e)
    threw = True
assert threw


function fn(), issues:
arg c missing required column 'x'


We can check column types in addition to mere column names.

In [12]:
threw = False
try:
    fn(1, 2, c=pd.DataFrame({'x': [3.0]}))
except TypeError as e:
    print(e)
    threw = True
assert threw


function fn(), issues:
arg c  column 'x' expected type int, found type float


And we check return types.

In [13]:

rv = None
threw = False
try:
    fn(1, 2, c=pd.DataFrame({'x': [30], "z": [17.2]}), d=pd.DataFrame({'q': [7.0]}))
except TypeError as e:
    print(e.args[0])
    rv = e.args[1]
    threw = True
assert threw

rv

fn() return value: missing required column 'z'


Unnamed: 0,q
0,7.0


Notice the rejected return value is attached to the `TypeError` to help with diagnosis and debugging.

Now, let's show a successful call.

In [14]:
fn(1, b=2, c=pd.DataFrame({'x': [3]}), d=pd.DataFrame({'z': [7.0]}))

Unnamed: 0,z
0,7.0


We can turn off the checking with a single global command.

In [15]:
TypeCheckSwitch().off()

And now notice a previously failing call is no longer checked.

In [16]:
fn(1, 2, c=pd.DataFrame({'x': [30], "z": [17.2]}), d=pd.DataFrame({'q': [7.0]}))

Unnamed: 0,q
0,7.0


The return value has is missing the required `z` column, but with checks off the function is not interfered with.

The idea is: when checks are on failures are detected much closer to causes, making debugging and diagnosis much easier. Also the decorations are a easy way to document in human readable form some basics of the expected input and output schemas.

And the input and output schema are attached to the function as objects.

In [17]:
pprint(fn.type_schema.arg_specs)

{'a': <class 'int'>,
 'b': {<class 'float'>, <class 'int'>},
 'c': {'x': <class 'int'>}}


In [18]:
pprint(fn.type_schema.return_spec)

{'z': <class 'float'>}


This makes the schema data available for other use, even some automatic checking of function composition conditions! 

The implementation, documentation, and demo of this methodology can be found [here](https://github.com/WinVector/Examples/tree/main/arg_types).

This does run into what I call "the first rule of meta-programming": that it meta-programming only works as long as it doesn't run into other meta-programming. That being said, I feel these decorators can be very valuable in Python data science projects.