In [58]:
import os

import pandas as pd
import pandera as pa

Welcome to this mini-tutorial on the python library pandera. Pandera lets you validate dataframe objects against specified pandera schemas. This can be easily integrated into your data pipelines in order to continuously test your data, spot data errors, and generally support data quality monitoring.

Let us start by reading in a small dataframe. Its column names should already give you a hint on each intended data type.

In [59]:
df = pd.read_csv(os.path.join("dummy-data", "dummy.csv"))

In [60]:
df

Unnamed: 0,var_bool,var_signed_int,var_unsigned_int,var_float,var_str,var_date,var_timestamp,var_categorical,var_constant
0,True,2,2,-3.5,this,12.02.1992,1994-10-21 04:03:37,a,a
1,False,3,3,2.7,is,01.03.1989,2017-05-03 01:35:46,b,a
2,True,-4,4,4.6,a,13.05.2006,1981-10-23 06:55:57,b,a
3,False,-10,5,-100.7,test,11.12.2022,2013-04-12 11:44:26,a,a
4,True,200,6,,hello,02.04.1987,2001-04-29 16:04:09,a,a
5,False,1,7,4.3,hi,14.03.1996,1997-01-17 09:16:10,b,a
6,True,2,8,4.3,hi,12.02.1992,1961-10-29 11:12:22,b,a
7,True,-2,9,4.3,hi,01.03.1989,1978-07-11 08:32:47,a,a
8,True,-3,0,4.3,hi,13.05.2006,2018-10-16 12:32:45,b,a
9,False,-4,1,4.3,hi,25.08.2009,2012-05-01 06:50:06,b,a


However, to raise any doubts, here is the schema again:

| Column/variable name | Data type                                     |
|----------------------|-----------------------------------------------|
| var_bool             | boolean                                       |
| var_signed_int       | signed integer (positive and negative values) |
| var_unsigned_int     | unsigned integer (only non-negative values)   |
| var_float            | floating point number                         |
| var_str              | string                                        |
| var_date             | date                                          |
| var_timestamp        | timestamp                                     |
| var_categorical      | categorical                                   |
| var_constant         | constant                                      |


In [None]:
# DataFrameSchema.update_column

## Schema inference

Imagine you have a complex dataset and you do not want to write the schema from scratch. After all, time is limited and manual processes are error-prone. Luckily, pandera comes with a function that can generate a first draft of a pandera schema.

In [61]:
schema = pa.infer_schema(df)
print(schema)
# schema.to_script("dummy-schema.py")


<Schema DataFrameSchema(
    columns={
        'var_bool': <Schema Column(name=var_bool, type=DataType(bool))>
        'var_signed_int': <Schema Column(name=var_signed_int, type=DataType(int64))>
        'var_unsigned_int': <Schema Column(name=var_unsigned_int, type=DataType(int64))>
        'var_float': <Schema Column(name=var_float, type=DataType(float64))>
        'var_str': <Schema Column(name=var_str, type=DataType(object))>
        'var_date': <Schema Column(name=var_date, type=DataType(object))>
        'var_timestamp': <Schema Column(name=var_timestamp, type=DataType(object))>
        'var_categorical': <Schema Column(name=var_categorical, type=DataType(object))>
        'var_constant': <Schema Column(name=var_constant, type=DataType(object))>
    },
    checks=[],
    coerce=True,
    dtype=None,
    index=<Schema Index(name=None, type=DataType(int64))>,
    strict=False,
    name=None,
    ordered=False,
    unique_column_names=False,
    metadata=None, 
    add_missing_colum