# A Brief Intro to Data Validation in Python with Pandera

## Pandera is Data Validation Package for Python

- https://pandera.readthedocs.io/en/latest/index.html#
- Checks data in dataframes against schema (set of rules) 
- Schema (DataFrameSchema object) contains suite of tests that you specify
- You test data in dataframe against schema to see if it conforms to expectations
- Can be used with dataframes from:
  - pandas
  - polars
  - PySpark SQL
- A lot of functionality, we'll just cover basics here
- "validate" package for R is similar:
  - https://cran.r-project.org/web/packages/validate/vignettes/cookbook.html

In [1]:
import pandas as pd
import pandera as pa
from pandera import Column, DataFrameSchema, Check
import json

### We'll use the Titanic Dataset for this Brief Tutorial

In [2]:
# read in titanic dataset 
df = pd.read_csv('../datasets/titanic.csv')

In [3]:
# browse df
df.head(10)

Unnamed: 0,survived,pclass,name,sex,age,fare,sibsp,parch
0,0,3,"Braund, Mr. Owen Harris",male,22.0,7.25,1,0
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,71.2833,1,0
2,1,3,"Heikkinen, Miss. Laina",female,26.0,7.925,0,0
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,53.1,1,0
4,0,3,"Allen, Mr. William Henry",male,35.0,8.05,0,0
5,0,1,"McCarthy, Mr. Timothy J",male,54.0,51.8625,0,0
6,0,3,"Palsson, Master. Gosta Leonard",male,2.0,21.075,3,1
7,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,11.1333,0,2
8,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,30.0708,1,0
9,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,16.7,1,1


In [4]:
# check data types of df
df.dtypes

survived      int64
pclass        int64
name         object
sex          object
age         float64
fare        float64
sibsp         int64
parch         int64
dtype: object

## We'll Create a simple DataFrameSchema for the Titanic Dataset

### We'll use these datatypes:

- survived: int
- pclass: int
- name: str
- sex: str
- age: float
- fare: float
- sibsp: int
- parch: int

In [5]:
# Create simple DataFrameSchema for titanic dataset

schema = DataFrameSchema(
                            {"survived": Column(int),
                             "pclass": Column(int),
                             "name": Column(str),
                             "sex": Column(str),
                             "age": Column(float),
                             "fare": Column(float),
                             "sibsp": Column(int),
                             "parch": Column(int)
                            }
                        )

In [6]:
try:
    schema.validate(df)
    print('Validated!')
except pa.errors.SchemaError as schema_error:
    print(schema_error)

Validated!


## Column-Level Rules

### Testing for Presence/Absence of Columns
- required = True (default)
- required = False (allows absence of column)

In [7]:
# By default, all columns specified by schema are required
# implicit 'required = True'

try:
    schema.validate(df.drop('survived', axis = 1))
    print('Validated!')
except pa.errors.SchemaError as schema_error:
    print(schema_error)

column 'survived' not in dataframe. Columns in dataframe: ['pclass', 'name', 'sex', 'age', 'fare', 'sibsp', 'parch']


In [8]:
# change the schema using required = False 
# for given the column makes it optional
schema = DataFrameSchema(
                            {"survived": Column(int, required=False),
                             "pclass": Column(int),
                             "name": Column(str),
                             "sex": Column(str),
                             "age": Column(float),
                             "fare": Column(float),
                             "sibsp": Column(int),
                             "parch": Column(int)
                            }
                        )

In [9]:
# after change to schema, we can validate the dataframe with missing column

try:
    schema.validate(df.drop('survived', axis = 1))
    print('Validated!')
except pa.errors.SchemaError as schema_error:
    print(schema_error)

Validated!


### Testing for Data Type of Column

In [None]:
# What if I change the data type of the age column to str 
# and try to validate the df with our schema?
df['age'] = df['age'].astype(str)

try:
    schema.validate(df)
    print('Validated!')
except pa.errors.SchemaError as schema_error:
    print(schema_error)

expected series 'age' to have type float64, got object


In [11]:
# Let's revert the data type of pclass to be int as expected by schema
df['age'] = df['age'].astype(float)

try:
    schema.validate(df)
    print('Validated!')
except pa.errors.SchemaError as schema_error:
    print(schema_error)

Validated!


### Testing for Null Values in Column

In [12]:
# show there are no null values
df.isnull().sum()

survived    0
pclass      0
name        0
sex         0
age         0
fare        0
sibsp       0
parch       0
dtype: int64

In [13]:
df['age'].head()

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: age, dtype: float64

In [14]:
# by default, null values won't validate

df.loc[0, 'age'] = None
df['age'].head()

0     NaN
1    38.0
2    26.0
3    35.0
4    35.0
Name: age, dtype: float64

In [15]:
try:
    schema.validate(df)
    print('Validated!')
except pa.errors.SchemaError as schema_error:
    print(schema_error)

non-nullable series 'age' contains null values:
0   NaN
Name: age, dtype: float64


In [16]:
# use of nullable = True
# allows null values to pass validation
schema = DataFrameSchema(
                            {"survived": Column(int),
                             "pclass": Column(int),
                             "name": Column(str),
                             "sex": Column(str),
                             "age": Column(float, nullable=True),
                             "fare": Column(float),
                             "sibsp": Column(int),
                             "parch": Column(int)
                            }
                        )

In [17]:
try:
    schema.validate(df)
    print('Validated!')
except pa.errors.SchemaError as schema_error:
    print(schema_error)

Validated!


In [18]:
df.loc[0, 'age'] = 22.0
df['age'].head()

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: age, dtype: float64

### Testing that Values Are Greater Than/Less Than Specific Value

In [19]:
schema = DataFrameSchema(
                            {"survived": Column(int), 
                             "pclass": Column(int),
                             "name": Column(str),
                             "sex": Column(str),
                             "age": Column(float, Check.less_than_or_equal_to(125)),
                             "fare": Column(float),
                             "sibsp": Column(int),
                             "parch": Column(int)
                            }
                        )

try:
    schema.validate(df)
    print('Validated!')
except pa.errors.SchemaError as schema_error:
    print(schema_error)

Validated!


### Testing that Values Are In a Specific Range

In [20]:
schema = DataFrameSchema(
                            {"survived": Column(int), 
                             "pclass": Column(int),
                             "name": Column(str),
                             "sex": Column(str),
                             "age": Column(float, Check.between(0, 125, include_min=True, include_max=True)),
                             "fare": Column(float),
                             "sibsp": Column(int),
                             "parch": Column(int)
                            }
                        )

try:
    schema.validate(df)
    print('Validated!')
except pa.errors.SchemaError as schema_error:
    print(schema_error)

Validated!


### Testing that Values in Column are Members of Specified Set

In [21]:
schema = DataFrameSchema(
                            {"survived": Column(int), 
                             "pclass": Column(int, Check.isin([1, 2, 3])),
                             "name": Column(str),
                             "sex": Column(str),
                             "age": Column(float),
                             "fare": Column(float),
                             "sibsp": Column(int),
                             "parch": Column(int)
                            }
                        )

try:
    schema.validate(df)
    print('Validated!')
except pa.errors.SchemaError as schema_error:
    print(schema_error)

Validated!


### Custom Tests on Column

In [30]:
schema = DataFrameSchema(
                            {"survived": Column(int), 
                             "pclass": Column(int),
                             "name": Column(str),
                             "sex": Column(str),
                             "age": Column(float),
                             "fare": Column(float),
                             "sibsp": Column(int),
                             "parch": Column(int)
                            }
                        )

# ADD A CUSTOM TEST WITH LAMBDA FUNCTION

### More Sophisticated Tests on Columns are Possible

- Using hypothesis package and scipy for statistical tests
  - Detecting data drift
- Checks involving multiple columns at once

## Dataframe-Level Tests

### Testing Presence/Absence of All Columns

- Columns not specified in schema aren't check
- Can add tests at dataframe level to check for presence/absence of columns
  - strict = True
  - Only columns listed in schema can be present

In [31]:
# add column not in our schema
df['is_child'] = df['age'] < 14

In [32]:
df.head(10)

Unnamed: 0,survived,pclass,name,sex,age,fare,sibsp,parch,is_child
0,0,3,"Braund, Mr. Owen Harris",male,22.0,7.25,1,0,False
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,71.2833,1,0,False
2,1,3,"Heikkinen, Miss. Laina",female,26.0,7.925,0,0,False
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,53.1,1,0,False
4,0,3,"Allen, Mr. William Henry",male,35.0,8.05,0,0,False
5,0,1,"McCarthy, Mr. Timothy J",male,54.0,51.8625,0,0,False
6,0,3,"Palsson, Master. Gosta Leonard",male,2.0,21.075,3,1,True
7,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,11.1333,0,2,False
8,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,30.0708,1,0,False
9,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,16.7,1,1,True


In [33]:
# try to validate with new column (should validate successfully)
try:
    schema.validate(df)
    print('Validated!')
except pa.errors.SchemaError as schema_error:
    print(schema_error)

Validated!


In [None]:
# add strict=True to schema
schema = DataFrameSchema(
                            {"survived": Column(int), 
                             "pclass": Column(int),
                             "name": Column(str),
                             "sex": Column(str),
                             "age": Column(float),
                             "fare": Column(float),
                             "sibsp": Column(int),
                             "parch": Column(int)
                            },
                            strict=True
                        )

# with strict = True, won't validate with unspecified column
try:
    schema.validate(df)
    print('Validated!')
except pa.errors.SchemaError as schema_error:
    print(schema_error)

column 'is_child' not in DataFrameSchema {'survived': <Schema Column(name=survived, type=DataType(int64))>, 'pclass': <Schema Column(name=pclass, type=DataType(int64))>, 'name': <Schema Column(name=name, type=DataType(str))>, 'sex': <Schema Column(name=sex, type=DataType(str))>, 'age': <Schema Column(name=age, type=DataType(float64))>, 'fare': <Schema Column(name=fare, type=DataType(float64))>, 'sibsp': <Schema Column(name=sibsp, type=DataType(int64))>, 'parch': <Schema Column(name=parch, type=DataType(int64))>}


In [38]:
df = df.drop('is_child', axis=1)

### Testing Order of Cols

- Used for enforcing column order specified in schema
  - ordered = True

In [39]:
schema = DataFrameSchema(
                            {"survived": Column(int), 
                             "pclass": Column(int),
                             "name": Column(str),
                             "sex": Column(str),
                             "age": Column(float),
                             "fare": Column(float),
                             "sibsp": Column(int),
                             "parch": Column(int)
                            },
                            ordered = True
                        )

In [41]:
# change order of columns
# so 'survived' col is last

df = df[['pclass',
        'name',
        'sex',
        'age',
        'fare',
        'sibsp',
        'parch',
        'survived']]

In [42]:
# with ordered = True, new order of cols won't validate
try:
    schema.validate(df)
    print('Validated!')
except pa.errors.SchemaError as schema_error:
    print(schema_error)

column 'pclass' out-of-order


In [43]:
# change order of columns
# back to match schema

df = df[['survived',
        'pclass',
        'name',
        'sex',
        'age',
        'fare',
        'sibsp',
        'parch']]

### Testing Uniqueness of Columns
- To make sure you don't have columns with duplicate values
- unique = ["col_1", "col_2"]  

### REMOVE?
### Removing Invalid Rows

- drop_invalid_rows = True

## Pandera can attempt to infer a schema automatically

In [None]:
#%%
# have pandera automatically infer the DataFrameSchema from the dataframe
schema_inferred = pa.infer_schema(df)

#%%
# inspect DataFrameSchmea
print(schema_inferred)

## Lazy Validation 
- Lazy Validation
  - By default, first failed test raises error
  - Can set lazy = True to accumulate results of all tests and see all errors that are have occurred
  - schema.validate(df, lazy = True)
  - Easiest to view accumulated results as json


In [None]:
schema = DataFrameSchema(
                            {"survived": Column(int), 
                             "pclass": Column(int, Check.isin([2, 3])),
                             "name": Column(str),
                             "sex": Column(str),
                             "age": Column(float, Check.between(5, 65, include_min=True, include_max=True)),
                             "fare": Column(float),
                             "sibsp": Column(int),
                             "parch": Column(int)
                            }
                        )

# toggle between lazy=False and lazy=True to see difference
try:
    schema.validate(df, lazy=False)
except pa.errors.SchemaErrors as e:
    print(json.dumps(e.message, indent=2))

{
  "DATA": {
    "DATAFRAME_CHECK": [
      {
        "schema": null,
        "column": "pclass",
        "check": "isin([2, 3])",
        "error": "Column 'pclass' failed element-wise validator number 0: isin([2, 3]) failure cases: 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1"
      },
      {
        "schema": null,
        "column": "age",
        "check": "in_range(5, 65)",
        "error": "Column 'age' failed element-wise validator number 0: in_range(5, 65) failure cases: 2.0,

## Generating Error Reports
- Can write error reports to json file

In [47]:
schema = DataFrameSchema(
                            {"survived": Column(int), 
                             "pclass": Column(int, Check.isin([2, 3])),
                             "name": Column(str),
                             "sex": Column(str),
                             "age": Column(float, Check.between(5, 65, include_min=True, include_max=True)),
                             "fare": Column(float),
                             "sibsp": Column(int),
                             "parch": Column(int)
                            }
                        )


try:
    schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as e:
    json_string = json.dumps(e.message, indent=2)
    print(json_string)

with open("../schema_error_reports/schema_error_log.json", "w") as file:
    file.write(json_string)

{
  "DATA": {
    "DATAFRAME_CHECK": [
      {
        "schema": null,
        "column": "pclass",
        "check": "isin([2, 3])",
        "error": "Column 'pclass' failed element-wise validator number 0: isin([2, 3]) failure cases: 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1"
      },
      {
        "schema": null,
        "column": "age",
        "check": "in_range(5, 65)",
        "error": "Column 'age' failed element-wise validator number 0: in_range(5, 65) failure cases: 2.0,

## Schema Transformations
- Can be used to modify schemas without needing to redefine entire schema
  - schema.add_columns()
  - schema.remove_columns()
  - schema.update_columns()
  - schema.rename_columns()
  - schema.set_index()
  - schema.reset_index()

## Saving Schema Specifications
- Schema specifications can be saved to file
  - .py 
  - .yaml Files

In [None]:
# storing schema specification as python script
schema.to_script('../schemas/titanic_schema_inferred.py')

# storing schema specification as yaml file
schema.to_yaml('../schemas/titanic_schema_inferred.yaml')

## Using Pandera decorators for easy integration of data validation into data pipelines

### What's a decorator?
- a function (higher-order function) that modifies the behavior of a function without direct changing the code in the function or class
- essentially a wrapper that can be used to extend/alter original function
- denoted in python by the @ symbol

In [None]:
# toy example of what a decorated function looks like
@my_decorator
def my_function():
    pass

# behind the scenes, the function is modified by decorator
def my_function():
    pass
my_function = my_decorator(my_function)

### Pandera Decorators
- @check_input
- @check_output
- @check_io