# A Brief Intro to Data Validation in Python with Pandera

## Pandera is Data Validation Package for Python

- https://pandera.readthedocs.io/en/latest/index.html#
- Checks data in dataframes against schema (set of rules) 
- Schema (DataFrameSchema object) contains suite of tests that you specify
- You test data in dataframe against schema to see if it conforms to expectations
- Can be used with dataframes from:
  - pandas
  - polars
  - PySpark SQL
- A lot of functionality, we'll just cover basics here
- "validate" package for R is similar:
  - https://cran.r-project.org/web/packages/validate/vignettes/cookbook.html

In [2]:
import pandas as pd
import pandera as pa
from pandera import Column, DataFrameSchema, Check

### We'll use the Titanic Dataset for this Brief Tutorial

In [3]:
# read in titanic dataset 
df = pd.read_csv('../datasets/titanic.csv')

In [4]:
# browse df
df.head(10)

Unnamed: 0,survived,pclass,name,sex,age,fare,sibsp,parch
0,0,3,"Braund, Mr. Owen Harris",male,22.0,7.25,1,0
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,71.2833,1,0
2,1,3,"Heikkinen, Miss. Laina",female,26.0,7.925,0,0
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,53.1,1,0
4,0,3,"Allen, Mr. William Henry",male,35.0,8.05,0,0
5,0,1,"McCarthy, Mr. Timothy J",male,54.0,51.8625,0,0
6,0,3,"Palsson, Master. Gosta Leonard",male,2.0,21.075,3,1
7,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,11.1333,0,2
8,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,30.0708,1,0
9,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,16.7,1,1


In [5]:
# check data types of df
df.dtypes

survived      int64
pclass        int64
name         object
sex          object
age         float64
fare        float64
sibsp         int64
parch         int64
dtype: object

## We'll Create a simple DataFrameSchema for the Titanic Dataset

### We'll use these datatypes:

- survived: int
- pclass: int
- name: str
- sex: str
- age: float
- fare: float
- sibsp: int
- parch: int

In [None]:
# Create simple DataFrameSchema for titanic dataset

schema = DataFrameSchema(
                            {"survived": Column(int),
                             "pclass": Column(int),
                             "name": Column(str),
                             "sex": Column(str),
                             "age": Column(float),
                             "fare": Column(float),
                             "sibsp": Column(int),
                             "parch": Column(int)
                            }
                        )

In [8]:
try:
    schema.validate(df)
    print('Validated!')
except pa.errors.SchemaError as schema_error:
    print(schema_error)

Validated!


## Column-Level Rules

### Testing for Presence/Absence of Columns
- required = True (default)
- required = False (allows absence of column)

In [9]:
# By default, all columns specified by schema are required
# implicit 'required = True'

try:
    schema.validate(df.drop('survived', axis = 1))
    print('Validated!')
except pa.errors.SchemaError as schema_error:
    print(schema_error)

column 'survived' not in dataframe. Columns in dataframe: ['pclass', 'name', 'sex', 'age', 'fare', 'sibsp', 'parch']


In [10]:
# change the schema using required = False 
# for given the column makes it optional
schema = DataFrameSchema(
                            {"survived": Column(int, required=False),
                             "pclass": Column(int),
                             "name": Column(str),
                             "sex": Column(str),
                             "age": Column(float),
                             "fare": Column(float),
                             "sibsp": Column(int),
                             "parch": Column(int)
                            }
                        )

In [12]:
# after change to schema, we can validate the dataframe with missing column

try:
    schema.validate(df.drop('survived', axis = 1))
    print('Validated!')
except pa.errors.SchemaError as schema_error:
    print(schema_error)

Validated!


### Testing for Data Type of Column

In [13]:
# What if I change the data type of the pclass column to str 
# and try to validate the df with our schema?
df['age'] = df['age'].astype(str)

try:
    schema.validate(df)
    print('Validated!')
except pa.errors.SchemaError as schema_error:
    print(schema_error)

expected series 'age' to have type float64, got object


In [15]:
# Let's revert the data type of pclass to be int as expected by schema
df['age'] = df['age'].astype(float)

try:
    schema.validate(df)
    print('Validated!')
except pa.errors.SchemaError as schema_error:
    print(schema_error)

Validated!


### Testing for Null Values in Column

In [None]:
# show there are no null values
df.isnull().sum()

survived    0
pclass      0
name        0
sex         0
age         0
fare        0
sibsp       0
parch       0
dtype: int64

In [20]:
df['age'].head()

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: age, dtype: float64

In [22]:
# by default, null values won't validate

df.loc[0, 'age'] = None
df['age'].head()

0     NaN
1    38.0
2    26.0
3    35.0
4    35.0
Name: age, dtype: float64

In [23]:
try:
    schema.validate(df)
    print('Validated!')
except pa.errors.SchemaError as schema_error:
    print(schema_error)

non-nullable series 'age' contains null values:
0   NaN
Name: age, dtype: float64


In [24]:
# use of nullable = True
# allows null values to pass validation
# change the schema using required = False 
# for given the column makes it optional
schema = DataFrameSchema(
                            {"survived": Column(int),
                             "pclass": Column(int),
                             "name": Column(str),
                             "sex": Column(str),
                             "age": Column(float, nullable=True),
                             "fare": Column(float),
                             "sibsp": Column(int),
                             "parch": Column(int)
                            }
                        )

In [25]:
try:
    schema.validate(df)
    print('Validated!')
except pa.errors.SchemaError as schema_error:
    print(schema_error)

Validated!


In [27]:
df.loc[0, 'age'] = 22.0
df['age'].head()

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: age, dtype: float64

### Testing that Values Are Greater Than/Less Than Specific Value

In [28]:
schema = DataFrameSchema(
                            {"survived": Column(int), 
                             "pclass": Column(int),
                             "name": Column(str),
                             "sex": Column(str),
                             "age": Column(float, Check.less_than_or_equal_to(125)),
                             "fare": Column(float),
                             "sibsp": Column(int),
                             "parch": Column(int)
                            }
                        )

try:
    schema.validate(df)
    print('Validated!')
except pa.errors.SchemaError as schema_error:
    print(schema_error)

Validated!


### Testing that Values Are In a Specific Range

In [None]:
schema = DataFrameSchema(
                            {"survived": Column(int), 
                             "pclass": Column(int),
                             "name": Column(str),
                             "sex": Column(str),
                             "age": Column(float, Check.between(0, 125, include_min=True, include_max=True)),
                             "fare": Column(float),
                             "sibsp": Column(int),
                             "parch": Column(int)
                            }
                        )

try:
    schema.validate(df)
    print('Validated!')
except pa.errors.SchemaError as schema_error:
    print(schema_error)

### Testing that Values in Column are Members of Specified Set

In [34]:
schema = DataFrameSchema(
                            {"survived": Column(int), 
                             "pclass": Column(int, Check.isin([1, 2, 3])),
                             "name": Column(str),
                             "sex": Column(str),
                             "age": Column(float),
                             "fare": Column(float),
                             "sibsp": Column(int),
                             "parch": Column(int)
                            }
                        )

try:
    schema.validate(df)
    print('Validated!')
except pa.errors.SchemaError as schema_error:
    print(schema_error)

Validated!


### Custom Tests on Column

In [None]:
schema = DataFrameSchema(
                            {"survived": Column(int), 
                             "pclass": Column(int),
                             "name": Column(str),
                             "sex": Column(str),
                             "age": Column(float),
                             "fare": Column(float),
                             "sibsp": Column(int),
                             "parch": Column(int)
                            }
                        )

### More Sophisticated Tests on Columns are Possible

- Using hypothesis package and scipy for statistical tests
  - Detecting data drift
- Checks involving multiple columns at once

### Element-Wise vs Vectorized Column Checks

## Dataframe-Level Rules

### Testing Presence/Absence of All Columns

- Used for enforcing at dataframe level presence/absence of columns
- Only columns listed in schema can be present
  - strict = True

### Testing Order of Cols

- Used for enforcing column order specified in schema
  - ordered = True

### Testing Uniqueness of Columns

### Removing Unspecified Columns
  - strict = 'filter'

### Adding Missing Columns
- add_missing_columns = True

### Removing Invalid Rows

- drop_invalid_rows = True

## DataFrameSchema Transformations
- schema.add_columns()
- schema.remove_columns()
- schema.update_columns()
- schema.rename_columns()
- schema.set_index()
- schema.reset_index()

## Pandera can attempt to infer a schema automatically

In [None]:
#%%
# have pandera automatically infer the DataFrameSchema from the dataframe
schema_inferred = pa.infer_schema(df)

#%%
# inspect DataFrameSchmea
print(schema_inferred)

## Lazy Validation and Generating Error Reports
- Lazy Validation
  - By default, first failed check/test raises error
  - If lazy = True, Waits for all checks/tests to be performed so you can see all errors that are raised
  - schema.validate(df, lazy = True)


## Saving Schema Specifications
- Schema specifications can be saved to file
  - .py 
  - .yaml Files

In [None]:
# storing schema specification as python script
schema.to_script('../schemas/titanic_schema_inferred.py')

# storing schema specification as yaml file
schema.to_yaml('../schemas/titanic_schema_inferred.yaml')

## Using Pandera decorators for easy integration of data validation into data pipelines

### What's a decorator?
- a function (higher-order function) that modifies the behavior of a function without direct changing the code in the function or class
- essentially a wrapper that can be used to extend/alter original function
- denoted in python by the @ symbol

In [None]:
# toy example of what a decorated function looks like
@my_decorator
def my_function():
    pass

# behind the scenes, the function is modified by decorator
def my_function():
    pass
my_function = my_decorator(my_function)

### Pandera Decorators
- @check_input
- @check_output
- @check_io