In [1]:
import pandas as pd
import pandera as pa
import numpy as np
from pandera.typing import Series, DataFrame

## What is Pandera?

pandera is a Union.ai open source project that provides a flexible and expressive API for performing data validation on dataframe-like objects to make data processing pipelines more readable and robust.

Dataframes contain information that pandera explicitly validates at runtime. This is useful in production-critical data pipelines or reproducible research settings. With pandera, you can:
- Define a schema once and use it to validate different dataframe types including pandas, polars, dask, modin, and pyspark.pandas.
- Check the types and properties of columns in a pd.DataFrame or values in a pd.Series.
- Perform more complex statistical validation like hypothesis testing.
- Parse data to standardize the preprocessing steps needed to produce valid data.
- Seamlessly integrate with existing data analysis/processing pipelines via function decorators.
- Define dataframe models with the class-based API with pydantic-style syntax and validate dataframes using the typing syntax.
- Synthesize data from schema objects for property-based testing with pandas data structures.
- Lazily Validate dataframes so that all validation rules are executed before raising an error.
- Integrate with a rich ecosystem of python tools like pydantic, fastapi and mypy.


We want to do this on the cleaned data as it makes no sense to do it on raw data as there's likely a whole load of nonsense in there. It's not until you've cleaned the data that you can properly validate the data based on some logic. For example, market values cannot be below 0.

In [2]:
# load cleaned data
data = pd.read_parquet('../data/player_info_cleaned.parquet')

In [3]:
data.head()

Unnamed: 0,dob,age,country,current_club,height,name,position,value_euro_m,joined_date,number,signed_from,signing_fee_euro_m,tm_id,tm_name,foot,season,team
0,1993-08-17,30,Brazil,Manchester City,188.0,Ederson,Goalkeeper,35.0,2017-07-01,31,SL Benfica,40.0,238223,ederson,left,2023,manchester-city
1,1992-11-06,31,Germany,Manchester City,185.0,Stefan Ortega,Goalkeeper,9.0,2022-07-01,18,Arminia Bielefeld,0.0,85941,stefan-ortega,right,2023,manchester-city
2,1995-04-02,29,United States,Colorado Rapids,191.0,Zack Steffen,Goalkeeper,2.0,2019-07-09,13,Columbus Crew SC,7.0,221624,zack-steffen,right,2023,manchester-city
3,2005-11-02,18,England,Manchester City U21,,True Grant,Goalkeeper,0.0,NaT,0,,0.0,919438,true-grant,unknown,2023,manchester-city
4,1985-09-03,38,England,Manchester City,188.0,Scott Carson,Goalkeeper,0.0,2021-07-20,33,Derby County,0.0,14555,scott-carson,right,2023,manchester-city


In [4]:
data['position'].unique().tolist()

['Goalkeeper',
 'Centre-Back',
 'Left-Back',
 'Right-Back',
 'Defensive Midfield',
 'Central Midfield',
 'Attacking Midfield',
 'Left Winger',
 'Right Winger',
 'Centre-Forward',
 'Left Midfield',
 'Second Striker',
 'Right Midfield']

## Create Pandera validation schema

There are two methods to validate a dataframe. The first is the DataFrameSchema - an object-based API that allows the specification of schema to validate the index and columns of a pandas dataframe. The second method, and the primary focus of this notebook, is the DataFrameModel which is a class-based API very similar to Pydantic. The DataFrameModel offers direct validation of our dataframes by leveraging both pandera and pandas datatypes. In my opinion, the DataFrameModel is a much more intiuitive way of validating our data, particularly because it integrates seemlessly with pandas. The DataFrameModel can be converted into a DataFrameSchema if necessary so it gives you the best of both worlds.   

In [5]:
data.dtypes

dob                   datetime64[ns]
age                            int64
country                     category
current_club                category
height                         Int16
name                          object
position                    category
value_euro_m                 float64
joined_date           datetime64[ns]
number                         Int16
signed_from                 category
signing_fee_euro_m           float64
tm_id                          Int64
tm_name                       object
foot                        category
season                         int64
team                        category
dtype: object

In [6]:
class PlayerSchema(pa.DataFrameModel):
    
    dob: Series[pd.Timestamp] = pa.Field(nullable=False, ge=pd.Timestamp('1975-01-01'))
    age: Series[pa.Int64] = pa.Field(ge=0, le=50, nullable=False)
    country: Series[pa.Category] = pa.Field(nullable=False)
    current_club: Series[pa.Category] = pa.Field(nullable=True)
    height: Series[pa.Int16] = pa.Field(ge=120, le=250, nullable=True)
    name: Series[pa.String] = pa.Field(nullable=False)
    position: Series[pa.Category] = pa.Field(nullable=False)
    value_euro_m: Series[pa.Float64] = pa.Field(ge=0, le=200)
    joined_date: Series[pd.Timestamp] = pa.Field(nullable=True, ge=pd.Timestamp('2000-01-01'))
    number: Series[pa.Int16] = pa.Field(ge=0, le=99)
    signed_from: Series[pa.Category] = pa.Field(nullable=True)
    signing_fee_euro_m: Series[pa.Float64] = pa.Field(ge=0, le=300, nullable=True)
    tm_id: Series[pa.Int64] = pa.Field(ge=0, unique=False, nullable=False)
    foot: Series[pa.Category] = pa.Field(nullable=False, isin=['right', 'left', 'both', 'unknown'])
    season: Series[pa.Int64] = pa.Field(ge=2003, le=2025, nullable=False)
    team: Series[pa.Category] = pa.Field(nullable=False)
        

Above, we define a schema by sub-classing pa.DataFrameModel which is done in the same way you sub-class BaseModel in Pydantic. We then populated the schema with the corresponding columns in our dataset, providing the data type expected in each column, and defining boundaries using the pa.Field method. 

In [7]:
@pa.check_types
def load_data() -> DataFrame[PlayerSchema]:
    return pd.read_parquet('../data/player_info_cleaned.parquet')

In [8]:
def validate_data() -> None:
    try:
        stats = load_data()
        print(stats.head())
    except pa.errors.SchemaError as e:
        print(e)

In [9]:
validate_data()

error in check_types decorator of function 'load_data': Column 'height' failed element-wise validator number 0: greater_than_or_equal_to(120) failure cases: 17


## Reusing Field objects

In [10]:
from functools import partial

NullableField = partial(pa.Field, nullable=True)
NotNullableField = partial(pa.Field, nullable=False)

In [11]:
class PlayerSchema(pa.DataFrameModel):
    
    dob: Series[pd.Timestamp] = pa.Field(nullable=False, ge=pd.Timestamp('1975-01-01'))
    age: Series[pa.Int64] = pa.Field(ge=0, le=50, nullable=False)
    country: Series[pa.Category] = NotNullableField()
    current_club: Series[pa.Category] = NullableField() 
    height: Series[pa.Int16] = pa.Field(ge=120, le=250, nullable=True)
    name: Series[pa.String] = NotNullableField()
    position: Series[pa.Category] = NotNullableField()
    value_euro_m: Series[pa.Float64] = pa.Field(ge=0, le=200)
    joined_date: Series[pd.Timestamp] = pa.Field(nullable=True, ge=pd.Timestamp('2000-01-01'))
    number: Series[pa.Int16] = pa.Field(ge=0, le=99)
    signed_from: Series[pa.Category] = NullableField()
    signing_fee_euro_m: Series[pa.Float64] = pa.Field(ge=0, le=300, nullable=True)
    tm_id: Series[pa.Int64] = pa.Field(ge=0, unique=False, nullable=False)
    foot: Series[pa.Category] = pa.Field(nullable=False, isin=['Right', 'Left', 'Both'])
    season: Series[pa.Int64] = pa.Field(ge=2003, le=2025, nullable=False)
    team: Series[pa.Category] = NotNullableField()

## Adding in specific column checks

In [12]:
class PlayerSchema(pa.DataFrameModel):
    
    dob: Series[pd.Timestamp] = pa.Field(nullable=False, ge=pd.Timestamp('1975-01-01'))
    age: Series[pa.Int64] = pa.Field(ge=0, le=50, nullable=False)
    country: Series[pa.Category] = NotNullableField()
    current_club: Series[pa.Category] = NullableField() 
    height: Series[pa.Int16] = pa.Field(ge=120, le=250, nullable=True)
    name: Series[pa.String] = NotNullableField()
    position: Series[pa.Category] = NotNullableField()
    value_euro_m: Series[pa.Float64] = pa.Field(ge=0, le=200)
    joined_date: Series[pd.Timestamp] = pa.Field(nullable=True, ge=pd.Timestamp('2000-01-01'))
    number: Series[pa.Int16] = pa.Field(ge=0, le=99)
    signed_from: Series[pa.Category] = NullableField()
    signing_fee_euro_m: Series[pa.Float64] = pa.Field(ge=0, le=300, nullable=True)
    tm_id: Series[pa.Int64] = pa.Field(ge=0, unique=False, nullable=False)
    foot: Series[pa.Category] = pa.Field(nullable=False, isin=['right', 'left', 'both', 'unknown'])
    season: Series[pa.Int64] = pa.Field(ge=2003, le=2025, nullable=False)
    team: Series[pa.Category] = NotNullableField()
        
    # @pa.check(height)
    # def height_check(cls, height: Series[pa.Int16]) -> Series[pa.Int16]:
    #     return height > 120
    

In [13]:
validate_data()

error in check_types decorator of function 'load_data': Column 'height' failed element-wise validator number 0: greater_than_or_equal_to(120) failure cases: 17


## Implementing checks in processing step

#### Check heights

In [14]:
def clean_height(df: DataFrame) -> DataFrame[PlayerSchema]:
    data = df.copy()
    data.loc[data[PlayerSchema.height] < 120, PlayerSchema.height] = data[PlayerSchema.height].median()
    return data

In [15]:
data = clean_height(data)

In [16]:
@pa.check_types
def validate_data_2(df: DataFrame) -> DataFrame[PlayerSchema]:
    try:
        return df
    except pa.errors.SchemaError as e:
        print(e)

In [17]:
validate_data_2(data)

Unnamed: 0,dob,age,country,current_club,height,name,position,value_euro_m,joined_date,number,signed_from,signing_fee_euro_m,tm_id,tm_name,foot,season,team
0,1993-08-17,30,Brazil,Manchester City,188,Ederson,Goalkeeper,35.0,2017-07-01,31,SL Benfica,40.0,238223,ederson,left,2023,manchester-city
1,1992-11-06,31,Germany,Manchester City,185,Stefan Ortega,Goalkeeper,9.0,2022-07-01,18,Arminia Bielefeld,0.0,85941,stefan-ortega,right,2023,manchester-city
2,1995-04-02,29,United States,Colorado Rapids,191,Zack Steffen,Goalkeeper,2.0,2019-07-09,13,Columbus Crew SC,7.0,221624,zack-steffen,right,2023,manchester-city
3,2005-11-02,18,England,Manchester City U21,,True Grant,Goalkeeper,0.0,NaT,0,,0.0,919438,true-grant,unknown,2023,manchester-city
4,1985-09-03,38,England,Manchester City,188,Scott Carson,Goalkeeper,0.0,2021-07-20,33,Derby County,0.0,14555,scott-carson,right,2023,manchester-city
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
851,1998-04-10,26,Scotland,Luton Town,178,Jacob Brown,Centre-Forward,4.0,2023-08-10,19,Stoke City,3.0,469958,jacob-brown,right,2023,luton-town
852,1994-12-02,29,England,Luton Town,184,Cauley Woodrow,Centre-Forward,1.0,2022-07-01,10,Barnsley FC,0.0,169801,cauley-woodrow,right,2023,luton-town
853,2002-11-18,21,Wales,Luton Town,,Joe Taylor,Centre-Forward,0.0,2023-01-31,0,Peterborough United,0.0,944551,joe-taylor,right,2023,luton-town
854,1998-08-21,25,Zimbabwe,No Club,183,Admiral Muskwe,Centre-Forward,0.0,2021-07-15,0,Leicester City U21,0.0,314378,admiral-muskwe,right,2023,luton-town


In [18]:
# validate dataframe
DataFrame[PlayerSchema](data)

Unnamed: 0,dob,age,country,current_club,height,name,position,value_euro_m,joined_date,number,signed_from,signing_fee_euro_m,tm_id,tm_name,foot,season,team
0,1993-08-17,30,Brazil,Manchester City,188,Ederson,Goalkeeper,35.0,2017-07-01,31,SL Benfica,40.0,238223,ederson,left,2023,manchester-city
1,1992-11-06,31,Germany,Manchester City,185,Stefan Ortega,Goalkeeper,9.0,2022-07-01,18,Arminia Bielefeld,0.0,85941,stefan-ortega,right,2023,manchester-city
2,1995-04-02,29,United States,Colorado Rapids,191,Zack Steffen,Goalkeeper,2.0,2019-07-09,13,Columbus Crew SC,7.0,221624,zack-steffen,right,2023,manchester-city
3,2005-11-02,18,England,Manchester City U21,,True Grant,Goalkeeper,0.0,NaT,0,,0.0,919438,true-grant,unknown,2023,manchester-city
4,1985-09-03,38,England,Manchester City,188,Scott Carson,Goalkeeper,0.0,2021-07-20,33,Derby County,0.0,14555,scott-carson,right,2023,manchester-city
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
851,1998-04-10,26,Scotland,Luton Town,178,Jacob Brown,Centre-Forward,4.0,2023-08-10,19,Stoke City,3.0,469958,jacob-brown,right,2023,luton-town
852,1994-12-02,29,England,Luton Town,184,Cauley Woodrow,Centre-Forward,1.0,2022-07-01,10,Barnsley FC,0.0,169801,cauley-woodrow,right,2023,luton-town
853,2002-11-18,21,Wales,Luton Town,,Joe Taylor,Centre-Forward,0.0,2023-01-31,0,Peterborough United,0.0,944551,joe-taylor,right,2023,luton-town
854,1998-08-21,25,Zimbabwe,No Club,183,Admiral Muskwe,Centre-Forward,0.0,2021-07-15,0,Leicester City U21,0.0,314378,admiral-muskwe,right,2023,luton-town
