##  Pandera and how to integrate it with Pandas
Introduction to Pandera
Pandera is a Python library for data validation that integrates seamlessly with Pandas. It allows you to define validation schemes declaratively and perform checks on DataFrames and Series.

Key Features
1. Pandas Integration: Works directly with DataFrames and Series
2. Declarative Validation: Defines rules clearly and legibly
3. Data Types: Supports validation of native Python and Pandas data types
4. Custom Validation: Allows you to create custom rules
5. Error Messages: Provides detailed messages about validation failures

In [1]:
pip install pandera

Collecting pandera
  Downloading pandera-0.26.1-py3-none-any.whl.metadata (10 kB)
Collecting pydantic (from pandera)
  Downloading pydantic-2.12.4-py3-none-any.whl.metadata (89 kB)
Collecting typeguard (from pandera)
  Downloading typeguard-4.4.4-py3-none-any.whl.metadata (3.3 kB)
Collecting typing_inspect>=0.6.0 (from pandera)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting mypy-extensions>=0.3.0 (from typing_inspect>=0.6.0->pandera)
  Downloading mypy_extensions-1.1.0-py3-none-any.whl.metadata (1.1 kB)
Collecting annotated-types>=0.6.0 (from pydantic->pandera)
  Downloading annotated_types-0.7.0-py3-none-any.whl.metadata (15 kB)
Collecting pydantic-core==2.41.5 (from pydantic->pandera)
  Downloading pydantic_core-2.41.5-cp311-cp311-win_amd64.whl.metadata (7.4 kB)
Collecting typing-inspection>=0.4.2 (from pydantic->pandera)
  Downloading typing_inspection-0.4.2-py3-none-any.whl.metadata (2.6 kB)
Downloading pandera-0.26.1-py3-none-any.whl (292 kB)
Down

In [None]:
import pandas as pd
import pandera as pa

# Creating a sample DataFrame
df = pd.DataFrame({
    'age': [25, 30, 35],
    'salary': [5000, 6000, 7000  ],
    'email': ['joao@email.com', 'maria@email.com', 'pedro@email.com']
})

# Defining the validation schema
schema = pa.DataFrameSchema({
    'age': pa.Column(int, checks=pa.Check.ge(0)),  # age >= 0
    'salary': pa.Column(float, checks=pa.Check.gt(0)),  # salary > 0
    'email': pa.Column(str, checks=pa.Check.str_matches(r'^[^@]+@[^@]+\.[^@]+$'))
})

# Validating the DataFrame
try:
    schema.validate(df)
    print("Valid data!")
except pa.errors.SchemaError as e:
    print(f"Validation error: {e}")


Validation error: expected series 'salary' to have type float64, got int64


top-level pandera module will be **removed in a future version of pandera**.
If you're using pandera to validate pandas objects, we highly recommend updating
your import:

```
# old import
import pandera as pa

# new import
import pandera.pandas as pa
```

If you're using pandera to validate objects from other compatible libraries
like pyspark or polars, see the supported libraries section of the documentation
for more information on how to import pandera:

https://pandera.readthedocs.io/en/stable/supported_libraries.html


```
```



# Supported Validation Types
Column Validation:

- Data type
- Null values
- Unique values
- Values ​​in a specific list
- Row Validation:

- Relationships between columns
- Aggregations
- Complex conditions
- DataFrame Validation:

- DataFrame structure
- Number of rows/columns
- Indexes

# Advanced Validation Example

In [4]:
# More complex schema
advanced_schema = pa.DataFrameSchema({
    'age': pa.Column(
        int,
        checks=[
            pa.Check.ge(0),   # Age must be greater than or equal to zero
            pa.Check.le(120)  # Age must be less than or equal to 120
        ],
        nullable=False  # Null values are not allowed
    ),
    'salary': pa.Column(
        float,
        checks=[
            pa.Check.gt(0),        # Salary must be greater than zero
            pa.Check.lt(1_000_000) # Salary must be less than 1 million
        ]
    ),
    'email': pa.Column(
        str,
        checks=pa.Check.str_matches(r'^[^@]+@[^@]+\.[^@]+$'),  # Validates email format
        nullable=False  # Null values are not allowed
    ),
    'department': pa.Column(
        str,
        checks=pa.Check.isin(['IT', 'HR', 'Sales', 'Finance'])  # Department must be in this allowed list
    )
})

# Validation with transformation
def adjust_salary(df):
    df = df.copy()
    df['salary'] = df['salary'] * 1.1  # 10% increase
    return df

# Using pipe for transformation
schema_with_transformation = pa.DataFrameSchema({
    'salary': pa.Column(
        float,
        checks=pa.Check.gt(0),
        coerce=True
    )
})

# Example usage:
validated_df = schema_with_transformation.validate(df)  # Checks if the column is valid
transformed_df = validated_df.pipe(adjust_salary)       # Applies the transformation


In [5]:
validated_df

Unnamed: 0,age,salary,email
0,25,5000.0,joao@email.com
1,30,6000.0,maria@email.com
2,35,7000.0,pedro@email.com


In [6]:
transformed_df

Unnamed: 0,age,salary,email
0,25,5500.0,joao@email.com
1,30,6600.0,maria@email.com
2,35,7700.0,pedro@email.com
