## Defining Schemas and Data Validations
Basic Structure of a Schema
A schema in Pandera is defined using the DataFrameSchema class and consists of:

- Columns and their types
- Validation rules
- Additional settings

In [1]:
!pip install pandera

Collecting pandera
  Downloading pandera-0.26.1-py3-none-any.whl.metadata (10 kB)
Collecting typeguard (from pandera)
  Downloading typeguard-4.4.4-py3-none-any.whl.metadata (3.3 kB)
Collecting typing_extensions (from pandera)
  Using cached typing_extensions-4.15.0-py3-none-any.whl.metadata (3.3 kB)
Downloading pandera-0.26.1-py3-none-any.whl (292 kB)
Downloading typeguard-4.4.4-py3-none-any.whl (34 kB)
Using cached typing_extensions-4.15.0-py3-none-any.whl (44 kB)
Installing collected packages: typing_extensions, typeguard, pandera
[2K  Attempting uninstall: typing_extensions
[2K    Found existing installation: typing_extensions 4.12.2
[2K    Uninstalling typing_extensions-4.12.2:
[2K      Successfully uninstalled typing_extensions-4.12.232m0/3[0m [typing_extensions]
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3/3[0m [pandera]m2/3[0m [pandera]
[1A[2K[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installe

In [3]:
import pandera as pa

## Basic Validations
### Type Validation

In [4]:
schema_type = pa.DataFrameSchema({
    'age': pa.Column(int),      # must be an integer
    'name': pa.Column(str),     # must be a string
    'salary': pa.Column(float)  # must be a float
})


top-level pandera module will be **removed in a future version of pandera**.
If you're using pandera to validate pandas objects, we highly recommend updating
your import:

```
# old import
import pandera as pa

# new import
import pandera.pandas as pa
```

If you're using pandera to validate objects from other compatible libraries
like pyspark or polars, see the supported libraries section of the documentation
for more information on how to import pandera:

https://pandera.readthedocs.io/en/stable/supported_libraries.html


```
```



# Range Validation

In [5]:
schema_range = pa.DataFrameSchema({
    'age': pa.Column(
        int,
        checks=[
            pa.Check.ge(0),    # greater than or equal to 0
            pa.Check.le(120)   # less than or equal to 120
        ]
    ),
    'score': pa.Column(
        float,
        checks=[
            pa.Check.ge(0),    # greater than or equal to 0
            pa.Check.le(10)    # less than or equal to 10
        ]
    )
})


## Advanced Validations
### Format Validation

In [6]:
schema_format = pa.DataFrameSchema({
    'email': pa.Column(
        str,
        checks=pa.Check.str_matches(r'^[^@]+@[^@]+\.[^@]+$')  # email format
    ),
    'cpf': pa.Column(
        str,
        checks=pa.Check.str_matches(r'^\d{3}\.\d{3}\.\d{3}-\d{2}$')  # CPF format
    )
})


### Validation of Unique Values

In [7]:
schema_unique = pa.DataFrameSchema({
    'id': pa.Column(
        int,
        checks=pa.Check(lambda s: s.is_unique()),  # values must be unique
        nullable=False
    )
})


### Validating Values ​​in a List

In [8]:
schema_list = pa.DataFrameSchema({
    'state': pa.Column(
        str,
        checks=pa.Check.isin(['SP', 'RJ', 'MG', 'RS'])
    ),
    'status': pa.Column(
        str,
        checks=pa.Check.isin(['active', 'inactive', 'pending'])
    )
})


## Custom Validations
### Custom Validation Functions

In [9]:
def check_column_sum(df):
    return df['column1'] + df['column2'] == df['total']

schema_custom = pa.DataFrameSchema({
    'column1': pa.Column(float),
    'column2': pa.Column(float),
    'total': pa.Column(float)
}, checks=pa.Check(check_column_sum))


### Validation with Lambda Expressions

In [10]:
schema_lambda = pa.DataFrameSchema({
    'age': pa.Column(
        int,
        checks=pa.Check(lambda x: x > 0)  # must be greater than 0
    ),
    'salary': pa.Column(
        float,
        checks=pa.Check(lambda x: x > 1000)  # must be greater than 1000
    )
})


### Full DataFrame Validation

In [11]:
schema_complete = pa.DataFrameSchema({
    'id': pa.Column(
        int,
        checks=[
            pa.Check(lambda x: x.is_unique()),  # must be unique
            pa.Check.gt(0)                      # must be greater than 0
        ],
        nullable=False
    ),
    'name': pa.Column(
        str,
        checks=pa.Check.str_length(min_value=3, max_value=100),  # length 3–100
        nullable=False
    ),
    'age': pa.Column(
        int,
        checks=[
            pa.Check.ge(0),    # >= 0
            pa.Check.le(120)   # <= 120
        ]
    ),
    'email': pa.Column(
        str,
        checks=pa.Check.str_matches(r'^[^@]+@[^@]+\.[^@]+$'),  # valid email format
        nullable=False
    ),
    'salary': pa.Column(
        float,
        checks=[
            pa.Check.gt(0),         # > 0
            pa.Check.lt(1_000_000)  # < 1,000,000
        ]
    )
})


Tips for Defining Efficient Schemes
1. Start with the Basics: Define the essential types and rules first
2. Add Validations Gradually: Implement more complex validations as needed
3. Document the Rules: Add comments explaining the purpose of each validation
4. Test the Schemes: Verify that the validations are working as expected
5. Keep Performance in Mind: Avoid overly complex validations that could impact performance