# Type annotations and dataframe models

We are making use of Python type annotations (or type hints).

Why type annotations are good:
* They make code easier to read and understand (either for humans or e.g. for IDE's).
* They make code easier to maintain and refactor.
* Static type checkers and code analysis can catch errors before the code is run.
* Libraries like dataclasses, attrs, Pydantic or Pandera can use type annotations to define types and validate data.

Why type annotations are sometimes hard:
* Python type systems is not perfects as it has not been designed for static typing from the beginning.
* Libraries like Pandas use a lot of dynamic typing and are not easy to type hint and check.

## Type annotations use cases

### A simple case for functions

```python
def add_integers(a: int, b: int) -> int:
    return a + b
```


### Dataclass definitions

```python
from dataclasses import dataclass
import datetime

@dataclass
class Person:
    name: str
    date_of_birth: datetime.date
```

### A generics example

```python
from typing import TypeVar

T = TypeVar('T')

def first_element(items: list[T]) -> T:
    return items[0]
```

## Type hints for Pandas

Pandas is a bit special as it uses a lot of dynamic typing and is not easy to type hint and check.
E.g., `DataFrame` is quite a complex type with parametrised (d)types of columns or index.
That's why the [pandas-stubs](https://github.com/pandas-dev/pandas-stubs) project 
was only recently adopted by the Pandas team and is still incomplete and in development.

Example:

In [None]:
import pandas as pd
import numpy as np

temp_wind = pd.DataFrame(
    {"temperature": [15, 16.5, 17.3], "wind_direction": ["N", "N", "W"]},
    index=pd.date_range("2021-01-01", "2021-01-03"),   
).astype({"temperature": np.float32, "wind_direction": pd.StringDtype("pyarrow")})

A straightforward type hint for a `DataFrame` would be:

```python
import pandas as pd

def analyse_wind(wind_data: pd.DataFrame) -> pd.DataFrame:
    ...
```

This is definitely a good minimum. Howevern, can you guess the correct type annotation for `df` that would describe the column and index names and types?

Note that we do not include `pandas-stubs` in the requirements (although we tested it)---it'd simply be t
oo much effort and likely `# type: ignore` would be needed in many places.

## Pandera DataFrame Models

Pandera introduced [DataFrame Models](https://pandera.readthedocs.io/en/stable/dataframe_models.html),
a concept derived from Pydantic models. This concept provides a means to type annotate Pandas DataFrames,
and, to validate data. An [integration with mypy](https://pandera.readthedocs.io/en/stable/mypy_integration.html#mypy-integration) exists. Unfortunately, at the time of writing, there was a bug in this integration for `Union` (hence also `Optional`).
types so we do not use it. See the corresponding [Github issue](https://github.com/unionai-oss/pandera/issues/1204).


Quoting the documentation of Pandera:

> Pandera provides a class-based API that’s heavily inspired by pydantic. In contrast to the object-based API, you can define dataframe models in much the same way you’d define pydantic models.

* This is also similar to how `dataclasses` or `attrs` work. 
* We will see how it differs from the object-based API later.
 
> `DataFrameModel`s are annotated with the `pandera.typing` module using the standard typing syntax. Models can be explicitly converted to a `DataFrameSchema` or used to validate a `DataFrame` directly.

> Note: Due to current limitations in the pandas library (see discussion here), pandera annotations are only used for run-time validation and has limited support for static-type checkers like mypy. See the Mypy Integration for more details. 

### Basic usage

This is an example from Pandera documentation:

In [None]:
import pandas as pd
import pandera as pa
from pandera.typing import DataFrame, Series


class InputSchema(pa.DataFrameModel):
    year: Series[int] = pa.Field(gt=2000, coerce=True)
    month: Series[int] = pa.Field(ge=1, le=12, coerce=True)
    day: Series[int] = pa.Field(ge=0, le=365, coerce=True)

class OutputSchema(InputSchema):
    revenue: Series[float]

@pa.check_types
def transform(df: DataFrame[InputSchema]) -> DataFrame[OutputSchema]:
    return df.assign(revenue=100.0)


**Exercise**: 
1. Try to input an invalid DataFrame to `transform` function and see what happens.
2. Try to feed a dataframe with an extra column. It should not fail.
3. Add a `strict` option so that the validation fails if there are extra columns.

In [None]:
# invalid column name
invalid_df_1 = pd.DataFrame(
    {
        "year": [2021, 2021, 2021],
        "month": [1, 1, 1],
        "date": [1, 1, 1],
    }
)

# invalid value
invalid_df_2 = pd.DataFrame(
    {
        "year": [2021, 2021, 2021],
        "month": [1, 1, 0],
        "day": [1, 1, 1],
    }
)

transform(invalid_df_1)

In [None]:
df_with_extra_column = pd.DataFrame(
    {
        "year": [2021, 2021, 2021],
        "month": [1, 1, 1],
        "day": [1, 1, 1],
        "weekday": ["Monday", "Tuesday", "Wednesday"],
    }
)
transform(df_with_extra_column)

In [None]:
# Add a `strict` option so that the validation fails if there are extra columns.

class InputSchema(pa.DataFrameModel):
    year: Series[int] = pa.Field(gt=2000, coerce=True)
    month: Series[int] = pa.Field(ge=1, le=12, coerce=True)
    day: Series[int] = pa.Field(ge=0, le=365, coerce=True)

    class Config:
        strict = True

# we need to re-define the function so that the updated model is used
@pa.check_types
def transform(df: DataFrame[InputSchema]) -> DataFrame[OutputSchema]:
    return df.assign(revenue=100.0)

# this will raise an error now
transform(df_with_extra_column)

### Parametrized dtypes

Pandas supports a couple of parametrized dtypes, e.g. `DatetimeTZDtype`, `StringDtype` or `CategoricalDtype`.
These type parameters can be provided via `typing.Annotated`.

In [None]:
import numpy as np
import pandas as pd
import pandera as pa
from pandera.typing import Index, Series
from typing import Annotated

class TempWindModel(pa.DataFrameModel):
    temperature: Series[np.float32] = pa.Field(coerce=True)
    wind_direction: Series[Annotated[pd.StringDtype, "pyarrow"]] = pa.Field(coerce=True)
    idx: Index[pd.Timestamp] = pa.Field(coerce=True, check_name=False)

    class Config:
        strict = True

**Exercise**:
1. Try to validate the `temp_wind` dataframe defined below using the `TempWindSchema` schema.
2. Verify that `temp_wind.astype({"wind_direction": pd.StringDtype("python")})` (i.e. the dataframe with `wind_direction` as a python string backend dtype) is still successfully validated.
3. Modify the `Field` parameters of the `wind_direction` columns so the the validation in point 2. fails.

In [None]:
temp_wind = pd.DataFrame(
    {"temperature": [15, 16.5, 17.3], "wind_direction": ["N", "N", "W"]},
    index=pd.date_range("2021-01-01", "2021-01-03"),
).astype({"wind_direction": "string[pyarrow]", "temperature": "float32"})

In [None]:
# 1. Try to validate the `temp_wind` dataframe defined below using the `TempWindSchema` schema.

TempWindModel.validate(temp_wind)

In [None]:
# 2. Verify that `temp_wind.astype({"wind_direction": pd.StringDtype("python")})`
# (i.e. the dataframe with `wind_direction` as a python string backend dtype)
# is still successfully validated.

TempWindModel.validate(temp_wind.astype({"wind_direction": pd.StringDtype("python")}))

In [None]:
# 3. Modify the `Field` parameters of the `wind_direction` columns so the the validation in point 2. fails.

class TempWindModel(pa.DataFrameModel):
    temperature: Series[np.float32] = pa.Field(coerce=True)
    # coerce=False will fail the validation
    wind_direction: Series[Annotated[pd.StringDtype, "pyarrow"]] = pa.Field(coerce=False)
    idx: Index[pd.Timestamp] = pa.Field(coerce=True, check_name=False)

    class Config:
        strict = True

TempWindModel.validate(temp_wind.astype({"wind_direction": pd.StringDtype("python")}))