## pydantic + pandas

"simple" goal: pydantic schema for generating a dataframe



In [1]:
from inspect import getfullargspec
import pandas as pd
import typing
import json
from pydantic import BaseModel, create_model, validator, Field, ValidationError
import numpy as np
from enum import Enum

first, check out the instantiation args for a dataframe:

pd.DataFrame

In [2]:
pd.DataFrame?

[0;31mInit signature:[0m
[0mpd[0m[0;34m.[0m[0mDataFrame[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mdata[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mindex[0m[0;34m:[0m [0;34m'Axes | None'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcolumns[0m[0;34m:[0m [0;34m'Axes | None'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdtype[0m[0;34m:[0m [0;34m'Dtype | None'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcopy[0m[0;34m:[0m [0;34m'bool | None'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
Two-dimensional, size-mutable, potentially heterogeneous tabular data.

Data structure also contains labeled axes (rows and columns).
Arithmetic operations align on both row and column labels. Can be
thought of as a dict-like container for Series objects. The primary
pandas data s

So some complexity here... 
1. the `data` arg can be many things and is not explicitly typed. it's validated within `DataFrame.__init__()`, but not typed because it can be so many things.
2. except for `copy`, the arguments are internal pandas types. We can check out the `Dtypes` and `Axes` types are with:


In [3]:
pd._typing.Axes

typing.Collection[typing.Any]

In [4]:
pd._typing.Dtype

typing.Union[ForwardRef('ExtensionDtype'), str, numpy.dtype, typing.Type[typing.Union[str, float, int, complex, bool, object]]]

### initial manual `DataFrameModel`

a simple attempt at building a pydantic model. Adding a `dtype` attribute is proving difficult... for now, we'll use a string declaration approach with `Enum`. So let's construct a `DtypeEnum` from a list of strings corresponding to data types that we'll allow. When we get to trying to instantiate a true pandas `DataFrame`, we'll use `eval()` to get an actual type. 

In [5]:
allowed_types = ['int', 'float', 'str', 'complex', 'np.int64', 'np.float64']
DtypeEnum = Enum("DtypeEnum", dict(zip(allowed_types, allowed_types)))

In [6]:
class DataFrameModel(BaseModel):
    data: dict # for simplicity for now, only allow data dict
    index: typing.Optional[pd._typing.Axes] = None
    columns: typing.Optional[pd._typing.Axes] = None    
    dtype: typing.Optional[DtypeEnum] = None
    copy_: typing.Optional[bool] = Field(None, alias='copy')
    
    class Config:
        arbitrary_types_allowed = True  ## needed for Axes type        

In [7]:
df = DataFrameModel.construct()

In [8]:
df.schema()

{'title': 'DataFrameModel',
 'type': 'object',
 'properties': {'data': {'title': 'Data', 'type': 'object'},
  'index': {'title': 'Index'},
  'columns': {'title': 'Columns'},
  'dtype': {'$ref': '#/definitions/DtypeEnum'},
  'copy': {'title': 'Copy', 'type': 'boolean'}},
 'required': ['data'],
 'definitions': {'DtypeEnum': {'title': 'DtypeEnum',
   'description': 'An enumeration.',
   'enum': ['int', 'float', 'str', 'complex', 'np.int64', 'np.float64']}}}

In [9]:
with open('test_schema.json', 'w') as fi:
    fi.write(df.schema_json())

In [10]:
DataFrameModel(data={"a":[1,2,3]}, dtype="complex")

DataFrameModel(data={'a': [1, 2, 3]}, index=None, columns=None, dtype=<DtypeEnum.complex: 'complex'>, copy_=None)

In [11]:
DataFrameModel(data={"a":[1,2,3]}, dtype="complex").json()

'{"data": {"a": [1, 2, 3]}, "index": null, "columns": null, "dtype": "complex", "copy_": null}'

## instiating a dataframe. 

Assuming we've used our schema above to write a json to `filled_schema.json`, let's actually instantiate a dataframe:

In [12]:
valid_model = DataFrameModel.parse_file('filled_schema.json')

In [13]:
valid_model

DataFrameModel(data={'col_1': [1, 2, 3, 4], 'col_2': [-1, 20, 30, -20]}, index=None, columns=None, dtype=<DtypeEnum.np.int64: 'np.int64'>, copy_=True)

in the yt analysis schema approach, we attached a `._run` attribute to the pydantic classes. but it may be clearer to have a separate ingestion process:

In [14]:
def instantiate_df(pandantic_model: DataFrameModel) -> pd.DataFrame:
    enum_dtype = pandantic_model.dtype # e.g., <DtypeEnum.int: 'int'>
    dtype_str = enum_dtype.value # e.g., 'int'
    actual_dtype = eval(dtype_str) # e.g., int 
    return pd.DataFrame(pandantic_model.data, 
                        index=pandantic_model.index, 
                        columns=pandantic_model.columns,
                        dtype=actual_dtype,
                        copy=pandantic_model.copy_
                       )

In [15]:
df = instantiate_df(valid_model)

In [16]:
df.head()

Unnamed: 0,col_1,col_2
0,1,-1
1,2,20
2,3,30
3,4,-20
