# Skooma

Skooma is a lightweight type validation tool for Pandas DataFrames.

First, import the `Schema` class and the `@validate` decorator:

In [1]:
import pandas as pd
from skooma import Schema, validate

For simplicity's sake, we'll create a small Pandas DataFrame with dummy values for testing:

In [2]:
df = pd.DataFrame({
    'nums': range(5),
    'chars': list('abcde')
})

df

Unnamed: 0,nums,chars
0,0,a
1,1,b
2,2,c
3,3,d
4,4,e


Next, we'll create a schema for our DataFrame using the `Schema` class. To create a new `Schema`, we'll pass in a dictionary with keys corresponding to the columns in `i`. The value for each key is a lambda function that evaluates to `True` or `False` for each unique value in the column. This allows us to set very specific constraints beyond basic data types.

Here, we'll define a schema for `i` that requires the `nums` column to contain integers less than 100, and the `chars` column to contain strings:

In [30]:
example_schema = Schema({
    'nums': Number(lambda x: x < 100),
    'chars': String(lambda x: len(x) == 1)
})

example_schema

<skooma.Schema at 0x11800ca50>

In [36]:
Integer(lambda x: x < 100)
Float()
Number()
Boolean()
String()
Object()
DateTime()

AttributeError: module 'pandas' has no attribute 'int64'

In [4]:
example_schema.strict

True

Every `Schema` instance has a `.validate()` method, which takes in a DataFrame and tests it against the schema requirements. If it passes validation, `validate` returns `True`, else `False`.

In [31]:
example_schema.validate(df)

True

In [18]:
example_schema.validate(df.assign(nums=df['nums'] + 99))

Invalid value in column 'nums': 100
Invalid value in column 'nums': 101
Invalid value in column 'nums': 102
Invalid value in column 'nums': 103


False

In [45]:
df2 = df.replace({1: None})
df2[(df2['nums'] < 100)]

Unnamed: 0,nums,chars
0,0,a
2,2,c
3,3,d
4,4,e


Note that, by default, a `Schema` must have a key for every column in the DataFrame being validated. We can disable this, and only define validation requirements for a subset of columns, by passing the optional argument `strict=False`.

In [7]:
permissive_schema = Schema(
    {'nums': lambda x: x < 100}, 
    strict=False
)

permissive_schema.validate(df)

True

In [8]:
@validate(
    args=(example_schema, None), 
    returns=Schema({'nums': lambda x: x % 2 == 0}, strict=False)
)
def multiply(df: pd.DataFrame, x: int) -> pd.DataFrame:
    return df * x

multiply(df, 2)

Validating input at index 0...
Passed!
Validating output...
Passed!


Unnamed: 0,nums,chars
0,0,aa
1,2,bb
2,4,cc
3,6,dd
4,8,ee


In [9]:
@validate(
    args=(example_schema, None), 
    returns=Schema({'nums': lambda x: x % 2 == 0}, strict=True)
)
def multiply(df: pd.DataFrame, x: int) -> pd.DataFrame:
    return df * x

multiply(df, 2).head()

Validating input at index 0...
Passed!
Validating output...
Column 'chars' not found in Schema


AttributeError: 'NoneType' object has no attribute 'head'