# Constraints generation

## Type detection

We infer feature types using the dtype data for each Pandas columns. This has some quirks:

- By default, if there are missing values in an int column, it will 
be cast to float64 instead of a nullable int type like int32D. This means directly infering type from the column types will be faulty for integer columns with missing values. See [here](https://stackoverflow.com/questions/21287624/convert-pandas-column-containing-nans-to-dtype-int)

- The object type is used both for pure string columns and mixed columns. We attempt to detect string type columns for small datasets. Pandas recommends assigning the dtype 'string' to columns with text data. See [here](https://pandas.pydata.org/docs/user_guide/text.html#text-types)



In [19]:
from flare.generators import gen_constraints, gen_statistics
from flare.examples import generate_example_dataframe

from dataclasses import asdict
from pprint import pprint

In [20]:
full_df = generate_example_dataframe()
full_df.head()

Unnamed: 0,int_0,int_1,positive_int_0,positive_int_1,float_0,float_1,str_0,str_1,mixed_0,mixed_1
0,42.0,5,73,47,0.356265,0.512995,eeny,eeny,2.0,eeny
1,80.0,5,96,16,,0.620325,,moe,1,eeny
2,8.0,17,99,19,0.546542,0.579696,,meeny,2.0,2.0
3,16.0,88,8,57,0.451448,0.015139,eeny,miny,2.0,2.0
4,,82,86,25,0.726925,0.21657,miny,eeny,eeny,eeny


In [21]:
constraints = gen_constraints(full_df)

In [30]:
pprint(asdict(constraints), indent=2, depth=4, width=30)

{ 'features': [ { 'completeness': 0.9,
                  'inferred_type': 'Fractional',
                  'monitoringConfigOverrides': None,
                  'name': 'int_0',
                  'num_constraints': { 'is_non_negative': True},
                  'string_constraints': None},
                { 'completeness': 1.0,
                  'inferred_type': 'Integral',
                  'monitoringConfigOverrides': None,
                  'name': 'int_1',
                  'num_constraints': { 'is_non_negative': True},
                  'string_constraints': None},
                { 'completeness': 1.0,
                  'inferred_type': 'Integral',
                  'monitoringConfigOverrides': None,
                  'name': 'positive_int_0',
                  'num_constraints': { 'is_non_negative': True},
                  'string_constraints': None},
                { 'completeness': 1.0,
                  'inferred_type': 'Integral',
                  'monitoringConfigOverrides'

In [13]:
stats = gen_statistics(full_df)

In [31]:
pprint(asdict(stats), indent=2, depth=4, width=30)

{ 'dataset': { 'item_count': 1000},
  'features': [ { 'inferred_type': 'Fractional',
                  'name': 'int_0',
                  'numerical_statistics': { 'common': { 'num_missing': 100,
                                                        'num_present': 900},
                                            'distribution': None,
                                            'max': 99.0,
                                            'mean': 48.318888888888885,
                                            'min': 0.0,
                                            'std_dev': 28.302896059399178,
                                            'sum': 43487.0},
                  'string_statistics': None},
                { 'inferred_type': 'Integral',
                  'name': 'int_1',
                  'numerical_statistics': { 'common': { 'num_missing': 0,
                                                        'num_present': 1000},
                                            'distribution': 