# Engarde

> Engarde is a package for defensive data analysis.

Facts of life:
1. Data are messy.
1. An analysis relies on certain (invariant) assumption about our data and the work flow.

We need to
1. explicitely state these assumptions.
1. check that they're true.

Ideally, without messing up our beautiful code.

### Example

Image you are at the beginning of your data processing pipeline.

You do something like

```python
>>> data = load_data()
>>> data_prepared = prepare_data(data)
```

To finally end up with
```python
>>> extract_all_answers(data_prepared)
42
```

We want to be sure that `load_data` and `prepare_data` do a proper job, i.e. the resulting data items have the required properties.

In [1]:
import numpy as np
import pandas as pd
import engarde.decorators as ed


def load_data():
    """Complicated data loading procedure.
    
    We require:
    - x, yl and yu should be of type float
    - x should be increasing
    """
    # this could be a rather messy data loading procedure
    df = pd.DataFrame({'x': [5, 1, 2, 3, 4, 0], 
                       'yl': [1, 0, 1, np.NaN, np.NaN, 0],
                       'yu': [4, 2, 2, 3, 4, 1]})
    return df

data = load_data()
data

Unnamed: 0,x,yl,yu
0,0,0.0,1
1,1,0.0,2
2,2,1.0,2
3,3,,3
4,4,,4
5,5,1.0,4


In [37]:
@ed.is_monotonic(items={'x': (True, True)})
@ed.has_dtypes(items={'x': np.float, 'yl': np.float, 'yu': np.float})
def load_data():
    """Complicated data loading procedure.
    
    We require:
    - x, yl and yu should be of type float
    - x should be increasing
    """
    # this could be a rather messy data loading procedure
    df = pd.DataFrame({'x': [5, 1, 2, 3, 4, 0], 
                       'yl': [1, 0, 1, np.NaN, np.NaN, 0],
                       'yu': [4, 2, 2, 3, 4, 1]})
    return df

data = load_data()
data

AssertionError: x has the wrong dtype (<class 'float'>)

In [34]:
@ed.is_monotonic(items={'x': (True, True)})
@ed.has_dtypes(items={'x': np.float, 'yl': np.float, 'yu': np.float})
def load_data():
    """Complicated data loading procedure.
    
    We require:
    - x, yl and yu should be of type float
    - x should be increasing
    """
    # this could be a rather messy data loading procedure
    df = pd.DataFrame({'x': [5, 1, 2, 3, 4, 0], 
                       'yl': [1, 0, 1, np.NaN, np.NaN, 0],
                       'yu': [4, 2, 2, 3, 4, 1]}).astype(np.float) #  <-- dtypes
    return df

data = load_data()
data

AssertionError: 

In [36]:
@ed.is_monotonic(items={'x': (True, True)})
@ed.has_dtypes(items={'x': np.float, 'yl': np.float, 'yu': np.float})
def load_data():
    """Complicated data loading procedure.
    
    We require:
    - x, yl and yu should be of type float
    - x should be increasing
    """
    # this could be a rather messy data loading procedure
    df = pd.DataFrame({'x': [5, 1, 2, 3, 4, 0], 
                       'yl': [1, 0, 1, np.NaN, np.NaN, 0],
                       'yu': [4, 2, 2, 3, 4, 1]}).astype(np.float).sort_values('x') #  <-- sort
    return df

data = load_data()
data

Unnamed: 0,x,yl,yu
5,0.0,0.0,1.0
1,1.0,0.0,2.0
2,2.0,1.0,2.0
3,3.0,,3.0
4,4.0,,4.0
0,5.0,1.0,4.0


### Exercises

- Implement the following requirements:
```python
def prepare_data(data):
    """Data preparation.
    
    We require:
    - no missing values
    - yl <= yu
    - x is index
    - a new variable x_valid = x < 4; it must assume only the values {False, True}
    """
    return data_prepared
```
by employing [appropriate decorators](http://engarde.readthedocs.io/en/latest/api.html).

# Solutions

In [33]:
def has_index_name(df, index_name):
    return df.index.name == index_name

def is_vl_seq_yu(df):
    return df['yl'] <= df['yu']


@ed.verify_all(is_vl_seq_yu)
@ed.verify(has_index_name, 'x')
@ed.none_missing()
@ed.within_set({'x_valid': {True, False}})
@ed.has_dtypes({'x_valid': np.bool})
def prepare_data(data):
    """Data preparation.
    
    We require:
    - no missing values
    - yl <= yu
    - x is index
    - a new variable x_valid = x < 4; it must assume only the values {False, True}
    """
    data_prepared = data.fillna(method='pad')
    data_prepared['x_valid'] = data_prepared['x'] < 4
    return data_prepared.set_index('x')

prepare_data(load_data())

Unnamed: 0_level_0,yl,yu,x_valid
x,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,0.0,1.0,True
1.0,0.0,2.0,True
2.0,1.0,2.0,True
3.0,1.0,3.0,True
4.0,1.0,4.0,False
5.0,1.0,4.0,False


### Summary

- [Engarde](http://engarde.readthedocs.io/en/latest/) is one way to state and verify data related assertions.


- It will not stop your code from failing, but it may fail gracefully.


- It can be an inspiration of how to structure your exploration already at an early stage.


- A similar effect may be achieved through customized container (most likely more effort).