# 1. Make your assertions explicit

## `assert`

Simplest way to state an assertion/assumption.

### Example

In [1]:
def inverse_not_supporting_0(x):
    assert x != 0, 'We do not support 0, sorry'
    return 1 / x

In [2]:
inverse_not_supporting_0(2)

0.5

In [3]:
inverse_not_supporting_0(0)

AssertionError: We do not support 0, sorry

### Summary

- `assert` is very explicit.

- Use it for conditions that should never happen to crash early (gracefully) if necessary.

- Documents cases to be taken care of later.

- Not for production code.

---

What about exceptions? These are for errors than can happen (further reading [here](https://stackoverflow.com/questions/5142418/what-is-the-use-of-assert-in-python) and [here](https://stackoverflow.com/questions/944592/best-practice-for-python-assert)).


## Engarde

> [Engarde](https://github.com/TomAugspurger/engarde) is a package for defensive data analysis.

Facts of life:
1. Data are messy.
1. An analysis relies on certain (invariant) assumption about our data and the work flow.

We need to
1. explicitely state these assumptions.
1. check that they're true.

Ideally, without messing up our beautiful code.


---


*Engarde serves an example, you can implement similar functionality on your own. Engarde focuses on `pandas.DataFrame` only.*

### Example

Image at some point in our data processing pipeline:

```python
>>> data = load()
>>> prepared = prepare(data)
>>> extract_all_answers(prepared)
42
```

At each stage, we require some properties to hold, i.e. that some assertions are valid.

In [4]:
import numpy as np
import pandas as pd
import engarde.decorators as ed


def load():
    """Complicated data loading procedure.
    
    We require:
    - x, y_lower and y_upper should be of type float
    - x should be increasing
    """
    # this could be a rather messy data loading procedure
    df = pd.DataFrame({'x': [5, 1, 2, 3, 4, 0], 
                       'y_lower': [1, 0, 1, np.NaN, np.NaN, 0],
                       'y_upper': [4, 2, 2, 3, 4, 1]})
    return df

data = load()
data

Unnamed: 0,x,y_lower,y_upper
0,5,1.0,4
1,1,0.0,2
2,2,1.0,2
3,3,,3
4,4,,4
5,0,0.0,1


In [5]:
@ed.is_monotonic(items={'x': (True, True)})
@ed.has_dtypes(items={'x': np.float, 'y_lower': np.float, 'y_upper': np.float})
def load():
    """Complicated data loading procedure.
    
    We require:
    - x, y_lower and y_upper should be of type float
    - x should be increasing
    """
    # this could be a rather messy data loading procedure
    df = pd.DataFrame({'x': [5, 1, 2, 3, 4, 0], 
                       'y_lower': [1, 0, 1, np.NaN, np.NaN, 0],
                       'y_upper': [4, 2, 2, 3, 4, 1]})
    return df

data = load()
data

AssertionError: x has the wrong dtype (<class 'float'>)

In [7]:
@ed.is_monotonic(items={'x': (True, True)})
@ed.has_dtypes(items={'x': np.float, 'y_lower': np.float, 'y_upper': np.float})
def load():
    """Complicated data loading procedure.
    
    We require:
    - x, y_lower and y_upper should be of type float
    - x should be increasing
    """
    # this could be a rather messy data loading procedure
    df = pd.DataFrame({'x': [5, 1, 2, 3, 4, 0], 
                       'y_lower': [1, 0, 1, np.NaN, np.NaN, 0],
                       'y_upper': [4, 2, 2, 3, 4, 1]}).sort_values('x').astype(np.float)
    return df

data = load()
data

Unnamed: 0,x,y_lower,y_upper
5,0.0,0.0,1.0
1,1.0,0.0,2.0
2,2.0,1.0,2.0
3,3.0,,3.0
4,4.0,,4.0
0,5.0,1.0,4.0


# Exercises

- Implement the following requirements:
```python
def prepare(data):
    """Data preparation.
    
    We require:
    - no missing values
    - y_lower <= y_upper
    - x is the index variable
    - a new variable x_valid = x < 4; it must assume only the values {False, True}
    """
    return prepared
```
by employing [appropriate decorators](http://engarde.readthedocs.io/en/latest/api.html).

---

*You do not have to implement all requirements. Pick the ones you find most interesting!*

# Solutions

In [8]:
def has_index_name(df, index_name):
    return df.index.name == index_name

def is_y_lower_seq_y_upper(df):
    return df['y_lower'] <= df['y_upper']


@ed.verify_all(is_y_lower_seq_y_upper)
@ed.verify(has_index_name, 'x')
@ed.none_missing()
@ed.within_set({'x_valid': {True, False}})  # similar
@ed.has_dtypes({'x_valid': np.bool})
def prepare(data):
    """Data preparation.
    
    We require:
    - no missing values
    - y_lower <= y_upper
    - x is index
    - a new variable x_valid = x < 4; it must assume only the values {False, True}
    """
    data_prepared = data.fillna(method='pad')
    data_prepared['x_valid'] = data_prepared['x'] < 4
    return data_prepared.set_index('x')

prepare(load())

Unnamed: 0_level_0,y_lower,y_upper,x_valid
x,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,0.0,1.0,True
1.0,0.0,2.0,True
2.0,1.0,2.0,True
3.0,1.0,3.0,True
4.0,1.0,4.0,False
5.0,1.0,4.0,False


# Summary

- [Engarde](http://engarde.readthedocs.io/en/latest/) is one way to state and verify data related assertions.


- It will not stop your code from failing, but it may fail gracefully.


- It can be an inspiration of how to structure your exploration already at an early stage.


- You can implement similar assertion checking mechanisms:
  - Using decorators as above
  - Using customized container types
  - ...

# Questions?