# Generating Realistic Tabular Errors using `tab_err`

`tab_err` is an implementation of a tabular data error model that disentangles error mechanism and error type.
It generalizes the formalization of missing values, implying that missing values are only one of many possibel error type implemented here.
`tab_err` gives the user full control over the error generation process and allows to model realistic errors with complex dependency structurs.

This notebook aims to briefly explain the concept of this idea and its implementation.

## Error Model

Combines an error mechanism and error type and defines how many percent of the column should be perturbed.

```python
from tab_err import ErrorModel, error_type, error_mechanism

error_model = ErrorModel(
    error_mechanism=error_mechanism.ECAR(),
    error_type=error_type.MissingValue(),
    error_rate=0.1
)
```

After it's initialisation, it can be used to apply it to a `DataFrame`.

```python
corrupte_data, error_mask = error_model.apply(data=data_frame, column="name")
```

## Error Mechanism

The error mechanism determines the distribution of errors. We distinguish between Erroenous Not At Random (ENAR), Erroneous At Random (EAR) and Erroneous Completely At Random (ECAR).
Error mechanisms are used to generate binary error masks that determine where errors will be inserted.

### Erroneous Completely At Random (ECAR)

If the distribution of errors is independent of the data in the table, it is called Erroneous Completely At Random (ECAR).
Imagine a table containing application data.
Depending on the device and software utilized by the application's users, the user's content contains encoding errors.
Because the table does not include information on the device or software, the errors appear completely at random.

```python
from tab_err.error_mechanism import ECAR

ecar = ECAR()
error_mask = ecar.sample(data=data, column="text", error_rate=0.2)
```


### Erroneous At Random (EAR)

In case the distribution of errors in one column depends on the values of another column, we call the error distribution Erroneous At Random (EAR).
Imagine several typists manually digitizing a table.
One of the typists might make errors while typing.

```python
from tab_err.error_mechanism import EAR

ear = EAR(condition_to_column="typist")
error_mask = ear.sample(data=data, column="name", error_rate=0.2)
```


### Erroneous Not At Random (ENAR)

The distribution of errors that are Erroneous Not At Random (ENAR) depends on the erroneous value itself: For example, imagine three services write data into one table. 
The table contains a column `service`, into which the services write their name and the date when inserting a value, following the format `${SERVICE}-YYYY-MM-DD`.
Now, imagine that one of the services uses the incorrect format `${SERICE}-DD-MM-YYYY`.
In this scenario, the distribution of the error depends on the erroneous value itself.

```python
from tab_err.error_mechanism import ENAR

enar = ENAR()
error_mask = enar.sample(data=data, column="service", error_rate=1.0)
```

## Error Type

The way(s) in which cells are incorrect is represented with `ErrorType` objects. To apply them to data, an error mask is necessary, which need to be generated upfront. See [Error_Types.ipynb](Error_Types.ipynb) for more information and examples.

```python
from tab_err.error_type import MissingValue

missing_value = MissingValue()
perturbed_column = missing_value.apply(data=data, error_mask=error_mask, column="income")
```

Note, for efficiency reasons, `apply` returns only the perturbed column instead of the data.
The recommended way of generating errors is through APIs that take care of the correct application.

## API Implementations

We offer three APIs to conveniently introduce realistic errors into tabular data.
1. **Low-Level API**: Applies a single error model with a given error rate
2. **Mid-Level API**: Applies multiple error models, prevents conflicting error insertions and ensures the correct error rate
3. **High-Level API**: Allows to perturbe a dataset with a given error rate, uses random error meachnisms and types and prevents conflicting error insertions

### Low-Level API

Allows to perturbe a dataset without explicitly build an error model.
```python
from tab_err import error_mechanism, error_type
from tab_err.api import low_level

perturbed_data, error_maks = low_level.create_errors(
    data=data,
    column="income",
    error_rate=0.5,
    error_mechanism=error_mechanism.ECAR(),
    error_type=error_type.MissingValue()
)
```

This is equivalent to creating and applying an `ErrorModel` object.
```python
from tab_err import ErrorModel, error_type, error_mechanism

error_model = ErrorModel(
    error_mechanism=error_mechanism.ECAR(),
    error_type=error_type.MissingValue(),
    error_rate=0.5
)
corrupte_data, error_mask = error_model.apply(data=data, column="income")
```

### Mid-Level API

Allows to bind multiple error models together using a `MidLevelConfig` object, a thin wrapper around `dict` that simply maps from columns -> list of `ErrorModel`s.
It is a thin wrapper around `dict`, therefore, it's possible to to directly pass the `dict`.
The mid-level API prevents conflicting error insertions and ensures the correct error rate.

```python
from tab_err import error_mechanism, error_type
from tab_err.api import MidLevelConfig, mid_level

config = MidLevelConfig(
    {
        "typist": [
            ErrorModel(
                error_mechanism=error_mechanism.ENAR(),
                error_type=error_type.Mojibake(),
                error_rate=0.3
            )
        ],
        "book_title": [
            ErrorModel(
                error_mechanism=error_mechanism.EAR(condition_to_column="typist"),
                error_type=error_type.Typo(),
                error_rate=0.1
            ),
            ErrorModel(
                error_mechanism=error_mechanism.ECAR(),
                error_typeerror_type=error_type.MissingValue(MissingValue),
                error_rate=0.01
            ),
        ],
    }
)

corrupte_data, error_mask = mid_level.create_errors(data=data, config=config)
```


### High-Level API

Not yet implemented. We are working on this!