In [1]:
%load_ext autoreload
%autoreload 2

In [25]:
import pandas as pd

from error_generation.api import low_level, mid_level
from error_generation.error_mechanism import EAR, ECAR, ENAR
from error_generation.error_type import Butterfinger, Mislabel, MissingValue, Mistype, Mojibake, Permutate, WrongUnit
from error_generation.utils import ErrorModel, MidLevelConfig

# Error Generation

The following notebook demonstrates how the `error_generation` software package functions.
Fundamentally, one Error Mechanism and one Error Type are combined into an Error Model, which we need to insert errors into tables.

## Error Mechanism
The Error Mechanism determines the distribution of errors. 
We distinguish between Erroenous Not At Random (ENAR), Erroneous At Random (EAR) and Erroneous Completely At Random (ECAR).

### Erroneous Not At Random (ENAR)
The distribution of errors that are ENAR depends on the erroneous value itself: For example, imagine three services write data into one table. 
The table's schema contains a column `service`, into which the services write their name and the date when inserting a value, following the format `${SERVICE}-YYYY-MM-DD`.
Now, imagine that one of the services uses the incorrect format `${SERICE}-DD-MM-YYYY`.
In this scenario, the distribution of the error depends on the erroneous value.

Let's go ahead and simulate it using `error_generation`.

In [26]:
df_enar = pd.DataFrame(
    {
        "service": [
            "Aservice-2024-02-01",
            "Aservice-2024-02-02",
            "Aservice-2024-02-03",
            "Bservice-2024-02-01",
            "Bservice-2024-02-02",
            "Bservice-2024-02-03",
            "Cservice-2024-02-01",
            "Cservice-2024-02-02",
            "Cservice-2024-02-03",
        ]
    }
)
enar, permutate = ENAR(), Permutate({"permutation_separator": "-", "permutation_pattern": [0, 3, 2, 1]})

df_corrupted, error_mask = low_level.create_errors(df_enar, "service", 0.34, enar, permutate)

In [4]:
df_corrupted

Unnamed: 0,service
0,Aservice-01-02-2024
1,Aservice-02-02-2024
2,Aservice-03-02-2024
3,Bservice-2024-02-01
4,Bservice-2024-02-02
5,Bservice-2024-02-03
6,Cservice-2024-02-01
7,Cservice-2024-02-02
8,Cservice-2024-02-03


### Erroneous At Random (EAR)
In case the distribution of errors in one column depends on the distribution of another column, we call the the error distribution EAR. For example, imagine several typists manually digitizing a table. One of the typists might make errors while typing. Let's simulate this.

In [27]:
df_ear = pd.DataFrame(
    {
        "typist": ["Alice", "Alice", "Alice", "Bob", "Bob", "Bob"],
        "book_title": ["To Kill a Mockingbird", "1984", "Pride and Prejudice", "The Great Gatsby", "Moby-Dick", "The Catcher in the Rye"],
    }
)
ear, butterfinger = EAR(), Butterfinger()

df_corrupted, error_mask = low_level.create_errors(df_ear, "book_title", 0.5, ear, butterfinger)



In [28]:
df_corrupted

Unnamed: 0,typist,book_title
0,Alice,To Kill q Mockingbird
1,Alice,1983
2,Alice,Pride ans Prejudice
3,Bob,The Great Gatsby
4,Bob,Moby-Dick
5,Bob,The Catcher in the Rye


### Erroneous Completely At Random (ECAR)
In case the distribution of errors does not depend on the erroneous column or any other column in the table, we call the distribution ECAR. For example, imagine a table containing application data. Depending on the device and software utilized by the application's users, the user's content contains encoding errors. Because the table does not include information on the device or software, the errors appear completely at random. We can use `error_generation` to simulate this.

In [29]:
df_ecar = pd.DataFrame(
    {
        "user": ["Alice", "Alice", "Bob", "Bob", "Clara", "David"],
        "content": ["¿Cómo estás?", "Привет, как дела?", "今日はどうですか", "Ça va bien, merci.", "¡Nos vemos mañana!", "Ich hätte Hunger."],
    }
)
ecar, mojibake = ECAR(), Mojibake({"encoding_sender": "utf-8", "encoding_receiver": "iso-8859-1"})

df_corrupted, error_mask = low_level.create_errors(df_ecar, "content", 0.5, ecar, mojibake)

In [30]:
df_corrupted

Unnamed: 0,user,content
0,Alice,¿Cómo estás?
1,Alice,"ÐÑÐ¸Ð²ÐµÑ, ÐºÐ°Ðº Ð´ÐµÐ»Ð°?"
2,Bob,ä»æ¥ã¯ã©ãã§ãã
3,Bob,"Ça va bien, merci."
4,Clara,¡Nos vemos mañana!
5,David,Ich hÃ¤tte Hunger.


## Error Type
The Error Type corresponds to the Error Mechanism: Where the Error Mechanism indicates if a cell contains an error, the Error Type defines how the correct value will be transformed into the error.

Below, we demonstrate the different Error Types.

## Mistype

In [31]:
mistype = Mistype({"mistype_dtype": "float64"})
ecar = ECAR()
df_mistype = pd.DataFrame({"a": [1, 2, 3], "b": ["blau", "gelb", "blau"]})
df_corrupted, error_mask = low_level.create_errors(df_mistype, "a", 0.5, ecar, mistype)

In [32]:
df_corrupted

Unnamed: 0,a,b
0,1.0,blau
1,2.0,gelb
2,3.0,blau


## Permutation

In [33]:
data = {"A": ["apple", "banana", "cherry", "pineapple"], "B": ["red apple", "yellow banana", "dark cherry", "blue pineapple"], "C": [10, 20, 30, 40]}
df_permutate = pd.DataFrame(data)
permutate = Permutate({"permutation_separator": " ", "permutation_automation_pattern": "fixed"})
df_corrupted, error_mask = low_level.create_errors(df_permutate, "B", 1.0, ecar, permutate)

In [34]:
df_corrupted

Unnamed: 0,A,B,C
0,apple,apple red,10
1,banana,banana yellow,20
2,cherry,cherry dark,30
3,pineapple,pineapple blue,40


## Mojibake

In [35]:
mojibake = Mojibake()
df_mojibake = pd.DataFrame({"a": [0, 1, 2], "b": ["Ente", "Haus", "Grünfelder Straße 17, 13357 Öppeln"]})
df_corrupted, error_mask = low_level.create_errors(df_mojibake, "b", 1.0, ecar, mojibake)

In [36]:
df_corrupted

Unnamed: 0,a,b
0,0,Ente
1,1,Haus
2,2,"Grnfelder Strae 17, 13357 ppeln"


## Butterfinger

In [37]:
butterfinger = Butterfinger()
df_butterfinger = pd.DataFrame({"a": [0, 1, 2], "b": ["Entspannung", "Genugtuung", "Ausgeglichenheit"]})
df_corrupted, error_mask = low_level.create_errors(df_butterfinger, "b", 1.0, ecar, butterfinger)

In [38]:
df_corrupted

Unnamed: 0,a,b
0,0,Entspannujg
1,1,Genigtuung
2,2,Ausgeglichenbeit


## Wrong Unit

In [39]:
wrong_unit = WrongUnit({"wrong_unit_scaling": lambda x: x / 1000})
df_wrong_unit = pd.DataFrame({"a": [0, 1, 2], "b": [40, 50, 60]})
df_corrupted, error_mask = low_level.create_errors(df_wrong_unit, 1, 1.0, ecar, wrong_unit)

In [40]:
df_corrupted

Unnamed: 0,a,b
0,0,0.04
1,1,0.05
2,2,0.06


## Mislabel

In [41]:
mislabel = Mislabel()
df_mislabel = pd.DataFrame({"a": [1, 2, 3], "b": ["blau", "gelb", "blau"]})
df_mislabel["b"] = df_mislabel["b"].astype("category")
df_corrupted, error_mask = low_level.create_errors(df_mislabel, "b", 1.0, ecar, mislabel)

In [42]:
df_corrupted

Unnamed: 0,a,b
0,1,gelb
1,2,blau
2,3,gelb


## Missing

In [43]:
missing = MissingValue()
df_missing = pd.DataFrame({"a": [1, 2, 3], "b": ["blau", "gelb", "blau"]})
df_corrupted, error_mask = low_level.create_errors(df_missing, "b", 1.0, ecar, missing)

In [44]:
df_corrupted

Unnamed: 0,a,b
0,1,
1,2,
2,3,


## Error Generation APIs
There are three APIs available to generate errors:
- Low Level API `api.low_level`
- Mid Level API `api.mid_level`
- High Level API`api.high_level` (not implemented yet)

### Low Level API
The Low Level API is used to apply single error-models to a table, as demonstrated in the examples above.

### Mid Level API
The Mid Level API allows the user to apply several error-models to one table.
It prevents conflicting error insertions from happening and ensures that as many errors as generated as the user required.

In [45]:
df_mid_level = pd.DataFrame(
    {
        "typist": ["Alice", "Alice", "Alice", "Bob", "Bob", "Bob"],
        "book_title": ["To Kill a Mockingbird", "1984", "Pride and Prejudice", "The Great Gatsby", "Moby-Dick", "The Catcher in the Rye"],
    }
)

config = MidLevelConfig(
    {
        "typist": [ErrorModel(ENAR(), MissingValue(), 0.5)],
        "book_title": [ErrorModel(EAR(condition_to_column="typist"), Butterfinger(), 0.5)],
    }
)

df_corrupt, error_mask = mid_level.create_errors(df_mid_level, config)

In [46]:
df_corrupt

Unnamed: 0,typist,book_title
0,Alice,To Kill a Mockingbird
1,Alice,2984
2,,Pride anr Prejudice
3,,The Great Gatwby
4,,Moby-Dick
5,Bob,The Catcher in the Rye


In [47]:
error_mask

Unnamed: 0,typist,book_title
0,False,False
1,False,True
2,True,True
3,True,True
4,True,False
5,False,False
