# Error Types Examples

This notebook gives an overview of existing error types.
Some of them have further configuration, passed to the constructor. 
For this, you can create an [ErrorTypeConfig](../tab_err/error_type/_config.py) object or simply use a dict.

The examples below concat the original dataframe, the perturbed dataframe, and the error mask as columns to make it easier to see the changes.

In [1]:
from __future__ import annotations

import pandas as pd

from tab_err import error_type
from tab_err.api import low_level
from tab_err.error_mechanism import ECAR
from tab_err.error_type import ErrorTypeConfig

## Utils for Cleaner Notebook

In [2]:
def show_result(original_df: pd.DataFrame, perturbed_df: pd.DataFrame, error_mask: pd.DataFrame | None = None) -> pd.DataFrame:
    """Simple helper function to show DataFrames after perturbing them."""
    return (
        pd.concat([original_df, perturbed_df], keys=["original", "perturbed"], axis=1)
        if error_mask is None
        else pd.concat([original_df, perturbed_df, error_mask], keys=["original", "perturbed", "error_mask"], axis=1)
    )

In [3]:
df_user_content = pd.DataFrame(
    {
        "user_id": [1, 1, 2, 2, 3, 4],
        "user": ["Alice", "Alice", "Bob", "Bob", "Clara", "David"],
        "content": ["¿Cómo estás?", "Привет, как дела?", "今日はどうですか", "Ça va bien, merci.", "¡Nos vemos mañana!", "Ich hätte Hunger."],
        "timestamp": ["12 a.m.", "3 p.m.", "3 p.m.", "4 a.m.", "1 p.m.", "1 p.m."],
    }
)
df_typist_book_title = pd.DataFrame(
    {
        "typist": ["Alice", "Alice", "Alice", "Bob", "Bob", "Bob"],
        "book_title": ["To Kill a Mockingbird", "1984", "Pride and Prejudice", "The Great Gatsby", "Moby-Dick", "The Catcher in the Rye"],
        "rating": [1.0, 3.0, 3.0, 4.0, 2.0, 1.0],
    }
)
df_services = pd.DataFrame(
    {
        "service": [
            "service-A-2024-02-01",
            "service-A-2024-02-02",
            "service-A-2024-02-03",
            "service-A-2024-02-01",
            "service-B-2024-02-02",
            "service-B-2024-02-03",
            "service-C-2024-02-01",
            "service-C-2024-02-02",
            "service-C-2024-02-03",
        ]
    }
)

## Add Delta

If an instrument is incorrectly gauged or wrongly used, it is possible that systematic measurement errors occur, e.g., constantly too high measurements.


This error type needs some configuration.
Here we use an explicit `ErrorTypeConfig` object.

In [4]:
add_delta_error = error_type.AddDelta(ErrorTypeConfig(add_delta_value=0.1))

df_corrupted, error_mask = low_level.create_errors(df_typist_book_title, "rating", 0.2, ECAR(), add_delta_error)

show_result(df_typist_book_title, df_corrupted, error_mask)

Unnamed: 0_level_0,original,original,original,perturbed,perturbed,perturbed,error_mask,error_mask,error_mask
Unnamed: 0_level_1,typist,book_title,rating,typist,book_title,rating,typist,book_title,rating
0,Alice,To Kill a Mockingbird,1.0,Alice,To Kill a Mockingbird,1.0,False,False,False
1,Alice,1984,3.0,Alice,1984,3.0,False,False,False
2,Alice,Pride and Prejudice,3.0,Alice,Pride and Prejudice,3.0,False,False,False
3,Bob,The Great Gatsby,4.0,Bob,The Great Gatsby,4.0,False,False,False
4,Bob,Moby-Dick,2.0,Bob,Moby-Dick,2.0,False,False,False
5,Bob,The Catcher in the Rye,1.0,Bob,The Catcher in the Rye,1.1,False,False,True


We can also configure error types using a `dict`.

In [5]:
add_delta_error = error_type.AddDelta({"add_delta_value": 0.1})

df_corrupted, error_mask = low_level.create_errors(df_typist_book_title, "rating", 0.2, ECAR(), add_delta_error)

show_result(df_typist_book_title, df_corrupted, error_mask)

Unnamed: 0_level_0,original,original,original,perturbed,perturbed,perturbed,error_mask,error_mask,error_mask
Unnamed: 0_level_1,typist,book_title,rating,typist,book_title,rating,typist,book_title,rating
0,Alice,To Kill a Mockingbird,1.0,Alice,To Kill a Mockingbird,1.0,False,False,False
1,Alice,1984,3.0,Alice,1984,3.0,False,False,False
2,Alice,Pride and Prejudice,3.0,Alice,Pride and Prejudice,3.0,False,False,False
3,Bob,The Great Gatsby,4.0,Bob,The Great Gatsby,4.1,False,False,True
4,Bob,Moby-Dick,2.0,Bob,Moby-Dick,2.0,False,False,False
5,Bob,The Catcher in the Rye,1.0,Bob,The Catcher in the Rye,1.0,False,False,False


## MissingValue

If the transmission of data is interrupted or unstable, some cells can be missing or users simply don't fill out some fields.

Most error types are configurable and offer sane defaults.

In [6]:
missing_value_error = error_type.MissingValue()

df_corrupted, error_mask = low_level.create_errors(df_typist_book_title, "rating", 0.2, ECAR(), missing_value_error)

show_result(df_typist_book_title, df_corrupted, error_mask)

Unnamed: 0_level_0,original,original,original,perturbed,perturbed,perturbed,error_mask,error_mask,error_mask
Unnamed: 0_level_1,typist,book_title,rating,typist,book_title,rating,typist,book_title,rating
0,Alice,To Kill a Mockingbird,1.0,Alice,To Kill a Mockingbird,1.0,False,False,False
1,Alice,1984,3.0,Alice,1984,,False,False,True
2,Alice,Pride and Prejudice,3.0,Alice,Pride and Prejudice,3.0,False,False,False
3,Bob,The Great Gatsby,4.0,Bob,The Great Gatsby,4.0,False,False,False
4,Bob,Moby-Dick,2.0,Bob,Moby-Dick,2.0,False,False,False
5,Bob,The Catcher in the Rye,1.0,Bob,The Catcher in the Rye,1.0,False,False,False


But if necessary, we can change the missing value representation. For example using `9999` in this example.

In [7]:
missing_value_error = error_type.MissingValue({"missing_value": 9999})

df_corrupted, error_mask = low_level.create_errors(df_typist_book_title, "rating", 0.2, ECAR(), missing_value_error)

show_result(df_typist_book_title, df_corrupted, error_mask)

Unnamed: 0_level_0,original,original,original,perturbed,perturbed,perturbed,error_mask,error_mask,error_mask
Unnamed: 0_level_1,typist,book_title,rating,typist,book_title,rating,typist,book_title,rating
0,Alice,To Kill a Mockingbird,1.0,Alice,To Kill a Mockingbird,1.0,False,False,False
1,Alice,1984,3.0,Alice,1984,9999.0,False,False,True
2,Alice,Pride and Prejudice,3.0,Alice,Pride and Prejudice,3.0,False,False,False
3,Bob,The Great Gatsby,4.0,Bob,The Great Gatsby,4.0,False,False,False
4,Bob,Moby-Dick,2.0,Bob,Moby-Dick,2.0,False,False,False
5,Bob,The Catcher in the Rye,1.0,Bob,The Catcher in the Rye,1.0,False,False,False


## Category Swap

Choosing the wrong value from a drop down menu leads to a category swap error.

Some of the error types change values randomly.
If we need or want to, we can set a seed value to obtain consistent results.
Same is true for the error mechanisms.

Note, for category swap it's necessary that the to-be-perturbed column is of type `category`.

In [8]:
category_swap_error = error_type.CategorySwap(seed=42)

df_corrupted, error_mask = low_level.create_errors(
    df_user_content.assign(user=lambda df: df["user"].astype("category")), "user", 0.6, ECAR(seed=42), category_swap_error
)

show_result(df_user_content, df_corrupted, error_mask)

Unnamed: 0_level_0,original,original,original,original,perturbed,perturbed,perturbed,perturbed,error_mask,error_mask,error_mask,error_mask
Unnamed: 0_level_1,user_id,user,content,timestamp,user_id,user,content,timestamp,user_id,user,content,timestamp
0,1,Alice,¿Cómo estás?,12 a.m.,1,David,¿Cómo estás?,12 a.m.,False,True,False,False
1,1,Alice,"Привет, как дела?",3 p.m.,1,Alice,"Привет, как дела?",3 p.m.,False,False,False,False
2,2,Bob,今日はどうですか,3 p.m.,2,Bob,今日はどうですか,3 p.m.,False,False,False,False
3,2,Bob,"Ça va bien, merci.",4 a.m.,2,Alice,"Ça va bien, merci.",4 a.m.,False,True,False,False
4,3,Clara,¡Nos vemos mañana!,1 p.m.,3,Clara,¡Nos vemos mañana!,1 p.m.,False,False,False,False
5,4,David,Ich hätte Hunger.,1 p.m.,4,Clara,Ich hätte Hunger.,1 p.m.,False,True,False,False


## Extraneous

If a user adds street name and house number in the same, instead of separate fields, it is extraneous incorrect information.

In [9]:
extraneous_error = error_type.Extraneous({"extraneous_value_template": "11/10 {value}"})

df_corrupted, error_mask = low_level.create_errors(df_user_content, "timestamp", 0.6, ECAR(), extraneous_error)

show_result(df_user_content, df_corrupted, error_mask)

Unnamed: 0_level_0,original,original,original,original,perturbed,perturbed,perturbed,perturbed,error_mask,error_mask,error_mask,error_mask
Unnamed: 0_level_1,user_id,user,content,timestamp,user_id,user,content,timestamp,user_id,user,content,timestamp
0,1,Alice,¿Cómo estás?,12 a.m.,1,Alice,¿Cómo estás?,11/10 12 a.m.,False,False,False,True
1,1,Alice,"Привет, как дела?",3 p.m.,1,Alice,"Привет, как дела?",3 p.m.,False,False,False,False
2,2,Bob,今日はどうですか,3 p.m.,2,Bob,今日はどうですか,3 p.m.,False,False,False,False
3,2,Bob,"Ça va bien, merci.",4 a.m.,2,Bob,"Ça va bien, merci.",11/10 4 a.m.,False,False,False,True
4,3,Clara,¡Nos vemos mañana!,1 p.m.,3,Clara,¡Nos vemos mañana!,11/10 1 p.m.,False,False,False,True
5,4,David,Ich hätte Hunger.,1 p.m.,4,David,Ich hätte Hunger.,1 p.m.,False,False,False,False


## Wrong Dtype

Updating software components can introduce bugs.
For example no longer returning the correct data type.

In [10]:
mistype_error = error_type.Mistype({"mistype_dtype": "float64"})

df_corrupted, error_mask = low_level.create_errors(df_user_content, "user_id", 0.6, ECAR(), mistype_error)

show_result(df_user_content, df_corrupted, error_mask)

Unnamed: 0_level_0,original,original,original,original,perturbed,perturbed,perturbed,perturbed,error_mask,error_mask,error_mask,error_mask
Unnamed: 0_level_1,user_id,user,content,timestamp,user_id,user,content,timestamp,user_id,user,content,timestamp
0,1,Alice,¿Cómo estás?,12 a.m.,1.0,Alice,¿Cómo estás?,12 a.m.,True,False,False,False
1,1,Alice,"Привет, как дела?",3 p.m.,1.0,Alice,"Привет, как дела?",3 p.m.,False,False,False,False
2,2,Bob,今日はどうですか,3 p.m.,2.0,Bob,今日はどうですか,3 p.m.,True,False,False,False
3,2,Bob,"Ça va bien, merci.",4 a.m.,2.0,Bob,"Ça va bien, merci.",4 a.m.,True,False,False,False
4,3,Clara,¡Nos vemos mañana!,1 p.m.,3.0,Clara,¡Nos vemos mañana!,1 p.m.,False,False,False,False
5,4,David,Ich hätte Hunger.,1 p.m.,4.0,David,Ich hätte Hunger.,1 p.m.,False,False,False,False


## [Mojibake](https://en.wikipedia.org/wiki/Mojibake)

Mojibake occurs if the sender and receiver do not use the same text encoding.

In [11]:
mojibake_error = error_type.Mojibake()

df_corrupted, error_mask = low_level.create_errors(df_user_content, "content", 0.6, ECAR(), mojibake_error)

show_result(df_user_content, df_corrupted, error_mask)

Unnamed: 0_level_0,original,original,original,original,perturbed,perturbed,perturbed,perturbed,error_mask,error_mask,error_mask,error_mask
Unnamed: 0_level_1,user_id,user,content,timestamp,user_id,user,content,timestamp,user_id,user,content,timestamp
0,1,Alice,¿Cómo estás?,12 a.m.,1,Alice,¿Cómo estás?,12 a.m.,False,False,False,False
1,1,Alice,"Привет, как дела?",3 p.m.,1,Alice,"§±§â§Ú§Ó§Ö§ä, §Ü§Ñ§Ü §Õ§Ö§Ý§Ñ?",3 p.m.,False,False,True,False
2,2,Bob,今日はどうですか,3 p.m.,2,Bob,º£Æü¤Ï¤É¤¦¤Ç¤¹¤«,3 p.m.,False,False,True,False
3,2,Bob,"Ça va bien, merci.",4 a.m.,2,Bob,"ª®a va bien, merci.",4 a.m.,False,False,True,False
4,3,Clara,¡Nos vemos mañana!,1 p.m.,3,Clara,¡Nos vemos mañana!,1 p.m.,False,False,False,False
5,4,David,Ich hätte Hunger.,1 p.m.,4,David,Ich hätte Hunger.,1 p.m.,False,False,False,False


## Outlier

It is possible that values are much higher/lower than they are normally are.


In [12]:
outlier_error = error_type.Outlier()

df_corrupted, error_mask = low_level.create_errors(df_typist_book_title, "rating", 0.6, ECAR(), outlier_error)

show_result(df_typist_book_title, df_corrupted, error_mask)

Unnamed: 0_level_0,original,original,original,perturbed,perturbed,perturbed,error_mask,error_mask,error_mask
Unnamed: 0_level_1,typist,book_title,rating,typist,book_title,rating,typist,book_title,rating
0,Alice,To Kill a Mockingbird,1.0,Alice,To Kill a Mockingbird,-2.866422,False,False,True
1,Alice,1984,3.0,Alice,1984,6.206065,False,False,True
2,Alice,Pride and Prejudice,3.0,Alice,Pride and Prejudice,6.055841,False,False,True
3,Bob,The Great Gatsby,4.0,Bob,The Great Gatsby,4.0,False,False,False
4,Bob,Moby-Dick,2.0,Bob,Moby-Dick,2.0,False,False,False
5,Bob,The Catcher in the Rye,1.0,Bob,The Catcher in the Rye,1.0,False,False,False


## Permutate

Occurs if concatenated values are wrongly ordered.

In [13]:
permutate_error = error_type.Permutate({"permutation_separator": "-", "permutation_pattern": [2, 3, 4, 0, 1]})

df_corrupted, error_mask = low_level.create_errors(df_services, "service", 0.6, ECAR(), permutate_error)

show_result(df_services, df_corrupted, error_mask)

Unnamed: 0_level_0,original,perturbed,error_mask
Unnamed: 0_level_1,service,service,service
0,service-A-2024-02-01,service-A-2024-02-01,False
1,service-A-2024-02-02,service-A-2024-02-02,False
2,service-A-2024-02-03,service-A-2024-02-03,False
3,service-A-2024-02-01,2024-02-01-service-A,True
4,service-B-2024-02-02,2024-02-02-service-B,True
5,service-B-2024-02-03,2024-02-03-service-B,True
6,service-C-2024-02-01,2024-02-01-service-C,True
7,service-C-2024-02-02,service-C-2024-02-02,False
8,service-C-2024-02-03,2024-02-03-service-C,True


Random pattern example

In [14]:
permutate_error = error_type.Permutate({"permutation_separator": "-", "permutation_automation_pattern": "random"})

df_corrupted, error_mask = low_level.create_errors(df_services, "service", 0.6, ECAR(), permutate_error)

show_result(df_services, df_corrupted, error_mask)

Unnamed: 0_level_0,original,perturbed,error_mask
Unnamed: 0_level_1,service,service,service
0,service-A-2024-02-01,A-01-2024-service-02,True
1,service-A-2024-02-02,02-02-service-2024-A,True
2,service-A-2024-02-03,service-A-2024-02-03,False
3,service-A-2024-02-01,service-A-2024-02-01,False
4,service-B-2024-02-02,B-02-2024-service-02,True
5,service-B-2024-02-03,service-B-2024-02-03,False
6,service-C-2024-02-01,01-02-service-C-2024,True
7,service-C-2024-02-02,service-C-2024-02-02,False
8,service-C-2024-02-03,2024-C-service-02-03,True


Fixed pattern example

In [15]:
permutate_error = error_type.Permutate({"permutation_separator": "-", "permutation_automation_pattern": "fixed"})

df_corrupted, error_mask = low_level.create_errors(df_services, "service", 0.6, ECAR(), permutate_error)

show_result(df_services, df_corrupted, error_mask)

Unnamed: 0_level_0,original,perturbed,error_mask
Unnamed: 0_level_1,service,service,service
0,service-A-2024-02-01,service-A-2024-02-01,False
1,service-A-2024-02-02,service-A-2024-02-02,False
2,service-A-2024-02-03,service-02-03-A-2024,True
3,service-A-2024-02-01,service-02-01-A-2024,True
4,service-B-2024-02-02,service-02-02-B-2024,True
5,service-B-2024-02-03,service-B-2024-02-03,False
6,service-C-2024-02-01,service-C-2024-02-01,False
7,service-C-2024-02-02,service-02-02-C-2024,True
8,service-C-2024-02-03,service-02-03-C-2024,True


## Replace

Some characters/sub-strings are replaced with different chars/strings.

In [16]:
replace_error = error_type.Replace(ErrorTypeConfig(replace_what="-", replace_with="_"))

df_corrupted, error_mask = low_level.create_errors(df_services, "service", 0.6, ECAR(), replace_error)

show_result(df_services, df_corrupted, error_mask)

Unnamed: 0_level_0,original,perturbed,error_mask
Unnamed: 0_level_1,service,service,service
0,service-A-2024-02-01,service-A-2024-02-01,False
1,service-A-2024-02-02,service-A-2024-02-02,False
2,service-A-2024-02-03,service_A_2024_02_03,True
3,service-A-2024-02-01,service_A_2024_02_01,True
4,service-B-2024-02-02,service-B-2024-02-02,False
5,service-B-2024-02-03,service_B_2024_02_03,True
6,service-C-2024-02-01,service_C_2024_02_01,True
7,service-C-2024-02-02,service-C-2024-02-02,False
8,service-C-2024-02-03,service_C_2024_02_03,True


## Typo

Manually typing on a keyboard can lead to accidentally hitting the wrong keys.

In [17]:
typo_error = error_type.Typo()

df_corrupted, error_mask = low_level.create_errors(df_typist_book_title, "book_title", 0.6, ECAR(), typo_error)

show_result(df_typist_book_title, df_corrupted, error_mask)

Unnamed: 0_level_0,original,original,original,perturbed,perturbed,perturbed,error_mask,error_mask,error_mask
Unnamed: 0_level_1,typist,book_title,rating,typist,book_title,rating,typist,book_title,rating
0,Alice,To Kill a Mockingbird,1.0,Alice,To Kill a Mockingbird,1.0,False,False,False
1,Alice,1984,3.0,Alice,1984,3.0,False,False,False
2,Alice,Pride and Prejudice,3.0,Alice,Pride and Prejydice,3.0,False,True,False
3,Bob,The Great Gatsby,4.0,Bob,The Great Gatsby,4.0,False,False,False
4,Bob,Moby-Dick,2.0,Bob,Moby-Dicj,2.0,False,True,False
5,Bob,The Catcher in the Rye,1.0,Bob,The Catcher in fhe Rye,1.0,False,True,False


## Wrong Unit

If cm are stored instead of m.

In [18]:
wrong_unit = error_type.WrongUnit({"wrong_unit_scaling": lambda x: x * 10})

df_corrupted, error_mask = low_level.create_errors(df_typist_book_title, "rating", 0.6, ECAR(), wrong_unit)

show_result(df_typist_book_title, df_corrupted, error_mask)

Unnamed: 0_level_0,original,original,original,perturbed,perturbed,perturbed,error_mask,error_mask,error_mask
Unnamed: 0_level_1,typist,book_title,rating,typist,book_title,rating,typist,book_title,rating
0,Alice,To Kill a Mockingbird,1.0,Alice,To Kill a Mockingbird,10.0,False,False,True
1,Alice,1984,3.0,Alice,1984,30.0,False,False,True
2,Alice,Pride and Prejudice,3.0,Alice,Pride and Prejudice,30.0,False,False,True
3,Bob,The Great Gatsby,4.0,Bob,The Great Gatsby,4.0,False,False,False
4,Bob,Moby-Dick,2.0,Bob,Moby-Dick,2.0,False,False,False
5,Bob,The Catcher in the Rye,1.0,Bob,The Catcher in the Rye,1.0,False,False,False
