In [1]:
import pandas as pd

pd.set_option("display.max_rows", 500)
pd.set_option("display.max_columns", 500)
pd.set_option("display.width", 1000)

from IPython.display import display, Markdown

# Data Fabricator

The data fabricator utility was created to enable mocking a set of tables declaratively,
where join integrity between tables are easy to define and maintain. This enables
full integration tests of pipelines to be conducted and scaled without manually
crafting single data points for every single table.

Many libraries such as [faker](https://github.com/joke2k/faker),
[hypothesis](https://github.com/HypothesisWorks/hypothesis/tree/master/hypothesis-python),
or even the newer [GAN](https://github.com/sdv-dev/TGAN) based approaches
address the issue of mocking a **single** table realistically or rely on having a
dataset beforehand. `data_fabricator` works without the need for real data, only knowledge 
of such.

## Simple Example

Let's say we want to mock the following set of tables and their relationships:
![Excel file](images/simple_erd.png)


The data fabricator configuration will look like:  

### YAML API

Example YAML API syntax:

In [2]:
import yaml

yaml_path = "tests/v1/scenarios/scenario2/single_yaml_config.yml"
with open(yaml_path, "r") as yaml_file:
    yaml_string = yaml_file.read()
    display(Markdown("\n".join(["```yaml", yaml_string, "```"])))

```yaml
tables:
  - _target_: data_fabricator.v1.core.mock_generator.create_table
    name: classes
    columns:
      student_id:
        _metadata_:
          foreign_key: True
        _target_: data_fabricator.v1.core.mock_generator.RowApply
        list_of_values: "students.student_id"
        row_func: "lambda x:x"
      course:
        _target_: data_fabricator.v1.core.mock_generator.RowApply
        list_of_values: "faculty.course"
        row_func: "lambda x:x"
        resize: True

  - _target_: data_fabricator.v1.core.mock_generator.create_table
    name: faculty
    num_rows: 5
    columns:
      faculty_id:
        _metadata_:
          primary_key: True
        _target_: data_fabricator.v1.core.mock_generator.UniqueId
      name:
        _target_: data_fabricator.v1.core.mock_generator.Faker
        provider: name
        faker_seed: 1
      course:
        _target_: data_fabricator.v1.core.mock_generator.ValuesFromSamples
        sample_values: ["engineering", "computer science", "mathematics"]

  - _target_: data_fabricator.v1.core.mock_generator.create_table
    name: students
    num_rows: 10
    columns:
      enrollment_date:
        end_dt: "2022-12-31"
        freq: M
        _target_: data_fabricator.v1.core.mock_generator.Date
        start_dt: "2021-01-01"
      name:
        _target_: data_fabricator.v1.core.mock_generator.Faker
        provider: name
        faker_seed: 1
      student_id:
        _metadata_:
          primary_key: True
        _target_: data_fabricator.v1.core.mock_generator.UniqueId

```

See [02_kedro_integration](02_kedro_integration.md) for more details. 

Hopefully, that was intuitive and easy to follow! Some intricacies are explained in the
following sections.

### Python API

Define 

In [3]:
from data_fabricator.v1.core.mock_generator import (
    BaseTable,
    Date,
    Faker,
    RowApply,
    UniqueId,
    ValuesFromSamples,
)


class Students(BaseTable):
    num_rows = 10
    _metadata_ = {
        "description": "Student table with enrollment info",
    }
    student_id = UniqueId()
    name = Faker(provider="name", faker_seed=1)
    enrollment_date = Date(start_dt="2019-01-01", end_dt="2020-12-31", freq="M")


class Faculty(BaseTable):
    num_rows = 5
    _metadata_ = {
        "description": "Faculty info along with departments",
    }
    faculty_id = UniqueId()
    name = Faker(provider="name", faker_seed=1)
    course = ValuesFromSamples(
        sample_values=["engineering", "computer science", "mathematics"],
        prob_null_kwargs={"seed": 1},
    )


class Classes(BaseTable):
    student_id = RowApply(list_of_values="Students.student_id", row_func=lambda x: x)
    course = RowApply(
        list_of_values="Faculty.course", row_func=lambda x: x, resize=True
    )

In order to generate the data, perform the following:

In [4]:
from tabulate import tabulate
from data_fabricator.v1.core.mock_generator import MockDataGenerator

mdg = MockDataGenerator(tables=[Classes, Faculty, Students])
mdg.generate_all()

for table_name, table in mdg.tables.items():
    print(f"Table: {table_name}")
    print(tabulate(table.dataframe, headers=table.dataframe.columns, tablefmt="psql"))
    print("\n")

Resizing list from 5 to 10


Table: Classes
+----+--------------+------------------+
|    |   student_id | course           |
|----+--------------+------------------|
|  0 |            1 | mathematics      |
|  1 |            2 | mathematics      |
|  2 |            3 | computer science |
|  3 |            4 | engineering      |
|  4 |            5 | engineering      |
|  5 |            6 | computer science |
|  6 |            7 | mathematics      |
|  7 |            8 | computer science |
|  8 |            9 | engineering      |
|  9 |           10 | mathematics      |
+----+--------------+------------------+


Table: Faculty
+----+------------------+--------------+------------------+
|    | course           |   faculty_id | name             |
|----+------------------+--------------+------------------|
|  0 | engineering      |            1 | Ryan Gallagher   |
|  1 | mathematics      |            2 | Jon Cole         |
|  2 | mathematics      |            3 | Rachel Davis     |
|  3 | engineering      |         

## V1 API Summary

- Dual Python/YAML API. Python/code API allows for IDE autocomplete.
- Resolves table generation independently of the order, which means tables no 
    longer need to be declared02 in order.
- Supports multiple config files, which helps break down bigger sets of tables into smaller 
    more manageable pieces.
- YAML API means `data_fabricator` is also compatible with other configuration management 
    systems like `Hydra`.
- Allows for capturing table and column level metadata under the `_metadata_` field for 
    better documentation support. 
- Various quality-of-life changes for less typing and more doing. 

## Configuration Structure

There are 2 ways one can use the `data_fabricator` module: the Python API and the YAML API.

### Python API

In [5]:
class Table1(BaseTable):
    num_rows: ...
    _metadata_ = {
        "description": "...",
    }
    column_a: ...
    column_b: ...
    column_c: ...


class Table2(BaseTable):
    num_rows: ...
    _metadata_ = {
        "description": "...",
    }
    column_a: ...
    column_b: ...
    column_c: ...

### YAML API

Note this API syntax may change if you bring your own DI framework. The following 
examples uses the Hydra instantiation which relies on the `_target_` keyword.

```yaml
tables:
  - _target_: data_fabricator.v1.core.mock_generator.create_table
    name: table1
    num_rows: ...
    _metadata_: 
      description: ...
    columns:
      column_a: ...
      column_b: ...
      column_c: ...

  - _target_: data_fabricator.v1.core.mock_generator.create_table
    name: table2
    num_rows: ...
    _metadata_: 
      description: ...
    columns:
      column_a: ...
      column_b: ...
      column_c: ...
```

A few points to note:
* Every table is represented by an object injected with `create_table` that returns 
a dataclass exactly like the one declared with Python API.
* Every column for that table is represented as a key under `columns`.
* All the columns for the particular table are explicitly declared. 
* The `_metadata_` field also allow other information about table like description. 
* Depending on the context, `num_rows` may be inferred.


## Migration Guide from V0 to V1

If you are using the `v0` API, you can simply install the new version. The `v1` API will 
exist side-by-side with the `v0` API. You can slowly convert one function at a time from 
the `v0` API to the `v1` API. 

Most of the changes would be porting from the old structure to the `v1` YAML API. As 
an illustration, we provide a before and after comparison below. 

Before:
```yaml
  customers:
    num_rows: 10
    columns:
      hcp_id:
        type: generate_unique_id
        prefix: hcp
        id_start_range: 1
        id_end_range: 11
      ftr1:
        type: generate_random_numbers
        start_range: 0
        end_range: 1
        prob_null: 0.25
  ```

After:
```yaml
customers:
  - _target_: data_fabricator.v1.core.mock_generator.create_table
    name: customers
    num_rows: 10
    columns:
      hcp_id:
        _target_: data_fabricator.v1.core.mock_generator.PrimaryKey
        prefix: hcp
        id_start_range: 1
        id_end_range: 11
      ftr1:
        _target_: data_fabricator.v1.core.mock_generator.RandomNumbers
        start_range: 0
        end_range: 1
        prob_null_kwargs: 
          prob_null: 0.25
```

A `v0_converter` function is provided in the package `data_fabricator.v1.utils` to facilitate the transition. It can convert the functions that are implemented both in `v0` and `v1`. 

It's important to note that it won't correctly migrate the functions that had undergone implementation changes in `v1`. For example, functions `drop_filtered_condition_rows` , `cross_product`, `drop_duplicates`.

In these cases, it may be necessary to manually modify the code to account for any changes in implementation between `v0` and `v1`.


In [6]:
from data_fabricator.v1.utils import v0_converter

yaml_string = """
students:
  num_rows: 10
  columns:
    student_id:
      type: generate_unique_id
      seed: 1 # defaults to None
      prob_null: 0.5 # defaults to 0
      null_value: unassigned # defaults to None
"""
config = yaml.safe_load(yaml_string)
config = v0_converter(config)
v1_string = yaml.safe_dump(config, default_flow_style=False, sort_keys=False)

Result:

In [7]:
display(Markdown("\n".join(["```yaml", v1_string, "```"])))

```yaml
tables:
- _target_: data_fabricator.v1.core.mock_generator.create_table
  name: students
  num_rows: 10
  columns:
    student_id:
      _target_: data_fabricator.v1.core.mock_generator.UniqueId
      prob_null_kwargs:
        prob_null: 0.5
        null_value: unassigned
        seed: 1

```

Notice:
* Each YAML file must contain a dictionary key that returns a list of `create_table` 
objects. 
* Each object takes four table attributes: `name`, `num_rows`, `_metadata_`, and 
`columns`. 
* Each column is now represented by an object with the following format: `_target_: path.to.ColumnClass` 
* A node for instantiating YAML config files as objects using `hydra` is provided in `data_fabricator.v1.nodes.hydra.fabricate_datasets`. `hydra` is currently an optional dependency. To use this functionality you need to do `pip install brix.data_fabricator[hydra]`.
Alternatively, if you prefer to use other object injection frameworks, you can utilize the clean node provided in data_fabricator.v1.nodes.fabrication.fabricate_datasets. Please note that in such cases, you may need to adjust the syntax accordingly to suit your chosen framework.
* Each parameter of the respective class is passed directly below.
* Arguments to the wrapper function `probability_null`, such as `prob_null`, `seed`, 
* and `null_value`, should go inside `prob_null_kwargs` argument as illustrated.
* The function signature for `data_fabricator.v1.nodes.hydra.fabricate_datasets` is
 different from `data_fabricator.v0.nodes.fabrication.fabricate_datasets`.

### Node Difference
The function signature for the `v0` `fabricate_datasets`:

```python
def fabricate_datasets(
    fabrication_params: Dict[str, Any],
    ignore_prefix: List[str] = _IGNORE_DATAFRAMES_WITH_PREFIX,
    seed: int = None,
    **source_dfs: Dict[str, Union[pd.DataFrame, pyspark.sql.DataFrame]]
) -> Dict[str, pd.DataFrame]:
```

The function signature for the `v1` `fabricate_datasets`:

```python
def fabricate_datasets(
    ignore_prefix: List[str] = _IGNORE_DATAFRAMES_WITH_PREFIX,
    seed: int = None,
    **fabricator_params: Dict[str, List[BaseTable]]
) -> Dict[str, pd.DataFrame]:
```

Notice:
* `fabricator_params` is now a dictionary of lists of `BaseTable` objects, which means
   you can split your tables into different files. 
* `source_dfs` is now deprecated.

## Available Data Fabricator Column Classes and Functions

In [8]:
from data_fabricator.v1 import core
import pkgutil
import inspect
import data_fabricator.v1.core.mock_generator
from tabulate import tabulate
from types import FunctionType


pkgname = core.__name__
pkgpath = core.__path__[0]
found_packages = list(pkgutil.iter_modules([pkgpath], prefix="{}.".format(pkgname)))
sub_packages = [x.split(".")[-1] for _, x, _ in found_packages]

importer = found_packages[0][0]

func_row = []
columns_row = []
for idx, name in enumerate(sub_packages):
    col_classes = []
    list_of_functions = []
    module_spec = importer.find_spec(found_packages[idx][1])
    module = module_spec.loader.load_module(found_packages[idx][1])
    clsmembers = inspect.getmembers(module, inspect.isclass)
    col_classes.extend(
        [
            cls[1]
            for cls in clsmembers
            if (
                issubclass(cls[1], data_fabricator.v1.core.mock_generator.BaseColumn)
                and cls[1] != data_fabricator.v1.core.mock_generator.BaseColumn
            )
        ]
    )
    func_members = inspect.getmembers(
        data_fabricator.v1.core.functions, inspect.isfunction
    )

    for col in col_classes:
        col_doc = col.__doc__.split("\n")[0]
        columns_row.append(
            {"column class": str(col).split("'")[1], "description": col_doc}
        )

    if "functions" in module.__name__:
        list_of_functions.extend(
            [
                (".".join([module.__name__, f[0]]), f[1])
                for f in func_members
                if isinstance(f[1], FunctionType)
                and not f[0].startswith("_")
                and f[0]
                not in [
                    "load_callable_with_libraries",
                    "load_function_if_string",
                    "probability_null",
                    "wraps",
                    "deepcopy",
                ]
            ]
        )

        for f in list_of_functions:
            f_doc = f[1].__doc__.split("\n")[0]
            func_row.append({"function": f[0], "description": f_doc})

### Column Classes: 

In [9]:
table = pd.DataFrame(columns_row)
print(tabulate(table, headers=table.columns, tablefmt="psql"))

+----+------------------------------------------------------------------+------------------------------------------------------------------------------+
|    | column class                                                     | description                                                                  |
|----+------------------------------------------------------------------+------------------------------------------------------------------------------|
|  0 | data_fabricator.v1.core.mock_generator.ColumnApply               | Generic wrapper to call a function on a list.                                |
|  1 | data_fabricator.v1.core.mock_generator.CrossProduct              | Given a list of lists, returns all possible combinations of values.          |
|  2 | data_fabricator.v1.core.mock_generator.CrossProductWithSeparator | Given a list of lists, return possible combinations values.                  |
|  3 | data_fabricator.v1.core.mock_generator.Date                      | Create a

### Functions:

In [10]:
table = pd.DataFrame(func_row)
print(tabulate(table, headers=table.columns, tablefmt="psql"))

+----+---------------------------------------------------------------------+--------------------------------------------------------------------------+
|    | function                                                            | description                                                              |
|----+---------------------------------------------------------------------+--------------------------------------------------------------------------|
|  0 | data_fabricator.v1.core.functions.column_apply                      | Generic wrapper to call a function on a list.                            |
|  1 | data_fabricator.v1.core.functions.conditional_generate_from_weights | Generate a new distribution conditional on ``value``.                    |
|  2 | data_fabricator.v1.core.functions.conditional_string                | Remap value based on the provided mapping.                               |
|  3 | data_fabricator.v1.core.functions.cross_product                     | Given a lis