In [1]:
import yaml
from IPython.display import display, Markdown

# Kedro Integration

## TLDR

As part of the package, we provide a node to accelerate the usage of 
`data_fabricator` utility. Below we provide a simple example of how this 
node can be used in a pipeline. 

* `hydra` is currently an optional dependency. To use this functionality you need to do `pip install brix.data_fabricator[hydra]`.  Alternatively, if you prefer to use other object injection frameworks, you can utilize the clean node provided in data_fabricator.v1.nodes.fabrication.fabricate_datasets. Please note that in such cases, you may need to adjust the syntax accordingly to suit your chosen framework.

* Each parameter of the respective class is passed directly below.

Let's say you have the following defined in `parameters.yml`.

In [2]:
yaml_input_param = """
customers:
  - _target_: data_fabricator.v1.core.mock_generator.create_table
    name: customers
    num_rows: 10
    columns:
      hcp_id:
        _target_: data_fabricator.v1.core.mock_generator.PrimaryKey
        prefix: hcp
        id_start_range: 1
        id_end_range: 11
      ftr1:
        _target_: data_fabricator.v1.core.mock_generator.RandomNumbers
        start_range: 0
        end_range: 1
        prob_null_kwargs: 
          prob_null: 0.25
  """
my_df_params = yaml.safe_load(yaml_input_param)
display(Markdown("\n".join(["```yaml", yaml_input_param, "```"])))

```yaml

customers:
  - _target_: data_fabricator.v1.core.mock_generator.create_table
    name: customers
    num_rows: 10
    columns:
      hcp_id:
        _target_: data_fabricator.v1.core.mock_generator.PrimaryKey
        prefix: hcp
        id_start_range: 1
        id_end_range: 11
      ftr1:
        _target_: data_fabricator.v1.core.mock_generator.RandomNumbers
        start_range: 0
        end_range: 1
        prob_null_kwargs: 
          prob_null: 0.25
  
```

We could imagine this parameter to be defined in a `parameters.yml` file and 
passed as an input to the node. For this example we pass it directly to the
node.

In [3]:
from data_fabricator.v1.nodes.hydra import fabricate_datasets

# Setting seed is not recommended for general use, please consider when to use seed
output_datasets = fabricate_datasets(**my_df_params, seed=1)

`output_datasets` now contains a dictionary of datasets, just as `kedro` would expect.
Let us show the generated customer dataset:

In [4]:
from tabulate import tabulate

customer_table = output_datasets["customers"]
print(tabulate(customer_table, headers=customer_table.columns, tablefmt="psql"))

+----+----------+-------------+
|    | hcp_id   |        ftr1 |
|----+----------+-------------|
|  0 | hcp1     |   0.134364  |
|  1 | hcp2     |   0.847434  |
|  2 | hcp3     |   0.763775  |
|  3 | hcp4     | nan         |
|  4 | hcp5     |   0.495435  |
|  5 | hcp6     |   0.449491  |
|  6 | hcp7     | nan         |
|  7 | hcp8     |   0.788723  |
|  8 | hcp9     |   0.0938596 |
|  9 | hcp10    | nan         |
+----+----------+-------------+


## Introduction
If you have a Kedro project, you can generate mocked data by using `data_fabricator`.
This documentation will provide the steps that are required to integrate with a 
Kedro project, we are using the example from `README.md` in this tutorial.

## Step-by-step
In order to configure your Kedro to use `data_fabricator`, there are steps you need to follow:

* Modify your `parameters.yml` file.
* Modify your `catalog.yml` file.
* Create your `pipeline.py` file.
* Register your pipelines into `pipeline_registry.py` file.

In [5]:
import os

if os.environ.get("CIRCLECI"):
    default_env = os.environ.get("CONDA_DEFAULT_ENV")
    os.environ[
        "PYSPARK_DRIVER_PYTHON"
    ] = f"/home/circleci/miniconda/envs/{default_env}/bin/python"
    os.environ[
        "PYSPARK_PYTHON"
    ] = f"/home/circleci/miniconda/envs/{default_env}/bin/python"
os.environ["NUMEXPR_MAX_THREADS"] = "32"

In [6]:
import sys
from pathlib import Path

current_path = Path(os.curdir).absolute()
sys.path.insert(0, str(current_path))

import os

os.environ["PYTHONPATH"] = (
    f"{os.getenv('PYTHONPATH')}:" if os.getenv("PYTHONPATH") else ""
) + str(current_path)

In [7]:
import subprocess
import os
import shutil
import tempfile

from pathlib import Path

current_path = Path(os.curdir).absolute()


def subprocess_call(cmd: str) -> None:
    """Call subprocess with error check."""
    print("=========================================")
    print(f"Calling: {cmd}")
    print("=========================================")
    subprocess.run(cmd, check=True, shell=True)


PROJECT_NAME = "my_kedro_project"
from pathlib import Path

tmp_path = Path(tempfile.TemporaryDirectory().name)


if tmp_path.exists() and tmp_path.is_dir():
    shutil.rmtree(tmp_path)

filepath = tmp_path / "prompt.yml"

prompt_text = f"""
project_name: {PROJECT_NAME}
repo_name: {PROJECT_NAME}
python_package: {PROJECT_NAME}
"""

filepath.parent.mkdir(parents=True, exist_ok=True)
filepath.write_text(prompt_text)
subprocess_call(f"cd {tmp_path} && kedro new --config=prompt.yml")

utility_path = Path().cwd().absolute().parent

shutil.copytree(
    utility_path,
    tmp_path / PROJECT_NAME / "src" / "data_fabricator",
)

Calling: cd /var/folders/x8/_9l2j54n1lx_71w8kncv7mc80000gp/T/tmpupg52bvi && kedro new --config=prompt.yml

The project name 'my_kedro_project' has been applied to: 
- The project title in /private/var/folders/x8/_9l2j54n1lx_71w8kncv7mc80000gp/T/tmpupg52bvi/my_kedro_project/README.md 
- The folder created for your project in /private/var/folders/x8/_9l2j54n1lx_71w8kncv7mc80000gp/T/tmpupg52bvi/my_kedro_project 
- The project's python package in /private/var/folders/x8/_9l2j54n1lx_71w8kncv7mc80000gp/T/tmpupg52bvi/my_kedro_project/src/my_kedro_project

A best-practice setup includes initialising git and creating a virtual environment before running 'pip install -r src/requirements.txt' to install project-specific dependencies. Refer to the Kedro documentation: https://kedro.readthedocs.io/

Change directory to the project generated in /private/var/folders/x8/_9l2j54n1lx_71w8kncv7mc80000gp/T/tmpupg52bvi/my_kedro_project by entering 'cd /private/var/folders/x8/_9l2j54n1lx_71w8kncv7mc80000gp/

PosixPath('/var/folders/x8/_9l2j54n1lx_71w8kncv7mc80000gp/T/tmpupg52bvi/my_kedro_project/src/data_fabricator')

## Simple Example from `README.md`

Once you have started your Kedro project with `kedro new`, we need to specify the following configuration in your `parameters.yml`:


In [8]:
import yaml

yaml_path = "tests/v1/scenarios/scenario2/single_yaml_config.yml"
with open(yaml_path, "r") as yaml_file:
    yaml_string = yaml_file.read()

yaml_string = "\n".join(
    [
        yaml_string,
        "# Setting seed is not recommended for general use, please consider when to use seed",
        "seed_val: 1",
    ]
)
filepath = tmp_path / PROJECT_NAME / "conf" / "base" / "parameters.yml"
filepath.parent.mkdir(parents=True, exist_ok=True)
filepath.write_text(yaml_string)
print(f"filepath: {filepath}")
display(Markdown("\n".join(["```yaml", yaml_string, "```"])))

filepath: /var/folders/x8/_9l2j54n1lx_71w8kncv7mc80000gp/T/tmpupg52bvi/my_kedro_project/conf/base/parameters.yml


```yaml
tables:
  - _target_: data_fabricator.v1.core.mock_generator.create_table
    name: classes
    columns:
      student_id:
        _metadata_:
          foreign_key: True
        _target_: data_fabricator.v1.core.mock_generator.RowApply
        list_of_values: "students.student_id"
        row_func: "lambda x:x"
      course:
        _target_: data_fabricator.v1.core.mock_generator.RowApply
        list_of_values: "faculty.course"
        row_func: "lambda x:x"
        resize: True

  - _target_: data_fabricator.v1.core.mock_generator.create_table
    name: faculty
    num_rows: 5
    columns:
      faculty_id:
        _metadata_:
          primary_key: True
        _target_: data_fabricator.v1.core.mock_generator.UniqueId
      name:
        _target_: data_fabricator.v1.core.mock_generator.Faker
        provider: name
        faker_seed: 1
      course:
        _target_: data_fabricator.v1.core.mock_generator.ValuesFromSamples
        sample_values: ["engineering", "computer science", "mathematics"]

  - _target_: data_fabricator.v1.core.mock_generator.create_table
    name: students
    num_rows: 10
    columns:
      enrollment_date:
        end_dt: "2022-12-31"
        freq: M
        _target_: data_fabricator.v1.core.mock_generator.Date
        start_dt: "2021-01-01"
      name:
        _target_: data_fabricator.v1.core.mock_generator.Faker
        provider: name
        faker_seed: 1
      student_id:
        _metadata_:
          primary_key: True
        _target_: data_fabricator.v1.core.mock_generator.UniqueId

# Setting seed is not recommended for general use, please consider when to use seed
seed_val: 1
```

Then, we can specify the outputs location and format in our `catalog.yml` file. In this case, we are giving `csv` under `raw` layer:

In [9]:
import yaml

catalog_yaml_string = """
students:
  type: pandas.CSVDataSet
  filepath: data/01_raw/students.csv
  layer: raw

faculty:
  type: pandas.CSVDataSet
  filepath: data/01_raw/faculty.csv
  layer: raw

classes:
  type: pandas.CSVDataSet
  filepath: data/01_raw/classes.csv
  layer: raw
"""

filepath = tmp_path / PROJECT_NAME / "conf" / "base" / "catalog.yml"
filepath.parent.mkdir(parents=True, exist_ok=True)
filepath.write_text(catalog_yaml_string)
print(f"filepath: {filepath}")
display(Markdown("\n".join(["```yaml", catalog_yaml_string, "```"])))

filepath: /var/folders/x8/_9l2j54n1lx_71w8kncv7mc80000gp/T/tmpupg52bvi/my_kedro_project/conf/base/catalog.yml


```yaml

students:
  type: pandas.CSVDataSet
  filepath: data/01_raw/students.csv
  layer: raw

faculty:
  type: pandas.CSVDataSet
  filepath: data/01_raw/faculty.csv
  layer: raw

classes:
  type: pandas.CSVDataSet
  filepath: data/01_raw/classes.csv
  layer: raw

```

We can now create our pipeline, the `pipeline.py` file will look like:

In [10]:
pipeline_file_txt = """
from kedro.pipeline import Pipeline, node
from kedro.pipeline.modular_pipeline import pipeline
from data_fabricator.v1.nodes.hydra import fabricate_datasets

def create_pipeline(**kwargs) -> Pipeline:
    return pipeline(
        [
            node(
                func=fabricate_datasets,
                inputs=dict(fabrication_params="params:tables", seed="params:seed_val"),
                outputs=dict(students="students", faculty="faculty", classes="classes"),
                name="data_fabricator_node",
            )
        ]
    )
"""

filepath = tmp_path / PROJECT_NAME / "src" / PROJECT_NAME / "pipelines" / "pipeline.py"
filepath.parent.mkdir(parents=True, exist_ok=True)
filepath.write_text(pipeline_file_txt)
print(f"filepath: {filepath}")
display(Markdown("\n".join(["```python", pipeline_file_txt, "```"])))

filepath: /var/folders/x8/_9l2j54n1lx_71w8kncv7mc80000gp/T/tmpupg52bvi/my_kedro_project/src/my_kedro_project/pipelines/pipeline.py


```python

from kedro.pipeline import Pipeline, node
from kedro.pipeline.modular_pipeline import pipeline
from data_fabricator.v1.nodes.hydra import fabricate_datasets

def create_pipeline(**kwargs) -> Pipeline:
    return pipeline(
        [
            node(
                func=fabricate_datasets,
                inputs=dict(fabrication_params="params:tables", seed="params:seed_val"),
                outputs=dict(students="students", faculty="faculty", classes="classes"),
                name="data_fabricator_node",
            )
        ]
    )

```

Finally, the pipeline can be registered into `pipeline_registry.py`:

In [11]:
pipeline_registry = """
from typing import Dict
from kedro.pipeline import Pipeline
from my_kedro_project.pipelines.pipeline import create_pipeline

def register_pipelines() -> Dict[str, Pipeline]:
    return dict(__default__= create_pipeline())
"""
filepath = tmp_path / PROJECT_NAME / "src" / PROJECT_NAME / "pipeline_registry.py"
filepath.parent.mkdir(parents=True, exist_ok=True)
filepath.write_text(pipeline_registry)
print(f"filepath: {filepath}")
display(Markdown("\n".join(["```python", pipeline_registry, "```"])))

filepath: /var/folders/x8/_9l2j54n1lx_71w8kncv7mc80000gp/T/tmpupg52bvi/my_kedro_project/src/my_kedro_project/pipeline_registry.py


```python

from typing import Dict
from kedro.pipeline import Pipeline
from my_kedro_project.pipelines.pipeline import create_pipeline

def register_pipelines() -> Dict[str, Pipeline]:
    return dict(__default__= create_pipeline())

```

Now, we can run `kedro run`

In [12]:
only_kedro = "kedro run"
cmd = f"cd {tmp_path}/{PROJECT_NAME} && {only_kedro}"
subprocess_call(cmd)

Calling: cd /var/folders/x8/_9l2j54n1lx_71w8kncv7mc80000gp/T/tmpupg52bvi/my_kedro_project && kedro run
[2;36m                    [0m         commands from                          [2m            [0m
[2;36m                    [0m         [1;35mEntryPoint[0m[1m([0m[33mname[0m=[32m'deploy'[0m,              [2m            [0m
[2;36m                    [0m         [33mvalue[0m=[32m'kedro_deploy.cli:cli'[0m,          [2m            [0m
[2;36m                    [0m         [33mgroup[0m=[32m'kedro.project_commands'[0m[1m)[0m. Full  [2m            [0m
[2;36m                    [0m         exception: No module named             [2m            [0m
[2;36m                    [0m         [32m'kedro_deploy'[0m                         [2m            [0m
[2;36m[07/05/23 09:34:54][0m[2;36m [0m[34mINFO    [0m Kedro project my_kedro_project       ]8;id=939375;file:///opt/homebrew/Caskroom/miniforge/base/envs/qblabs-monorepo/lib/python3.9/site-packages/ke

In [13]:
import re

print(only_kedro)
log_file = Path(tmp_path) / Path(PROJECT_NAME) / Path("info.log")
with open(str(log_file)) as f:
    logs_txt = f.read()
    logs_txt = re.sub(
        "\d{4}\-\d{2}\-\d{2}\s+\d{2}\:\d{2}:\d{2},\d{3}\s+\-\s+", "", logs_txt
    )
    print(logs_txt)

kedro run
kedro.framework.session.session - INFO - Kedro project my_kedro_project
  return _bootstrap._gcd_import(name[level:], package, level)

kedro.io.data_catalog - INFO - Loading data from 'params:seed_val' (MemoryDataSet)...
kedro.io.data_catalog - INFO - Loading data from 'params:tables' (MemoryDataSet)...
kedro.pipeline.node - INFO - Running node: data_fabricator_node: fabricate_datasets([params:seed_val,params:tables]) -> [students,faculty,classes]
kedro.io.data_catalog - INFO - Saving data to 'students' (CSVDataSet)...
kedro.io.data_catalog - INFO - Saving data to 'faculty' (CSVDataSet)...
kedro.io.data_catalog - INFO - Saving data to 'classes' (CSVDataSet)...
kedro.runner.sequential_runner - INFO - Completed 1 out of 1 tasks
kedro.runner.sequential_runner - INFO - Pipeline execution completed successfully.



We can see tables are generated under the specified layer:

In [14]:
cmd = f"ls {tmp_path}/{PROJECT_NAME}/data/01_raw/"
subprocess_call(cmd)

Calling: ls /var/folders/x8/_9l2j54n1lx_71w8kncv7mc80000gp/T/tmpupg52bvi/my_kedro_project/data/01_raw/
classes.csv
faculty.csv
students.csv


For validation, we can check `classes.csv` file:

In [15]:
cmd = f"head {tmp_path}/{PROJECT_NAME}/data/01_raw/classes.csv"
subprocess_call(cmd)

Calling: head /var/folders/x8/_9l2j54n1lx_71w8kncv7mc80000gp/T/tmpupg52bvi/my_kedro_project/data/01_raw/classes.csv
student_id,course
1,engineering
2,computer science
3,mathematics
4,computer science
5,engineering
6,mathematics
7,computer science
8,engineering
9,computer science


In [16]:
# Final clean up for when running this doc
subprocess_call(f"rm -r {tmp_path}/")

Calling: rm -r /var/folders/x8/_9l2j54n1lx_71w8kncv7mc80000gp/T/tmpupg52bvi/
