# Introduction
If you have a Kedro project, you can generate mocked data by using data_fabricator.This documentation will provide the steps that are required to integrate with a Kedro project, we are using the example from `README.md` in this tutorial.

## Kedro Integration
In order to configure your Kedro to use data_fabricator, there are steps you need to follow:

* Modify your `parameters.yml` file.
* Modify your `catalog.yml` file.
* Create your `pipeline.py` file.
* Register your pipelines into `pipeline_registry.py` file.

In [2]:
import os

if os.environ.get("CIRCLECI"):
    default_env = os.environ.get("CONDA_DEFAULT_ENV")
    os.environ[
        "PYSPARK_DRIVER_PYTHON"
    ] = f"/home/circleci/miniconda/envs/{default_env}/bin/python"
    os.environ[
        "PYSPARK_PYTHON"
    ] = f"/home/circleci/miniconda/envs/{default_env}/bin/python"
os.environ["NUMEXPR_MAX_THREADS"] = "32"

In [3]:
import sys
from pathlib import Path

current_path = Path(os.curdir).absolute()
sys.path.insert(0, str(current_path))

import os

os.environ["PYTHONPATH"] = (
    f"{os.getenv('PYTHONPATH')}:" if os.getenv("PYTHONPATH") else ""
) + str(current_path)

In [4]:
import subprocess
import os
import shutil
import tempfile

from pathlib import Path

current_path = Path(os.curdir).absolute()


def subprocess_call(cmd: str) -> None:
    """Call subprocess with error check."""
    print("=========================================")
    print(f"Calling: {cmd}")
    print("=========================================")
    subprocess.run(cmd, check=True, shell=True)


PROJECT_NAME = "my_kedro_project"
from pathlib import Path

tmp_path = Path(tempfile.TemporaryDirectory().name)


if tmp_path.exists() and tmp_path.is_dir():
    shutil.rmtree(tmp_path)

filepath = tmp_path / "prompt.yml"

prompt_text = f"""
project_name: {PROJECT_NAME}
repo_name: {PROJECT_NAME}
python_package: {PROJECT_NAME}
"""

filepath.parent.mkdir(parents=True, exist_ok=True)
filepath.write_text(prompt_text)
subprocess_call(f"cd {tmp_path} && kedro new --config=prompt.yml")

utility_path = Path().cwd() / "data_fabricator"

shutil.copytree(
    utility_path,
    tmp_path / PROJECT_NAME / "src" / "data_fabricator",
)

Calling: cd /var/folders/x8/_9l2j54n1lx_71w8kncv7mc80000gp/T/tmp0n34l2vu && kedro new --config=prompt.yml

The project name 'my_kedro_project' has been applied to: 
- The project title in /private/var/folders/x8/_9l2j54n1lx_71w8kncv7mc80000gp/T/tmp0n34l2vu/my_kedro_project/README.md 
- The folder created for your project in /private/var/folders/x8/_9l2j54n1lx_71w8kncv7mc80000gp/T/tmp0n34l2vu/my_kedro_project 
- The project's python package in /private/var/folders/x8/_9l2j54n1lx_71w8kncv7mc80000gp/T/tmp0n34l2vu/my_kedro_project/src/my_kedro_project

A best-practice setup includes initialising git and creating a virtual environment before running 'pip install -r src/requirements.txt' to install project-specific dependencies. Refer to the Kedro documentation: https://kedro.readthedocs.io/

Change directory to the project generated in /private/var/folders/x8/_9l2j54n1lx_71w8kncv7mc80000gp/T/tmp0n34l2vu/my_kedro_project by entering 'cd /private/var/folders/x8/_9l2j54n1lx_71w8kncv7mc80000gp/

PosixPath('/var/folders/x8/_9l2j54n1lx_71w8kncv7mc80000gp/T/tmp0n34l2vu/my_kedro_project/src/data_fabricator')

## Simple Example from `README.md`

Once you have started your Kedro project with `kedro new`, we need to specify the following configuration in your `parameters.yml`:


In [5]:

config_yaml_string = """
my_config:
  students:
    num_rows: 10
    columns:
      student_id:
        type: generate_unique_id
      name:
        type: faker
        provider: name
        # Setting seed is not recommended for general use, please consider when to use seed
        faker_seed: 1
      enrollment_date:
        type: generate_dates
        start_dt: 2019-01-01
        end_dt: 2020-12-31
        freq: M

  faculty:
    num_rows: 5
    columns:
      faculty_id:
        type: generate_unique_id
      name:
        type: faker
        provider: name
        # Setting seed is not recommended for general use, please consider when to use seed
        faker_seed: 1
      class:
        type: generate_values
        sample_values:
          - engineering
          - computer science
          - mathematics

  classes:
    columns:
      student_id:
        type: row_apply
        list_of_values: students.student_id
        row_func: "lambda x: x"
      class:
        type: row_apply
        list_of_values: faculty.class
        row_func: "lambda x: x"
        resize: True
# Setting seed is not recommended for general use, please consider when to use seed
seed_val: 1

"""

filepath = tmp_path / PROJECT_NAME / "conf" / "base" / "parameters.yml"
filepath.parent.mkdir(parents=True, exist_ok=True)
filepath.write_text(config_yaml_string)
print(f"filepath: {filepath}")
print(config_yaml_string)

filepath: /var/folders/x8/_9l2j54n1lx_71w8kncv7mc80000gp/T/tmp0n34l2vu/my_kedro_project/conf/base/parameters.yml

my_config:
  students:
    num_rows: 10
    columns:
      student_id:
        type: generate_unique_id
      name:
        type: faker
        provider: name
        # Setting seed is not recommended for general use, please consider when to use seed
        faker_seed: 1
      enrollment_date:
        type: generate_dates
        start_dt: 2019-01-01
        end_dt: 2020-12-31
        freq: M

  faculty:
    num_rows: 5
    columns:
      faculty_id:
        type: generate_unique_id
      name:
        type: faker
        provider: name
        # Setting seed is not recommended for general use, please consider when to use seed
        faker_seed: 1
      class:
        type: generate_values
        sample_values:
          - engineering
          - computer science
          - mathematics

  classes:
    columns:
      student_id:
        type: row_apply
        list_of_va

Then, we can specify the outputs location and format in our `catalog.yml` file. In this case, we are giving `csv` under `raw` layer:


In [6]:

catalog_yaml_string = """
students:
  type: pandas.CSVDataSet
  filepath: data/01_raw/students.csv
  layer: raw

faculty:
  type: pandas.CSVDataSet
  filepath: data/01_raw/faculty.csv
  layer: raw

classes:
  type: pandas.CSVDataSet
  filepath: data/01_raw/classes.csv
  layer: raw
"""

filepath = tmp_path / PROJECT_NAME / "conf" / "base" / "catalog.yml"
filepath.parent.mkdir(parents=True, exist_ok=True)
filepath.write_text(catalog_yaml_string)
print(f"filepath: {filepath}")
print(catalog_yaml_string)

filepath: /var/folders/x8/_9l2j54n1lx_71w8kncv7mc80000gp/T/tmp0n34l2vu/my_kedro_project/conf/base/catalog.yml

students:
  type: pandas.CSVDataSet
  filepath: data/01_raw/students.csv
  layer: raw

faculty:
  type: pandas.CSVDataSet
  filepath: data/01_raw/faculty.csv
  layer: raw

classes:
  type: pandas.CSVDataSet
  filepath: data/01_raw/classes.csv
  layer: raw



We can now create our pipeline, the `pipeline.py` file will look like:

In [7]:
pipeline_file_txt = """
from kedro.pipeline import Pipeline, node
from kedro.pipeline.modular_pipeline import pipeline
from data_fabricator.v0.nodes.fabrication import fabricate_datasets

def create_pipeline(**kwargs) -> Pipeline:
    return pipeline(
        [
            node(
                func=fabricate_datasets,
                inputs=dict(fabrication_params="params:my_config", seed="params:seed_val"),
                outputs=dict(students="students", faculty="faculty", classes="classes"),
                name="data_fabricator_node",
            )
        ]
    )
"""

filepath = tmp_path / PROJECT_NAME / "src" / PROJECT_NAME / "pipelines" / "pipeline.py"
filepath.parent.mkdir(parents=True, exist_ok=True)
filepath.write_text(pipeline_file_txt)
print(f"filepath: {filepath}")
print(pipeline_file_txt)

filepath: /var/folders/x8/_9l2j54n1lx_71w8kncv7mc80000gp/T/tmp0n34l2vu/my_kedro_project/src/my_kedro_project/pipelines/pipeline.py

from kedro.pipeline import Pipeline, node
from kedro.pipeline.modular_pipeline import pipeline
from data_fabricator.v0.nodes.fabrication import fabricate_datasets

def create_pipeline(**kwargs) -> Pipeline:
    return pipeline(
        [
            node(
                func=fabricate_datasets,
                inputs=dict(fabrication_params="params:my_config", seed="params:seed_val"),
                outputs=dict(students="students", faculty="faculty", classes="classes"),
                name="data_fabricator_node",
            )
        ]
    )



Finally, the pipeline can be registered into `pipeline_registry.py`:

In [8]:
pipeline_registry = """
from typing import Dict
from kedro.pipeline import Pipeline
from my_kedro_project.pipelines.pipeline import create_pipeline

def register_pipelines() -> Dict[str, Pipeline]:
    return dict(__default__= create_pipeline())
"""
filepath = tmp_path / PROJECT_NAME / "src" / PROJECT_NAME / "pipeline_registry.py"
filepath.parent.mkdir(parents=True, exist_ok=True)
filepath.write_text(pipeline_registry)
print(f"filepath: {filepath}")
print(pipeline_registry)

filepath: /var/folders/x8/_9l2j54n1lx_71w8kncv7mc80000gp/T/tmp0n34l2vu/my_kedro_project/src/my_kedro_project/pipeline_registry.py

from typing import Dict
from kedro.pipeline import Pipeline
from my_kedro_project.pipelines.pipeline import create_pipeline

def register_pipelines() -> Dict[str, Pipeline]:
    return dict(__default__= create_pipeline())



Now, we can run `kedro run`

In [9]:
only_kedro = "kedro run"
cmd = f"cd {tmp_path}/{PROJECT_NAME} && {only_kedro}"
subprocess_call(cmd)

Calling: cd /var/folders/x8/_9l2j54n1lx_71w8kncv7mc80000gp/T/tmp0n34l2vu/my_kedro_project && kedro run
[2;36m                    [0m         commands from                          [2m            [0m
[2;36m                    [0m         [1;35mEntryPoint[0m[1m([0m[33mname[0m=[32m'deploy'[0m,              [2m            [0m
[2;36m                    [0m         [33mvalue[0m=[32m'kedro_deploy.cli:cli'[0m,          [2m            [0m
[2;36m                    [0m         [33mgroup[0m=[32m'kedro.project_commands'[0m[1m)[0m. Full  [2m            [0m
[2;36m                    [0m         exception: No module named             [2m            [0m
[2;36m                    [0m         [32m'kedro_deploy'[0m                         [2m            [0m
[2;36m[07/06/23 11:06:55][0m[2;36m [0m[34mINFO    [0m Kedro project my_kedro_project       ]8;id=506860;file:///opt/homebrew/Caskroom/miniforge/base/envs/qblabs-monorepo/lib/python3.9/site-packages/ke

In [10]:
import re

print(only_kedro)
log_file = Path(tmp_path) / Path(PROJECT_NAME) / Path("info.log")
with open(str(log_file)) as f:
    logs_txt = f.read()
    logs_txt = re.sub(
        "\d{4}\-\d{2}\-\d{2}\s+\d{2}\:\d{2}:\d{2},\d{3}\s+\-\s+", "", logs_txt
    )
    print(logs_txt)

kedro run
kedro.framework.session.session - INFO - Kedro project my_kedro_project
  from ..core.fabricator import MockDataGenerator

  return _bootstrap._gcd_import(name[level:], package, level)

kedro.io.data_catalog - INFO - Loading data from 'params:my_config' (MemoryDataSet)...
kedro.io.data_catalog - INFO - Loading data from 'params:seed_val' (MemoryDataSet)...
kedro.pipeline.node - INFO - Running node: data_fabricator_node: fabricate_datasets([params:my_config,params:seed_val]) -> [students,faculty,classes]
kedro.io.data_catalog - INFO - Saving data to 'students' (CSVDataSet)...
kedro.io.data_catalog - INFO - Saving data to 'faculty' (CSVDataSet)...
kedro.io.data_catalog - INFO - Saving data to 'classes' (CSVDataSet)...
kedro.runner.sequential_runner - INFO - Completed 1 out of 1 tasks
kedro.runner.sequential_runner - INFO - Pipeline execution completed successfully.



We can see tables are generated under the specified layer:

In [11]:
cmd = f"ls {tmp_path}/{PROJECT_NAME}/data/01_raw/"
subprocess_call(cmd)

Calling: ls /var/folders/x8/_9l2j54n1lx_71w8kncv7mc80000gp/T/tmp0n34l2vu/my_kedro_project/data/01_raw/
classes.csv
faculty.csv
students.csv


For validation, we can check `classes.csv` file:

In [12]:
cmd = f"head {tmp_path}/{PROJECT_NAME}/data/01_raw/classes.csv"
subprocess_call(cmd)

Calling: head /var/folders/x8/_9l2j54n1lx_71w8kncv7mc80000gp/T/tmp0n34l2vu/my_kedro_project/data/01_raw/classes.csv
student_id,class
1,computer science
2,engineering
3,engineering
4,engineering
5,computer science
6,engineering
7,mathematics
8,mathematics
9,mathematics


In [13]:
# Final clean up for when running this doc
subprocess_call(f"rm -r {tmp_path}/")

Calling: rm -r /var/folders/x8/_9l2j54n1lx_71w8kncv7mc80000gp/T/tmp0n34l2vu/
