In [1]:
import pandas as pd

pd.set_option("display.max_rows", 500)
pd.set_option("display.max_columns", 500)
pd.set_option("display.width", 1000)

# Case Study - Generate Transactions Data

This case study helps us understand how to fabricate transaction data one might encounter in a typical supermarket
operations study.

## Entity relationship diagram

![ER Diagram](images/transactions_case_study_erd.svg)


## Table definition

Product: Dimension table containing product information.

| Column name  | Column logic                               |
|--------------|--------------------------------------------|
| product_id   | unique ID prefixed with PROD_              |
| product_name | name of the product                        |
| category     | category of the product                    |
| price        | price of the product                       |


Customer: Dimension table containing customer information.

| Column name | Column logic                    |
|-------------|---------------------------------|
| customer_id | unique ID prefixed with CUST_   |
| name        | name of the customer            |
| address     | customer address                |
| membership  | membership type of the customer |


Transactions: Fact table containing customer transaction details.

| Column name             | Column logic                                      |
|-------------------------|---------------------------------------------------|
| transaction_id          | unique ID prefixed with TRNSCT_                   |
| customer_id             | customer ID from customer master                  |
| product_id              | product ID from product master                    |
| quantity                | price associated with product id in product table |
| quantity                | integer denoting total number of items            |
| transaction_amount      | total price (quantity*product price)              |
| transaction_date        | date and time of transaction                      |

In [2]:
import yaml

yaml_string = """
tables:
- _target_: data_fabricator.v1.core.mock_generator.create_table
  name: product
  num_rows: 5
  columns:
    product_id:
      _target_: data_fabricator.v1.core.mock_generator.UniqueId
      prefix: PROD_
      id_start_range: 1
      id_end_range: 100
      id_length: 10
      _metadata_: {
        "description": 'unique ID prefixed with "PROD_"'
      }
    product_name:
      _target_: data_fabricator.v1.core.mock_generator.Faker
      provider: pystr_format
      provider_args:
        string_format: ?????????
        letters: ABCDEFGHIJKLMNOPQRSTUVWXYZ
      faker_seed: 1
      _metadata_: {
        "description": 'name of the product'
      }
    category:
      _target_: data_fabricator.v1.core.mock_generator.ValuesFromSamples
      prob_null_kwargs:
        seed: 1
      sample_values:
      - CATEG1
      - CATEG2
      - CATEG3
      _metadata_: {
        "description": 'category of the product'
      }
    price:
      _target_: data_fabricator.v1.core.mock_generator.RandomNumbers
      start_range: 2
      end_range: 20
      _metadata_: {
        "description": 'price of the product'
      }
- _target_: data_fabricator.v1.core.mock_generator.create_table
  name: customer
  num_rows: 5
  columns:
    customer_id:
      _target_: data_fabricator.v1.core.mock_generator.UniqueId
      prefix: CUST_
      id_start_range: 1
      id_end_range: 100
      id_length: 10
      _metadata_: {
        "description": 'unique ID prefixed with "CUST_"'
      }
    name:
      _target_: data_fabricator.v1.core.mock_generator.Faker
      provider: name
      faker_seed: 1
      _metadata_: {
        "description": 'name of the customer'
      }
    address:
      _target_: data_fabricator.v1.core.mock_generator.Faker
      provider: address
      faker_seed: 1
      _metadata_: {
        "description": 'customer address'
      }
    membership:
      _target_: data_fabricator.v1.core.mock_generator.ValuesFromSamples
      sample_values:
        - PLATINUM
        - GOLD
        - SILVER
        - BRONZE
      _metadata_: {
        "description": 'membership type of the customer'
      }
- _target_: data_fabricator.v1.core.mock_generator.create_table
  name: transactions
  num_rows: 50
  columns:
    transaction_id:
      _target_: data_fabricator.v1.core.mock_generator.UniqueId
      prefix: TRNSCT_
      id_start_range: 1
      id_end_range: 500
      id_length: 10
      _metadata_: {
        "description": 'unique ID prefixed with "TRNSCT_"'
      }
    customer_id:
      _target_: data_fabricator.v1.core.mock_generator.RowApply
      prob_null_kwargs:
        seed: 1
      list_of_values:
      - customer.customer_id
      row_func: 'lambda x: x'
      resize: true
      _metadata_: {
        "description": 'customer ID from customer master'
      }
    product_id:
      _target_: data_fabricator.v1.core.mock_generator.RowApply
      prob_null_kwargs:
        seed: 1
      list_of_values:
      - product.product_id
      - product.price
      resize: true
      row_func: 'lambda x, y: x'
      _metadata_: {
        "description": 'product ID from product master'
      }
    product_price:
      _target_: data_fabricator.v1.core.mock_generator.RowApply
      prob_null_kwargs:
        seed: 1
      list_of_values:
      - product.product_id
      - product.price
      resize: true
      row_func: 'lambda x, y: y'
      _metadata_: {
        "description": 'price associated with product id in product table'
      }
    quantity:
      _target_: data_fabricator.v1.core.mock_generator.RandomNumbers
      start_range: 1
      end_range: 5
      dtype: Int64
      _metadata_: {
        "description": 'integer denoting total number of items'
      }
    transaction_amount:
      _target_: data_fabricator.v1.core.mock_generator.RowApply
      list_of_values:
      - transactions.product_price
      - transactions.quantity
      row_func: 'lambda x,y: x*y'
      _metadata_: {
        "description": 'total price (quantity*product price)'
      }
    transaction_date:
      _target_: data_fabricator.v1.core.mock_generator.Date
      start_dt: 2021-01-01
      end_dt: 2021-06-30
      freq: D
      _metadata_: {
        "description": 'date and time of transaction'
      }

"""
config = yaml.safe_load(yaml_string)

The data fabricator configuration will look like:

In [3]:
print(yaml_string)


tables:
- _target_: data_fabricator.v1.core.mock_generator.create_table
  name: product
  num_rows: 5
  columns:
    product_id:
      _target_: data_fabricator.v1.core.mock_generator.UniqueId
      prefix: PROD_
      id_start_range: 1
      id_end_range: 100
      id_length: 10
      _metadata_: {
        "description": 'unique ID prefixed with "PROD_"'
      }
    product_name:
      _target_: data_fabricator.v1.core.mock_generator.Faker
      provider: pystr_format
      provider_args:
        string_format: ?????????
        letters: ABCDEFGHIJKLMNOPQRSTUVWXYZ
      faker_seed: 1
      _metadata_: {
        "description": 'name of the product'
      }
    category:
      _target_: data_fabricator.v1.core.mock_generator.ValuesFromSamples
      prob_null_kwargs:
        seed: 1
      sample_values:
      - CATEG1
      - CATEG2
      - CATEG3
      _metadata_: {
        "description": 'category of the product'
      }
    price:
      _target_: data_fabricator.v1.core.mock_generato

The data will look like:

###  test

In [4]:
from data_fabricator.v1.core.mock_generator import MockDataGenerator
from data_fabricator.v1.nodes.hydra import hydra_instantiate_dictionary
from tabulate import tabulate


config = hydra_instantiate_dictionary(config)

# Setting seed is not recommended for general use, please consider when to use seed
mock_generator = MockDataGenerator(tables=config["tables"], seed=1)
mock_generator.generate_all()

for table_name in mock_generator.tables:
    if not table_name.startswith("_"):
        df = mock_generator.tables[table_name].dataframe
        print(f"Table: {table_name}")
        print(tabulate(df, headers=df.columns, tablefmt="psql"))
        print("\n")

Resizing list from 5 to 50
Resizing list from 5 to 50
Resizing list from 5 to 50


Table: product
+----+--------------+----------------+------------+----------+
|    | product_id   | product_name   | category   |    price |
|----+--------------+----------------+------------+----------|
|  0 | PROD_00009   | DWTGMLQUC      | CATEG1     | 10.0908  |
|  1 | PROD_00018   | AVLTALSFY      | CATEG3     | 13.7287  |
|  2 | PROD_00033   | XAAOYJFKA      | CATEG3     | 16.197   |
|  3 | PROD_00073   | FLMGGFLHA      | CATEG1     |  3.68947 |
|  4 | PROD_00098   | VOQEZWDIS      | CATEG2     |  2.51025 |
+----+--------------+----------------+------------+----------+


Table: customer
+----+---------------+------------------+-------------------------------+--------------+
|    | customer_id   | name             | address                       | membership   |
|----+---------------+------------------+-------------------------------+--------------|
|  0 | CUST_00050    | Ryan Gallagher   | 417 Kennedy Isle Apt. 706     | PLATINUM     |
|    |               |                  | La