In [1]:
import pandas as pd

pd.set_option("display.max_rows", 500)
pd.set_option("display.max_columns", 500)
pd.set_option("display.width", 1000)

# Case Study - Generate Transactions Data

This case study helps us understand how to fabricate transaction data one might encounter in a typical supermarket
operations study.

## Entity relationship diagram

![ER Diagram](../images/transactions_case_study_erd.svg)


## Table definition

Product: Dimension table containing product information.

| Column name  | Column logic                               |
|--------------|--------------------------------------------|
| product_id   | unique ID prefixed with PROD_              |
| product_name | name of the product                        |
| category     | category of the product                    |
| price        | price of the product                       |


Customer: Dimension table containing customer information.

| Column name | Column logic                       |
|-------------|------------------------------------|
| customer_id | unique ID prefixed with CUST_      |
| first_name  | first name of the customer         |
| last_name   | last name of the customer          |
| address     | customer address                   |
| membership  | membership type of the customer    |


Transactions: Fact table containing customer transaction details.

| Column name             | Column logic                             |
|-------------------------|------------------------------------------|
| transaction_id          | unique ID prefixed with TRNSCT_          |
| customer_id             | customer ID from customer master         |
| product_id              | product ID from product master           |
| quantity                | integer denoting total number of items   |
| transaction_amount      | total price (quantity*product price)     |
| transaction_date        | date and time of transaction             |


In [2]:
import yaml

yaml_string = """
product:
  num_rows: 5
  columns:
    product_id:
      type: generate_unique_id
      prefix: PROD_
      id_start_range: 1
      id_end_range: 100
      id_length: 10
    product_name:
      type: faker
      provider: pystr_format
      provider_args:
        string_format: "?????????"
        letters: "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
      # Setting seed is not recommended for general use, please consider when to use seed
      faker_seed: 1
    category:
      type: generate_values
      sample_values: ["CATEG1", "CATEG2", "CATEG3"]
      seed: 1
    price:
      type: generate_random_numbers
      start_range: 2
      end_range: 20
      integer: False

customer:
  num_rows: 5
  columns:
    customer_id:
      type: generate_unique_id
      prefix: CUST_
      id_start_range: 1
      id_end_range: 100
      id_length: 10
    name:
      type: faker
      provider: name
      # Setting seed is not recommended for general use, please consider when to use seed
      faker_seed: 1
    address:
      type: faker
      provider: address
      # Setting seed is not recommended for general use, please consider when to use seed
      faker_seed: 1
    membership:
      type: generate_values
      sample_values: ["PLATINUM", "GOLD", "SILVER", "BRONZE"]


transactions:
  num_rows: 50
  columns:
    transaction_id:
      type: generate_unique_id
      prefix: TRNSCT_
      id_start_range: 1
      id_end_range: 500
      id_length: 10
    customer_id:
      type: row_apply
      list_of_values:
      - customer.customer_id
      row_func: "lambda x: x"
      resize: True
      seed: 1
    product_id:
      type: row_apply
      list_of_values: [product.product_id, product.price]
      resize: True
      row_func: "lambda x, y: x"
      seed : 1
    product_price:
      type: row_apply
      list_of_values: [product.product_id, product.price]
      resize: True
      row_func: "lambda x, y: y"
      seed : 1
    quantity:
      type: generate_random_numbers
      start_range: 1
      end_range: 5
      integer: True
    transaction_amount:
      type: row_apply
      list_of_values: [transactions.product_price, transactions.quantity]
      row_func: "lambda x,y: x*y"
    transaction_date:
      type: generate_dates
      start_dt: 2021-01-01
      end_dt: 2021-06-30
      freq: D

"""
config = yaml.safe_load(yaml_string)

###  test

The data fabricator configuration will look like:

In [3]:
print(yaml_string)


product:
  num_rows: 5
  columns:
    product_id:
      type: generate_unique_id
      prefix: PROD_
      id_start_range: 1
      id_end_range: 100
      id_length: 10
    product_name:
      type: faker
      provider: pystr_format
      provider_args:
        string_format: "?????????"
        letters: "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
      # Setting seed is not recommended for general use, please consider when to use seed
      faker_seed: 1
    category:
      type: generate_values
      sample_values: ["CATEG1", "CATEG2", "CATEG3"]
      seed: 1
    price:
      type: generate_random_numbers
      start_range: 2
      end_range: 20
      integer: False

customer:
  num_rows: 5
  columns:
    customer_id:
      type: generate_unique_id
      prefix: CUST_
      id_start_range: 1
      id_end_range: 100
      id_length: 10
    name:
      type: faker
      provider: name
      # Setting seed is not recommended for general use, please consider when to use seed
      faker_seed: 1
    a

The data will look like:

In [4]:
from data_fabricator.v0.core.fabricator import MockDataGenerator
from tabulate import tabulate

# Setting seed is not recommended for general use, please consider when to use seed
mock_generator = MockDataGenerator(instructions=config, seed=1)
mock_generator.generate_all()

for table_name, table in mock_generator.all_dataframes.items():
    if not table_name.startswith("_"):
        print(f"Table: {table_name}")
        print(tabulate(table, headers=table.columns, tablefmt="psql"))
        print("\n")

  from data_fabricator.v0.core.fabricator import MockDataGenerator
Resizing list from 5 to 50
Resizing list from 5 to 50
Resizing list from 5 to 50


Table: product
+----+--------------+----------------+------------+----------+
|    | product_id   | product_name   | category   |    price |
|----+--------------+----------------+------------+----------|
|  0 | PROD_00009   | DWTGMLQUC      | CATEG1     | 10.0908  |
|  1 | PROD_00018   | AVLTALSFY      | CATEG3     | 13.7287  |
|  2 | PROD_00033   | XAAOYJFKA      | CATEG3     | 16.197   |
|  3 | PROD_00073   | FLMGGFLHA      | CATEG1     |  3.68947 |
|  4 | PROD_00098   | VOQEZWDIS      | CATEG2     |  2.51025 |
+----+--------------+----------------+------------+----------+


Table: customer
+----+---------------+------------------+-------------------------------+--------------+
|    | customer_id   | name             | address                       | membership   |
|----+---------------+------------------+-------------------------------+--------------|
|  0 | CUST_00050    | Ryan Gallagher   | 417 Kennedy Isle Apt. 706     | PLATINUM     |
|    |               |                  | La