# Fluke Data Generator

# Fluke Data Generation Documentation

## Introduction

This Python script is designed to generate synthetic data for a Fluke product management system. It creates realistic test data for products, customers, and orders, which can be used for development, testing, and demonstration purposes. The data is generated using the Faker library and is stored in Delta tables within a Databricks Unity Catalog environment.



In [0]:
%pip install faker
%restart_python

%md
# Code Overview

### Data Generation System

The code creates three main tables in a Databricks Unity Catalog:

1. **fluke_products**: Stores product information
2. **fluke_customers**: Contains customer details
3. **fluke_orders**: Records order transactions

### Table Relationships

The tables are related through the following key fields:
- `fluke_orders.customer_id` → `fluke_customers.customer_id`
- `fluke_orders.product_id` → `fluke_products.product_id`

This creates a many-to-many relationship where:
- One customer can place multiple orders
- One product can be ordered multiple times
- Each order must be associated with exactly one customer and one product

### Data Generation Details

#### Products Table
- Generates 1000 product records by default
- Includes product categories specific to Fluke's product line:
  - Multimeters (87V series)
  - Thermal Cameras (Ti series)
  - Clamp Meters (376 series)
  - Power Quality Analyzers (435 series)
  - Oscilloscopes (190 series)
  - Calibration Tools (55 series)
- Each product has a unique ID, price, warranty period, and status

#### Customers Table
- Generates 500 customer records by default
- Includes various customer types:
  - Industrial
  - Commercial
  - Educational
  - Government
  - Retail
- Contains complete contact information and registration details

#### Orders Table
- Generates 2000 order records by default
- Links products and customers
- Includes order details such as:
  - Quantity
  - Unit price
  - Total amount
  - Payment status
  - Shipping status

### Technical Implementation

The code uses several key technologies:
- **PySpark**: For data processing and table creation
- **Faker**: For generating realistic synthetic data
- **Delta Lake**: For storing the data in a reliable format
- **Unity Catalog**: For managing the data catalog

### Data Generation Process

1. The script first generates the base data (products and customers)
2. Then creates orders that reference the generated products and customers
3. Finally, creates Delta tables in the specified Unity Catalog schema
4. All tables include audit fields like `last_updated` for tracking changes

### Usage

To use this script:
1. Ensure you have access to a Databricks environment with Unity Catalog
2. **Update the `catalog_name` and `schema_name` in the `widgets` above**
3. Run the script to generate and populate the tables

The generated data can be used for:
- Testing data pipelines
- Developing analytics dashboards
- Training machine learning models
- Demonstrating system capabilities 

In [0]:
from faker import Faker
import random
from datetime import datetime, timedelta
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, DateType, TimestampType
from pyspark.sql.functions import lit, col
import logging
import uuid

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Initialize Faker
fake = Faker()

# Constants
PRODUCT_CATEGORIES = [
    "Multimeters", "Thermal Cameras", "Clamp Meters", 
    "Power Quality Analyzers", "Oscilloscopes", "Calibration Tools"
]

PRODUCT_STATUS = ["In Stock", "Low Stock", "Out of Stock", "Discontinued"]

CUSTOMER_TYPES = ["Industrial", "Commercial", "Educational", "Government", "Retail"]

# Add dbutils widgets
dbutils.widgets.text("catalog_name", "default_catalog", "Catalog Name")
dbutils.widgets.text("schema_name", "default_schema", "Schema Name")

def generate_products_data(num_records=1000):
    """Generate product data"""
    products = []
    for i in range(num_records):
        category = random.choice(PRODUCT_CATEGORIES)
        model_prefix = {
            "Multimeters": "87V",
            "Thermal Cameras": "Ti",
            "Clamp Meters": "376",
            "Power Quality Analyzers": "435",
            "Oscilloscopes": "190",
            "Calibration Tools": "55"
        }[category]
        
        # Use UUID for unique product IDs
        unique_id = str(uuid.uuid4())[:8]
        products.append({
            "product_id": f"FLUKE-{model_prefix}-{unique_id}",
            "product_name": f"{category} {random.randint(100, 999)}",
            "category": category,
            "description": fake.text(max_nb_chars=200),
            "price": round(random.uniform(100, 10000), 2),
            "warranty_period": random.randint(1, 5),
            "status": random.choice(PRODUCT_STATUS),
            "release_date": fake.date_between(start_date='-5y', end_date='today'),
            "last_updated": datetime.now()
        })
    return products

def generate_customers_data(num_records=500):
    """Generate customer data"""
    customers = []
    for i in range(num_records):
        # Use UUID for unique customer IDs
        unique_id = str(uuid.uuid4())[:8]
        customers.append({
            "customer_id": f"CUST-{unique_id}",
            "company_name": fake.company(),
            "customer_type": random.choice(CUSTOMER_TYPES),
            "contact_name": fake.name(),
            "email": fake.email(),
            "phone": fake.phone_number(),
            "address": fake.address(),
            "country": fake.country(),
            "registration_date": fake.date_between(start_date='-3y', end_date='today'),
            "last_updated": datetime.now()
        })
    return customers

def generate_orders_data(num_records=2000, products=None, customers=None):
    """Generate order data"""
    orders = []
    for i in range(num_records):
        product = random.choice(products)
        customer = random.choice(customers)
        quantity = random.randint(1, 10)
        order_date = fake.date_between(start_date='-2y', end_date='today')
        
        # Use UUID for unique order IDs
        unique_id = str(uuid.uuid4())[:8]
        orders.append({
            "order_id": f"ORD-{unique_id}",
            "customer_id": customer["customer_id"],
            "product_id": product["product_id"],
            "order_date": order_date,
            "quantity": quantity,
            "unit_price": product["price"],
            "total_amount": round(quantity * product["price"], 2),
            "payment_status": random.choice(["Paid", "Pending", "Cancelled"]),
            "shipping_status": random.choice(["Shipped", "Processing", "Delivered"]),
            "last_updated": datetime.now()
        })
    return orders

def create_tables(spark, catalog_name, schema_name):
    """Create and populate the tables in Unity Catalog"""
    logger.info("Generating data...")
    
    # Generate data
    products = generate_products_data()
    customers = generate_customers_data()
    orders = generate_orders_data(products=products, customers=customers)
    
    # Define schemas
    product_schema = StructType([
        StructField("product_id", StringType(), False),
        StructField("product_name", StringType(), False),
        StructField("category", StringType(), False),
        StructField("description", StringType(), True),
        StructField("price", DoubleType(), False),
        StructField("warranty_period", IntegerType(), False),
        StructField("status", StringType(), False),
        StructField("release_date", DateType(), False),
        StructField("last_updated", TimestampType(), False)
    ])
    
    customer_schema = StructType([
        StructField("customer_id", StringType(), False),
        StructField("company_name", StringType(), False),
        StructField("customer_type", StringType(), False),
        StructField("contact_name", StringType(), False),
        StructField("email", StringType(), False),
        StructField("phone", StringType(), False),
        StructField("address", StringType(), False),
        StructField("country", StringType(), False),
        StructField("registration_date", DateType(), False),
        StructField("last_updated", TimestampType(), False)
    ])
    
    order_schema = StructType([
        StructField("order_id", StringType(), False),
        StructField("customer_id", StringType(), False),
        StructField("product_id", StringType(), False),
        StructField("order_date", DateType(), False),
        StructField("quantity", IntegerType(), False),
        StructField("unit_price", DoubleType(), False),
        StructField("total_amount", DoubleType(), False),
        StructField("payment_status", StringType(), False),
        StructField("shipping_status", StringType(), False),
        StructField("last_updated", TimestampType(), False)
    ])
    
    # Create DataFrames
    products_df = spark.createDataFrame(products, product_schema)
    customers_df = spark.createDataFrame(customers, customer_schema)
    orders_df = spark.createDataFrame(orders, order_schema)
    
    # Write to Delta tables
    logger.info("Writing to Delta tables...")
    
    # Products table
    products_df.write.mode("overwrite").option(
        "overwriteSchema", "true"
    ).saveAsTable(f"{catalog_name}.{schema_name}.fluke_products")
    
    # Customers table
    customers_df.write.mode("overwrite").option(
        "overwriteSchema", "true"
    ).saveAsTable(f"{catalog_name}.{schema_name}.fluke_customers")
    
    # Orders table
    orders_df.write.mode("overwrite").option(
        "overwriteSchema", "true"
    ).saveAsTable(f"{catalog_name}.{schema_name}.fluke_orders")
    
    logger.info("Data generation and table creation completed successfully!")

def main():
    """Main function to run the data generation"""
    # Initialize Spark session
    spark = SparkSession.builder.getOrCreate()
    
    # Get values from widgets
    catalog_name = dbutils.widgets.get("catalog_name")
    schema_name = dbutils.widgets.get("schema_name")
    
    # Create tables
    create_tables(spark, catalog_name, schema_name)

if __name__ == "__main__":
    main()