# Create Sensor Bronze Table with Line Data Generator

Requires Classic Compute Cluster - DBR >= 16.4 LTS or Serverless with environment version >= 4

##### Install required libraries

In [0]:
# Install custom data generator library
%pip install -r ./line_data_generator/requirements.txt
%pip install ./line_data_generator

In [0]:
%run ./0-Parameters

### Step 1: Create Delta Table in Unity Catalog

##### Provide required information using the widget at the top of the notebook

- Your Unity Catalog table name
  - Identify the target table you want to ingest data to. 

In [0]:
spark.sql(f"""
CREATE OR REPLACE TABLE {BRONZE_TABLE} (
    sensor_rotation DOUBLE,
    sensor_flow DOUBLE,
    sensor_temperature DOUBLE,
    sensor_speed DOUBLE,
    sensor_vibration DOUBLE,
    sensor_pressure DOUBLE,
    component_yield_output DOUBLE,
    timestamp STRING,
    component_id STRING,
    damaged_component BOOLEAN,
    abnormal_sensor STRING,
    machine_id STRING,
    line_id STRING
)
""")

### Step 2: Generate sensor data 

We will use a data generator to simulate data coming from a complex ball bearing production system organized as folllwing. 

The core elements of this prduction system are: 
- Production Line
  - Machine
    - Component
      - Sensor

<br>
<img src="./images/ball-bearing-diagram.png" width="600"/>



**How does the data generator work**
<br>
The data genetor will produce data mathing the production line setup you defined in the Digital Twin frontend. The following variables will be used:

- Number of lines
- Number of machines per line: each line can have a different number of machies.
- Number of components per machine: each machine  has the same number of components
- Each component has 6 different sensors (fixed) generating data:
  - Temperature
  - Pressure
  - Vibration
  - Speed
  - Rotation
  - Flow

In [0]:
import pandas as pd
import time
from line_data_generator import generate_all_lines, generate_equipment_mapping, table_size_estimator

In [0]:
# Define Production Line configuration -- TODO: Read from config file
num_lines = 4
machines_per_line = [3, 3, 4, 2]  # Number of machines per line
num_components = 3  

<br>You can adjust the size of the generated dataset by setting the sample_size parameter.

For instance, if you choose a sample size of 1000, the data generator will produce a dataset with these characteristics:

- Each row represents a specific component at a specific timestamp, with component_id and timestamp serving as unique identifiers.
- Each component will have 1000 rows, corresponding to the sample size.
- Consecutive rows for a given component are spaced 1 millisecond apart. Thus, for a sample size of 1000, the total duration covered per component is 1 second (1000 * 0.001).
- The total number of rows in the dataset depends on the number of components in your production line. For example, with 36 components (4 lines, 12 machines, 3 components per machine), the dataset will have 36,000 rows (36 * 1000).
- The overall time span for the dataset remains 1 seconds (1000 * 0.001), meaning that sensor data for different components is generated in parallel at the same timestamps.

In [0]:
# Define sample size
sample_size = 1000

With this configuration you will generate a dataset with the following size

In [0]:
# Generate equipment mapping: lines -> machine -> components -> sensors 
equipment_mapping = generate_equipment_mapping(num_lines, machines_per_line, num_components)

# Estimate table size
tot_num_rows, est_table_size, line_num_rows, est_line_table_size = table_size_estimator(machines_per_line, num_components, sample_size)

print(f"Number of rows: {tot_num_rows}")
print(f"Estimated table size: {est_table_size:.2f} MB")

In [0]:
# Run data generator and display the first 10 rows
batch_df_lines = generate_all_lines(equipment_mapping, sample_size, time.time())
display(batch_df_lines.head(10))

In [0]:
## Write generated data into delta table in UC
spark.createDataFrame(batch_df_lines).write.mode("append").saveAsTable(BRONZE_TABLE)

In [0]:
display(spark.table(BRONZE_TABLE).count())

--
### _STOP_ - If you want to use Zerobus Ingest to ingest data, go directly to notebook "2-Ingest-Data-Zerobus"
--
### Optional: Simulate Continuous Data Ingestion

In this section, we will simulate a continuous data ingestion. To prevent creating a large dataset in a single data generation job, we will run multiple data ingestion batches. Each batch will be generated separately and appended to the Delta Table one at a time.

In [0]:
### Define batch configuration

sample_size = 10000    # for each sensor
batch_count = 5       # number of writes (int)

With this configuration you will generate a dataset with the following size

In [0]:
# Calculate total number of rows to be generated 
tot_num_rows, est_table_size, line_num_rows, est_line_table_size = table_size_estimator(machines_per_line, num_components, sample_size)

print(f"Number of rows in each batch: {tot_num_rows}")
print(f"Estimated table size in each batch: {est_table_size:.2f} MB")
print(f"Number of rows for the total dataset: {tot_num_rows * batch_count}")

Run the data generator loop

In [0]:
# Run loop to generate data in batch (loop) 

min_batch_wait =  sample_size *  0.001   # minumun seconds wait between writes to avoid having overlapping time between batches

for i in range(0, int(batch_count)):

  current_time = time.time()

  if i > 0:
    if current_time <= batch_time + min_batch_wait:
      wait = int(batch_time + min_batch_wait - current_time + 10) # add 10 seconds just in case
      time.sleep(wait)
      print(f"Pausing {wait} seconds to avoid overlapping timestamps across batches")

  print(f"--- Generating batch {int(i+1)} / {int(batch_count)} ---")

  batch_time = time.time()
  batch = generate_all_lines(equipment_mapping, sample_size, batch_time)

  spark.createDataFrame(batch).write.mode("append").saveAsTable(BRONZE_TABLE)

In [0]:
display(spark.table(BRONZE_TABLE).count())