## Understanding Eager Batching on Hailo devices with DeGirum PySDK

This guide demonstrates how **eager batching** affects inference performance on Hailo-8 and Hailo-8L devices using the DeGirum PySDK. It walks through live benchmarks comparing different models and batching strategies. **Eager batching** in DeGirum PySDK queues multiple inference requests and sends them to the accelerator as a batch. This can improve inference throughput—**but only for multi-context models**.


#### Understanding Context Size

In Hailo's architecture, **context size** refers to how many model instances (contexts) can be concurrently loaded into the accelerator:

- **Single-context models**: Fully fit into Hailo's memory. Adding batching doesn’t improve performance because the model already saturates the accelerator.
  
- **Multi-context models**: Too large to fit completely, so Hailo divides execution across contexts. Batching allows better hardware utilization and **can significantly improve FPS**.


## What This Guide Demonstrates

We run two models:
- A **YOLOv8n object detection model** (context = 1)
- A **YOLOv8m object detection model** (context > 1)

Each is evaluated under:
- `eager_batch_size = 1` (no batching)
- `eager_batch_size = 8` (batched)

We measure **frames per second (FPS)** using PySDK and visualize the results.


## Installation & imports

In [None]:
!pip install -U degirum 

In [None]:
import degirum as dg
import degirum_tools

## Load the model

In [None]:
iterations = 1000
model_zoo = 'degirum/hailo'
inference_host_address = '@local'
single_context_model = 'yolov8n_coco--640x640_quant_hailort_multidevice_1'
multi_context_model = 'yolov8m_coco--640x640_quant_hailort_multidevice_1'
token = ''

## Eager batching on single context model

#### batch size = 1

In [None]:
# Load the model
model = dg.load_model(
    model_name=single_context_model,
    inference_host_address=inference_host_address,
    zoo_url="https://hub.degirum.com/degirum/hailo",
    eager_batch_size=1,  # Set eager batch size to 1 for single context model
)

# Turn off C++-based post-processing (Does not affect models with a 'PythonFile' python-based postprocessor!)
model.output_postprocess_type = "None"

single_context_bs_1_results = degirum_tools.model_time_profile(model, iterations)
print(f"Running with batch size - {str(model._model_parameters.EagerBatchSize)}, Observed FPS: {single_context_bs_1_results.observed_fps:5.2f}")

#### batch size 8

In [None]:
# Load the model
model = dg.load_model(
    model_name=single_context_model,
    inference_host_address=inference_host_address,
    zoo_url="https://hub.degirum.com/degirum/hailo",
    eager_batch_size=8,  # Set eager batch size to 8 for single context model
)

# Turn off C++-based post-processing (Does not affect models with a 'PythonFile' python-based postprocessor!)
model.output_postprocess_type = "None"

single_context_bs_8_results = degirum_tools.model_time_profile(model, iterations)
print(f"Running with batch size - {str(model._model_parameters.EagerBatchSize)}, Observed FPS: {single_context_bs_8_results.observed_fps:5.2f}")

### Visualise the results for single context model

In [None]:
import matplotlib.pyplot as plt


# Labels and data
labels = ['FPS - bs 1', 'FPS - bs 8']
values = [single_context_bs_1_results.observed_fps, single_context_bs_8_results.observed_fps]

# Plotting
plt.bar(labels, values, color=['blue', 'orange'])

# Customizing
plt.ylabel('FPS')
plt.title('Impact of batch size on single context model')

# Show plot
plt.show()


## Eager batching on multi context model

#### batch size 1

In [None]:
# Load the model
model_a = dg.load_model(
    model_name=multi_context_model,
    inference_host_address=inference_host_address,
    zoo_url="https://hub.degirum.com/degirum/hailo",
    eager_batch_size=1 # Set eager batch size to 1 for multi context model
)

# Turn off C++-based post-processing (Does not affect models with a 'PythonFile' python-based postprocessor!)
model_a.output_postprocess_type = "None"

multi_context_bs_1_results = degirum_tools.model_time_profile(model_a, iterations)
print(f"Running with batch size - {str(model_a._model_parameters.EagerBatchSize)}, Observed FPS: {multi_context_bs_1_results.observed_fps:5.2f}")

#### batch size 8

In [None]:
# Load the model
model_b = dg.load_model(
    model_name=multi_context_model,
    inference_host_address=inference_host_address,
    zoo_url="https://hub.degirum.com/degirum/hailo",
    eager_batch_size=8  # Set eager batch size to 8 for multi context model 
)


# Turn off C++-based post-processing (Does not affect models with a 'PythonFile' python-based postprocessor!)
model_b.output_postprocess_type = "None"

multi_context_bs_8_results = degirum_tools.model_time_profile(model_b, iterations)
print(f"Running with batch size - {str(model_b._model_parameters.EagerBatchSize)}, Observed FPS: {multi_context_bs_8_results.observed_fps:5.2f}")

### Visualise the results for multi context models

In [None]:
import matplotlib.pyplot as plt

# Labels and data
labels = ['FPS - bs 1', 'FPS - bs 8']
values = [multi_context_bs_1_results.observed_fps, multi_context_bs_8_results.observed_fps]

# Plotting
plt.bar(labels, values, color=['blue', 'orange'])

# Customizing
plt.ylabel('FPS')
plt.title('Impact of batch size on multi-context model')

# Show plot
plt.show()

## Conclusion

Eager batching can significantly improve inference performance on Hailo devices—but only for models that use multiple contexts due to their larger size. For smaller models that fully utilize the accelerator memory, batching adds no benefit and may introduce unnecessary latency. By understanding model context and evaluating FPS under different batch sizes, you can make informed decisions to optimize throughput based on your specific use case.
