### OpenVINO Utilities Notebook

Various utilities live in this notebook to help users of OpenArc understand the properties of their devices; mastering understanding of available data types, quantization strategies and  available optimization techniques is only one part of learning to use OpenVINO on different kinds of hardware.

OpenArc does some of the work of serving inference but is opinionated in areas of the approach; OpenArc doesn't hand hold like other Intel applications like [Intel AI Playground](https://github.com/intel/AI-Playground) which are more entry-level-plug-and-play.

Thanks again for checking out my project. 

## Introduction to working with Intel Devices

This document offers discussion of "lessons-learned" from months of working with Intel GPU devices; *hours* of blood, sweat, and tears went into setting up this project and it's a good place to share what I've learned. At this stage in the Intel AI Stack it seems like a neccessary contribution to the community.


## Introduction to working with Intel Devices

This document offers discussion of "lessons-learned" from months of working with Intel GPU devices; *hours* of blood, sweat, and tears went into setting up this project and it's a good place to share what I've learned. At this stage in the Intel AI Stack it seems like a neccessary contribution to the community.

### What is OpenVINO?

OpenVINO is an inference backend for *acclerating* inference deployments of machine learning models on Intel hardware. It can be hard to understand the documentation- the Intel AI stack has many staff engineers/contributors to all manner of areas in the open source ecosystem and much of the stack is evolving without massive community contributions like what we have seen with llama.cpp. 

Many reasons contribute to the decline of Intel's dominance/popularity in the hardware space in the past few years; however they offer extensive open source contributions to many areas of AI, ML and have been since before [Attention Is All You Need](https://arxiv.org/abs/1706.03762). AI didn't start in 2017- however the demand for faster inference on existing infrastructure has never been higher. Plus, Arc chips are cheap but come with a steep learning curve. Sure, you can settle for Vulkan... but you aren't here to download a GGUF and send it.  



### OpenVINO Utilities

Various utilities live in this notebook to help users of OpenArc understand the properties of their devices; mastering understanding of available data types, quantization strategies and  available optimization techniques is only one part of learning to use OpenVINO on different kinds of hardware.





Check out the [Guide to the OpenVINO IR] and then use my [Command Line Tool tool](https://huggingface.co/spaces/Echo9Zulu/Optimum-CLI-Tool_tool) to perform converion. There are default approachs that "work" but to really leverage available compute you have to dig deeper and convert models yourself

## Diagnostic: Device Query


Reccomended usage strategies:
    - Driver issues
    - Device access permissions
    - Test Hardware access from containers
    - Python path visibility
    - Proper environment variable configuration 

#### Example use cases:

1. Evaluating conflicting dependencies
    - With careful dependency management you can control hardware across the Intel AI stack.
    - However 


2. Say you need to have PyTorch, IPEX and OpenVINO in one conda env.
    - This test alongside an XPU device query creates useful diagnostic infomration. 
    - 


In [None]:
# Diagnostic Device Query

import openvino as ov

core = ov.Core()
available_devices = core.available_devices

print(available_devices)

## Understanding your device: Device Query

Working with OpenVINO requires understanding facts about your device.

OpenVINO uses an Intermediate Representation format to translate a model graph into a proprietary format used by the C++ runtime. 

OpenArc takes the optimization process a step further by offering tools for converting models which embrace the complexity of the task. 

While the excellent CLI tool streamlines the process,  each parameter requires careful consideration of several different facts the Device Query makes easier to discover. Seriously- use [Intel Ark](https://www.intel.com/content/www/us/en/ark.html) for hardware you don't own and the Device Query for every other convieveable usecase.

Here's what's most important to consider:

### Supported Datatypes

Most Intel Devices will support FP32 natives as well 


### Quantization

The same rules, practices and 








In [None]:
# Device Query: 


# Taken from https://github.com/openvinotoolkit/openvino/blob/master/samples/python/hello_query_device/hello_query_device.py

import logging as log
import sys

import openvino as ov


def param_to_string(parameters) -> str:
    """Convert a list / tuple of parameters returned from OV to a string."""
    if isinstance(parameters, (list, tuple)):
        return ', '.join([str(x) for x in parameters])
    else:
        return str(parameters)


def main():
    log.basicConfig(format='[ %(levelname)s ] %(message)s', level=log.INFO, stream=sys.stdout)

    # --------------------------- Step 1. Initialize OpenVINO Runtime Core --------------------------------------------
    core = ov.Core()

    # --------------------------- Step 2. Get metrics of available devices --------------------------------------------
    log.info('Available devices:')
    for device in core.available_devices:
        log.info(f'{device} :')
        log.info('\tSUPPORTED_PROPERTIES:')
        for property_key in core.get_property(device, 'SUPPORTED_PROPERTIES'):
            if property_key not in ('SUPPORTED_PROPERTIES'):
                try:
                    property_val = core.get_property(device, property_key)
                except TypeError:
                    property_val = 'UNSUPPORTED TYPE'
                log.info(f'\t\t{property_key}: {param_to_string(property_val)}')
        log.info('')

    # -----------------------------------------------------------------------------------------------------------------
    return 0


if __name__ == '__main__':
    sys.exit(main())

## Model Conversion

OpenVINO is an inference engine for leveraging diverse types of compute. To squeeze as much performance as possible from any hardware requires a bit more work than using the naive approach, especially once you have a usecase in mind and know what hardware you are using.

### The Naive Approach

OpenVINO defaults to **int8_asym** when setting "export=True" in both **OVModelForCausalLM.from_pretrained()** and the Optimum CLI Export Tool if no arguments for weight_format are passed. 

OpenArc has been designed for usecases which wander toward the bleeding edge of AI where users are expected to understand the nuance of datatypes, quantization strategies, calibration datasets, how these parameters contribute to accuracy loss and maybe have just come from IPEX or (as of 2.5) 'vanilla' Pytorch and are looking to optimize a deployment.

For convience "export=False" is exposed on the /model/load endpoint; however I **strongly discourage** using it. To get the best performance from OpenVINO you have to get into the weeds.

### The Less Naive Approach to Model Conversion

Many Intel CPUs support INT8 but it isn't always the best choice. 

OpenVINO notebooks prove out that INT4 weight only compression coupled with quantization strategies like AWQ + Scale Estimation achieve better performance across the Intel device ecosystem with negligable accuracy loss. Still, different model architectures offer different performance reguardless of the chosen datatype; in practice it can be hard to predict how a model will perform so understanding how these parameter's work is essential to maximizing throughput by testing different configurations on the same target model.


### Why Speed Matters

Nvidia GPUs are faster and have a better open source backbone than Intel. However, Intel devices are cheaper by comparison. Even so, I don't want speed for the sake of being fast. OpenArc has been tooled for Agentic usecases and synthetic data generation where low throughput can damage workflow execution. 

If I want to dump some problem into a RoundRobin style multi-turn chat I am not sitting there waiting for 



Note: If you are using cloude compute it should still work.