<h1>Testing Fine-Tuned LLAMA 3.2 Vision Served in vLLM</h1>
This notebook uses the Watchman dataset data. It runs inference using the fine-tuned model</br>
created and deployed with vLLM in the fine_tune.ipynb notebook.</br>
This notebook will walk through the whole dataset and use Watchman's model_interfaces.py's</br>
'vllm-complex' interface to run inference, then summarize the results to see how we are doing now.</br>

---

In [None]:
# Check if we can poke the vLLM server
!curl http://127.0.0.1:5050/version

In [1]:
import os
import sys
import base64
from pathlib import Path
import pandas as pd
from PIL import Image
from io import BytesIO
from openai import OpenAI

In [None]:
# Let's confirm that vLLM is working and get the id of model served.
# Expecting that it's only serving our fine tuned model, otherwise
# we will use the last one in the returned model list.

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "NONE_NEEDED"
openai_api_base = "http://localhost:5050/v1"

client = OpenAI(
    # defaults to os.environ.get("OPENAI_API_KEY")
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()

print(f"Served models:")
for m in models:
    print(f"  {m.id}")
vllm_model = m.id
print(f"Will use: {vllm_model}")

In [None]:
# load model interfaces we can use for testing
print(f"Working dir: {os.getcwd()}")
sys.path.append(os.path.abspath(".."))
sys.path.append(os.path.abspath("../orchestrator"))
from shared_settings import *
from model_interfaces import *
print(MODELS)

In [None]:
# set up vars for finding train data
dataset = "../.data/dataset" # dataset folder location
chans = ["porch"] # list of channels to load the data for
objs = ["person"] # list of objects to load the data for
model_name = "vllm-complex" # model interface name to use for inference experimentation
c_desc = {
    "porch": "Porch",
}
o_desc = {
    "person": "a person",
}
MODEL_INTERFACE = MODELS[model_name](model_to_use=vllm_model)

In [None]:
# run test inference
from PIL import Image
from IPython.display import display
import base64

def test_inf(s, c, o, res):
    print(f"Subdir: {s} Expecting: {res}")
    image_pname = f"{s}/image.jpg"
    img_data = base64.b64encode(Path(image_pname).read_bytes()).decode()
    res, msg = MODEL_INTERFACE.locate(img_data, o_desc[o], c_desc[c])
    img = Image.open(image_pname)
    w, h = img.size
    display(img.resize((int(w / 4), int(h / 4))))
    print("Inference result: ", "yes" if res else "no")
    print("Location: ", msg if msg is not None else "N/A")
    print("---------------------------")

for c in chans:
    for o in objs:
        dir = f"{dataset}/{c}/{o}"
        subdirs = [f.path for f in os.scandir(dir) if f.is_dir()]
        true_pos = False
        false_pos = False
        for s in subdirs:
            if true_pos and false_pos:
                break
            if os.path.exists(f"{s}/skip"):
                continue
            if not false_pos and os.path.exists(f"{s}/no"):
                test_inf(s, c, o, "no")
                false_pos = True
            if not true_pos and not os.path.exists(f"{s}/no"):
                test_inf(s, c, o, "yes")
                true_pos = True
            

In [13]:
# Do full walk through using the vLLM served model.
# Show the location description for all positive cases, also show errors.
from PIL import Image
from IPython.display import display
import base64

def check_inf(s, c, o, exp_res):
    image_pname = f"{s}/image.jpg"
    img_data = base64.b64encode(Path(image_pname).read_bytes()).decode()
    res, msg = MODEL_INTERFACE.locate(img_data, o_desc[o], c_desc[c])
    error = 0
    if (res and exp_res != 'yes') or (not res and exp_res != 'no'):
        error = 1
    if res or error > 0:
        msg = "none" if msg is None else msg.replace("\n", "")
        print(f"Subdir: {s} Expecting: {exp_res} {'(error)' if error > 0 else '(ok)'} Location: {msg}")
    return error

total = 0
total_16b_no = 0
total_16b_yes = 0
for c in chans:
    for o in objs:
        dir = f"{dataset}/{c}/{o}"
        subdirs = [f.path for f in os.scandir(dir) if f.is_dir()]
        for s in subdirs:
            if os.path.exists(f"{s}/skip"):
                continue
            if os.path.exists(f"{s}/no"):
                cur_err_no = check_inf(s, c, o, "no")
                cur_err_yes = 0
            else:
                cur_err_yes = check_inf(s, c, o, "yes")
                cur_err_no = 0
            total_16b_no += cur_err_no
            total_16b_yes += cur_err_yes
            total += 1
            _, dn = os.path.split(s)
            #print(f"\rTotal:{total} err16b_no:{cur_err_no} err16b_yes:{cur_err_yes} Subdir:{c}/{o}/{dn}      ", end="")

print(f"\n----------------------------------------------------------")
print(f"\nSummary: out of {total} Error on No: {total_16b_no}, Error on Yes:{total_16b_yes}")


Subdir: ../.data/dataset/porch/person/325 Expecting: yes (ok) Location: the person is located on the porch
Subdir: ../.data/dataset/porch/person/238 Expecting: yes (ok) Location: the person is located on the porch
Subdir: ../.data/dataset/porch/person/975 Expecting: yes (ok) Location: the person is located on the driveway
Subdir: ../.data/dataset/porch/person/34 Expecting: yes (ok) Location: the person is located on the porch
Subdir: ../.data/dataset/porch/person/721 Expecting: yes (ok) Location: the person is located on the driveway
Subdir: ../.data/dataset/porch/person/5 Expecting: yes (ok) Location: the person is located in the driveway
Subdir: ../.data/dataset/porch/person/302 Expecting: yes (ok) Location: the person is located on the driveway
Subdir: ../.data/dataset/porch/person/411 Expecting: yes (ok) Location: the person is located on the porch
Subdir: ../.data/dataset/porch/person/170 Expecting: yes (ok) Location: the person is located on the porch
Subdir: ../.data/dataset/por

Clearly the accuracy has improved. It's unclear why no errors in the confusion matrix, but under vLLM there are a few errors.</br>
Those error samples are false negatives (with the object being looked for reasonably difficult to recognize).</br>
It could be some of the true positives that were eliminated when balancing the dataset.</br>
Maybe it's related to the model conversion to f16 for vLLM or vLLM image processing pipeline?</br>

---

The "production" testing has been going well for 3 days.</br>
On the 4th it generated a series of false positives (for a "person in the driver's seat in the car").</br>
Then, continued without errors (for now for another 3 days).</br>

---