# Inference Test: Query the Deployed Model

This notebook sends prediction requests to our TensorFlow model running on **NVIDIA Triton Inference Server**.

**What's happening under the hood:**
- The model was trained in `train-and-upload.py` and uploaded to MinIO (S3)
- KServe pulled the model from S3 into a Triton pod
- Triton loaded the SavedModel onto the **GPU** and is serving it via REST API
- We're calling it from inside the cluster using Triton's [v2 inference protocol](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/customization_guide/inference_protocols.html)

## 1. Discover the InferenceService & Check Model Health

First, auto-detect the InferenceService name — the RHOAI Dashboard generates the name from the model registry entry, so we can't hardcode it. Then query Triton's metadata endpoint to discover the input/output tensor names.

> **Why auto-detect tensors?** TF/Keras appends a numeric suffix to tensor names each time you export (`keras_tensor`, `keras_tensor_9`, `keras_tensor_12`, etc.). Hardcoding the name would break on re-export.

In [None]:
import requests
import json
import subprocess

# --- Auto-detect InferenceService name ---
# Try oc/kubectl first (available in RHOAI workbench images)
try:
    result = subprocess.run(
        ["oc", "get", "inferenceservice", "-o", "jsonpath={.items[0].metadata.name}"],
        capture_output=True, text=True, timeout=10
    )
    if result.returncode == 0 and result.stdout.strip():
        isvc_name = result.stdout.strip()
    else:
        raise RuntimeError(result.stderr)
except Exception as e:
    print(f"oc lookup failed ({e}), trying k8s API...")
    NAMESPACE = open("/var/run/secrets/kubernetes.io/serviceaccount/namespace").read().strip()
    TOKEN = open("/var/run/secrets/kubernetes.io/serviceaccount/token").read().strip()
    k8s_url = f"https://kubernetes.default.svc/apis/serving.kserve.io/v1beta1/namespaces/{NAMESPACE}/inferenceservices"
    resp = requests.get(k8s_url, headers={"Authorization": f"Bearer {TOKEN}"}, verify="/var/run/secrets/kubernetes.io/serviceaccount/ca.crt")
    isvc_list = resp.json()
    if "items" not in isvc_list:
        print(f"API error: {json.dumps(isvc_list, indent=2)}")
        raise
    isvc_name = isvc_list["items"][0]["metadata"]["name"]

NAMESPACE = open("/var/run/secrets/kubernetes.io/serviceaccount/namespace").read().strip()
print(f"Found InferenceService: {isvc_name}")

# --- Build Triton URL ---
MODEL_NAME = "demo-model"
TRITON_URL = f"http://{isvc_name}-predictor.{NAMESPACE}.svc.cluster.local:8080/v2/models/{MODEL_NAME}"
print(f"Triton endpoint:       {TRITON_URL}")

# --- Query model metadata ---
resp = requests.get(TRITON_URL)
metadata = resp.json()

input_name = metadata['inputs'][0]['name']
input_shape = metadata['inputs'][0]['shape']
output_name = metadata['outputs'][0]['name']

print(f"\nModel:    {metadata['name']}")
print(f"Version:  {metadata['versions'][0]}")
print(f"Platform: {metadata['platform']}")
print(f"Input:    {input_name} shape={input_shape}")
print(f"Output:   {output_name} shape={metadata['outputs'][0]['shape']}")

## 2. Send a Prediction

Our model takes **5 float features** and returns a **sigmoid probability** (0 to 1).

The request format follows Triton's v2 protocol:
- `name`: the input tensor name (auto-detected above)
- `shape`: `[1, 5]` — one sample with 5 features
- `datatype`: `FP32` — 32-bit floating point
- `data`: the actual feature values

In [None]:
data_1 = [0.1, 0.5, 0.3, 0.7, 0.2]

payload = {
    "inputs": [{
        "name": input_name,
        "shape": [1, 5],
        "datatype": "FP32",
        "data": data_1
    }]
}

resp = requests.post(f"{TRITON_URL}/infer", json=payload)
result = resp.json()
prediction_1 = result['outputs'][0]['data'][0]

print(f"Input:      {data_1}")
print(f"Prediction: {prediction_1:.6f}")

## 3. Try Different Inputs

Changing the feature values produces a different prediction — proof that the model is actually computing, not returning a static value.

In [None]:
data_2 = [0.9, 0.1, 0.8, 0.2, 0.95]

payload["inputs"][0]["data"] = data_2
resp = requests.post(f"{TRITON_URL}/infer", json=payload)
result = resp.json()
prediction_2 = result['outputs'][0]['data'][0]

print(f"Input:      {data_2}")
print(f"Prediction: {prediction_2:.6f}")
print(f"\nDifferent inputs → different predictions ({prediction_1:.4f} vs {prediction_2:.4f})")

## What Just Happened

1. Python `requests` sent JSON over HTTP to the **Triton pod** inside the cluster
2. Triton ran the **forward pass** through our neural network on the **A10G GPU**
3. Result: a sigmoid probability between 0 and 1

**In production, this same pattern powers:**
- Fraud detection scores on transactions
- Credit risk assessments
- Real-time pricing models

The only differences would be a real trained model, external access via Gateway/HTTPRoute, and token authentication.