# Inference

## Copy tokenizer

In [None]:
from transformers import AutoTokenizer

# 1. Specify the name of the Hugging Face model
model_name = "Qwen/Qwen3-0.6B" 

# 2. Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 3. Define the local directory where you want to save the tokenizer
local_directory = "./final_chess_model"

# 4. Save the tokenizer to the local directory
tokenizer.save_pretrained(local_directory)

print(f"Tokenizer for '{model_name}' saved to '{local_directory}'")

# Compilation

After completing the fine-tuning process, the next step is to compile the trained model for AWS Trainium inference using the Hugging Face Optimum Neuron toolchain.
Neuron compilation optimizes the model graph and converts it into a Neuron Executable File Format (NEFF), enabling efficient execution on NeuronCores.

In [None]:
!optimum-cli export neuron \
  --model "./final_chess_model" \
  --task text-generation \
  --sequence_length 2048 \
  --batch_size 1 \
  ./final_chess_model_compiled

# Inference

We will install the Optimum Neuron vllm library.  Then, run inference using the compiled model.

In [None]:
%pip install -q optimum-neuron[vllm] matplotlib

Now you can run the batch inference example below

In [None]:
from vllm import LLM
from transformers import AutoTokenizer
from inference_helper import (
    load_chess_data,
    create_prompts,
    run_inference,
    process_inference_results,
    save_evaluation_results,
    print_sample_results
)

# Load the compiled model
llm = LLM(
    model="./final_chess_model_compiled",
    max_num_seqs=1,
    max_model_len=2048,
    tensor_parallel_size=2,
)

# Load chess dataset
chess_data = load_chess_data('/home/ubuntu/environment/neuron-distillation-sample/distillation/data/chess_output.json')

# Create prompts
tokenizer = AutoTokenizer.from_pretrained("./final_chess_model_compiled")
prompts = create_prompts(chess_data, tokenizer)

# Run inference
outputs = run_inference(llm, prompts, max_tokens=2048, temperature=0.0)

# Process results
results, teacher_correct, student_correct = process_inference_results(outputs, chess_data)

# Save results and print summary
summary = save_evaluation_results(results, teacher_correct, student_correct)

# Show sample results
print_sample_results(results, num_samples=5)


In [None]:
import json
from inference_helper import (
    print_detailed_statistics,
    create_visualization,
    print_sample_predictions
)

# Load results
with open('static/chess_evaluation_results.json', 'r') as f:
    data = json.load(f)

# Print detailed statistics
print_detailed_statistics(data)

# Create visualization
create_visualization(data)

# Show sample predictions
print_sample_predictions(data, num_samples=5)


## Optional

You can add in the results from the original model

You can start an inference endpoint on the current instance, like

In [None]:
!python -m vllm.entrypoints.openai.api_server \
    --model="/home/ubuntu/environment/ml/qwen/compiled_model" \
    --max-num-seqs=1 \
    --max-model-len=2048 \
    --tensor-parallel-size=2 \
    --port=8080 \
    --device "neuron"

And query the inference endpoint like

In [None]:
curl 127.0.0.1:8080/v1/completions \
    -H 'Content-Type: application/json' \
    -X POST \
    -d '{"prompt":"<|im_start|>system\n    You are a sentiment classifier. You take input strings and return the sentiment of POSITIVE, NEGATIVE, or NEUTRAL. Only return the sentiment.\n    <|im_start|>user\n    The service at this restaurant exceeded all my expectations!\n    <|im_start|>assistant", "temperature": 0.8, "max_tokens":128}'
