# AWS Neuron Inference Demo: Qwen3-8B Model

## What is AWS Neuron?
AWS Neuron is a specialized SDK and runtime for running machine learning inference on **AWS Inferentia** and **Trainium** chips - purpose-built silicon optimized for ML workloads. Unlike general-purpose GPUs, these chips are designed specifically for inference, offering:

- **Better cost-performance**: Up to 70% lower cost per inference vs comparable GPU instances
- **Predictable performance**: Consistent latency without the variability of shared GPU resources  
- **High throughput**: Optimized for batch inference workloads

## What is NeuronX Distributed Inference (NxDI)?
NxDI is PyTorch-based library that simplifies deploying large language models on Neuron hardware. It provides:
- **Production-ready models** (Llama, Qwen, Mixtral, etc.)
- **Advanced inference features** (continuous batching, speculative decoding, KV caching)
- **Distributed strategies** (tensor parallelism across multiple Neuron cores)
- **Seamless integration** with existing PyTorch workflows

## Key Concepts You'll Learn:
- **Model Compilation**: Converting PyTorch models to Neuron-optimized format (one-time process)
- **Tensor Parallelism**: Splitting model layers across multiple Neuron cores for larger models
- **Bucketing**: Pre-compiling for different sequence lengths to avoid recompilation
- **On-device Sampling**: Performing text generation sampling directly on Neuron hardware

## ⚠️ Supported Qwen3 Models

**IMPORTANT**: As of now NxDi supports only the following official Qwen3 model checkpoints:

- [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
- [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) 
- [Qwen/Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B)
- [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) ✅ *Used in this demo*
- [Qwen/Qwen3-14B](https://huggingface.co/Qwen/Qwen3-14B)
- [Qwen/Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B)

**Note**: 
- Other Qwen3 variants, fine-tuned models, or custom checkpoints may not be compatible with NeuronX Distributed Inference
- For larger models (14B, 32B), you'll need instances with more Neuron cores (inf2.24xlarge or inf2.48xlarge)
- This demo uses **Qwen3-8B** as it provides a good balance of capability and resource requirements

## 1. Environment Setup and Imports

First, we'll set up the environment and import necessary libraries.

In [1]:
import time
import torch
import psutil
from pathlib import Path
from huggingface_hub import snapshot_download, login

from transformers import AutoTokenizer, GenerationConfig

def validate_environment():
    """Validate that we're running on a Neuron-enabled instance."""
    try:
        import torch_neuronx
        import neuronx_distributed_inference
        print("✅ Neuron environment validated")
        return True
    except ImportError as e:
        print(f"❌ Neuron environment not found: {e}")
        print("💡 Make sure you're running on an inf2/trn1 instance with Neuron SDK installed")
        return False

if not validate_environment():
    raise RuntimeError("Please run this notebook on a Neuron-enabled instance")

✅ Neuron environment validated


## 2. Instance Configuration and Neuron Concepts

### Understanding Neuron Cores
Each AWS Inferentia/Trainium instance contains multiple **Neuron cores** - the compute units that execute your model:
- **inf2.xlarge**: 2 cores (good for development/testing)
- **inf2.8xlarge**: 2 cores (cost-effective production)  
- **inf2.24xlarge**: 12 cores (high-throughput production)
- **inf2.48xlarge**: 24 cores (maximum single-instance performance)

### Tensor Parallelism (TP)
For models too large for a single core, we split them across multiple cores using **tensor parallelism**:
- TP degree = number of cores to use
- Higher TP = can run larger models, but with communication overhead
- For Qwen3-8B: TP=2 is optimal balance of performance and resource usage

In [3]:
# 🔹  INSTANCE SELECTION  🔹
# ---------------------------------------------------------------------
# Supported instances and their Neuron-core counts
INSTANCE_PROFILES = {
    "inf2.xlarge"   : dict(cores=2 , tp=2 , batch_size=1),
    "inf2.8xlarge"  : dict(cores=2, tp=2 , batch_size=1),
    "inf2.24xlarge" : dict(cores=12, tp=12 , batch_size=4),
    "inf2.48xlarge" : dict(cores=24, tp=24 , batch_size=8),
    "trn1.32xlarge" : dict(cores=32, tp=32 , batch_size=16),
}

# Choose your target instance here (or via environment variable)
INSTANCE_TYPE   = "trn1.32xlarge"
assert INSTANCE_TYPE in INSTANCE_PROFILES, f"Unsupported instance {INSTANCE_TYPE}"

profile         = INSTANCE_PROFILES[INSTANCE_TYPE]
NUM_CORES       = profile["cores"]
TP_DEGREE       = profile["tp"]
BATCH_SIZE      = profile["batch_size"]

print(f"🖥️  Target instance : {INSTANCE_TYPE} "
      f"(Neuron cores={NUM_CORES}, tp={TP_DEGREE}, batch_size={BATCH_SIZE})")
# ---------------------------------------------------------------------

# Paths ----------------------------------------------------------------
MODEL_ID              = "Qwen/Qwen3-8B"
BASE_DIR              = Path("/home/ubuntu")
ORIGINAL_MODEL_PATH   = BASE_DIR / "model_hf_qwen" / "qwen3-8b"
COMPILED_MODEL_PATH   = BASE_DIR / "traced_model_qwen3" / "qwen3-8b" / str(profile["tp"]) / str(profile["batch_size"])
ORIGINAL_MODEL_PATH.mkdir(parents=True, exist_ok=True)
COMPILED_MODEL_PATH.mkdir(parents=True, exist_ok=True)

🖥️  Target instance : trn1.32xlarge (Neuron cores=32, tp=32, batch_size=16)


## 3. Model Download

Download the pre-trained model from Hugging Face Hub.

In [4]:
def download_model_if_needed(model_id: str, local_dir: Path) -> None:
    """Download model if not already present locally."""
    if not (local_dir / "config.json").exists():
        print(f"📥 Downloading {model_id} to {local_dir}...")
        snapshot_download(model_id, local_dir=str(local_dir))
        print("✅ Download complete")
    else:
        print(f"✅ Model already exists at {local_dir}")

download_model_if_needed(MODEL_ID, ORIGINAL_MODEL_PATH)

📥 Downloading Qwen/Qwen3-8B to /home/ubuntu/model_hf_qwen/qwen3-8b...


Fetching 15 files:   0%|          | 0/15 [00:00<?, ?it/s]

LICENSE: 0.00B [00:00, ?B/s]

README.md: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/728 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

.gitattributes: 0.00B [00:00, ?B/s]

model-00001-of-00005.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00002-of-00005.safetensors:   0%|          | 0.00/3.99G [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

model-00004-of-00005.safetensors:   0%|          | 0.00/3.19G [00:00<?, ?B/s]

model-00005-of-00005.safetensors:   0%|          | 0.00/1.24G [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

model-00003-of-00005.safetensors:   0%|          | 0.00/3.96G [00:00<?, ?B/s]

✅ Download complete


## 4. Tokenizer and Generation Configuration

Set up the tokenizer and generation parameters.

In [5]:
def setup_tokenizer_and_generation_config(model_path: Path) -> tuple:
    """Initialize tokenizer and generation configuration."""
    tokenizer = AutoTokenizer.from_pretrained(str(model_path), padding_side="right")
    tokenizer.pad_token = tokenizer.eos_token
    
    generation_config = GenerationConfig.from_pretrained(str(model_path))
    generation_config_kwargs = {
        "do_sample": True,
        "top_k": 1,
        "pad_token_id": tokenizer.pad_token_id,
    }
    generation_config.update(**generation_config_kwargs)
    
    print(f"✅ Tokenizer setup complete. Vocab size: {tokenizer.vocab_size}")
    return tokenizer, generation_config

tokenizer, generation_config = setup_tokenizer_and_generation_config(ORIGINAL_MODEL_PATH)

✅ Tokenizer setup complete. Vocab size: 151643


## 5. Neuron Configuration

This is where we configure Neuron-specific parameters:

- **tp_degree**: Tensor parallelism degree (number of Neuron cores to use)
- **batch_size**: Number of sequences to process in parallel
- **max_context_length**: Maximum input sequence length
- **seq_len**: Maximum total sequence length (input + output)
- **bucketing**: Pre-compile for different sequence lengths for optimal performance
- **on_device_sampling**: Perform sampling on Neuron device for better performance

In [6]:
from neuronx_distributed_inference.models.config import NeuronConfig, OnDeviceSamplingConfig

def create_neuron_config() -> NeuronConfig:
    """Create Neuron-specific configuration for optimal performance."""
    return NeuronConfig(
        # Parallelism configuration
        tp_degree=TP_DEGREE,
        batch_size=BATCH_SIZE,
        
        # Sequence length configuration
        max_context_length=1024,  # Maximum input tokens
        seq_len=2048,  # Maximum total sequence length
        
        # Performance optimizations
        enable_bucketing=True,  # Enable bucketing for different sequence lengths
        context_encoding_buckets=[1024],  # Pre-compile for these context lengths
        token_generation_buckets=[2048],  # Pre-compile for these generation lengths
        
        # Sampling configuration
        on_device_sampling_config=OnDeviceSamplingConfig(top_k=5),
        
        # Model-specific optimizations
        flash_decoding_enabled=False,  # Disable for this demo
        torch_dtype=torch.bfloat16,  # Use bfloat16 for better performance
        attn_kernel_enabled=True,  # Enable optimized attention kernels
        attn_cls="NeuronQwen3Attention"  # Use Qwen3-specific attention implementation
    )

neuron_config = create_neuron_config()
print("✅ Neuron configuration created")
print(f"   - Tensor parallelism degree: {neuron_config.tp_degree}")
print(f"   - Batch size: {neuron_config.batch_size}")
print(f"   - Max context length: {neuron_config.max_context_length}")
print(f"   - Sequence length: {neuron_config.seq_len}")

  from .mappings import (
  from .mappings import (
  from .mappings import (
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)


✅ Neuron configuration created
   - Tensor parallelism degree: 32
   - Batch size: 16
   - Max context length: 1024
   - Sequence length: 2048


  from neuronx_distributed_inference.modules.custom_calls import neuron_cumsum
  from neuronx_distributed_inference.modules.attention.gqa import GQA, GroupQueryAttention_QKV
  from neuronx_distributed_inference.modules.attention.gqa import GQA, GroupQueryAttention_QKV
  from neuronx_distributed_inference.modules.attention.gqa import GQA, GroupQueryAttention_QKV


## 6. Model Compilation

This step converts the PyTorch model to Neuron-optimized format. This is a one-time process that can take 10-30 minutes depending on the model size and configuration.

**Note**: Compilation creates optimized compute graphs specifically for your hardware and configuration.

In [7]:
from neuronx_distributed_inference.models.qwen3.modeling_qwen3 import Qwen3InferenceConfig, NeuronQwen3ForCausalLM
from neuronx_distributed_inference.utils.hf_adapter import HuggingFaceGenerationAdapter, load_pretrained_config

# 🔹  compile or load flag  🔹
COMPILE_MODEL = 1  # 1 = compile, 0 = only load

def compile_or_load(model_path: Path, compiled_path: Path, neuron_cfg: NeuronConfig):
    """Compile if requested, else only load."""

    # Safety: inf2.xlarge does not have enough DRAM for compilation
    if INSTANCE_TYPE == "inf2.xlarge" and COMPILE_MODEL:
        raise RuntimeError("Compilation on inf2.xlarge is not supported. "
                           "Set NEURON_COMPILE=0 and use a pre-compiled model.")

    if COMPILE_MODEL:
        if (compiled_path / "pytorch_model.bin").exists():
            print("⚠️  Compiled model already exists – skipping compilation.")
        else:
            print("🔨 Compiling model … this can take ~30 min.")
            cfg = Qwen3InferenceConfig(
                neuron_config,
                load_config=load_pretrained_config(str(model_path)),
            )
            model = NeuronQwen3ForCausalLM(str(model_path), cfg)
            model.compile(str(compiled_path))
            tokenizer.save_pretrained(str(compiled_path))
            
            print("✅ Compilation finished.")
    else:
        print("🚫 Compilation skipped (NEURON_COMPILE=0).")

    # --- load compiled artefacts ---
    model = NeuronQwen3ForCausalLM(str(compiled_path))
    model.load(str(compiled_path))
    print("✅ Model loaded from disk.")
    return model

# run it
neuron_model = compile_or_load(ORIGINAL_MODEL_PATH, COMPILED_MODEL_PATH, neuron_config)

  from neuronx_distributed_inference.modules.attention.attention_base import NeuronAttentionBase
  from neuronx_distributed_inference.modules.attention.attention_base import NeuronAttentionBase
  from neuronx_distributed_inference.modules.attention.attention_base import NeuronAttentionBase
Neuron: Saving the neuron_config to /home/ubuntu/traced_model_qwen3/qwen3-8b/32/16/
Neuron: Generating HLOs for the following models: ['context_encoding_model', 'token_generation_model']


🔨 Compiling model … this can take ~30 min.
[2025-09-02 07:46:38.169: I neuronx_distributed/parallel_layers/parallel_state.py:628] > initializing tensor model parallel with size 32
[2025-09-02 07:46:38.170: I neuronx_distributed/parallel_layers/parallel_state.py:629] > initializing pipeline model parallel with size 1
[2025-09-02 07:46:38.170: I neuronx_distributed/parallel_layers/parallel_state.py:630] > initializing context model parallel with size 1
[2025-09-02 07:46:38.171: I neuronx_distributed/parallel_layers/parallel_state.py:631] > initializing data parallel with size 1
[2025-09-02 07:46:38.172: I neuronx_distributed/parallel_layers/parallel_state.py:632] > initializing world size to 32
[2025-09-02 07:46:38.173: I neuronx_distributed/parallel_layers/parallel_state.py:379] [rank_0_pp-1_tp-1_dp-1_cp-1] Chosen Logic for replica groups ret_logic=<PG_Group_Logic.LOGIC1: (<function ascending_ring_PG_group at 0x72ac063539a0>, 'Ascending Ring PG Group')>
[2025-09-02 07:46:38.175: I neuro

Neuron: Generating 1 hlos for key: context_encoding_model
Neuron: Started loading module context_encoding_model
Neuron: Finished loading module context_encoding_model in 0.08832788467407227 seconds
Neuron: generating HLO: context_encoding_model, input example shape = torch.Size([16, 1024])
  with torch.cuda.amp.autocast(enabled=False):
Neuron: Finished generating HLO for context_encoding_model in 7.0562803745269775 seconds, input example shape = torch.Size([16, 1024])
Neuron: Generating 1 hlos for key: token_generation_model
Neuron: Started loading module token_generation_model
Neuron: Finished loading module token_generation_model in 0.0668025016784668 seconds
Neuron: generating HLO: token_generation_model, input example shape = torch.Size([16, 1])
Neuron: Finished generating HLO for token_generation_model in 1.0833957195281982 seconds, input example shape = torch.Size([16, 1])
Neuron: Generated all HLOs in 8.371405601501465 seconds
Neuron: Starting compilation for the priority HLO
Ne

2025-09-02 07:46:47.000464:  12221  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --framework=XLA /tmp/nxd_model/token_generation_model/_tp0_bk0/model.MODULE_c4f0c212f54294e84e33+617f6939.hlo_module.pb --output /tmp/nxd_model/token_generation_model/_tp0_bk0/model.MODULE_c4f0c212f54294e84e33+617f6939.neff --target=trn1 --auto-cast=none --model-type=transformer --tensorizer-options=--enable-ccop-compute-overlap --cc-pipeline-tiling-factor=1 --vectorize-strided-dma  --lnc=1 -O2 --internal-hlo2tensorizer-options=--verify-hlo=true --logfile=/tmp/nxd_model/token_generation_model/_tp0_bk0/log-neuron-cc.txt --enable-internal-neff-wrapper --verbose=35
...........Completed run_backend_driver.


Neuron: Done compilation for the priority HLO in 216.08494329452515 seconds



Compiler status PASS


Neuron: Updating the hlo module with optimized layout
Neuron: Done optimizing weight layout for all HLOs in 4.648864030838013 seconds
Neuron: Starting compilation for all HLOs
Neuron: Neuron compiler flags: --auto-cast=none --model-type=transformer  --tensorizer-options='--enable-ccop-compute-overlap --cc-pipeline-tiling-factor=2 --vectorize-strided-dma ' --lnc=1 -O1  --internal-hlo2tensorizer-options=' --modular-flow-mac-threshold=10  --verify-hlo=true'  --logfile=/tmp/nxd_model/context_encoding_model/_tp0_bk0/log-neuron-cc.txt


2025-09-02 07:50:27.000396:  12221  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --framework=XLA /tmp/nxd_model/context_encoding_model/_tp0_bk0/model.MODULE_6020fd81e9865b09a888+ad9e832d.hlo_module.pb --output /tmp/nxd_model/context_encoding_model/_tp0_bk0/model.MODULE_6020fd81e9865b09a888+ad9e832d.neff --target=trn1 --auto-cast=none --model-type=transformer --tensorizer-options=--enable-ccop-compute-overlap --cc-pipeline-tiling-factor=2 --vectorize-strided-dma  --lnc=1 -O1 --internal-hlo2tensorizer-options= --modular-flow-mac-threshold=10  --verify-hlo=true --logfile=/tmp/nxd_model/context_encoding_model/_tp0_bk0/log-neuron-cc.txt --verbose=35




..Completed run_backend_driver.


Neuron: Finished Compilation for all HLOs in 38.93697476387024 seconds



Compiler status PASS
..

Neuron: Done preparing weight layout transformation


Completed run_backend_driver.

Compiler status PASS


Neuron: Finished building model in 295.1843423843384 seconds
Neuron: SKIPPING pre-sharding the checkpoints. The checkpoints will be sharded during load time.
root: NeuronConfig init: Unexpected keyword arguments: {'apply_seq_ids_mask': False, 'enable_long_context_mode': False, 'enable_output_completion_notifications': False, 'enable_token_tree': False, 'is_chunked_prefill': False, 'is_prefill_stage': None, 'kv_cache_tiling': False, 'scratchpad_page_size': None, 'skip_warmup': False, 'tile_cc': False, 'weights_to_skip_layout_optimization': []}
Neuron: Sharding weights on load...
Neuron: Sharding Weights for ranks: 0...31


✅ Compilation finished.
[2025-09-02 07:51:33.550: I neuronx_distributed/parallel_layers/parallel_state.py:628] > initializing tensor model parallel with size 32
[2025-09-02 07:51:33.551: I neuronx_distributed/parallel_layers/parallel_state.py:629] > initializing pipeline model parallel with size 1
[2025-09-02 07:51:33.551: I neuronx_distributed/parallel_layers/parallel_state.py:630] > initializing context model parallel with size 1
[2025-09-02 07:51:33.552: I neuronx_distributed/parallel_layers/parallel_state.py:631] > initializing data parallel with size 1
[2025-09-02 07:51:33.552: I neuronx_distributed/parallel_layers/parallel_state.py:632] > initializing world size to 32
[2025-09-02 07:51:33.553: I neuronx_distributed/parallel_layers/parallel_state.py:379] [rank_0_pp-1_tp-1_dp-1_cp-1] Chosen Logic for replica groups ret_logic=<PG_Group_Logic.LOGIC1: (<function ascending_ring_PG_group at 0x72ac063539a0>, 'Ascending Ring PG Group')>
[2025-09-02 07:51:33.555: I neuronx_distributed/para

Neuron: Done Sharding weights in 2.1348061939997933
Neuron: Finished weights loading in 44.940252185999725 seconds
Neuron: Warming up the model.


2025-Sep-02 07:52:19.0001 12221:14033 [8] int nccl_net_ofi_create_plugin(nccl_net_ofi_plugin_t**):213 CCOM WARN NET/OFI Failed to initialize sendrecv protocol
2025-Sep-02 07:52:19.0005 12221:14033 [8] int nccl_net_ofi_create_plugin(nccl_net_ofi_plugin_t**):354 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2025-Sep-02 07:52:19.0010 12221:14033 [8] ncclResult_t nccl_net_ofi_init_no_atexit_fini_v6(ncclDebugLogger_t):183 CCOM WARN NET/OFI Initializing plugin failed
2025-Sep-02 07:52:19.0015 12221:14033 [8] net_plugin.cc:97 CCOM WARN OFI plugin initNet() failed is EFA enabled?


Neuron: Warmup completed in 2.6343612670898438 seconds.


✅ Model loaded from disk.


## 7. Inference Demonstration

Now let's run inference with our Neuron-optimized model. We'll demonstrate both regular and "thinking" modes.

In [8]:
from neuronx_distributed_inference.utils.hf_adapter import HuggingFaceGenerationAdapter

def setup_inference_components(model, model_path: Path):
    """Setup tokenizer and generation adapter for inference."""
    tokenizer = AutoTokenizer.from_pretrained(str(model_path))
    tokenizer.pad_token = tokenizer.eos_token
    
    generation_config = GenerationConfig.from_pretrained(str(ORIGINAL_MODEL_PATH))
    generation_config_kwargs = {
        "do_sample": False,
        "temperature": 0.9,
        "top_k": 5,
        "pad_token_id": tokenizer.pad_token_id,
    }
    generation_config.update(**generation_config_kwargs)
    
    generation_model = HuggingFaceGenerationAdapter(model)
    
    return tokenizer, generation_model

def parse_thinking_output(output_ids: list, tokenizer) -> tuple:
    """Parse thinking content from model output."""
    try:
        # Find the end of thinking token (151668 = </think>)
        think_end_token = 151668
        index = len(output_ids) - output_ids[::-1].index(think_end_token)
    except ValueError:
        index = 0
    
    thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
    response_content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
    
    return thinking_content, response_content

def run_inference(model, messages: list, enable_thinking: bool = False, max_new_tokens: int = 512):
    """Run inference with the Neuron model."""
    tokenizer, generation_model = setup_inference_components(model, COMPILED_MODEL_PATH)
    
    # Prepare input
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=enable_thinking
    )
    
    inputs = tokenizer([text], return_tensors="pt")
    input_ids = inputs['input_ids']
    
    print(f"🔄 Running inference (thinking={'enabled' if enable_thinking else 'disabled'})...")
    start_time = time.time()
    
    # Generate response
    outputs = generation_model.generate(
        input_ids=input_ids,
        max_new_tokens=max_new_tokens
    )
    
    inference_time = time.time() - start_time
    
    # Extract generated tokens
    output_ids = outputs[0][len(inputs.input_ids[0]):].tolist()
    
    if enable_thinking:
        thinking_content, response_content = parse_thinking_output(output_ids, tokenizer)
        return thinking_content, response_content, inference_time
    else:
        response_content = tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n")
        return None, response_content, inference_time

print("✅ Inference functions ready")

✅ Inference functions ready


### 7.1 Simple Question (No Thinking Mode)

In [9]:
# Simple question without thinking mode
messages = [{'role': 'user', 'content': "What's your name?"}]

thinking, response, inference_time = run_inference(
    neuron_model, 
    messages, 
    enable_thinking=False, 
    max_new_tokens=512
)

print(f"\n📊 Performance: {inference_time:.2f} seconds")
print(f"\n🤖 Response: {response}")

HuggingFaceGenerationAdapter has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as e

🔄 Running inference (thinking=disabled)...

📊 Performance: 1.35 seconds

🤖 Response: My name is Qwen, and I'm a large language model developed by Alibaba Cloud. How can I assist you today?


## 8. Performance Analysis

Let's run a few more examples to analyze performance characteristics.

In [10]:
def monitor_system_resources():
    """Monitor system resources during inference."""
    return {
        'cpu_percent': psutil.cpu_percent(),
        'memory_percent': psutil.virtual_memory().percent,
        'available_memory_gb': psutil.virtual_memory().available / (1024**3)
    }

def benchmark_inference(model, num_runs: int = 3):
    """Benchmark inference performance."""
    print(f"🔬 Running performance benchmark ({num_runs} runs)...")
    
    test_cases = [
        {"messages": [{'role': 'user', 'content': "Explain quantum computing in simple terms."}], "max_tokens": 256},
        {"messages": [{'role': 'user', 'content': "Write a short poem about machine learning."}], "max_tokens": 128},
        {"messages": [{'role': 'user', 'content': "What are the benefits of using AWS Neuron?"}], "max_tokens": 200}
    ]
    
    results = []
    
    for i, test_case in enumerate(test_cases):
        model.reset()
        
        # Monitor resources before inference
        pre_resources = monitor_system_resources()
        
        start_time = time.time()
        _, response, inference_time = run_inference(
            model, 
            test_case["messages"], 
            enable_thinking=False, 
            max_new_tokens=test_case["max_tokens"]
        )
        
        # Calculate tokens generated (approximate)
        tokens_generated = len(response.split()) * 1.3  # Rough token count
        tokens_per_second = tokens_generated / inference_time
        
        post_resources = monitor_system_resources()
        
        result = {
            'test_case': i + 1,
            'inference_time': inference_time,
            'tokens_generated': int(tokens_generated),
            'tokens_per_second': tokens_per_second,
            'cpu_usage': post_resources['cpu_percent'],
            'memory_usage': post_resources['memory_percent']
        }
        results.append(result)
        
        print(f"   Test {i+1}: {inference_time:.2f}s, {tokens_per_second:.1f} tokens/s")
    
    # Summary statistics
    avg_time = sum(r['inference_time'] for r in results) / len(results)
    avg_tokens_per_sec = sum(r['tokens_per_second'] for r in results) / len(results)
    
    print(f"\n📊 Performance Summary:")
    print(f"   Average inference time: {avg_time:.2f} seconds")
    print(f"   Average throughput: {avg_tokens_per_sec:.1f} tokens/second")
    print(f"   Instance type: {INSTANCE_TYPE}")
    print(f"   Tensor parallelism: {TP_DEGREE} cores")
    
    return results

# Run benchmark
avg_time = benchmark_inference(neuron_model)

🔬 Running performance benchmark (3 runs)...


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


🔄 Running inference (thinking=disabled)...
   Test 1: 2.94s, 83.2 tokens/s


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


🔄 Running inference (thinking=disabled)...
   Test 2: 1.75s, 70.0 tokens/s


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


🔄 Running inference (thinking=disabled)...
   Test 3: 2.40s, 77.6 tokens/s

📊 Performance Summary:
   Average inference time: 2.36 seconds
   Average throughput: 76.9 tokens/second
   Instance type: trn1.32xlarge
   Tensor parallelism: 32 cores


## 10. Production Deployment Guidelines

### Compilation Strategy
- **Development**: Compile on larger instances (inf2.8xlarge+), then copy artifacts
- **Production**: Load pre-compiled models to minimize startup time
- **CI/CD**: Include compilation step in your model deployment pipeline

### Performance Optimization Tips
1. **Right-size your instance**: Start with inf2.8xlarge for most workloads for 8B model
2. **Optimize sequence lengths**: Use bucketing for variable-length inputs
3. **Batch similar requests**: Group requests with similar token counts
4. **Monitor utilization**: Use Neuron metrics

### Cost Optimization
- **Reserved Instances**: For predictable workloads, use Reserved Instances (up to 70% savings)
- **Spot Instances**: For fault-tolerant batch processing
- **Auto Scaling**: Scale Neuron instances based on request volume

### Next Steps
- Integrate with your existing inference pipeline (FastAPI, vLLM, etc.)
- Set up monitoring with CloudWatch and Neuron Monitor
- Consider multi-model serving for better resource utilization

In [11]:
# Final cleanup
print("🧹 Cleaning up resources...")
if 'neuron_model' in locals():
    neuron_model.reset()
print("✅ Demo completed successfully!")
print(f"\n📁 Compiled model available at: {COMPILED_MODEL_PATH}")
print("💡 You can reuse the compiled model for future inference without recompilation.")

🧹 Cleaning up resources...
✅ Demo completed successfully!

📁 Compiled model available at: /home/ubuntu/traced_model_qwen3/qwen3-8b/32/16
💡 You can reuse the compiled model for future inference without recompilation.
