
# Install LLM Environment on Ubuntu 24.04 with Intel Extension for PyTorch (IPEX) on i5-11400 and on macOS with M1 GPU Acceleration
    


## Step 1: Download Install Miniconda for Linux
```sh
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
```
Follow the on-screen instructions.
When prompted, choose to allow changes to `.bashrc` and others to activate Conda on shell startup.
`exec fish` (one time)


### MacOSX installation
```sh
curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-arm64.sh
bash Miniconda3-latest-MacOSX-arm64.sh
```
Follow the on-screen instructions.
When prompted, choose to allow changes to `.bashrc` and others to activate Conda on shell startup.
`exec zsh` (one time)


## Step 2: Create a Conda Environment for LLM
```sh
conda create -n llm python=3.10 -y
conda activate llm
```
On Colab you would check  the CUDA version available on your Colab instance:

In [None]:
!nvidia-smi

/bin/bash: line 1: nvidia-smi: command not found



## Step 3: Install PyTorch with Conda
```sh
conda install pytorch torchvision torchaudio -c pytorch-nightly
```
on Colab do this:
```python
!pip install torch torchvision torchaudio
```


## Step 4: Install Intel (only) Extension for PyTorch (i.e. IPEX)
```python
!pip install intel_extension_for_pytorch
```
    


### Other Required Dependencies
```python
!pip install sympy==1.13.1
!pip install dnspython>=2.0.0
!pip install python-dateutil>=2.5.3
!pip install oauthlib>=3.0.0
!pip install 'pydantic<2.0.0,>=1.6.1'
!pip install 'markdown<3.4,>=3.2.1'
```
    


## Step 5: Install Rust (MacOSX Only, Required for Transformers and Datasets)
```sh
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
```
    


## Step 6: Install Transformers
```python
!conda install transformers
```
    


## Step 7: Verify the Installation
And report which accelerator is used: MPS, XPU IPEX or just CPU

    

In [None]:
import torch
import transformers

print(f"PyTorch version: {torch.__version__}")
print(f"PyTorch is using: {torch.get_num_threads()} threads")
print(f"Transformers version: {transformers.__version__}")

try:
    import intel_extension_for_pytorch as ipex
    print(f"Intel Extension for PyTorch version: {ipex.__version__}")
    print(f"Using IPEX optimizations: {ipex.optimize}")
except ImportError:
    print("Intel Extension for PyTorch is not available.")
    ipex = None  # Assign None to ipex if import fails

# Check for CUDA
if torch.cuda.is_available():
    print(f"CUDA available: {torch.cuda.is_available()}")
    print(f"Current device: {torch.cuda.get_device_name(0)}")  # This line will only execute if CUDA is available
else:
    print("CUDA is not available on this device.")

# Check if MPS (Metal Performance Shaders) is available
try:
    print("MPS Available:", torch.backends.mps.is_available())
    # Check if PyTorch is using MPS
    device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
    print(f"Using device: {device}")
except AttributeError as e:
    print(f"Error checking for MPS: {e}")

# Check if PyTorch is using XPU
try:
    device = torch.device("xpu" if torch.xpu.is_available() else "cpu")
    print(f"Selected device: {device}")
except AttributeError as e:
    print(f"Error checking for XPU: {e}")

try:
    print(f"Using MKL: {torch.backends.mkl.is_available()}")
except AttributeError as e:
    print(f"Error checking for MKL: {e}")

try:
    print(f"Using MKLDNN: {torch.backends.mkldnn.is_available()}")
except AttributeError as e:
    print(f"Error checking for MKLDNN: {e}")



PyTorch version: 2.5.1+cu124
PyTorch is using: 1 threads
Transformers version: 4.47.1
Intel Extension for PyTorch is not available.
CUDA is not available on this device.
MPS Available: False
Using device: cpu
Selected device: cpu
Using MKL: True
Using MKLDNN: True



## Final Notes
- **Fish Shell:** If using Fish or Bash shell, Conda is automatically activated.
- **Performance:** This setup optimizes running large language models on Intel i5-11400 CPU using IPEX. If you have an Intel Arc A770 GPU, enable XPU acceleration via `torch.xpu` to speed up model inference.
    

The following example

In [None]:
import torch
import time
import os
from transformers import AutoModelForSequenceClassification, AutoTokenizer, AutoModelForCausalLM

print(f"Quantization Test Program")
print(f"PyTorch version: {torch.__version__}")

# Force the Use of Intel OneDNN Backend
torch.backends.quantized.engine = 'onednn'
print(f"Quantized engine: {torch.backends.quantized.engine}")
print(f"CPU backend; MKL enabled: {torch.backends.mkl.is_available()}, MKL-DNN enabled: {torch.backends.mkl.is_available()}")

# test loading of models
model_name = "facebook/opt-125m"  # A small model for testing
model = AutoModelForCausalLM.from_pretrained(model_name)
print("Model loaded successfully")
# another way to load models
#model = torch.hub.load('huggingface/pytorch-transformers', 'model', 'bert-base-uncased')


print( "distilbert-base-uncased is a small pretrained base model that does not include a classifier.")
model_name = "distilbert-base-uncased"
# or just use can use fine tuned model
# model_name = "distilbert-base-uncased-finetuned-sst-2-english"

print( "using AutoModelForSequenceClassification, the model adds a randomly initialized classification head")
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# for uncased non fine tuned model, reset the classifier
model.classifier.reset_parameters()

tokenizer = AutoTokenizer.from_pretrained(model_name)

# larger input batch size
inputs = tokenizer(["I love AI!", "Embedded systems are great!", "Quantization speeds up inference."], return_tensors="pt", padding=True, truncation=True)


# Measure original model size
original_model_path = "original_model.pt"
torch.save(model.state_dict(), original_model_path)
original_model_size = os.path.getsize(original_model_path) / (1024 * 1024)  # Convert to MB
print(f"Original model size: {original_model_size:.2f} MB")

# Evaluate original model inference time
model.eval()
start_time = time.time()
with torch.no_grad():
    original_output = model(**inputs)
end_time = time.time()
original_inference_time = end_time - start_time
print(f"Original model inference time: {original_inference_time:.6f} seconds")
print(f"Original model output: {original_output}")

# Perform dynamic quantization
quantized_model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)

# Measure quantized model size
quantized_model_path = "quantized_model.pt"
torch.save(quantized_model.state_dict(), quantized_model_path)
quantized_model_size = os.path.getsize(quantized_model_path) / (1024 * 1024)  # Convert to MB
print(f"Quantized model size: {quantized_model_size:.2f} MB")

# Evaluate quantized model inference time
start_time = time.time()
with torch.no_grad():
    quantized_output = quantized_model(**inputs)
end_time = time.time()
quantized_inference_time = end_time - start_time
print(f"Quantized model inference time: {quantized_inference_time:.6f} seconds")
print(f"Quantized model output: {quantized_output}")

# Compare speedup
speedup = original_inference_time / quantized_inference_time if quantized_inference_time > 0 else float('inf')
print(f"Speedup from quantization: {speedup:.2f}x")

# Clean up temporary model files
os.remove(original_model_path)
os.remove(quantized_model_path)


Quantization Test Program
PyTorch version: 2.5.1+cu124
Quantized engine: onednn
CPU backend; MKL enabled: True, MKL-DNN enabled: True


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/651 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/251M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

Model loaded successfully
distilbert-base-uncased is a small pretrained base model that does not include a classifier.
using AutoModelForSequenceClassification, the model adds a randomly initialized classification head


config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Original model size: 255.45 MB
Original model inference time: 0.486773 seconds
Original model output: SequenceClassifierOutput(loss=None, logits=tensor([[ 0.0252,  0.0650],
        [ 0.0432,  0.0450],
        [-0.0822,  0.0279]]), hidden_states=None, attentions=None)
Quantized model size: 132.29 MB
Quantized model inference time: 0.147930 seconds
Quantized model output: SequenceClassifierOutput(loss=None, logits=tensor([[ 0.0198,  0.0573],
        [ 0.0205,  0.0574],
        [-0.1163,  0.0182]]), hidden_states=None, attentions=None)
Speedup from quantization: 3.29x


# AMD Ryzen™ AI NPU
 Also known as XDNA for accelerating AI workloads, similar to how your notebook handles Intel (IPEX) and Apple (MPS) accelerators. However, support is still emerging

 ## Notebook Flow for AMD Ryzen AI
1. rain or load model in PyTorch

2. Export to ONNX:
```python
torch.onnx.export(model, sample_input, "model.onnx", input_names=["input"], output_names=["output"])
```
3. Install AMD Ryzen AI tools (via their SDK)

4. Run inference via ONNX Runtime:
```python
sess = ort.InferenceSession("model.onnx", providers=["VitisAIExecutionProvider"])
output = sess.run(None, {"input": input_array})
```

#⚠️ Important Limitations
- No Linux support yet (only Windows for Ryzen AI SDK)

- Only ONNX models can currently run on the NPU

- You might need model quantization (INT8) for NPU execution

- Your code cannot directly use torch.device("npu") like for MPS or IPEX

