# Running Qwen3-8B on MacOS (Apple Silicon)

The previous notebooks were optimized for GPUs (using CUDA). If you want to run models on your Mac,
that hardware is also well-suited for reasoning LLMs.

Apple developed their own framework ([mlx](https://github.com/ml-explore/mlx)). This is a very
nice framework which can also be accessed via Python.

Apple hardware is particularly well-suited for running LLMs as the memory bandwidth is higher
compared to PCs.  `mlx` is optimized for Apple Silicon and allows running models very quickly,
especially if you use quantized models.

The interface is completely different from `transformers`.

In [None]:
from mlx_lm import load, generate

Quite comfortabel: `model` and `tokenizer` are returned as a pair

In [None]:
model, tokenizer = load("mlx-community/DeepSeek-R1-0528-Qwen3-8B-4bit-DWQ")

In [None]:
prompt = "How many 'r's are in 'strawberry'?"

`apply_chat_template` also exists in the `tokenizer`, similar to Hugging Face `transformers`.

In [None]:
if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True
    )


Generation works also in a very similar way, but streaming output is directly shown.

In [None]:
response = generate(model, tokenizer, prompt=prompt, verbose=True, max_tokens=2500)

The response variable contains the whole string:

In [None]:
print(response)

In [None]:
from IPython.display import display, Markdown
display(Markdown(response))