# Local LLM Inference with MLX-LM on Apple Silicon

This notebook demonstrates how to run Concordia simulations using MLX-LM for
local inference on Apple Silicon (M1/M2/M3/M4 Macs).

**Requirements:**
- macOS with Apple Silicon (M1/M2/M3/M4)
- Python 3.12+
- MLX and MLX-LM installed

**Benefits of local inference:**
- No API keys required
- No per-token costs
- Data stays on your machine
- Fast inference on Apple Silicon

In [None]:
# @title Install dependencies (if needed)
# Uncomment and run if you haven't installed the dependencies yet
# %pip install gdm-concordia[mlx_lm] sentence-transformers

In [None]:
# @title Imports

from concordia.contrib.language_models.mlx_lm import mlx_lm_model
import concordia.prefabs.entity as entity_prefabs
import concordia.prefabs.game_master as game_master_prefabs
from concordia.prefabs.simulation import generic as simulation
from concordia.typing import prefab as prefab_lib
from concordia.utils import helper_functions
from IPython import display
import numpy as np
import sentence_transformers

## Model Selection

MLX-LM supports many quantized models from the `mlx-community` on HuggingFace.
Popular choices include:

| Model | Size | Memory | Notes |
|-------|------|--------|-------|
| `mlx-community/Llama-3.2-3B-Instruct-4bit` | 3B | ~2GB | Fast, good for testing |
| `mlx-community/Llama-3.1-8B-Instruct-4bit` | 8B | ~5GB | Good balance |
| `mlx-community/Mistral-7B-Instruct-v0.3-4bit` | 7B | ~4GB | Strong performance |
| `mlx-community/Qwen2.5-7B-Instruct-4bit` | 7B | ~4GB | Excellent reasoning |
| `mlx-community/gemma-2-9b-it-4bit` | 9B | ~6GB | Google's latest |

The model will be downloaded automatically on first use.

In [None]:
# @title Model Configuration

# Choose a model from mlx-community on HuggingFace
# Smaller models are faster but less capable
MODEL_NAME = 'mlx-community/Llama-3.2-3B-Instruct-4bit'

# Optional: Path to a LoRA adapter for fine-tuned behavior
ADAPTER_PATH = None  # e.g., '/path/to/my/adapter'

# System message for the model
SYSTEM_MESSAGE = (
    'You always continue sentences provided by the user and you never repeat '
    'what the user already said.'
)

In [None]:
# @title Initialize the MLX-LM model

print(f'Loading model: {MODEL_NAME}')
print('This may take a moment on first run as the model is downloaded...')

model = mlx_lm_model.MLXLMLanguageModel(
    model_name=MODEL_NAME,
    adapter_path=ADAPTER_PATH,
    system_message=SYSTEM_MESSAGE,
)

print('Model loaded successfully!')

In [None]:
# @title Setup sentence encoder for memory retrieval

st_model = sentence_transformers.SentenceTransformer(
    'sentence-transformers/all-mpnet-base-v2'
)
embedder = lambda x: st_model.encode(x, show_progress_bar=False)

In [None]:
# @title Test the model with a simple prompt

test_prompt = 'What makes a good leader?'
print(f'Prompt: {test_prompt}\n')

response = model.sample_text(
    test_prompt,
    max_tokens=150,
    temperature=0.7,
)
print(f'Response:\n{response}')

In [None]:
# @title Test the choice selection

prompt = 'When faced with a difficult decision, the wisest course of action is usually to'
choices = [
    'act quickly and decisively',
    'gather more information first',
    'seek advice from others',
    'trust your intuition',
]

idx, choice, debug = model.sample_choice(prompt, choices)

print(f'Prompt: {prompt}\n')
print(f'Selected: "{choice}" (index {idx})\n')
print('Log probabilities:')
for c, logprob in debug['logprobs'].items():
    print(f'  {c}: {logprob:.4f}')

In [None]:
# @title Load prefabs

prefabs = {
    **helper_functions.get_package_classes(entity_prefabs),
    **helper_functions.get_package_classes(game_master_prefabs),
}

## Simple Simulation: Two Philosophers Debating

Let's run a simple simulation where two philosophers discuss the nature of knowledge.

In [None]:
# @title Configure the simulation

instances = [
    prefab_lib.InstanceConfig(
        prefab='conversational__Entity',
        role=prefab_lib.Role.ENTITY,
        params={
            'name': 'Socrates',
            'conversation_style': (
                'Speaks with careful reasoning, often using questions to '
                'guide the conversation. Values wisdom and self-knowledge.'
            ),
        },
    ),
    prefab_lib.InstanceConfig(
        prefab='conversational__Entity',
        role=prefab_lib.Role.ENTITY,
        params={
            'name': 'Aristotle',
            'conversation_style': (
                'Speaks methodically, categorizing and analyzing concepts. '
                'Appreciates empirical observation and practical wisdom.'
            ),
        },
    ),
    prefab_lib.InstanceConfig(
        prefab='dialogic__GameMaster',
        role=prefab_lib.Role.GAME_MASTER,
        params={
            'name': 'conversation rules',
            'next_game_master_name': 'conversation rules',
            'acting_order': 'fixed',
            'can_terminate_simulation': False,
        },
    ),
    prefab_lib.InstanceConfig(
        prefab='formative_memories_initializer__GameMaster',
        role=prefab_lib.Role.INITIALIZER,
        params={
            'name': 'initial setup rules',
            'next_game_master_name': 'conversation rules',
            'shared_memories': [
                'Socrates and Aristotle are meeting in the agora to discuss philosophy.',
                'The topic of their discussion is the nature of knowledge and virtue.',
            ],
            'player_specific_memories': {
                'Socrates': [
                    'Believes that true knowledge comes from within through questioning.',
                    'Famous for saying "I know that I know nothing."',
                ],
                'Aristotle': [
                    'Was once a student at Plato\'s Academy.',
                    'Believes knowledge comes from observing the natural world.',
                ],
            },
        },
    ),
]

config = prefab_lib.Config(
    default_premise=(
        'Two great philosophers meet in ancient Athens to discuss the '
        'fundamental nature of knowledge and how one can live a virtuous life.'
    ),
    default_max_steps=10,
    prefabs=prefabs,
    instances=instances,
)

In [None]:
# @title Initialize the simulation

runnable_simulation = simulation.Simulation(
    config=config,
    model=model,
    embedder=embedder,
)

In [None]:
# @title Run the simulation

print('Running simulation...')
print('(This may take a while depending on your model and hardware)\n')

raw_log = []
results_log = runnable_simulation.play(
    max_steps=6,
    raw_log=raw_log,
)

In [None]:
# @title Display the results

display.HTML(results_log)

## Performance Tips

1. **Use quantized models**: 4-bit quantized models (ending in `-4bit`) use much less memory and run faster.

2. **Match model size to your RAM**: 
   - 8GB RAM: Use 3B models
   - 16GB RAM: Use 7-8B models  
   - 32GB+ RAM: Can use larger models

3. **Reduce max_tokens**: Shorter generations are faster.

4. **Use lower temperatures**: Temperature=0 uses greedy decoding which is slightly faster.

5. **Consider using LoRA adapters**: Fine-tune a small adapter instead of using a larger base model.

```
Copyright 2025 DeepMind Technologies Limited.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
```