In [8]:
import readline # Enables input line editing
from pathlib import Path
 
import lmstudio as lms
 
# Define a function that can be called by the model and provide them as tools to the model.
# Tools are just regular Python functions. They can be anything at all.
def create_file(name: str, content: str):
    """Create a file with the given name and content."""
    dest_path = Path(name)
    if dest_path.exists():
        return "Error: File already exists."
    try:
        dest_path.write_text(content, encoding="utf-8")
    except Exception as exc:
        return "Error: {exc!r}"
    return "File created."
 
def print_fragment(fragment, round_index=0):
    # .act() supplies the round index as the second parameter
    # Setting a default value means the callback is also
    # compatible with .complete() and .respond().
    print(fragment.content, end="", flush=True)
 
model = lms.llm("openai/gpt-oss-20b")
chat = lms.Chat("You are a helpful assistant running on the user's computer.")
 
while True:
    try:
        user_input = input("User (leave blank to exit): ")
    except EOFError:
        print()
        break
    if not user_input:
        break
    chat.add_user_message(user_input)
    print("Assistant: ", end="", flush=True)
    model.act(
        chat,
        [create_file],
        on_message=chat.append,
        on_prediction_fragment=print_fragment,
    )
    print()
 

Assistant: <|channel|>analysis<|message|>User says Hello. Just respond politely.<|end|><|start|>assistant<|channel|>final<|message|>Hello! How can I help you today?


In [7]:
from openai import OpenAI
 
client = OpenAI(
    base_url="http://localhost:1234/v1",
    api_key="not-needed"  # LM Studio does not require an API key
)
 
result = client.chat.completions.create(
    model="openai/gpt-oss-20b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain what MXFP4 quantization is."}
    ]
)
 
print(result.choices[0].message.content)

**MXFP4 Quantization – A Quick Overview**

> **MXFP4** stands for **Mixed‑Precision Floating‑Point 4‑bit** quantization.  
> It’s a way of compressing deep‑learning models (especially those that run on edge devices or specialized hardware) by representing most of the network’s weights and activations with only **4 bits** of precision, while still retaining a small set of higher‑precision “anchor” parameters to keep accuracy high.

---

## Why Quantize at All?

- **Memory Footprint:** Full 32‑bit floating‑point tensors can be huge. Reducing them to 4 bits cuts memory by **8×**.
- **Compute Efficiency:** Many modern accelerators (e.g., Google’s Edge TPU, NVIDIA’s TensorRT, or custom ASICs) have fast integer‑only kernels that run much faster than floating‑point ones when data is in low precision.
- **Energy Savings:** Less data movement and simpler arithmetic mean lower power consumption—critical for mobile/IoT devices.

---

## The “Mixed‑Precision” Twist

Traditional 4‑bit quantization 