## Getting Started with LM Studio and Prompt Calls

**Overview:** In this hands-on lab, you will go through the steps of setting up your local LLM environment and making some initial prompt calls. The lab is structured as a Jupyter notebook, with explanatory text and code cells. You’ll get to practice adjusting parameters like temperature and max tokens, and see how prompt formatting (system/user messages) affects the output. This lab is aimed at mid-level developers, so we assume you’re comfortable running Python code and making HTTP requests. 

By the end of this lab, you should be able to:
- Connect to the local LM Studio server from code.
- Send both completion and chat requests to the OpenChat-3.5 model.
- Experiment with different generation settings and observe their effects.
- Understand how token limits can impact your prompts and outputs.

Let’s get started!

### **Step 1: Verify the Local LLM Server**

Before sending complex prompts, do a quick check that we can reach the LM Studio server and that our model is listed.

**Exercise 1.1: List models via API.** Run the following cell to list the models:


In [None]:
import requests

BASE_URL = "http://localhost:1234"  # LM Studio's default API base URL

res = requests.get(f"{BASE_URL}/v1/models")
if res.status_code == 200:
    models = res.json()
    print("Available models:", models)
else:
    print("Could not fetch models. Is LM Studio running and the server enabled?")
    print("Status code:", res.status_code, res.text)


- **Expected Result:** You should see a JSON response containing at least one model. Look for the `"id"` field – that is the model’s identifier. It might be something like `"openchat-3.5-0106"` or similar. If you see an empty list or an error, double-check that LM Studio’s local server is on (in the app’s settings) and the port matches.

**Question:** *If no models are listed or you got an error, how would you troubleshoot?*  
*(Hint: Check LM Studio is running the server on the expected port. You can also try `curl http://localhost:1234/v1/health` (if available) or ensure your firewall isn’t blocking local calls.)*

Once the model is confirmed, we can proceed to querying it.


### **Step 2: Your First Completion Call**

Now, let’s send a simple prompt to the model to make sure everything is working end-to-end.

**Exercise 2.1: Make a completion request.** We’ll ask the model a basic question: *“What is prompt engineering?”* This will use the `/v1/completions` endpoint, treating the query as a single-turn prompt.

In [None]:
payload = {
    "model": "openchat-3.5-0106",  # replace with your model ID if different
    "prompt": "What is prompt engineering?",
    "max_tokens": 100,   # expecting a short paragraph answer
    "temperature": 0.5   # moderate temperature for a factual answer
}
response = requests.post(f"{BASE_URL}/v1/completions", json=payload)
if response.status_code == 200:
    data = response.json()
    answer = data.get("choices", [{}])[0].get("text", "")
    print("Model's answer:\n", answer.strip())
else:
    print("Request failed:", response.status_code, response.text)

- **Run the cell** and observe the output. You should get a coherent explanation of what prompt engineering is. If the answer is cut off or seems incomplete, check if `max_tokens` might have been too low. If it’s too verbose or wandering, note that for later when we adjust temperature.
- This is your “Hello, world” with the LLM. You sent a prompt and got a completion back. 🎉

**Think About:** The model likely gave an answer defining prompt engineering (e.g. “crafting effective inputs to guide AI models to produce desired outputs”). Does it align with your understanding? This is a good sanity check that the model knows the basics.

### **Step 3: Chat API – Structured Conversation**

Next, let’s use the chat format, which is more powerful for interactive prompts or setting context via roles.

**Exercise 3.1: Use a system message to influence style.** We’ll ask the same question, but this time instruct the model (via a system role) to respond in a particular style – say, a formal academic tone.


In [6]:
chat_payload = {
    "model": "openchat-3.5-0106",
    "messages": [
        {"role": "system", "content": "You are a concise and formal academic assistant."},
        {"role": "user", "content": "What is prompt engineering?"}
    ],
    "temperature": 0.5,
    "max_tokens": 100
}
chat_res = requests.post(f"{BASE_URL}/v1/chat/completions", json=chat_payload)
answer = None
if chat_res.status_code == 200:
    chat_data = chat_res.json()
    answer = chat_data.get("choices", [{}])[0].get("message", {}).get("content", "")
    print("Assistant's answer (formal style):\n", answer.strip())
else:
    print("Chat request failed:", chat_res.status_code, chat_res.text)

Assistant's answer (formal style):
 Prompt engineering refers to the process of designing, developing, and optimizing input prompts for AI language models such as GPT-3 or GPT-4 in order to generate more accurate, relevant, and useful responses. This involves understanding the capabilities and limitations of the model, as well as the specific context and goals of the user's query. By carefully crafting prompts, prompt engineers can guide the AI to produce higher quality outputs and improve overall user experience. Techniques used


- **Observe the style/tone:** The content of the answer should still explain prompt engineering, but perhaps the wording is more formal or succinct (because we set the persona to “concise and formal academic”). If you remove or change the system message to something like “You are a casual, friendly assistant” and run again, you’d likely see a difference in tone. Feel free to try that as an experiment!

**Exercise 4.2: Multi-turn conversation.** Let’s extend the conversation. Imagine we follow up with another question. In a chat scenario, we need to include the conversation history in the next request (since the API is stateless). We’ll append a new user message and send the whole message list again:

In [7]:
# Continue the conversation with a follow-up question
if answer:
    followup_messages = chat_payload["messages"] + [
        {"role": "assistant", "content": answer},
        {"role": "user", "content": "Why is it important for AI developers?"}
    ]
    follow_payload = {
        "model": "openchat-3.5-0106",
        "messages": followup_messages,
        "temperature": 0.5,
        "max_tokens": 100
    }
    follow_res = requests.post(f"{BASE_URL}/v1/chat/completions", json=follow_payload)
    if follow_res.status_code == 200:
        follow_data = follow_res.json()
        follow_answer = follow_data.get("choices", [{}])[0].get("message", {}).get("content", "")
        print("Assistant's follow-up answer:\n", follow_answer.strip())

Assistant's follow-up answer:
 Prompt engineering is crucial for AI developers because it significantly impacts the performance, efficiency, and effectiveness of AI language models in addressing user queries or tasks. The importance lies in several aspects:

1. Enhanced Accuracy: Well-designed prompts help AI models generate more accurate responses by providing clear and concise information, reducing ambiguity and misinterpretation.
2. Contextual Relevance: Prompt engineering ensures that the AI model understands the context of the


Here we:
- Took the original messages (system + first user question),
- Added the assistant’s answer as a message with role `"assistant"`,
- Then a new user question *“Why is it important for AI developers?”*,
- Sent the whole sequence again to get a follow-up answer.

- **Run the above cell** and see the response. The assistant’s answer should reference the context (it knows we were talking about prompt engineering) and explain why it’s important, perhaps building on the prior answer. This shows how the model uses the conversation history. It’s remembering what was said (within the context window).

This multi-turn approach is essential when you want the model to have a back-and-forth dialogue, or when you want to feed the model its own previous answer (for verification or refining further).


### **Step 4: Playing with Temperature**

Now that basic Q&A is working, let’s systematically explore the effect of `temperature`. 

**Exercise 45.1: High vs Low Temperature Output.** We’ll use a creative prompt to see how the style changes. Run the following cell:


In [8]:
test_prompt = "Describe a sunset over a futuristic city skyline."
for temp in [0.2, 0.9]:
    payl = {
        "model": "openchat-3.5-0106",
        "prompt": test_prompt,
        "max_tokens": 60,
        "temperature": temp
    }
    resp = requests.post(f"{BASE_URL}/v1/completions", json=payl).json()
    text = resp.get("choices", [{}])[0].get("text", "").strip()
    print(f"\nTemperature {temp}:\n{text}\n")


Temperature 0.2:
The sun was setting over the horizon, casting an orange glow across the futuristic city skyline. The towering buildings, adorned with neon lights and holographic advertisements, stretched up into the sky like a forest of glass and steel. As the sun


Temperature 0.9:
It was the first time that I had seen anything like it, and it took my breath away. The sky was a brilliant swirl of reds, yellows, pinks, purples, and oranges, creating a spectacular scene that would be ingrained in my



This will prompt the model to *“Describe a sunset over a futuristic city skyline.”* once with temperature 0.2 (quite low) and once with 0.9 (high).

- **Compare the outputs:** 
  - At **0.2**, the description might be straightforward, perhaps focusing on clear details like “The sun sets, casting orange hues on glass skyscrapers... It is calm and peaceful.”
  - At **0.9**, the description might become more vivid, imaginative or unexpected, maybe introducing creative elements like “neon lights flicker on as the sky burns red-orange, reflecting off hovering vehicles...”.
  
Are the two descriptions different in phrasing or imagery? Note these differences. This reinforces how temperature can make the model either play it safe or go wild with the description.



**Exercise 4.2: When to use what?** Consider a scenario: You’re building a factual Q&A bot versus a poetry generator. Which temperature setting would you choose for each and why? 


### **Step 5: Understanding Token Limits**

Finally, let’s ensure we understand token limits practically. Our model has a context window of 8192 tokens. It’s hard to hit that in a quick test without a huge input, but we can simulate the effect of smaller token limits using `max_tokens` (for output) and by observing input lengths.

**Exercise 5.1: Long Prompt Truncation (Thought Experiment).** We won’t actually use an 8000-token prompt here, but consider: if you feed the model input that exceeds 8192 tokens (including your prompt + its own generated answer so far), it will truncate the oldest parts (for a chat) or simply not process beyond the limit (for a single prompt). In practice, if you have a *very* long conversation, you need to start summarizing or dropping earlier turns once you approach the limit.

To get a feel for counting tokens, here’s a simple way using an approximate method (since we can’t easily get the model’s tokenizer without additional libraries):

In [9]:
text = "Hello, world!" * 50  # a repetitive text
char_count = len(text)
approx_tokens = char_count / 4  # using 1 token ~ 4 chars rule
print(f"Char count: {char_count}, approx token count: {approx_tokens}")

Char count: 650, approx token count: 162.5


- **Try with different text lengths:** Change the multiplier or text content to see how the token estimate grows. For example, try 500 repetitions instead of 50 and see the estimate. Keep in mind this is a rough estimat, but it helps gauge scale.

**Exercise 5.2: Using max_tokens to simulate context cut-off.** We did an exercise above listing Python features with small `max_tokens`. Let’s explicitly observe truncation:

In [10]:
long_prompt = "Python is an interpreted, high-level programming language." * 20  # repeat to make it long
payload = {
    "model": "openchat-3.5-0106",
    "prompt": long_prompt + "\nIn summary, Python is known for:",
    "max_tokens": 50,
    "temperature": 0.3
}
res = requests.post(f"{BASE_URL}/v1/completions", json=payload).json()
summary = res.get("choices", [{}])[0].get("text", "")
print(summary)


1. Being easy to read and write
2. Having a large standard library
3. Supporting multiple programming paradigms
4. Being open source and having a strong community
5. Being cross-platform (runs on


Here we gave a long prompt (by repeating a sentence many times) and then asked for a summary. We limited the answer to 50 tokens. Likely, the model will start listing some known Python features but stop partway due to the token cap (you might see it stop at an incomplete point like “easy syntax, large community, and...” and then cut off).

- **Reflect:** If you needed the full summary, you’d raise `max_tokens`. If the input itself was too long (say we repeated that sentence 1000 times which might exceed context), the model might ignore the later part of the prompt or not be able to include everything in the summary. Thankfully 8192 tokens is quite a lot (our prompt above is far smaller). But if you did have, say, a 50-page document to summarize, you’d have to chunk it because it wouldn’t fit in one go.

**Question:** *What happens if you ask the model to produce more tokens than the context window allows? For example, setting `max_tokens` to 10000 with this model.*  
*(Hint: The model will stop at its context limit; it physically can’t output beyond ~8192 minus input length. Also, many APIs will throw an error if you set max_tokens too high or simply ignore it beyond the limit.)*

### **Conclusion of Lab**

In this lab, you:
- Set up and connected to a local LLM service (LM Studio running OpenChat-3.5).
- Made both completion and chat requests through a standard API interface.
- Used system messages to guide the model’s style and witnessed multi-turn conversation handling.
- Adjusted temperature to see its effect on output randomness vs. determinism.
- Played with `max_tokens` to control output length and discussed how the context window is a hard limit for the model’s memory.

These hands-on experiments should give you a solid intuition for how LLMs function under the hood and how we control their behavior. Keep these lessons in mind as we proceed to more advanced prompt engineering tasks. The principles of **prompt design** (clear instructions, providing context, etc.) combined with **parameter tuning** (like we did with temperature and max tokens) will be the key to getting great results from LLMs in your applications.
