# Controlling LLM Output with Penalties and Limits

### Setup and Initialization

In [1]:
import os
from openai import OpenAI
from IPython.display import Markdown, display
from dotenv import load_dotenv


load_dotenv()

# Initialize the client (Replace 'your-api-key' or use environment variables)
client = OpenAI()


### Testing Function
Allows different frequency and presence penalities

In [None]:
def get_llm_response(prompt, max_tokens=100, freq_penalty=0.0, pres_penalty=0.0, temperature=0.8):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=max_tokens,
        frequency_penalty=freq_penalty,
        presence_penalty=pres_penalty,
        temperature=temperature,  # Slight randomness to see penalties in action
    )
    #return response.choices[0].message.content.strip()
    return response

### Max Token Limits
The max_tokens parameter is a hard stop. It doesn't tell the model to "wrap it up"; it literally cuts the model off mid-sentence if it reaches the limit.

In [7]:
prompt = "Write a long, detailed paragraph about the history of the Roman Empire."

# --- Test 1: Short Response ---
print("--- Max Tokens: 20 (The 'Cliff-Hanger' Effect) ---")
response_short = get_llm_response(prompt, max_tokens=20)
print(response_short.choices[0].message.content)  # Print the text
print(f"Tokens Used: {response_short.usage.total_tokens}")

print("\n--- Max Tokens: 100 (The Standard Response) ---")
# --- Test 2: Longer Response ---
response_long = get_llm_response(prompt, max_tokens=100)
print(response_long.choices[0].message.content)  # Print the text

# Displaying detailed breakdown
print(f"\nInput (Prompt) tokens: {response_long.usage.prompt_tokens}")
print(f"Output (Response) tokens: {response_long.usage.completion_tokens}")
print(f"Total tokens used: {response_long.usage.total_tokens}")

--- Max Tokens: 20 (The 'Cliff-Hanger' Effect) ---
The history of the Roman Empire is a captivating tale of growth, power, and eventual decline that spans
Tokens Used: 41

--- Max Tokens: 100 (The Standard Response) ---
The history of the Roman Empire is a rich tapestry of power, innovation, and transformation, spanning several centuries and leaving an indelible mark on the world. It began with the end of the Roman Republic, which had been plagued by internal strife and civil wars. In 27 BC, Julius Caesar's adopted heir, Octavian, later known as Augustus, emerged victorious from the turmoil and was granted the title of the first emperor by the Roman Senate. This marked the beginning of the Pax Rom

Input (Prompt) tokens: 21
Output (Response) tokens: 100
Total tokens used: 121


### Frequency Penalty
Frequency Penalty (Range: -2.0 to 2.0) penalizes tokens based on how many times they have already appeared in the text. The more a word is used, the less likely it is to be used again.

In [8]:
# We'll use a prompt that usually causes repetition
prompt = (
    "List 10 ways to say 'hello' using only the word 'hello' and variations of 'hello'."
)

print("--- Frequency Penalty: 0.0 (Standard/Repetitive) ---")
print(get_llm_response(prompt, freq_penalty=0.0))

print("\n--- Frequency Penalty: 2.0 (Forced Variety) ---")
print(get_llm_response(prompt, freq_penalty=2.0))

--- Frequency Penalty: 0.0 (Standard/Repetitive) ---
ChatCompletion(id='chatcmpl-DC6YYmRRml6ngvl4Kw8w4rTBqfsFl', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Certainly! Here are ten variations of saying "hello" using forms of the word:\n\n1. Hello!\n2. Hello there!\n3. Hello, hello!\n4. Hellooo!\n5. Hey, hello!\n6. Well, hello!\n7. Hello everyone!\n8. Oh, hello!\n9. Why, hello!\n10. Hello, friend!', refusal=None, role='assistant', annotations=[], audio=None, function_call=None, tool_calls=None))], created=1771777510, model='gpt-4o-2024-08-06', object='chat.completion', service_tier='default', system_fingerprint='fp_01cbaa0587', usage=CompletionUsage(completion_tokens=72, prompt_tokens=29, total_tokens=101, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0)))

---

### Presence Penalty
Presence Penalty (Range: -2.0 to 2.0) penalizes a token if it has appeared at all so far. It doesn't care how many times it appeared; it just pushes the model to talk about new things.

In [11]:
# The prompt is designed to see if the model wanders off-topic
prompt = "Tell me about the importance of trees in 4 sentences."

print("--- Presence Penalty: 0.0 (Likely to stay on one point) ---")
print(get_llm_response(prompt, pres_penalty=0.0))

print("\n--- Presence Penalty: 2.0 (Forced to switch to new sub-topics) ---")
print(get_llm_response(prompt, pres_penalty=2.0))

--- Presence Penalty: 0.0 (Likely to stay on one point) ---
ChatCompletion(id='chatcmpl-DC6n0104z5jneAbWLAT4iKSchYOWw', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Trees are vital to the environment as they produce oxygen through photosynthesis and absorb carbon dioxide, helping mitigate climate change. They provide habitat and food for a wide range of wildlife, enhancing biodiversity. Trees also contribute to human well-being by offering shade, reducing urban heat, and improving air quality. Additionally, they prevent soil erosion, conserve water, and can enhance the aesthetic value of landscapes, contributing to the overall health of ecosystems and communities.', refusal=None, role='assistant', annotations=[], audio=None, function_call=None, tool_calls=None))], created=1771778406, model='gpt-4o-2024-08-06', object='chat.completion', service_tier='default', system_fingerprint='fp_01cbaa0587', usage=CompletionUsage(completion_tok

### Brainstorming Mode
Combining high Presence Penalty with high Temperature creates the ultimate "Brainstorming Mode."

In [None]:
# Scenario: Brainstorming a new sci-fi movie concept
prompt = "Give me a 5-sentence brainstorm for a unique sci-fi movie premise."

# High Temp (Creative) + High Presence (Diverse Topics)
print("--- THE BRAINSTORMER (Temp 1.2, Presence 1.5) ---")
print(get_llm_response(prompt, pres_penalty=1.5, max_tokens=150))

--- THE BRAINSTORMER (Temp 1.2, Presence 1.5) ---


TypeError: get_llm_response() got an unexpected keyword argument 'temp'

### Comparison Summary Table

| Parameter | Range | Primary Goal | Behavior |
| :---------- | :---------- | :----------- | :----------- |
| **Max Tokens** | 1 to 128k+ | **Length Control** | A "Hard Stop" that cuts the generation off at a specific token count. |
| **Frequency Penalty** | -2.0 to 2.0 | **Anti-Repetition** | Penalties scale with **repetition count**. More uses = higher penalty. |
| **Presence Penalty** | -2.0 to 2.0 | **Topic Diversity** | One-time penalty. If a word exists once, it gets penalized. |



### Implementation Guide

| Goal | Parameter | Value |
| :---------- | :---------- | :---------- |
| **Stop cut-off sentences** | Max Tokens | Increase (e.g., 500) |
| **Avoid repetitive words** | Frequency Penalty | 0.5 to 1.5 |
| **Force new topics** | Presence Penalty | 0.5 to 1.0 |