# Controlling LLM Output with Penalties and Limits

### Setup and Initialization

In [13]:
import os
from openai import OpenAI
from IPython.display import Markdown, display
from dotenv import load_dotenv


load_dotenv()

# Initialize the client (Replace 'your-api-key' or use environment variables)
client = OpenAI()


### Testing Function
Allows different frequency and presence penalities

In [14]:
def get_llm_response(prompt, max_tokens=100, freq_penalty=0.0, pres_penalty=0.0, temperature=0.8):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=max_tokens,
        frequency_penalty=freq_penalty,
        presence_penalty=pres_penalty,
        temperature=temperature,  # Slight randomness to see penalties in action
    )
    #return response.choices[0].message.content.strip()
    return response

### Max Token Limits
The max_tokens parameter is a hard stop. It doesn't tell the model to "wrap it up"; it literally cuts the model off mid-sentence if it reaches the limit.

In [15]:
prompt = "Write a long, detailed paragraph about the history of the Roman Empire."

# --- Test 1: Short Response ---
print("--- Max Tokens: 20 (The 'Cliff-Hanger' Effect) ---")
response_short = get_llm_response(prompt, max_tokens=20)
print(response_short.choices[0].message.content)  # Print the text
print(f"Tokens Used: {response_short.usage.total_tokens}")

print("\n--- Max Tokens: 100 (The Standard Response) ---")
# --- Test 2: Longer Response ---
response_long = get_llm_response(prompt, max_tokens=100)
print(response_long.choices[0].message.content)  # Print the text

# Displaying detailed breakdown
print(f"\nInput (Prompt) tokens: {response_long.usage.prompt_tokens}")
print(f"Output (Response) tokens: {response_long.usage.completion_tokens}")
print(f"Total tokens used: {response_long.usage.total_tokens}")

--- Max Tokens: 20 (The 'Cliff-Hanger' Effect) ---
The history of the Roman Empire is a grand narrative that chronicles the evolution of a small city-state into
Tokens Used: 41

--- Max Tokens: 100 (The Standard Response) ---
The Roman Empire, one of history's most influential civilizations, emerged from the Roman Republic's decline in the first century BCE, marking a significant transformation in governance and territorial expansion. Its inception is traditionally dated to 27 BCE when Gaius Octavius, later known as Augustus, was granted imperium by the Senate, ushering in the era of the principate, a system that maintained republican forms while vesting substantial power in the emperor. Augustus implemented reforms that stabilized the empire after years of

Input (Prompt) tokens: 21
Output (Response) tokens: 100
Total tokens used: 121


### Frequency Penalty
Frequency Penalty (Range: -2.0 to 2.0) penalizes tokens based on how many times they have already appeared in the text. The more a word is used, the less likely it is to be used again.

In [16]:
# We'll use a prompt that usually causes repetition
prompt = (
    "List 10 ways to say 'hello' using only the word 'hello' and variations of 'hello'."
)

print("--- Frequency Penalty: 0.0 (Standard/Repetitive) ---")
print(get_llm_response(prompt, freq_penalty=0.0))

print("\n--- Frequency Penalty: 2.0 (Forced Variety) ---")
print(get_llm_response(prompt, freq_penalty=2.0))

--- Frequency Penalty: 0.0 (Standard/Repetitive) ---
ChatCompletion(id='chatcmpl-DC6s8kVphDri0eWAUNKg0XIAUseSB', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Certainly! Here are ten ways to say "hello" using variations of the word:\n\n1. Hello!\n2. Hi!\n3. Hey!\n4. Hello there!\n5. Hiya!\n6. Hey there!\n7. Heya!\n8. Hello, hello!\n9. Hi there!\n10. Hellooo!', refusal=None, role='assistant', annotations=[], audio=None, function_call=None, tool_calls=None))], created=1771778724, model='gpt-4o-2024-08-06', object='chat.completion', service_tier='default', system_fingerprint='fp_01cbaa0587', usage=CompletionUsage(completion_tokens=65, prompt_tokens=29, total_tokens=94, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0)))

--- Frequency Penalty: 2.0 (Forced Variety) -

### Presence Penalty
Presence Penalty (Range: -2.0 to 2.0) penalizes a token if it has appeared at all so far. It doesn't care how many times it appeared; it just pushes the model to talk about new things.

In [17]:
# The prompt is designed to see if the model wanders off-topic
prompt = "Tell me about the importance of trees in 4 sentences."

print("--- Presence Penalty: 0.0 (Likely to stay on one point) ---")
print(get_llm_response(prompt, pres_penalty=0.0))

print("\n--- Presence Penalty: 2.0 (Forced to switch to new sub-topics) ---")
print(get_llm_response(prompt, pres_penalty=2.0))

--- Presence Penalty: 0.0 (Likely to stay on one point) ---
ChatCompletion(id='chatcmpl-DC6sCJjGzi6B2rKGlw7YeRX4DpJZU', choices=[Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content='Trees play a crucial role in maintaining ecological balance by absorbing carbon dioxide and releasing oxygen, which is essential for the survival of most life forms on Earth. Their roots help prevent soil erosion and maintain water cycles by facilitating groundwater recharge and reducing runoff. Additionally, trees provide habitats and food for countless species, supporting biodiversity and ecosystem health. Beyond environmental benefits, trees also offer economic value through resources like timber, fruits, and medicinal products, while enhancing human well-being by providing shade, reducing urban heat, and improving', refusal=None, role='assistant', annotations=[], audio=None, function_call=None, tool_calls=None))], created=1771778728, model='gpt-4o-2024-08-06', ob

### Brainstorming Mode
Combining high Presence Penalty with high Temperature creates the ultimate "Brainstorming Mode."

In [20]:
import textwrap

# 1. Call the function (which returns the full object)
response = get_llm_response(prompt, pres_penalty=1.5, max_tokens=150, temperature=1.2)

# 2. Extract the string and convert it for textwrap right here
response_text = response.choices[0].message.content.strip()

# 3. Now textwrap will work perfectly
print("--- THE BRAINSTORMER ---")
print("\n" + textwrap.fill(response_text, width=80))

--- THE BRAINSTORMER ---

In a future where memories are traded as commodities, society is ravaged by
identity theft and emotional detachment. Our protagonist, a disenchanted memory
designer, stumbles upon an unaltered memory of Earth's forgotten distant past
revealing the existence of intergalactic ancestors. This memory becomes highly
sought after by factions that either wish to preserve human history or exploit
forbidden knowledge for power. As minds merge in unforeseen ways, consciousness
begins transcending physical form, challenging the very notion of humanity.
Racing against time, the designer must decide whether restoring collective
memory is worth potentially losing individuality forever.


### Comparison Summary Table

| Parameter | Range | Primary Goal | Behavior |
| :---------- | :---------- | :----------- | :----------- |
| **Max Tokens** | 1 to 128k+ | **Length Control** | A "Hard Stop" that cuts the generation off at a specific token count. |
| **Frequency Penalty** | -2.0 to 2.0 | **Anti-Repetition** | Penalties scale with **repetition count**. More uses = higher penalty. |
| **Presence Penalty** | -2.0 to 2.0 | **Topic Diversity** | One-time penalty. If a word exists once, it gets penalized. |



### Implementation Guide

| Goal | Parameter | Value |
| :---------- | :---------- | :---------- |
| **Stop cut-off sentences** | Max Tokens | Increase (e.g., 500) |
| **Avoid repetitive words** | Frequency Penalty | 0.5 to 1.5 |
| **Force new topics** | Presence Penalty | 0.5 to 1.0 |