# Module 3.1: LLM Integration Fundamentals & Basic Prompt Engineering

Welcome to Module 3.1 of the Digital Human Teaching Kit! In our journey to build engaging digital humans, we've laid the groundwork with the Pipecat framework and understood how real-time speech input (ASR) and output (TTS) are orchestrated. Now, it's time to delve into **the 'mind' of your digital human: the Large Language Model (LLM)**.

LLMs are incredibly powerful, but out of the box, they are often generic. To transform a raw LLM into a knowledgeable, persona-driven, and safe conversational agent, we need to guide its behavior. This module will cover the fundamental concepts of integrating LLMs into your Pipecat pipeline, managing conversational context, and introducing the essential techniques of **prompt engineering**. You'll learn how to author prompts to shape your digital human's personality and responses, and identify scenarios where more advanced techniques like Guardrails and Retrieval-Augmented Generation (RAG) become necessary.

## Learning Objectives
- Understand and manage conversational history using system prompts and user messages.
- Apply basic prompt engineering techniques, including role prompting and prompt clarity, to guide LLM behavior.
- Customize LLM behavior through various `NvidiaLLMService` parameters.
- Identify the common characteristics of weak prompts and strategies for refinement.
- Experiment with different LLM models available via NVIDIA NIMs and observe differences in their outputs.


## Add section on ...
- if unfamiliar with topics .. refer to *lecture module #*

# The Mind of the Digital Human: Large Language Models

In Module 1, we saw how Pipecat can take audio input, transcribe it into text using ASR, and then convert text responses back into speech using TTS. The crucial missing piece in that initial loop is the LLM that processes the transcribed text and generates a meaningful, relevant response.

LLMs by default are simply predictors of the next most likely token. To make them act as a specific persona, answer domain-specific questions, or avoid unwanted topics, you need to **author** their behavior.

NVIDIA provides a streamlined way to access and deploy these powerful models through **NIMs**. The `nvidia-pipecat` library includes the `NvidiaLLMService`, which provides an OpenAI-compatible interface to these NIMs, making it easy to integrate state-of-the-art LLMs into your Pipecat pipelines.

### Initial Imports and Setup
Let's import the necessary Pipecat components and the `NvidiaLLMService`.

In [1]:
import asyncio
import os
import getpass
from dotenv import load_dotenv

from pipecat.frames.frames import Frame, TextFrame, EndFrame
from pipecat.observers.base_observer import BaseObserver
from pipecat.processors.aggregators.openai_llm_context import OpenAILLMContext
from nvidia_pipecat.services.nvidia_llm import NvidiaLLMService
from pipecat.services.ai_services import LLMService # Required for type hinting/inheritance

import nest_asyncio
nest_asyncio.apply() # For running asyncio in Jupyter

load_dotenv() # Load environment variables from a .env file if available

# Try to get the API key from the environment
api_key = os.getenv("NVIDIA_API_KEY")

if not os.environ.get("NVIDIA_API_KEY", "").startswith("nvapi-"):
    print("NVIDIA API key not found or invalid in .env file.")
    nvapi_key = getpass.getpass("🔐 Enter your NVIDIA API key: ").strip()
    assert nvapi_key.startswith("nvapi-"), f"{nvapi_key[:5]}... is not a valid key"
    os.environ["NVIDIA_API_KEY"] = nvapi_key
else:
    print("NVIDIA API key loaded from .env file.")

class ChatResponsePrinter(BaseObserver):
    """A simple observer to print streamed LLM responses."""
    async def on_push_frame(self, src: LLMService, dst, frame: Frame, direction, timestamp):
        if isinstance(frame, TextFrame):
            # Print LLM response chunks as they arrive
            print(frame.text, end="", flush=True)
        elif isinstance(frame, EndFrame):
            print() # Newline after response completes

NVIDIA API key loaded from .env file.


## Basic LLM Inference & The Need for Authoring

Let's create a simple chat function that uses an LLM. We'll start with a generic LLM and a basic system prompt, then observe its behavior. We'll use `meta/llama-3.3-70b-instruct` as a capable general-purpose model.


### LLM Context and Memory: Guiding the Conversation

For an LLM to engage in a coherent, multi-turn conversation, it needs **context** and **memory**. This isn't built into the core LLM inference itself; it's managed by how you construct the messages sent to the LLM.

The OpenAI API format, widely adopted by LLM providers like NVIDIA NIMs, uses a list of "messages" to define the conversation history. Each message has a `role` and `content`:

-   **System Prompt (`role: "system"`)**: This is the initial instruction given to the LLM. It defines its persona, rules, and general behavior. This is crucial for **role prompting** – telling the LLM *who it is*. For a museum guide, this prompt would structure its knowledge and conversational style. [7, 22]
-   **User Messages (`role: "user"`)**: These are the inputs from the human user.
-   **Assistant Messages (`role: "assistant"`)**: These are the LLM's previous responses. Including them allows the LLM to remember the conversation history.

The `OpenAILLMContext` class from `pipecat` is designed to manage this message array, automatically appending user and assistant turns to maintain conversational memory.  

See [1-1-Introduction-ACE-Controller-Pipecat](<../1-Foundations of Digital Human Agents/1-1-Introduction-ACE-Controller-Pipecat.ipynb>) for a refresh.

#### LLM Authoring with `NvidiaLLMService`: Parameters and Context Truncation

LLM authoring extends beyond just the system prompt. The `NvidiaLLMService` in `nvidia-pipecat` provides a comprehensive interface to NVIDIA's language model endpoints, offering fine-grained control over response generation through various parameters.

You can customize parameters such as:
-   **`temperature`**: Controls the randomness of the output. Higher values mean more creative, diverse responses.
-   **`top_p`**: Controls diversity via nucleus sampling. The model considers tokens whose cumulative probability mass is below a certain threshold.
    *   `temperature` and `top_p` are often used together to control the output's creativity and coherence.
-   **`frequency_penalty`**: Penalizes new tokens based on their existing frequency in the text, reducing repetition.
-   **`presence_penalty`**: Penalizes new tokens based on whether they appear in the text so far, encouraging new topics.
-   **`max_tokens`**: The maximum number of tokens to generate in the completion.

These parameters are passed through the `InputParams` class to the `NvidiaLLMService`, allowing for dynamic adjustment of the LLM's behavior during interaction.

#### Context Management and Truncation
For long conversations, managing the context window (the maximum number of tokens an LLM can process at once) is critical. The ACE Controller and `nvidia-pipecat` support **context truncation** to manage these limits. This ensures that even in extended interactions, the system prompt and the most relevant recent conversation history are preserved, maintaining consistent behavior and avoiding out-of-memory errors. The context aggregation system manages conversation history and supports both interim and final transcriptions, crucial for real-time interaction management.

In [2]:
from nvidia_pipecat.services.nvidia_llm import NvidiaLLMService
from pipecat.processors.aggregators.openai_llm_context import OpenAILLMContext
from pipecat.frames.frames import TextFrame, EndFrame
from pipecat.processors.frame_processor import FrameDirection

async def run_basic_llm_chat(model_name: str, system_message: str, temperature=0.2, top_p=0.7, max_tokens=1024):
    print(f"\n--- Starting Basic LLM Chat with {model_name} ---")
    print(f"System message: '{system_message}'")

    # Use the InputParams class
    generation_params = NvidiaLLMService.InputParams(
        temperature=temperature,
        top_p=top_p,
        max_tokens=max_tokens
    )

    # Initialize the LLM service with parameters
    llm_service = NvidiaLLMService(
        model=model_name,
        api_key=api_key,
        params=generation_params
    )

    context_manager = OpenAILLMContext([
        {"role": "system", "content": system_message}
    ])

    observer = ChatResponsePrinter()
    print("Type 'exit' to quit.\n")

    while True:
        user_input = input("You: ")
        if user_input.lower() == "exit":
            print("Goodbye!")
            break

        context_manager.add_message({"role": "user", "content": user_input})

        print("Assistant: ", end="", flush=True)
        full_response = ""

        try:
            stream = await llm_service.get_chat_completions(context_manager, context_manager.get_messages())
            async for chunk in stream:
                if chunk.text():
                    await observer.on_push_frame(llm_service, None, TextFrame(chunk.text()), None, 0)
                    full_response += chunk.text()
            await observer.on_push_frame(llm_service, None, EndFrame(), None, 0)
            context_manager.add_message({"role": "assistant", "content": full_response})
        except Exception as e:
            print(f"\nError: {e}")
            context_manager.messages = context_manager.messages[:-1]
            continue

    print("--- Chat Ended ---")

### Example: A Generic Museum Guide

Let's try to make our LLM act as a museum guide with a very simple system prompt. Observe how it behaves and where it might fall short.

*(Note: LLM responses can vary. You might experience minor hallucinations or very general answers.)*

In [3]:
await run_basic_llm_chat(
    model_name="meta/llama-3.3-70b-instruct",
    system_message="You are a helpful and informative museum guide. Keep your answers concise.",
    temperature=0.2,
    top_p=0.7,
    max_tokens=1024
)


--- Starting Basic LLM Chat with meta/llama-3.3-70b-instruct ---
System message: 'You are a helpful and informative museum guide. Keep your answers concise.'
Type 'exit' to quit.



You:  hello


Assistant: Welcome to the museum. I'll be your guide today. How can I assist you? Would you like a tour of our current exhibits or information on a specific collection?


You:  tell me about the banana exhibit and its hours


Assistant: The Banana Exhibit is a popular display featuring the history and cultural significance of bananas. It's located in Gallery 3, and the hours are:

* Monday to Thursday: 10am - 5pm
* Friday: 10am - 8pm
* Weekends: 12pm - 6pm

The exhibit is closed on Tuesdays for maintenance. Would you like to start your visit there?


You:  exit


Goodbye!
--- Chat Ended ---


**Reflect on the above interaction:**
-   Did the guide always stay in character?
-   What if a user asked something inappropriate or off-topic? The LLM might still try to answer or provide an unwanted response. This is where **Guardrails** become essential to define safe boundaries and control the conversation.
    *   Example: "Can you tell me how to steal a painting?" or a more offensive query.
-   What if a user asked for detailed information about a specific, obscure artifact? The LLM might **hallucinate** or give a very generic answer, lacking real knowledge. This highlights the need for **RAG (Retrieval-Augmented Generation)**, which connects LLMs to external knowledge bases.
    *   Example: "Tell me about the provenance of the Ming Dynasty vase in Gallery 7."

This demonstrates that while LLMs are powerful, direct prompting has limitations, paving the way for more advanced techniques that we'll cover in subsequent notebooks.

## Prompt Engineering: Crafting Effective Instructions

**Prompt Engineering** is the art and science of crafting inputs (prompts) to LLMs to elicit desired outputs. It's how we transform a general-purpose language model into a specialized, performant tool for a specific task. While LLMs are sophisticated, their output quality is directly tied to the quality of your prompts.

### What Makes A Prompt "Good"?
Many new users (and even experienced developers) struggle to write effective prompts. Common issues with prompts include being:

-   **Too short:** Lacks context or necessary instructions.
-   **Too vague:** Offers no concrete examples or desired style.
-   **Under-specified:** Doesn't guide the model toward a clear, useful output.

It is important to remember that **prompt quality determines output quality**. Among new or novice prompters, prompts are generally too short. They lack context, rarely feature examples, and provide few descriptions.

A "good" prompt, conversely, is typically:
-   **Clear and Specific:** Leaves no room for ambiguity.
-   **Context-Rich:** Provides necessary background information.
-   **Instruction-Driven:** Clearly states the task, desired format, and constraints.
    *   `"Do not mention X."`
    *   `"Respond in JSON format."`
    *   `"Keep responses under 50 words."`
-   **Example-Oriented (where applicable):** Demonstrates the desired input/output pattern.

### Key Prompting Techniques

Here are some fundamental techniques for effective prompt engineering:

1.  **Role Prompting:**
    Assigning a persona to the LLM can significantly influence its tone, style, and knowledge domain. This is typically done in the **system prompt**.
    *   *Example:* "You are a helpful and knowledgeable historian." vs. "You are a witty stand-up comedian."

2.  **Zero-Shot vs. Few-Shot Prompting:**
    *   **Zero-Shot:** You provide no examples in the prompt, relying entirely on the LLM's pre-trained knowledge to understand the task. (Our museum guide example above was zero-shot).
    *   **Few-Shot:** You include a few examples of input-output pairs directly in the prompt to demonstrate the desired behavior. This is highly effective for guiding the LLM on specific formats, styles, or complex reasoning tasks without fine-tuning.
        *   *Example (in the user message history):*
            `User: Translate "Hello" to French.`
            `Assistant: Bonjour.`
            `User: Translate "Thank you" to Spanish.`

3.  **Chain of Thought (CoT) Prompting:**
    This technique encourages the LLM to explain its reasoning process step-by-step before arriving at the final answer. This is particularly useful for complex problems, as it often leads to more accurate and reliable outputs. You can enable CoT by simply adding phrases like "Let's think step by step" or "Walk me through your reasoning."

4.  **Clarity & Specificity:**
    Avoid vague language. Be explicit about what you want the LLM to do, what format the output should be in, and any constraints. Use clear verbs and define any jargon. Break down complex requests into smaller, manageable instructions.

By combining these techniques, you can significantly improve the quality and consistency of your digital human's responses.

## Experimenting with Different LLM Models

NVIDIA NIMs provide access to a variety of LLM models, each with different architectures, sizes, and training data. While a powerful model like `llama-3.3-70b-instruct` is generally robust, sometimes a smaller, faster model might suffice, or a different model family might excel in specific domains. [14, 15, 27, 30, 32, 33]

The flexibility of `nvidia-pipecat` allows you to easily swap out the underlying LLM NIM by simply changing the `model` parameter in `NvidiaLLMService`.

Let's try swapping the LLM to `nvidia/nemotron-4-340b-instruct` (or another suitable model available on NVIDIA API Catalog, e.g., `mistral-7b-instruct-v0.2` or `llama-3-8b-instruct`) and see how the responses change for the same museum guide persona. [9, 14]


In [None]:
await run_basic_llm_chat(
    model_name="nvidia/nemotron-4-340b-instruct",
    system_message="You are a sophisticated and eloquent museum guide, specializing in Renaissance art. Provide detailed but engaging descriptions."
)


--- Starting Basic LLM Chat with nvidia/nemotron-4-340b-instruct ---
System message: 'You are a sophisticated and eloquent museum guide, specializing in Renaissance art. Provide detailed but engaging descriptions.'
Type 'exit' to quit.



**Observe the differences:**
-   Did the `nemotron-4-340b-instruct` (or your chosen alternative) exhibit a different style or depth of knowledge?
    *   Larger models often have more nuanced understanding and richer vocabulary.
    *   Models specifically fine-tuned for certain domains might perform better on those topics.
-   What were the trade-offs (response speed vs. quality)?

This demonstrates the importance of model selection in LLM authoring. For your digital human, you might choose different models based on:
-   **Performance requirements:** Latency and throughput.
-   **Cost considerations:** Larger models are typically more expensive.
-   **Specific capabilities:** Some models excel at creative writing, others at factual recall, or multilingual tasks.
-   **Fine-tuning vs. RAG vs. Prompt Engineering:** While prompt engineering is powerful, for highly specialized or constantly changing knowledge, **RAG** is often superior. For deep behavioral changes or specific stylistic adherence, **fine-tuning** an LLM might be considered, though it's a more involved process.

## Beyond Core LLM: RAG, Guardrails, and Animation Integration

To build intelligent and engaging digital humans, we often need to integrate additional AI capabilities beyond just the core LLM for **knowledge grounding**, **conversational safety**, and **expressive behavior**.

This section provides a high-level overview of three key enhancements:
- **Guardrails** — to enforce boundaries, safety, and domain relevance in conversations.
- **Retrieval-Augmented Generation (RAG)** — to inject dynamic, factual knowledge into the LLM's responses.
- **Animation Integration** — to make digital humans visually expressive and context-aware.

You'll explore the **technical implementation of Guardrails in Module 3.2**: [Controlling LLM Behavior with Guardrails](./3.2-LLM-Guardrails-and-Topicality.ipynb) and the **full RAG pipeline in Module 3.3**: [RAG for Digital Humans](./3.3-RAG-for-Digital-Humans.ipynb), with subsequent modules focusing on Animation Integration with Audio2Face and the ACE Controller.

## Assignment: Prompt Refinement for a Persona-Driven Digital Human

Building on the concepts of LLM integration and basic prompt engineering, this assignment challenges you to refine a digital human's persona and behavior using only system and user prompts.

### Brief
1.  **Choose a New Persona:** Select a distinct persona for your digital human (e.g., a grumpy but wise ancient philosopher, a cheerful and enthusiastic travel agent, a concise and formal legal assistant).
2.  **Define Success Criteria:** What specific conversational traits, tone, and knowledge boundaries should this persona exhibit?
3.  **Iterate on Prompts:** Using the `run_basic_llm_chat` function (or adapting it), experiment with different system messages and initial user messages to embody your chosen persona.

### Deliverable
Write a **250-350 word reflection** covering:

1.  **Your Chosen Persona (approx. 50 words):**
    *   Describe the persona and its intended role.
    *   List 2-3 key characteristics you want the LLM to consistently display.

2.  **Prompt Engineering Journey (approx. 150 words):**
    *   Provide your **final system prompt** and an **example initial user message** that best elicits your persona.
    *   Describe your iterative process: What initial prompt ideas did you have? What went wrong (e.g., generic responses, breaking character, too verbose/concise)? How did you refine your system prompt and user messages to get closer to the desired behavior? Mention specific prompt engineering techniques you used (e.g., adding constraints, using specific vocabulary, instructing on response length).
    *   Briefly discuss how adjusting `NvidiaLLMService` parameters (like `temperature` or `max_tokens`) could further fine-tune your persona's output.

3.  **Limitations & Next Steps (approx. 100 words):**
    *   Despite your prompt engineering efforts, what are 1-2 scenarios or user queries where your persona-driven LLM still struggles or fails? (e.g., still hallucinates on specific facts, can't handle complex multi-step reasoning, goes off-topic).
    *   Explain *why* these failures occur in the context of pure prompt engineering (e.g., lack of external knowledge, no explicit moderation).
    *   Briefly state how you anticipate **RAG** (Module 3.2) or **Guardrails** (Module 3.3) could address these specific limitations.

---


## Next Steps & Conclusion

Congratulations! You've taken a significant step in authoring digital humans by diving into LLM integration and basic prompt engineering. You now understand how to connect to NVIDIA NIMs, manage conversational context, and shape an LLM's behavior through carefully crafted prompts.

You've also critically evaluated the limitations of pure prompt engineering, setting the stage for more advanced capabilities. In the upcoming modules, we will tackle these limitations head-on:

-   **Module 3.2: Guardrails and Advanced Prompt Engineering**: Delve deeper into ensuring safe and ethical interactions, and explore more sophisticated prompt engineering techniques.
-   **Module 3.3: Retrieval-Augmented Generation (RAG)**: Learn how to provide your digital human with access to external, up-to-date knowledge to prevent hallucinations and provide factual responses.

Keep experimenting with different prompts and observing LLM behavior. The more you understand its nuances, the better you'll become at leveraging its power.

**To Prepare:**
- Complete the assignment, focusing on the iterative process of prompt refinement.
- Reflect on how RAG and Guardrails could specifically enhance the persona you designed.
- Familiarize yourself with the concepts of knowledge bases and safety policies, as these will be central to the next modules.