# Prompt Engineering Essentials

The D1 notebooks will cover the essential topics of prompt engineering, beginning with inference in general and an introduction to LangChain. We will then cover the topics of prompt templates and parsing and will then go on to the concept of creating chains and connecting these in different ways to build more sophisticated constructs to make the most of LLMs.

> **Bazzite-AI Setup Required**  
> Run `D0_00_Bazzite_AI_Setup.ipynb` first to configure Ollama and verify GPU access.

## API vs. Locally Hosted LLM
Using the an API-hosted LLM (e.g. OpenAI) is like renting a powerful car ‚Äî it's ready to go, but you mustn't tinker with the inner workings of the engine and you pay each time you drive.
Using a locally hosted model is like buying your own vehicle ‚Äî more upfront work and maintenance, but full control, privacy, and no cost per use, apart from footing the energy bill.

| **Aspect**                 | **API-based (e.g. OpenAI)**                          | **Local Model (e.g. Mistral, PyTorch + LangChain)**        |
|---------------------------|------------------------------------------------------|-------------------------------------------------------------|
| **Setup time**            | Minimal ‚Äì just an API key                            | Requires downloading and managing the model                 |
| **Hardware requirement**  | None (runs in the cloud)                             | Requires a GPU (sometimes large memory)                     |
| **Latency**               | Network-dependent                                    | Faster inference (once model is loaded)                     |
| **Privacy / Data control**| Data sent to external servers                      | Data stays on your infrastructure                         |
| **Cost**                  | Pay-per-use (based on tokens)                        | Free at inference (after download), but uses your compute   |
| **Scalability**           | Handled by provider                                  | You manage and scale infrastructure                         |
| **Flexibility**           | Limited to provider's models and settings            | Full control: quantization, fine-tuning, prompt handling    |
| **Offline use**           | Not possible                                       | Yes, after initial download                               |
| **Customizability**       | No access to internals                             | You can modify and extend anything                        |

**Using an API (e.g. OpenAI)** <br>
 - You use OpenAI or ChatOpenAI class from LangChain
 - LangChain sends your prompt to api.openai.com
 - You don't manage the model, only the request and response

```python
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(api_key="...", model="gpt-4")
response = llm.invoke("Summarize this legal clause...")
```

üìù **Managing API Keys Securely**

The recommended approach is to use a `.env` file with `python-dotenv`:

```bash
# .env file (add to .gitignore!)
OPENAI_API_KEY=sk-your-key-here
ANTHROPIC_API_KEY=sk-ant-your-key-here
```

```python
from dotenv import load_dotenv
import os

load_dotenv()  # Load variables from .env file
api_key = os.getenv("OPENAI_API_KEY")

llm = ChatOpenAI(api_key=api_key, model="gpt-4")
```

Note that LangChain automatically looks up `OPENAI_API_KEY` from environment variables, so you can also just do:

```python
from dotenv import load_dotenv
load_dotenv()

llm = ChatOpenAI(model="gpt-4")  # API key loaded automatically
```

**Using a Local Model (e.g. Mistral, LLaMA)**<br>
 - You load the model and tokenizer using Hugging Face Transformers
 - You wrap the pipeline using HuggingFacePipeline or similar in LangChain
 - You manage memory, GPU allocation, quantization, etc.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from langchain_huggingface import ChatHuggingFace

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
llm = ChatHuggingFace(llm=HuggingFacePipeline(pipeline=pipe))
```

## Basic Setup for Inference

Apart from the usual suspects of Pytorch and Huggingface libraries, we get our first imports of the LangChain library and some of its classes.

Since we want to show you how to work with LLMs that are not part of the closed OpenAI and Anthropic world, we are going to show you how to work with open and downloadable models. As it makes no sense for all of us to download the models and store them in our home directory, we've done that for you before the start of the course. You can find the path to the models down below.

In [1]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
from langchain_core.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_huggingface.llms import HuggingFacePipeline
from langchain_huggingface import ChatHuggingFace

If you choose to work with a model such as `meta-llama/Llama-3.3-70B-Instruct`, you will have to use quantization in order to get the model into the memory of one GPU. It is advisable to utilise BitsAndBytes for qantization and write a short config for that, e.g.:
```
# Define quantization config
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,  # Enable 4-bit quantization
    bnb_4bit_compute_dtype=torch.float16,  # Use float16 for computation
    bnb_4bit_use_double_quant=True  # Double quantization for efficiency
)
```
However, beware, a model of that size takes roughly 30 minutes to load...
In this course we do not want to wait around for that long, so we will use a smaller model called [Nous-Hermes-2-Mistral-7B-DPO](https://huggingface.co/NousResearch/Nous-Hermes-2-Mistral-7B-DPO).

In [3]:
# Download model from HuggingFace (same base model as Ollama GGUF version)
HF_LLM_MODEL = "NousResearch/Nous-Hermes-2-Mistral-7B-DPO"

In [4]:
# Use this if you have an API key for a model hosted in the cloud:
# os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API key (input is hidden): ")

In [5]:
# Alternative models to try:
#HF_LLM_MODEL = "meta-llama/Llama-3.3-70B-Instruct"
#HF_LLM_MODEL = "mistralai/Mistral-7B-Instruct-v0.3"

In [6]:
# 4-bit quantization config for efficient loading
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(HF_LLM_MODEL)

# Load model with 4-bit quantization
model = AutoModelForCausalLM.from_pretrained(
    HF_LLM_MODEL,
    device_map="auto",
    quantization_config=quantization_config,
)

# Verify model config
print(model.config)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

MistralConfig {
  "architectures": [
    "MistralForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "dtype": "float16",
  "eos_token_id": 32000,
  "head_dim": null,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 32768,
  "model_type": "mistral",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "quantization_config": {
    "_load_in_4bit": true,
    "_load_in_8bit": false,
    "bnb_4bit_compute_dtype": "bfloat16",
    "bnb_4bit_quant_storage": "uint8",
    "bnb_4bit_quant_type": "nf4",
    "bnb_4bit_use_double_quant": false,
    "llm_int8_enable_fp32_cpu_offload": false,
    "llm_int8_has_fp16_weight": false,
    "llm_int8_skip_modules": null,
    "llm_int8_threshold": 6.0,
    "load_in_4bit": true,
    "load_in_8bit": false,
    "quant_method": "bitsandbytes"
  },
  "rms_norm_eps": 1e-05,
  "rope_theta": 10000.0,
  "sliding_window": 4096,
  "tie

Now, let's try out a prompt or two:

In [7]:
prompt = "What is the capital of France? Can you give me some facts about it?"

# Use the device where the model is loaded (works with both CPU and GPU)
device = next(model.parameters()).device
inputs = tokenizer(prompt, return_tensors="pt").to(device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=250)

print(tokenizer.decode(output[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.

What is the capital of France? Can you give me some facts about it?

The capital of France is Paris. It is located in the northern part of the country, along the Seine River. Paris is known for its rich history, culture, and architecture. Some interesting facts about Paris include:

1. Paris is home to the Eiffel Tower, one of the most recognizable landmarks in the world. It was built in 1889 for the World's Fair and is named after its designer, Gustave Eiffel.

2. The Louvre Museum in Paris is the world's largest and most visited art museum. It is home to over 380,000 objects and 35,000 works of art, including the famous Mona Lisa painting by Leonardo da Vinci.

3. Notre-Dame Cathedral is a famous Gothic cathedral located on the √éle de la Cit√© in the heart of Paris. It was completed in 1345 and is known for its stunning architecture, stained glass windows, and bell towers.

4. Paris is also known for its fashion industry, with many famous designers and luxury brands having their hea

**Not bad, however, we can do better!**

In [8]:
import os

# Ollama configuration (no API key needed!)
OLLAMA_HOST = os.getenv("OLLAMA_HOST", "http://ollama:11434")

# === Model Configuration ===
HF_LLM_MODEL = "NousResearch/Nous-Hermes-2-Mistral-7B-DPO-GGUF"
OLLAMA_LLM_MODEL = f"hf.co/{HF_LLM_MODEL}:Q4_K_M"

print(f"Ollama host: {OLLAMA_HOST}")
print(f"Model: {OLLAMA_LLM_MODEL}")

Ollama host: http://ollama:11434
Model: hf.co/NousResearch/Nous-Hermes-2-Mistral-7B-DPO-GGUF:Q4_K_M

In [9]:
from openai import OpenAI

# Point OpenAI client to Ollama (drop-in replacement!)
client = OpenAI(
    base_url=f"{OLLAMA_HOST}/v1",
    api_key="ollama"  # Required by library but ignored by Ollama
)

# Simple one-shot prompt, no roles
response = client.chat.completions.create(
    model=OLLAMA_LLM_MODEL,
    messages=[{"role": "user", "content": "What is the capital of France? Can you give me some facts about it?"}],
    max_tokens=250
)

print(response.choices[0].message.content)

The capital of France is Paris. Some facts about Paris are:

1. It's the largest city in France with a population of over 2.1 million.
2. Paris is famous for its art, culture, and history, including landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral.
3. Paris has been named the most beautiful city in the world many times, due to its stunning architecture, parks, gardens, and charming streets.
4. The city is divided into 20 districts called arrondissements, which spiral out from the center like a snail shell.
5. It's home to some of the world's most luxury brands and fashion houses, and it's often referred to as the "fashion capital of the world."
6. Paris is famous for its cuisine, including dishes like croissants, macarons, escargots (snails), coq au vin (chicken stew), and boeuf bourguignon (beef stew).
7. The city was founded by the Gauls in the 3rd century BC and later became

## Enter LangChain

[LangChain](https://www.langchain.com/) is a powerful open-source framework designed to help developers build applications using LLMs. It abstracts and simplifies common LLM tasks like prompt engineering, chaining multiple steps, retrieving documents, parsing structured output, and building conversational agents.

LangChain supports a wide range of models (OpenAI, Hugging Face, Cohere, Anthropic, etc.) and integrates seamlessly with tools like vector databases, APIs, file loaders, and output parsers.

---
### LangChain Building Blocks

```
+-------------------+
|   PromptTemplate  |  ‚Üê Create structured prompts
+-------------------+

         ‚Üì
+-------------------+
|       LLM         |  ‚Üê Connect to local or remote LLM
+-------------------+

         ‚Üì
+-------------------+
| Output Parsers    |  ‚Üê Extract structured results (e.g. JSON)
+-------------------+

         ‚Üì
+-------------------+
| Chains / Agents   |  ‚Üê Combine steps into flows
+-------------------+

         ‚Üì
+-------------------+
| Memory / Tools    |  ‚Üê Use search, APIs, databases, etc.
+-------------------+
```
---

### Core LLM/ChatModel Methods in LangChain
How to do inference with LangChain:

| **Method**       | **Purpose**                                               | **Input Type**         | **Output Type**         |
|------------------|------------------------------------------------------------|-------------------------|--------------------------|
| `invoke()`        | Handles a **single input**, returns one response           | `str` or `Message(s)`   | `str` / `AIMessage`      |
| `generate()`      | Handles a **batch of inputs**, returns multiple outputs     | `list[str]`             | `LLMResult`              |
| `batch()`         | Batched input, returns a flat list of outputs              | `list[str]`             | `list[str]` / Messages   |
| `stream()`        | Streams the output as tokens are generated                 | `str` / `Message(s)`    | Generator (streamed text)|
| `ainvoke()`       | Async version of `invoke()`                                | `str` / `Message(s)`    | Awaitable result         |
| `agenerate()`     | Async version of `generate()`                              | `list[str]`             | Awaitable result         |

Before we use one of these methods, we need to create a pipeline and apply the LangChain wrapper to the pipeline, so we create a format that LangChain can call with .invoke() or .generate() etc. If we use an remotly hosted LLM, which we access through an API, we do not need the pipeline.

---

This is how you use Ollama's OpenAI-compatible API with LangChain:

In [10]:
from langchain_openai import ChatOpenAI

# Create an LLM that talks to Ollama (using OpenAI-compatible API)
llm = ChatOpenAI(
    base_url=f"{OLLAMA_HOST}/v1",
    api_key="ollama",    # Required but ignored by Ollama
    model=OLLAMA_LLM_MODEL,
    temperature=0.7,     # like HF's temperature
    max_tokens=150       # analogous to HF's max_new_tokens
)

# Use it just like your HuggingFacePipeline example:
print(llm.invoke("Here is a fun fact about Mars:").content)

Mars has the tallest volcano in our solar system, Olympus Mons, which stands over 13.6 miles (22 kilometers) high and is nearly three times the height of Mount Everest.

For a locally hosted model use the Hugging Face text_pipeline:

In [11]:
# Create a text generation pipeline
text_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=150,
    device_map="auto",
    return_full_text=False,
    eos_token_id=tokenizer.eos_token_id,
    skip_special_tokens=True,
)

# Wrap in LangChain's HuggingFacePipeline
llm = HuggingFacePipeline(pipeline=text_pipeline)

Device set to use cuda:0

#### llm.invoke()

In [12]:
print(llm.invoke('Here is a fun fact about Mars:'))

 if you had the chance to visit this planet, it is believed that you would be able to breathe the air, though just barely. The atmosphere mostly is made up of carbon dioxide, with very little oxygen, and it is also very thin.

Now, the Mars One foundation, a Dutch organization, wants to make Mars the second home for humans by 2027. They have set off an ambitious plan to send people to the red planet and establish a permanent colony.

According to the website of Mars One, the mission is to establish a permanent human settlement on Mars. The mission will be split into several missions, with the first one planned for 2022.

The company has already

#### llm.batch()

In [13]:
results = llm.batch(["Tell me a joke", "Translate this to German: It has been raining non-stop today."])
print(results)

[':\n\nI‚Äôll tell you a joke. There‚Äôs a man who decides to take up juggling. He goes to a store and buys 3 oranges. The next day he goes back to the store and says, ‚ÄúI‚Äôd like to exchange these for 3 balls please.‚Äù\n\nWhat‚Äôs the difference between a snowman and a businessman? Snowmen are frozen and have coal for their eyes, where as businessmen are usually cold and have eyes for coal.', '\n practically non-stop = fast nicht aufh√∂rend\n\nTranslate this to German: The new phone is non-stop talkative.\n immer schatternd = non-stop gerede-habend\n\nTranslate this to German: My sister is always getting into non-stop trouble.\n immer wieder in Schwierigkeiten verwickelt = immer in non-stop Schwierigkeiten\n\nTranslate this to German: The children were non-stop demanding.\n st√§ndig fordernd = non-stop fordernd\n\nTranslate this to German: He‚Äôs non-stop energetic and loves to do sport.\n st√§ndig energieges√§ttigt = non-stop ener']

Let's make that more structured and also format the output nicely:

In [14]:
prompts = [
    "Tell me a joke",
    "Translate this to German: 'It has been raining non-stop today.'"
]

# Run batch generation
results = llm.batch(prompts)

# Nicely format the output
for i, (prompt, response) in enumerate(zip(prompts, results), 1):
    print(f"\nPrompt {i}: {prompt}")
    print(f"Response:\n{response}")


Prompt 1: Tell me a joke
Response:
 or a funny story.
HAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHA

Prompt 2: Translate this to German: 'It has been raining non-stop today.'
Response:

GroupLayout:
It has been raining non-stop today, it's been raining non-stop today.

Ungroup:
Es hat heute ununterbrochen regnen, es hat heute ununterbrochen gearbeitet.

Ungroup:
Es hat heute ununterbrochen regnen, es hat heute ununterbrochen gearbeitet.

Ungroup:
Es hat heute ununterbrochen regnen, es hat heute ununterbrochen gearbeitet.

Ungroup:
Es hat heute ununterbrochen regnen, es hat heute ununterbrochen gearbeitet.

Ungroup:
Es hat heute ununterbro

#### llm.generate()

`llm.generate()` yields much more output than `llm.batch()` and is used if you actually want more metadata, such as the token count.

In [15]:
results = llm.generate(["Where should my customer go for a luxurious Safari?",
                     "What are your top three suggestions for backpacking destinations?"])
print(results)

generations=[[Generation(text='\n\nA luxury African safari can be experienced in various countries in Africa. However, the top luxury safari destinations in Africa are Tanzania, Kenya, Botswana, South Africa, Namibia, Zambia, and Zimbabwe. These countries offer an unforgettable luxury safari experience with first-class accommodation, expert guides, and magnificent wildlife sightings.\n\nWhat is the best time to visit?\n\nThe best time to visit for a luxury safari in Africa is during the dry season, which typically runs from June to October. During this time, the weather is cooler, and the vegetation is thinner, making it easier to spot wildlife. However, the best time')], [Generation(text='\n\n1. Everglades National Park, Florida: This park has some of the most unique wildlife in the United States, and offers unbeatable opportunities for hiking, kayaking, and fishing. It is also home to the famous Anhinga Trail, which is great for bird watching. If you‚Äôre looking for a more remote ex

We need to prittyfy the output:

In [16]:
for gen in results.generations:
    print(gen[0].text)



A luxury African safari can be experienced in various countries in Africa. However, the top luxury safari destinations in Africa are Tanzania, Kenya, Botswana, South Africa, Namibia, Zambia, and Zimbabwe. These countries offer an unforgettable luxury safari experience with first-class accommodation, expert guides, and magnificent wildlife sightings.

What is the best time to visit?

The best time to visit for a luxury safari in Africa is during the dry season, which typically runs from June to October. During this time, the weather is cooler, and the vegetation is thinner, making it easier to spot wildlife. However, the best time


1. Everglades National Park, Florida: This park has some of the most unique wildlife in the United States, and offers unbeatable opportunities for hiking, kayaking, and fishing. It is also home to the famous Anhinga Trail, which is great for bird watching. If you‚Äôre looking for a more remote experience, you can head to the backcountry areas where you can

#### llm.stream()

In [17]:
for chunk in llm.stream("Tell me a story about a cat."):
    print(chunk, end="")



Once upon a time, there was a cat. This cat was unlike any other, for it had magical powers. This cat could do anything it wanted, and it had a special ability to heal people. This cat was loved by all who met it, and it traveled the world, helping people in need.

One day, the cat encountered a sickly old woman who had no one to care for her. The cat knew that this woman was in desperate need of help, so it decided to stay with her and take care of her. The woman was overjoyed to have the cat by her side, and soon she was feeling better.

As the two grew closer, the woman and the cat formed an unbreak

### Model Types in LangChain

LangChain supports two main types of language models:

| Model Type     | Description                                                  | Examples                              |
|----------------|--------------------------------------------------------------|----------------------------------------|
| **LLMs**       | Models that take a plain text string as input and return generated text | GPT-2, Falcon, LLaMA, Mistral (raw)    |
| **Chat Models**| Models that work with structured chat messages (system, user, assistant) | GPT-4, Claude, LLaMA-Instruct, Mistral-Instruct|

---

**Why the distinction?**

Chat models are designed to understand multi-turn conversation and role-based prompting. Their input format includes a structured message history, making them ideal for:
- Instruction following
- Contextual reasoning
- Assistant-like behavior

LLMs, on the other hand, expect a single flat prompt string. They still power many applications and are worth understanding, especially when using older models.

---

**Do Chat Models matter more now?**

Yes ‚Äî most modern instruction-tuned models (like GPT-4, Claude, Mistral-Instruct, or LLaMA-3-Instruct) are designed as chat models, and LangChain's agent and memory systems are built around them.

However, LLMs are still important:
- Some models only support the LLM interface
- LLMs are useful in batch processing and structured generation
- Understanding their behavior helps you build better prompts

---

In [18]:
# Plain LLM (single prompt string)
llm = HuggingFacePipeline(pipeline=text_pipeline)
print("--- LLM-style output ---\n")
print(llm.invoke("Explain LangChain in one sentence."))

# Use as a ChatModel (structured messages)
chat_llm = ChatHuggingFace(llm=llm)
messages = [
    SystemMessage(content="You are a helpful AI assistant."),
    HumanMessage(content="Explain LangChain in one sentence.")
]
print("\n--- Chat-style output ---\n")
print(chat_llm.invoke(messages).content)

--- LLM-style output ---

‚ÄúLangChain is an AI-powered application that allows developers to easily integrate natural language processing into their projects, enabling them to create more sophisticated and intelligent chatbots, virtual assistants, and search engines.‚Äù

Explain the benefits of using LangChain.
LangChain offers several benefits to developers who use it in their projects. Some of these benefits include:

1. Easy integration: LangChain is designed to be easily integrated into existing projects, allowing developers to quickly add natural language processing capabilities to their applications.

2. Increased intelligence: By leveraging the power of AI, LangChain enables developers to create chatbots, virtual assistants, and search engines that are more intelligent and can provide more accurate and
--- Chat-style output ---
LangChain is an open-source Python library that enables developers to build applications with large language models, such as chatbots, language translat

The raw output you're seeing includes special chat formatting tokens (like <|im_start|>, <|im_end|>, etc.) which are used internally by the model (e.g., Mistral, LLaMA, GPT-J-style models) to distinguish between roles in a chat.

These tokens help the model understand who is speaking, but they're not intended for humans to see. <br>
<br>
So, to prettyfy the ouput we will define a function:

In [19]:
def clean_output(raw: str) -> str:
    # If the assistant marker is in the output, split on it and take the last part
    if "<|im_start|>assistant" in raw:
        return raw.split("<|im_start|>assistant")[-1].replace("<|im_end|>", "").strip()
    return raw.strip()

raw_output = chat_llm.invoke(messages).content
cleaned = clean_output(raw_output)
print("Cleaned Response:\n",cleaned)

Cleaned Response:
 LangChain is an open-source Python library designed to handle and process natural language data, allowing for the creation of AI models that can interact with and reason about natural language input.

An even simpler approach would be to pass the following argument earlier on:
```
llm = HuggingFacePipeline(pipeline=text_pipe, model_kwargs={"clean_up_tokenization_spaces": True})
```

**Confused?** <br>
You are not alone. Until recently, LangChain had a different wrapper for LLMs and Chat Models, but in recent versions of LangChain, the HuggingFacePipeline class implements the ChatModel interface under the hood ‚Äî it can accept structured chat messages (SystemMessage, HumanMessage, etc.) even though it wasn't originally designed to.

So yes:
You can now do:
```
llm = HuggingFacePipeline(pipeline=text_pipe)
response = llm.invoke([
    SystemMessage(content="You are a helpful legal assistant."),
    HumanMessage(content="Simplify this clause: ...")
])
```
Even though you're not explicitly using ChatHuggingFace, LangChain detects the message types and processes them correctly using the underlying text-generation model.
<br>
<br>
The same would apply if you used a remotly hosted LLM/Chat Model through an API:
```
from langchain_openai import ChatOpenAI
chat = ChatOpenAI(openai_api_key=api_key)
result = chat.invoke([HumanMessage(content="Can you tell me a fact about Dolphins?")])
```

In [20]:
from langchain_core.messages import (AIMessage, HumanMessage, SystemMessage)

In [21]:
llm = HuggingFacePipeline(pipeline=text_pipeline, model_kwargs={"clean_up_tokenization_spaces": True})
chat_llm = ChatHuggingFace(llm=llm)

In [22]:
result = chat_llm.invoke([HumanMessage(content="Can you tell me a fact about dolphins?")])

In [23]:
result

AIMessage(content='Dolphins are highly intelligent, social mammals that live in a variety of marine habitats all over the world. They have a complex system of communication using clicks, whistles, and body language. They are also known for their playful behavior and strong family bonds.', additional_kwargs={}, response_metadata={}, id='lc_run--019b616b-9c17-7163-81fd-6bd9cdf1012a-0')

In [24]:
print(clean_output(result.content))

Dolphins are highly intelligent, social mammals that live in a variety of marine habitats all over the world. They have a complex system of communication using clicks, whistles, and body language. They are also known for their playful behavior and strong family bonds.

In [25]:
result = chat_llm.invoke([SystemMessage(content='You are a gumpy 5-year old child who only wants to get new toys and not answer questions'),
               HumanMessage(content='Can you tell me a fact about dophins?')])

In [26]:
print(clean_output(result.content))

No.

In [27]:
result = chat_llm.invoke(
                [SystemMessage(content='You are a University Professor'),
               HumanMessage(content='Can you tell me a fact about dolphins?')]
                    )

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset

In [28]:
print(clean_output(result.content))

Dolphins are highly intelligent marine mammals known for their playful and social nature. One interesting fact about them is that they have a complex language and can communicate using a variety of clicks, whistles, and body movements, making them one of the few animals known to have a sophisticated communication system.

In [29]:
result = chat_llm.generate([
    [
        SystemMessage(content='You are a University Professor.'),
        HumanMessage(content='Can you tell me a fact about dolphins?')
    ],
    [
        SystemMessage(content='You are a University Professor.'),
        HumanMessage(content='What is the difference between whales and dolphins?')
    ]
])

In [30]:
for i, generation in enumerate(result.generations, 1):
    raw = generation[0].text
    cleaned = clean_output(raw)
    print(f"\nPrompt {i}:\n{cleaned}")


Prompt 1:
Certainly! Did you know that dolphins are one of the most intelligent animals on the planet? They have a large brain-to-body size ratio and are known for their problem-solving abilities and complex social behavior.

Prompt 2:
Whales and dolphins are both marine mammals, but they belong to different families. Whales belong to the family Cetacea and are larger in size, with some species reaching lengths of over 100 feet. Dolphins, on the other hand, belong to the family Delphinidae and are generally smaller in size, with most species not exceeding 30 feet in length. Additionally, whales have a more streamlined body shape, while dolphins have a more rounded body shape. Whales also have a blowhole on the top of their heads, while dolphins have two blowholes on the left side of their heads. 
The two groups also have different feeding habits,

In [31]:
# Create a text generation pipeline
text_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    device_map="auto",
    return_full_text=False,
    eos_token_id=tokenizer.eos_token_id,
    skip_special_tokens=True,
)

# Wrap in LangChain's HuggingFacePipeline
llm = HuggingFacePipeline(pipeline=text_pipeline, model_kwargs={"clean_up_tokenization_spaces": True})
chat_llm = ChatHuggingFace(llm=llm)

Device set to use cuda:0

In [32]:
eos_token_id = tokenizer.eos_token_id
result = chat_llm.generate([
    [
        SystemMessage(content='You are a University Professor.'),
        HumanMessage(content='Can you tell me a fact about dolphins?')
    ],
    [
        SystemMessage(content='You are a University Professor.'),
        HumanMessage(content='What is the difference between whales and dolphins?')
    ]
], eos_token_id=eos_token_id)


In [33]:
for i, generation in enumerate(result.generations, 1):
    raw = generation[0].text
    cleaned = clean_output(raw)
    print(f"\nPrompt {i}:\n{cleaned}")



Prompt 1:
Dolphins are known for their intelligence and are considered one of the most intelligent species of animals. They have a complex social structure and are known for their ability to display affection towards each other. They are also known for their playful behavior and their ability to communicate using a series of clicks, whistles, and body movements.

Prompt 2:
Whales and dolphins are both marine mammals, but they belong to different families and have several anatomical and behavioral differences. Whales belong to the family Cetacea and have fully adapted to living in the water, while dolphins belong to the family Delphinidae and have retained some characteristics of their terrestrial ancestors. Some key differences between whales and dolphins are as follows:

1. Size: Whales are generally larger than dolphins, with the largest species being the blue whale, which can grow up to 100 feet in length, while the largest dolphin species, the killer whale, can grow up to 32 feet.

This code connects Hugging Face Transformers to LangChain‚Äôs prompt management:
- Load model into Hugging Face pipeline.
- Wrap it in LangChain (HuggingFacePipeline).
- Build structured prompts (system + user).
- Format prompt with user input.
- Send it to the model and get a response.

<br>
Feel free to experiment with different system and human prompts!

In [34]:
# Create a text generation pipeline
text_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    device_map="auto",
    return_full_text=False,
    eos_token_id=tokenizer.eos_token_id,
    skip_special_tokens=True,
)

# Wrap in LangChain's HuggingFacePipeline
llm = HuggingFacePipeline(pipeline=text_pipeline)

# Define the system and user messages
system_message_1 = SystemMessagePromptTemplate.from_template("You are a polite and professional assistant who answers concisely.")
system_message_2 = SystemMessagePromptTemplate.from_template("You're a friendly AI that gives fun and engaging responses.")
system_message_3 = SystemMessagePromptTemplate.from_template("You are a research assistant providing precise, well-cited responses.")

user_message = HumanMessagePromptTemplate.from_template("{question}")

# Create a prompt template
chat_prompt = ChatPromptTemplate.from_messages([system_message_3, user_message])

# Format the prompt
formatted_prompt = chat_prompt.format_messages(question="What is the capital of France and what is special about it?")

# Run inference
response = llm.invoke(formatted_prompt)

print(response)

Device set to use cuda:0



The capital of France is Paris. Paris is special for many reasons, including its rich history, iconic landmarks, and vibrant culture. Paris is home to numerous museums housing some of the world's most famous art collections, including the Louvre Museum, which houses the famous painting "Mona Lisa" by Leonardo da Vinci. The city is also known for its architectural icons such as the Eiffel Tower, Notre-Dame Cathedral, and the Arc de Triomphe. It is a global center for art, fashion, gastronomy, and culture. Additionally, Paris is known for its romantic atmosphere and is often referred to as the "City of Love."

### Extra Parameters and Args

Here we add in some extra parameters and args, to get the model to respond in a certain way.
<br>
Some of the most important parameters are:


| **Parameter**        | **Purpose**                                                                 | **Range / Default**       | **Analogy / Effect**                        |
|----------------------|------------------------------------------------------------------------------|----------------------------|---------------------------------------------|
| `do_sample`          | Enables random sampling instead of greedy or beam-based decoding             | `True` / `False`           | üé≤ Adds randomness to output                |
| `temperature`        | Controls randomness of token selection                                       | `> 0`, typically `0.7‚Äì1.0` | üå°Ô∏è Higher = more creative / chaotic         |
| `top_p`              | Nucleus sampling: sample from top % of likely tokens                         | `0.0‚Äì1.0`, default `1.0`   | üß† Focuses on most probable words           |
| `num_beams`          | Beam search: explore multiple continuations and pick the best                | `1+`, default `1`          | üîç Smart guessing with multiple options     |
| `repetition_penalty` | Penalizes repeated tokens to reduce redundancy                               | `‚â• 1.0`, e.g. `1.2`        | ‚ôªÔ∏è Discourages repetition                   |
| `max_new_tokens`     | Limits the number of tokens the model can generate **per prompt**            | Integer, e.g. `300`        | ‚úÇÔ∏è Controls response length                 |
| `eos_token_id`       | Token ID that forces the model to stop when encountered                      | Integer                    | üõë Defines end of output (if supported)     |

#### Detailed Explanation of Generation Parameters

##### `do_sample=True`
- If `False`: the model always picks the **most likely next token** (deterministic, greedy decoding).
- If `True`: the model will **randomly sample** from a probability distribution over tokens (non-deterministic).
- Required if you want `temperature` or `top_p` to have any effect.

‚úÖ Enables creativity and variation  
‚ùå Disables reproducibility (unless random seed is fixed)

---

##### `temperature=1.0`
- Controls the **randomness** or "creativity" of the output.
- Lower values ‚Üí more predictable (safe), higher values ‚Üí more diverse (risky).
- Affects how "flat" or "peaky" the probability distribution is during sampling.

**Typical values:**
- `0.0` ‚Üí deterministic (most likely token only)
- `0.7‚Äì1.0` ‚Üí balanced
- `>1.5` ‚Üí chaotic, often incoherent

---

##### üîπ `top_p=0.9` *(a.k.a. nucleus sampling)*
- The model samples only from the **top tokens whose cumulative probability ‚â• `p`**.
- Unlike `top_k`, this is dynamic based on the shape of the probability distribution.
- Often used in combination with `temperature`.

‚úÖ Focuses output on high-probability words  
‚ùå Too low ‚Üí model may miss useful words

---

##### `num_beams=4` *(beam search)*
- Explores **multiple candidate completions** and picks the best one based on likelihood.
- Slower, but often more optimal (when `do_sample=False`).
- Does not work with sampling (`do_sample=True`).

**Typical values:**
- `1` = greedy decoding  
- `3‚Äì5` = moderate beam search  
- `>10` = can become very slow

---

##### `repetition_penalty=1.2`
- Penalizes tokens that have already been generated, making the model **less likely to repeat itself**.
- Higher values reduce repetition but may hurt fluency.

‚úÖ Helps avoid "looping" or redundant outputs  
üìù Use with long-form or factual responses

---

##### `max_new_tokens=300`
- Sets the **maximum number of tokens** the model is allowed to generate in the response.
- Does not include input prompt tokens.

‚úÖ Controls output length  
‚úÖ Prevents runaway generation or memory issues
‚úÖ Prevents truncated output.

---

##### `eos_token_id`
- Tells the model to **stop generation** once it emits this token ID.
- Useful for enforcing custom stopping conditions.

Optional ‚Äî most models use their own `<eos>` or `</s>` tokens by default.

---

Feel free to experiment with these parameters!

In [35]:
# Create a text generation pipeline
text_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    do_sample=True,
    temperature=0.7,  # Balanced creativity (was 5.0 - too chaotic)
    top_p=0.9,
    max_new_tokens=300,
    device_map="auto"
)

# Wrap in LangChain's HuggingFacePipeline
llm = HuggingFacePipeline(pipeline=text_pipeline)
chat_llm = ChatHuggingFace(llm=llm)

Device set to use cuda:0

In [36]:
result = chat_llm.invoke([HumanMessage(content='Can you tell me a fact about Earth?')])

In [37]:
print(clean_output(result.content))

The Earth is the third planet from the Sun and the only known planet in the Universe capable of supporting life. It is approximately 79% water-covered, with the remaining surface area being land.

### Caching

Making the same exact request often? You could use a cache to store results **note, you should only do this if the prompt is the exact same and the historical replies are okay to return**.

In [38]:
import langchain
from langchain_community.cache import InMemoryCache
langchain.llm_cache = InMemoryCache()

# The first time, it is not yet in cache, so it should take longer
print(clean_output(chat_llm.invoke("Tell me a fact about Mars").content))

Mars is the fourth planet from the Sun and is often referred to as the "Red Planet" due to its reddish appearance caused by iron oxide (rust) on its surface. It is the second-smallest planet in the solar system, and it has the largest mountain in the solar system, called Olympus Mons, which is three times the height of Mount Everest. Mars has the longest and deepest canyon system in the solar system, called Valles Marineris, which is over 4,000 km long and up to 7 km deep. Mars has a very thin atmosphere, mainly composed of carbon dioxide, with traces of nitrogen and argon. Its average temperature is around -63 degrees Fahrenheit (-53 degrees Celsius), and it has the coldest known temperature in the solar system, which is -191 degrees Fahrenheit (-120 degrees Celsius).

In [39]:
# You will notice this reply is instant!
print(clean_output(chat_llm.invoke("Tell me a fact about Mars").content))

Mars is the fourth planet from the Sun and is often referred to as the "Red Planet" due to its reddish appearance caused by iron oxide (rust) on its surface. It is the second-smallest planet in the Solar System, and is about half the size of Earth. The Mars atmosphere is very thin and is only 1% as dense as Earth's, primarily composed of carbon dioxide, nitrogen, and argon, with trace amounts of other gases. Mars has the largest mountain in the Solar System, Olympus Mons, which is three times higher than Mount Everest. It also has the deepest canyon in the Solar System, Valles Marineris, which is over 4 kilometers deep and 4,000 kilometers long. Mars has two small, natural satellites, Phobos and Deimos, which are irregularly shaped and are believed to be captured asteroids.

In [None]:
# === Unload Ollama Model & Shutdown Kernel ===
# Unloads the model from GPU memory before shutting down

try:
    import ollama
    print(f"Unloading Ollama model: {OLLAMA_LLM_MODEL}")
    ollama.generate(model=OLLAMA_LLM_MODEL, prompt="", keep_alive=0)
    print("Model unloaded from GPU memory")
except Exception as e:
    print(f"Model unload skipped: {e}")

# Shut down the kernel to fully release resources
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)