# Introduction to Prompt Engineering with Large Language Models

In this notebook, you'll learn how to interact with large language models (LLMs) using prompting.

We will explore:

- Basic prompts
- Improving prompts with context
- Few-shot examples
- Using prompt templates

<a href="https://colab.research.google.com/github/cbadenes/semantic-report-search/blob/main/data/analysis/40_prompting_basics.ipynb" target="_parent">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/>
</a>


In [1]:
from huggingface_hub import login
import getpass

token = getpass.getpass("🔑 Enter your Hugging Face token: ")
login(token)


🔑 Enter your Hugging Face token: ··········


### Parameters for `pipeline("text-generation", ...)`

- **model**: The preloaded model (e.g., `AutoModelForCausalLM`).
- **tokenizer**: The tokenizer that matches the model (e.g., `AutoTokenizer`).

#### Generation Parameters:

- **max_length**:
  - Total length of input + generated output.
  - Use this for absolute control over sequence size.
  - ⚠️ Can be overridden by `max_new_tokens`.

- **max_new_tokens**:
  - Limits the number of *new* tokens the model should generate.
  - Use this instead of `max_length` when input varies in length.
  - Recommended for most prompting scenarios.

- **truncation**:
  - If `True`, cuts input to fit within model limits.
  - Useful when feeding long text as prompt.

- **do_sample**:
  - If `True`, the model samples from the probability distribution (randomness).
  - If `False`, it picks the most likely next token (greedy decoding).
  - Recommended: `True` + `temperature` tuning for creativity.

- **temperature**:
  - Controls the randomness of sampling.
  - Lower = more deterministic (e.g., 0.3), higher = more creative (e.g., 0.8).
  - Best used with `do_sample=True`.

- **return_full_text**:
  - If `True`, the result includes both the prompt and generated text.
  - If `False`, it returns only the model's output.


In [5]:
!pip install -q transformers accelerate

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

tokenizer = AutoTokenizer.from_pretrained(model_id, token=token)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", token=token)

Device set to use cuda:0


In [8]:
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=512,          # Total max length of input + output
    truncation=True,         # Truncate input if it exceeds model's max length
    do_sample=True,          # Enable sampling (randomness); set False for greedy decoding
    return_full_text=True,  # If True, returns input + output; if False, only generated part
    temperature=0.9,         # Controls randomness; lower = more deterministic
    max_new_tokens=10       # Number of tokens to generate (output only)
)


Device set to use cuda:0


Basic Prompt:

In [10]:
prompt = (
    "<|system|>\n"
    "You are a helpful assistant.\n"
    "<|user|>\n"
    "Summarize the purpose of a report that analyzes the distribution flows in feeder markets.\n"
    "<|assistant|>\n"
)
response = generator(prompt)[0]['generated_text']
print(response)


<|system|>
You are a helpful assistant.
<|user|>
Summarize the purpose of a report that analyzes the distribution flows in feeder markets.
<|assistant|>
A report analyzing the distribution flows in feeder
