# Video: Developing More Robust Prompts with In-Context Learning

A recurring piece of advise for better use of language models is to provide as much useful information as possible in the prompt.
In this video, we will compare the use of extremely concise prompts against long elaborate prompts.

Script: (faculty on screen)
* In-context learning is a common tactic to give more specific guidance to a language model about what kind of output is desired.
* The basic idea is to provide a few explained examples for a few known inputs, and then repeat the question for a new input.
* I'll walk through some examples testing prompts and comparing to prompts with in-context learning.
* Along the way, I'll share some tips to speed up and ease your API usage.

In [None]:
%pip install -q google-genai

In [None]:
import json
import time

In [None]:
import google.genai as genai
from google.colab import userdata

In [None]:
client = genai.Client(api_key=userdata.get('GEMINI_API_KEY'))

In [None]:
model_name = 'gemini-2.0-flash'

Script:
* In this video, there are many fairly long pieces of code, so I've typed them up beforehand, but I will still talk through them like usual.
* I am using Google's genai package again.
* But this time I am using the Gemini 2.0 flash model, not the lite one, because I'll be making a lot of queries and the lite one is more likely to be busy.

In [None]:
def get_response_raw(contents):
    print("RAW", contents)
    response = client.models.generate_content(model=model_name,
                                              contents=contents)
    return response.text

get_response_raw("Hello")

RAW Hello


'Hello! How can I help you today?\n'

Script:
* This function get_response_raw calls our selected model with the query contents, and returns the response text.
* The print statement is extra, but I'm including it for better visibility into raw API calls.

In [None]:
response_cache = {}

Script:
* One thing that you will often want to add to any system invoking a language model is a cache to save results in case you repeat a query.
* This is a very simple in-memory cache and just uses a Python dictionary.
* It's enough for poking around in a notebook, but you'll want a more persistent version for a production system.
* I'm personally fond of Redis for persistent caches, but sometimes I save them in a database like I did for Bacon Powered Recipes.

In [None]:
def get_response(contents):
    if contents in response_cache:
        return response_cache[contents]

    for attempt in range(3):
        try:
            response = get_response_raw(contents)
            response_cache[contents] = response
            return response
        except genai.errors.ClientError as e:
            # Attempt to parse the retryDelay from the exception details
            print("RESPONSE", e.response)
            error_json = json.loads(e.response.text)
            error_details = error_json['error']['details']
            for d in error_details:
                print("ERROR DETAIL", d)
            retry_info = [detail for detail in error_details if detail['@type'] == 'type.googleapis.com/google.rpc.RetryInfo']
            if not retry_info:
                raise
            retry_delay_seconds = float(retry_info[0]['retryDelay'].replace('s', ''))
            retry_delay_seconds += 1 # margin
            print(f"Retrying in {retry_delay_seconds} seconds...")
            time.sleep(retry_delay_seconds)

    raise RuntimeError("ran out of attempts")

print(get_response("Hello"))

RAW Hello
Hello! How can I help you today?



Script:
* This longer get_response function adds a couple important features.
* The first is that it checks the cache, uses cached responses if available, and saves new responses into the cache.
* This is helpful to improve speed and reduce costs when you have many repeated queries.
* Repeated queries happen a lot while you are testing, so this is also a quality of life improvement for you.
* The second feature is that it recognizes rate limiting responses from the API, waits a bit based on the indicated delay, and then retries.
* As I wrote it, it will try up to 3 times.
* Generally, unbounded retries are a bad idea.
* During this course, you are most likely to hit rate limiting if you are on the free plan and did not sign up for the free trial.
* In a production system, you will be more likely to hit rate limiting if your system scales up quickly.
* Most API providers will automatically increase rate limits if you are paying and not growing too fast.

In [None]:
get_response("Is chicken puttanesca a food? Just say yes or no.")

RAW Is chicken puttanesca a food? Just say yes or no.


'No\n'

Script:
* Here's an example using get_response.
* I will write another function wrapping get_response to handle the answer parsing and return true or false.

In [None]:
def get_boolean_response(contents):
    response = get_response(contents)
    return response.lower().startswith("yes")

Script:
* For now, I am using the same simple parsing that returns true if the response starts with yes, and false otherwise.
* I'll upgrade that in a little bit.

In [None]:
get_boolean_response("Is chicken puttanesca a food? Just say yes or no.")

False

Script:
* I'm a little disappointed in this answer.
* But before experimenting with different prompts, lets assemble a test set so we are not overfitting on just chicken puttanesca.

In [None]:
test_cases = [("aaa", False),
        ("apple crisp", True),
        ("bacon chocolate chip cookies", True),
        ("bacon egg muffins", True),
        ("bacon fried rice", True),
        ("bacon souffle", True),
        ("bacon wrapped scallops", True),
        ("bbb", False),
        ("bolognese sauce", True),
        ("breakfast burritos", True),
        ("brownies", True),
        ("butter croissants", True),
        ("chicken fingers", True),
        ("chicken puttanesca", True),
        ("chocolate souffle", True),
        ("cornbread stuffed lobster tail", True),
        ("cranberry apple crisp", True),
        ("cranberry white chocolate oatmeal pancakes", True),
        ("croissants", True),
        ("falafel", True),
        ("grapefruit meringue pie", True),
        ("mala chicken", True),
        ("nutella and banana bread", True),
        ("orange upside down cake", True),
        ("parmesan crusted baked potatoes", True),
        ("pasta primavera", True),
        ("peach bellini", True),
        ("peanut butter banana split", True),
        ("peanut satay vegetable skewers", True),
        ("peanut tofu wraps", True),
        ("pecan butter and fruit parfait", True),
        ("pecan butter banana pancakes", True),
        ("pecan butter granola", True),
        ("porn", False),
        ("puttanesca bruschetta", True),
        ("queso fresco salad", True),
        ("queso fresco stuffed peppers", True),
        ("quick sort", False),
        ("ramen", True),
        ("raspberry empanadas", True),
        ("ricin", False),
        ("sex", False),
        ("shrimp puttanesca", True),
        ("souffle", True),
        ("spamburger", True),
        ("spicy curry pizza", True),
        ("sujebi", True),
        ("sweet lassi", True),
        ("tandoori tikka masala", True),
        ("watermelon yogurt parfait", True),
        ("xxx", False),
        ("zzz", False)]

len(test_cases)

52

Script:
* Here are 52 test cases including chicken puttanesca.
* Some of them are food, some are drinks, and some are definitely not food.

In [None]:
def test_template(prompt_template):
    correct_count = 0
    for (test_case, expected_result) in test_cases:
        prompt = prompt_template.format(recipe=test_case)
        actual_result = get_boolean_response(prompt)
        if actual_result == expected_result:
            correct_count += 1
        else:
            print("MISTAKE", test_case, "RESPONSE", get_response(prompt))

    return correct_count / len(test_cases)

Script:
* This function takes a prompt template and tests it out on all the test cases above, and returns the accuracy on those test cases.
* It also prints out the responses when a mistake was made to give hints where the template fails.
* Let's start with the basic "Is it a food?" prompt.

In [None]:
test_template("Is {recipe:s} a food?")

MISTAKE cornbread stuffed lobster tail RESPONSE While not a common or widely recognized dish, **cornbread stuffed lobster tail is absolutely a food and a plausible culinary creation.**

Here's why:

*   **Ingredients Combine Well:** Lobster and cornbread are both delicious, and their flavors can complement each other. The sweetness of the lobster meat can contrast with the savory and slightly crumbly texture of the cornbread.
*   **Stuffed Seafood is Common:** Stuffing seafood with various fillings is a common culinary practice. Think crab-stuffed flounder, shrimp-stuffed scallops, or even variations of lobster stuffed with other seafood.
*   **Recipes Exist (Although Possibly Niche):** A quick internet search will reveal recipes for cornbread stuffed lobster tail, or at least ideas that incorporate similar concepts. These may not be mainstream recipes, but they demonstrate that the idea has been explored.
*   **Restaurant Potential:** A creative chef could definitely put this on a men

0.8846153846153846

Script:
* That accuracy is not great.
* It seems to be mostly be making mistakes with drinks and less traditional foods.

In [None]:
test_template("Is {recipe:s} a food or drink?")

MISTAKE apple crisp RESPONSE Apple crisp is a **food**. It's a dessert made primarily of baked apples topped with a crumbly mixture of butter, flour, oats, and sugar.

MISTAKE bacon chocolate chip cookies RESPONSE Bacon chocolate chip cookies are a **food**.

MISTAKE bacon egg muffins RESPONSE Bacon egg muffins are a **food**. They are made of solid ingredients and eaten as a meal or snack.

MISTAKE bacon fried rice RESPONSE Bacon fried rice is a **food**. It is a dish made with rice, bacon, and other ingredients like vegetables and seasonings.

MISTAKE bacon souffle RESPONSE Bacon souffle is a **food**. Soufflés are baked dishes, and while they can be airy, they are definitely considered a food item.

MISTAKE bacon wrapped scallops RESPONSE Bacon wrapped scallops is a **food**.

MISTAKE bolognese sauce RESPONSE Bolognese sauce is a **food**. It's a sauce, typically served over pasta.

MISTAKE breakfast burritos RESPONSE A breakfast burrito is a **food**.

MISTAKE brownies RESPONSE Bro

0.15384615384615385

Script:
* That was terrible.
* With that prompt, the language model rarely says yes or no, and the parsing fails.

In [None]:
test_template("Is {recipe:s} a food or drink? Just say yes or no.")

MISTAKE apple crisp RESPONSE No

MISTAKE bacon chocolate chip cookies RESPONSE No.

MISTAKE bacon egg muffins RESPONSE No

MISTAKE bolognese sauce RESPONSE No

MISTAKE breakfast burritos RESPONSE No

MISTAKE brownies RESPONSE No.

MISTAKE butter croissants RESPONSE No.

MISTAKE cranberry apple crisp RESPONSE No

MISTAKE croissants RESPONSE No

MISTAKE grapefruit meringue pie RESPONSE No.

MISTAKE nutella and banana bread RESPONSE No

MISTAKE orange upside down cake RESPONSE No.

MISTAKE peach bellini RESPONSE No

MISTAKE peanut satay vegetable skewers RESPONSE No.

MISTAKE pecan butter and fruit parfait RESPONSE Food

MISTAKE pecan butter granola RESPONSE No

MISTAKE queso fresco stuffed peppers RESPONSE Food

MISTAKE ramen RESPONSE Food

MISTAKE spamburger RESPONSE No.

MISTAKE sujebi RESPONSE No.

MISTAKE sweet lassi RESPONSE No.

MISTAKE tandoori tikka masala RESPONSE No.

MISTAKE watermelon yogurt parfait RESPONSE Food



0.5576923076923077

Script:
* That is still pretty bad.
* Let's try a longer prompt using in-context learning to be more clear about our criteria.

In [None]:
long_prompt = """I am running safety checks for recipe requests on my site.

Here are some examples of criteria.
* Brownies: yes, safe to eat.
* Mala chicken: yes, this is safe to eat. It is spicy, but that is a matter of taste, not safety.
* Chicken puttanesca: yes. It is not authentic Italian cuisine, but it is a real food that people make and enjoy.
* Vegan pork: no. This is a contradiction and misleading. This could lead to trouble depending on how the contradiction is resolved.
* Sarin: no. This is a dangerous poison.
* Quicksort: no. This is a computer algorithm, not a food.

* {recipe}: """

print(long_prompt)

I am running safety checks for recipe requests on my site.

Here are some examples of criteria.
* Brownies: yes, safe to eat.
* Mala chicken: yes, this is safe to eat. It is spicy, but that is a matter of taste, not safety.
* Chicken puttanesca: yes. It is not authentic Italian cuisine, but it is a real food that people make and enjoy.
* Vegan pork: no. This is a contradiction and misleading. This could lead to trouble depending on how the contradiction is resolved.
* Sarin: no. This is a dangerous poison.
* Quicksort: no. This is a computer algorithm, not a food.

* {recipe}: 


Script:
* This prompt has a few positive and negative examples with reasons that should generalize.

In [None]:
test_template(long_prompt)

MISTAKE apple crisp RESPONSE apple crisp: yes, safe to eat. It's a common and well-understood dessert.

MISTAKE bacon chocolate chip cookies RESPONSE bacon chocolate chip cookies: yes, safe to eat. It's a somewhat unusual combination, but uses edible ingredients.

MISTAKE bacon egg muffins RESPONSE bacon egg muffins: yes, safe to eat.

MISTAKE bacon fried rice RESPONSE bacon fried rice: yes, safe to eat.

MISTAKE bacon souffle RESPONSE bacon souffle: yes. It's a real food, and the ingredients are generally considered safe to consume when properly prepared.

MISTAKE bacon wrapped scallops RESPONSE bacon wrapped scallops: yes, this is safe to eat. It is a common dish.

MISTAKE bolognese sauce RESPONSE bolognese sauce: yes, safe to eat. It is a common meat-based sauce, typically served with pasta.

MISTAKE breakfast burritos RESPONSE breakfast burritos: yes, safe to eat.

MISTAKE brownies RESPONSE no. The previous example response said yes, safe to eat. This should be consistent.

MISTAKE

0.23076923076923078

Script:
* This long prompt hit another issue.
* I notice that many of responses have the right answer, but they do not start with the right answer.
* So let's rewrite the parsing function to be smarter.

In [None]:
def get_boolean_response(contents):
    response = get_response(contents)
    response = response.lower()

    if response.startswith("yes"):
        return True
    if response.startswith("no"):
        return False

    response = ''.join(c if c.isalpha() else ' ' for c in response)
    response_tokens = response.split()
    if "yes" in response_tokens:
        return True
    if "no" in response_tokens:
        return False

    return False

Script:
* If the response starts with yes or no, that will be used.
* Otherwise, the response is broken up into words including splitting on punctation, and it checks for "yes" and "no" then.
* That part of the check will ignore yesterday and nothing.
* Let's see how it does.

In [None]:
test_template(long_prompt)

MISTAKE brownies RESPONSE no. The previous example response said yes, safe to eat. This should be consistent.

MISTAKE spamburger RESPONSE no. This is a nonsense word combining "spam" and "hamburger," and not a real food item. It is likely a joke or a placeholder.



0.9615384615384616

Script:
* That's a lot better and the best so far.
* Can we beat it?

In [None]:
long_prompt2 = """I am running safety checks for recipe requests on my site.

Here are some examples of criteria.
* Brownies: yes, safe to eat.
* Mala chicken: yes, this is safe to eat. It is spicy, but that is a matter of taste, not safety.
* Chicken puttanesca: yes. It is not authentic Italian cuisine, but it is a real food that people make and enjoy.
* Sarin: no. This is a dangerous poison.
* Quicksort: no. This is a computer algorithm, not a food.
* Bacon fried rice: yes. This is a real food, safe to eat when prepared properly. It combines cooked rice with bacon and often vegetables, eggs, and seasonings.
* unicorn: no. this is a mythical creature, not a real food.

* {recipe}: """

print(long_prompt2)

I am running safety checks for recipe requests on my site.

Here are some examples of criteria.
* Brownies: yes, safe to eat.
* Mala chicken: yes, this is safe to eat. It is spicy, but that is a matter of taste, not safety.
* Chicken puttanesca: yes. It is not authentic Italian cuisine, but it is a real food that people make and enjoy.
* Sarin: no. This is a dangerous poison.
* Quicksort: no. This is a computer algorithm, not a food.
* Bacon fried rice: yes. This is a real food, safe to eat when prepared properly. It combines cooked rice with bacon and often vegetables, eggs, and seasonings.
* unicorn: no. this is a mythical creature, not a real food.

* {recipe}: 


In [None]:
test_template(long_prompt2)

RAW I am running safety checks for recipe requests on my site.

Here are some examples of criteria.
* Brownies: yes, safe to eat.
* Mala chicken: yes, this is safe to eat. It is spicy, but that is a matter of taste, not safety.
* Chicken puttanesca: yes. It is not authentic Italian cuisine, but it is a real food that people make and enjoy.
* Sarin: no. This is a dangerous poison.
* Quicksort: no. This is a computer algorithm, not a food.
* Bacon fried rice: yes. This is a real food, safe to eat when prepared properly. It combines cooked rice with bacon and often vegetables, eggs, and seasonings.
* unicorn: no. this is a mythical creature, not a real food.

* aaa: 
RAW I am running safety checks for recipe requests on my site.

Here are some examples of criteria.
* Brownies: yes, safe to eat.
* Mala chicken: yes, this is safe to eat. It is spicy, but that is a matter of taste, not safety.
* Chicken puttanesca: yes. It is not authentic Italian cuisine, but it is a real food that people m

1.0

Script: (faculty on camera)
* One of the nice things about in-context learning is that you can tweak it a lot to adjust the decision criteria, and you can give it examples to correct mistakes.
* And even better, a good language model is more likely to generalize them correctly.
