# Prompt Enginnering
### **Course Name:** Natural Language Processing **<font color="red">(CSC4100)</font>**




Hello, everyone.
In this tutorial, we'll explore *Prompt Engineering* techniques.

**<font color="blue">What is prompt engineering in AI?</font>**

An AI prompt is a carefully crafted instruction given to an AI model to generate a specific output. These inputs can range from text and images to videos or even music.

Prompt engineering means writing precise instructions that guide AI models like **ChatGPT** to produce specific and useful responses. It involves designing inputs that an AI can easily understand and act upon, ensuring the output is relevant and accurate.

## **The First Step: Mastering the Use of APIs**



### About API Keys
#### DeepSeek API Keys
‚ùó Note that you can go [Deepseek](https://platform.deepseek.com/api_keys) to get your api key. Alternatively,  youbcan go to [Siliconflow](https://cloud.siliconflow.cn/) for free credit:
- Go to the website [https://platform.deepseek.com/api_keys](https://platform.deepseek.com/api_keys).
- Setup your api key through setting the environment variable os.environ["DEEPSEEK_API_KEY"]
- Remember to update `DEEPSEEK_BASE_URL` to https://api.siliconflow.cn/v1/chat/completions when using API from SiliconFlow.

#### OpenAI API Keys
Note that we provide a key with 100 US dollars, if it is used up you need to buy the Keys yourself (it may cost you a little bit of money), here is how to buy the keys:
- Go to the website [https://eylink.cn/buy/7](https://eylink.cn/buy/7).
- Purchase a 14 RMB key (10 US dollars). (10 dollars are enough.)
- Fill in the `OPENAI_API_KEY` below with the key you purchased.

(As a student, you can apply for a $100 free API credit at https://azure.microsoft.com/en-us/free/students. Remember to update `OPENAI_BASE_URL` to https://api.openai.com/v1/chat/completions when using API from Azure.)

üîÖ To facilitate easier access to OpenAI's model APIs, we make use of a popular framework langchain.

In [None]:
!pip install langchain
!pip install langchain-openai
!pip install langchain-deepseek
!pip install retrying

In [1]:
import os
import time
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_deepseek import ChatDeepSeek
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
import random
import json
from retrying import retry
import requests

DeepSeek Key

In [2]:
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "You are an AI assistant, please answer user's question."),
        ("user", "{input}")
    ]
)

In [None]:
# Set DeepSeek API key and base URL
os.environ["DEEPSEEK_API_KEY"] = "sk-"
os.environ["DEEPSEEK_BASE_URL"] = "https://api.deepseek.com/v1"
deepseek_chat = ChatDeepSeek(model="deepseek-chat", temperature=1)

model = ChatDeepSeek(model="deepseek-chat", temperature=1)

chain = prompt | model

üòä You can now engage directly with DeepSeek-Chat using our `invoke` function.

In [4]:
response = chain.invoke({"input": "Hello"})
print(response)

content='Hello! How can I help you today? üòä' additional_kwargs={'refusal': None} response_metadata={'token_usage': {'completion_tokens': 11, 'prompt_tokens': 17, 'total_tokens': 28, 'completion_tokens_details': None, 'prompt_tokens_details': {'audio_tokens': None, 'cached_tokens': 0}, 'prompt_cache_hit_tokens': 0, 'prompt_cache_miss_tokens': 17}, 'model_provider': 'deepseek', 'model_name': 'deepseek-chat', 'system_fingerprint': 'fp_ffc7281d48_prod0820_fp8_kvcache', 'id': 'aa08df44-143f-4701-a54b-91d4f958c2df', 'finish_reason': 'stop', 'logprobs': None} id='lc_run--43875c1d-9c35-4ef6-9f36-5044d107a499-0' usage_metadata={'input_tokens': 17, 'output_tokens': 11, 'total_tokens': 28, 'input_token_details': {'cache_read': 0}, 'output_token_details': {}}


OpenAI

In [None]:
from openai import OpenAI
client = OpenAI(api_key="sk-", base_url="http://66.206.9.230:4000/v1")

models = client.models.list()
for m in models.data:
    print(m.id)


gpt-3.5-turbo-0125
gpt-4
gpt-4-0613
gpt-4-turbo-2024-04-09
gpt-4o-2024-05-13
gpt-4o-mini-2024-07-18
gpt-4o-2024-08-06
gpt-5-2025-08-07
o1-2024-12-17
gpt-5-mini-2025-08-07
gpt-5-nano-2025-08-07
gpt-4.1-2025-04-14
gpt-4.1-nano-2025-04-14
gpt-4o-2024-11-20
o3-mini-2025-01-31
gpt-5-chat-2025-08-07
o4-mini-2025-04-16


In [None]:
from langchain_openai import ChatOpenAI

model_4o_mini = ChatOpenAI(
    model="gpt-4o-mini-2024-07-18",
    temperature=1,
    openai_api_key="sk-",
    openai_api_base="http://66.206.9.230:4000/v1"   # explicit argument
)

chain = prompt | model_4o_mini 



In [7]:
response = chain.invoke({"input": "Hello"})
print(response)


content='Hello! How can I assist you today?' additional_kwargs={'refusal': None} response_metadata={'token_usage': {'completion_tokens': 10, 'prompt_tokens': 23, 'total_tokens': 33, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_provider': 'openai', 'model_name': 'gpt-4o-2024-05-13', 'system_fingerprint': 'fp_ee1d74bde0', 'id': 'chatcmpl-CXl7ICdUErXa1neL08P9LuSafQJq6', 'finish_reason': 'stop', 'logprobs': None} id='lc_run--6eae65d4-076d-4528-9cf0-0bbc826483fa-0' usage_metadata={'input_tokens': 23, 'output_tokens': 10, 'total_tokens': 33, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}}


To make the output easy to use, we can apply a output parser to the original output!

In [8]:
chain = prompt | model | StrOutputParser()

response = chain.invoke({"input": "Hello"})
print(response)

Hello! How can I assist you today?


### **LLM Setting**

**1. Model Selection**

You are free to choose from several models. Below are the models DeepSeek offers:
- deepseek-chat (currently pointing to deepseek-v3, offering a balance of capability and cost, highly recommended!)
- deepseek-reasoner (currently pointing to deepseek-r1, extremely sophisticated and intelligent)
For detailed information, please visit [DeepSeek's Website](https://api-docs.deepseek.com/zh-cn/news/news250120).

Note that DeepSeek offers a discount when using API in the midnight, details showing [here](https://api-docs.deepseek.com/zh-cn/quick_start/pricing)

Below are the models OpenAI offers:

- gpt-4o-mini (the most cost-effective model, highly recommended!)
- gpt-4o (offers a balance of capability and cost)
- gpt-4-turbo (extremely sophisticated and intelligent)

For detailed information, please visit [OpenAI's Website](https://platform.openai.com/docs/models/gpt-4-turbo-and-gpt-4).

In [9]:
deepseek_reasoner = ChatDeepSeek(model="deepseek-reasoner", temperature=1)
chain = prompt | deepseek_reasoner | StrOutputParser()
chain.invoke({'input': "hello! what\' your name?"}) # Ê≥®ÊÑèreasonerÂæàË¥µÂì¶

"Hello! I'm an AI assistant created by DeepSeek. You can call me Assistant or any name you like! üòä How can I help you today? And what's your name?"

**2. Temperature**
Controls the randomness of the model's output. Lower values make responses more deterministic and focused, while higher values allow for more creative and diverse outputs. Use low values for factual tasks and high values for creative tasks.

In [11]:
model = ChatDeepSeek(model="deepseek-chat", temperature=0)
chain = prompt | model | StrOutputParser()

chain.invoke( {"input": 'hello! what\' your name?'})

"Hello! I'm an AI assistant created by DeepSeek. I don't have a personal name, but you can think of me as your helpful AI companion! How can I assist you today? üòä"

**3. Top P**
This is nucleus sampling, where only tokens from the top probability mass (up to `top_p`) are considered. Lower values encourage more focused responses, while higher values increase the diversity of possible outputs.

In [21]:
model = ChatDeepSeek(model="deepseek-chat", top_p=0.9)
chain = prompt | model | StrOutputParser()

chain.invoke( {"input": 'hello! what\' your name?'})

"Hello! I'm an AI assistant created by DeepSeek. I don't have a personal name, but you can just call me DeepSeek! I'm here to help answer your questions and assist with anything you need. What can I help you with today? üòä"

**4. Max Length** This limits the total number of tokens the model can generate, helping control response length and prevent irrelevant output.

In [13]:
model = ChatDeepSeek(model="deepseek-chat", max_tokens=5)
chain = prompt | model | StrOutputParser()

chain.invoke( {"input": 'hello! what\' your name?'})

"Hello! I'm an"

### **Prompting Techniques**

**1. Zero-Shot Prompting** Large language models (LLMs) today, such as GPT-3.5 Turbo, GPT-4, and Claude 3, are tuned to follow instructions and are trained on large amounts of data. Large-scale training makes these models capable of performing some tasks in a "zero-shot" manner. Zero-shot prompting means that the prompt used to interact with the model won't contain examples or demonstrations. The zero-shot prompt directly instructs the model to perform a task without any additional examples to steer it.

In [22]:
model = ChatDeepSeek(model="deepseek-chat")

chain = prompt | model | StrOutputParser()

your_prompt = """Classify the text into neutral, negative or positive.
Text: I think the vacation is okay.
Sentiment:"""

chain.invoke({"input": your_prompt})

'neutral'

**2. Few-Shot Prompting** While large-language models demonstrate remarkable zero-shot capabilities, they still fall short on more complex tasks when using the zero-shot setting. Few-shot prompting can be used as a technique to enable in-context learning where we provide demonstrations in the prompt to steer the model to better performance. The demonstrations serve as conditioning for subsequent examples where we would like the model to generate a response.

According to [Touvron et al. 2023](https://arxiv.org/pdf/2302.13971.pdf) few shot properties first appeared when models were scaled to a sufficient size ([Kaplan et al., 2020](https://arxiv.org/abs/2001.08361)).

In [15]:
your_prompt = """This is bad! // Negative
This is awesome! // Positive
Wow that movie was rad! // Positive
What a horrible show! //"""
chain.invoke({"input": your_prompt})

'Negative'



**3. Chain-of-Thought Prompting** Introduced in [Wei et al. (2022)](https://arxiv.org/abs/2201.11903), chain-of-thought (CoT) prompting enables complex reasoning capabilities through intermediate reasoning steps. You can combine it with few-shot prompting to get better results on more complex tasks that require reasoning before responding.

![](https://www.promptingguide.ai/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Fcot.1933d9fe.png&w=1920&q=75)  

In [24]:
your_prompt = """I went to the market and bought 10 apples. I gave 2 apples to the neighbor and 2 to the repairman. I then went and bought 5 more apples and ate 1. How many apples did I remain with?
Let's think step by step."""
chain.invoke({"input": your_prompt})

'Let‚Äôs break it down:  \n\n1. Start: 10 apples.  \n2. Gave 2 to the neighbor: \\( 10 - 2 = 8 \\) apples left.  \n3. Gave 2 to the repairman: \\( 8 - 2 = 6 \\) apples left.  \n4. Bought 5 more: \\( 6 + 5 = 11 \\) apples.  \n5. Ate 1: \\( 11 - 1 = 10 \\) apples.  \n\nSo, you remained with **10 apples**.'

**4. Self-Consistency** Perhaps one of the more advanced techniques out there for prompt engineering is self-consistency. Proposed by [Wang et al. (2022)](https://arxiv.org/abs/2203.11171), self-consistency aims "to replace the naive greedy decoding used in chain-of-thought prompting". The idea is to sample multiple, diverse reasoning paths through few-shot CoT, and use the generations to select the most consistent answer. This helps to boost the performance of CoT prompting on tasks involving arithmetic and commonsense reasoning.

In [None]:
your_prompt = """**Problem: Calculate the total cost if you buy 3 notebooks at $2 each and 2 pens at $1.50 each.**

**Examples:**

1. **Example Problem:** How many apples can you buy with $10 if each costs $2?
   **Solution:**
   - I have $10.
   - Each apple costs $2.
   - $10 / $2 = 5 apples.
   **Answer: You can buy 5 apples.**

2. **Example Problem:** If a train travels 50 miles in an hour, how far will it travel in 4 hours?
   **Solution:**
   - The train travels 50 miles in one hour.
   - In 4 hours, it will travel 50 miles/hour * 4 hours = 200 miles.
   **Answer: The train will travel 200 miles.**

**Your Task:**

- Use the chain of thought to break down the cost calculation.
- Sample multiple reasoning paths.
- Determine the most consistent calculation across different samples.

**Reasoning:**
- Start by identifying the cost of one category of items:
  - 3 notebooks at $2 each = $6.
- Then, calculate the cost for the other category:
  - 2 pens at $1.50 each = $3.
- Add both amounts to find the total cost:
  - $6 (notebooks) + $3 (pens) = $9.

**Consistency Check:**
- Sample several reasoning paths. For example:
  1. Calculate total cost for notebooks first, then pens, and sum.
  2. Calculate total cost for pens first, then notebooks, and sum.
  3. Directly multiply and add the costs of notebooks and pens.
- Compare the answers and select the most frequently occurring result.

**Final Answer:**
- After verifying consistency across samples, conclude with the most consistent answer.
"""
chain.invoke({"input": your_prompt})


**Chain of Thought:**

1. **Cost of notebooks:**  
   - 3 notebooks √ó $2 each = $6.

2. **Cost of pens:**  
   - 2 pens √ó $1.50 each = $3.

3. **Total cost:**  
   - $6 + $3 = $9.

---

**Multiple Reasoning Paths:**

- **Path 1:**  
  - Notebooks: 3 √ó $2 = $6  
  - Pens: 2 √ó $1.50 = $3  
  - Total: $6 + $3 = $9

- **Path 2:**  
  - Pens: 2 √ó $1.50 = $3  
  - Notebooks: 3 √ó $2 = $6  
  - Total: $3 + $6 = $9

- **Path 3:**  
  - Direct calculation:  
    (3 √ó $2) + (2 √ó $1.50) = $6 + $3 = $9

---

**Consistency Check:**  
All three reasoning paths yield the same result: **$9**.

---

**Final Answer:**  
The total cost is **$9**.




**5. Tree of Thoughts (ToT)** For complex tasks that require exploration or strategic lookahead, traditional or simple prompting techniques fall short. [Yao et el. (2023)](https://arxiv.org/abs/2305.10601) and [Long (2023)](https://arxiv.org/abs/2305.08291) recently proposed Tree of Thoughts (ToT), a framework that generalizes over chain-of-thought prompting and encourages exploration over thoughts that serve as intermediate steps for general problem solving with language models.

ToT maintains a tree of thoughts, where thoughts represent coherent language sequences that serve as intermediate steps toward solving a problem. This approach enables an LM to self-evaluate the progress through intermediate thoughts made towards solving a problem through a deliberate reasoning process. The LM's ability to generate and evaluate thoughts is then combined with search algorithms (e.g., breadth-first search and depth-first search) to enable systematic exploration of thoughts with lookahead and backtracking.



![](https://www.promptingguide.ai/_next/image?url=%2F_next%2Fstatic%2Fmedia%2FTOT.3b13bc5e.png&w=3840&q=75)  

In [18]:
your_prompt = """Imagine you are solving a complex logic puzzle where you need to arrange six objects in a specific order based on a set of rules. Each object can only be placed once, and certain objects must be placed before others according to the rules provided below. You need to explore different arrangements systematically to find the correct solution. Follow these steps:_

1. **Step 1: Start by listing possible first moves**: Consider each of the six objects and think through which ones could logically come first based on the rules.
2. **Step 2: Generate multiple intermediate thoughts for the second position**: After selecting a first object, think about which objects can go next while considering the constraints. Explore at least three different possibilities.
3. **Step 3: Evaluate each option**: After placing the first two objects, evaluate whether the current arrangement aligns with the rules. If it does, proceed to explore the third position. If not, backtrack and try another path.
4. **Step 4: Search using Breadth-First Search (BFS)**: Expand on each potential arrangement one step at a time. Use BFS to keep track of multiple arrangements at once, evaluating their adherence to the rules as you go.
5. **Step 5: Look ahead and refine**: After placing three objects, look ahead to the remaining positions and evaluate potential placements. If a placement leads to a conflict, backtrack and explore a different arrangement. Use depth-first search (DFS) if necessary to explore more deeply.
6. **Step 6: Complete the arrangement and find the correct solution**: Continue exploring arrangements, self-evaluating each step, until the correct order is found. Summarize your reasoning process after completing the task."""
chain.invoke({"input": your_prompt})

'Let‚Äôs break this down step by step.  \n\n**Step 1: List possible first moves**  \nI don‚Äôt have the actual puzzle rules or objects, so I‚Äôll make up a simple example to illustrate the method.  \n\nSuppose the objects are **A, B, C, D, E, F** and the rules are:  \n1. A must come before B.  \n2. C must come before D.  \n3. E must come before F.  \n4. B must come before D.  \n5. No other constraints.  \n\nPossible first moves:  \n- A can be first (no rule says something must be before A).  \n- C can be first.  \n- E can be first.  \nBut B cannot be first (A must be before B), D cannot be first (C and B before D), F cannot be first (E before F).  \n\nSo possible first objects: **A, C, E**.  \n\n---\n\n**Step 2: Generate multiple intermediate thoughts for the second position**  \n\nLet‚Äôs pick **A** as first.  \nRemaining objects: B, C, D, E, F.  \n\nPossible second positions:  \n- **B** (A before B is satisfied).  \n- **C** (no conflict).  \n- **E** (no conflict).  \nNot D yet (B mus

**6. Retrieval Augmented Generation (RAG)** General-purpose language models can be fine-tuned to achieve several common tasks such as sentiment analysis and named entity recognition. These tasks generally don't require additional background knowledge.

For more complex and knowledge-intensive tasks, it's possible to build a language model-based system that accesses external knowledge sources to complete tasks. This enables more factual consistency, improves reliability of the generated responses, and helps to mitigate the problem of "hallucination".

Meta AI researchers introduced a method called [Retrieval Augmented Generation (RAG)](https://ai.facebook.com/blog/retrieval-augmented-generation-streamlining-the-creation-of-intelligent-natural-language-processing-models/) to address such knowledge-intensive tasks. RAG combines an information retrieval component with a text generator model. RAG can be fine-tuned and its internal knowledge can be modified in an efficient manner and without needing retraining of the entire model.


![](https://www.promptingguide.ai/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Frag.c6528d99.png&w=1920&q=75)  

A recommended repository of implementations: (langchain) [https://github.com/langchain-ai/langchain].

In [19]:
your_prompt = """## Instruction: Use the provided retrieval content to answer the question.
### Retrieval Content:
1. The retrieved document discusses the impact of climate change on polar bear populations in the Arctic, detailing factors like ice melting, loss of habitat, and changes in prey availability.

### Question:
What are the primary reasons for the decline in polar bear populations as discussed in the retrieval content, and what measures can be implemented to mitigate this issue?
"""
chain.invoke({"input": your_prompt})

'Based on the retrieval content provided, the primary reasons for the decline in polar bear populations are:\n\n1. Ice melting\n2. Loss of habitat\n3. Changes in prey availability\n\nThese factors are all consequences of climate change affecting the Arctic region.\n\nTo mitigate this issue, comprehensive measures addressing climate change would be necessary, including:\n- Reducing greenhouse gas emissions globally\n- Protecting critical polar bear habitats\n- Implementing conservation programs\n- Supporting international agreements on climate action\n- Monitoring polar bear populations and their ecosystems\n\nThe effectiveness of these measures depends on coordinated global efforts to address the root causes of climate change while implementing targeted conservation strategies in the Arctic region.'

**7. Automatic Reasoning and Tool-use (ART)** Combining CoT prompting and tools in an interleaved manner has shown to be a strong and robust approach to address many tasks with LLMs. These approaches typically require hand-crafting task-specific demonstrations and carefully scripted interleaving of model generations with tool use. Paranjape et al., (2023) propose a new framework that uses a frozen LLM to automatically generate intermediate reasoning steps as a program.

ART works as follows:

- given a new task, it select demonstrations of multi-step reasoning and tool use from a task library
- at test time, it pauses generation whenever external tools are called, and integrate their output before resuming generation

ART encourages the model to generalize from demonstrations to decompose a new task and use tools in appropriate places, in a zero-shot fashion. In addition, ART is extensible as it also enables humans to fix mistakes in the reasoning steps or add new tools by simply updating the task and tool libraries. The process is demonstrated below:

![](https://www.promptingguide.ai/_next/image?url=%2F_next%2Fstatic%2Fmedia%2FART.3b30f615.png&w=1200&q=75)  

## **Task 2: LLMS for AI Feedback**

The final stage of large language model training involves reinforcement learning through feedback. Such feedback can come from either  human experts or   AI. This feedback is used to learn a reward model, with data defined in a triplet form. This triplet comes from a question, two answers, and a choice by a human or AI on which answer is better.

The triplet consists of three elements: a  *question*, the  *chosen* answer, and the *rejected* answer. You are asked to use ChatGPT to provide the feedback, namely, choose the preferred one. Note that the feedback is highly biased by the order of placed answers, please shuffle the order of answers when using chatGPT for preference feedback.

In [None]:
!wget  https://NLP-course-cuhksz.github.io/Assignments/Assignment1/task2/data/1.rlhf.jsonl

In [7]:

import json

import re
from tqdm import tqdm


with open('1.rlhf.jsonl') as f:
  data = [json.loads(l) for l in f]
data[0] # sample

{'Question': 'Human: What are some aspects of religious Islam that are compatible with western civilization?',
 'Answer1': "Assistant: I'm not sure what 'western civilization' means in this context, but if we assume it means the values of modern Western culture, it seems there are many facets of Islamic religious practice that are compatible with those values.  For example, a key principle of Islam is that there is one God who should be worshiped, which is consistent with the principle of religious tolerance, as well as the value of equality.  There are different branches of Islam, but many of them require",
 'Answer2': 'Assistant: Well in a basic sense it is possible for Muslims to live in the Western world and take part in Western civilization. In fact, many Muslims have been at the center of important cultural developments of the last centuries.',
 'Preference': 'Answer2'}

In [8]:
data[0]

{'Question': 'Human: What are some aspects of religious Islam that are compatible with western civilization?',
 'Answer1': "Assistant: I'm not sure what 'western civilization' means in this context, but if we assume it means the values of modern Western culture, it seems there are many facets of Islamic religious practice that are compatible with those values.  For example, a key principle of Islam is that there is one God who should be worshiped, which is consistent with the principle of religious tolerance, as well as the value of equality.  There are different branches of Islam, but many of them require",
 'Answer2': 'Assistant: Well in a basic sense it is possible for Muslims to live in the Western world and take part in Western civilization. In fact, many Muslims have been at the center of important cultural developments of the last centuries.',
 'Preference': 'Answer2'}

Zero Shot

In [9]:
your_prompt = '''
[Question]:
{Question}

[Answer1]:
{Answer1}

[Answer2]:
{Answer2}

A good response should be generally helpful, accurate, correct, and safe.
Choose which answer is overall better. Output only 'Answer1' or 'Answer2'.
'''

def get_query(da):
  return your_prompt.format_map(da)

testdata = data[0]

print(get_query(testdata))

print(f'---- GPT-4o response ----')

print(chain.invoke(get_query(testdata)).content)


[Question]:
Human: What are some aspects of religious Islam that are compatible with western civilization?

[Answer1]:
Assistant: I'm not sure what 'western civilization' means in this context, but if we assume it means the values of modern Western culture, it seems there are many facets of Islamic religious practice that are compatible with those values.  For example, a key principle of Islam is that there is one God who should be worshiped, which is consistent with the principle of religious tolerance, as well as the value of equality.  There are different branches of Islam, but many of them require

[Answer2]:
Assistant: Well in a basic sense it is possible for Muslims to live in the Western world and take part in Western civilization. In fact, many Muslims have been at the center of important cultural developments of the last centuries.

A good response should be generally helpful, accurate, correct, and safe.
Choose which answer is overall better. Output only 'Answer1' or 'Answer

Zero Shot

In [12]:
your_prompt = '''
[Question]:
{Question}

[Answer1]:
{Answer1}

[Answer2]:
{Answer2}

A good response should be generally helpful, accurate, correct, and safe.
Choose which answer is overall better. Output only 'Answer1' or 'Answer2'.
'''

def get_query(da):
  return your_prompt.format_map(da)

correct_num = 0
total_num = 0

# Test on the entire dataset
for da in tqdm(data):
  da['deepseek_ans'] =  chain.invoke(get_query(da))
  if da['deepseek_ans'].content == da['Preference']:
    correct_num += 1
  total_num += 1
print(f"Model consistency rate with human: {correct_num/total_num:.2%}")

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100/100 [01:36<00:00,  1.04it/s]

Model consistency rate with human: 67.00%





<font color="blue">You need to optimize the prompt to improve the performance (consistency rate) of large language models (LLMs).</font>

Few Shot (Entire Dataset)

In [13]:
your_prompt = '''

==== Demonstrations ====
[Question]:
What is the common side effect of aspirin?

[Answer1]:
Aspirin may cause stomach irritation and bleeding.

[Answer2]:
Aspirin is a vitamin supplement.

[Better]:
Answer1


[Question]:
What is the purpose of antibiotics?

[Answer1]:
They kill or inhibit bacterial growth.

[Answer2]:
They relieve joint pain.

[Better]:
Answer1


[Question]:
Where is the capital of France?

[Answer1]:
The capital of France is Paris.

[Answer2]:
France is in Europe.

[Better]:
Answer1


==== Evaluate ====
[Question]:
{Question}

[Answer1]:
{Answer1}

[Answer2]:
{Answer2}

A good response should be generally helpful, accurate, correct, and safe.
Choose which answer is overall better. Output only 'Answer1' or 'Answer2'.
'''

correct_num = 0
total_num = 0
for da in tqdm(data):
  da['deepseek_ans'] =  chain.invoke(get_query(da))
  if da['deepseek_ans'].content == da['Preference']:
    correct_num += 1
  total_num += 1
print(f"Model consistency rate with human: {correct_num/total_num:.2%}")

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100/100 [01:35<00:00,  1.05it/s]

Model consistency rate with human: 68.00%





CoT + Few Shot

In [10]:
your_prompt = '''

==== Demonstrations ====
[Question]:
What is the common side effect of aspirin?

[Answer1]:
Aspirin may cause stomach irritation and bleeding.

[Answer2]:
Aspirin is a vitamin supplement.

[Better]:
Answer1


[Question]:
What is the purpose of antibiotics?

[Answer1]:
They kill or inhibit bacterial growth.

[Answer2]:
They relieve joint pain.

[Better]:
Answer1


[Question]:
Where is the capital of France?

[Answer1]:
The capital of France is Paris.

[Answer2]:
France is in Europe.

[Better]:
Answer1


==== Evaluate ====
[Question]:
{Question}

[Answer1]:
{Answer1}

[Answer2]:
{Answer2}

A good response should be generally helpful, accurate, correct, and safe.
Choose which answer is overall better. Output only 'Answer1' or 'Answer2'.

Let's think step-by-step
'''

correct_num = 0
total_num = 0
for da in tqdm(data):
  da['deepseek_ans'] =  chain.invoke(get_query(da))
  if da['deepseek_ans'].content == da['Preference']:
    correct_num += 1
  total_num += 1
print(f"Model consistency rate with human: {correct_num/total_num:.2%}")



100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100/100 [02:02<00:00,  1.23s/it]

Model consistency rate with human: 60.00%





CoT + Zero Shot

In [16]:
your_prompt = '''
[Question]:
{Question}

[Answer1]:
{Answer1}

[Answer2]:
{Answer2}

A good response should be generally helpful, accurate, correct, and safe.
Choose which answer is overall better. Output only 'Answer1' or 'Answer2'.

Let's think step-by-step
'''
correct_num = 0
total_num = 0


for da in tqdm(data):
  da['deepseek_ans'] =  chain.invoke(get_query(da))
  if da['deepseek_ans'].content == da['Preference']:
    correct_num += 1
  total_num += 1
print(f"Model consistency rate with human: {correct_num/total_num:.2%}")

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100/100 [02:01<00:00,  1.21s/it]

Model consistency rate with human: 63.00%





In [None]:
from langchain_openai import ChatOpenAI

model_4o_mini = ChatOpenAI(
    model="gpt-4o-mini-2024-07-18",
    temperature=1,
    openai_api_key="sk-",
    openai_api_base="http://66.206.9.230:4000/v1"   # explicit argument
)

chain = prompt | model_4o_mini 



Self-consistency

In [25]:
your_prompt = """**Problem: Decide which answer better satisfies the user's question.**

**Examples:**

1. **Example Question:** What is the common side effect of aspirin?
   **Answer1:** Aspirin may cause stomach irritation and bleeding.
   **Answer2:** Aspirin is a vitamin supplement.
   **Better:** Answer1

2. **Example Question:** What is the purpose of antibiotics?
   **Answer1:** They relieve joint pain.
   **Answer2:** They kill or inhibit bacterial growth.
   **Better:** Answer2

3. **Example Question:** Where is the capital of France?
   **Answer1:** The capital of France is Paris.
   **Answer2:** France is in Europe.
   **Better:** Answer1


**Evaluate:**
[Question]:
{Question}

[Answer1]:
{Answer1}

[Answer2]:
{Answer2}

**Your Task:**
- You are an impartial judge for a *single* triplet (Question, Answer1, Answer2) from a preference dataset.
- Your goal is to choose which answer better satisfies the question.
- Judge by: relevance to the question, factual accuracy, helpfulness/clarity, and safety (avoid harmful or misleading content).

**Reasoning (do this silently):**
- Think step-by-step *internally* (do not reveal your chain of thought).
- Consider whether each answer is on-topic, correct, and sufficiently informative.
- Prefer precise, directly responsive, and safe content.
- If both are weak, choose the *less* harmful/misleading and more relevant one.

**Consistency Check (do this silently):**
- Internally sample multiple reasoning paths and compare your conclusions.
- If any paths disagree, reconcile them and choose the answer supported by the strongest consistent reasoning.
- Do not output your reasoning‚Äîonly the final label.



**Output Format (STRICT):**
Output exactly one token on a single line: Answer1 or Answer2.
"""


correct_num = 0
total_num = 0


for da in tqdm(data):
  da['deepseek_ans'] =  chain.invoke(get_query(da))
  if da['deepseek_ans'].content == da['Preference']:
    correct_num += 1
  total_num += 1
print(f"Model consistency rate with human: {correct_num/total_num:.2%}")


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100/100 [01:40<00:00,  1.00s/it]

Model consistency rate with human: 70.00%





Change to another model - Gpt 4 Turbo

However, the accuracy score decreased, possibly due to noise / distractions

In [None]:
from langchain_openai import ChatOpenAI

model_4o = ChatOpenAI(
    model="gpt-4-turbo-2024-04-09",
    temperature=1,
    openai_api_key="sk-",
    openai_api_base="http://66.206.9.230:4000/v1"   # explicit argument
)

chain = prompt | model_4o


In [17]:
your_prompt = """**Problem: Decide which answer better satisfies the user's question.**

**Examples:**

1. **Example Question:** What is the common side effect of aspirin?
   **Answer1:** Aspirin may cause stomach irritation and bleeding.
   **Answer2:** Aspirin is a vitamin supplement.
   **Better:** Answer1

2. **Example Question:** What is the purpose of antibiotics?
   **Answer1:** They relieve joint pain.
   **Answer2:** They kill or inhibit bacterial growth.
   **Better:** Answer2

3. **Example Question:** Where is the capital of France?
   **Answer1:** The capital of France is Paris.
   **Answer2:** France is in Europe.
   **Better:** Answer1


**Evaluate:**
[Question]:
{Question}

[Answer1]:
{Answer1}

[Answer2]:
{Answer2}

**Your Task:**
- You are an impartial judge for a *single* triplet (Question, Answer1, Answer2) from a preference dataset.
- Your goal is to choose which answer better satisfies the question.
- Judge by: relevance to the question, factual accuracy, helpfulness/clarity, and safety (avoid harmful or misleading content).

**Reasoning (do this silently):**
- Think step-by-step *internally* (do not reveal your chain of thought).
- Consider whether each answer is on-topic, correct, and sufficiently informative.
- Prefer precise, directly responsive, and safe content.
- If both are weak, choose the *less* harmful/misleading and more relevant one.

**Consistency Check (do this silently):**
- Internally sample multiple reasoning paths and compare your conclusions.
- If any paths disagree, reconcile them and choose the answer supported by the strongest consistent reasoning.
- Do not output your reasoning‚Äîonly the final label.



**Output Format (STRICT):**
Output exactly one token on a single line: Answer1 or Answer2.
"""


correct_num = 0
total_num = 0


for da in tqdm(data):
  da['deepseek_ans'] =  chain.invoke(get_query(da))
  if da['deepseek_ans'].content == da['Preference']:
    correct_num += 1
  total_num += 1
print(f"Model consistency rate with human: {correct_num/total_num:.2%}")


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100/100 [02:04<00:00,  1.24s/it]

Model consistency rate with human: 67.00%





RAG

In [26]:
# pip install langchain-community wikipedia
from langchain_community.retrievers import WikipediaRetriever
from tqdm import tqdm
import re

wiki = WikipediaRetriever(top_k_results=2)

def rag_judge(question, answer1, answer2, llm, temperature=0.0):
    """Retrieve Wikipedia evidence and decide which answer is better."""
    try:
        docs = wiki.invoke(question)[:2]
    except Exception:
        docs = []
    evidence = "\n".join(
        f"[E{i+1}] {getattr(d, 'page_content','').strip().replace('\n',' ')}"
        for i, d in enumerate(docs)
        if getattr(d, "page_content", "").strip()
    ) or "No evidence found."

    prompt = f"""
==== Evidence ====
{evidence}

==== Evaluate ====
[Question]:
{question}

[Answer1]:
{answer1}

[Answer2]:
{answer2}

Rules:
- Prefer the answer that best matches the Evidence (accuracy first).
- Also consider relevance, clarity, and safety.
- If evidence is unclear, choose using your own reasoning.
Output exactly one line:
Final: Answer1
or
Final: Answer2
""".strip()

    try:
        resp = llm.invoke({"input": prompt}, temperature=temperature)
    except TypeError:
        resp = llm.invoke({"input": prompt})
    text = getattr(resp, "content", str(resp)).strip()
    m = re.search(r"\bFinal\s*:\s*(Answer1|Answer2)\b", text, re.I)
    return m.group(1).title() if m else None


# =========================================================
# Evaluation loop
# =========================================================
correct_num = 0
total_num = 0

for da in tqdm(data):
    pred = rag_judge(da["Question"], da["Answer1"], da["Answer2"], chain, temperature=0.0)
    da["deepseek_ans"] = pred
    if pred == da["Preference"]:
        correct_num += 1
    total_num += 1

print(f"Model consistency rate with human: {correct_num/total_num:.2%}")


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100/100 [11:37<00:00,  6.98s/it]

Model consistency rate with human: 66.00%





Self Consistency - but adjust Temperature and Top P for better score

In [28]:
your_prompt = """**Problem: Decide which answer better satisfies the user's question.**

**Examples:**

1. **Example Question:** What is the common side effect of aspirin?
   **Answer1:** Aspirin may cause stomach irritation and bleeding.
   **Answer2:** Aspirin is a vitamin supplement.
   **Better:** Answer1

2. **Example Question:** What is the purpose of antibiotics?
   **Answer1:** They relieve joint pain.
   **Answer2:** They kill or inhibit bacterial growth.
   **Better:** Answer2

3. **Example Question:** Where is the capital of France?
   **Answer1:** The capital of France is Paris.
   **Answer2:** France is in Europe.
   **Better:** Answer1


**Evaluate:**
[Question]:
{Question}

[Answer1]:
{Answer1}

[Answer2]:
{Answer2}

**Your Task:**
- You are an impartial judge for a *single* triplet (Question, Answer1, Answer2) from a preference dataset.
- Your goal is to choose which answer better satisfies the question.
- Judge by: relevance to the question, factual accuracy, helpfulness/clarity, and safety (avoid harmful or misleading content).

**Reasoning (do this silently):**
- Think step-by-step *internally* (do not reveal your chain of thought).
- Consider whether each answer is on-topic, correct, and sufficiently informative.
- Prefer precise, directly responsive, and safe content.
- If both are weak, choose the *less* harmful/misleading and more relevant one.

**Consistency Check (do this silently):**
- Internally sample multiple reasoning paths and compare your conclusions.
- If any paths disagree, reconcile them and choose the answer supported by the strongest consistent reasoning.
- Do not output your reasoning‚Äîonly the final label.



**Output Format (STRICT):**
Output exactly one token on a single line: Answer1 or Answer2.
"""

temperature = 0.2   # 0 = deterministic, 0.3‚Äì0.7 = some diversity
top_p = 0.9         # 1.0 = full sampling; try 0.9 for narrower focus


correct_num = 0
total_num = 0


for da in tqdm(data):
  da['deepseek_ans'] = chain.invoke(
        get_query(da),
        temperature=temperature,
        top_p=top_p
    )

  if da['deepseek_ans'].content == da['Preference']:
    correct_num += 1
  total_num += 1
print(f"Model consistency rate with human: {correct_num/total_num:.2%}")


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100/100 [01:51<00:00,  1.12s/it]

Model consistency rate with human: 74.00%



