In [1]:
%%capture
# update or install the necessary libraries
!pip install --upgrade openai
!pip install --upgrade python-dotenv

In [2]:
import json
from openai import OpenAI
import pandas as pd
import re
client = OpenAI(api_key="")

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
# Load datasets
with open(r'/content/drive/MyDrive/Assignment 5/Assignment_50_questions.json', 'r') as file:
    test_data = json.load(file)

with open(r'/content/drive/MyDrive/Assignment 5/train.json', 'r') as file:
    train_data = json.load(file)

# Display the first few examples of the datasets
print("Test Data Example:")
print(test_data[:2])

print("\nTrain Data Example:")
print(train_data[:2])

Test Data Example:
[{'qid': 'fc0bbbfd4467bd868714', 'question': 'Is a curling iron necessary in curling?', 'answer': False}, {'qid': 'ed04a34363f248900c18', 'question': 'Can the Persian Gulf fit in New Jersey?', 'answer': False}]

Train Data Example:
[{'qid': '2bc9c4f9c19c167187f2', 'term': 'Genghis Khan', 'description': 'founder and first Great Khan of the Mongol Empire', 'question': 'Are more people today related to Genghis Khan than Julius Caesar?', 'answer': True, 'facts': ['Julius Caesar had three children.', 'Genghis Khan had sixteen children.', 'Modern geneticists have determined that  out of every 200 men today has DNA that can be traced to Genghis Khan.'], 'decomposition': ['How many kids did Julius Caesar have?', 'How many kids did Genghis Khan have?', 'Is #2 greater than #1?'], 'evidence': [[[['Caesarion-2', 'Julia (daughter of Caesar)-1']], [['Alakhai Bekhi-1', 'Tolui-1'], 'no_evidence'], ['operation']], [[['Julius Caesar-75']], [['Genghis Khan-17']], ['operation']], [[['Ga

## Zero-shot Prompting

Zero-shot prompting is a technique where a model is tasked with solving a problem or answering a question without having been explicitly trained on examples of that specific task.

In [None]:
def zero_shot_prompt(question):
    messages = [
        {"role": "system", "content": "You are a knowledgeable and thorough assistant."},
        {"role": "user", "content": f"Give boolian answer True or False without additional information and additional symbols: {question}"}
    ]
    response = client.chat.completions.create(
        model="gpt-3.5-turbo-0125",
        messages=messages
    )
    answer =  response.choices[0].message.content
    if answer.lower() == 'true':
        return True
    elif answer.lower() == 'false':
        return False
    else:
        return None
#Test with a single question
question = test_data[0]['question']
answer = zero_shot_prompt(question)
print(f"Question: {question}\nZero-shot Answer: {answer}")

Question: Is a curling iron necessary in curling?
Zero-shot Answer: False


In [33]:
def calculate_accuracy(df, method):
    correct_answers = (df['true_answer'] == df[method]).sum()
    total_questions = len(df)
    accuracy = correct_answers / total_questions
    return accuracy

In [None]:
results_zero_shot = []

for item in test_data:
    question = item['question']
    true_answer = item['answer']
    zero_shot_answer = zero_shot_prompt(question)
    results_zero_shot.append({
        'question': question,
        'true_answer': true_answer,
        'zero_shot': zero_shot_answer
    })

results_zero_shot_df = pd.DataFrame(results_zero_shot)

zero_shot_accuracy = calculate_accuracy(results_zero_shot_df, 'zero_shot')

print(f"Zero-shot Prompting Accuracy: {zero_shot_accuracy:.2%}")

Zero-shot Prompting Accuracy: 58.00%


In [None]:
results_zero_shot

[{'question': 'Is a curling iron necessary in curling?',
  'true_answer': False,
  'zero_shot': False},
 {'question': 'Can the Persian Gulf fit in New Jersey?',
  'true_answer': False,
  'zero_shot': False},
 {'question': 'Is Fiat Chrysler associated with Japanese cars?',
  'true_answer': False,
  'zero_shot': False},
 {'question': 'Is Lionel Richie related to Sheila E?',
  'true_answer': False,
  'zero_shot': False},
 {'question': 'Are blue lips normal?',
  'true_answer': False,
  'zero_shot': False},
 {'question': 'Can professional boxers expect to have low dental bills?',
  'true_answer': False,
  'zero_shot': False},
 {'question': 'Would an explosion at a gunpowder storage facility result in a supersonic shock wave?',
  'true_answer': False,
  'zero_shot': True},
 {'question': 'Would it be unusual to find a yellow perch in the Red Sea?',
  'true_answer': True,
  'zero_shot': None},
 {'question': 'Could Brooke Shields succeed at University of Pennsylvania?',
  'true_answer': True,
 

The zero-shot prompting method achieved an accuracy of 58.00% on the given test dataset. This indicates that the method correctly answered slightly more than half of the questions with boolean answers. While this method offers a baseline performance, it does not leverage examples to guide the model's reasoning process.

## Few-shot Prompting (3-shot)

The goal of few-shot prompting is to provide the language model with a few examples of the task to help it better understand how to respond. In this case, we're using three examples from the training set. The StrategyQA dataset was used for testing few-shot techniques. The model will first answer three supporting questions from 'decomposition' using the provided facts, and then it will respond to the main question.

In [31]:
# Filter records that have exactly 3 questions in decomposition
filtered_data = [record for record in train_data if len(record['decomposition']) == 3]

In [None]:
def few_shot_prompt(question, shots, fact):
    messages = [
        {"role": "system", "content": "You are a knowledgeable and thorough assistant."},
        {"role": "assistant", "content": f"Answer the three questions:{shots} using the information provided in {fact}. Use these answers as supporting information for your responses" },
        {"role": "user", "content": f"Give boolian answer True or False without additional information, additional questions and additional symbols: {question}"}
    ]
    response = client.chat.completions.create(
    model="gpt-3.5-turbo-0125",
    messages=messages
    )
    answer =  response.choices[0].message.content
    if answer.lower() == 'true':
        return True
    elif answer.lower() == 'false':
        return False
    else:
        return None
# Test with a single question
question = train_data[0]['question']
shots = train_data[0]['decomposition']
fact = train_data[0]['facts']
answer = few_shot_prompt(question, shots, fact)
print(f"Question: {question}\nFew-shot Answer: {answer}")

Question: Are more people today related to Genghis Khan than Julius Caesar?
Few-shot Answer: False


In [None]:
results_few_shot = []

for item in filtered_data:
    question = item['question']
    shots = item['decomposition']
    fact = item['facts']
    true_answer = item['answer']
    few_shot_answer = few_shot_prompt(question, shots, fact)
    results_few_shot.append({
        'question': question,
        'true_answer': true_answer,
        'few_shot': few_shot_answer
    })

results_few_shot_df = pd.DataFrame(results_few_shot)
few_shot_accuracy = calculate_accuracy(results_few_shot_df, 'few_shot')

print(f"Few-shot Prompting Accuracy: {few_shot_accuracy:.2%}")

Few-shot Prompting Accuracy: 71.72%


In [None]:
results_few_shot

[{'question': 'Are more people today related to Genghis Khan than Julius Caesar?',
  'true_answer': True,
  'few_shot': True},
 {'question': 'Would a Monoamine Oxidase candy bar cheer up a depressed friend?',
  'true_answer': False,
  'few_shot': False},
 {'question': 'Would a dog respond to bell before Grey seal?',
  'true_answer': True,
  'few_shot': True},
 {'question': 'Is shrimp scampi definitely free of plastic?',
  'true_answer': False,
  'few_shot': False},
 {'question': 'Do the anchors on Rede Globo speak Chinese?',
  'true_answer': False,
  'few_shot': False},
 {'question': 'Is a Boeing 737 cost covered by Wonder Woman (2017 film) box office receipts?',
  'true_answer': True,
  'few_shot': False},
 {'question': 'Can you buy Casio products at Petco?',
  'true_answer': False,
  'few_shot': False},
 {'question': 'Did the Space Race use relay batons?',
  'true_answer': False,
  'few_shot': False},
 {'question': 'Does Biochemistry study gluons?',
  'true_answer': False,
  'few_sho

The model achieved a better accuracy score compared to the zero-shot approach. Few-shot Prompting significantly improved accuracy to 71.72%. However, a direct comparison of these results is not feasible. By providing a few examples, the model could better understand the task and improve its responses. This demonstrates the value of example-based learning in enhancing the model's performance.

## Zero-shot Chain-of-Thought (CoT)

Zero-shot CoT (Chain of Thought) involves prompting a language model to generate a reasoning process for a problem it hasn't been explicitly trained to solve. Instead of providing the answer directly, the model is encouraged to articulate the reasoning process, which helps it arrive at the correct solution through logical steps. I used the same dataset with 50 questions from the Zero-shot task. The prompt was modified to provide an explanation and a chain of thought before giving an answer.

In [None]:
def zero_shot_cot_prompt(question):
    messages = [
        {"role": "system", "content": "You are a knowledgeable and thorough assistant."},
        {"role": "user", "content": f"Give Boolean answer beginning with True or False and provide step-by-step explanation of why your give this answer and your chain of thought: {question}"}
    ]
    response = client.chat.completions.create(
        model="gpt-3.5-turbo-0125",
        messages=messages
    )
    answer =  response.choices[0].message.content
    # Extract and ensure the final answer is either True or False
    answer = re.match(r'^(True|False)', answer.strip()).group(1)
    if answer.lower() == 'true':
        return True
    elif answer.lower() == 'false':
        return False
    else:
        return None
#Test with a single question
question = test_data[8]['question']
answer = zero_shot_cot_prompt(question)
print(f"Question: {question}\nZero-shot Answer: {answer}")

Question: Could Brooke Shields succeed at University of Pennsylvania?
Zero-shot Answer: False


In [None]:
results_zero_shot_cot = []
for item in test_data:
    question = item['question']
    true_answer = item['answer']
    zero_shot_cot_answer = zero_shot_cot_prompt(question)
    results_zero_shot_cot.append({
        'question': question,
        'true_answer': true_answer,
        'zero_shot_cot': zero_shot_cot_answer
    })

results_zero_shot_cot_df = pd.DataFrame(results_zero_shot_cot)

zero_shot_cot_accuracy = calculate_accuracy(results_zero_shot_cot_df, 'zero_shot_cot')

print(f"Zero-shot CoT Prompting Accuracy: {zero_shot_cot_accuracy:.2%}")

Zero-shot CoT Prompting Accuracy: 72.00%


In [None]:
results_zero_shot_cot

[{'question': 'Is a curling iron necessary in curling?',
  'true_answer': False,
  'zero_shot_cot': False},
 {'question': 'Can the Persian Gulf fit in New Jersey?',
  'true_answer': False,
  'zero_shot_cot': False},
 {'question': 'Is Fiat Chrysler associated with Japanese cars?',
  'true_answer': False,
  'zero_shot_cot': False},
 {'question': 'Is Lionel Richie related to Sheila E?',
  'true_answer': False,
  'zero_shot_cot': False},
 {'question': 'Are blue lips normal?',
  'true_answer': False,
  'zero_shot_cot': False},
 {'question': 'Can professional boxers expect to have low dental bills?',
  'true_answer': False,
  'zero_shot_cot': False},
 {'question': 'Would an explosion at a gunpowder storage facility result in a supersonic shock wave?',
  'true_answer': False,
  'zero_shot_cot': True},
 {'question': 'Would it be unusual to find a yellow perch in the Red Sea?',
  'true_answer': True,
  'zero_shot_cot': True},
 {'question': 'Could Brooke Shields succeed at University of Pennsylv

As we can see, accuracy increases significantly up to 72%. By encouraging the model to generate a chain of thought or reasoning steps, it was able to arrive at more accurate answers. This technique shows that guiding the model to articulate its reasoning can enhance its problem-solving capabilities, even without explicit examples.

## Few-shot Chain-of-Thought (CoT)

In Few-shot CoT (Chain of Thought), the model is provided with a few examples that illustrate the step-by-step reasoning process for solving a particular type of problem. These examples serve as a guide, enabling the model to understand how to decompose and solve similar problems.
The StrategyQA dataset was used again for testing this technique. The assistant's role remained the same with the same content, while the user prompt was changed to provide a step-by-step guide on how to answer the final question.


In [29]:
def few_shot_cot_prompt(question, shots, fact):
    messages = [
        {"role": "system", "content": "You are a knowledgeable and thorough assistant."},
        {"role": "assistant", "content": f"Using the information provided in {fact}, answer the following three questions: {shots}. Provide detailed explanations for each answer, supporting your conclusions with the relevant information from {fact}." },
        {"role": "user", "content": f"Please follow these steps to answer {question}:\n1) Re-check the answers in the explanation, and make a summary.\n2) Discuss {question} based only on the summary in step 1. This should be followed by a step-by-step explanation of why you arrived at this answer, including your chain of thought.\n3) Provide a Boolean answer, which MUST always begin with either 'TRUE' or 'FALSE' without additional information or symbols on the question in step 2."}
    ]
    response = client.chat.completions.create(
    model="gpt-3.5-turbo-0125",
    messages=messages,
    )
    answer =  response.choices[0].message.content
    match = re.search(r'\b(TRUE|FALSE)\b', answer)
    if match:
        boolean_answer = match.group(1)
        if boolean_answer.lower() == 'true':
            return True
        elif boolean_answer.lower() == 'false':
            return False
# Test with a single question
question = train_data[56]['question']
shots = train_data[56]['decomposition']
fact = train_data[56]['facts']
answer = few_shot_cot_prompt(question, shots, fact)
print(f"Question: {question}\nFew-shot Answer: {answer}")

Question: Does Ukrainian Greek Catholic Church recognize Alexander Nevsky as a saint?
Few-shot Answer: True


In [34]:
results_few_shot_cot = []

for item in filtered_data:
    question = item['question']
    shots = item['decomposition']
    fact = item['facts']
    true_answer = item['answer']
    few_shot_cot_answer = few_shot_cot_prompt(question, shots, fact)
    results_few_shot_cot.append({
        'question': question,
        'true_answer': true_answer,
        'few_shot': few_shot_cot_answer
    })

results_few_shot__cot_df = pd.DataFrame(results_few_shot_cot)
few_shot_cot_accuracy = calculate_accuracy(results_few_shot__cot_df, 'few_shot')

print(f"Few-shot Chain-of-Thought Prompting Accuracy: {few_shot_cot_accuracy:.2%}")

Few-shot Chain-of-Thought Prompting Accuracy: 76.55%


This tehniques shows the best result at 76,5 %.  The combination of examples and explicit reasoning steps enables the model to decompose and solve problems more effectively, leading to the best performance among the tested techniques.

## Summary and Discussion

In [41]:
# Print the accuracies
print(f"Zero-shot Prompting Accuracy: {zero_shot_accuracy:.2%}")
print(f"Few-shot Prompting Accuracy: {few_shot_accuracy:.2%}")
print(f"Zero-shot CoT Prompting Accuracy: {zero_shot_cot_accuracy:.2%}")
print(f"Few-shot CoT Prompting Accuracy: {few_shot_cot_accuracy:.2%}")


Zero-shot Prompting Accuracy: 58.00%
Few-shot Prompting Accuracy: 71.72%
Zero-shot CoT Prompting Accuracy: 72.00%
Few-shot CoT Prompting Accuracy: 76.55%


The results of the experiment clearly indicate that Few-shot Chain-of-Thought (CoT) Prompting yields the highest accuracy at 76.55%. This approach outperforms the other methods by leveraging both examples and a structured reasoning process. The significant improvement in accuracy highlights the importance of guiding the model through a logical chain of thought while also providing illustrative examples.

For few shorts setting, an additional assistant role was introduced. This role was tasked with providing supporting questions from 'decomposition' and using "facts" as clues to answer these questions. This setup aimed to improve the model's reasoning process before it addressed the main question.

Fine-tuning is described in the next section:

## Behind the Scenes

Before achieving these results, several prompts and approaches were tried. It was found that a structured guide was the best prompt variation for this task. In previous attempts, the model would often discuss the answer and use non-provided additional information, failing to provide the correct answer. Another difficulty was the evaluation of the result; even with direct requests to provide a 'True' or 'False' answer, the model would eventually give unexpected responses after several attempts.

## Fine Tuning of Best Model

Take in to account all challenges above, let's try to make model more specific to task, reducing Temperature to 0.7. This setting is likely to produce more nuanced answers without veering too far from the expected reasoning.

In [43]:
def few_shot_cot_prompt_tuned(question, shots, fact):
    messages = [
        {"role": "system", "content": "You are a knowledgeable and thorough assistant."},
        {"role": "assistant", "content": f"Using the information provided in {fact}, answer the following three questions: {shots}. Provide detailed explanations for each answer, supporting your conclusions with the relevant information from {fact}." },
        {"role": "user", "content": f"Please follow these steps to answer {question}:\n1) Re-check the answers in the explanation, and make a summary.\n2) Discuss {question} based only on the summary in step 1. This should be followed by a step-by-step explanation of why you arrived at this answer, including your chain of thought.\n3) Provide a Boolean answer, which MUST always begin with either 'TRUE' or 'FALSE' without additional information or symbols on the question in step 2."}
    ]
    response = client.chat.completions.create(
    model="gpt-3.5-turbo-0125",
    messages=messages,
    temperature=0.7,
    )
    answer =  response.choices[0].message.content
    match = re.search(r'\b(TRUE|FALSE)\b', answer)
    if match:
        boolean_answer = match.group(1)
        if boolean_answer.lower() == 'true':
            return True
        elif boolean_answer.lower() == 'false':
            return False

In [45]:
results_few_shot_cot_tuned = []

for item in filtered_data:
    question = item['question']
    shots = item['decomposition']
    fact = item['facts']
    true_answer = item['answer']
    few_shot_cot_answer_tuned = few_shot_cot_prompt_tuned(question, shots, fact)
    results_few_shot_cot_tuned.append({
        'question': question,
        'true_answer': true_answer,
        'few_shot': few_shot_cot_answer_tuned
    })

results_few_shot_cot_tuned_df = pd.DataFrame(results_few_shot_cot_tuned)
few_shot_cot_tuned_accuracy = calculate_accuracy(results_few_shot_cot_tuned_df, 'few_shot')

print(f"Few-shot Chain-of-Thought Prompting Accuracy: {few_shot_cot_tuned_accuracy:.2%}")

Few-shot Chain-of-Thought Prompting Accuracy: 79.38%


Result is much better. Now, lets decrease number of max tokens to 100. The model will has a limited capacity to generate overly long responses, which can help keep the answers focused and relevant.

In [35]:
def few_shot_cot_prompt_tuned(question, shots, fact):
    messages = [
        {"role": "system", "content": "You are a knowledgeable and thorough assistant."},
        {"role": "assistant", "content": f"Using the information provided in {fact}, answer the following three questions: {shots}. Provide detailed explanations for each answer, supporting your conclusions with the relevant information from {fact}." },
        {"role": "user", "content": f"Please follow these steps to answer {question}:\n1) Re-check the answers in the explanation, and make a summary.\n2) Discuss {question} based only on the summary in step 1. This should be followed by a step-by-step explanation of why you arrived at this answer, including your chain of thought.\n3) Provide a Boolean answer, which MUST always begin with either 'TRUE' or 'FALSE' without additional information or symbols on the question in step 2."}
    ]
    response = client.chat.completions.create(
    model="gpt-3.5-turbo-0125",
    messages=messages,
    temperature=0.7,
    max_tokens=100
    )
    answer =  response.choices[0].message.content
    match = re.search(r'\b(TRUE|FALSE)\b', answer)
    if match:
        boolean_answer = match.group(1)
        if boolean_answer.lower() == 'true':
            return True
        elif boolean_answer.lower() == 'false':
            return False

In [36]:
results_few_shot_cot_tuned = []

for item in filtered_data:
    question = item['question']
    shots = item['decomposition']
    fact = item['facts']
    true_answer = item['answer']
    few_shot_cot_answer = few_shot_cot_prompt_tuned(question, shots, fact)
    results_few_shot_cot_tuned.append({
        'question': question,
        'true_answer': true_answer,
        'few_shot': few_shot_cot_answer
    })

results_few_shot_cot_tuned_df = pd.DataFrame(results_few_shot_cot_tuned)
few_shot_cot_tuned_accuracy = calculate_accuracy(results_few_shot_cot_tuned_df, 'few_shot')

print(f"Few-shot Chain-of-Thought Prompting Accuracy: {few_shot_cot_tuned_accuracy:.2%}")

Few-shot Chain-of-Thought Prompting Accuracy: 9.85%


In [42]:
results_few_shot_cot_tuned

[{'question': 'Are more people today related to Genghis Khan than Julius Caesar?',
  'true_answer': True,
  'few_shot': True},
 {'question': 'Would a Monoamine Oxidase candy bar cheer up a depressed friend?',
  'true_answer': False,
  'few_shot': None},
 {'question': 'Would a dog respond to bell before Grey seal?',
  'true_answer': True,
  'few_shot': None},
 {'question': 'Is shrimp scampi definitely free of plastic?',
  'true_answer': False,
  'few_shot': None},
 {'question': 'Do the anchors on Rede Globo speak Chinese?',
  'true_answer': False,
  'few_shot': None},
 {'question': 'Is a Boeing 737 cost covered by Wonder Woman (2017 film) box office receipts?',
  'true_answer': True,
  'few_shot': None},
 {'question': 'Can you buy Casio products at Petco?',
  'true_answer': False,
  'few_shot': None},
 {'question': 'Did the Space Race use relay batons?',
  'true_answer': False,
  'few_shot': None},
 {'question': 'Does Biochemistry study gluons?',
  'true_answer': False,
  'few_shot': No

The Few-shot Chain-of-Thought (CoT) prompting accuracy dropped dramatically to 9.85%. The limit of max_tokens to 100 appears to be insufficient for the model to provide a good reasoning process and complete answers. As observed in the results, many responses were incomplete or did not provide any answer at all.