# **PromptCraft: Exploring LLM Behavior through Custom Prompting**
---





## Overview

This project demonstrates how large language models (LLMs), such as ChatGPT, can be applied to sentiment classification through prompt engineering—without the need for traditional supervised learning.

By leveraging LLMs’ ability to perform zero-shot reasoning, the notebook explores how different styles of prompts impact the accuracy and consistency of model outputs.

---

## What This Notebook Covers

- Walkthrough of example prompts for sentiment classification  
- Design and evaluation of custom prompts  
- Accuracy comparison between different prompt designs  
- Analysis of prompt structure and its effect on model behavior  

---

## Notes

- This notebook uses open-access LLM APIs (e.g., OpenAI).
- Designed to run efficiently in Google Colab using CPU runtime.
- Be aware of API rate limits (typically 300 queries per hour).

---

By the end of this notebook, you will have insight into how prompt design can shape the performance of large language models in classification tasks.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
%cd '/content/drive/MyDrive/'

/content/drive/MyDrive


In [None]:
!pip install datasets==3.1.0
!pip install tqdm==4.62.1

Collecting tqdm==4.62.1
  Using cached tqdm-4.62.1-py2.py3-none-any.whl.metadata (56 kB)
Using cached tqdm-4.62.1-py2.py3-none-any.whl (76 kB)
Installing collected packages: tqdm
  Attempting uninstall: tqdm
    Found existing installation: tqdm 4.67.1
    Uninstalling tqdm-4.67.1:
      Successfully uninstalled tqdm-4.67.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datasets 3.1.0 requires tqdm>=4.66.3, but you have tqdm 4.62.1 which is incompatible.
dopamine-rl 4.0.9 requires tqdm>=4.64.1, but you have tqdm 4.62.1 which is incompatible.[0m[31m
[0mSuccessfully installed tqdm-4.62.1


In [None]:
import pandas as pd
import requests
from datasets import load_dataset
from tqdm import tqdm
from huggingface_hub import InferenceClient



In [None]:
data = load_dataset("imdb")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
def extract_input_outputs(dataset):
    """
    input: dataset -> List[Dict]
    Each dict contains two keys - text and label.
    text contains the text for which sentiment has to be labeled.
    label is an integer denoting the gold sentiment.
    output: inputs -> List, outputs -> List
    inputs is a list of texts from dataset
    outputs is a list of labels from dataset
    """

    inputs, outputs = [], []
    for element in dataset:
        inputs.append(element["text"])
        outputs.append(element["label"])

    assert len(inputs) == len(outputs)

    return inputs, outputs

In [None]:
def create_prompt_inputs(input_texts, prompt_str):
    """
    input: input_texts -> List, prompt_str -> str
    input_texts is an array of reviews from the imdb dataset
    prompt_str is a template prompt with a placeholder to be filled in using a review
    Together, the prompt_str template and a review is used to create an input for the LLM
    output: prompt_inputs -> List
    prompt_inputs is an array of strings.
    Each element is the prompt template prompt_str with the placeholder filled by the review.
    """
    prompt_inputs = []
    for input_text in input_texts:
        prompt = prompt_str.format(**locals())
        prompt_inputs.append(prompt)
    return prompt_inputs

In [None]:
def calculate_accuracy(transformed_model_outputs, ground_truth_labels):
    total, correct = 0, 0
    for predicted_label, ground_truth in zip(transformed_model_outputs, ground_truth_labels):
        if predicted_label == ground_truth:
            correct += 1
        total += 1
    accuracy = round(correct * 100 / total, 2)
    return accuracy

In [None]:
def query(payload):
	# utility function to query HuggingFace Inference API to generate model outputs
	response = requests.post(API_URL, headers=headers, json=payload)
	return response.json()

In [None]:
def get_prompting_outputs(model_inputs):
    # input: model_inputs -> List
    # model_inputs is an array of the inputs to be fed to the LLM to obtain corresponding outputs (here, sentiment labels)
    # this function, in turn, calls the query function, which queries the HuggingFace Inference API to generate outputs from the specified model
    predictions = []
    for model_input in tqdm(model_inputs):
        prompt_output = query({"inputs": f"{model_input}"})
        # handle rate limit errors
        if 'error' in prompt_output:
            print(f"Error: {prompt_output['error']}.\nPlease re-run this method in 1 hour.")
            print(f'Number of predictions generated: {len(predictions)}')
            return predictions
        predictions.append(prompt_output[0]['generated_text'])
    return predictions

In [None]:
hf_key = "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"

In [None]:
API_URL = "https://api-inference.huggingface.co/models/mistralai/Mistral-7B-Instruct-v0.3"
headers = {"Authorization": f"Bearer {hf_key}"}

**Working Example**
--

First, we will walk through an example with just one data point and one prompt. Next, you will be required to experiment with designing different prompts on 100 data points, calculating the accuracy of sentiment classification using that prompt, and analyzing the outputs.

In [None]:
test_set_slice_size = 1

In [None]:
data_slice = data["test"].shuffle(seed=42).select([i for i in list(range(test_set_slice_size))])

In [None]:
input_texts, ground_truth_labels = extract_input_outputs(data_slice)

Below is an example of how to write a prompt template, use it to generate input for the LLM and obtain outputs according to it. '{input_text}' is used as a placeholder for the input from the dataset. Make sure to retain it in your constructed prompt.

For example,
```
input_text = 'When I unsuspectedly rented A Thousand Acres, I thought I was in for an entertaining King Lear story'
prompt = 'review: {input_text}\n\n what is the sentiment (1 for positive/0 for negative)? :'
```
renders as the following:
```
prompt = 'review: When I unsuspectedly rented A Thousand Acres, I thought I was in for an entertaining King Lear story

what is the sentiment (1 for positive/0 for negative)? :'
```

In [None]:
prompt = 'review: {input_text}\n\n what is the sentiment (1 for positive/0 for negative)? :'
prompt_inputs = create_prompt_inputs(input_texts, prompt)
raw_generations = get_prompting_outputs(prompt_inputs)

100%|██████████| 1/1 [00:00<00:00,  4.39it/s]


In [None]:
prompt_inputs, raw_generations

(["review: <br /><br />When I unsuspectedly rented A Thousand Acres, I thought I was in for an entertaining King Lear story and of course Michelle Pfeiffer was in it, so what could go wrong?<br /><br />Very quickly, however, I realized that this story was about A Thousand Other Things besides just Acres. I started crying and couldn't stop until long after the movie ended. Thank you Jane, Laura and Jocelyn, for bringing us such a wonderfully subtle and compassionate movie! Thank you cast, for being involved and portraying the characters with such depth and gentleness!<br /><br />I recognized the Angry sister; the Runaway sister and the sister in Denial. I recognized the Abusive Husband and why he was there and then the Father, oh oh the Father... all superbly played. I also recognized myself and this movie was an eye-opener, a relief, a chance to face my OWN truth and finally doing something about it. I truly hope A Thousand Acres has had the same effect on some others out there.<br /><

Model output looks similar to:
```
raw_generation = 'review: When I unsuspectedly rented A Thousand Acres, I thought I was in for an entertaining King Lear story

what is the sentiment (1 for positive/0 for negative)? : 1'
```
We then need to extract just the model output so that we can transform it to 1/0 (integer) if needed so that we can evaluate accuracy of using the prompt.
If you notice the example above, the raw generation is the model input concatenated to the model output.
So, to extract just the output, we remove the model input from the raw generation and whatever is left is automatically the output. **Note: The models output may contain some additional text (hallucinations). So we may need to perform additional preprocessing steps to extract the desired outputs.**

In [None]:
def extract_outputs_from_raw_generations(input_texts, raw_generations):
    model_outputs = []
    if not raw_generations:  # Check if raw_generations is empty
        print("No generations to process.")
    for model_input, model_generation in zip(input_texts, raw_generations):
        print("Processing generation:", model_generation)  # Print the raw generation being processed
        # TODO Start: Ensure `model_generation` is sanitized or preprocessed if necessary - Replace the input text in `model_generation` with an empty string, perform any additional preprocessing if needed
        model_output = model_generation.replace(model_input, '').strip() # replacing input text in the generated output
        print("Stripped output:", model_output)  # See what remains after stripping input text

        # Further processing to extract the numeric sentiment value
        # Assuming the model outputs are formatted like "1" or "0" after the question
        try:
            # Split the output on whitespace and take the last element as the predicted sentiment
            # This is simplistic and assumes the last piece of text is always the sentiment.
            model_output = model_output.split()[-1]
            model_output = int(model_output)  # Converting the sentiment to an integer
        except ValueError:
            # Handle cases where conversion to integer might fail
            print("Failed to extract an integer from the model output:", model_output)
            model_output = None  # or set a default value, e.g., -1
        # TODO End
        model_outputs.append(model_output)
    return model_outputs

model_outputs = extract_outputs_from_raw_generations(prompt_inputs, raw_generations)
print("Model outputs:", model_outputs)


Processing generation: review: <br /><br />When I unsuspectedly rented A Thousand Acres, I thought I was in for an entertaining King Lear story and of course Michelle Pfeiffer was in it, so what could go wrong?<br /><br />Very quickly, however, I realized that this story was about A Thousand Other Things besides just Acres. I started crying and couldn't stop until long after the movie ended. Thank you Jane, Laura and Jocelyn, for bringing us such a wonderfully subtle and compassionate movie! Thank you cast, for being involved and portraying the characters with such depth and gentleness!<br /><br />I recognized the Angry sister; the Runaway sister and the sister in Denial. I recognized the Abusive Husband and why he was there and then the Father, oh oh the Father... all superbly played. I also recognized myself and this movie was an eye-opener, a relief, a chance to face my OWN truth and finally doing something about it. I truly hope A Thousand Acres has had the same effect on some othe

In [None]:
model_outputs = extract_outputs_from_raw_generations(prompt_inputs, raw_generations)

Processing generation: review: <br /><br />When I unsuspectedly rented A Thousand Acres, I thought I was in for an entertaining King Lear story and of course Michelle Pfeiffer was in it, so what could go wrong?<br /><br />Very quickly, however, I realized that this story was about A Thousand Other Things besides just Acres. I started crying and couldn't stop until long after the movie ended. Thank you Jane, Laura and Jocelyn, for bringing us such a wonderfully subtle and compassionate movie! Thank you cast, for being involved and portraying the characters with such depth and gentleness!<br /><br />I recognized the Angry sister; the Runaway sister and the sister in Denial. I recognized the Abusive Husband and why he was there and then the Father, oh oh the Father... all superbly played. I also recognized myself and this movie was an eye-opener, a relief, a chance to face my OWN truth and finally doing something about it. I truly hope A Thousand Acres has had the same effect on some othe

In [None]:
# print model_outputs to validate
model_outputs

[1]

In [None]:
if model_outputs:  # Check if model_outputs is not empty
    print(f'Type of one item of model output: {type(model_outputs[0])}, one model output item: {model_outputs[0]}')
else:
    print("The model_outputs list is empty.")

Type of one item of model output: <class 'int'>, one model output item: 1


Above, you can see that our prompt design leads the LLM to generate a string which can be 0 or 1.

Now that we have the model outputs extracted, we now need to transform the outputs to 1/0 (integer) classes, so that we can compare the predicted labels to the ground truth labels and calculate accuracy when using that prompt.

In [None]:
def transform_demo_prompt_model_outputs_to_ground_truth_classes(model_outputs):
    # this is a custom function that works for the simple example we have been following till now
    # you will have to write your own custom function to transform model outputs from your prompt to map to the ground truth classes
    predicted_labels = []
    for model_output in model_outputs:
        # this function handles the specific mapping of model outputs to ground truth classes for your use case: "1" to 1, "0" to 0
        if model_output == '1':
            predicted_labels.append(1)
        elif model_output == '0':
            predicted_labels.append(0)
        else:
            # if model has hallucinated and generated something other than 0/1, it should also be treated as a failure case
            predicted_labels.append(0)
    return predicted_labels

In [None]:
transformed_model_outputs = transform_demo_prompt_model_outputs_to_ground_truth_classes(model_outputs)

In [None]:
print(f'Type of one item of model output after transformation: {type(transformed_model_outputs[0])}, one item of model output after transformation: {transformed_model_outputs[0]}')

Type of one item of model output after transformation: <class 'int'>, one item of model output after transformation: 0


**Sample Output:** Type of one item of model output after transformation: <class 'int'>, one item of model output after transformation: 1

Above, you can see that the transformed model outputs are of type integer. These can now be successfully used to calculate accuracy of our method.

In [None]:
accuracy = calculate_accuracy(transformed_model_outputs, ground_truth_labels)
print(accuracy)

0.0


**Prompt Design**
--

Tips on types of prompts to experiment with:
1. Instead of 1/0, we can prompt the model to output True/False, or positive/negative. Including information about the types of outputs you want nudges the model generation in the correct direction.
2. We can add an instruction at the start of the model input that the model should follow. It is good to have instructions that describe how you yourself would perform sentiment classification given a movie review.
3. LLMs have been shown to respond to certain types of artifacts as part of their prompts. For example, try adding "Let's think step by step" to your prompt and see whether performance of this LLM also improves. Note that the artifacts differ from model to model, and there is no guarantee that what works on one might also work on the other.
4. It also helps to prompt the models to reason about their answers in a structured manner. For example, you can prompt the model to generate a JSON object where one key denotes the value and another denotes the explanation for the answer.

In [None]:
test_set_slice_size = 100

In [None]:
data_slice = data["test"].shuffle(seed=42).select([i for i in list(range(test_set_slice_size))])

In [None]:
input_texts, ground_truth_labels = extract_input_outputs(data_slice)

## **Run Baseline Prompt on Test Data**

In [None]:
# Run the baseline prompt - compute accuracy
prompt_baseline = 'review: {input_text}\n\n what is the sentiment (1 for positive/0 for negative)? :'
prompt_baseline_inputs = create_prompt_inputs(input_texts, prompt_baseline)
raw_generations = get_prompting_outputs(prompt_baseline_inputs)

# check if the output is being generated as expected
print(f"Raw Generation Example : {raw_generations[0]}")

model_outputs = extract_outputs_from_raw_generations(prompt_baseline_inputs, raw_generations)
print(f'Type of one item of model output: {type(model_outputs[0])}, one model output item: {model_outputs[0]}')

transformed_model_outputs = transform_demo_prompt_model_outputs_to_ground_truth_classes(model_outputs)
print(f'Type of one item of model output after transformation: {type(transformed_model_outputs[0])}, one item of model output after transformation: {transformed_model_outputs[0]}')

accuracy = calculate_accuracy(transformed_model_outputs, ground_truth_labels)
print(f'Accuracy of sentiment classification by using this prompt (Baseline): {accuracy}%')

100%|██████████| 100/100 [02:07<00:00,  1.28s/it]

Raw Generation Example : review: <br /><br />When I unsuspectedly rented A Thousand Acres, I thought I was in for an entertaining King Lear story and of course Michelle Pfeiffer was in it, so what could go wrong?<br /><br />Very quickly, however, I realized that this story was about A Thousand Other Things besides just Acres. I started crying and couldn't stop until long after the movie ended. Thank you Jane, Laura and Jocelyn, for bringing us such a wonderfully subtle and compassionate movie! Thank you cast, for being involved and portraying the characters with such depth and gentleness!<br /><br />I recognized the Angry sister; the Runaway sister and the sister in Denial. I recognized the Abusive Husband and why he was there and then the Father, oh oh the Father... all superbly played. I also recognized myself and this movie was an eye-opener, a relief, a chance to face my OWN truth and finally doing something about it. I truly hope A Thousand Acres has had the same effect on some ot

Reference: **The accuracy of the baseline prompt on 100 data points is 63%.**


**Experiment 1: Designing Prompt 1**
---

In [None]:
def extract_outputs_from_raw_generations_prompt1(input_texts, raw_generations):
    model_outputs = []
    for model_input, model_generation in zip(input_texts, raw_generations):
        # TODO Start: Ensure `model_generation` is sanitized or preprocessed if necessary - Replace the input text in `model_generation` with an empty string, perform any additional preprocessing if needed
        # Strip the input text from the raw output to isolate the model's response
        sanitized_output = model_generation.replace(model_input, '').strip()
        # Extract the first word, assuming it's the sentiment; this is simplistic and may need adjustment based on actual model output formatting
        model_output = sanitized_output.split()[0].lower() if sanitized_output else ''
        # TODO End
        model_outputs.append(model_output)
    return model_outputs


def transform_prompt1_model_outputs_to_ground_truth_classes(model_outputs):
    predicted_labels = []
    for model_output in model_outputs:
        # Mapping the textual output to numerical classes
        if model_output == 'positive':
            predicted_labels.append(1)
        elif model_output == 'negative':
            predicted_labels.append(0)
        else:
            # Handle cases where output is neither "positive" nor "negative"
            predicted_labels.append(-1)  # Considered as an invalid or unclear response
        # TODO End
    return predicted_labels


In [None]:
prompt1_experiment = 'review: {input_text}\n\n what is the sentiment (1 for positive/0 for negative)? :' ## design prompt
prompt1_inputs = create_prompt_inputs(input_texts, prompt1_experiment)
prompt1_raw_generations = get_prompting_outputs(prompt1_inputs)

# Check if the output is being generated as expected and ensure it's safely accessed
if prompt1_raw_generations:  # Check if not empty
    print(f"Raw Generation Example : {prompt1_raw_generations[0]}")
else:
    print("No generations were produced or available to display.")

prompt1_model_outputs = extract_outputs_from_raw_generations_prompt1(prompt1_inputs, prompt1_raw_generations)
if prompt1_model_outputs:  # Check if not empty
    print(f'Type of one item of model output: {type(prompt1_model_outputs[0])}, one model output item: {prompt1_model_outputs[0]}')
else:
    print("No model outputs were extracted or available to display.")

prompt1_transformed_model_outputs = transform_prompt1_model_outputs_to_ground_truth_classes(prompt1_model_outputs)
if prompt1_transformed_model_outputs:  # Check if not empty
    print(f'Type of one item of model output after transformation: {type(prompt1_transformed_model_outputs[0])}, one item of model output after transformation: {prompt1_transformed_model_outputs[0]}')
else:
    print("No transformed model outputs were available to display.")

prompt1_accuracy = calculate_accuracy(prompt1_transformed_model_outputs, ground_truth_labels)
print(f'Accuracy of sentiment classification by using this prompt (Method 1): {prompt1_accuracy}%')


100%|██████████| 100/100 [00:19<00:00,  5.00it/s]

Raw Generation Example : review: <br /><br />When I unsuspectedly rented A Thousand Acres, I thought I was in for an entertaining King Lear story and of course Michelle Pfeiffer was in it, so what could go wrong?<br /><br />Very quickly, however, I realized that this story was about A Thousand Other Things besides just Acres. I started crying and couldn't stop until long after the movie ended. Thank you Jane, Laura and Jocelyn, for bringing us such a wonderfully subtle and compassionate movie! Thank you cast, for being involved and portraying the characters with such depth and gentleness!<br /><br />I recognized the Angry sister; the Runaway sister and the sister in Denial. I recognized the Abusive Husband and why he was there and then the Father, oh oh the Father... all superbly played. I also recognized myself and this movie was an eye-opener, a relief, a chance to face my OWN truth and finally doing something about it. I truly hope A Thousand Acres has had the same effect on some ot




## **Experiment: Designing Prompt 2**

In [None]:
def extract_outputs_from_raw_generations_prompt2(input_texts, raw_generations):
    model_outputs = []
    for model_input, model_generation in zip(input_texts, raw_generations):
        # Removing the input text to focus on the generated sentiment output
        model_output = model_generation.replace(model_input, '').strip().split()[0].lower()

        model_outputs.append(model_output)
    return model_outputs


def transform_prompt2_model_outputs_to_ground_truth_classes(model_outputs):
    predicted_labels = []
    for model_output in model_outputs:
        if model_output == 'positive':
            predicted_labels.append(1)
        elif model_output == 'negative':
            predicted_labels.append(0)
        else:
            # Handling unexpected outputs
            predicted_labels.append(-1)  # Consider using another value for unclear responses
    return predicted_labels


In [None]:

prompt2_experiment = 'Please analyze the review and indicate the sentiment. Is it positive or negative? Give a brief reason for your conclusion: {input_text} Conclusion:'  ## design prompt
prompt2_inputs = create_prompt_inputs(input_texts, prompt2_experiment)
prompt2_raw_generations = get_prompting_outputs(prompt2_inputs)

# Check if the output is being generated as expected and ensure it's safely accessed
if prompt2_raw_generations:
    print(f"Raw Generation Example : {prompt2_raw_generations[0]}")
else:
    print("No generations were produced or available to display.")

# Extract and transform model outputs
prompt2_model_outputs = extract_outputs_from_raw_generations_prompt2(prompt2_inputs, prompt2_raw_generations)
if prompt2_model_outputs:  # Check if not empty
    print(f'Type of one item of model output: {type(prompt2_model_outputs[0])}, one model output item: {prompt2_model_outputs[0]}')
else:
    print("No model outputs were extracted or available to display.")

prompt2_transformed_model_outputs = transform_prompt2_model_outputs_to_ground_truth_classes(prompt2_model_outputs)
if prompt2_transformed_model_outputs:  # Check if not empty
    print(f'Type of one item of model output after transformation: {type(prompt2_transformed_model_outputs[0])}, one item of model output after transformation: {prompt2_transformed_model_outputs[0]}')
else:
    print("No transformed model outputs were available to display.")

# Calculate and print the accuracy
prompt2_accuracy = calculate_accuracy(prompt2_transformed_model_outputs, ground_truth_labels)
print(f'Accuracy of sentiment classification by using this prompt (Method 2): {prompt2_accuracy}%')


100%|██████████| 100/100 [03:09<00:00,  1.89s/it]

Raw Generation Example : Please analyze the review and indicate the sentiment. Is it positive or negative? Give a brief reason for your conclusion: <br /><br />When I unsuspectedly rented A Thousand Acres, I thought I was in for an entertaining King Lear story and of course Michelle Pfeiffer was in it, so what could go wrong?<br /><br />Very quickly, however, I realized that this story was about A Thousand Other Things besides just Acres. I started crying and couldn't stop until long after the movie ended. Thank you Jane, Laura and Jocelyn, for bringing us such a wonderfully subtle and compassionate movie! Thank you cast, for being involved and portraying the characters with such depth and gentleness!<br /><br />I recognized the Angry sister; the Runaway sister and the sister in Denial. I recognized the Abusive Husband and why he was there and then the Father, oh oh the Father... all superbly played. I also recognized myself and this movie was an eye-opener, a relief, a chance to face 




## Observations



- Observation 1

- Observation 2

- Observation 3

In [None]:

df = pd.DataFrame({
    'review': input_texts,
    'ground_truth_label': ground_truth_labels,
    'model_input': prompt2_inputs,  # Replace with the inputs used for your best prompt
    'raw_model_output': prompt2_raw_generations,  # Assuming this holds the raw outputs from the best prompt
    'transformed_model_output': prompt2_transformed_model_outputs  # Outputs after transformation to ground truth classes
})
df.to_csv('cse354_assignment4_submission_customprompt_outputs.csv', index=False)
