# Assignment: Advanced Prompt Engineering and Systematic Optimization

<a target="_blank" href="https://colab.research.google.com/github/chenwh0/Natural-Language-Processing-work/blob/main/module7/AdvancedPromptEngineeringandSystematicOptimization.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

**Background:**  This assignment explores **advanced prompt engineering** as both an art and a science. Moving beyond basic instruction-following, you will investigate systematic approaches to prompt optimization, analyze the cognitive and linguistic mechanisms underlying effective prompts, and develop frameworks for evaluating prompt performance across diverse tasks and domains. The focus is on **understanding the theoretical foundations** of prompt engineering while developing practical expertise in optimization techniques.

## Instructions and Point Breakdown

### 1. **Prompt Engineering Methodology Framework (2 points)**

- **Develop a Systematic Approach:**
  - Create a structured methodology for prompt design that includes: template design, iterative refinement, and performance evaluation
  - Implement **three different prompting paradigms**:
    - Zero-shot with precise instructions
    - Few-shot with strategic example selection  
    - Chain-of-thought with explicit reasoning steps

- **Theoretical Foundation:**
  - Analyze the cognitive science principles underlying effective prompts
  - Investigate how prompt structure affects model reasoning and output quality

- **Critical Questions:**
  - How does the sequence and placement of information within prompts affect model performance?
  - What are the trade-offs between prompt complexity and output reliability?
  - How do different prompting paradigms align with various types of reasoning tasks?

### 2. **Multi-Domain Prompt Optimization (3 points)**

- **Cross-Domain Implementation:**
  - Select **three distinct domains** (e.g., creative writing, technical documentation, logical reasoning, data analysis)
  - For each domain, develop and iteratively optimize prompts using multiple techniques:
    - **Meta-prompting**: Use LLMs to suggest prompt improvements
    - **Recursive self-improvement**: Multi-iteration refinement cycles
    - **Template-based optimization**: Structured formats with placeholders
    - **Context-aware decomposition**: Breaking complex tasks into sub-prompts

- **Comparative Analysis:**
  - Measure performance across domains using appropriate metrics (accuracy, creativity, coherence, etc.)
  - Document the optimization process and effectiveness of each technique
  - Analyze which techniques work best for different types of tasks

- **Critical Questions:**
  - How do optimal prompting strategies vary across different cognitive domains?
  - What patterns emerge in successful prompt structures across diverse applications?
  - How can you systematically identify when a prompt has reached optimal performance?

### 3. **Prompt Interpretability and Failure Analysis (2 points)**

- **Deep Analysis of Prompt Mechanisms:**
  - Investigate attention patterns and model behavior with different prompt structures
  - Analyze failure modes: when and why do optimized prompts fail?
  - Study the relationship between prompt complexity and model interpretability

- **Systematic Error Analysis:**
  - Identify classes of problems where prompt engineering breaks down
  - Analyze the role of prompt length, specificity, and linguistic complexity
  - Investigate bias introduction through prompt design choices

- **Critical Questions:**
  - How do different prompt engineering techniques affect model reasoning transparency?
  - What are the limits of prompt-based optimization versus model fine-tuning?
  - How can we detect when prompts are exploiting spurious correlations rather than genuine understanding?

### 4. **Automated Prompt Evaluation and Scaling (2 points)**

- **Evaluation Framework Development:**
  - Design metrics for prompt quality that go beyond task-specific performance
  - Implement automated evaluation systems for prompt comparison
  - Develop methods for detecting prompt robustness across edge cases

- **Scalability Investigation:**
  - Analyze computational costs of different optimization approaches
  - Investigate techniques for prompt transfer across related tasks
  - Explore automated prompt generation and refinement systems

- **Critical Questions:**
  - How can we design evaluation metrics that capture both effectiveness and robustness?
  - What are the trade-offs between human evaluation and automated assessment of prompts?
  - How do we balance prompt optimization costs with performance gains in production systems?

### 5. **Future Directions and Theoretical Implications (1 point)**

- **Research Synthesis:**
  - Synthesize insights from your experiments into broader principles of prompt engineering
  - Identify limitations of current prompting approaches and propose solutions
  - Discuss implications for human-AI interaction design

- **Critical Analysis Questions:**
  - How might prompt engineering evolve as model capabilities advance?
  - What are the ethical implications of highly optimized prompts that may manipulate model outputs?
  - How does effective prompt engineering relate to theories of human cognition and communication?
  - What role should prompt engineering play in AI safety and alignment?

## Submission Requirements

- **Comprehensive Jupyter Notebook** containing:
  - Implementation of all optimization techniques across multiple domains
  - Detailed experimental results with statistical analysis
  - Visualizations of optimization trajectories and performance comparisons  
  - Code for automated evaluation systems
  - Thorough written analysis addressing all critical questions (**Maximum** 3-4 paragraphs each)

- **Technical Rigor:** Include proper experimental controls, statistical significance testing, and reproducibility measures

- **Libraries:** Use `transformers`, `openai`, `langchain`, `pandas`, `matplotlib`, `seaborn`, and statistical analysis packages

## Advanced Extensions (Optional)

- **Multi-model Optimization:** Compare prompt effectiveness across different LLM architectures
- **Dynamic Prompt Adaptation:** Implement systems that modify prompts based on real-time performance feedback
- **Prompt Compression:** Investigate methods for maintaining effectiveness while reducing prompt length
- **Cross-lingual Prompt Engineering:** Analyze prompt optimization across different languages

**Grading Rubric:**

| Section                                        | Points |
|:-----------------------------------------------|:------:|
| Prompt engineering methodology framework      | 2      |
| Multi-domain prompt optimization              | 3      |
| Prompt interpretability & failure analysis    | 2      |
| Automated prompt evaluation & scaling         | 2      |
| Future directions & theoretical implications  | 1      |
| **Total**                                     | **10** |

**Evaluation Criteria:**
- **Methodological Rigor (30%):** Systematic approach to prompt design and optimization
- **Theoretical Understanding (35%):** Depth of analysis of prompt engineering principles and mechanisms  
- **Innovation and Insights (35%):** Novel approaches, unexpected findings, and meaningful contributions to the field

**Learning Objectives:**
Students will develop expertise in advanced prompt engineering as a systematic discipline, understanding both the practical techniques and theoretical foundations necessary for designing effective human-AI interaction systems. This includes mastery of optimization methods, evaluation frameworks, and the ability to analyze and predict prompt effectiveness across diverse applications.

# Imports and Installs

In [None]:
!pip install tqdm textattack -q

In [None]:
# Data preprocessing libraries
import pandas

# LLM libraries
from transformers import AutoModelForCausalLM, AutoTokenizer
from tqdm import trange
import torch
import torch.nn.functional as F

# Libraries for cosine similarity
from sentence_transformers import SentenceTransformer
from sentence_transformers import util
miniLM_model = SentenceTransformer('all-MiniLM-L6-v2') # Load pre-trained model (you can choose different models based on your needs)

# Text augumentation library
from textattack.augmentation import EasyDataAugmenter

# 1. **Prompt Engineering Methodology Framework**

## Information placement order within prompt matters
Information placed at the beginning and end of the prompt will get more attention because they either set the context for how all subsequent tokens are encoded OR they are given more weight and attention in the transformer because they are near the end, thus considered the most recent.
Important information that is placed in middle of prompt affect less subsequent tokens and they get less attention.

## Prompt complexity vs Output reliability tradeoff
The more complex the prompt is, i.e. prompts that require multiple steps to solve, the higher chance the model may have to misrepresent and or misinterpret the prompt, thus making its output less reliable.

## Different prompting paradigmns for different reasoning tasks
**Zero-shot prompting**: Instruct LLM to do certain task without any samples. LLM may misunderstand question and give erroneous outputs.

**Few-shot prompting**: Instruct LLM to do certain task with sample input-output pairs. LLM is able to know general input's format and expected output's format.

**Chain of Thought (CoT) prompting**: Instruct LLM to do step-by-step reasoning. Takes up more tokens in window but more effective to guide LLM to solving problem by breaking it down into small sequential tasks.



In [None]:
class ModelLLM:
    def __init__(self, name):
        self.name = name # Replace with actual model ID from Hugging Face
        self.tokenizer = tinyLlama_tokenizer = AutoTokenizer.from_pretrained(name)
        self.model = AutoModelForCausalLM.from_pretrained(name)

    def generate(self, prompt, cache=False, temperature=1): # Generate output from 1 string (promnpt)
        prompt += "<|assistant|>" # Add role assignment tag for TinyLlama
        tokenized_inputs = self.tokenizer(prompt, return_tensors="pt")
        token_ids = self.model.generate(**tokenized_inputs, max_new_tokens=300, use_cache=cache, do_sample=True, temperature=temperature)
        decoded_tokens = self.tokenizer.decode(token_ids[0], skip_special_tokens=True)
        return decoded_tokens

    def multi_generate(self, prompts, cache=False, temperature=1): # Generate outputs from list of strings (prompts)
        outputs = []
        for i in trange(len(prompts)):
            decoded_tokens = self.generate(prompts[i], cache=cache, temperature=temperature)
            outputs.append(decoded_tokens)
        return outputs

    def substep_generate(self, prompts, cache=False, temperature=1): # Generate outputs from list of strings (subdivided task prompts)
        outputs = []
        for i in trange(len(prompts)):
            if i == len(prompts) - 1: # If processing last output
                outputs[-1] = outputs[-1].replace(prompts[-1]+"<|assistant|>", "")
                outputs[-2] = outputs[-2].replace(prompts[-2]+"<|assistant|>", "")
                prompts[i] += f"output 1: {outputs[-2]}. output 2: {outputs[-1]}." # Add previous 2 outputs as inputs to last prompt for combining.

            decoded_tokens = self.generate(prompts[i], cache=cache, temperature=temperature)
            outputs.append(decoded_tokens)
        return outputs

    def tokenize(self, prompt): # Tokenize 1 string (prompt)
        token_ids = self.tokenizer.encode(prompt)
        decoded_tokens = self.tokenizer.convert_ids_to_tokens(token_ids)
        return {"input": prompt, "token_ids": token_ids, "decoded_tokens": decoded_tokens}

    def multi_tokenize(self, prompts): # Tokenize list of strings (prompts)
        rows = []
        for i in range(len(prompts)):
            row = self.tokenize(prompts[i])
            print("Input text:", prompts[i])
            print("Token ids:", row["token_ids"])
            print("Decoded tokens:", row["decoded_tokens"])
            row["prompt_num"] = i
            rows.append(row)
        return pandas.DataFrame.from_dict(rows, orient="columns")

tinyLlama = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tinyLlama_model = ModelLLM(tinyLlama)
print(tinyLlama_model.model)

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 2048)
    (layers): ModuleList(
      (0-21): 22 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=5632, bias=False)
          (up_proj): Linear(in_features=2048, out_features=5632, bias=False)
          (down_proj): Linear(in_features=5632, out_features=2048, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (rot

In [None]:
def print_responses(responses, prompts):
    for i, response in enumerate(responses):
        print()
        print()
        print("Prompt", i, "-", prompts[i])
        print()
        print("Response", i, "-", response.replace(prompts[i]+"<|assistant|>", "")) # Delete the prompt itself from the response

In [None]:
zero_shot_prompt = "List 3 machine learning methods to analyze a large geospatial database and find geospatial hotspots or interesting patterns."
few_shot_prompt = f"""Prompt: List 1 machine learning method to classify a large database of images. Answer: 1. Convolutional Neural Networks (CNNs) are deep learning models specifically designed for image data. They automatically learn spatial hierarchies of features through layers of convolutions, pooling, and nonlinear activations.
Prompt: List 1 machine learning method to identify patterns across geospatial SQL rows. Answer: 1. Density-Based Spatial Clustering of Applications with Noise (DBSCAN) groups together points that are closely packed and marks points in low-density areas as outliers.
Prompt: {zero_shot_prompt} Answer: ?"""
chain_of_thought_prompt = f"Think step by step. 1. What are good machine learning methods identify patterns and hotspots across any geospatial data? 2. What are good machine learning methods for Big Data? 3. What are machine learning or generative AI methods that can combine both of these techniques? {zero_shot_prompt}"

In [None]:
prompts = [zero_shot_prompt, few_shot_prompt, chain_of_thought_prompt]
responses = tinyLlama_model.multi_generate(prompts, cache=True, temperature=0.2)
print_responses(responses, prompts)



Prompt 0 - List 3 machine learning methods to analyze a large geospatial database and find geospatial hotspots or interesting patterns.

Response 0 - 

1. K-means Clustering: K-means clustering is a popular technique for grouping data points based on their similarity. It is a non-parametric method that partitions the data into k clusters, where k is a hyperparameter that can be tuned. K-means clustering is a good choice for large geospatial datasets as it can handle large datasets efficiently.

2. Hierarchical Clustering: Hierarchical clustering is a clustering technique that groups data points based on their similarity. It is a hierarchical clustering algorithm that partitions the data into k clusters, where k is a hyperparameter that can be tuned. Hierarchical clustering is a good choice for large geospatial datasets as it can handle large datasets efficiently.

3. Decision Trees: Decision trees are a type of supervised learning algorithm that can be used to classify data points ba

# 2. **Multi-Domain Prompt Optimization**

**Meta-prompting**: Use LLMs to suggest prompt improvements

**Recursive self-improvement**: Multi-iteration refinement cycles

**Template-based optimization**: Structured formats with placeholders

**Context-aware decomposition**: Breaking complex tasks into sub-prompts

- **Comparative Analysis:**
  - Measure performance across domains using appropriate metrics (accuracy, creativity, coherence, etc.)
  - Document the optimization process and effectiveness of each technique
  - Analyze which techniques work best for different types of tasks
### Creative prompt
| Prompt optimization technique      | accuracy | creativity | coherence |
|:-----------------------------------|:--------:|:----------:|:---------:|
| Meta-prompting                     |    10    |     4      |     10    |
| Recursive self-improvement         |    9     |     5      |     10    |
| Template-based optimization        |    9     |     5      |     10    |
| Context-aware decomposition        |    5     |     10     |     3     |

### Data Analysis prompt
| Prompt optimization technique      | accuracy | creativity | coherence |
|:-----------------------------------|:--------:|:----------:|:---------:|
| Meta-prompting                     |    6     |     0      |     10    |
| Recursive self-improvement         |    9     |     0      |     10    |
| Template-based optimization        |    9     |     0      |     10    |
| Context-aware decomposition        |    5     |     0      |     3     |

### Documentation prompt
| Prompt optimization technique      | accuracy | creativity | coherence |
|:-----------------------------------|:--------:|:----------:|:---------:|
| Meta-prompting                     |    7     |     0      |     10    |
| Recursive self-improvement         |    9     |     0      |     10    |
| Template-based optimization        |    9     |     0      |     10    |
| Context-aware decomposition        |    3     |     0      |     5     |


## Various prompting strategies for different cognitive domians
 * **Meta-prompting** - Generally works well across all cognitive domains for any question that desires a specific format and type of answer that isn't as open-ended.
 * **Recursive self-improvement** - Generally works well across all cognitive fomains because it helps identify specific misunderstandings of prompts that model may have and tailor the prompt to correcting it continuously.
 * **Template-based optimization** - Works especially well for data analysis prompt and documentation prompt. But overly-structured response is suboptimal for a creative writing prompt.
 * **Context-aware decomposition** - In this experiment, I had to expand the token window so that the final sub-prompt to combine previous outputs can fit the token window. However, tinyLlama ended up generating incoherent code in data analysis prompt and documentation prompt (where it was filled with unecessary imports and repetitive code). Subdividing the creative writing prompt helped generated a very creative but sometimes inconsistent script.


## A generally successful prompt structure
A generally successful prompt structure is concise. It leaves no room for misinterpretation by providing short one or few-shot examples that can give the model insight on input and output formatting and desired style of answers that the prompter desires.

## Systematic method to identify optimal prompt
To evaluate the optimaliality of a prompt, one can evaluate the LLMs' response and check:
* *Is LLM's response factually accurate?*
* *Is LLM's response complying with instructions?*
* *Is LLM's response coherent & relevant to the prompt? Is LLM generating irrelevant data?*
* *Is LLM's response generally consistent across multiple generations using same prompt?*

In [None]:
# Original prompts
prompt_creative_writing = "Write a story using Outline → characters → scenes → dialogue."
prompt_data_analysis = """headers = ["Sample_ID", "Latitude", "Longitude", "Collection_Date", "Detected", "Concentration_ppm"]
data = [ [1, 40.7, -74.5, "2022-05-15", "Lead", 245.7], [2, 40.7, -74.8, "2023-06-22", "Lead", 312.4], [3, 40.6, -74.4, "2024-07-10", "Lead", 398.1] ] Does this data reveal anything?"""
prompt_documentation = """Convert code into a python function: embedding = model.encode(query, convert_to_tensor=True)
hits = util.semantic_search(embedding, corpus_embeddings, top_k=5)[0]
top_k_keys = "Fetched features and values:\n"
for hit in hits:
    key = corpus_keys[hit['corpus_id']]
    top_k_keys += f"{key} - {response_map_descriptions[key]["value"]}"
"""

In [None]:
# Meta-prompting
prompt_creative_writing = "Write a story about a character with a certain role who must fulfill a need or want. But, they're prevented by an external or internal conflict."
prompt_data_analysis = """Analyze the trend in this contamination time series data and summarize the key findings.
headers = ["Sample_ID", "Latitude", "Longitude", "Collection_Date", "Detected", "Concentration_ppm"]
data = [ [1, 40.7, -74.5, "2022-05-15", "Lead", 245.7], [2, 40.7, -74.8, "2023-06-22", "Lead", 312.4], [3, 40.6, -74.4, "2024-07-10", "Lead", 398.1] ]"""
prompt_documentation = """Convert this into a well-structured Python function with proper documentation. embedding = model.encode(query, convert_to_tensor=True)
hits = util.semantic_search(embedding, corpus_embeddings, top_k=5)[0]
top_k_keys = "Fetched features and values:\n"
for hit in hits:
    key = corpus_keys[hit['corpus_id']]
    top_k_keys += f"{key} - {response_map_descriptions[key]["value"]}"
"""

In [None]:
prompts = [prompt_creative_writing, prompt_data_analysis, prompt_documentation]
responses = tinyLlama_model.multi_generate(prompts, cache=True, temperature=0.2)
print_responses(responses, prompts)

100%|██████████| 3/3 [04:03<00:00, 81.26s/it]



Prompt 0 - Write a story about a character with a certain role who must fulfill a need or want. But, they're prevented by an external or internal conflict.

Response 0 - 

Sophie was a successful businesswoman who had built her own empire from the ground up. She had a reputation for being tough and unyielding, but deep down, she was a kind and compassionate person. She had a heart of gold, and her employees loved her for it.

One day, Sophie received a call from her boss, who informed her that her company was facing a major crisis. The company's main product had been discontinued, and they were struggling to find a replacement. Sophie knew that this was a critical need for her company, but she was prevented from fulfilling it by an internal conflict.

Sophie's boss had been pushing for a new product that would be more profitable and popular. Sophie had been hesitant to go along with this plan, as she had invested a lot of time and resources into developing the original product.


Pro




In [None]:
# Recursive self-improvement and Template-based optimization
prompt_creative_writing =  "Write a story about [name], a [character role], who has a [goal]. But, they're prevented by this [antagonistic force]."
prompt_data_analysis = """Analyze the trend in this contamination time series data and summarize the key findings.
headers = ["Sample_ID", "Latitude", "Longitude", "Collection_Date", "Detected", "Concentration_ppm"]
data = [ [1, 40.7, -74.5, "2022-05-15", "Lead", 245.7], [2, 40.7, -74.8, "2023-06-22", "Lead", 312.4], [3, 40.6, -74.4, "2024-07-10", "Lead", 398.1] ]
Response should be [Geographical observations from spatial context] [Summary of contamination level and its implications]"""
prompt_documentation = """Convert code into  python function def cosine_similarity_fetch(args): [code] return top_k_keys.
Code:
embedding = model.encode(query, convert_to_tensor=True)
hits = util.semantic_search(embedding, corpus_embeddings, top_k=5)[0]
top_k_keys = "Fetched features and values:\n"
for hit in hits:
    key = corpus_keys[hit['corpus_id']]
    top_k_keys += f"{key} - {response_map_descriptions[key]["value"]}"
"""

In [None]:
prompts = [prompt_creative_writing, prompt_data_analysis, prompt_documentation]
responses = tinyLlama_model.multi_generate(prompts, cache=True, temperature=0.2)
print_responses(responses, prompts)

100%|██████████| 3/3 [04:23<00:00, 87.74s/it]



Prompt 0 - Write a story about [name], a [character role], who has a [goal]. But, they're prevented by this [antagonistic force].

Response 0 - 

Name: Emily

Character Role: Student

Goal: Get into a prestigious university

Antagonistic Force: The college admissions process is rigorous and competitive, and Emily's application is rejected multiple times.

Emily's parents are both successful businesspeople, and they have always encouraged her to pursue her dreams. Emily's passion for music has always been her greatest strength, and she dreams of becoming a professional musician. However, the college admissions process is a daunting task, and Emily's application is rejected multiple times.

Emily's parents are devastated, but they know that Emily has what it takes to succeed. They encourage her to keep trying, and Emily begins to focus on her studies and her passion for music. She starts to attend music classes and auditions, and she even starts playing in a local band.


Prompt 1 - An




In [None]:
# Context-aware decomposition
prompts_creative_writing = ["Introduce protagonist [name] as a [character role]. Establish their motivation for pursuing a [goal].",
 "Introduce [antagonistic force] and how moment when protagonist overcomes [antagonistic force]",
 "Assemble the previous outputs into one cohesive and intersting story premise."]


geospatial_data = """headers = ["Sample_ID", "Latitude", "Longitude", "Collection_Date", "Detected", "Concentration_ppm"]
data = [ [1, 40.7, -74.5, "2022-05-15", "Lead", 245.7], [2, 40.7, -74.8, "2023-06-22", "Lead", 312.4], [3, 40.6, -74.4, "2024-07-10", "Lead", 398.1] ]"""
prompts_data_analysis = [f"Describe the geographical and temporal area covered by data like: Based on longitude and latitude, this data describes [geographical area] across [timeframe]. {geospatial_data}",
 f"Contextualize the soil quality data like: The underlying [pattern] has [implication] which can lead to [consequence]. {geospatial_data}",
 "Combine the previous outputs into a single, concise summary of the key findings."]


code = """embedding = model.encode(query, convert_to_tensor=True)
hits = util.semantic_search(embedding, corpus_embeddings, top_k=5)[0]
top_k_keys = "Fetched features and values:\n"
for hit in hits:
    key = corpus_keys[hit['corpus_id']]
    top_k_keys += f"{key} - {response_map_descriptions[key]["value"]}"
"""
prompts_documentation = [f"List all variables in this code that should become function parameters and the final value that should be returned. {code}",
                         f"Extract the core logic of the code, focusing on the search and retrieval steps, independent of string formatting and variables. {code}",
                         "Assemble the identified parameters, core logic, and formatted output into the final function: def cosine_similarity_fetch(args)"]

In [None]:
responses = tinyLlama_model.substep_generate(prompts_creative_writing, cache=True, temperature=1)
print_responses(responses, prompts_creative_writing)



Prompt 0 - Introduce protagonist [name] as a [character role]. Establish their motivation for pursuing a [goal].

Response 0 - 

ACT ONE
INT. ABANDONED BUILDING - DAY
[Scene: Introductions and set up for Jared, a 28-year-old who recently graduated from college. He's dressed in his school's letterman's jacket - the kind worn at the end of football games.]

JARED (To himself, in a deep voice). I just have to finish this thesis. I'll see the light at the end of the tunnel.

INT. CURRENT ABANDONED BUILDING - NIGHT

MORGAN (In a cold, measured tone). Jared, I need you to step aside and let me handle this.

JARED (In a defensive tone, not wanting to be taken from under the carnage). Who said I wanted a piece of you? You were always better than me.

MORGAN (Punching Jared in the arm). Jared, give me a chance to explain. These are my students you're talking about. How many were there?

JARED (Angry, defensive voice). You think you're better than us? All you care about is money and fame. You'

In [None]:
responses = tinyLlama_model.substep_generate(prompts_data_analysis, cache=True, temperature=0.1)
print_responses(responses, prompts_data_analysis)

100%|██████████| 3/3 [08:41<00:00, 173.73s/it]



Prompt 0 - Describe the geographical and temporal area covered by data like: Based on longitude and latitude, this data describes [geographical area] across [timeframe]. headers = ["Sample_ID", "Latitude", "Longitude", "Collection_Date", "Detected", "Concentration_ppm"]
data = [ [1, 40.7, -74.5, "2022-05-15", "Lead", 245.7], [2, 40.7, -74.8, "2023-06-22", "Lead", 312.4], [3, 40.6, -74.4, "2024-07-10", "Lead", 398.1] ]

Response 0 - 

# 4. Data cleaning: Remove any duplicates, missing values, or invalid data. Use appropriate data cleaning techniques like:

# 1. Dropping duplicates:
data = data.drop_duplicates(subset=["Sample_ID"])

# 2. Removing missing values:
data = data.dropna(subset=["Concentration_ppm"])

# 3. Validating data:
data = data.dropna(subset=["Concentration_ppm"])

# 4. Checking for invalid data:
data = data.dropna(subset=["Concentration_ppm"])

# 5. Checking for missing values:
data = data.dropna(subset=["Concentration_ppm"])

# 6. Checking for duplicates:
data = data




In [None]:
responses = tinyLlama_model.substep_generate(prompts_documentation, cache=True, temperature=0.001)
print_responses(responses, prompts_documentation)

100%|██████████| 3/3 [08:37<00:00, 172.61s/it]



Prompt 0 - List all variables in this code that should become function parameters and the final value that should be returned. embedding = model.encode(query, convert_to_tensor=True)
hits = util.semantic_search(embedding, corpus_embeddings, top_k=5)[0]
top_k_keys = "Fetched features and values:
"
for hit in hits:
    key = corpus_keys[hit['corpus_id']]
    top_k_keys += f"{key} - {response_map_descriptions[key]["value"]}"


Response 0 - 
Here's the full code with the variables and their corresponding values:

```python
import tensorflow as tf
import tensorflow_hub as hub
import numpy as np
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import jaccard_similarity_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_sc




# 3. **Prompt Interpretability and Failure Analysis**

## Attention patterns and model behavior with different prompt structures
* *Prepend instruction to prompt* - Generates undesired code for data analysis prompt.
* *Append instruction to prompt* - Improved output for data analysis prompt, generated output followed provided template structure in instructions.
* *Insert instruction to prompt* - Generates incoherent and repetitive data analysis prompt.

##  When and why do optimized prompts fail?
Optimized prompts may still fail to generate cohereant, accurate, and quality responses if the task at hand is too complex to understand (see data analysis prompt below) because it may require too many steps for the model to keep track of. Or, the prompt may fail if it requires a timely answer that the model can't provide based on the date of its training data.

## Prompt complexity and model interpretability
In the case of tinyLlama, the more complex the prompt is, the less interpretable the model's output is. This may be because if the model generates a suboptimal token towards the beginning the token has a higher chance of propagation that error into future tokens and because complex tasks build on top of each other, the model may generate something that is uninterpretable and incorrect.

## Systematic Error Analysis
* *Where prompt engineering breaks down* - Problems involving complex calculations, development of mathematical proofs, logic puzzles, numerical data analysis & processing (through this experiment)
* *An optimal prompt* - Fits within the token window, it includes specified examples of various potential input-output pairs and it uses concise language that has low possibility of being misinterpreted. (as demonstrated by first creative writing prompt which generated an acceptable output that followed the instructions of the prompt)
* *Biased prompts* - A prompt may accidentally introduce bias by providing assumptions that are untrue. LLMs are created to generate the next most likely token at each pass through the transformer architecture, the biased assumptions may result in a bias token being generated, and thus the bias propels through the rest of the generated output through self-attention mechanisms.  
## Model reasoning transparency using prompt techniques
By using techniques such as context-aware decomposition, users are narrow the problem point down to a specific step. This makes it easier to follow the model's path of reasoning step-by-step and debug.
## Prompt-based optimization limitations compared to fine-tuning.
In prompt-based optimization, the instructions for how to process and execute a task is constrained to the size of the input window and it must be provided at each new generation. When there is a large amount of data that requires processing, fine-tuning a model may improve the ability for model to generate the desired output.  
## Genuine understanding? Or spurious correlations?
The experiment below demonstrates the inability for the model to consistently understand the required task. For example, even a minor alteration of the prompt (prepending the instructions after the data instead of after it) causes the model to assume the python list brackets [ ] means the prompt is asking for code to analyse the list data. In other cases, it may be that the model hallucinates and provides incorrect and or incoherent information.

In [None]:
instruction = "Given the data, note geographical observations from spatial context and a summary of contamination level and its implications"
data1 = """headers = ["Sample_ID", "Latitude", "Longitude", "Collection_Date", "Detected", "Concentration_ppm"]"""
data2 = """[ [1, 40.7, -74.5, "2022-05-15", "Lead", 245.7], [2, 40.7, -74.8, "2023-06-22", "Lead", 312.4], [3, 40.6, -74.4, "2024-07-10", "Lead", 398.1] ]"""
prompt_prepend_instruction = f"{instruction} {data1} {data2}"
prompt_append_instruction = f"{data1} {data2} {instruction}"
prompt_insert_instruction = f"{data1} {instruction} {data2}"
prompts = [prompt_prepend_instruction, prompt_append_instruction, prompt_insert_instruction]

In [None]:
responses = tinyLlama_model.multi_generate(prompts, cache=True, temperature=0.1)
print_responses(responses, prompts)

100%|██████████| 3/3 [08:00<00:00, 160.31s/it]



Prompt 0 - Given the data, note geographical observations from spatial context and a summary of contamination level and its implications headers = ["Sample_ID", "Latitude", "Longitude", "Collection_Date", "Detected", "Concentration_ppm"] [ [1, 40.7, -74.5, "2022-05-15", "Lead", 245.7], [2, 40.7, -74.8, "2023-06-22", "Lead", 312.4], [3, 40.6, -74.4, "2024-07-10", "Lead", 398.1] ]

Response 0 - 

# 4. Calculate the mean and standard deviation of the detected lead concentration for each sample
mean_lead = sum(lead_concentration_data) / len(lead_concentration_data)
std_lead = math.sqrt(sum((lead_concentration_data - mean_lead) ** 2) / len(lead_concentration_data))

# 5. Plot the contamination level and its implications
plt.scatter(lead_concentration_data[:, 0], lead_concentration_data[:, 1], c="red", s=100)
plt.xlabel("Latitude")
plt.ylabel("Longitude")
plt.title("Contamination Level and Its Implications")
plt.show()

# 6. Save the contamination level and its implications as a CSV file
w




# 4. **Automated Prompt Evaluation and Scaling**

Design metrics for prompt quality that go beyond task-specific performance - As stated previously, evaluating the response given by the LLM is a good metric for judging the quality of any prompt:
* *Is LLM's response factually accurate?*
* *Is LLM's response complying with instructions?*
* *Is LLM's response coherent & relevant to the prompt? Is LLM generating irrelevant data?*
* *Is LLM's response generally consistent across multiple generations using same prompt?*

## Computational costs of different optimization approaches
| Prompt optimization technique      |Computational costs (1-10)|
|------------------------------------|:------------------------:|
| Meta-prompting                     |             2            |
| Recursive self-improvement         |             7            |
| Template-based optimization        |             1            |
| Context-aware decomposition        |             7            |

## techniques for prompt transfer across related tasks
* Append one-shot or few-shot examples at the end of a prompt to tailor model's generation to the needs of the specific task.
* Fine-tuning to personalize the model towards completing a similar set of tasks

**automated prompt generation**: Program the same or another model to generate prompts to be inputted into a model to complete a certain task. This is more efficient because the 1st model can be asked to generate multiple variations of the same prompt efficiently.

**refinement systems**: Program  the same or another model to do recursive self-improvement and run through multi-iteration refinement cycles before stopping the model when an acceptable prompt is generated.


In [None]:
def compare_output_quality(actual_output, desired_output, model):
    # Generate embeddings
    actual_embedding = model.encode(actual_output, convert_to_tensor=True)
    desired_output_embedding = model.encode(desired_output, convert_to_tensor=True)

    # Cosine semantic search to compare embeddings
    hits = util.semantic_search(actual_embedding, desired_output_embedding, top_k=1)[0]

    similarity_score = hits[0]['score'] # Return similarity score
    return similarity_score

In [None]:
# @title
desired_output = """The samples collected show increasing lead contamination levels which pose a clear risk to public health, particularly in an urban area like New York City, near Manhattan.
Lead is a toxic substance, especially dangerous to young children and pregnant women. Chronic exposure to lead can result in developmental delays, cognitive impairment, and other serious health conditions.
This level of lead contamination is significantly above safe thresholds for drinking water."""

In [None]:
print("Evaluation system to compare prompt effectiveness:")
print("Prompt 0 = prepend instructions. Prompt 1 = append instructions. Prompt 2 = insert instructions")
print()
for i, response in enumerate(responses):
    similarity = compare_output_quality(response, desired_output, miniLM_model)
    print(f"Cosine Similarity Score of actual output vs desired output from using prompt {i}: {similarity:.4f}")

Evaluation system to compare prompt effectiveness:
Prompt 0 = prepend instructions. Prompt 1 = append instructions. Prompt 2 = insert instructions

Cosine Similarity Score of actual output vs desired output from using prompt 0: 0.4316
Cosine Similarity Score of actual output vs desired output from using prompt 1: 0.6827
Cosine Similarity Score of actual output vs desired output from using prompt 2: 0.5443


In [None]:
# Methods for detecting prompt robustness across edge cases: Introducing text noise to prompts!
eda_augmenter = EasyDataAugmenter(transformations_per_example=1) # Initialize the EasyDataAugmenter
augmented_prompts = [eda_augmenter.augment(prompt)[0] for prompt in prompts] # Generate typographical errors like random inserts, deletions, or synonym replacements
responses = tinyLlama_model.multi_generate(augmented_prompts, cache=True, temperature=0.1)
print_responses(augmented_prompts, responses)

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
100%|██████████| 3/3 [07:09<00:00, 143.11s/it]



Prompt 0 - Given the data, note geographical observations from spatial parallel context and a summary of contamination level and its implications headers = ["twoscore Sample_ID", "Latitude", "Longitude", "Collection_Date", "Detected", "Concentration_ppm"] [ [1, 40.7, -74.5, "2022-05-15", "Lead", 245.7], [2, 40.7, -74.8, "2023-06-22", "pollution Lead", 312.4], [3, 40.6, -74.4, "2024-07-10", "Lead", 398.1] ]<|assistant|>

# 4. Calculate the average concentration of lead in the given data
avg_lead = sum(map(float, data[0])) / len(data[0])
print("Average concentration of lead in the data:", avg_lead)

# 5. Calculate the standard deviation of lead concentration in the data
std_lead = math.sqrt(sum((map(float, data[0]) - avg_lead) ** 2) / (len(data[0]) - 1))
print("Standard deviation of lead concentration in the data:", std_lead)

# 6. Calculate the correlation coefficient between the concentration of lead and the geographical location
correlation_coefficient = math.sqrt(sum((map(float, da




In [None]:
print("Evaluation system to compare prompt effectiveness:")
print("Prompt 0 = prepend instructions. Prompt 1 = append instructions. Prompt 2 = insert instructions")
print()
for i, response in enumerate(responses):
    similarity = compare_output_quality(response, desired_output, miniLM_model)
    print(f"Cosine Similarity Score of actual output vs desired output from using prompt {i}: {similarity:.4f}")

Evaluation system to compare prompt effectiveness:
Prompt 0 = prepend instructions. Prompt 1 = append instructions. Prompt 2 = insert instructions

Cosine Similarity Score of actual output vs desired output from using prompt 0: 0.4152
Cosine Similarity Score of actual output vs desired output from using prompt 1: 0.5847
Cosine Similarity Score of actual output vs desired output from using prompt 2: 0.5793


# 5. **Future Directions and Theoretical Implications**

## Broad principles of prompt engineering
1. Draft minimal workflow
2. Identify missing components, adding iteratively
3. Give few-shot examples if outputs deviate
4. Split tasks across chained prompts
5. Validate with another LLM or rule-based checks
6. Automate workflow with LangChain or LlamaIndex
## Limitations of current prompting approaches
In production, outputs must be reliable and controlled. This prevents failures and maintains quality of system. However, generative models usually generate inconsistent outputs. Present and proposed solutions use external extensions to add templates to guide output format of generative AI.

## Human-AI interaction design
Human-in-the-loop is a practice that can be utilized to ensure that if a generative AI outputs something undesirable, the prompt can be manually adjusted by the human and re-inputted back to the AI.

## **Critical Analysis:**
* Prompt engineering has already evolved in multiple aspects throughout the years but in the future, as already seen with the state-of-art models today. LLMs are increasingly able to make accurate assumptions about what users want given little to no instructions or context. In the future, there could be a smaller AI or just one sophisticated AI that can handle the task of prompt-engineering and the generation itself without the need for human input.
* An ethical implication to consider is that if there's incorrect bias in the highly optimized prompt, how can it be captured downstream in production environments if the technological direction is moving further and further into automation and AI, removing  
* Similar to humans, generative AI can have more difficulties in properly executing and creating responses if the prompt lacks clarity and requirements for desired response formats.
* Prompt engineering can specify **guardrails** which are instructions meant to deter the model from responding to harmful, hurtful, or irrelevant prompts.