# Exemplar Answer Generation Project - OpenAI Integration Section

This notebook preprocesses the training data, formats the data, integrates OpenAI API, and lastly evaluate the results.

## Environment Setup and Dependency Import

Aimed to assist the later dependencies import in the project. Several libraries and modules will be imported throughout the code for various tasks such as data manipulation, preprocessing, OpenAI API integration, and evaluation.

In [9]:
! pip install numpy scikit-learn openai tiktoken nltk spacy textstat rouge-score



# Section 1: Training Data Retrieval

In [2]:
# Dependency Imports for section 1

import json
from collections import Counter
from statistics import mean
from sklearn.model_selection import train_test_split


## 1.1 DataSet Processing

### 1.1.1 Data Loading

In [3]:
# Load the training data
with open('data/cura-llm-training-data.json', 'r') as file:
    training_data = json.load(file)
    
    print(f"Loaded {len(training_data)} training samples")

Loaded 117 training samples


### 1.1.2 Data Preprocessing and Preparation

We'll first preprocess our data to clean and standardize it, then format it for the OpenAI API.

In [4]:
# Preprocess a single data item
def preprocess_data(data_item):
    
    processed = {}
    
    # 1. Clean and truncate task content
    task_content = data_item['task_content']
    
    # Remove HTML entities
    task_content = task_content.replace('&nbsp;', ' ')
    
    # Truncate if too long (keeping most relevant parts)
    if len(task_content) > 11000:
        task_content = task_content[:11000]
    
    processed['task_content'] = task_content
    
    # 2. Format question
    processed['question'] = data_item['question'].strip()
    
    # 3. Process rubric
    rubric = json.loads(data_item['rubric'])
    
    # Normalize scoring criteria
    processed['rubric'] = {
        'criteria': rubric['criteria'],
        'total_score': int(rubric['total_score']),
        'items': [item.strip() for item in rubric['items']]
    }
    
    # 4. Clean exemplar answer
    answer = data_item['answer']
    
    # Remove extra quotes
    answer = answer.strip('"')
    
    # Remove multiple spaces
    answer = ' '.join(answer.split())
    processed['answer'] = answer
    
    return processed

In [5]:
# Process all data items
processed_data = []

for item in training_data:
    try:
        processed = preprocess_data(item)
        processed_data.append(processed)
    except Exception as e:
        print(f"Error processing item: {e}")
    continue

print(f"Successfully preprocessed {len(processed_data)} items")

Successfully preprocessed 117 items


In [6]:
# Verify preprocessing results
example_item = processed_data[0]
print("Example of preprocessed item:")
print(json.dumps(example_item, indent=2))

# Calculate preprocessing statistics
preprocessed_stats = {
    'task_content_lengths': [len(item['task_content']) for item in processed_data],
    'question_lengths': [len(item['question']) for item in processed_data],
    'answer_lengths': [len(item['answer']) for item in processed_data]
    }

print("\nPreprocessing Statistics:")
for key, values in preprocessed_stats.items():
    print(f"\n{key}:")
    print(f"  Mean: {mean(values):.2f}")
    print(f"  Max: {max(values)}")

Example of preprocessed item:
{
  "task_content": "Designing your rocket    Building phase     The shape, weight, and size of a rocket, and it\u2019s design of nose cone and fins, all affect how aerodynamic or efficient it will be.     Being efficient allows a rocket to use less fuel while travelling long distances or overcoming gravity to take off and escape our atmosphere.     Rockets need to go straight up when launching and not veer to one side or roll when travelling through space. An effective  nose cone  and  fins  will help to stabilise your rocket.     So, to create a rocket that can be launched into space, you must design:     A rocket body    Fins &amp; a nose cone    A final rocket design with all elements     In your teams, you must:     Design your nose cones and fins/tail. Think about which materials to use and how to attach them before constructing their designs    Two team members could construct the nose cone and the other two could construct the fins/tail design. You

### 1.1.3 Format Data for OpenAI API

In [7]:
# Format preprocessed data for OpenAI API training
def prepare_training_format(processed_data):
    
    formatted_data = []
    
    for item in processed_data:
        formatted_item = {
            'context': {
                'task_content': item['task_content'],
                'rubric': item['rubric'],
                'question': item['question']
            },
            'exemplar_answer': item['answer']
            }
        formatted_data.append(formatted_item)
    
    return formatted_data
        
# Format the preprocessed data
formatted_data = prepare_training_format(processed_data)
print("Example of formatted training data:")
print(json.dumps(formatted_data[0], indent=2))

Example of formatted training data:
{
  "context": {
    "task_content": "Designing your rocket    Building phase     The shape, weight, and size of a rocket, and it\u2019s design of nose cone and fins, all affect how aerodynamic or efficient it will be.     Being efficient allows a rocket to use less fuel while travelling long distances or overcoming gravity to take off and escape our atmosphere.     Rockets need to go straight up when launching and not veer to one side or roll when travelling through space. An effective  nose cone  and  fins  will help to stabilise your rocket.     So, to create a rocket that can be launched into space, you must design:     A rocket body    Fins &amp; a nose cone    A final rocket design with all elements     In your teams, you must:     Design your nose cones and fins/tail. Think about which materials to use and how to attach them before constructing their designs    Two team members could construct the nose cone and the other two could construct th

### 1.1.4 Split Data for Training and Validation

In [8]:
# Split data into training and validation sets
train_data, val_data = train_test_split(formatted_data, test_size=0.2, random_state=42)

print(f"Training set size: {len(train_data)}")
print(f"Validation set size: {len(val_data)}")

# Save processed and split data
with open('data/processed_train_data.json', 'w') as f:
    json.dump(train_data, f, indent=2)

with open('data/processed_val_data.json', 'w') as f:
   json.dump(val_data, f, indent=2)

print("Saved processed training and validation data")

Training set size: 93
Validation set size: 24
Saved processed training and validation data


# Section 2: OpenAI API Integration

In [12]:
# Dependency Imports for Section 2

import os
import tiktoken
import time
import logging
import asyncio
import numpy as np
from openai import OpenAI
from typing import Dict, List, Optional

# Set up API key
OPENAI_API_KEY = "sk-svcacct-uPxqiJzaSiREXSBYOcwrhvmpYLe3uGPMjs6eQ_XELvLftEZ3Ti59ubhaZgPK3Uc0fTU6vevKT3BlbkFJMHPKGyXsJKAVw2CavL0utPanw92cweyJzkuIe4e5v5dtqF803SAACEsEsHfZQnyNWbG-cgUA"

## 2.1 Set Up OpeaAI Handler

### 2.1.1 Base OpenAI Handler

In [13]:
class BaseOpenAIHandler:
    def __init__(self, api_key: str, model: str = "gpt-4o-mini"):
        self.client = OpenAI(api_key=api_key)
        self.model = model
        self.encoding = tiktoken.get_encoding("cl100k_base")
        self.total_tokens_used = 0
        self.token_limit = 5_000_000
        
        # Setup logging
        logging.basicConfig(level=logging.INFO)
        self.logger = logging.getLogger(__name__)
        
        # Initialize token counter file
        os.makedirs('outputs', exist_ok=True)
        self._init_token_counter()
    
    def _init_token_counter(self):
        """Initialize or load token counter from file"""
        try:
            with open('outputs/token_usage.json', 'r') as f:
                usage_data = json.load(f)
                self.total_tokens_used = usage_data.get('total_tokens', 0)
        except FileNotFoundError:
            self._save_token_usage()
    
    def _save_token_usage(self):
        """Save token usage to file"""
        with open('outputs/token_usage.json', 'w') as f:
            json.dump({'total_tokens': self.total_tokens_used}, f)
    
    def count_tokens(self, text: str) -> int:
        """Count tokens in a text string"""
        return len(self.encoding.encode(text))
    
    def get_token_usage_stats(self) -> Dict:
        """Get current token usage statistics"""
        return {
            'total_tokens_used': self.total_tokens_used,
            'remaining_tokens': self.token_limit - self.total_tokens_used,
            'percentage_used': (self.total_tokens_used / self.token_limit) * 100
        }

### 2.1.2 OpenAI Prompt Handler

In [14]:
class PromptHandler(BaseOpenAIHandler):
    def __init__(self, api_key: str, model: str = "gpt-4o-mini"):
        super().__init__(api_key, model)
        self.prompt_template = None
        
    def format_prompt(self, context: Dict) -> str:
        """Basic prompt format"""
        prompt = f"""Given a teaching context, generate a high-quality exemplar answer.

Task Content:
{context['task_content']}

Question:
{context['question']}

Rubric Criteria:
- Total Score: {context['rubric']['total_score']}
- Assessment Criteria: {context['rubric']['criteria']}
- Scoring Items:
{chr(10).join([f"  {i+1}. {item}" for i, item in enumerate(context['rubric']['items'])])}

Generate an exemplar answer that:
1. Directly addresses the question
2. Meets the highest scoring criteria in the rubric
3. Demonstrates clear understanding and comprehensive coverage
4. Uses appropriate academic language
5. Is concise yet complete

Exemplar Answer:"""
        return prompt
    
    def _create_basic_template(self, context: Dict) -> str:
        """Basic Prompt Template"""
        return self.format_prompt(context)
    
    def _create_few_shot_template(self, context: Dict) -> str:
        """Template with few-shot examples"""
        examples = "\n\n".join([
            f"Example {i+1}:\n"
            f"Question: {ex['question']}\n"
            f"Rubric: {ex['rubric']}\n"
            f"Answer: {ex['answer']}\n"
            for i, ex in enumerate(self.best_examples)
        ])
        
        prompt = f"""Given a teaching context, generate a high-quality exemplar answer.

Previous successful examples:
{examples}

Now, generate an answer for:
Task Content:
{context['task_content']}

Question:
{context['question']}

Rubric Criteria:
- Total Score: {context['rubric']['total_score']}
- Assessment Criteria: {context['rubric']['criteria']}
- Scoring Items:
{chr(10).join([f"  {i+1}. {item}" for i, item in enumerate(context['rubric']['items'])])}

Generate an exemplar answer that:
1. Directly addresses the question
2. Meets the highest scoring criteria in the rubric
3. Demonstrates clear understanding and comprehensive coverage
4. Uses appropriate academic language
5. Is concise yet complete

Exemplar Answer:"""
        return prompt
    
    def _create_detailed_template(self, context: Dict) -> str:
        """Prompt template with detailed instructions"""
        rubric_items = context['rubric']['items']
        highest_score_criteria = rubric_items[0]
        
        prompt = f"""As an expert education content creator, generate a high-quality exemplar answer.

Task Context:
{context['task_content']}

Question to Answer:
{context['question']}

To achieve the highest score ({context['rubric']['total_score']} points), your answer must:
- Meet this specific criteria: {highest_score_criteria}
- Demonstrate deep understanding of: {context['rubric']['criteria']}
- Include clear evidence and explanation
- Use precise academic language
- Be well-structured and coherent

Scoring Guide:
{chr(10).join([f"Level {i+1}: {item}" for i, item in enumerate(rubric_items)])}

Additional Requirements:
1. Start with a clear main point
2. Support with specific evidence
3. Explain relationships between concepts
4. Use subject-specific vocabulary
5. Conclude with a summary if appropriate

Your Exemplar Answer:"""
        return prompt

### 2.1.3 OpenAI Generation Handler

In [15]:
class GenerationHandler(PromptHandler):
          
    def generate_answer(self, context: Dict, 
                       max_retries: int = 3, 
                       temperature: float = 0.7) -> Optional[str]:
        
        """Generate answer with optimized prompt"""
        if self.prompt_template:
            prompt = self.prompt_template(context)
        else:
            prompt = self.format_prompt(context)
            
        prompt_tokens = self.count_tokens(prompt)
        
        if self.total_tokens_used + prompt_tokens > self.token_limit:
            self.logger.error("Token limit reached!")
            return None
        
        for attempt in range(max_retries):
            try:
                response = self.client.chat.completions.create(
                    model=self.model,
                    messages=[
                        {"role": "system", "content": "You are an expert education content creator specializing in generating exemplar answers for student assessment."},
                        {"role": "user", "content": prompt}
                    ],
                    temperature=temperature,
                    max_tokens=1000
                )
                
                self.total_tokens_used += response.usage.total_tokens
                self._save_token_usage()
                
                return response.choices[0].message.content
                
            except Exception as e:
                self.logger.error(f"Attempt {attempt + 1} failed: {str(e)}")
                if attempt == max_retries - 1:
                    raise
                time.sleep(2 ** attempt)
             
    def estimate_prompt_tokens(self, context: Dict) -> int:
        """Estimate tokens for a prompt before sending"""
        formatted_prompt = self.format_prompt(context)
        return self.count_tokens(formatted_prompt)

### 2.1.4 OpenAI Training Handler

In [27]:
class TrainingHandler(GenerationHandler):
    
    def optimize_prompt_template(self, validation_data: List[Dict]):
        """Optimise prompt templates"""
        templates = [
            self._create_basic_template,
            self._create_few_shot_template,
            self._create_detailed_template
        ]
        
        best_score = 0
        best_template = None
        
        for template_func in templates:
            score = self._evaluate_template(template_func, validation_data)
            if score > best_score:
                best_score = score
                best_template = template_func
        
        self.prompt_template = best_template
        self.logger.info(f"Selected best template with score: {best_score}")
    
    def _evaluate_template(self, template_func, validation_data: List[Dict]) -> float:
        """Evaluating the effectiveness of prompt templates"""
        scores = []
        for item in validation_data[:5]:
            try:
                prompt = template_func(item['context'])
                generated_answer = self.generate_answer(item['context'], temperature=0.3)
                if generated_answer:
                    quality_score = self._evaluate_example_quality({
                        'answer': generated_answer,
                        'rubric': json.dumps(item['context']['rubric']),
                        'question': item['context']['question']
                    })
                    scores.append(quality_score)
            except Exception as e:
                self.logger.error(f"Template evaluation error: {e}")
                continue
        
        return np.mean(scores) if scores else 0.0
    
    def _evaluate_example_quality(self, example: Dict) -> float:
        """Example quality assessment"""
        score = 0.0
        answer = example['answer']
        rubric = example['rubric']
        rubric_data = json.loads(rubric)
        
        # 1. Content relevance (0.6 points)
        rubric_items = rubric_data['items']
        criteria = rubric_data['criteria'].lower()
        answer_lower = answer.lower()
        
        # Check for keyword matching
        keyword_matches = sum(1 for item in rubric_items if item.lower() in answer_lower)
        content_score = min(0.4, (keyword_matches / len(rubric_items)) * 0.5)
        score += content_score
        
        # Completeness of answer (0.2 marks)
        # Adjustment of desired length based on total score
        expected_length = int(rubric_data['total_score']) * 50  # 每分50字符
        actual_length = len(answer)
        if actual_length >= expected_length:
            score += 0.3
        else:
            score += 0.3 * (actual_length / expected_length)
        
        # 3. Language quality (0.2 points)
        sentences = [s.strip() for s in answer.split('.') if s.strip()]
        
        # Sentence count scoring
        if len(sentences) >= 2:
            score += 0.1
        
        # Lexical diversity
        words = answer.split()
        unique_words = set(words)
        vocabulary_ratio = len(unique_words) / len(words) if words else 0
        if vocabulary_ratio >= 0.6:  # Lexical richness thresholds
            score += 0.1
        
        return score
    
    def select_best_examples(self, training_data: List[Dict], n_examples: int = 3):
        """Select the best few-shot example"""
        selected_examples = []
        for item in training_data:
            
            example = {
                'answer': item['exemplar_answer'],
                'rubric': json.dumps(item['context']['rubric']),
                'question': item['context']['question']
            }
            # Assessing quality
            quality_score = self._evaluate_example_quality(example)
            example['quality_score'] = quality_score
            selected_examples.append(example)
        
        # Sort and select the best examples
        selected_examples.sort(key=lambda x: x['quality_score'], reverse=True)
        self.best_examples = selected_examples[:n_examples]
        
        # Save the best examples
        with open('outputs/best_examples.json', 'w') as f:
            json.dump(self.best_examples, f, indent=2)
            
    def train(self, training_data: List[Dict], validation_data: List[Dict]):
            """Training process"""
            self.logger.info("Starting training process...")
            
            #  Select the best example
            self.logger.info("Selecting best examples...")
            self.select_best_examples(training_data)
            
            # Optimise the prompt template
            self.logger.info("Optimizing prompt template...")
            self.optimize_prompt_template(validation_data)
            
            # Save the training results
            self.logger.info("Saving training results...")
            self._save_training_results()
            
            self.logger.info("Training completed!")
    
    def _save_training_results(self):
        """Save training results to file"""
        results = {
            'best_examples': self.best_examples,
            'selected_template': self.prompt_template.__name__,
            'token_usage': self.get_token_usage_stats()
        }
        
        with open('outputs/training_results.json', 'w') as f:
            json.dump(results, f, indent=2)

### 2.1.5 Final OpenAI Handler

In [28]:
class OpenAIHandler(TrainingHandler):
    """
    Final OpenAIHandler class that inherits all functionality
    """
    pass

## 2.2 Train the OpenAI API Integration

In [32]:
# Initialize the handler
handler = OpenAIHandler(OPENAI_API_KEY)

handler.train(train_data, val_data)

INFO:__main__:Starting training process...
INFO:__main__:Selecting best examples...
INFO:__main__:Optimizing prompt template...
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/

## 2.3 Test the OpenAI API Integration

### 2.3.1 Test on the First Example

In [35]:
# Test the number of tokens with the first example from our processed data
test_context = formatted_data[0]['context']

# Estimate tokens
estimated_tokens = handler.estimate_prompt_tokens(test_context)
print(f"Estimated tokens for first example: {estimated_tokens}")

Estimated tokens for first example: 662


In [33]:
# Run the test on the first example
test_context = formatted_data[0]['context']
test_answer = handler.generate_answer(test_context)
print("Generated answer:", test_answer)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


Generated answer: **Which fin design worked best? Why?**

In our rocket design experiment, the fin design that worked best was the triangular fin configuration. This design outperformed the others in terms of stability and straightness of flight during the launch trials. 

The triangular fins were strategically placed at an angle, which allowed for optimal airflow around the rocket. This aerodynamic shape minimized drag, enabling the rocket to ascend more efficiently. Throughout the trials, we observed that rockets equipped with triangular fins maintained a much straighter trajectory compared to those with rectangular or circular fins. Specifically, the triangular fins reduced the tendency of the rocket to veer to one side or roll, which are critical factors for achieving successful launches. 

In terms of variables, we controlled the size and weight of the fins, ensuring that all designs were made from the same material—lightweight cardboard. This allowed us to isolate the fin shape a

In [36]:
# Check token usage
usage_stats = handler.get_token_usage_stats()

print("Token usage statistics:")
print(json.dumps(usage_stats, indent=2))

Token usage statistics:
{
  "total_tokens_used": 145584,
  "remaining_tokens": 4854416,
  "percentage_used": 2.91168
}


### 2.3.2 Batch Processing Function

Create a function to process multiple examples.

In [37]:
def process_batch(handler, data_batch, batch_size=5):
    results = []
    
    for i in range(0, len(data_batch), batch_size):
        batch = data_batch[i:i + batch_size]
        
        # Process each item in the batch
        for item in batch:
            try:
                result = handler.generate_answer(item['context'])
                results.append(result)
                print(f"Processed item {len(results)}")
                
                # Save results to file
                with open(f'outputs/generated_answers_batch_{i//batch_size + 1}.json', 'w') as f:
                    json.dump(results, f, indent=2)
                
                # Check token usage
                usage_stats = handler.get_token_usage_stats()
                print(f"Remaining tokens: {usage_stats['remaining_tokens']}")
                
                # Sleep for a second to avoid rate limits
                time.sleep(1)
                
            except Exception as e:
                print(f"Error processing item: {e}")
    
    return results

In [38]:
# Process a small test batch
test_batch = formatted_data[:4]
test_results = process_batch(handler, test_batch, batch_size=1)

print("\nTest batch results:")
for i, result in enumerate(test_results):
    print(f"\nExample {i+1}:")
    print(result)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


Processed item 1
Remaining tokens: 4853376


INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


Processed item 2
Remaining tokens: 4851765


INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


Processed item 3
Remaining tokens: 4850208


INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


Processed item 4
Remaining tokens: 4848826

Test batch results:

Example 1:
**Exemplar Answer: Which fin design worked best? Why?**

The fin design that proved to be the most effective in stabilizing our rocket during flight was the triangular fin configuration. This design not only enhanced the rocket's stability but also ensured that it maintained a straight trajectory during launch, which is crucial for achieving optimal height and aerodynamics.

**Supporting Evidence:**
During our trials, the triangular fins demonstrated superior performance compared to other designs, such as rectangular and square fins. Specifically, when we conducted our first trial with the triangular fins, the rocket ascended vertically with minimal lateral movement, reaching an altitude of approximately 15 meters. In contrast, the rectangular fins led to significant rolling and veering off course, resulting in a maximum altitude of only 10 meters.

**Explanation of Relationships:**
The effectiveness of the tri

In [39]:
# Check token usage
usage_stats = handler.get_token_usage_stats()

print("Token usage statistics:")
print(json.dumps(usage_stats, indent=2))

Token usage statistics:
{
  "total_tokens_used": 151174,
  "remaining_tokens": 4848826,
  "percentage_used": 3.0234799999999997
}


# Section 3: Exemplar Answer Quality Evaluation

In [None]:
# Dependency Imports for Section 3

from sklearn.metrics.pairwise import cosine_similarity
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from rouge_score import rouge_scorer
import spacy
import textstat
from typing import Dict, List, Tuple