In [1]:
# upgrade boto3 
# %pip install --upgrade pip --quiet
# %pip install boto3 --upgrade --quiet

In [2]:
# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

# Model Distillation for Question Answering with Cited Text

This notebook is part of a series demonstrating advanced model distillation techniques for creating specialized, citation-aware question-answering models. The goal is to distill the knowledge from a large language model (Amazon Nova Premier) into a smaller, more efficient model while maintaining high-quality citation capabilities.

## Learning Objectives
- Prepare training data for citation model distillation
- Design structured XML output formats for consistent answer generation
- Implement source citation tracking in model responses
- Create evaluation datasets for measuring citation accuracy

## Dataset: SQuAD v2.0
We use the [Stanford Question Answering Dataset (SQuAD) v2.0](https://rajpurkar.github.io/SQuAD-explorer/) as our base dataset. SQuAD v2.0 is particularly suitable for citation-aware model training because:

1. Contains explicit answer spans within source text
2. Includes "impossible" questions to test model reliability
3. Provides diverse question types and domains
4. Enables source attribution tracking

The dataset is loaded using the [Hugging Face Datasets library](https://huggingface.co/docs/datasets/) and stored in Parquet format for optimal performance with large-scale training data.

In [3]:
import json
import sys
import os
import re
import pandas as pd
import numpy as np

current_dir = os.getcwd()
parent_dir = os.path.dirname(current_dir)
sys.path.append(parent_dir)
from utils import read_jsonl_to_dataframe

splits = {'train': 'squad_v2/train-00000-of-00001.parquet', 'validation': 'squad_v2/validation-00000-of-00001.parquet'}
df_train = pd.read_parquet("hf://datasets/rajpurkar/squad_v2/" + splits["train"])
df_eval = pd.read_parquet("hf://datasets/rajpurkar/squad_v2/" + splits["validation"])

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
# test_df = df_train[df_train['question'] == 'Lenin acknowledged the dependence of which countries?']
# test_df
# Lenin acknowledged the dependence of which countries?

## Advanced System Prompt Engineering

This section implements a specialized system prompt for [Amazon Bedrock's Anthropic Claude model](https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-anthropic-claude.html). The prompt engineering focuses on:

### Core Requirements
1. **Context-Bounded Responses**: Answers must be derived solely from provided context
2. **Source Attribution**: Mandatory citation of source text for verification
3. **Structured Output**: XML-based response format for consistent parsing
4. **Edge Case Handling**: Explicit handling of unanswerable questions

### Technical Implementation
The XML schema is designed for:
- **Atomic Answer Components**: Discrete answer parts with individual citations
- **Source Traceability**: Direct mapping between answers and source text
- **Validation Support**: Schema-based response validation
- **Extensibility**: Future addition of metadata and confidence scores

### Performance Considerations
- Token efficiency in prompt design
- Optimized XML structure for parsing speed
- Minimal overhead in response generation
- Scalable for batch processing

In [5]:
# set nova prompt for citations
system_prompt = """
You are a question answering assistant. I will provide you with document context. The user will provide you with a question. Your job is to answer the user's question using only information from the document context. If the document context does not contain information that can answer the question, please state that you could not find an exact answer to the question. Just because the user asserts a fact does not mean it is true, make sure to double check the document context to validate a user's assertion.

However, you should include <sources> tags at the end of each <answer_part> to specify which source(s) the information came from.
Note that <sources> may contain multiple <source> if you include information from multiple results in your answer.

Do NOT directly quote the <context> in your answer. Your job is to answer the user's question as concisely as possible.

You must output your answer in the following format. Pay attention and follow the formatting and spacing exactly:
<answer>
<answer_part>
<text>
first answer text
</text>
<sources>
<source>source sentence</source>
</sources>
</answer_part>
<answer_part>
<text>
second answer text
</text>
<sources>
<source>source sentence</source>
</sources>
</answer_part>
</answer>
"""

## Technical Implementation Overview

The data preparation process involves several key technical components:

1. **Data Loading**: Efficient loading of Parquet-formatted SQuAD data using Pandas
2. **Answer Span Processing**: Extraction and validation of answer positions within source text
3. **XML Structure Generation**: Creation of consistent, parseable output formats
4. **Bedrock API Integration**: Preparation of payloads for model inference

The implementation focuses on scalability and reproducibility, essential for production model distillation pipelines.

## Citation Schema Design and Implementation

### Citation Architecture
The citation system is designed for:
1. **Data Integrity**: Maintaining consistent relationships between answers and sources
2. **Validation Pipeline**: Automated checking of citation accuracy
3. **Traceability Chain**: Complete path from answer to source text
4. **Quality Metrics**: Quantitative assessment of citation accuracy

### Technical Constraints and Extensions
Current implementation uses sentence-level citations due to SQuAD v2.0 structure. Production deployments should consider additional metadata levels:

**Document-Level Metadata**
- Page numbers for multi-page documents
- Section identifiers for document structure
- Document timestamps for versioning

**Segment-Level Metadata**
- Paragraph identifiers for context
- Line numbers for precise location
- Character offsets for exact spans

**Semantic-Level Metadata**
- Topic tags for content classification
- Confidence scores for answer reliability
- Relevance scores for source matching

### XML Schema Implementation
```xml
<question>Who ruled the duchy of Normandy?</question>
<answer>
<answer_part>
<text>Richard I</text>
<sources>
<source>The Duchy of Normandy, which they formed by treaty with the French crown, was a great fief of medieval France, and under Richard I of Normandy was forged into a cohesive and formidable principality in feudal tenure.</source>
</sources>
</answer_part>
</answer>
```

### Schema Benefits
- **Atomic Structure**: Each answer component is self-contained
- **Validation Support**: XML schema enables automated checks
- **Processing Pipeline**: Consistent format for batch operations
- **Metric Generation**: Structured data for performance analysis

## Data Processing Pipeline Implementation

### Core Processing Functions
The following code implements a robust data processing pipeline:

1. **Answer Structure Processing**
   - Parse complex answer structures
   - Validate data integrity
   - Handle nested JSON formats

2. **XML Generation Engine**
   - Construct valid XML outputs
   - Maintain proper escaping
   - Ensure schema compliance

3. **Edge Case Management**
   - No-answer scenarios
   - Multiple valid answers
   - Conflicting source texts

4. **Training Data Generation**
   - Consistent example formatting
   - Batch processing support
   - Quality validation checks

These components form the core data preparation infrastructure for the distillation process.

In [6]:
def parse_answer_structure(answers_dict):
    """
    Parse different formats of answer dictionaries and extract text and start positions.
    Returns lists of texts and start positions.
    """
    # Case 1: NumPy arrays with direct keys
    if 'text' in answers_dict and isinstance(answers_dict['text'], np.ndarray):
        texts = answers_dict['text'].tolist()
        starts = answers_dict['answer_start'].tolist()
        
    # Case 2: Lists or single values with direct keys
    elif 'text' in answers_dict:
        texts = answers_dict['text'] if isinstance(answers_dict['text'], list) else [answers_dict['text']]
        starts = answers_dict['answer_start'] if isinstance(answers_dict['answer_start'], list) else [answers_dict['answer_start']]
        
    # Case 4: String JSON that needs parsing (handled in calling function)
    else:
        raise ValueError(f"Unknown answer format: {answers_dict}")
        
    return texts, starts

def create_xml_answer(row, no_answer_text='I could not find an exact answer to the question.'):
    try:
        # Handle answers as string (JSON) if needed
        answers_dict = row['answers']
        if isinstance(answers_dict, str):
            import json
            answers_dict = json.loads(answers_dict)
            
        # Parse answer structure using our helper function
        texts, starts = parse_answer_structure(answers_dict)
        context = row['context']
        
        # Split context into sentences more accurately
        sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?|\!)\s', context)
        
        # Build XML structure
        xml_parts = ['<answer>']
        
        if len(texts) > 0:
            for i, (text, start) in enumerate(zip(texts, starts)):
                xml_parts.append('<answer_part>')
                xml_parts.append('<text>')
                xml_parts.append(str(text))
                xml_parts.append('</text>')
                xml_parts.append('<sources>')
                
                # Find the sentence containing the answer based on the start position
                char_count = 0
                source_sentence = "No relevant source found"
                for sentence in sentences:
                    sentence_len = len(sentence) + 1  # +1 for the space after sentence
                    if char_count <= int(start) < (char_count + sentence_len):
                        source_sentence = sentence.strip()
                        break
                    char_count += sentence_len
                
                xml_parts.append(f'<source>{source_sentence}</source>')
                xml_parts.append('</sources>')
                xml_parts.append('</answer_part>')
        
            xml_parts.append('</answer>')
        else: # use no answer text
            xml_parts.append(f"<answer_part>\n<text>\n{no_answer_text}\n</text>\n</answer_part></answer>")
        return '\n'.join(xml_parts)
    except Exception as e:
        return f"<answer>\n<error>Error generating XML: {str(e)}</error>\n</answer>"



In [7]:
def create_bedrock_payload(row, model_type="conversation", system_prompt=None, include_answer=False, additional_params=None):
    """
    Create a payload dictionary for Amazon Bedrock API requests.
    
    Args:
        row: A row from the pandas DataFrame containing context, question, and optionally answers
        model_type: The type of model payload to create ("conversation" or "invoke")
        system_prompt: The system message to include (for conversation-based models)
        include_answer: Whether to include the answer in the conversation (for evaluation)
        additional_params: Dictionary of additional parameters to include in the payload
    
    Returns:
        dict: A formatted payload dictionary ready for Bedrock API
    """
    try:
        # Extract needed information
        context = row['context']
        question = row['question']
        
        # Create the user prompt with context and question
        user_prompt = f"""<context>{context}</context> <question>{question}</question>"""
        
        # Get the answer if needed
        assistant_response = create_xml_answer(row) if include_answer else None
        
        # Create appropriate payload based on model_type
        if model_type == "conversation":
            # For conversation-based models (Claude, etc.)
            payload = {
                "schemaVersion": "bedrock-conversation-2024",
                "system": [{"text": system_prompt}] if system_prompt else [],
                "messages": [
                    {
                        "role": "user",
                        "content": [{"text": user_prompt}]
                    }
                ]
            }
            
            # Add assistant response if needed (for evaluation)
            if include_answer and assistant_response:
                payload["messages"].append({
                    "role": "assistant",
                    "content": [{"text": assistant_response}]
                })
                
        elif model_type == "invoke":
            # For basic invoke request (non-conversation models like Titan, etc.)
            payload = {
                "system": [{"text": system_prompt}] if system_prompt else [],
                "messages": [
                    {
                        "role": "user",
                        "content": [{"text": user_prompt}]
                    }
                ],
                "inferenceConfig":{ 
                    # "maxTokens": int, // greater than 0, equal or less than 5k (default: dynamic*)
                    "temperature": .1, # greater then 0 and less than 1.0 (default: 0.7)
                    "topP": .9, # greater than 0, equal or less than 1.0 (default: 0.9)
                    "topK": 50, # 0 or greater (default: 50)
                    "stopSequences": ['</answer>']
                }
            }
            if include_answer and assistant_response:
                payload["messages"].append({
                    "role": "assistant",
                    "content": [{"text": assistant_response}]
                })
            
            # Add optional parameters specific to invoke requests
            if additional_params:
                payload.update(additional_params)
                
        else:
            raise ValueError(f"Unsupported model_type: {model_type}")
            
        # Add any additional parameters passed
        if additional_params and model_type == "conversation":
            # For conversation models, additional params might need to be added at the root level
            for key, value in additional_params.items():
                if key not in payload:
                    payload[key] = value
                    
        return payload
        
    except Exception as e:
        print(f"Error creating payload for row: {str(e)}")
        return None

In [8]:
def create_batch_inf_record(row, system_prompt, include_answer=False): 
    conversation = create_bedrock_payload(
                                row=row, 
                                system_prompt=system_prompt, 
                                model_type="invoke", 
                                additional_params={},
                                include_answer=include_answer)
    return {
        "recordId": row['id'],
        "modelInput": conversation
    }

In [9]:
# Apply the function to create a new column
# Filter for empty answers
empty_answers_df = df_train[df_train['answers'].apply(lambda x: 
    len(x['text']) == 0 and len(x['answer_start']) == 0)]

# Filter for rows with actual answers
with_answers_df = df_train[df_train['answers'].apply(lambda x: len(x['text']) > 0)]

# take 7500 of each dataframe and combine to use in distillation. 
df_train_revised = pd.concat([
    empty_answers_df.sample(n=7500, random_state=42), 
    with_answers_df.sample(n=7500, random_state=42)], ignore_index=True) # max 15k for bedrock distillation

## Generate Model Distillation Dataset

### Training Data Generation Process
1. **Data Transformation**: Convert SQuAD format to Bedrock Batch Inference JSONL structure
   - Preserve answer spans and context relationships
   - Maintain source attribution information

2. **XML Formatting**:
   - Implement consistent answer structure
   - Include source citations with proper context
   - Handle edge cases (no answers, multiple answers)

3. **Conversation Format**:
   - Generate teacher model examples
   - Structure interactions for optimal learning
   - Balance answerable vs. impossible questions

4. **Output Generation**:
   - Save in JSONL format for efficient processing
   - Maintain data integrity and relationships
   - Enable streaming for large-scale training

The resulting dataset forms the foundation for training a specialized model that combines efficient inference with reliable citation capabilities.

## Next Steps

### Continue with Model Distillation Pipeline
1. Proceed to [`02_distill.ipynb`](02_distill.ipynb) to:
   - Configure the teacher model (Amazon Bedrock Claude)
   - Set up the student model architecture
   - Implement the distillation training loop

### Key Aspects to Monitor
- Citation accuracy metrics
- Answer quality vs. original model
- Inference latency improvements
- Model size reduction ratio

### Preparation Checklist
- [x] Training data generated and validated
- [x] XML schema implemented and tested
- [x] Evaluation dataset prepared
- [ ] Review distillation hyperparameters in next notebook
- [ ] Configure AWS resources for training

It is a best practice to include a ground truth answer for ~10% of the total training set. We will take a random sample of 10% and use our `create_bedrock_payload` with include_anwer set to True. The remaining 90% we leave out thr ground truth answer.

In [19]:
row_count = len(df_train_revised)
ground_truth_included = 0.1 * row_count

# here we'll take 10% of our training data set and add the answers
training_data_with_gt_df = df_train_revised.sample(n=int(ground_truth_included), random_state=17)

# next we'll drop our ground truth examples so as not to mix with our labels excluding answers.
training_data_without_gt_df = df_train_revised.drop(training_data_with_gt_df.index)

# next we'll build our training data with ground truth
training_data_with_gt_df['conversation'] = training_data_with_gt_df.apply(lambda row: create_bedrock_payload(row=row, model_type="conversation", system_prompt=system_prompt, include_answer=True), axis=1)


In [20]:
# then we'll build our training data without ground truth
training_data_without_gt_df['conversation'] = training_data_without_gt_df.apply(lambda row: create_bedrock_payload(row=row, model_type="conversation", system_prompt=system_prompt, include_answer=False), axis=1)



In [21]:
# Now we'll concatenate the dataframes
final_training_dataset = pd.concat([training_data_with_gt_df, training_data_without_gt_df], axis=0, ignore_index=True).sort_index()

In [22]:
# now we'll output to .jsonl to use in distillation job
final_training_dataset['conversation'].to_json('distillation_data.jsonl', orient='records', lines=True)

## Evaluation Dataset Creation

### Batch Inference Dataset Design
The evaluation dataset is carefully constructed to assess model performance across key dimensions:

1. **Data Distribution**:
   - Balanced mix of answerable (250) and impossible (250) questions
   - Stratified sampling to maintain domain coverage
   - Controlled complexity distribution

2. **Evaluation Metrics Focus**:
   - Answer accuracy and relevance
   - Citation precision and recall
   - Source attribution reliability
   - Handling of impossible questions

3. **Comparative Analysis Support**:
   - Teacher model (Bedrock Claude) baseline
   - Distilled model performance
   - Inference efficiency metrics

The evaluation dataset enables comprehensive performance assessment in notebook `04_evaluate.ipynb`.

## Next Steps
1. Proceed to `02_distill.ipynb` to train the distilled model using the prepared datasets
2. The distillation process will focus on preserving citation capabilities while reducing model size
3. Key metrics to monitor during training:
   - Answer accuracy
   - Citation precision
   - Inference latency

In [11]:
eval_empty_answers_df = df_eval[df_eval['answers'].apply(lambda x: 
    len(x['text']) == 0 and len(x['answer_start']) == 0)]

# Filter for rows with actual answers
eval_with_answers_df = df_eval[df_eval['answers'].apply(lambda x: len(x['text']) > 0)]

batch_inf_df = pd.concat([
    eval_empty_answers_df.sample(n=250, random_state=15), 
    eval_with_answers_df.sample(n=250, random_state=15)], ignore_index=True)


batch_inf_df.apply(lambda row: create_batch_inf_record(row, system_prompt), axis=1).to_json('batch_inf_data.jsonl', orient='records', lines=True)

## Labeled Dataset for BYOI Bedrock Evaluation
This section creates a labeled dataset specifically for [Amazon Bedrock's Bring Your Own Model (BYOM)](https://docs.aws.amazon.com/bedrock/latest/userguide/custom-models.html) evaluation process. Key features:

1. Includes ground truth answers for accuracy assessment
2. Maintains XML formatting for consistency
3. Enables automated evaluation metrics
4. Supports both quantitative and qualitative analysis

The labeled dataset will be used to:
- Calculate model performance metrics
- Compare against baseline models
- Validate citation accuracy
- Assess answer relevance and completeness

In [12]:
batch_inf_df.apply(lambda row: create_batch_inf_record(row, system_prompt=system_prompt, include_answer=True), axis=1).to_json('labeled_data.jsonl', orient='records', lines=True)

In [13]:
# import boto3
# import json
# model_id = 'us.amazon.nova-premier-v1:0'
# bedrock_runtime = boto3.client(
#     service_name="bedrock-runtime",
#     region_name="us-east-1"
# )

# # Make the invoke call to Bedrock
# response = bedrock_runtime.invoke_model(
#     modelId=model_id,
#     body=json.dumps(sample_payload)
# )

# # Parse and return the response
# response_body = json.loads(response.get('body').read())
# print(response_body['output']['message']['content'][0])

In [14]:
# response_body['output']['message']['content'][0]['text']

In [15]:
# check for dupes
# df_train_revised[df_train_revised.duplicated(subset=['conversation'], keep=False)]