## LLM migration driven by data-aware prompt optimization - Summarization task

Example notebook 
- model migration from Claude 3 Haiku to Nova-Lite for a summarization task
- optimization by Bedrock APO and data-aware optimization after migration
- evaluation metric: LJ-score

### Install python packages requested by benchmarking

If you have not install the requested python libraries, uncomment the following command to run the installation.

In [None]:
#!pip install -r ../peccyben/requirements.txt

### Import libraries

In [None]:
import sys, os, time
import pandas as pd
from sklearn.model_selection import train_test_split
import json, copy
import warnings 

sys.path.insert(0, '../')

from peccymig.migration import Prompt_Opt_Template_Gen, lj_metric_summ
from peccymig.migration import model_initialize, model_inference, data_aware_optimization, update_prompt_catalog
from peccyben.summarizationtask import Summ_Ben 
from peccyben.promptcatalog import Prompt_Template_Gen

### Data preparation

Split training and testing data for prompt optimizer

In [None]:
# Load dataset
INPUT_FILE = 'xsum_100.csv'
df_input = pd.read_csv('./data/'+INPUT_FILE, encoding='utf8')

In [None]:
df_input_train, df_input_test = train_test_split(df_input, test_size=0.9, random_state=42)

df_input_train = df_input_train.reset_index()
df_input_test = df_input_test.reset_index()

Create DSPy dataset from DataFrame for classification task 

In [None]:
import dspy

def create_summ_dataset(prompt_template,df):
    """
    Create DSPy dataset from DataFrame for summarization task 
    Args: 
        prompt_template: the prompt template for the summarization task defined by user
        df: input dataset in DataFrame format
    Return: DSPy dataset
    """        
    
    dspy_dataset = []

    for i in range(len(df)):
        example = dspy.Example(
            question = prompt_template.format(document=df['Section_text'][i]),
            answer=df['Section_text'][i]
        ).with_inputs("question")

        dspy_dataset.append(example)
        #print(len(dspy_dataset))
        
    return dspy_dataset

### Evaluation before migration

Baselining the evaluation metrics for the model before migration

* **BENCH_KEY**: a unique keyname for your benchmarking in this round 
* **S3_BUCKET**: the S3 buckt you created for the benchmarking    
* **TASK_FOLDER**: the task folder you created under the S3 bucket   
* **INPUT_FILE**: the file name of the dataset you prepared for benchmarking    
* **METRICS_LIST**: the metrics we provide for the text classification task   
* **BEDROCK_REGION**: the AWS region that the model benchmarking runs on Bedrock
* **COST_FILE**: the price file used for calculating model inference cost 

In [None]:
BENCH_KEY = 'PeccyMig_202503_test'
S3_BUCKET = 'genai-sdo-llm-ben-20240310'
TASK_FOLDER = 'ben-summ'
INPUT_FILE = 'xsum_20.csv'
METRICS_LIST = ['Inference_Time','Input_Token','Output_Token','Throughput','RougeL-sum','Semantic_Similarity','BERT-F1','LJ_Score','Toxicity','Cost','Cache_Input_Token','Cache_Output_Token']
BEDROCK_REGION = 'us-east-1'
COST_FILE = 'bedrock_od_public.csv'

Results_summ = pd.DataFrame()
Results_summ = Results_summ.assign(metric_name=METRICS_LIST) 
Results_summ = Results_summ.set_index('metric_name')

#### Task specific setting

* Configure your **prompt** in the prompt catalog (prompt_catalog.json), and configure the prompt_catalog_id
* Set the **LLM hyperparameter** in model_kwargs. For the models on Bedrock, refer to [inferenceConfig](https://docs.aws.amazon.com/bedrock/latest/userguide/conversation-inference-call.html)
* Set two **LLM-judge models** (judge_model_1, judge_model_2) for the LLM-judge for the summarization task

In [None]:
prompt_catalog_id = "summ-1"

model_kwargs = {
    'maxTokens': 512, 
    'topP': 0.9, 
    'temperature': 0
}   

judge_model_1 = "us.anthropic.claude-3-7-sonnet-20250219-v1:0"
judge_model_2 = "us.deepseek.r1-v1:0"

#### Specify the model and other settings for benchmarking

Invoke **Summ_Ben** function to conduct the benchmarking for one selected model

* **method**: set "Bedrock" for the models on Bedrock
* **region**: configured in the previous step
* **model_id**: specify the Model ID for the model endpoint
* **model_kwargs**: configured in previous step
* **prompt_template**: prompt template based on the prompt configured in previous step
* **s3_bucket**: configured in previous step
* **file_name**: configured in previous step
* **BENCH_KEY**: configured in previous step
* **task_folder**: configured in previous step
* **cost_key**: set "public" when using AWS public pricing to calculate the cost
* **save_id**: the model name displayed in the report 
* **SLEEP_SEC**: you can configure "sleep and retry" when throtting, for example, set SLEEP_SEC = 10 to wait for 10 seconds between each inference
* **SAMPLE_LEN**: you can configure the number of samples for inference
* **PP_TIME**: if you want to run model inference for multiple rounds, set the number of rounds here.  
* **cacheconf**: set "default" to enable Bedrock Prompt Caching in the inference, "None" to disable
* **latencyOpt**: set "optimized" to enable Bedrock Latency Optimized Inference, "None" to disable

In [None]:
BEFORE_MODEL_ID = "anthropic.claude-3-haiku-20240307-v1:0"
before_save_id = 'haiku-3'

In [None]:
prompt_template = Prompt_Template_Gen(BEFORE_MODEL_ID, prompt_catalog_id)
print(prompt_template)

In [None]:
Results_summ[before_save_id] = Summ_Ben(method="Bedrock",
                                 region=BEDROCK_REGION,
                                 model_id=BEFORE_MODEL_ID,
                                 jm1=judge_model_1,
                                 jm2=judge_model_2,
                                 model_kwargs=model_kwargs,
                                 prompt_template=prompt_template,
                                 s3_bucket=S3_BUCKET,
                                 file_name=INPUT_FILE,
                                 BENCH_KEY=BENCH_KEY,
                                 task_folder=TASK_FOLDER,
                                 cost_key=COST_FILE,
                                 save_id=before_save_id,
                                 SLEEP_SEC=20,
                                 SAMPLE_LEN=3,   #len(df_input_test),
                                 PP_TIME=1,
                                 cacheconf="None",latencyOpt="None")

Results_summ

### Migration starts here ...

* **MODEL_ID**: specify the target model for migration
* **opt_model_id**: Bedrock model ID for optimizer
* **OPT_ITERATION**: specify the iteration number for target model prompt optimization 

In [None]:
MODEL_ID = 'us.amazon.nova-lite-v1:0'
opt_model_id = 'amazon.nova-lite-v1:0'
OPT_ITERATION = 6

#### Step 1: Bedrock APO 

Run Bedrock APO to get optimized prompt for the target model

In [None]:
#apo_prompt_template = Prompt_Opt_Template_Gen('us-west-2', opt_model_id, prompt_template)

# Alternatively you can manually create the prompt following the Nova model prompt best practices
apo_prompt_template = """
## Task
Your task is to summarize the given document enclosed in <doc></doc> tags in a brief and concise manner, without adding any information not mentioned in the document. Do not provide a preamble - start directly with the summarization.

## Guidelines
- Read the document carefully to understand its main points and key information.
- Identify the core ideas, facts, and arguments presented in the document.
- Synthesize the essential information into a clear and succinct summary.
- Use your own words to paraphrase the key points from the document.
- Omit unnecessary details or examples to keep the summary focused on the central concepts.
- If you cannot summarize the document, simply respond "I don't know" without making up an answer.

## Document to Summarize
<doc>
{{document}}
</doc>

Please provide your concise summary immediately without any preamble:"""

# Check the prompt output and manually update the prompt language and formatting 
apo_prompt_template = apo_prompt_template.replace("{{document}}", "{document}")

print(apo_prompt_template)

Evaluate the target model performance using **Summ_Ben** function

In [None]:
save_id = 'nova-lite(apo)'
Results_summ[save_id] = Summ_Ben(method="Bedrock",
                                 region=BEDROCK_REGION,
                                 model_id=MODEL_ID,
                                 jm1=judge_model_1,
                                 jm2=judge_model_2,
                                 model_kwargs=model_kwargs,
                                 prompt_template=apo_prompt_template,
                                 s3_bucket=S3_BUCKET,
                                 file_name=INPUT_FILE,
                                 BENCH_KEY=BENCH_KEY,
                                 task_folder=TASK_FOLDER,
                                 cost_key=COST_FILE,
                                 save_id=save_id,
                                 SLEEP_SEC=20,
                                 SAMPLE_LEN=3,   #len(df_input_test),
                                 PP_TIME=1,
                                 cacheconf="None",latencyOpt="None")

Results_summ

#### Step 2: data-aware optimization

Prepare training dataset and model initialization

In [None]:
train_set = create_summ_dataset(apo_prompt_template,df_input_train)
model = model_initialize(MODEL_ID)

Data-aware optimization process: you can specify the following parameters for the optimizer 

* **model**: the initialized target model in the previous step
* **train_set**: training dataset prepared in the previous step 
* **eval_metric_accuracy**: evaluation function 
* **num_candidates**: number of prompt candidate to generate and evaluate by the optimizer
* **num_trials**: number of optimization trials to run by the optimizer
* **minibatch_size**: optimize and evaluate prompt candidates over minibatch (subset of the full training set) 
* **minibatch_full_eval_steps**: every number of steps to run full evaluation on the top averaging set of prompt candidates

In [None]:
with warnings.catch_warnings():
    warnings.filterwarnings('ignore')

    for k in range(OPT_ITERATION):
        optimized_model = data_aware_optimization(model,train_set,lj_metric_summ,
                                                num_candidates = 5,
                                                num_trials = 7,
                                                minibatch_size = 5,
                                                minibatch_full_eval_steps = 7)
    

        print("========== Optimized prompt instruction ===============")
        print(optimized_model.prog.predict.signature.instructions)     # optimized_model.prog.predict.signature.instructions
        print("=======================================================")
        print("retry ",k,"...") 
        if k< OPT_ITERATION-1:
            time.sleep(60)        

Add the data-aware optimized prompt template to the prompt catalog 

In [None]:
update_prompt_catalog(prompt_catalog_id,optimized_model)

dao_prompt_template = Prompt_Template_Gen(MODEL_ID, prompt_catalog_id+'-dao')
print(dao_prompt_template)

Evaluate the target model performance using **Summ_Ben** function. Based on the performance, you can go back to step 2 to re-run the optimization by using different training set and/or different optimizer parameters. 

In [None]:
save_id = 'nova-lite(data-aware-opt)'
Results_summ[save_id] = Summ_Ben(method="Bedrock",
                                 region=BEDROCK_REGION,
                                 model_id=MODEL_ID,
                                 jm1=judge_model_1,
                                 jm2=judge_model_2,
                                 model_kwargs=model_kwargs,
                                 prompt_template=dao_prompt_template,
                                 s3_bucket=S3_BUCKET,
                                 file_name=INPUT_FILE,
                                 BENCH_KEY=BENCH_KEY,
                                 task_folder=TASK_FOLDER,
                                 cost_key=COST_FILE,
                                 save_id=save_id,
                                 SLEEP_SEC=20,
                                 SAMPLE_LEN=3,   #len(df_input_test),
                                 PP_TIME=1,
                                 cacheconf="None",latencyOpt="None")

Results_summ

#### Select your prompt for the target model based on the performance evaluation 

In [None]:
print("==== Optimized prompt from step 1 =====")
print(apo_prompt_template)
print("\n===== Optimized prompt from step 2 =====")
print(dao_prompt_template)