# Hands-on Tutorial
## Domain-Driven LLM Development: Insights into RAG and Fine-Tuning Practices
### Lab3: RAG + Fine-tuning and Benchmarking
#### Summary: 
This lab explore tools to support deciding between RAG and domain fine-tuning for LLM's tasks such as Q&A or summarizing documents.

Based on results from `Lab 1 - Advanced Techniques in Retrieval Augmented Generation (RAG)`, and `Lab 2 - LLM fine-tuning`, we wil explore how we can compare both approaches to support decison making related to which solution to choose.

Decision making related to choosing between Fine-tuned and RAG models performances should take into account model performance as well as pricing. 

In the first sesssion of this notebook we compare model performance to assess if one or both achieve a minimum aceptable performance based on appropriate metrics realted to the task on hand.

The second session is about offering pricing analysis tools to help decide between both models.
## Important note
### The pricing values used here are simulated and DO NOT reflect any real AWS pricing
The values used are consistent with the references below: 
> https://aws.amazon.com/bedrock/pricing/ </br>
> https://aws.amazon.com/kendra/pricing/

---

#### Installing and importing nedeed libraries

In [None]:
%%capture
!pip install -qU pinecone-client==2.2.1 ipywidgets==7.0.0
# install packages needed for plotting
! pip install -U kaleido --quiet
! pip install plotly --quiet
! pip install spacy --quiet

In [None]:
import pandas as pd
import os
import json
import os
import plotly.express as px
import plotly.io as pio
pio.renderers.default = 'notebook'

#### Load RAG and fine-tuned data from lab 1 and lab 2
These files contain samples with evaluation metrics as well as associated prompts used. 

For this short lab we are considering 30 interaction samples from each model, with the prompts and associated metric values.

In [None]:
#FINETUNED_FILE = "../lab-data/"+"ft_evaldata.csv"  
#RAG_FILE = "../lab-data/"+"rag_evaldata.csv"  
FINETUNED_FILE = "../lab-data/"+"sft_trn_result.csv"  
RAG_FILE = "../lab-data/"+"naive_rag_result.csv"


df_rag = pd.read_csv(RAG_FILE)
df_finetuned = pd.read_csv(FINETUNED_FILE)

In [None]:
df_rag.head()

In [None]:
df_finetuned.head()

#### Now we get the average metrics over all samples

In [None]:
# metric averages
def get_metric_averages(metric_df):
    # return list of metrics
    average_metrics = metric_df.loc[:, ['semantic_similarity', 'token_overlap_recall','rouge_l_recall']].mean()
    return [average_metrics.iloc[0],
            average_metrics.iloc[1],
            average_metrics.iloc[2]
            ]

In [None]:
l_finetuned_metrics = get_metric_averages(df_finetuned)
print ("semantic_similarity:", l_finetuned_metrics[0],
       "\ntoken_overlap_recall:", l_finetuned_metrics[1],
       "\nrouge_l_recall:", l_finetuned_metrics[2])

In [None]:
l_rag_metrics = get_metric_averages(df_rag)
print ("semantic_similarity:", l_rag_metrics[0],
       "\ntoken_overlap_recall:", l_rag_metrics[1],
       "\nrouge_l_recall:", l_rag_metrics[2])

## Session 1 - Performance analysis
#### Plot RAG vs fine-tuned performance metrics
We use a radar plot as a easy way to present all metrics together for comparison.
We are using 3 metrics in this hands-on:
+ semantic_similarity
+ token_overlap_recall
+ rouge_l_recall


In [None]:
# list of name, degree, score
nme = ["semantic_similarity", "token_overlap_recall", "rouge_l_recall",
      "semantic_similarity", "token_overlap_recall", "rouge_l_recall"
      ]
deg = ["finetuned", "finetuned", "finetuned", "rag", "rag", "rag"]
scr = [l_rag_metrics[0], l_rag_metrics[1], l_rag_metrics[2],
       l_finetuned_metrics[0], l_finetuned_metrics[1], l_finetuned_metrics[2] 
       ]
 
# dictionary of lists 
dict = {'metric': nme, 'model': deg, 'value': scr} 
   
df_evaluation = pd.DataFrame(dict)
   
df_evaluation

In [None]:
def visualize_metrics(df):
    #df = pd.read_csv("../lab-data/"+"new_test.csv")
    fig = px.line_polar(df, r="value",
                        theta="metric",
                        color="model",
                        line_close=True,
                        color_discrete_sequence=["#00eb93", "#4ed2ff"],
                        template="plotly_dark")

    fig.update_polars(angularaxis_showgrid=False,
                      radialaxis_gridwidth=0,
                      gridshape='linear',
                      bgcolor="#494b5a",
                      radialaxis_showticklabels=True
                      )
    fig.write_image(f"../lab-data/radarplot.pdf")
    
    fig.update_layout(paper_bgcolor="#2c2f36")
    fig.show()

In [None]:
# compare metrics for both models
visualize_metrics(df_evaluation);

## Analysis considerations
This plot allows you to have a quick view of the key metrics considered for this task. 

Before going through a cost analysis it is important to make sure if the models are above expected performance thresholds for the important metrics related to your use case. 

As long as both are above adequate thresholds, tyou can proceed to the costs analysis.

---


## Session 2 - Cost analysis
Once both models paseed the test related to performance, we can look at the costs involved.

We assume some premisses to proceed with this example:
+ As we need to use small datasets to run these labs, we will extrapolate RAG dataset sizes for a more realistic scenario
+ We also define a number of access per month for both solutions in order to calculate expected costs
+ Princing considered for RAG model:
    + Monthly costs related to the number of average input and output tokens
    + Monthly hosting costs for the RAG datasets
    + Monthly input and output token usage
    + We are considering pricing based on hte use of Amazon Kendra
+ Pricinng considered for Fine-tuned model
    + Fixed training cost
    + Monthly model hosting
    + Monthly input and output token usage
    
---

#### Process the input and output prompts from RAG and fine-tuned datasets form lab 1 and lab 2

In [None]:
df_finetuned_prompts = df_finetuned[["question", "response"]]
df_finetuned_prompts.head()

In [None]:
df_rag_prompts = df_rag[["question", "response"]]
df_rag_prompts.head()

#### Counting tokens from RAG and FT input and output prompts
We now proceed to counting the input and output tokens and calculating the sample mean for both.

In [None]:
# First we need to import spacy ADD ABOVE!!
import spacy 
# Creating blank language object then 
# tokenizing words of the sentence 
nlp = spacy.blank("en") 

In [None]:
#df_finetuned_prompts['input_prompt_tokens'] = df_finetuned_prompts['input_prompt'].apply()
pd.options.mode.copy_on_write = True
df_finetuned_prompts['num_input_prompt'] = df_finetuned_prompts['question'].map(lambda a: len(nlp(a)))
df_finetuned_prompts['num_output_prompt'] = df_finetuned_prompts['response'].map(lambda a: len(nlp(a)))

# below is using context only. Noit neede
# df_rag_prompts['num_input_prompt'] = df_rag_prompts['ctx_input_prompt'].map(lambda a: len(nlp(a)))
df_rag_prompts['num_input_prompt'] = df_rag_prompts['question'].map(lambda a: len(nlp(a)))
df_rag_prompts['num_output_prompt'] = df_rag_prompts['response'].map(lambda a: len(nlp(a)))


In [None]:
#df_finetuned_prompts[['num_input_prompt']].mean(axis=1)
ft_mean_input_size = df_finetuned_prompts.loc[:, 'num_input_prompt'].mean()
ft_mean_output_size = df_finetuned_prompts.loc[:, 'num_output_prompt'].mean()
rag_mean_input_size = df_rag_prompts.loc[:, 'num_input_prompt'].mean()
rag_mean_output_size = df_rag_prompts.loc[:, 'num_output_prompt'].mean()

In [None]:
# list of name, degree, score
model = ["finetuned", "finetuned", "rag", "rag"]
prompt = ["input", "output", "input", "output"]
num_tokens = [ft_mean_input_size, ft_mean_output_size, rag_mean_input_size, rag_mean_output_size]
 
# dictionary of lists 
dict = {'prompt_type': prompt, 'model': model, 'num_tokens': num_tokens} 
   
df_comp_in_out = pd.DataFrame(dict)

### Comparing prompt sizes for RAG and Fine-tuned models

In [None]:
fig = px.bar(df_comp_in_out, title = "Input and output mean token sizes", x="prompt_type", y="num_tokens",
             color="model", barmode="group", width=700, height=350)
fig.show()

## Can you see a difference above in output prompt average sizing comparing RAG and FT?
+ For input prompts it is quite obvious: Prompt size is bigger for RAG as the solution incorporates context text into the prompt.
+ For output it is not so obvious, but you can see that fine-tuned mean prompt size is usual smaller. This is supported by research results.

## Fine-tuning scenario
+ An application developer customizes the `Llama3 8B Chat` Pretrained (8B) model using 1000 tokens of data.
+ After training, uses custom model provisioned throughput for 1 hour to evaluate the performance of the model. 
+ The fine-tuned model is stored for 1 month. 
+ After evaluation, the developer uses provisioned throughput (1mo commit) to host the customized model.

#### Fixed fine-tuning training cost
For each fine-tuning task we consider:
+ Fine tuning training cost 
+ Fine-tuned model storage per month 
+ 1 hour of custom model inference for performance evaluation

#### Monthly Provisioned Throughput pricing
An application developer buys one model unit  for their text summarization use case.

+ Total monthly cost incurred = 1 model unit * inference throughput cost (ex. $21.18 * 24 hours * 31 days )

---

**Note: We will be considering one year period of cummulative costs in this analysis but the time period can be asily changed to consider dsifferent timeframes**


In [None]:
# timeframe considered
start_date = '1/1/2023'
end_date = '12/31/2023'

In [None]:
# fine-tuning training cost (fixed cost) considering it took 1 month
num_hours_month = 744 # 24h * 31 days
price_per_token = 0.00799
#number_of_steps = 500
#batch_size = 64
dataset_size = 1000000 # number of tokens
model_storage_per_month = 1.95 # 1 month
evaluation_throuput_hour = 21.18 # 1 hour thrtoughput for model evaluation

ft_training_cost = price_per_token * dataset_size + model_storage_per_month + evaluation_throuput_hour
print("Fine-tuned fixed training cost: ", ft_training_cost)


# fine-tuned model througput cost
ft_model_hosting_cost_month = model_storage_per_month
ft_token_cost_per_month = evaluation_throuput_hour * num_hours_month + model_storage_per_month

print("Fine-tuned model monthly hosting cost: ", ft_model_hosting_cost_month)
print("Fine-tuned model throuput cost: ", ft_token_cost_per_month)

## RAG scenario
We are considering Bedrock using a knoweledge base supported by Kendra for pricing simulation here.
An application developer uses a pre-trained `Llama3 8B Chat` model supported by RAG with the following premisses: 
+ 90,000 documents in the knowledge base 
+ 7000 searches per day
+ As we have less than 8k queries per day and 100k documents, the priccing per hour is $1.4

In [None]:
# RAG hosting calculation (simulating Amazon Kendra)

# define typical file size and that it is in Kendra (RAG)
num_rag_doc_to_search = 90000 # just to see it is below 100K for pricing
num_searches_day = 7000
pricing_per_hour = 1.4 # Up to 8k queries per day, Up to 100k documents, 
# RAG monthly model host cost
rag_hosting_princing_per_month = pricing_per_hour * num_hours_month # $1.4 per hour x 744 hours/month
print("RAG monthly hosting cost: ", rag_hosting_princing_per_month)
# RAG monthly model througput cost
rag_token_cost_per_month = evaluation_throuput_hour * num_hours_month + model_storage_per_month
print("RAG monthly throughput cost: ", rag_token_cost_per_month)

### Plotting RAG princing
First, we look at at the different comntribution between RAG datasets hosting and tokens usage

In [None]:
# test stacke for FT and RAg separatedly
# FT frequency - monthly
def plot_pricing_rag_stacked(df):
    fig = px.bar(df, title="RAG stacked pricing",  x=df.month,  y=["RAG_hosting_monthly_pricing","RAG_tokens_monthly_pricing"])
    fig.show()

In [None]:
#df_rag_costs_stacked: month, RAG_hosting_monthly, RAG_tokens_monthly
def prep_rag(plot=True):
    num_months = pd.date_range(start=start_date, end=end_date, freq='ME').to_frame().shape[0]

    l_accum_rag_token_pricing = []
    l_accum_rag_hosting_pricing = []
    rag_hosting_accum = 0
    rag_token_accum = 0

    for i in range(num_months):
        rag_hosting_accum += rag_hosting_princing_per_month
        rag_token_accum += rag_token_cost_per_month

        l_accum_rag_token_pricing.append(rag_token_accum)
        l_accum_rag_hosting_pricing.append(rag_hosting_accum)

    # create dataframe for ploting
    dt = pd.date_range(start=start_date, end=end_date, freq='ME')
    dt.to_frame()
    df_rag = dt.to_frame(index=False)
    df_rag.rename(columns={0: "month"}, inplace=True)
    df_rag['RAG_hosting_monthly_pricing'] = l_accum_rag_hosting_pricing
    df_rag['RAG_tokens_monthly_pricing'] = l_accum_rag_token_pricing

    if plot==True:
        plot_pricing_rag_stacked(df_rag)
    return df_rag

In [None]:
# input parameter: frequency of fine-tuning training (months)
df_rag = prep_rag(plot=True)
df_rag.head()

### Analysis considerations
We can see that throughput contributes much more than hosting costs in the cummulative costs of a RAG solution.

So the decision should be driven by efforts to reduce token costs.

**For example, if you have a yearly budget of $\$150K$, this use case DOESN'T fit on your budget.
Probably using fine-tuning can be an alternative**

## Plotting fine-tuning princing
Now, we look at at the different comntribution between fine-tuning fixed and monthly costs.

Let's take a first look at the different contributions between:
+ fine-tuning fixed training cost
+ fine-tuning monthly hosting cost
+ fine-tuning monthly token cost


In [None]:
# test stacke for FT and RAg separatedly
# FT frequency - monthly
def plot_pricing_ft_stacked(every_n_months, df):
    fig = px.bar(df, title="Fine-tuned stacked pricing - training frequency: every " + str(every_n_months) + " months",
                 x=df.month,  
                 y=["FT_hosting_monthly_pricing","FT_tokens_monthly_pricing", "FT_fixed_training_pricing"])
    fig.show()

#### Comparing costs considering monthly fine-tuning with same dataset size

In [None]:
num_months = pd.date_range(start=start_date, end=end_date, freq='ME').to_frame().shape[0]

def prep_finetuned(every_n_months, plot=True):
    num_months = pd.date_range(start=start_date, end=end_date, freq='ME').to_frame().shape[0]
    l_accum_ft_training_pricing = []
    l_accum_ft_token_pricing = []
    l_accum_ft_hosting_pricing = []
    ft_hosting_accum = 0
    ft_token_accum = 0
    ft_training_accum = 0
    # here we can change the training frequency and see how it affects the pricing
    #every_n_months = 4
    #every_n_months += m
    for i in range(num_months):
        #Accumulate fixed fine-tuning according to periodicity
        if i % every_n_months == 0:
            ft_training_accum += ft_training_cost
        
        ft_hosting_accum += ft_model_hosting_cost_month
        ft_token_accum += ft_token_cost_per_month
        l_accum_ft_token_pricing.append(ft_token_accum) # CASS REMOVE LATER: Multiplying tokens for n to show in plot
        l_accum_ft_hosting_pricing.append(ft_hosting_accum)
        l_accum_ft_training_pricing.append(ft_training_accum)
    # create dataframe for ploting
    dt = pd.date_range(start=start_date, end=end_date, freq='ME')
    dt.to_frame()
    df_ft = dt.to_frame(index=False)
    df_ft.rename(columns={0: "month"}, inplace=True)
    df_ft['FT_hosting_monthly_pricing'] = l_accum_ft_hosting_pricing
    df_ft['FT_tokens_monthly_pricing'] = l_accum_ft_token_pricing
    df_ft['FT_fixed_training_pricing'] = l_accum_ft_training_pricing
    if plot==True:
        plot_pricing_ft_stacked(every_n_months,df_ft)
    return df_ft

In [None]:
# input parameter: frequency of fine-tuning training (months)
df_ft_1 = prep_finetuned(1)
df_ft_1.head()

### Analysis considerations
We can see that training costs are insignificant when compared to throughput ones. 

And also that, for the same use case used here for RAG, the cummulative costs have a similar range.

So far, there is not difference when choosing between RAG or fine-tude solution.

But let's explore different fine-tune training frequencies and see how it affects cumulative costs.

## Exploring different fine-tuning frequencies
Let's first consider fine-tuning every quarter.

In [None]:
# input parameter: frequency of fine-tuning training (months)
df_ft_3 = prep_finetuned(3)
df_ft_3.head()

Now considering fine-tuning semi-yearly.

In [None]:
# input parameter: frequency of fine-tuning training (months)
df_ft_6 = prep_finetuned(6)
df_ft_6.head()

In [None]:
# input parameter: frequency of fine-tuning training (months)
df_ft_12 = prep_finetuned(12)
df_ft_12.head()

### Analysis considerations
We can clearly see that the fine-tuning frequency has a significant impact on total costs depending on the frequency. 
Considering the final yearly costs, we can see a reduction of around $\$90k$ (from $\$96K$ to $\$8K) when reducing finetune-frequency from once a month to annualy.

#### Plotting fine-tuning frequency against cummulative annual training costs

In [None]:
df_ft_1['frequency'] = "monthly"
df_ft_3['frequency'] = "quarterly"
df_ft_6['frequency'] = "semi-annualy"
df_ft_12['frequency'] = "annualy"
df_rag['frequency'] = "RAG"

all_df = [df_ft_1, df_ft_3, df_ft_6, df_ft_12]
final_df = pd.concat(all_df, ignore_index=True)

fig = px.line(final_df, x="month", y="FT_fixed_training_pricing", color='frequency')
fig.show()

#### Plotting total cummulative costs for RAG and different training frequency for fine-tuning

In [None]:
# sum RAG costs
df_rag["cummulative_cost"] = df_rag["RAG_hosting_monthly_pricing"] + df_rag["RAG_tokens_monthly_pricing"]
# sum FT costs
df_ft_1["cummulative_cost"] = df_ft_1["FT_hosting_monthly_pricing"] + df_ft_1["FT_tokens_monthly_pricing"] + \
                                df_ft_1["FT_fixed_training_pricing"]
df_ft_3["cummulative_cost"] = df_ft_3["FT_hosting_monthly_pricing"] + df_ft_3["FT_tokens_monthly_pricing"] +  \
                                df_ft_3["FT_fixed_training_pricing"]
df_ft_6["cummulative_cost"] = df_ft_6["FT_hosting_monthly_pricing"] + df_ft_6["FT_tokens_monthly_pricing"] + \
                                df_ft_6["FT_fixed_training_pricing"]
df_ft_12["cummulative_cost"] = df_ft_12["FT_hosting_monthly_pricing"] + df_ft_12["FT_tokens_monthly_pricing"] + \
                                df_ft_12["FT_fixed_training_pricing"]

all_df = [df_ft_1, df_ft_3, df_ft_6, df_ft_12, df_rag]
final_df = pd.concat(all_df, ignore_index=True)

fig = px.line(final_df, x="month", y="cummulative_cost", color='frequency')
fig.show()
final_df.head()

### Breakeven analysis
We can also perform a break-even analysis, i.e., when is the point where the cummulative costs of RAG achieve the same costs for the fine-tuned solution.

For the plot above we can see that, for the period of 1 year considered, let's compare the yearly fine-tuned mopdel against the RAG one.

In [None]:
df_ft_vs_rag = final_df[(final_df.frequency == "annualy") | (final_df.frequency == "RAG")]
# selecting only needed columns
df_ft_vs_rag = df_ft_vs_rag[["month", "cummulative_cost", "frequency"]]

fig = px.line(df_ft_vs_rag, x="month", y="cummulative_cost", color='frequency')
fig.show()
df_ft_vs_rag.head()

In [None]:
df_rag = df_ft_vs_rag.loc[(df_ft_vs_rag['frequency'] == "RAG")]
df_monthly_ft = df_ft_vs_rag.loc[(df_ft_vs_rag['frequency'] == "annualy")]

# merging both to calculate cost difference
df_temp = pd.merge(df_rag, df_monthly_ft, on=['month', 'month'])

df_temp = df_temp[['month', 'cummulative_cost_x', 'cummulative_cost_y']].copy()
df_temp.head()
df_temp["cost_diff"] = df_temp["cummulative_cost_x"] - df_temp["cummulative_cost_y"]

In [None]:
# finding difference closest to zero 
import numpy as np
df_closest = df_temp.iloc[(df_temp['cost_diff']-101).abs().argsort()[:1]]
df_closest

The data above shows that we achieve breakeven in costs around August 8th, 2023 for this use case

In [None]:
# concatenate vertically this columns to dataset
df_closest.drop(["cummulative_cost_x", "cummulative_cost_y"], axis=1, inplace=True)
df_closest.columns = ["month", "cummulative_cost"]
df_closest["frequency"] = "Breakeven"
df_closest

# prepare to plot breakeven threshold
df_closest = pd.DataFrame(np.repeat(df_closest.values, df_rag.shape[0], axis=0))
df_closest.columns = ["month", "cummulative_cost", "frequency"]
#seq = list(range(df_rag.shape[0])) 
seq = list(range(1,200000,18000)) # define a range to show up in the plot
df_closest["cummulative_cost"] = seq
#df_closest

both_df = [df_ft_vs_rag, df_closest]
breakeven_df = pd.concat(both_df, ignore_index=True)
#breakeven_df

fig = px.line(breakeven_df, x="month", y="cummulative_cost", color='frequency')
fig.show()
final_df.head()

### Analysis considerations
We can see in this example that by the end of August the RAG accumulated costs achieve the cummulative costs for Fine-tuning considering monthly fine-tuning training. It means that you can choose to spend less up front not paying for finetune and use RAG instead, if you plan to fine-tune your modle monthly.

If you opt for lower fine-tuning frequency, RAG is always a cheaper solution for this use case.