# Documentation
This is a notebook that should read in a dataframe and calculates how much it costed in € to prompt the LLMs for the generation of this dataframe.

In [2]:
pip install tiktoken

Collecting tiktoken
  Downloading tiktoken-0.8.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting regex>=2022.1.18 (from tiktoken)
  Downloading regex-2024.9.11-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.5/40.5 kB[0m [31m698.2 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Downloading tiktoken-0.8.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hDownloading regex-2024.9.11-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (792 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m792.8/792.8 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hInstalling collected packages: regex, tiktoken
Successfully installed regex-2024.9.11 tiktoken-0.8.0
Note: you may need to res

In [3]:
import pandas as pd
import os
from datetime import datetime
import plotly.express as px
import json
import tiktoken

In [4]:
qa_pair_prompt = """
        You are an expert in the Validation Services division of a pharmaceutical company. You hold a PhD in medicine, pharmacy, and biochemistry and possess extensive knowledge of regulatory compliance and quality assurance in the pharmaceutical industry. You are adept at analyzing and interpreting regulatory documents, extractable guides, and scientific research papers in biotech, biology, and pharmacy. Your expertise spans regulatory compliance, quality assurance, and scientific research methodologies in the pharmaceutical industry. You work at a company that provides products and services for drug development, biotechnology, and life science research, including laboratory instruments, consumables, and process technologies. 
        Instructions:
        You are given a text extract and should generate 2 different questions from that text, which relate to a product in the text. The questions don't have to be related to the same product. You should also generate the answers to the questions.
        Your answer is always of exactly the format: "[{'question': 'the question', 'answer': 'the answer'}, {'question': 'the question', 'answer': 'the answer'},]"
        """
evaluator_prompt = """
        You are tasked with evaluating the quality of answers generated by a Retrieval-Augmented Generation (RAG)-based chatbot designed to answer product-related questions in the validation service department of a pharmaceutical company. The evaluation involves comparing the generated answers against the true answers provided by experts.
        Your goal is to evaluate how well the generated answers match the true answers based on the following criteria:
        - Accuracy: How closely does the generated answer align with the true answer in terms of factual correctness?
        - Completeness: Does the generated answer fully address the question, or does it omit important details?
        - Clarity: Is the generated answer clear and understandable, or is it ambiguous or confusing?
        - Relevance: Is the generated answer directly relevant to the question asked, or does it include unnecessary or off-topic information?
        Instructions:
        Compare the true answer and the generated answer using the criteria above.
        Start by writing an Evaluation Text that provides a detailed comparison, explaining why the generated answer received a particular score. Your reasoning should highlight the strengths and weaknesses of the generated answer in relation to the true answer.
        Based on the comparison, assign an Evaluation Score on a scale from 1 to 5:
        5: The generated answer is nearly identical to the true answer in accuracy, completeness, clarity, and relevance.
        4: The generated answer is very close to the true answer but may miss some minor details or contain slight inaccuracies.
        3: The generated answer provides a reasonably correct response but has noticeable gaps in completeness, clarity, or relevance.
        2: The generated answer contains significant inaccuracies or omissions but has some elements that are correct or relevant.
        1: The generated answer is mostly incorrect or irrelevant.
        Output Example:
        At the end of the evaluation, return the result as a JSON object with the following structure:
        {
        "reasoning": "Your detailed comparison goes here, explaining why the generated answer received its score based on accuracy, completeness, clarity, and relevance.",
        "score": "Your score from 1 to 5 goes here."
        }
        Here are the question and answers for evaluation: 
            Question: , 
            True answer: , 
            Generated answer: .
        """

In [6]:
# pricing from https://azure.microsoft.com/en-us/pricing/details/cognitive-services/openai-service/
azure_openai_api_key = "example"#dbutils.secrets.get(scope='keyvault-link', key='azure-openai-api-key')
MODELS = {
    'MODEL_GPT4o': {
        "endpoint": "https://appprodsagopenaigpt4weu.openai.azure.com/openai/deployments/gpt-4o-cad-extraction/chat/completions?api-version=2024-02-15-preview",
        "api_key": azure_openai_api_key,
        "prompt_token_cost": 0.0047/1000,
        "completion_token_cost": 0.0139/1000
    },
    'MODEL_GPT4o_mini': {
        "endpoint": "https://appprodsagopenaigpt4weu.openai.azure.com/openai/deployments/evaluation_gpt4o-mini/chat/completions?api-version=2023-03-15-preview",
        "api_key": azure_openai_api_key,
        "prompt_token_cost": 0.00014/1000,
        "completion_token_cost": 0.0006/1000
    }
    # Add more endpoints and api keys here
}

In [7]:
def calculate_costs(token_counts:dict, model:str='MODEL_GPT4o'):
    completion_tokens = token_counts['completion_tokens']
    prompt_tokens = token_counts['prompt_tokens']
    costs_in_euro = prompt_tokens * MODELS[model]["prompt_token_cost"] + \
            completion_tokens * MODELS[model]["completion_token_cost"]
    return round(costs_in_euro, 4)
def count_tokens(text):
    encoding = tiktoken.encoding_for_model("gpt-4")
    return len(encoding.encode(text))
def transform_dataframe(df):
    for column in df.columns:
        df[column] = df[column].astype(str).apply(count_tokens)
    return df

In [None]:
# reads in the question_answer_pairs parquet and retrieves the highest timestamp
base_folder = "/Volumes/uc-catalog-dev/advancedanalytics-productai-dev/transformed_dev/llm-evaluation/" + datetime.now().strftime("%Y-%m-%d") + "/"
base_folder = "/Volumes/uc-catalog-dev/advancedanalytics-productai-dev/transformed_dev/llm-evaluation/2024-09-28/"
combined_df = pd.read_parquet(base_folder + f"question_answer_pairs+productai_answers+evaluation_results.parquet")
display(combined_df)

#### Calculation variables

##### Generating QA Pairs
- Model = GPT-4o-mini
- Input = (Prompt + Chunk size) * number of chunks
- Output = Questions + Answers

##### Generating Product AI response
- Model = GPT-4o
- Input = Questions
- Output = Product AI response

##### Evaluating answer
- Model = GPT-4o-mini
- Input = (Prompt + Question + Answer + Product AI response) * number of questions
- Output = (Score + Reasoning) * number of questions


In [None]:
token_df = transform_dataframe(combined_df)
display(token_df)

In [17]:
combined_df = pd.read_parquet("/home/jovyan/LLM_evaluation/runs/complete_anonymized_results_240928142914.parquet")
token_df = pd.read_parquet("/home/jovyan/LLM_evaluation/runs/tokenized_results_240928142914.parquet")

In [18]:
input_tokens_qa_pairs = count_tokens(qa_pair_prompt) * len(token_df) + token_df['chunk'].sum()
output_tokens_qa_pairs = token_df['question'].sum() + token_df['answer'].sum()
total_tokens_qa_pairs = input_tokens_qa_pairs + output_tokens_qa_pairs
token_counts_qa_pairs = {
    'prompt_tokens': input_tokens_qa_pairs,
    'completion_tokens': output_tokens_qa_pairs,
    'total_tokens': total_tokens_qa_pairs
}
costs_qa_pairs = calculate_costs(token_counts_qa_pairs, model='MODEL_GPT4o')
print(token_counts_qa_pairs)

{'prompt_tokens': 1784504, 'completion_tokens': 32828, 'total_tokens': 1817332}


In [14]:
input_tokens_productai = token_df['question'].sum()
output_tokens_productai = token_df['productai_response'].sum()
total_tokens_productai = input_tokens_productai + output_tokens_productai
token_counts_productai = {
    'prompt_tokens': input_tokens_productai,
    'completion_tokens': output_tokens_productai,
    'total_tokens': total_tokens_productai
}
costs_productai = calculate_costs(token_counts_productai, model='MODEL_GPT4o')
print(token_counts_productai)

{'prompt_tokens': 14738, 'completion_tokens': 85926, 'total_tokens': 100664}


In [15]:
input_tokens_evaluation = count_tokens(evaluator_prompt) * len(token_df) + token_df['question'].sum() + token_df['answer'].sum() + token_df['productai_response'].sum()
output_tokens_evaluation = token_df['evaluation_score'].sum() + token_df['evaluation_reasoning'].sum()
total_tokens_evaluation = input_tokens_evaluation + output_tokens_evaluation
token_counts_evaluation = {
    'prompt_tokens': input_tokens_evaluation,
    'completion_tokens': output_tokens_evaluation,
    'total_tokens': total_tokens_evaluation
}
costs_evaluation = calculate_costs(token_counts_evaluation, model='MODEL_GPT4o')
print(token_counts_evaluation)

{'prompt_tokens': 422612, 'completion_tokens': 87897, 'total_tokens': 510509}


In [16]:
display(f"Costs for QA pair generation: {costs_qa_pairs}€. Total tokens used: {total_tokens_qa_pairs}")
display(f"Costs for ProductAI prompts: {costs_productai}€. Total tokens used: {total_tokens_productai}")
display(f"Costs for Evaluation: {costs_evaluation}€. Total tokens used: {total_tokens_evaluation}")
display(f"Total costs: {costs_qa_pairs + costs_productai + costs_evaluation}€")
display(f"Total tokens used: {total_tokens_qa_pairs + total_tokens_productai + total_tokens_evaluation}")

'Costs for QA pair generation: 8.8435€. Total tokens used: 1817332'

'Costs for ProductAI prompts: 1.2636€. Total tokens used: 100664'

'Costs for Evaluation: 3.208€. Total tokens used: 510509'

'Total costs: 13.315100000000001€'

'Total tokens used: 2428505'