# Model Evaluation

## Model hosting research

As mentioned in the previous progress there were problems with the necesary compute power for running the models in our local machines and a exploration of using a Cloud solution was done. After doing some research we have decided to implement Vertex AI leveraging GCP, this will allow to test Mistral and Llama models without worrying about the compute resources in our local using the managed endpoints available as part of the solution.
With the managed endpoints we will be able to call the models running in GCP and only pay for the requests made, as we are new users of GCP a 300 USD voucher was generated to us which will make the usage of the models free, however if this does not work the charge for the models is extremly low for the POC purpose of the project. After a succesfull implementation of a RAG system the company will need to perform and evaluation on what is the best way to host the models.

## Requirements for running Notebook Locally

Anyone trying to run the Notebook will need to provision a service account which has access to Vertex AI platform as a user, to get this access the following steps need to be performed:

    1. Log in into GCP (If the user oes not have an account create a new one)
    2. Select the project where the Vertex AI instance will be used
    3. Open the Services Account option on the left side navigation menu under IAM Section
    4. Create a new Service Account
    5. Once the account is created go to the IAM section in the left navigation menu
    6. Select the user/mail of the new Service account created
    7. Edit the accesses and grante Vertex AI user permision to the service account
    8. Go back to the service account page
    9. Select the Service Account user and go to the key tab
    10 Click the create new key button and generate a new key with json format
    11. Once the key is created it will be downloaded to the local machine
    
After getting the service account credentials in local this need to be added to the environmment variables, this is done later in the notebook


## Install needed libraries

In [None]:
#!pip install pandas numpy matplotlib seaborn nltk textstat chromadb torch sentence-transformers hf_xet transformers accelerate langchain-community huggingface_hub google.auth



In [1]:
import pandas as pd
from langchain.embeddings import HuggingFaceEmbeddings
import matplotlib.pyplot as plt
import seaborn as sns
import unicodedata
import textstat
import re
import chromadb
import torch
import transformers
from huggingface_hub import notebook_login
import requests
import os
import json
from google.auth import credentials  # Import the credentials  module
from google.auth.transport.requests import Request  # Import Request
from google.auth import default #
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "./credenciales_google.json"

## Prepare Model

### Define the embedding class, reused from previous progress

In [2]:
class Generate_Embeddings:
    def __init__(self):
        mps_available = hasattr(torch.backends, 'mps') and torch.backends.mps.is_available()
        print(f"MPS disponible: {mps_available}")

        cuda_available = torch.cuda.is_available()
        print(f"CUDA disponible: {cuda_available}")

        if cuda_available:
            device = torch.device("cuda")
            print(f"Using GPU NVIDIA: {torch.cuda.get_device_name(0)}")
            print(f"Total Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
        elif mps_available:
            device = torch.device("mps")
            print(f"Usando GPU Apple Silicon via MPS")
        else:
            device = torch.device("cpu")
            print("Usando CPU")

        print(f"Dispositivo activo: {device}")
        self.embedding_model = HuggingFaceEmbeddings(
            #model_name="sentence-transformers/all-MiniLM-L6-v2",  # Opcion más ligera
            model_name="sentence-transformers/all-mpnet-base-v2", #opcion con 768
            model_kwargs={'device': device},
            # Este parámetro normaliza cada embedding a longitud 1
            encode_kwargs={'normalize_embeddings': True}
        )

    def generate_embedding_for_query(self,query):
        return self.embedding_model.embed_query(query)

### Define class to connect to Chroma DB, reused from previous progress

In [3]:
class Chroma_Connection:
    def __init__(self,collection_name,persist_directory = "./chroma_db2"):
        self.persist_directory = persist_directory
        self.collection_name = collection_name
        self.client = chromadb.PersistentClient(path=self.persist_directory)
        self.generate_embeddings = Generate_Embeddings()

    def query_chroma(self, query,n_documents=5):
        try:
            collection = self.client.get_collection(name=self.collection_name)
        except ValueError:
            print(
                f"Collection '{collection_name}' not found.  Returning empty results."
            )
            return []
        embedded_query = self.generate_embeddings.generate_embedding_for_query(query)
        results = collection.query(
            query_embeddings=[embedded_query],
            n_results=n_documents,
            include=["documents", "metadatas", "distances"],  #  Get the text and metadata, and distance
        )
        return results

### Define function to do Retrival, reused from previous progress

In [4]:
def retrival(query):
    collection_name = "C1_RAG_AWS_LENSES"
    context_retrival = Chroma_Connection(collection_name)
    context = context_retrival.query_chroma(query)
    return context

### Define intial prompt

For next iteration as we will be testing different prompts and different models it is important to define this is the initial prompt for mistral that will set our base ground for evaluating different models.

The first prompt we will be evaluating with mistral model will be the following:

    [INST]You are an expert Cloud architect specializing in AWS cloud solutions. Analyze the provided context and the architectural requirements to propose the best AWS-based solution, adhering to AWS Well-Architected Framework principles. Ensure your response uses only AWS services and does not rely on external knowledge beyond the provided context.
    Context:
        {context_content}
    Architectural Requirements:
        {query}
    Provide your architectural decision in the following format:

    1.  **Proposed AWS Architecture:**
    2.  **Justification:**
    3.  **AWS Services:**
    4.  **AWS Only:**

    If the context lacks sufficient information to make a confident decision, state: "Insufficient context to provide a confident architectural decision." and briefly explain what information is missing.

    [/INST]
    **BEGIN ASSISTANT RESPONSE:**
    Here's my architectural decision:
    [/INST]

### Define function to log in to GCP

In [5]:
def get_gcp_token():
    try:
        SCOPES = ['https://www.googleapis.com/auth/cloud-platform'] # Add all needed scopes
        creds, project_id = default(scopes=["https://www.googleapis.com/auth/cloud-platform"])
        auth_req = Request()
        creds.refresh(auth_req)
        access_token = creds.token
        return [access_token, project_id]
    except Exception as e:
        print(f"Error obtaining credentials: {e}")
        return None

### Define class model function from Vertex

In [6]:
class call_vertex_model:
    def __init__(self,model_api,model_name):
        self.token, self.project_id = get_gcp_token()
        self.model_name = model_name
        region = "us-central1"
        self.model_api = model_api.format(REGION=region,PROJECT_ID=self.project_id,MODEL_ID=self.model_name)

    def call_model(self,prompt):
        try:
            headers = {
                "Authorization": f"Bearer {self.token}",
                "Content-Type": "application/json",
                "Accept": "application/json"
            }
            payload = {
              "model": self.model_name,
              "messages": [
              {
                "role": "user",
                "content": [
                    {
                      "type": "text", "text": prompt
                    }
                  ]
                }
              ]
            }
            response = requests.post(url=self.model_api, headers=headers, json=payload)
            response.raise_for_status()  # Raise an exception for bad status codes
            response_dict = response.json()
            generated_text = response_dict["choices"][0]["message"]["content"]
            return generated_text
        except Exception as e:
            print(f"Error calling Vertex AI endpoint: {e}")
            return None


### Define RAG system class, reuse some parts of the previous progress

In [32]:
class rag_model:
    def __init__(self,model_api,model_name,base_prompt):
        self.base_prompt = base_prompt
        self.call_vertex_model = call_vertex_model(model_api,model_name)

    def generate(self,query):
        context_content = (retrival(query))["documents"]
        prompt = self.base_prompt.format(context_content=context_content, query=query)
        generated_text = self.call_vertex_model.call_model(prompt)
        
        return {
            "response": generated_text,
            "context": context_content
        }


### Create first full model

In [33]:
first_base_prompt = """
    [INST]You are an expert Cloud architect specializing in AWS cloud solutions. Analyze the provided context and the architectural requirements to propose the best AWS-based solution, adhering to AWS Well-Architected Framework principles. Ensure your response uses only AWS services and does not rely on external knowledge beyond the provided context.
    Context:
        {context_content}
    Architectural Requirements:
        {query}
    Provide your architectural decision in the following format:

    1.  **Proposed AWS Architecture:**
    2.  **Justification:**
    3.  **AWS Services:**
    4.  **AWS Only:**
    5.  **Context Justification**

    If the context lacks sufficient information to make a confident decision, state: "Insufficient context to provide a confident architectural decision." and briefly explain what information is missing.
    Do Not use anything outside the proided Context, only services mentioned in the provided context.
    [/INST]
    **BEGIN ASSISTANT RESPONSE:**
    Here's my architectural decision:
    [/INST]
"""
mistral_model_name = "mistral-small-2503"
mistral_model_api = "https://{REGION}-aiplatform.googleapis.com/v1/projects/{PROJECT_ID}/locations/{REGION}/publishers/mistralai/models/{MODEL_ID}:rawPredict"
mistral_first_prompt = rag_model(mistral_model_api,mistral_model_name,first_base_prompt)

### Test the RAG system

In [37]:
query = "For a financial institution processing real-time trading data, we require sub-millisecond latency for transaction processing and strict data immutability for auditing purposes. Which architectural pattern ensures both high performance and non-repudiation?"
response = mistral_first_prompt.generate(query)
response_text = response['response']
response_context = response['context']
print("Response: ", response_text)
print("Context: ", response_context)

MPS disponible: False
CUDA disponible: True
Using GPU NVIDIA: NVIDIA GeForce RTX 3070 Ti
Total Memory: 8.59 GB
Dispositivo activo: cuda
Response:  **1. Proposed AWS Architecture:**

    - **High-Performance Compute Layer:** Utilize EC2 instances with Enhanced Networking (ENA Express) to ensure sub-millisecond latency for transaction processing.
    - **Data Storage Layer:** Use Amazon Kinesis Data Streams for real-time data ingestion and processing, ensuring data immutability and durability.
    - **Data Processing Layer:** Implement Amazon EMR with Apache Spark for real-time data processing and analytics.
    - **Data Storage and Auditing:** Store processed data in Amazon S3 with versioning enabled for immutability and auditing purposes.
    - **Networking:** Use Amazon VPC with dedicated Direct Connect for low-latency and secure connectivity.
    - **Monitoring and Scaling:** Use Amazon CloudWatch for monitoring performance metrics and AWS Auto Scaling to handle varying workloads.

*

## Evaluation

### Test Data Set preparation

As mentioned in the previous progress, we will need to create a data set to evaluate the performance of the different models we will be generating as combination of prompt and LLM. To do so we took advantage that the base RAG is of public domain and we can request a more powerfull model to generate those questions and then a human validation will be performed to validate the correctness of the answers.
It is important to mention that as part of this initial POC the data is public and the Architectural Patterns beeing used are public patterns, that´s why this is a valid option on this scenario, but when going into using the internal Architectural patterns the questions and answers will need to be generated manually by experts of the different domains.

For this initial project we decided to use Gemini to generate the basic set of questions. The following prompt was used to generate the set of questions.
Prompt:

*Can you Create 174 Questions about Architectural Decisions based on intent and Non FUnctional Requirments that the expected answer is and Architectural Pattern Decision based on aws only taking into consideration the following siz AWS Well Architected Lenses (Serverless Application, Financial Services Industry, Generative AI, Machine Learning, Migration and Analytics)*



*Give me the input in a format I can pass to pandas to use to evaluate an RAG Model I'm creating.*

*As well provide from Which lens should the RAG pull the context for the specific Question.*

The intial set of questions needed to be fix as it did not included the justification, which will be critical for the evaluation as the model's might have different justifications which are valid therefore we need to give that extra contex to the judge model so it can judge propperly if the answer is correct or not.
Therefore we executed the following prompt in the same chat context of the Gemin model.

*Can you fix the answers to include only the expected services and the justification of each service, no need to include the pattern name.*

With this final prompt we were able to get the right questions, and after a human revition we were able to validate the justififcation and answers were right, and we were able to build our test data set.

### Pull questions from the Json into a Data Frame

In [None]:
questions_json_file = "./questions.json"
questions_df =  pd.read_json(questions_json_file)
questions_df.head()

Unnamed: 0,question,expected_answer_pattern,relevant_lens
0,A new e-commerce platform needs to handle high...,"[{'name': 'AWS Lambda', 'justification': 'Runs...",Serverless Application
1,For a financial institution processing real-ti...,"[{'name': 'Amazon Kinesis Data Streams', 'just...",Financial Services Industry
2,We are building an application for real-time t...,"[{'name': 'Amazon API Gateway', 'justification...",Generative AI
3,An anomaly detection system for industrial IoT...,"[{'name': 'Amazon Kinesis Data Streams', 'just...",Machine Learning
4,A legacy on-premises application with a monoli...,[{'name': 'AWS Application Migration Service (...,Migration


The questions where imported from Gemini just copying the result into a json file and the json file was then imported to the data frame, the above cell shows the first 5 questions with the relecant data, so we can evaluate the RAG context retrival and the full RAG system performance.

In [12]:
questions_df.shape

(174, 3)

In the above cell we can see the data frame pulled the 174 generated questions that will be used for evaluating the different parts of the full model, so we can do adjustments in either the embedding model, the prompt and/or the LLM to get the best results

## Judge Model

Now that we have a validated set of questions and expected answers, we define our judge model, which will evaluate the outputs of our base model (Mistral) using LLaMA 3.3 70B deployed on Vertex AI.

This model acts as an LLM-as-a-judge, a widely adopted technique for evaluating the quality of LLM-generated outputs in the absence of deterministic ground truth. It compares each model-generated response against the expected answer and rates it across three key dimensions:

**Technical Accuracy:** Does the answer use correct and contextually appropriate AWS services?

**Clarity:** Is the response well-structured, concise, and easy to follow?

**Completeness:** Does it address all relevant aspects of the architectural requirement in sufficient detail, without introducing unrelated elements?

The evaluation prompt enforces constraints aligned with the AWS Well-Architected Framework and penalizes the use of out-of-context services, deviation from required structure, or lack of configuration detail. The output is returned in a structured JSON format, enabling automated analysis and scoring across the dataset.

In [44]:
class judge_model:
    def __init__(self,model_api,model_name):
        self.call_vertex_model = call_vertex_model(model_api,model_name)
    
    def build_judge_prompt(self, question, mistral_response, expected_answer, context):
        return f"""
            You are an expert in cloud architecture for financial institutions. Your task is to evaluate a response generated by a language model, given a context, a question (architectural requirement), and an expected answer.

            The model was instructed to propose AWS-based solutions that:
            - Use **only services found in the provided context**
            - Follow the AWS Well-Architected Framework
            - Use the following structure: 
                1. Proposed AWS Architecture
                2. Justification
                3. AWS Services
                4. AWS Only
                5. Context Justification
            - If the context is insufficient, the model must reply: "Insufficient context to provide a confident architectural decision" and explain what's missing.

            Below is the evaluation task:

            Context:
            {context}

            Architectural Requirement (Question):
            {question}

            Model-generated answer:
            {mistral_response}

            Expected answer:
            {expected_answer}

            Evaluate the model-generated answer using these dimensions (scale 1–5):

            - Technical Accuracy
            - Clarity
            - Completeness

            **Important Evaluation Rules**:
            - Penalize any use of services not found in the context.
            - Penalize if the required structure is not followed.
            - Only reward completeness if the answer includes all key components **explicitly relevant to the question**, and provides sufficient detail on how they are used or configured. Including off-topic services or omitting key configuration details should reduce the completeness score.
            - Do not reward unnecessary or off-topic information.
            - Accuracy should reflect alignment with the Well-Architected Framework and proper AWS service usage.
            - Clarity should reflect whether the response is concise, readable, and logically structured.

            Provide your evaluation in the following JSON format:

            {{
            "accuracy": <1 to 5>,
            "clarity": <1 to 5>,
            "completeness": <1 to 5>,
            "justification": {{
                "accuracy": "Your justification here.",
                "clarity": "Your justification here.",
                "completeness": "Your justification here."
            }}
            }}
        """.strip()

    def generate(self, question, base_llm_response, expected_answer, context):
        self.base_prompt = self.build_judge_prompt(question, base_llm_response, expected_answer, context)
        generated_text = self.call_vertex_model.call_model(self.base_prompt)
        return generated_text

In this step, we instantiate our judge model using Meta’s LLaMA 3.3 70B Instruct, deployed via Vertex AI’s OpenAPI-compatible endpoint.

The judge_model_name specifies the model to be used for evaluation, while judge_model_api defines the full endpoint for accessing the Vertex AI chat interface. We then initialize the judge_model class with these parameters, enabling us to perform structured evaluation of the base model's responses using the custom prompt logic defined earlier.

This setup allows our system to decouple the evaluation logic from the underlying model infrastructure, making it modular and easily switchable across different judge LLMs or deployment backends (e.g., Gemini, GPT, Claude).

In [45]:
judge_model_name = "meta/llama-3.3-70b-instruct-maas"
judge_model_api = "https://{REGION}-aiplatform.googleapis.com/v1/projects/{PROJECT_ID}/locations/{REGION}/endpoints/openapi/chat/completions"
judge_first_prompt = judge_model(judge_model_api,judge_model_name)

We perform a manual test of the evaluation process by selecting the first question from the dataset. The response is generated using the base model (Mistral) along with its context, and evaluated by the judge model (LLaMA 3.3 70B) against the expected answer. This step is useful to inspect the evaluation logic before applying it across the full dataset.

In [46]:
test_question = questions_df['question'].iloc[0]
test_expected_answer = questions_df['expected_answer_pattern'].iloc[0]
test_mistral_response = mistral_first_prompt.generate(questions_df['question'].iloc[0])
test_mistral_response_text = test_mistral_response['response']
test_mistral_response_context = test_mistral_response['context']

response = judge_first_prompt.generate(test_question, test_mistral_response, test_expected_answer, test_mistral_response_context)
print(response)

MPS disponible: False
CUDA disponible: True
Using GPU NVIDIA: NVIDIA GeForce RTX 3070 Ti
Total Memory: 8.59 GB
Dispositivo activo: cuda
Here is the evaluation of the model-generated answer:

```
{
    "accuracy": 4,
    "clarity": 4,
    "completeness": 3,
    "justification": {
        "accuracy": "The model-generated answer is mostly accurate, as it suggests using AWS Lambda for serverless compute and Amazon EC2 Auto Scaling Groups for predictable workloads, which aligns with the Well-Architected Framework. However, it includes some unnecessary information and services not directly relevant to the question, such as AWS Glue and Redshift.",
        "clarity": "The response is generally clear and well-structured, but it is lengthy and includes some off-topic information, which reduces its overall clarity. The use of headings and bullet points helps to organize the content, but some sections are not directly relevant to the question.",
        "completeness": "The answer is partially co

This function extracts and parses the JSON evaluation returned by the judge model. It looks for a JSON block enclosed in triple backticks (```) and converts it into a Python dictionary. If no valid JSON is found or parsing fails, it returns an empty dictionary and prints the error. This step ensures structured access to the evaluation scores and justifications for further processing.

In [None]:
def extract_judge_data(judge_text):
    try:
        match = re.search(r"```([\s\S]*?)```", judge_text)
        if match:
            json_str = match.group(1).strip()
            return json.loads(json_str)
        else:
            print("No JSON block found between triple backticks.")
            return {}
    except json.JSONDecodeError as e:
        print(f"JSON decode error: {e}")
        return {}
    except Exception as e:
        print(f"Unexpected parsing error: {e}")
        return {}

This function determines whether a model response is considered acceptable based on evaluation scores. It checks if the average score meets a minimum threshold (default: 4.0) and that each individual score (accuracy, clarity, and completeness) is at least a given minimum (default: 3). The result is a Boolean value indicating whether the response passes the defined quality criteria.

In [50]:
def is_acceptable(row, threshold=4.0, min_each=3):
    scores = [row['accuracy_score'], row['clarity_score'], row['completeness_score']]
    return row['avg_score'] >= threshold and all(score >= min_each for score in scores)

In [52]:
eval_df = questions_df.copy()

Next, this block performs a batch evaluation loop over the dataset of questions using a base model (Mistral) and a judge model (LLaMA 3.3 70B). For each row in the dataset:

1.- The base model generates a response and context based on the input question.

2.- The judge model evaluates this response against the expected answer using a structured scoring prompt.

3.- The evaluation result is parsed to extract scores (accuracy, clarity, completeness) and their justifications.

4.- An average score is computed, and a Boolean flag (is_correct) is assigned based on predefined quality thresholds.

This process populates the eval_df with all relevant outputs and metrics, enabling downstream analysis of model performance at both individual and aggregate levels

In [None]:
from tqdm import tqdm

eval_df['base_model_response'] = ""
eval_df['judge_model_eval'] = ""
eval_df['accuracy_score'] = 0
eval_df['clarity_score'] = 0
eval_df['completeness_score'] = 0
eval_df['accuracy_just'] = ""
eval_df['clarity_just'] = ""
eval_df['completeness_just'] = ""
eval_df['avg_score'] = 0.0
eval_df['is_correct'] = False

for idx, row in tqdm(eval_df.iterrows(), total=eval_df.shape[0]):
    question = row['question']
    expected = row['expected_answer_pattern']

    base_model_output = mistral_first_prompt.generate(question)
    eval_df.at[idx, 'base_model_response'] = base_model_output['response']

    evaluation_text = judge_first_prompt.generate(question, base_model_output['response'], expected, base_model_output['context'])
    eval_df.at[idx, 'judge_model_eval'] = evaluation_text

    eval_dict = extract_judge_data(evaluation_text)

    eval_df.at[idx, 'accuracy_score'] = eval_dict.get('accuracy', 0)
    eval_df.at[idx, 'clarity_score'] = eval_dict.get('clarity', 0)
    eval_df.at[idx, 'completeness_score'] = eval_dict.get('completeness', 0)
    eval_df.at[idx, 'accuracy_just'] = eval_dict.get('justification', {}).get('accuracy', '')
    eval_df.at[idx, 'clarity_just'] = eval_dict.get('justification', {}).get('clarity', '')
    eval_df.at[idx, 'completeness_just'] = eval_dict.get('justification', {}).get('completeness', '')

    avg = (eval_dict.get('accuracy', 0) +
           eval_dict.get('clarity', 0) +
           eval_dict.get('completeness', 0)) / 3.0
    eval_df.at[idx, 'avg_score'] = avg

    eval_df.at[idx, 'is_correct'] = is_acceptable(eval_df.loc[idx])

This snippet calculates the model accuracy over a subset of the dataset (rows 0 to 51). It counts how many responses were marked as is_correct = True and divides that by the total number of samples in the slice. The result is printed as a percentage, reflecting how well the model performed within this evaluation window.

In [74]:
total = len(eval_df.iloc[0:52])
correct = eval_df.iloc[0:52]['is_correct'].sum()
model_accuracy = correct / total

print(f"Model Accuracy over {total} questions: {model_accuracy:.2%}")

Model Accuracy over 52 questions: 59.62%


The model achieved an accuracy of 59.62% over the evaluated subset of 52 questions. This means that slightly more than half of the generated responses met the predefined quality criteria across technical accuracy, clarity, and completeness. While this indicates a baseline level of competence, there is still significant room for improvement—particularly in generating more focused, precise, and context-aligned answers. The results suggest that the current base model (Mistral) can serve as a starting point, but further refinement, prompt engineering, or model selection may be necessary to reach production-level reliability.

In [75]:
try:
    eval_df.to_excel('subset_0_to_52.xlsx', sheet_name='Evaluation', index=False)
    print(f"✅ Archivo guardado exitosamente como 'subset_0_to_52.xlsx'")
except Exception as e:
    print(f"❌ Error al guardar archivo: {e}")

✅ Archivo guardado exitosamente como 'subset_0_to_52.xlsx'
