# Model Evaluation

## Model hosting research

As mentioned in the previous progress there were problems with the necesary compute power for running the models in our local machines and a exploration of using a Cloud solution was done. After doing some research we have decided to implment Vertex AI leveraging GCP, this will allow to test Mistral and Llama models without worrying about the compute resources in our local using the managed endpoints available as part of the solution.
With the managed endpoints we will be able to call the models running in GCP and only pay for the requests made, as we are new users of GCP a 300 USD voucher was generated to us which will make the usage of the models free, however if this does not work the charge for the models is extremly low for the POC purpose of the project. After a succesfull implementation of a RAG system the company will need to perform and evaluation on what is the best way to host the models.

## Requirements for running Notebook Locally

Anyone trying to run the Notebook will need to provision a service account which has access to Vertex AI platform as a user, to get this access the following steps need to be performed:

    1. Log in into GCP (If the user oes not have an account create a new one)
    2. Select the project where the Vertex AI instance will be used
    3. Open the Services Account option on the left side navigation menu under Iam Section
    4. Creatge a new Service Account
    5. Once the account is created go to the IAM section in the left navigation menu
    6. Select the user/mail of the new Service account created
    7. Edit the accesses and grante Vertex AI user permision to the service account
    8. Go back to the service account page
    9. Select the Service Account user and go to the key tab
    10 Click the create new key button and generate a new key with json format
    11. Once the key is created it will be downloaded to the local machine
    
After getting the service account credentials in local this need to be added to the environmment variables, this is done later in the notebook


## Install needed libraries

In [1]:
!pip install pandas numpy matplotlib seaborn nltk textstat chromadb torch sentence-transformers hf_xet transformers accelerate langchain-community huggingface_hub google.auth



In [2]:
import pandas as pd
from langchain.embeddings import HuggingFaceEmbeddings
import matplotlib.pyplot as plt
import seaborn as sns
import unicodedata
import textstat
import re
import chromadb
import torch
import transformers
from huggingface_hub import notebook_login
import requests
import os
import json
from google.auth import credentials  # Import the credentials  module
from google.auth.transport.requests import Request  # Import Request
from google.auth import default #
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "./credenciales_google.json"

## Prepare Model

### Define the embedding class, reused from previous progress

In [3]:
class Generate_Embeddings:
    def __init__(self):
        mps_available = hasattr(torch.backends, 'mps') and torch.backends.mps.is_available()
        print(f"MPS disponible: {mps_available}")

        cuda_available = torch.cuda.is_available()
        print(f"CUDA disponible: {cuda_available}")

        if cuda_available:
            device = torch.device("cuda")
            print(f"Using GPU NVIDIA: {torch.cuda.get_device_name(0)}")
            print(f"Total Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
        elif mps_available:
            device = torch.device("mps")
            print(f"Usando GPU Apple Silicon via MPS")
        else:
            device = torch.device("cpu")
            print("Usando CPU")

        print(f"Dispositivo activo: {device}")
        self.embedding_model = HuggingFaceEmbeddings(
            #model_name="sentence-transformers/all-MiniLM-L6-v2",  # Opcion más ligera
            model_name="sentence-transformers/all-mpnet-base-v2", #opcion con 768
            model_kwargs={'device': device},
            # Este parámetro normaliza cada embedding a longitud 1
            encode_kwargs={'normalize_embeddings': True}
        )

    def generate_embedding_for_query(self,query):
        return self.embedding_model.embed_query(query)

### Define class to connect to Chroma DB, reused from previous progress

In [4]:
class Chroma_Connection:
    def __init__(self,collection_name,persist_directory = "./chroma_db2"):
        self.persist_directory = persist_directory
        self.collection_name = collection_name
        self.client = chromadb.PersistentClient(path=self.persist_directory)
        self.generate_embeddings = Generate_Embeddings()

    def query_chroma(self, query,n_documents=5):
        try:
            collection = self.client.get_collection(name=self.collection_name)
        except ValueError:
            print(
                f"Collection '{collection_name}' not found.  Returning empty results."
            )
            return []
        embedded_query = self.generate_embeddings.generate_embedding_for_query(query)
        results = collection.query(
            query_embeddings=[embedded_query],
            n_results=n_documents,
            include=["documents", "metadatas", "distances"],  #  Get the text and metadata, and distance
        )
        return results

### Define function to do Retrival, reused from rpevious progress

In [5]:
def retrival(query):
    collection_name = "C1_RAG_AWS_LENSES"
    context_retrival = Chroma_Connection(collection_name)
    context = context_retrival.query_chroma(query)
    return context

### Define intial prompt

For next iteration as we will be testing different prompts and different models it is important to define this is the initial prompt for mistral that will set our base ground for evaluating different models.

The first prompt we will be evaluating with mistral model will be the following:

    [INST]You are an expert Cloud architect specializing in AWS cloud solutions. Analyze the provided context and the architectural requirements to propose the best AWS-based solution, adhering to AWS Well-Architected Framework principles. Ensure your response uses only AWS services and does not rely on external knowledge beyond the provided context.
    Context:
        {context_content}
    Architectural Requirements:
        {query}
    Provide your architectural decision in the following format:

    1.  **Proposed AWS Architecture:**
    2.  **Justification:**
    3.  **AWS Services:**
    4.  **AWS Only:**

    If the context lacks sufficient information to make a confident decision, state: "Insufficient context to provide a confident architectural decision." and briefly explain what information is missing.

    [/INST]
    **BEGIN ASSISTANT RESPONSE:**
    Here's my architectural decision:
    [/INST]

### Define function to log in to GCP

In [6]:
def get_gcp_token():
    try:
        SCOPES = ['https://www.googleapis.com/auth/cloud-platform'] # Add all needed scopes
        creds, project_id = default(scopes=["https://www.googleapis.com/auth/cloud-platform"])
        auth_req = Request()
        creds.refresh(auth_req)
        access_token = creds.token
        return [access_token, project_id]
    except Exception as e:
        print(f"Error obtaining credentials: {e}")
        return None

### Define class model function from Vertex

In [7]:
class call_vertex_model:
    def __init__(self,model_api,model_name):
        self.token, self.project_id = get_gcp_token()
        self.model_name = model_name
        region = "us-central1"
        self.model_api = model_api.format(REGION=region,PROJECT_ID=self.project_id,MODEL_ID=self.model_name)

    def call_model(self,prompt):
        try:
            headers = {
                "Authorization": f"Bearer {self.token}",
                "Content-Type": "application/json",
                "Accept": "application/json"
            }
            payload = {
              "model": self.model_name,
              "messages": [
              {
                "role": "user",
                "content": [
                    {
                      "type": "text", "text": prompt
                    }
                  ]
                }
              ]
            }
            response = requests.post(url=self.model_api, headers=headers, json=payload)
            response.raise_for_status()  # Raise an exception for bad status codes
            response_dict = response.json()
            generated_text = response_dict["choices"][0]["message"]["content"]
            return generated_text
        except Exception as e:
            print(f"Error calling Vertex AI endpoint: {e}")
            return None


### Define RAG system class, reuse some parts of the previous progress

In [8]:
class rag_model:
    def __init__(self,model_api,model_name,base_prompt):
        self.base_prompt = base_prompt
        self.call_vertex_model = call_vertex_model(model_api,model_name)

    def generate(self,query):
        context_content = (retrival(query))["documents"]
        prompt = self.base_prompt.format(context_content=context_content, query=query)
        generated_text = self.call_vertex_model.call_model(prompt)
        return generated_text

### Create first full model

In [9]:
first_base_prompt = """
    [INST]You are an expert Cloud architect specializing in AWS cloud solutions. Analyze the provided context and the architectural requirements to propose the best AWS-based solution, adhering to AWS Well-Architected Framework principles. Ensure your response uses only AWS services and does not rely on external knowledge beyond the provided context.
    Context:
        {context_content}
    Architectural Requirements:
        {query}
    Provide your architectural decision in the following format:

    1.  **Proposed AWS Architecture:**
    2.  **Justification:**
    3.  **AWS Services:**
    4.  **AWS Only:**
    5.  **Context Justification**

    If the context lacks sufficient information to make a confident decision, state: "Insufficient context to provide a confident architectural decision." and briefly explain what information is missing.
    Do Not use anything outside the proided Context, only services mentioned in the provided context.
    [/INST]
    **BEGIN ASSISTANT RESPONSE:**
    Here's my architectural decision:
    [/INST]
"""
mistral_model_name = "mistral-small-2503"
mistral_model_api = "https://{REGION}-aiplatform.googleapis.com/v1/projects/{PROJECT_ID}/locations/{REGION}/publishers/mistralai/models/{MODEL_ID}:rawPredict"
mistral_first_prompt = rag_model(mistral_model_api,mistral_model_name,first_base_prompt)

### Test the RAG system

In [11]:
query = "For a financial institution processing real-time trading data, we require sub-millisecond latency for transaction processing and strict data immutability for auditing purposes. Which architectural pattern ensures both high performance and non-repudiation?"
response = mistral_first_prompt.generate(query)
print(response)

MPS disponible: False
CUDA disponible: False
Usando CPU
Dispositivo activo: cpu
1. **Proposed AWS Architecture:**
    - **High-Performance, Low-Latency Trading System with Data Immutability:**
        - Utilize AWS Local Zones for sub-millisecond latency.
        - Implement Amazon Kinesis Data Streams for real-time data ingestion and processing.
        - Use Amazon EC2 instances with Enhanced Networking (ENA) and Elastic Network Adapter (ENA Express) for high-performance compute.
        - Deploy Amazon Managed Streaming for Apache Kafka (MSK) for durable and immutable data storage.
        - Leverage Amazon CloudFront for low-latency data delivery.
        - Implement Amazon S3 with versioning and object lock for immutable storage of trading data.
        - Use AWS Lambda for serverless processing of trading data.
        - Employ Amazon CloudWatch for monitoring and AWS CloudTrail for auditing.

2. **Justification:**
    - **Performance Efficiency:** AWS Local Zones and EC2 instanc

## Evaluation

### Test Data Set preparation

As mentioned in the previous progress, we will need to create a data set to evaluate the performance of the different models we will be generating as combination of prompt and LLM. To do so we took advantage that the base RAG is of public domain and we can request a more powerfull model to generate those questions and then a human validation will be performed to validate the correctness of the answers.
It is important to mention that as part of this initial POC the data is public and the Architectural Patterns beeing used are public patterns, that-s why this is a valid option on this scenario, but when going into using the internal Architectural patterns the questions and answers will need to be generated manually by experts of the different domains.

For this initial project we decided to use Gemini to generate the basic set of questions. The following prompt was used to generate the set of questions.
Prompt:

*Can you Create 174 Questions about Architectural Decisions based on intent and Non FUnctional Requirments that the expected answer is and Architectural Pattern Decision based on aws only taking into consideration the following siz AWS Well Architected Lenses (Serverless Application, Financial Services Industry, Generative AI, Machine Learning, Migration and Analytics)*



*Give me the input in a format I can pass to pandas to use to evaluate an RAG Model I'm creating.*

*As well provide from Which lens should the RAG pull the context for the specific Question.*

The intial set of questions needed to be fix as it did not included the justification, which will be critical for the evaluation as the model's might have different justifications which are valid therefore we need to give that extra contex to the judge model so it can judge propperly if the answer is correct or not.
Therefore we executed the following prompt in the same chat context of the Gemin model.

*Can you fix the answers to include only the expected services and the justification of each service, no need to include the pattern name.*

With this final prompt we were able to get the right questions, and after a human revition we were able to validate the justififcation and answers were right, and we were able to build our test data set.

### Pull questions from the Json into a Data Frame

In [16]:
questions_json_file = "./questions.json"
questions_df =  pd.read_json(questions_json_file)
questions_df.head()

Unnamed: 0,question,expected_answer_pattern,relevant_lens
0,A new e-commerce platform needs to handle high...,"[{'name': 'AWS Lambda', 'justification': 'Runs...",Serverless Application
1,For a financial institution processing real-ti...,"[{'name': 'Amazon Kinesis Data Streams', 'just...",Financial Services Industry
2,We are building an application for real-time t...,"[{'name': 'Amazon API Gateway', 'justification...",Generative AI
3,An anomaly detection system for industrial IoT...,"[{'name': 'Amazon Kinesis Data Streams', 'just...",Machine Learning
4,A legacy on-premises application with a monoli...,[{'name': 'AWS Application Migration Service (...,Migration


The questions where imported from Gemini just copying the result into a json file and the jon file was then imported to the data frame, the above cell shows the first 5 questions with the relecant data, so we can evaluate the RAG context retrival and the full RAG system performance.

In [17]:
questions_df.shape

(174, 3)

In the above cell we can see the data frame pulled the 174 generated questions that will be used for evaluating the different parts of the full model, so we can do adjustments in either the embedding model, the prompt and/or the LLM to get the best results