## Install needed Libraries

In [1]:
!pip install pandas numpy matplotlib seaborn nltk textstat chromadb torch sentence-transformers hf_xet transformers accelerate langchain-community huggingface_hub



### Import needed Libraries

In [2]:
import pandas as pd
from langchain.embeddings import HuggingFaceEmbeddings
import matplotlib.pyplot as plt
import seaborn as sns
import unicodedata
import textstat
import re
import chromadb
import torch
import transformers
from huggingface_hub import notebook_login



## Preapere RAG System

The data for the RAG system has been preprocesed and stoed in a vector database in the embedding format, therefore in this Notebook no extra processing for the data will be done. In this Notebook we will focus on generating the end to end RAG System first iteration and defining the evaluation technique, we will not perform any evaluation in this Notebook, we will focus next iteration on the evluation of the model as we will need to preaper the data for the evaluation and we will focus on that the next week. How ever in this week we will generate a fully working system that we will evaluate next week.

### Define embedding class with methods

As the RAG system context data is stored in he Vector DB as embeddings, we will need to convert the User Query to the same embeddings so we can pull from the Chroma DB the related context, therefore the first part of the system is to create he embedding function to embedded the user queries. The function needs to be the same used to stored them in the Vector DB, pulling the same function from the previous Notebook.

In [3]:
class Generate_Embeddings:
    def __init__(self):
        mps_available = hasattr(torch.backends, 'mps') and torch.backends.mps.is_available()
        print(f"MPS disponible: {mps_available}")

        cuda_available = torch.cuda.is_available()
        print(f"CUDA disponible: {cuda_available}")

        if cuda_available:
            device = torch.device("cuda")
            print(f"Using GPU NVIDIA: {torch.cuda.get_device_name(0)}")
            print(f"Total Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
        elif mps_available:
            device = torch.device("mps")
            print(f"Usando GPU Apple Silicon via MPS")
        else:
            device = torch.device("cpu")
            print("Usando CPU")

        print(f"Dispositivo activo: {device}")
        self.embedding_model = HuggingFaceEmbeddings(
            #model_name="sentence-transformers/all-MiniLM-L6-v2",  # Opcion más ligera
            model_name="sentence-transformers/all-mpnet-base-v2", #opcion con 768
            model_kwargs={'device': device},
            # Este parámetro normaliza cada embedding a longitud 1
            encode_kwargs={'normalize_embeddings': True}
        )

    def generate_embedding_for_query(self,query):
        return self.embedding_model.embed_query(query)

#### Test embedding function

In [4]:
generate_embeddings = Generate_Embeddings()
query_text = "What are the best practices for serverless applications?"
print(generate_embeddings.generate_embedding_for_query(query_text))

MPS disponible: True
CUDA disponible: False
Usando GPU Apple Silicon via MPS
Dispositivo activo: mps


  self.embedding_model = HuggingFaceEmbeddings(


[-0.03093838505446911, 0.03200007230043411, 0.05128869414329529, 0.014629404991865158, -0.05159478262066841, -0.005359634757041931, 0.008787281811237335, 0.040541768074035645, -0.010652217082679272, -0.0007046241080388427, 0.06030704826116562, -0.014727733097970486, 0.0049048056825995445, 0.0032119920942932367, 0.02013835310935974, 0.015705561265349388, 0.014108223840594292, 0.05449442192912102, -0.03879433125257492, 0.003764125518500805, 0.021612007170915604, -0.0265512652695179, 0.01489956583827734, -0.026848578825592995, 0.008851809427142143, -0.08655820041894913, -0.006769308354705572, 0.08772694319486618, -0.01011625211685896, -0.0049053337424993515, -0.09073223173618317, -0.05987256392836571, -0.06200317293405533, 0.044645316898822784, 1.0445569387229625e-06, -0.0014718231977894902, 0.004710671026259661, 0.007927129045128822, -0.023071158677339554, -0.00956744235008955, 0.04283610358834267, -0.014399752952158451, -0.017172420397400856, 0.028648700565099716, -0.00463675893843174, 

### Define class to connect to Chroma DB

Once we have the query in embeddings fomat, we need to pull the context from Chroma DB, for that we need to connect to the chroma DB and query using the default querying technique that is based on similarity score and ranks the results, we indicate how many results we want to retrive.

#### Extract ZIP file of CHroma DB

In [5]:
import zipfile
import os

zip_path = "/content/chroma_db2.zip"  # Replace with the actual name of your zip file
extract_path = "/content/"  # Choose a directory to extract the contents to

try:
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall(extract_path)
    print(f"ChromaDB data unzipped successfully to: {extract_path}")
except FileNotFoundError:
    print(f"Error: Zip file not found at {zip_path}")
except Exception as e:
    print(f"Error unzipping: {e}")

# Now, you would initialize your ChromaDB PersistentClient pointing to the extracted directory
# client = chromadb.PersistentClient(path=extract_path)

Error: Zip file not found at /content/chroma_db2.zip


In [6]:
class Chroma_Connection:
    def __init__(self,collection_name,persist_directory = "./chroma_db2"):
        self.persist_directory = persist_directory
        self.collection_name = collection_name
        self.client = chromadb.PersistentClient(path=self.persist_directory)
        self.generate_embeddings = Generate_Embeddings()

    def query_chroma(self, query,n_documents=5):
        try:
            collection = self.client.get_collection(name=self.collection_name)
        except ValueError:
            print(
                f"Collection '{collection_name}' not found.  Returning empty results."
            )
            return []
        embedded_query = self.generate_embeddings.generate_embedding_for_query(query)
        results = collection.query(
            query_embeddings=[embedded_query],
            n_results=n_documents,
            include=["documents", "metadatas", "distances"],  #  Get the text and metadata, and distance
        )
        return results


### Define function to do Retrival


In [7]:
def retrival(query):
    collection_name = "C1_RAG_AWS_LENSES"
    context_retrival = Chroma_Connection(collection_name)
    context = context_retrival.query_chroma(query)
    return context

#### Test retrival function

In [8]:
query_text = "I need to create a web api that will pull data from the source, what sevices should I use, for the financial services api using serveless application"
print(retrival(query_text))

MPS disponible: True
CUDA disponible: False
Usando GPU Apple Silicon via MPS
Dispositivo activo: mps
{'ids': [['Serverless Applications-The pillars of the Well-Architected Framework-Security pillar-Resources-NA-0-c3316ac2', 'Serverless Applications-Scenarios-Web application-NA-NA-0-071d48b5', 'Serverless Applications-Definitions-NA-NA-NA-0-ab4ca6af', 'Serverless Applications-Definitions-Edge layer-NA-NA-0-13267f50', 'Serverless Applications-Scenarios-RESTful microservices-NA-NA-0-1c8bf310']], 'embeddings': None, 'documents': [['resources - serverless applications lensresources - serverless applications lensdocumentationaws well-architectedaws well-architected frameworkdocumentation blogswhitepapersthird-party toolsresources refer following resources learn best practices security . documentation blogs lambda permissions api gateway request validation api gateway lambda authorizers building fine-grained authorization using cognito , api gateway , iam configuring vpc access lambda using s

Here we are confirming that the Documents are succesfully stored in the Db and that we are able to pull the relvant documents based on user questions.

## LLM Model


FOr the first iteration we are going to use LLAMA as that is one of the most known LLMs that are approved for use for this specific use case, we will use this one as a baseline and exploe in later iterrations different models.

### Pull model from HuggingFace

In [15]:
#notebook_login()

In [10]:
model_id = "mistralai/Mistral-7B-Instruct-v0.2"

pipeline = transformers.pipeline(
    task="text-generation",
    model=model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Some parameters are on the meta device because they were offloaded to the disk.
Device set to use mps


In [11]:
model_id = "mistralai/Mistral-7B-Instruct-v0.2"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

query_content = "I have the requirment to create a new application that will run serverless which services can I use that will guarantee a 5 second logic"
context_content = """[['optimizing time - serverless applications lensoptimizing time - serverless applications lensdocumentationaws well-architectedaws well-architected frameworkoptimizing time see well-architected framework whitepaper cost optimization best practices optimizing time section apply serverless applications . topicslambda cost performance optimizationlogging ingestion storageleverage vpc endpointsdynamodb on-demand provisioned capacityaws step functions express workflowsdirect integrationscode optimization document conventionsexpenditure usage awareness lambda cost performance optimizationdid page help ?', 'definitions - serverless applications lensdefinitions - serverless applications lensdocumentationaws well-architectedaws well-architected frameworkdefinitions well-architected framework based six pillars : operational excellence , security , reliability , performance efficiency , cost optimization , sustainability . serverless workloads , provides multiple core components ( serverless non-serverless ) allow design robust architectures serverless applications . section , present overview services used throughout document . eight areas consider building serverless workload : compute layerdata layermessaging streaming layeruser management identity layeredge layersystems monitoring deploymentdeployment approacheslambda version control', 'cost-effective resources - serverless applications lenscost-effective resources - serverless applications lensdocumentationaws well-architectedaws well-architected frameworkcost-effective resources cost 1 : optimize costs ? serverless architectures easier manage terms correct resource allocation . due pay-per-value pricing model scale based demand , serverless effectively reduces capacity planning effort . covered operational excellence performance pillars , optimizing serverless application direct impact value produces cost . lambda proportionally allocates cpu , network , storage iops based memory , faster initiation , cheaper value function produces due 1-ms billing incremental dimension .', 'selection - serverless applications lensselection - serverless applications lensdocumentationaws well-architectedaws well-architected frameworkselection per 1 : optimized performance serverless application ? run performance tests serverless application using steady burst rates . using result , try tuning capacity units provisioning model , load test changes help select best configuration : api gateway : use edge endpoints geographically dispersed customers . use regional regional customers using services within region . lambda : test different memory settings since cpu , network , storage iops allocated proportionally . optimize static initialization consider provisioned concurrency . step functions : test standard express workflows , consider per second rates execution start rate state transition rate . dynamodb : use on-demand unpredictable application traffic , otherwise provisioned mode consistent traffic . kinesis : use enhanced-fan-out dedicated input/output channels per consumer multiple consumer scenarios . use extended batch window low volume transactions lambda .', 'conclusion - serverless applications lensconclusion - serverless applications lensdocumentationaws well-architectedaws well-architected frameworkconclusion serverless applications take undifferentiated heavy-lifting developers , still important principles apply . reliability , regular testing failure paths provides better chance catching errors reach production . performance , starting backward customer expectations allow design optimal experience . number tools help optimize performance . cost optimization , reduce waste within serverless application right-sizing resources support traffic demands , improve value optimizing application . operations , architecture strive toward automation responding events . finally , secure application protect organization ’ sensitive information assets meet compliance requirements every layer . serverless landscape continues evolve growth maturation tooling , processes , adoption . continue update paper ensure resources knowledge needed build operate well-architected serverless systems .']]"""

prompt = f"""[INST]You are an expert Cloud architect specializing in AWS cloud solutions. Analyze the provided context and the architectural requirements to propose the best AWS-based solution, adhering to AWS Well-Architected Framework principles. Ensure your response uses only AWS services and does not rely on external knowledge beyond the provided context.

Context:
{context_content}

Architectural Requirements:
{query_content}

Provide your architectural decision in the following format:

1.  **Proposed AWS Architecture:**
2.  **Justification:**
3.  **AWS Services:**
4.  **AWS Only:**
5.  **No External Knowledge:**

If the context lacks sufficient information to make a confident decision, state: "Insufficient context to provide a confident architectural decision." and briefly explain what information is missing.[/INST]"""

messages = [{"role": "user", "content": prompt}]
formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False)

output = pipeline(formatted_prompt, max_new_tokens=512, truncation=True, do_sample=False, temperature=0.1)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Some parameters are on the meta device because they were offloaded to the disk.
Device set to use mps
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [12]:
print(pipeline("Plants create energy through a process known as",max_length=200,truncation=True))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[{'generated_text': 'Plants create energy through a process known as photosynthesis. During photosynthesis, plants absorb carbon dioxide from the air and water from the soil, and use sunlight to convert these substances into glucose, a type of sugar that provides energy for the plant. The glucose is then transported throughout the plant to provide energy for growth and other metabolic processes.\n\nPhotosynthesis occurs in the chloroplasts, which are specialized organelles found in the cells of green plants. Chloroplasts contain a pigment called chlorophyll, which is responsible for absorbing sunlight. The sunlight is used to generate ATP (adenosine triphosphate), which is the energy currency of the cell, and NADPH (nicotinamide adenine dinucleotide phosphate), which is used to reduce the carbon dioxide into glucose.\n\n'}]


In [13]:
def generate_answer(query):
  context_content = (retrival(query))["documents"]
  print(context_content)
  prompt = f"""[INST]You are an expert Cloud architect specializing in AWS cloud solutions. Analyze the provided context and the architectural requirements to propose the best AWS-based solution, adhering to AWS Well-Architected Framework principles. Ensure your response uses only AWS services and does not rely on external knowledge beyond the provided context.

Context:
{context_content}

Architectural Requirements:
{query}

Provide your architectural decision in the following format:

1.  **Proposed AWS Architecture:**
2.  **Justification:**
3.  **AWS Services:**
4.  **AWS Only:**
5.  **No External Knowledge:**

If the context lacks sufficient information to make a confident decision, state: "Insufficient context to provide a confident architectural decision." and briefly explain what information is missing.

[/INST]
**BEGIN ASSISTANT RESPONSE:**
Here's my architectural decision:
[/INST]
  """
  print(prompt)
  return pipeline(prompt,max_new_tokens=500,truncation=True,do_sample=False, temperature=0.1, repetition_penalty=1.1)

In [14]:
query = "I have the requirment to create a new application that will run serverless which services can I use that will guarantee a 5 second logic"
generate_answer(query)

MPS disponible: True
CUDA disponible: False
Usando GPU Apple Silicon via MPS
Dispositivo activo: mps


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[['optimizing time - serverless applications lensoptimizing time - serverless applications lensdocumentationaws well-architectedaws well-architected frameworkoptimizing time see well-architected framework whitepaper cost optimization best practices optimizing time section apply serverless applications . topicslambda cost performance optimizationlogging ingestion storageleverage vpc endpointsdynamodb on-demand provisioned capacityaws step functions express workflowsdirect integrationscode optimization document conventionsexpenditure usage awareness lambda cost performance optimizationdid page help ?', 'definitions - serverless applications lensdefinitions - serverless applications lensdocumentationaws well-architectedaws well-architected frameworkdefinitions well-architected framework based six pillars : operational excellence , security , reliability , performance efficiency , cost optimization , sustainability . serverless workloads , provides multiple core components ( serverless non

[{'generated_text': '[INST]You are an expert Cloud architect specializing in AWS cloud solutions. Analyze the provided context and the architectural requirements to propose the best AWS-based solution, adhering to AWS Well-Architected Framework principles. Ensure your response uses only AWS services and does not rely on external knowledge beyond the provided context.\n\nContext:\n[[\'optimizing time - serverless applications lensoptimizing time - serverless applications lensdocumentationaws well-architectedaws well-architected frameworkoptimizing time see well-architected framework whitepaper cost optimization best practices optimizing time section apply serverless applications . topicslambda cost performance optimizationlogging ingestion storageleverage vpc endpointsdynamodb on-demand provisioned capacityaws step functions express workflowsdirect integrationscode optimization document conventionsexpenditure usage awareness lambda cost performance optimizationdid page help ?\', \'defin

#### Next Steps

In the next phase of our project, we will leverage Vertex AI from Google to conduct a comprehensive evaluation of our RAG model. This approach will allow us to test different Large Language Models (LLMs) within the RAG framework, assessing their performance and suitability. The evaluation process will focus on the following key aspects:

*Retrieval Accuracy*: We will assess how effectively the model retrieves relevant documents in response to a given query. Focusing on the quality of the retrieved information regardless of the LLM used.

*Generation Quality*: The quality of the text generated will be measured using metrics such as coherence, factual correctness, and relevance to the query. We will use human evaluations and automated scoring to determine this quality.

*Latency and Efficiency*: Since we will be using Vertex AI, we will measure the response time of the model for each query.

*Scalability and Flexibility*: Vertex AI allows us to easily switch between different LLMs. We will evaluate the ability of the system to integrate various models and assess their comparative performance.

This evaluation will provide insights into the model's strengths and weaknesses, guiding further improvements in future phases of the project.