## Install needed Libraries

### Install Libraries from pip

In [3]:
!pip install langchain langchain-community pandas numpy matplotlib seaborn nltk textstat chromadb torch sentence-transformers hf_xet



### Import needed Libraries

In [5]:
import pandas as pd
from langchain.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document
from langchain.embeddings import HuggingFaceEmbeddings
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
# Download the needed nltk corpus 
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from collections import Counter
import unicodedata
import textstat
import re
import chromadb
import torch

USER_AGENT environment variable not set, consider setting it to identify your requests.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\soyel\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\soyel\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\soyel\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


## Convert Excel Spreadsheet to pandas Data Frame

In [9]:
# Read Excel containig List of URL's with Architectural Pattern and Metadata.
url_df = pd.read_excel("./URLs.xlsx", sheet_name="Sheet1")
# Show shape of DataFrame
print("Shape: ",url_df.shape)
# Show the Format of the Data Frame
url_df.head()

Shape:  (628, 6)


Unnamed: 0,URL,1st Level,2nd Level,3rd Level,4th Level,Lens
0,https://docs.aws.amazon.com/wellarchitected/la...,Abstract and Introducción,,,,Serverless Applications
1,https://docs.aws.amazon.com/wellarchitected/la...,Definitions,,,,Serverless Applications
2,https://docs.aws.amazon.com/wellarchitected/la...,Definitions,Compute Layers,,,Serverless Applications
3,https://docs.aws.amazon.com/wellarchitected/la...,Definitions,Data Layer,,,Serverless Applications
4,https://docs.aws.amazon.com/wellarchitected/la...,Definitions,Messaging and streaming layer,,,Serverless Applications


## Read each link and store the Data in correct Format

### Create Function to add Level to metadata

In [12]:
# We create a function to validate if a level exist in a row of the dataframe
def createMetadataLevel(level,url_line,metadata):
    #Validate if the Level is enot empty
    if(not pd.isna(url_line[level])):
        #If level is not empty add the level to the metadata
        metadata[level]=url_line[level]
    #Return the modified metadata.
    return metadata

### Create function to load the URL with the extra metadata.

In [16]:
def loadURLWithMetaData(url_line):
    # We define the loader, which will read the information in the URL's leveraging the langchain library.
    loader = WebBaseLoader(
        # We say, which URL will be read and loaded.
        url_line["URL"],
    )
    # We will read the URL and get different documents from all the paragraphs.
    docs = loader.load()
    # We define all the metadata to add to the docs read from this page
    metadata = {
        "Lens": url_line["Lens"],
        "1st Level": url_line["1st Level"]
    }
    # Add all levels of metadata, validating the level exists.
    metadata = createMetadataLevel("2nd Level",url_line,metadata)
    metadata = createMetadataLevel("3rd Level",url_line,metadata)
    metadata = createMetadataLevel("4th Level",url_line,metadata)

    for doc in docs:
        doc.metadata.update(metadata)

    return docs

### Cycle trough all URL's in the list and load them

In [19]:
#Define Variable to store all information extracted from the URL's with the metadata.
all_docs = []
#Cycle trough all URL's to load them as text and add the desired metadata.
for index, row in url_df.iterrows():
    #Read the content of the URL, add it to the list with it's needed metadata in the propper format, to be able to process it later.
    all_docs.extend(loadURLWithMetaData(row))
print("This is a sample of the content extracted from the URL's", all_docs[0])

This is a sample of the content extracted from the URL's page_content='
Serverless Applications Lens - AWS Well-Architected Framework - Serverless Applications LensServerless Applications Lens - AWS Well-Architected Framework - Serverless Applications LensDocumentationAWS Well-ArchitectedAWS Well-Architected FrameworkIntroductionCustom lens availabilityServerless Applications Lens - AWS Well-Architected FrameworkPublication date: July 14, 2022 (Document revisions)
    This document describes the Serverless Applications Lens for
    the AWS
      Well-Architected Framework. The document covers common
    serverless applications scenarios and identifies key elements to
    ensure that your workloads are architected according to best
    practices.
  
Introduction

      The AWS Well-Architected Framework helps you understand the pros and
      cons of decisions you make while building systems on AWS. By using
      the Framework, you will learn architectural best practices for
      desi

**Preprocessing for RAG System:**

Having structured and explored our data, the next crucial step is to prepare it for our Retrieval-Augmented Generation (RAG) system. This involves applying targeted preprocessing techniques informed by the findings of our Exploratory Data Analysis (EDA). The goal of these actions is to refine the text, making it more suitable for generating high-quality embeddings and ultimately enhancing the performance of our RAG model. Following these preprocessing steps, we will proceed to create the embeddings and load them into a vector database.

The following specific actions have been identified and will be performed in this notebook before the embedding and loading phase, which will occur at the conclusion of this notebook:

1.  **Apply Text Normalization and Clean Unusal Characters:** Standardize the text by removing html characters and setting all to lower case.
2.  **Handle "data" Word and repetitive words:** As data is very frequent in the context, we will consider it as a stopword for the beginig, if this does not help the method later we will be able to evaluate how to threat this sceario.
3.  **Handle Repetitive Phrases:** Identify and process or remove frequently recurring phrases that may not contribute significant semantic value.

## Normalize text and clean Unusal Characters

### Define function to Normalize text

In [23]:
def normalize_text(text, include_line_breaks = True):
    normalize_text = unicodedata.normalize('NFKC', text)
    if include_line_breaks: 
        normalize_text = re.sub(r"[ \t]+", " ", normalize_text)
    else:
        normalize_text = re.sub(r"\s+", " ", normalize_text)
    normalize_text = normalize_text.lower().strip()
    return normalize_text

### Test text normalization

In [27]:
text = all_docs[0].page_content
normalize_text(text)
# normalize_text(text, include_line_breaks = False)

"serverless applications lens - aws well-architected framework - serverless applications lensserverless applications lens - aws well-architected framework - serverless applications lensdocumentationaws well-architectedaws well-architected frameworkintroductioncustom lens availabilityserverless applications lens - aws well-architected frameworkpublication date: july 14, 2022 (document revisions)\n this document describes the serverless applications lens for\n the aws\n well-architected framework. the document covers common\n serverless applications scenarios and identifies key elements to\n ensure that your workloads are architected according to best\n practices.\n \nintroduction\n\n the aws well-architected framework helps you understand the pros and\n cons of decisions you make while building systems on aws. by using\n the framework, you will learn architectural best practices for\n designing and operating reliable, secure, efficient, and\n cost-effective systems in the cloud. it prov

## Handle repetitive Words

### Declare Function to remove repetitive words

In [31]:
def remove_repetitive_words(text,words_to_remove):
    word_tokens = word_tokenize(text)
    filtered_sentence = [w for w in word_tokens if not w.lower() in words_to_remove]
    return " ".join(filtered_sentence)
    

### Test Removal of repetitive words

In [34]:
stop_words = ["data"]
print("Original Text \n",all_docs[3].page_content,"\n")
print("Text without data word\n",remove_repetitive_words(all_docs[3].page_content,stop_words))

Original Text 
 
Data layer - Serverless Applications LensData layer - Serverless Applications LensDocumentationAWS Well-ArchitectedAWS Well-Architected FrameworkData layer The data layer of your workload manages persistent storage from within a system. It
      provides a secure mechanism to store the states that your business logic will need. It
      provides a mechanism to trigger events in response to data changes. 
Amazon DynamoDB helps you build serverless
      applications by providing a managed NoSQL database for persistent storage. Combined
        with DynamoDB Streams, you can respond in near
      real-time to changes in your DynamoDB table by
      invoking Lambda functions. DynamoDB Accelerator (DAX) adds a highly available
      in-memory cache for DynamoDB that delivers up to
      10x performance improvement from milliseconds to microseconds.  With Amazon Simple Storage Service (Amazon S3), you can build serverless
      web applications and websites by providing a h

## Handle Repetitive Phrases

### Identify Repetitive phrases

#### Define function to get repetitive phrases

In [39]:
def identify_repetitive_phrases(text, repetition_ngram=6):
    tokens = [word.lower() for word in word_tokenize(text) if word.isalpha()]
    tokens_clean = [t for t in tokens if t not in stopwords.words('english')]

    if len(tokens_clean) > 10:
        ngrams = [' '.join(tokens_clean[i:i+repetition_ngram]) for i in range(len(tokens_clean) - repetition_ngram + 1)]
        ngram_counts = Counter(ngrams)
        repeated_phrases = {k: v for k, v in ngram_counts.items() if v > 20}
        if repeated_phrases:
            print(f"Repeated phrases: {list(repeated_phrases.keys())}")

##### Get repeitive phrases

In [42]:
all_text = [doc.page_content for doc in all_docs]
combined_text = " ".join(all_text)
identify_repetitive_phrases(combined_text)

Repeated phrases: ['javascript disabled unavailable use amazon web', 'disabled unavailable use amazon web services', 'unavailable use amazon web services documentation', 'use amazon web services documentation javascript', 'amazon web services documentation javascript must', 'web services documentation javascript must enabled', 'services documentation javascript must enabled please', 'documentation javascript must enabled please refer', 'javascript must enabled please refer browser', 'must enabled please refer browser help', 'enabled please refer browser help pages', 'page help yesthanks letting us know', 'help yesthanks letting us know good', 'yesthanks letting us know good job', 'letting us know good job got', 'us know good job got moment', 'know good job got moment please', 'good job got moment please tell', 'job got moment please tell us', 'got moment please tell us right', 'moment please tell us right page', 'please tell us right page help', 'tell us right page help nothanks', 'us 

Seeing the specific repeated phrases we can categorize them into 1 general area:

1. Navigation/UI Element

For the category, we would need to remove them so it does not disturb with the content, as it is content from an UI persepctive, that does not contain valuable data.

The full sentences is, we will need to hande it after it is normalized:
To determine if a custom lens is available for the lens described in this whitepaper,
reach out to your Technical Account Manager (TAM), Solutions Architect (SA), or Support.
Javascript is disabled or is unavailable in your browser.To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.Document ***** this page help you? - YesThanks for letting us know we're doing a good job!If you've got a moment, please tell us what we did right so we can do more of it.Did this page help you? - NoThanks for letting us know this page needs work. We're sorry we let you down.If you've got a moment, please tell us how we can make the documentation better.

We will remove the whole section in each of the lenses, after a deep dive review we realized this was beeing pulled and does not add much context to the system.

### Define Function to handle repetitive phrase

In [46]:
def remove_repetitive_phrases(rep_phrases,text):
    for phrase in rep_phrases:
        text = re.sub(phrase, "", text, flags=re.IGNORECASE)
    return text

### Test Repetitive phrase handler

In [49]:
rep_phrases = ["Javascript is disabled or is unavailable in your browser.To use the Amazon Web Services Documentation, Javascript must be enabled.",
               "Please refer to your browser's Help pages for instructions.",
               r"Document (.+?) this page help you\?",
               "YesThanks for letting us know we're doing a good job!",
               "- If you've got a moment, please tell us what we did right so we can do more of it.",
               r"Did this page help you\? - NoThanks for letting us know this page needs work.",
               "We're sorry we let you down.",
               "If you've got a moment, please tell us how we can make the documentation better."
              ]
text = all_docs[2].page_content
print(remove_repetitive_phrases(rep_phrases,text))


Compute layer - Serverless Applications LensCompute layer - Serverless Applications LensDocumentationAWS Well-ArchitectedAWS Well-Architected FrameworkCompute layer The compute layer of your workload manages requests from external systems, controlling
      access and verifying that requests are appropriately authorized. Your business logic will be
      deployed and started by the runtime environment that it contains. 
AWS Lambda lets you run stateless serverless
      applications on a managed platform that supports microservice architectures, deployment, and
      management of execution at the function layer.  With Amazon API Gateway, you can run a fully
      managed REST API that integrates with Lambda to
      apply your business logic, and includes traffic management, authorization and access control,
      monitoring, and API versioning. 
AWS Step Functions orchestrates serverless workflows including
      coordination, state, and function chaining as well as combining long-r

We can see that the repetitive phrases at the end where succesfully removed and we now have the content that adds value to the context.

## Create Langchain data processing pipeline

In [53]:
def processing_pipeline(urls_df):
    documents = []
    for index, url_line in url_df.iterrows():
        documents.extend(loadURLWithMetaData(url_line))

    processed_documents = []
    for doc in documents:
        rep_phrases = [
            "Javascript is disabled or is unavailable in your browser.To use the Amazon Web Services Documentation, Javascript must be enabled.",
            "Please refer to your browser's Help pages for instructions.",
            r"Document (.+?) this page help you\?",
            "YesThanks for letting us know we're doing a good job!",
            "- If you've got a moment, please tell us what we did right so we can do more of it.",
            r"Did this page help you\? - NoThanks for letting us know this page needs work.",
            "We're sorry we let you down.",
            "If you've got a moment, please tell us how we can make the documentation better."
              ]
        text_no_repetitive_phrases = remove_repetitive_phrases(rep_phrases,doc.page_content)
        normalized_text_no_repetitve_phrases = normalize_text(text_no_repetitive_phrases)
        stop_words = set(stopwords.words('english'))
        stop_words.add("aws")
        stop_words.add("amazon")
        stop_words.add("us")
        stop_words.add("data")
        normalized_text_no_repetitve_phrases_no_stop_words = remove_repetitive_words(
            normalized_text_no_repetitve_phrases, stop_words
        )
        processed_documents.append(
        Document(
                page_content=normalized_text_no_repetitve_phrases_no_stop_words, metadata=doc.metadata
            )
        )

    return processed_documents
        

### Run LangChain pipeline with the data

In [56]:
prepro_docs = processing_pipeline(url_df) 
print(prepro_docs[0])

page_content='serverless applications lens - well-architected framework - serverless applications lensserverless applications lens - well-architected framework - serverless applications lensdocumentationaws well-architectedaws well-architected frameworkintroductioncustom lens availabilityserverless applications lens - well-architected frameworkpublication date : july 14 , 2022 ( document revisions ) document describes serverless applications lens well-architected framework . document covers common serverless applications scenarios identifies key elements ensure workloads architected according best practices . introduction well-architected framework helps understand pros cons decisions make building systems . using framework , learn architectural best practices designing operating reliable , secure , efficient , cost-effective systems cloud . provides way consistently measure architectures best practices identify areas improvement . believe well-architected systems greatly increases lik

## Run Vector Data Base

We have selected Chroma, as Chroma is a good Vector Database to do local development for low data and is very good for initial testing, therefore it alligns completly with what the scope of the project is. Chroma can run locally and can be installed with pip, the following command are for installing and then running chroma.

### Initialize Chroma DB

In [61]:
# Specify the path to store the database
persist_directory = "./chroma_db2"

# Initialize the persistent client
client = chromadb.PersistentClient(path=persist_directory)

#### Create Collection

##### Define Collection Name

In [65]:
#### Define Collection Name
collection_name = "C1_RAG_AWS_LENSES"

##### Define function to validate collection exists

In [68]:
def validate_collection(collection_name,client):
    return collection_name in [c.name for c in client.list_collections()]


##### Validate if collection exists, if it does not exist create it.

In [71]:
flag=True
if(not validate_collection(collection_name,client)):
    print("entre")
    collection = client.create_collection(name=collection_name)
    flag=False
else:
    collection = collection = client.get_collection(name=collection_name)

entre


#### Load or Update Data to Chroma Db

##### Define Function to Load to Chroma DB

In [75]:
def load_to_chroma(chunks,ids,embeddings,collection):
    chunks_text = [chunk.page_content for chunk in chunks]
    chunks_metadata = [chunk.metadata for chunk in chunks]

    collection.add(
        ids=ids,
        documents=chunks_text,
        metadatas=chunks_metadata,
        embeddings=embeddings,  # Pass the embeddings here
    )
    

##### Define Function to Update Chroma DB

In [78]:
def update_chroma(chunks,ids,embeddings,collection):
    chunks_text = [chunk.page_content for chunk in chunks]
    chunks_metadata = [chunk.metadata for chunk in chunks]

    collection.update(
        ids=ids,
        documents=chunks_text,
        metadatas=chunks_metadata,
        embeddings=embeddings,  # Pass the embeddings here
    )

Define function to Chunking content

In [81]:
from typing import List, Dict
import hashlib

def chunk_documents(prepro_docs: List[Document], chunk_size: int = 500, overlap: int = 100, method: str = 'paragraph') -> List[Document]:
    chunks = []

    for doc in prepro_docs:
        text = doc.page_content
        metadata = doc.metadata

        lens = metadata.get('Lens', '')
        level_1 = metadata.get('1st Level', 'NA')
        level_2 = metadata.get('2nd Level', 'NA')
        level_3 = metadata.get('3rd Level', 'NA')
        level_4 = metadata.get('4th Level', 'NA')

        chunk = []
        chunk_index = 0

        if method == 'sentence':
            sentences = nltk.sent_tokenize(text)
            for sentence in sentences:
                chunk.append(sentence)
                if len(" ".join(chunk).split()) >= chunk_size:
                    chunk_text = " ".join(chunk)
                    chunk_id = f"{lens}-{level_1}-{level_2}-{level_3}-{level_4}-{chunk_index}-{hashlib.sha256(chunk_text.encode()).hexdigest()[:8]}"
                    chunks.append(Document(page_content=chunk_text, metadata={"id": chunk_id, **metadata}))
                    chunk = chunk[-overlap:] if overlap > 0 else []
                    chunk_index += 1

            if chunk:
                chunk_text = " ".join(chunk)
                chunk_id = f"{lens}-{level_1}-{level_2}-{level_3}-{level_4}-{chunk_index}-{hashlib.sha256(chunk_text.encode()).hexdigest()[:8]}"
                chunks.append(Document(page_content=chunk_text, metadata={"id": chunk_id, **metadata}))

        elif method == 'paragraph':
            paragraphs = text.split('\n')
            for paragraph in paragraphs:
                if paragraph.strip():
                    chunk.append(paragraph)
                    if len(" ".join(chunk).split()) >= chunk_size:
                        chunk_text = " ".join(chunk)
                        chunk_id = f"{lens}-{level_1}-{level_2}-{level_3}-{level_4}-{chunk_index}-{hashlib.sha256(chunk_text.encode()).hexdigest()[:8]}"
                        chunks.append(Document(page_content=chunk_text, metadata={"id": chunk_id, **metadata}))
                        chunk = chunk[-overlap:] if overlap > 0 else []
                        chunk_index += 1

            if chunk:
                chunk_text = " ".join(chunk)
                chunk_id = f"{lens}-{level_1}-{level_2}-{level_3}-{level_4}-{chunk_index}-{hashlib.sha256(chunk_text.encode()).hexdigest()[:8]}"
                chunks.append(Document(page_content=chunk_text, metadata={"id": chunk_id, **metadata}))

    return chunks

Getting chunks from al preprocessed documents.

In [84]:
chunks = chunk_documents(prepro_docs, chunk_size=500, overlap=100)
# chunks = chunk_documents(prepro_docs, chunk_size=500, overlap=100, method='sentence')

In [86]:
print(f"Lenght of Chunks: {len(chunks)}")
print(f"Example of chunk content: {chunks[300]}")

Lenght of Chunks: 701
Example of chunk content: page_content='genrel04-bp02 implement model catalog - well-architectedgenrel04-bp02 implement model catalog - well-architecteddocumentationaws well-architectedaws well-architected frameworkimplementation guidanceresourcesgenrel04-bp02 implement model catalog model catalogs store manage model versions . act reliable store models may need deployed rolled back time . also facilitate decoupled deployment automation . desired outcome : implemented , best practice improves reliability generative ai workload helping make sure deployed model appropriate model given use case . benefits establishing best practice : manage change automation - implementing model catalog helps automate process deploying rolling back model versions . level risk exposed best practice established : low implementation guidance model catalogs provide centralized location review models , model version , model cards . traditionally , model catalogs meant store model artifact

In [88]:
mps_available = hasattr(torch.backends, 'mps') and torch.backends.mps.is_available()
print(f"MPS disponible: {mps_available}")

cuda_available = torch.cuda.is_available()
print(f"CUDA disponible: {cuda_available}")

if cuda_available:
    device = torch.device("cuda")
    print(f"Using GPU NVIDIA: {torch.cuda.get_device_name(0)}")
    print(f"Total Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
elif mps_available:
    device = torch.device("mps")
    print(f"Usando GPU Apple Silicon via MPS")
else:
    device = torch.device("cpu")
    print("Usando CPU")

print(f"Dispositivo activo: {device}")

MPS disponible: False
CUDA disponible: False
Usando CPU
Dispositivo activo: cpu


El modelo all-mpnet-base-v2 ha demostrado en múltiples benchmarks (Semantic Textual Similarity, clustering, QA retrieval) un alto rendimiento para capturar matices de significado en oraciones 
cortas y párrafos. Esto se traduce en vectores más discriminativos y útiles al consultar la base de datos vectorial.

Con 768 dimensiones, ofrece un buen compromiso entre capacidad de representación y coste computacional. 
Más dimensiones suelen conllevar mejor separación en el espacio vectorial, pero también mayor tamaño de índice; 
768 es ampliamente adoptado como estándar para aplicaciones de RAG y sistemas de búsqueda semántica.

In [91]:
embedding_model = HuggingFaceEmbeddings(
    #model_name="sentence-transformers/all-MiniLM-L6-v2",  # Opcion más ligera
    model_name="sentence-transformers/all-mpnet-base-v2", #opcion con 768
    model_kwargs={'device': device},
    # Este parámetro normaliza cada embedding a longitud 1
    encode_kwargs={'normalize_embeddings': True}
)


print("Generating embeddings for chunks...")
chunk_texts = [chunk.page_content for chunk in chunks]
embeddings = embedding_model.embed_documents(chunk_texts)


ids = [chunk.metadata["id"] for chunk in chunks]

print(f"Generated {len(embeddings)} embeddings")

  embedding_model = HuggingFaceEmbeddings(


Generating embeddings for chunks...
Generated 701 embeddings


#### Execute the right function

In [94]:
if flag:
    update_chroma(chunks,ids,embeddings,collection)
else:
    load_to_chroma(chunks,ids,embeddings,collection)

## Fase de Preparación de los Datos (CRISP-ML)

1. **Ingesta y consolidación**  
   Se importó el listado maestro de URLs desde un archivo Excel y se recuperó el contenido bruto de cada página web en un único DataFrame de pandas. Esto garantizó una fuente de verdad única para todos los documentos.

2. **Limpieza y normalización**  
   Se eliminaron patrones de texto redundantes (encabezados, pies de página repetitivos), se corrigieron inconsistencias de formato (minusculización, eliminación de caracteres especiales y secuencias de “loading…”), reduciendo el ruido y mejorando la calidad del corpus.

3. **Enriquecimiento estructural**  
   Mediante funciones personalizadas se añadieron metadatos jerárquicos (niveles de título) a cada fragmento de texto, permitiendo asociar cada bloque con su contexto dentro de la arquitectura de la página.

4. **Segmentación y fragmentación**  
   El texto se dividió en “chunks” de longitud controlada con solapamiento, equilibrando precisión semántica y eficiencia de cómputo, para obtener fragmentos de tamaño óptimo en la indexación vectorial.

5. **Vectorización y almacenamiento**  
   Cada fragmento se transformó en un vector de alta dimensión usando un modelo de embeddings y se cargó en una base de datos Chroma, preparando los datos para una recuperación basada en similaridad.

---

## Conclusión general

> Gracias a este proceso de preparación, el conjunto de datos pasó de un estado heterogéneo y ruidoso a una colección estructurada, limpia y enriquecida. Se establecieron flujos reproducibles y se garantizó la calidad y coherencia de los datos, sentando las bases para la fase de Modelado (Modeling), donde se utilizarán estos vectores y metadatos para desarrollar soluciones de RAG y sistemas de preguntas-respuestas efectivos.

In [292]:
collection.count()

701