## Install needed Libraries

### Install Libraries from pip

In [8]:
!pip install langchain langchain-community pandas numpy matplotlib seaborn nltk textstat



### Import needed Libraries

In [11]:
import pandas as pd
from langchain.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
# Download the needed nltk corpus 
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from collections import Counter
import unicodedata
import textstat
import re

USER_AGENT environment variable not set, consider setting it to identify your requests.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\soyel\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\soyel\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\soyel\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


## Convert Excel Spreadsheet to pandas Data Frame

In [14]:
# Read Excel containig List of URL's with Architectural Pattern and Metadata.
url_df = pd.read_excel("./URLs.xlsx", sheet_name="Sheet1")
# Show shape of DataFrame
print("Shape: ",url_df.shape)
# Show the Format of the Data Frame
url_df.head()

Shape:  (628, 6)


Unnamed: 0,URL,1st Level,2nd Level,3rd Level,4th Level,Lens
0,https://docs.aws.amazon.com/wellarchitected/la...,Abstract and Introducción,,,,Serverless Applications
1,https://docs.aws.amazon.com/wellarchitected/la...,Definitions,,,,Serverless Applications
2,https://docs.aws.amazon.com/wellarchitected/la...,Definitions,Compute Layers,,,Serverless Applications
3,https://docs.aws.amazon.com/wellarchitected/la...,Definitions,Data Layer,,,Serverless Applications
4,https://docs.aws.amazon.com/wellarchitected/la...,Definitions,Messaging and streaming layer,,,Serverless Applications


## Read each link and store the Data in correct Format

### Create Function to add Level to metadata

In [18]:
# We create a function to validate if a level exist in a row of the dataframe
def createMetadataLevel(level,url_line,metadata):
    #Validate if the Level is enot empty
    if(not pd.isna(url_line[level])):
        #If level is not empty add the level to the metadata
        metadata[level]=url_line[level]
    #Return the modified metadata.
    return metadata

### Create function to load the URL with the extra metadata.

In [21]:
def loadURLWithMetaData(url_line):
    # We define the loader, which will read the information in the URL's leveraging the langchain library.
    loader = WebBaseLoader(
        # We say, which URL will be read and loaded.
        url_line["URL"],
    )
    # We will read the URL and get different documents from all the paragraphs.
    docs = loader.load()
    # We define all the metadata to add to the docs read from this page
    metadata = {
        "Lens": url_line["Lens"],
        "1st Level": url_line["1st Level"]
    }
    # Add all levels of metadata, validating the level exists.
    metadata = createMetadataLevel("2nd Level",url_line,metadata)
    metadata = createMetadataLevel("3rd Level",url_line,metadata)
    metadata = createMetadataLevel("4th Level",url_line,metadata)

    for doc in docs:
        doc.metadata.update(metadata)

    return docs

### Cycle trough all URL's in the list and load them

In [24]:
#Define Variable to store all information extracted from the URL's with the metadata.
all_docs = []
#Cycle trough all URL's to load them as text and add the desired metadata.
for index, row in url_df.iterrows():
    #Read the content of the URL, add it to the list with it's needed metadata in the propper format, to be able to process it later.
    all_docs.extend(loadURLWithMetaData(row))
print("This is a sample of the content extracted from the URL's", all_docs[0])

This is a sample of the content extracted from the URL's page_content='
Serverless Applications Lens - AWS Well-Architected Framework - Serverless Applications LensServerless Applications Lens - AWS Well-Architected Framework - Serverless Applications LensDocumentationAWS Well-ArchitectedAWS Well-Architected FrameworkIntroductionCustom lens availabilityServerless Applications Lens - AWS Well-Architected FrameworkPublication date: July 14, 2022 (Document revisions)
    This document describes the Serverless Applications Lens for
    the AWS
      Well-Architected Framework. The document covers common
    serverless applications scenarios and identifies key elements to
    ensure that your workloads are architected according to best
    practices.
  
Introduction

      The AWS Well-Architected Framework helps you understand the pros and
      cons of decisions you make while building systems on AWS. By using
      the Framework, you will learn architectural best practices for
      desi

**Preprocessing for RAG System:**

Having structured and explored our data, the next crucial step is to prepare it for our Retrieval-Augmented Generation (RAG) system. This involves applying targeted preprocessing techniques informed by the findings of our Exploratory Data Analysis (EDA). The goal of these actions is to refine the text, making it more suitable for generating high-quality embeddings and ultimately enhancing the performance of our RAG model. Following these preprocessing steps, we will proceed to create the embeddings and load them into a vector database.

The following specific actions have been identified and will be performed in this notebook before the embedding and loading phase, which will occur at the conclusion of this notebook:

1.  **Apply Text Normalization and Clean Unusal Characters:** Standardize the text by removing html characters and setting all to lower case.
2.  **Evaluate "data" Word:** Analyze the prevalence and context of the word "data" to determine if it should be treated as a stop word.
3.  **Handle Repetitive Phrases:** Identify and process or remove frequently recurring phrases that may not contribute significant semantic value.

## Normalize text and clean Unusal Characters

### Define function to Normalize text

In [102]:
def normalize_text(text):
    nomalize_text = unicodedata.normalize('NFKC', text)
    nomalize_text = re.sub(r"\s+", " ", nomalize_text)  # Normalize whitespace
    return text

### Test text normalization

In [108]:
text = all_docs[0].page_content
print(normalize_text(text))


Serverless Applications Lens - AWS Well-Architected Framework - Serverless Applications LensServerless Applications Lens - AWS Well-Architected Framework - Serverless Applications LensDocumentationAWS Well-ArchitectedAWS Well-Architected FrameworkIntroductionCustom lens availabilityServerless Applications Lens - AWS Well-Architected FrameworkPublication date: July 14, 2022 (Document revisions)
    This document describes the Serverless Applications Lens for
    the AWS
      Well-Architected Framework. The document covers common
    serverless applications scenarios and identifies key elements to
    ensure that your workloads are architected according to best
    practices.
  
Introduction

      The AWS Well-Architected Framework helps you understand the pros and
      cons of decisions you make while building systems on AWS. By using
      the Framework, you will learn architectural best practices for
      designing and operating reliable, secure, efficient, and
      cost-effecti

## Explore "data" word

## Handle Repetitive Phrases

### Identify Repetitive phrases

#### Define function to get repetitive phrases

In [216]:
def identify_repetitive_phrases(text, repetition_ngram=6):
    # Normalize Unicode (e.g., accented characters)
    normalized_text = unicodedata.normalize('NFKC', text)

    tokens = [word.lower() for word in word_tokenize(normalized_text) if word.isalpha()]
    tokens_clean = [t for t in tokens if t not in stopwords.words('english')]

    if len(tokens_clean) > 10:
        ngrams = [' '.join(tokens_clean[i:i+repetition_ngram]) for i in range(len(tokens_clean) - repetition_ngram + 1)]
        ngram_counts = Counter(ngrams)
        repeated_phrases = {k: v for k, v in ngram_counts.items() if v > 20}
        if repeated_phrases:
            print(f"Repeated phrases: {list(repeated_phrases.keys())}")

##### Get repeitive phrases

In [219]:
all_text = [doc.page_content for doc in all_docs]
combined_text = " ".join(all_text)
identify_repetitive_phrases(combined_text)

Repeated phrases: ['javascript disabled unavailable use amazon web', 'disabled unavailable use amazon web services', 'unavailable use amazon web services documentation', 'use amazon web services documentation javascript', 'amazon web services documentation javascript must', 'web services documentation javascript must enabled', 'services documentation javascript must enabled please', 'documentation javascript must enabled please refer', 'javascript must enabled please refer browser', 'must enabled please refer browser help', 'enabled please refer browser help pages', 'page help yesthanks letting us know', 'help yesthanks letting us know good', 'yesthanks letting us know good job', 'letting us know good job got', 'us know good job got moment', 'know good job got moment please', 'good job got moment please tell', 'job got moment please tell us', 'got moment please tell us right', 'moment please tell us right page', 'please tell us right page help', 'tell us right page help nothanks', 'us 

Seeing the specific repeated phrases we can categorize them into 1 general area:

1. Navigation/UI Element

For the category, we would need to remove them so it does not disturb with the content, as it is content from an UI persepctive, that does not contain valuable data.

The full sentences is, we will need to hande it after it is normalized:
To determine if a custom lens is available for the lens described in this whitepaper,
reach out to your Technical Account Manager (TAM), Solutions Architect (SA), or Support.
Javascript is disabled or is unavailable in your browser.To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.Document ConventionsDefinitionsDid this page help you? - YesThanks for letting us know we're doing a good job!If you've got a moment, please tell us what we did right so we can do more of it.Did this page help you? - NoThanks for letting us know this page needs work. We're sorry we let you down.If you've got a moment, please tell us how we can make the documentation better.

We will remove the whole section in each of the lenses, after a deep dive review we realized this was beeing pulled and does not add much context to the system.

### Define Function to handle repetitive phrase

In [222]:
def remove_repetitive_phrases(rep_phrases,text):
    for phrase in rep_phrases:
        text = re.sub(phrase, "", text, flags=re.IGNORECASE)
    return text

### Test Repetitive phrase handler

In [209]:
rep_phrases = ["Javascript is disabled or is unavailable in your browser.To use the Amazon Web Services Documentation, Javascript must be enabled.",
               "Please refer to your browser's Help pages for instructions.",
               r"Document (.+?) this page help you\?",
               "YesThanks for letting us know we're doing a good job!",
               "- If you've got a moment, please tell us what we did right so we can do more of it.",
               r"Did this page help you\? - NoThanks for letting us know this page needs work.",
               "We're sorry we let you down.",
               "If you've got a moment, please tell us how we can make the documentation better."
              ]
text = normalize_text(all_docs[2].page_content)
print(remove_repetitive_phrases(rep_phrases,text))


Compute layer - Serverless Applications LensCompute layer - Serverless Applications LensDocumentationAWS Well-ArchitectedAWS Well-Architected FrameworkCompute layer The compute layer of your workload manages requests from external systems, controlling
      access and verifying that requests are appropriately authorized. Your business logic will be
      deployed and started by the runtime environment that it contains. 
AWS Lambda lets you run stateless serverless
      applications on a managed platform that supports microservice architectures, deployment, and
      management of execution at the function layer.  With Amazon API Gateway, you can run a fully
      managed REST API that integrates with Lambda to
      apply your business logic, and includes traffic management, authorization and access control,
      monitoring, and API versioning. 
AWS Step Functions orchestrates serverless workflows including
      coordination, state, and function chaining as well as combining long-r

We can see that the repetitive phrases at the end where succesfully removed and we now have the content that adds value to the context.