# Building the RAG system

Here we will build a RAG system where we can paste any web URL -> like a blog post. And then our system will read that article, summarise it and then add it to our knowledge base.

This way, later on, when we try to refer to something, we will be able to refer to articles that we have read.

In [None]:
! pip install beautifulsoup4 langchain_core langchain_community langchain_groq load-dotenv pinecone

## Importing all libraries

In [None]:
from bs4 import BeautifulSoup
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.document_loaders import WebBaseLoader
from langchain_groq import ChatGroq
from load_dotenv import load_dotenv
from pinecone import Pinecone, ServerlessSpec
import re, os

load_dotenv()

## Utility function

In [3]:
def clean_html_content(html_content: str):
    """
    This function takes an HTML content as input and returns a clean text.
    It removes script, nav, and footer tags from the HTML content.
    """
    soup = BeautifulSoup(html_content, "html.parser")

    # Remove script, nav, and footer tags
    for tag in soup(["script", "nav", "footer"]):
        tag.decompose()  # Completely removes the tag from the DOM

    return soup.get_text(separator="\n")  # Extracts clean text

In [4]:
def clean_scraped_text(text: str) -> str:
    """
    Cleans up the text scraped from websites, removing unnecessary newlines, spaces, and special characters.

    Args:
        text (str): Raw text scraped from a website.

    Returns:
        str: Cleaned text.
    """
    # Remove carriage returns
    text = re.sub(r"\r+", "", text)

    # Replace multiple newlines with a single newline
    text = re.sub(r"\n+", "\n", text)

    # Remove multiple spaces and tabs
    text = re.sub(r"[ \t]+", " ", text)

    # Remove lines that are empty or contain only whitespace
    text = re.sub(r"\n\s*\n", "\n", text)

    # Remove leading/trailing whitespace from each line
    text = "\n".join(line.strip() for line in text.splitlines())

    # Remove leading/trailing whitespace from the entire text
    text = text.strip()

    # Remove unwanted characters (non-ASCII, control characters)
    text = re.sub(r"[^\x00-\x7F]+", " ", text)

    # Remove HTML entities like  , &amp;, etc.
    text = re.sub(r"&\w+;", " ", text)

    # Replace multiple punctuation marks with a single one
    text = re.sub(r"[\.\,\!\?\;\:]{2,}", ".", text)

    # Ensure consistent spacing after punctuation
    text = re.sub(r"([.!?])([^\s])", r"\1 \2", text)

    return text

## Getting the content from URL

This is where we will scrape the URL for content and then use diffrent techniques to clean the HTML that we got by scraping the web page.

In [None]:
url = (
    "https://www.amitavroy.com/articles/2024-07-25-Importance-of-Docker-as-a-developer"
)
loader = WebBaseLoader(url)
documents = loader.load()

This is where we removed all the HTML tags and stuff.

I am also removing content from some unwanted tags so that the content that we are considering is much more relevant.

And then I am concatinating the entire content into one block of content so that we can summarise it.

In [25]:
cleaned_texts = [clean_html_content(doc.page_content) for doc in documents]
cleaned_texts = clean_scraped_text(" ".join(cleaned_texts))

## Summarising the content with our LLM

To keep very useful information in our vector database, we are summarising the content first and then we will send it to Pinecone 

In [11]:
groq_api_key = os.getenv("GROQ_API_KEY")
model = ChatGroq(api_key=groq_api_key, model="llama-3.2-3b-preview")

In [26]:
prompt_template = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            """
     You are a technical person who will read content and try to summarise it.
     You need to capture all the important information from the content so that you can answer questions about it later.
     You also need to make sure that the summary is concise and easy to understand.
     This content will be later used to embed and store inside a vector database for retrival later.
     Try to also add a section where you are mentining the key entities as comma separated tags of the content for example:
     Topics: PHP, Laravel, Queued jobs 
     """,
        ),
        (
            "user",
            "Summarise the following content \n: {context}",
        ),
    ]
)

In [27]:
formatted_prompt = prompt_template.invoke({"context": cleaned_texts})
summary = model.invoke(formatted_prompt)

In [28]:
summary.response_metadata

{'token_usage': {'completion_tokens': 373,
  'prompt_tokens': 1379,
  'total_tokens': 1752,
  'completion_time': 0.231928694,
  'prompt_time': 0.22172015,
  'queue_time': 0.021728798999999993,
  'total_time': 0.453648844},
 'model_name': 'llama-3.2-3b-preview',
 'system_fingerprint': 'fp_a926bfdce1',
 'finish_reason': 'stop',
 'logprobs': None}

In [29]:
summary.content

'**Summary:**\n\nThe importance of Docker as a developer cannot be overstated. This article highlights six key reasons why every developer should get comfortable with Docker:\n\n1. **Consistency Across Environments**: Docker provides a consistent environment across all stages of development, reducing stress and headaches caused by different configurations.\n2. **Portability**: Docker\'s infrastructure is defined as code, making it easy to replicate setups or roll back to previous configurations.\n3. **Easy to Experiment**: Docker containers allow developers to spin up a container with any configuration, experiment freely, and remove it when done, with no traces left behind.\n4. **Easy on Resources**: Docker containers have lower memory requirements, allowing multiple containers to run on a single machine without overloading it.\n5. **Security**: Docker provides a layer of security by isolating applications and dependencies within containers, minimizing the risk of conflicts and unautho

## Saving the information to Pinecone

This is where we already have the summarised version of the article with us.
Now, we will save that information into our vector database using the Pinecone embedding model

In [19]:
def get_pinecone_index(index_name: str):
    pinecone_api_key = os.getenv("PINECONE_API_KEY")
    pc = Pinecone(api_key=pinecone_api_key)
    if index_name not in pc.list_indexes().names():
        pc.create_index(
            name=index_name,
            dimension=1024,
            metric="cosine",
            spec=ServerlessSpec(cloud="aws", region="us-east-1"),
        )
    return pc.Index(index_name)

In [20]:
def store_data_to_pinecone(data):
    pinecone_api_key = os.getenv("PINECONE_API_KEY")
    pc = Pinecone(api_key=pinecone_api_key)
    embedding_model = "multilingual-e5-large"
    pcone_index = get_pinecone_index("ragtutorial")

    embeddings = pc.inference.embed(
        model=embedding_model,
        inputs=[d["text"] for d in data],
        parameters={"input_type": "passage", "truncate": "END"},
    )

    records = []
    for d, e in zip(data, embeddings):
        records.append(
            {"id": d["id"], "values": e["values"], "metadata": {"text": d["text"]}}
        )

    return pcone_index.upsert(vectors=records, namespace="example-namespace")

In [30]:
data = [
    {"id": "2", "text": cleaned_texts, "category": "ragtutorial"},
]

In [31]:
store_data_to_pinecone(data)

{'upserted_count': 1}

## Doing search on my data

In [32]:
pcone_index = get_pinecone_index("ragtutorial")

In [36]:
pinecone_api_key = os.getenv("PINECONE_API_KEY")
pc = Pinecone(api_key=pinecone_api_key)
query = "why I should learn docker"

search_result = pc.inference.embed(
    model="multilingual-e5-large", inputs=[query], parameters={"input_type": "query"}
)

In [37]:
results = pcone_index.query(
    namespace="example-namespace",
    vector=search_result[0].values,
    top_k=1,
    include_metadata=True,
)

In [38]:
results

{'matches': [{'id': '2',
              'metadata': {'text': 'The importance of Docker as a developer - '
                                   'AMITAV ROY BLOGThe importance of Docker as '
                                   'a developer - AMITAV ROY BLOGAMITAV ROY '
                                   'BLOGHOMEPOSTSABOUTThe importance of Docker '
                                   'as a developerDocker has become a key tool '
                                   'for any developer. It is no more something '
                                   'that only the Devops person should know. '
                                   'In this article I will talk about the '
                                   'reasons why Docker is so important to know '
                                   'as a developer. 25 July, '
                                   '2024DEVOPSDEVLIFEDOCKERAs a software '
                                   'developer, understanding Docker is not '
                                   "just a