# **Contextual Compression Retriever:**

**Contextual Compression Retriever** is a component used in the data retrieval process to condense the amount of data fetched while maintaining its relevance to the user's query. This technique is essential in managing large volumes of data and ensuring that only the most pertinent information is passed along for further processing and response generation.<br>

**To use the Contextual Compression Retriever, you'll need:**<br>

* a base retriever
* a Document Compressor

The Contextual Compression Retriever passes queries to the base retriever, takes the initial documents and passes them through the Document Compressor. The Document Compressor takes a list of documents and shortens it by reducing the contents of documents or dropping documents altogether.<br><br>

**Key Points:**<br>
* **Purpose:** To compress and filter out irrelevant parts of the retrieved documents, making them more manageable and focused on the context of the user's query.

* **Importance:** Enhances efficiency by reducing the amount of data that needs to be processed by downstream components, such as language models, without losing the essence of the information.

* **Usage:** Particularly useful in applications with large datasets or when dealing with documents that contain a lot of extraneous information.<br><br>


**Example Scenario:**<br>
**Scenario:** A user queries a customer support chatbot about **"How to reset the password for my account."**<br>

* **Data Retrieval:** The system retrieves several documents, including user manuals, support tickets, and FAQ entries related to account management.

* **Contextual Compression:**
  *  **Original Document:** The retrieved document is a user manual with 100 pages, including sections on various account-related issues like registration, account settings, security measures, and password reset.

  * **Compression Process:** The retriever identifies and extracts the most relevant sections specifically about password reset procedures.

* **Compressed Document:** The resulting document now consists of only 5 pages focusing solely on the password reset process, eliminating unrelated sections.

* **Output:** The compressed document is then passed to the language model to generate a concise, accurate response to the user's query.<br><br>



**Benefits:**
* **Efficiency:** Reduces processing time and computational load by focusing on relevant information.

* **Accuracy:** Increases the likelihood of generating accurate responses by filtering out irrelevant data.

* **Scalability:** Allows the system to handle larger datasets effectively.<br><br>


By implementing a Contextual Compression Retriever, the overall system becomes more efficient and responsive, providing users with precise and relevant answers while minimizing unnecessary data processing.

## **Data Ingestion:**

In [94]:
# Install necessary libaries:

!pip install langchain langchain-community unstructured pinecone-client boto3 langchain_aws requests beautifulsoup4 -qU

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/76.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.2/76.2 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25h

### **Get the List of URLs:**

In [None]:
# Extracting Urls from a specific website:

import requests
from bs4 import BeautifulSoup
import urllib.parse

def retrieve_urls(website_url):
    try:
        # Send a GET request to the website
        response = requests.get(website_url)
        response.raise_for_status()  # Raise an exception for HTTP errors

        # Parse the HTML content using BeautifulSoup
        soup = BeautifulSoup(response.content, 'html.parser')

        # Find all anchor tags with href attributes
        anchor_tags = soup.find_all('a', href=True)

        # Extract URLs and make them absolute if they are relative
        urls = set()
        for tag in anchor_tags:
            url = tag['href']
            full_url = urllib.parse.urljoin(website_url, url)
            urls.add(full_url)

        return urls

    except requests.exceptions.RequestException as e:
        print(f"Error fetching the website: {e}")
        return set()

# Example usage
website_url = "https://edzlms.com/"
urls = retrieve_urls(website_url)
urls = list(urls)
urls

### **Load Documents from URLs:**

In [5]:
from langchain.document_loaders import UnstructuredURLLoader

URLs = [
    'https://edzlms.com/edzlms-features/',
    'https://edzlms.com/index.php/contact/',
    'https://edzlms.com/detailed-features-list/',
    'https://edzlms.com/index.php/detailed-feature-list/',
    'https://edzlms.com/index.php/terms-of-use/',
    'https://edzlms.com/index.php/user-management/',
    'https://edzlms.com/lms-security/',
    'https://edzlms.com/other-industry/',
    'https://edzlms.com/lms-overview/',
    'https://edzlms.com/reporting-and-tracking/',
    'https://edzlms.com/integration/',
    'https://edzlms.com/index.php/about-the-team/',
    'https://edzlms.com/index.php/course-creation/',
    'https://edzlms.com/sales-training/',
    'https://edzlms.com',
    'https://edzlms.com/university/',
    'https://edzlms.com/index.php/portals/',
    'https://edzlms.com/insurance-sectors/',
    'https://edzlms.com/monetise-content-with-ecommerce/',
    'https://edzlms.com/index.php/extended-enterprise/',
    'https://edzlms.com/gamification-and-learners-engagement/',
    'https://edzlms.com/blogs/',
    'https://edzlms.com/index.php/blogs/',
    'https://edzlms.com/index.php/hospital-and-medical/',
    'https://edzlms.com/index.php/super-fast-management-of-user/',
    'https://edzlms.com/index.php/lms-features/',
    'https://edzlms.com/healthcare/',
    'https://edzlms.com/continuous-learning-for-employee/',
    'https://edzlms.com/index.php/insurance-training/',
    'https://edzlms.com/index.php/sales-training/',
    'https://edzlms.com/index.php/gamification-and-learners-engagement/',
    'https://edzlms.com/index.php/reporting/',
    'https://edzlms.com/train-your-customers/',
    'https://edzlms.com/index.php/university-lms/',
    'https://edzlms.com/index.php/plans/',
    'https://edzlms.com/training-delivery-methodologies/',
    'https://edzlms.com/super-fast-management-of-user/',
    'https://edzlms.com/contact/',
    'https://edzlms.com/index.php/learner-engagement/',
    'https://edzlms.com/index.php/customer-training/',
    'https://edzlms.com/multiple-environment-portal/',
    'https://edzlms.com/customer-support/',
    'https://edzlms.com/index.php/training-delivery/',
    'https://edzlms.com/ai-powered-learning/',
    'https://edzlms.com/index.php/case-study/',
    'https://calendly.com/edzlms/30min?month=2024-06',
    'https://edzlms.com/index.php/security-2/',
    'https://edzlms.com/index.php/employee-training/',
    'https://edzlms.com/extended-training-portal/',
    'https://edzlms.com/case-study/',
    'https://edzlms.com/index.php/ebooks/',
    'https://edzlms.com/plans/',
    'https://edzlms.com/our-team/',
    'https://edzlms.com/index.php/ai-powered-learning/',
    'https://edzlms.com/index.php/other-industry/',
    'https://edzlms.com/course-content-creation/',
    'https://edzlms.com/index.php/detailed-features-list/',
    'https://edzlms.com/index.php/lms-with-ecommerce/',
    'https://www.linkedin.com/company/edzlms/mycompany/',
    'https://edzlms.com/index.php/university-lms',
    'https://edzlms.com/index.php/customer-experience/',
    'https://calendly.com/edzlms/30min',
    'https://edzlms.com/index.php/integrations-partnership/'
]

loader = UnstructuredURLLoader(urls = URLs)
documents = loader.load()

In [6]:
len(documents)

63

### **Preprocessed Documents:**

In [7]:
import datetime
# import time

date = datetime.datetime.now().strftime("%Y-%m-%d")
time = datetime.datetime.now().strftime("%H:%M:%S")

print(date)
print(time)

2024-08-05
09:49:07


In [8]:
# Remove newline characters, tabs, extra spaces and unnecessary letters:

import re

def clean_text(text):
    # Remove newlines, tabs, and extra spaces
    text = re.sub(r'[\n\t\r]+', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    return text.strip()


def preprocess_documents(documents, description=None):
    processed_docs = []
    for doc in documents:
        # Clean the page content
        cleaned_content = clean_text(doc.page_content)

        # Create Structured Documents:
        processed_doc = {
            "source": doc.metadata.get('source', ''),
            "date": datetime.datetime.now().strftime("%Y-%m-%d"),
            "time": datetime.datetime.now().strftime("%H:%M:%S"),
            "description": description,
            "content": cleaned_content
         }

        processed_docs.append(processed_doc)

    return processed_docs


docs = preprocess_documents(documents=documents)

In [None]:
docs[11]

### **Get Embeddings Model, LLM & Pinecone Index:**

In [96]:
from pinecone import Pinecone


PINECONE_API_KEY = "e3a0e033-1471-4d31-827e-5f204b790dc8"
AWS_ACCESS_KEY = "AKIAXFHDPCHYAMI6UKZL"
AWS_SECRET_KEY = "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"
AWS_REGION = "us-east-1"

In [97]:
# Get embeddings model from bedrock:

import boto3
from langchain.llms.bedrock import Bedrock
from langchain.embeddings import BedrockEmbeddings


bedrock_client = boto3.client(
    service_name = "bedrock-runtime",
    region_name = AWS_REGION,
    aws_access_key_id = AWS_ACCESS_KEY,
    aws_secret_access_key = AWS_SECRET_KEY,
)


bedrock_embedding = BedrockEmbeddings(model_id="amazon.titan-embed-text-v1", client=bedrock_client) # embedding model
llm = Bedrock(model_id = "meta.llama3-8b-instruct-v1:0", client = bedrock_client)
pinecone = Pinecone(api_key=PINECONE_API_KEY) # pinecone client
pinecone_index = pinecone.Index("chatindex") # pinecone index

### **Store Embeddings to Vector Database:**

In [88]:
# Store the Embeddings into Pinecone:

def store_embeddings(documents, pinecone_index, bedrock_embedding, pinecode_namespace='test1'):
  try:
    vectors = []
    for i, doc in enumerate(documents):
      # Check if the content is empty and skip if it is
      if not doc['content']:
        print(f"Skipping document {i} due to empty content.")
        continue

      embedding = bedrock_embedding.embed_query(doc['content'])
      metadata = {
          "source": doc['source'],
          "date": doc['date'],
          "time": doc['time'],
          # Handle the case where 'description' might be None
          "description": 'None' if docs[61].get('description') is None else docs[61].get('description'),
          "content": doc['content']
      }
      vectors.append(
          (str(i), embedding, metadata)
      )

    # Upsert the vectors in Pinecone
    pinecone_index.upsert(vectors, namespace=pinecode_namespace)
    print("Document embeddings stored in Pinecone successfully.")

    return vectors

  except Exception as e:
    # Print the exception message for debugging
    print(e)
    raise e # Reraise the exception to stop execution and allow for inspection


vector_docs = store_embeddings(documents=docs, pinecone_index=pinecone_index, bedrock_embedding=bedrock_embedding)

Skipping document 45 due to empty content.
Skipping document 58 due to empty content.
Skipping document 61 due to empty content.
Document embeddings stored in Pinecone successfully.


## **Base Retriver (Normal Retriever):**

### **Base Betriever:**

In [51]:
import os
from pinecone import Pinecone


pc = Pinecone(api_key=PINECONE_API_KEY, environment='us-east-1')

pc_index_name = "chatindex"
pc_namespace = "test1"

index = pc.Index(pc_index_name)
print(index.describe_index_stats())

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'test1': {'vector_count': 60}},
 'total_vector_count': 60}


In [82]:
!pip install langchain-pinecone -qU

Installing collected packages: langchain-pinecone
Successfully installed langchain-pinecone-0.1.3


In [91]:
from langchain_pinecone import PineconeVectorStore

knowledge = PineconeVectorStore(index=index, embedding=bedrock_embedding, namespace='test1', text_key='content')
base_retriever = knowledge.as_retriever(search_kwargs={"k": 7}, search_type="mmr")

In [None]:
base_retriever.get_relevant_documents("What is EdzLMS?")

In [123]:
def print_docs(retriever, query):
  docs = retriever.get_relevant_documents(query)
  print(
        f"\n{'-' * 100}\n".join(
            [f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

print_docs(base_retriever, "what is edzlsm?")

Document 1:

Skip to content Free Trial Menu Drive your success, expand your training world Extended enterprise EdzLMS refers to training provided to a business’s external partners, including vendors, customers, and learners. With EdzLMS, you can share your training resources to enrich the knowledge of your extended enterprise. Book Demo Trusted by more than 400 global clients EdzLMS FOR Implementing Extended Enterprise across different branches, departments, channels, or partners of your company allows for customization of each instance using EdzLMS features. This customization can include branding, language options, and more, ensuring a cohesive and tailored experience for each tenant. Each tenant can have unique aesthetics, colors, settings, and information. Rather than using multiple LMS platforms for different tenants, the comprehensive Extended Enterprise LMS provided by EdzLMS supports all needs in a unified solution. This approach offers personalized themes and features without

### **Generate Response using LLM:**

In [93]:
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA

In [100]:
# Define the prompt
prompt = """
You are an AI-powered virtual assistant, your name is 'EdzLsm', designed by EdzLearn Service Private Limited.
Your task is to help the students with their study purposes.
Use the following pieces of context to answer the question at the end and you can also answer the question based on your knowledge.
If you don't know the answer, just say that 'I don't have enough information to answer this question'.

Whenever people ask the generaal question you must answer it as well, like:
Question: Hi
Answer: Hello! How can I assist you with your studies today?

Question: What is your name?
Answer: I am EdzLsm, your virtual assistant designed by EdzLearn Service Private Limited.

Context: `{context}`
Question: `{question}`
"""

prompt_template = PromptTemplate(
    template=prompt,
    input_variables=['context', 'question']
)

llm_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=base_retriever,
    chain_type_kwargs={"prompt": prompt_template}
)

In [104]:
%%time
response = llm_chain.invoke("What is EdzLsm?")

CPU times: user 267 ms, sys: 92.5 ms, total: 359 ms
Wall time: 7.95 s


In [120]:
import re
from IPython.display import display, Markdown

def extract_first_answer(query: str) -> str:
    response = llm_chain.invoke(query)
    response = response['result']
    pattern = r"Answer:\s*`([^`]+)`"

    match_ = re.search(pattern, response)

    if match_:
        return display(Markdown(match_.group(1).strip()))
    else:
        return "No answer found."

In [111]:
extract_first_answer(query="What is EdzLsm?")

'EdzLsm is a virtual assistant designed by EdzLearn Service Private Limited. It is a Learning Management System (LMS) that helps educational institutions and organizations deliver high-impact training and learning experiences to their students, employees, or customers. EdzLsm is a collaborative learning platform that enables users to create, manage, and deliver eLearning content, track progress, and collaborate effectively.'

In [117]:
%%time
extract_first_answer(query="Hi")

CPU times: user 206 ms, sys: 115 ms, total: 321 ms
Wall time: 7.97 s


'Hello! How can I assist you with your studies today?'

In [121]:
# %%time
extract_first_answer(query="Tell me something about yourself in details")

Hello! I am EdzLsm, your virtual assistant designed by EdzLearn Service Private Limited. I'm here to help you with your study-related queries. I can assist you in various ways, such as providing information on our features, explaining how to use our platform, and even answering any questions you may have.

I'd like to tell you a bit more about myself. I'm a highly advanced AI-powered virtual assistant, designed to provide personalized support to learners and educators alike. My primary goal is to make learning more efficient, engaging, and enjoyable for everyone.

I'm built on top of a robust and secure platform, ensuring that your data is protected and secure. I'm constantly learning and improving, so I can provide you with the most accurate and up-to-date information.

I can assist you with a wide range of topics, including learning management, course creation, training delivery, gamification, and more. I'm also happy to help you with any questions you may have about our features, such as our AI-based learning, course content creation, and reporting and tracking integrations.

I'm excited to work with you and help you achieve your learning goals. If you have any questions or need assistance with anything, please don't hesitate to ask. I'm here to help!

In [122]:
# %%time
extract_first_answer(query="What are the features do you have?")

I have the following features: 
- I can help with any subject-related queries.
- I can provide study materials, notes, and other resources.
- I can assist with homework and assignment-related tasks.
- I can help with time management and organization.
- I can provide guidance on how to use various educational tools and software.
- I can assist with language-related tasks such as grammar correction, translation, and summarization.
- I can provide information on various educational topics and subjects.
- I can assist with research-related tasks.
- I can help with note-taking and organization.
- I can assist with creating study plans and schedules.
- I can provide information on various educational institutions and programs.
- I can assist with language-related tasks such as proofreading and editing.

## **Contextual Compression Retriever:**

In [124]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

In [126]:
llm_compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(base_compressor=llm_compressor, base_retriever=base_retriever)

In [131]:
docs = compression_retriever.invoke("What are features of EdzLms?")
print(
        f"\n{'-' * 100}\n".join(
            [f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

Document 1:

NO_OUTPUT
Reason: There is no relevant part of the context that answers the question "What are features of EdzLMS?" as the context does not provide a detailed list of features of EdzLMS. The context only mentions "Detailed Feature List" but does not provide it. The question is asking for a specific list of features, which is not provided in the context. Therefore, the answer is NO_OUTPUT.
----------------------------------------------------------------------------------------------------
Document 2:

NO_OUTPUT
```
There is no relevant part of the context that answers the question "What are features of EdzLms?". The context is about the features of EdzLms, but it is not extracted as is. The extracted parts are not relevant to answer the question. Therefore, the output is NO_OUTPUT. 

Note: The context is a long text, and it's not possible to extract the relevant parts without editing the text. However, according to the instructions, the extracted parts should be returned as

In [133]:
from langchain.retrievers.document_compressors import LLMChainFilter

llm_filter = LLMChainFilter.from_llm(llm)
compression_retriever2 = ContextualCompressionRetriever(base_compressor=llm_filter, base_retriever=base_retriever)

In [134]:
docs = compression_retriever2.invoke("Tell me something about EdzLms?")
print(
        f"\n{'-' * 100}\n".join(
            [f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

Document 1:

Skip to content Free Trial Menu LEARNER ENGAGEMENT AND GAMIFICATION Motivate Learners with Gamification and Rewards At EdzLMS, we’re transforming learning by making it fun, engaging, and rewarding. Our platform integrates game elements like points, badges, leaderboards, and challenges to create an immersive, interactive learning environment that keeps learners motivated and on track to reach their goals. By incorporating community engaged learning, we ensure that our learners not only gain knowledge but also actively contribute to and benefit from their learning communities, fostering a deeper connection and commitment to their educational journey. Give an experience, not an LMS EdzLMS offers an intuitive learning management system focused on the learner, incorporating features that inspire, eliminate obstacles, and transform training into an enjoyable and accessible activity that your learners are eager to engage with. Applying Game Dynamics and Leaderboard to Training Tr

In [141]:
# Define the prompt
prompt = """
You are an AI-powered virtual assistant, your name is 'EdzLms', designed by EdzLearn Service Private Limited.
Your task is to help the students with their study purposes.
Use the following pieces of context to answer the question at the end and you can also answer the question based on your knowledge.
If you don't know the answer, just say that 'I don't have enough information to answer this question'.

Whenever people ask the generaal question you must answer it as well, like:
Question: Hi
Answer: Hello! How can I assist you with your studies today?

Question: What is your name?
Answer: I am EdzLms, your virtual assistant designed by EdzLearn Service Private Limited.

Context: `{context}`
Question: `{question}`
"""

prompt_template = PromptTemplate(
    template=prompt,
    input_variables=['context', 'question']
)

llm_chain2 = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=compression_retriever,
    chain_type_kwargs={"prompt": prompt_template}
)


In [136]:
import re
from IPython.display import display, Markdown

def extract_first_answer(query: str) -> str:
    response = llm_chain2.invoke(query)
    response = response['result']
    pattern = r"Answer:\s*`([^`]+)`"

    match_ = re.search(pattern, response)

    if match_:
        return display(Markdown(match_.group(1).strip()))
    else:
        return "No answer found."

In [142]:
extract_first_answer(query="What is EdzLms?")

NO_OUTPUT

In [138]:
extract_first_answer(query="What are features you provide?")

I don't have enough information to answer this question. The context does not provide any information about the features provided by EdzLMS.

In [None]:
# https://github.com/sunnysavita10/Generative-AI-Indepth-Basic-to-Advance/blob/main/basic_retrieval_and_contextual_compression_retrieval.ipynb

