- Tuning hyperparameters
  - Parsers = pdfplumer(removed header footer)
  - Embedding model = FastEmbedEmbeddings(BAAI/bge-small-en-v1.5)
  - Vectorstore = FAISS(fast), chromaDB(slow)
  - LLMs = llama3-70b-8192(robust), mixtral-8x7b-32768 (speciafic task)
  - Other hyperparameters = k, chunk_size
  - RAG Implemetations = (1) using RCTSplitter, (2) using section-wise-chunking

- Improvements
  - better embedding model
  - make it conversational
  - reranking, query transformation techniques

# Installing Dependencies and libraries

In [1]:
import time
import warnings
warnings.filterwarnings("ignore")

In [2]:
%%time
# 3 min
!pip install -q langchain
!pip install -q langchain-core
!pip install -q langchain-community
!pip install -q fastembed
!pip install -q pypdf
!pip install -q langchain_groq
!pip install -q faiss-gpu
!pip install -q sentence_transformers
!pip install -q pdfplumber

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf 24.4.1 requires cubinlinker, which is not installed.
cudf 24.4.1 requires cupy-cuda11x>=12.0.0, which is not installed.
cudf 24.4.1 requires ptxcompiler, which is not installed.
cuml 24.4.0 requires cupy-cuda11x>=12.0.0, which is not installed.
dask-cudf 24.4.1 requires cupy-cuda11x>=12.0.0, which is not installed.
keras-cv 0.9.0 requires keras-core, which is not installed.
keras-nlp 0.12.1 requires keras-core, which is not installed.
tensorflow-decision-forests 1.8.1 requires wurlitzer, which is not installed.
apache-beam 2.46.0 requires dill<0.3.2,>=0.3.1.1, but you have dill 0.3.8 which is incompatible.
apache-beam 2.46.0 requires numpy<1.25.0,>=1.14.3, but you have numpy 1.26.4 which is incompatible.
apache-beam 2.46.0 requires pyarrow<10.0.0,>=3.0.0, but you have pyarrow 14.0.2 which is incompatible

In [3]:
import time
import numpy as np 
import pandas as pd
import random
import pdfplumber
import re
from sklearn.metrics.pairwise import cosine_similarity
from langchain_community.embeddings import FastEmbedEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
# from langchain.chains.base import Chain
from langchain_community.document_loaders import PyPDFLoader
from langchain.docstore.document import Document
from langchain_community.vectorstores import FAISS
from langchain.prompts import PromptTemplate
from langchain_groq import ChatGroq
from langchain.chains import RetrievalQA
from langchain.retrievers import ParentDocumentRetriever
from IPython.display import Markdown, display
# from langchain.embeddings import HuggingFaceBgeEmbeddings
from langchain.storage import InMemoryStore
from tqdm.autonotebook import tqdm, trange

### Environment Variables

In [4]:
# AEBS PDFs
path1 = "/kaggle/input/pdffiles/GB AEBS.pdf"
path2 = "/kaggle/input/pdffiles/UN AEBS.pdf"

# Light PDFs
path3 = "/kaggle/input/pdffiles/GB Lighting installation.pdf"
path4 = "/kaggle/input/pdffiles/R048r12e.pdf"

In [36]:
# LLMs
mixtral = ChatGroq(groq_api_key ="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", model = 'mixtral-8x7b-32768', temperature=0.05)
llama3 = ChatGroq(groq_api_key ="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", model = 'llama3-70b-8192', temperature=0.05)

# For question answering
template1 = """
You are the Vehicle Regulation Assistant, a helpful AI assistant. Your task is to answer given questions from the provided relevant part of the PDF. The answer should be highly-detailed and well-sturctured. If possible, refer to specific sections number within the context (e.g., "According to section 4.1.2,..."). Do not begin your response with phrases like "Based on the provided context, the answer to the question is:". If the context does not contain information related to the question, explicitly state that there is no relevant information in the provided context. Be polite and helpful.

CONTEXT: {context}

QUESTION: {question}
"""
template2 = """
Your task is to answer the question accurately and in detail, using only the information provided in the given context. Where applicable, refer to specific section numbers within the context (e.g., "According to section 4.1.2,..."). If the answer is not found in the provided context, simply state that there is no relevant information available without sharing details about the context.

CONTEXT: {context}

QUESTION: {question}
"""
# Avoid unnecessary phrases like "Based on the provided context, the answer to the question is:".

template3 = """
Please provide a detailed and well-structured response to the question below, using only the information provided in the context.If the context does not contain information related to the question, explicitly state that there is no relevant information in the provided context. Be polite and helpful.Also, provide a confidence level from 0 to 100% in your response based on how certain you are about the information you have provided.

CONTEXT: {context}

QUESTION: {question}
"""
prompt = PromptTemplate(template=template2, input_variables=["question", "context"])

combine_template = """
Your task is to answer the question accurately and in detail, by synthesizing relevant information from the provided answers.
Where applicable, refer to specific section numbers within the context (e.g., "According to section 4.1.2,...").
Do not reveal that the information comes from multiple answers, directly answer the question.

QUESTION: {question}

ANSWER 1: {answer1}

ANSWER 2: {answer2}
"""
# For comparison RAG 1
comparison_template = """
We have provided a question and their two answers. Generate a comparison section without a heading which includes whether both answer are same or partially same or different. If they are paritially same, then what is same and what is different. This comparison is based on the answers generated from both the contexts. Accuracy and precision are crucial for this task.

QUESTION: {question}

ANSWER 1: {answer1}

ANSWER 2: {answer2}
"""
# For comparison RAG 2
def get_comparision_prompt(query, context1, context2):
    comparison_template = """
    Response in three sections
    
    ANSWER 1: This is firts section, here answer the question form the context 1.
    
    ANSWER 2: This is second section, here answer the question form the context 2.
    
    COMPARISON: This is third section, here answer whether both answer are same or partially same or different. If they are paritially same, then what is same and what is different.
    This section is completely based on answer generated in first and second section.
    
    Please answer the question solely based on the provided context. If you can't answer any of the both questions from their context then just tell that there is no answer in that context. This is very important for my life, be very precised and accurate in answering the question and also in comparison..

    QUESTION: {question}

    CONTEXT1: {context1}

    CONTEXT2: {context2}
    """
    comparison_prompt = comparison_template.format(context1 = context1,context2 = context2, question = query)
    return comparison_prompt

# Lightning
queries = ["Whats the difference between Grouped and Combined lamps?", "Can dipped-beam headlamp and main-beam headlamp for front lighting system?", "what is color of End Outline marker lamp?", "Can yellow lamp used as front fog lamp?", "Can red color light placed in the front of the vehicle?", "Can white light can be placed at the back of the vehicle?", "What are 1,1,a,1b,2a,2b,5,6 in direction indicator lamps?", "is cornering lamp mandatory?", "does reflective tape come under light and light signalling?", "standard weight of a person for testing?","can dipped beam uses as a main beam?", "what are the light functions to be kept rear of the vehicle?", "What lamp should be fitted for passenger vehicles?"]

In [6]:
# Embedding Model
# embedding_model = HuggingFaceBgeEmbeddings(model_name="BAAI/bge-large-en",
#                                model_kwargs={'device': 'cuda'},
#                                encode_kwargs={'normalize_embeddings': False})

# Section wise chunking
- after removing header footer

In [7]:
embedding_model= FastEmbedEmbeddings()

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/706 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.24k [00:00<?, ?B/s]

model_optimized.onnx:   0%|          | 0.00/66.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

In [8]:
def embed_texts(texts):
    return FastEmbedEmbeddings.embed_documents(embedding_model,texts = texts)

def get_header_footer(pdf_path, threshold=0.71):
    with pdfplumber.open(pdf_path) as pdf:
        random_page_nos = random.sample(range(5, len(pdf.pages)), 10)
        
        avg_similarity = 1
        header_lines = -1
        
        while avg_similarity > threshold and header_lines < 4:
            header_lines += 1
            five_lines = []
            
            for page_no in random_page_nos:
                lines = pdf.pages[page_no].extract_text().split('\n')
                if len(lines) > header_lines:
                    five_lines.append(lines[header_lines])
            similarities = cosine_similarity(embed_texts(five_lines))
            avg_similarity = np.mean(similarities[np.triu_indices(len(similarities), k=1)])
            
        avg_similarity = 1
        footer_lines = -1
        
        while avg_similarity > threshold and footer_lines < 4:
            footer_lines += 1
            five_lines = []
            
            for page_no in random_page_nos:
                lines = pdf.pages[page_no].extract_text().split('\n')
                if len(lines) > footer_lines:
                    five_lines.append(lines[-(footer_lines+1)])
            similarities = cosine_similarity(embed_texts(five_lines))
            avg_similarity = np.mean(similarities[np.triu_indices(len(similarities), k=1)])
            
        return header_lines, footer_lines
    
def extract_text(pdf_path):
    header_lines, footer_lines = get_header_footer(pdf_path)
    with pdfplumber.open(pdf_path) as pdf:
        text = ''
        for page in pdf.pages:
            page_text = page.extract_text()
            if page_text:
                lines = page_text.split('\n')
                if lines:
                    page_text = '\n'.join(lines[header_lines:-(footer_lines+1)])
                    text += page_text + '\n'
        return text

In [9]:
print("approximate no. of tokens", len(extract_text(path4).split()))

approximate no. of tokens 39382


In [10]:
pattern = re.compile(r'\n([1-9]|1[0-9])\. [A-Z][a-zA-Z]+')   #United nations
# pattern = re.compile('\d\. [A-Z]')                          # Smb United nations
# pattern = re.compile(r'\n([1-9]|1[0-9]) [A-Z][a-zA-Z]+')    #Chinese
# pattern = re.compile(r'(\n(1[0-9]|[1-9])\s+[A-Z][a-zA-Z]+.*?)(?=\n(?:[1-9]|1[0-5])\s+[A-Z]|$)', re.DOTALL) #Chinese

In [11]:
def section_wise_chunking(pdf_path):
    text = extract_text(pdf_path)
    matches = list(pattern.finditer(text))
    
    # Use the positions of the matches to split the text into sections
    sections = []
    last_index = 0
    for match in matches:
        start, end = match.span()
        section_text = text[last_index:start].strip()
        if section_text:
            sections.append(section_text)
        last_index = start
    if last_index < len(text):
        sections.append(text[last_index:].strip())
    
    # Handeling too small and too large sections
    text_chunks = []
    for i, section in enumerate(sections):
        if i != 0 and (len(section.split()) < 400):
            text_chunks[-1] += "\n"+section
        elif len(section.split()) > 800:
            splitted_chunks = RecursiveCharacterTextSplitter(chunk_size=5000, chunk_overlap=300).split_text(section)
            text_chunks += splitted_chunks[:1] + [splitted_chunks[0].split('\n')[0]+' (Partial)\n'+ chunk for chunk in splitted_chunks[1:]]
        else:
            text_chunks.append(section)
    return text_chunks

In [12]:
print("chunking....")
chunks = section_wise_chunking(path4)
print("No. of chunks created:", len(chunks))


chunking....
No. of chunks created: 53


In [13]:
# print("No. of chunks:", len(chunks))
# for i, chunk in enumerate(chunks):
#     print(chunk)
#     print("\n" + "_"*80,len(chunk.split()), "Words\n")

# Get vectorstore
- FAISS.from_documents takes list of docs as argument


In [26]:
# for RecursiveCharacterTextSplitter
def get_vectorstore1(path):
    text = extract_text(path)
    texts = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200).split_text(text)
    docs = [Document(text) for text in texts if text.strip()]
#     docs = PyPDFLoader(path).load_and_split(RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200, add_start_index=True))
    vectorstore = FAISS.from_documents(docs, embedding_model)
    return vectorstore

In [27]:
# for section_wise_chunking
def get_vectorstore2(path):
    texts = section_wise_chunking(path)
    docs = [Document(text) for text in texts if text.strip()]
    vectorstore = FAISS.from_documents(docs, embedding_model)
    return vectorstore

# Double retrieval RAG

In [28]:
%%time
retriever1 = get_vectorstore1(path4).as_retriever(search_kwargs={"k": 5})

CPU times: user 3min, sys: 599 ms, total: 3min
Wall time: 1min 43s


In [24]:
%%time
retriever2 = get_vectorstore2(path4).as_retriever()

CPU times: user 49.6 s, sys: 73.6 ms, total: 49.7 s
Wall time: 38 s


In [39]:
def Double_RAG(query):
    
    chain1 = RetrievalQA.from_llm(llm=llama3, retriever=retriever1, prompt= prompt)
    chain2 = RetrievalQA.from_llm(llm=llama3, retriever=retriever2, prompt= prompt)
    
    answer1 = chain1.invoke(query)['result']
    answer2 = chain2.invoke(query)['result']
    
    print("ANSWER1:", answer1)
    print("ANSWER2:", answer2)

    combine_prompt = combine_template.format(question = query,answer1 = answer1, answer2 = answer2)
    response = llama3.invoke(combine_prompt).content
    
    return response

In [43]:
saved_responses = []
for query in queries:
    print("QUERY:",query)
    print("---- RecursiveCharacterTextSplitter tokens:",[len(retriever1.invoke(query)[i].page_content.split()) for i in range(5)])
    print("---- section_wise_chunking tokens:",[len(retriever2.invoke(query)[i].page_content.split()) for i in range(4)])
    s = time.time()
    response = Double_RAG(query) 
    print("COMBINED ANSWER:")
    display(Markdown(response))
    print("_"*60, " generated in", time.time()-s)
    saved_responses.append({"query": query, "response": response})

QUERY: Whats the difference between Grouped and Combined lamps?
---- RecursiveCharacterTextSplitter tokens: [140, 150, 151, 149, 161]
---- section_wise_chunking tokens: [752, 784, 756, 789]
ANSWER1: According to section 2.7.4 and 2.7.5 of the context, the difference between Grouped and Combined lamps is as follows:

* Grouped lamps (2.7.4) are devices having separate apparent surfaces in the direction of the reference axis and separate light sources, but a common lamp body.
* Combined lamps (2.7.5) are devices having separate apparent surfaces in the direction of the reference axis, but a common light source and a common lamp body.

In other words, Grouped lamps have separate light sources, while Combined lamps share a common light source.
ANSWER2: According to the provided context, specifically section 2.7.3 and 2.7.5, the difference between Grouped and Combined lamps is as follows:

**Grouped lamps** (section 2.7.3) are devices having separate apparent surfaces in the direction of th

The difference between Grouped and Combined lamps lies in the number of light sources they have. Grouped lamps are devices that have separate apparent surfaces in the direction of the reference axis, separate light sources, but a common lamp body. On the other hand, Combined lamps are devices that have separate apparent surfaces in the direction of the reference axis, but a common light source and a common lamp body. In other words, Grouped lamps have multiple light sources, one for each apparent surface, whereas Combined lamps have a single light source that serves all apparent surfaces.

____________________________________________________________  generated in 10.609696865081787
QUERY: Can dipped-beam headlamp and main-beam headlamp for front lighting system?
---- RecursiveCharacterTextSplitter tokens: [151, 138, 145, 141, 142]
---- section_wise_chunking tokens: [820, 799, 784, 762]
ANSWER1: According to section 6.2.7.2, "The dipped-beam may remain switched on at the same time as the main beams." This implies that it is allowed to have both dipped-beam headlamps and main-beam headlamps switched on simultaneously for the front lighting system.
ANSWER2: According to the provided context, the answer is yes. 

In section 6.1.7.4, it is stated that "The main-beam headlamps may be switched on either simultaneously or in pairs." and "For changing over from the dipped to the main beam at least one pair of main-beam headlamps shall be switched on." This implies that both dipped-beam headlamps and main-beam headlamps can be used for front lighting systems.

Additionally, in sec

Yes, it is allowed to have both dipped-beam headlamps and main-beam headlamps for the front lighting system. According to section 6.2.7.2, the dipped-beam may remain switched on at the same time as the main beams, implying that it is permissible to have both types of headlamps switched on simultaneously. Furthermore, section 6.1.7.4 states that the main-beam headlamps may be switched on either simultaneously or in pairs, and that at least one pair of main-beam headlamps shall be switched on when changing over from the dipped to the main beam. This suggests that both dipped-beam headlamps and main-beam headlamps can be used for front lighting systems, and that they can be used together or separately as needed.

____________________________________________________________  generated in 69.77866220474243
QUERY: what is color of End Outline marker lamp?
---- RecursiveCharacterTextSplitter tokens: [156, 158, 142, 170, 164]
---- section_wise_chunking tokens: [756, 792, 810, 802]
ANSWER1: According to section 5.15, the color of the End-outline marker lamp is White in front and Red at the rear.
ANSWER2: According to the provided context, the color of the End Outline marker lamp is not explicitly specified. However, it is mentioned in section 5 that "Conspicuity marking: White to the front; White or yellow to the side; Red or yellow to the rear." Since the End Outline marker lamp is a type of conspicuity marking, it can be inferred that it may be white or yellow, but the exact color is not specified.
COMBINED ANSWER:


According to section 5.15, the color of the End-outline marker lamp is White in front and Red at the rear.

____________________________________________________________  generated in 67.16660213470459
QUERY: Can yellow lamp used as front fog lamp?
---- RecursiveCharacterTextSplitter tokens: [162, 161, 157, 167, 142]
---- section_wise_chunking tokens: [784, 756, 789, 1095]
ANSWER1: According to the provided context, there is no relevant information available that suggests a yellow lamp can be used as a front fog lamp. The context only mentions the orientation, vertical inclination, and electrical connections of front fog lamps, but it does not specify the color of the lamp. Therefore, it cannot be determined from the provided context whether a yellow lamp can be used as a front fog lamp.
ANSWER2: According to section 5.15 of the provided context, a front fog lamp can emit either white or selective yellow light. Therefore, a yellow lamp can be used as a front fog lamp.
COMBINED ANSWER:


According to section 5.15, a front fog lamp can emit either white or selective yellow light, which implies that a yellow lamp can be used as a front fog lamp.

____________________________________________________________  generated in 71.42583799362183
QUERY: Can red color light placed in the front of the vehicle?
---- RecursiveCharacterTextSplitter tokens: [168, 142, 151, 170, 131]
---- section_wise_chunking tokens: [802, 756, 833, 788]
ANSWER1: According to section 5.10.1, "For the visibility of red light towards the front of a vehicle, with the exception of a red rearmost side-marker lamp, there shall be no direct visibility of the apparent surface of a red lamp if viewed by an observer moving within Zone 1 as specified in Annex 4."

This implies that red color light cannot be placed in the front of the vehicle, except for a red rearmost side-marker lamp.
ANSWER2: According to the provided context, there is no specific information that prohibits the use of red color light in the front of a vehicle. However, it is mentioned in section 5 of the context that "Conspicuity marking: White to the front; White or yellow to the side; Red or yellow 

According to section 5.10.1, red color light cannot be placed in the front of a vehicle, with the exception of a red rearmost side-marker lamp. This implies that direct visibility of a red lamp towards the front of a vehicle is not allowed, as viewed by an observer moving within Zone 1 as specified in Annex 4. While section 5 suggests that red color is not typically used for conspicuity markings in the front of a vehicle, it does not explicitly prohibit its use. However, the explicit restriction in section 5.10.1 takes precedence, indicating that red color light should not be placed in the front of a vehicle, except for the specified exception.

____________________________________________________________  generated in 70.59026646614075
QUERY: Can white light can be placed at the back of the vehicle?
---- RecursiveCharacterTextSplitter tokens: [142, 150, 168, 151, 170]
---- section_wise_chunking tokens: [756, 802, 788, 807]
ANSWER1: According to section 5.10.2, "For the visibility of white light towards the rear, with the exception of reversing lamps and white side conspicuity markings fitted to the vehicle, there shall be no direct visibility of the apparent surface of a white lamp if viewed by an observer moving within Zone 2 in a transverse plane situated 25 m behind the vehicle (see Annex 4)". This implies that white light is not allowed to be placed at the back of the vehicle, except for reversing lamps and white side conspicuity markings.
ANSWER2: According to section 5 of the provided context, "Conspicuity marking: ... Red or yellow to the rear." This implies that white light is not allowed at the back of the vehicle.
C

According to the regulations, white light cannot be placed at the back of the vehicle, with the exception of reversing lamps and white side conspicuity markings. This is because, as stated in section 5.10.2, there shall be no direct visibility of the apparent surface of a white lamp if viewed by an observer moving within Zone 2 in a transverse plane situated 25 m behind the vehicle. Additionally, section 5 of the context specifies that conspicuity markings on the rear of the vehicle should be red or yellow, further supporting the notion that white light is not permitted at the back of the vehicle.

____________________________________________________________  generated in 66.83231687545776
QUERY: What are 1,1,a,1b,2a,2b,5,6 in direction indicator lamps?
---- RecursiveCharacterTextSplitter tokens: [158, 179, 192, 151, 170]
---- section_wise_chunking tokens: [876, 827, 789, 756]
ANSWER1: According to the provided context, 1, 1a, 1b, 2a, 2b, 5, and 6 refer to categories of direction-indicator lamps.

Specifically:

* Categories 1, 1a, and 1b refer to front direction-indicator lamps (section 6.5.3).
* Categories 2a and 2b refer to rear direction-indicator lamps (section 6.5.3).
* Categories 5 and 6 refer to side direction-indicator lamps (section 6.5.3).

These categories are used to define the arrangement and requirements for direction-indicator lamps on vehicles.
ANSWER2: According to the provided context, 1, 1a, 1b, 2a, 2b, 5, and 6 are categories of direction-indicator lamps.

Specifically, they refer to the following:

* 1, 1a, and 1b: Front direction-indicator lamp categories
*

In direction indicator lamps, the numbers 1, 1a, 1b, 2a, 2b, 5, and 6 refer to specific categories of direction-indicator lamps. These categories define the arrangement and requirements for direction-indicator lamps on vehicles. Specifically, categories 1, 1a, and 1b refer to front direction-indicator lamps, categories 2a and 2b refer to rear direction-indicator lamps, and categories 5 and 6 refer to side direction-indicator lamps. These categories are used to ensure that direction-indicator lamps are properly installed and function correctly on vehicles.

____________________________________________________________  generated in 70.31753826141357
QUERY: is cornering lamp mandatory?
---- RecursiveCharacterTextSplitter tokens: [154, 143, 153, 149, 137]
---- section_wise_chunking tokens: [789, 802, 833, 756]
ANSWER1: According to section 6.20.1, the presence of a cornering lamp is optional on motor vehicles. Therefore, the answer is no, the cornering lamp is not mandatory.
ANSWER2: According to section 6.20.1, the presence of a cornering lamp is optional on motor vehicles.
COMBINED ANSWER:


According to section 6.20.1, the presence of a cornering lamp is optional on motor vehicles. Therefore, the answer is no, the cornering lamp is not mandatory.

____________________________________________________________  generated in 67.57276678085327
QUERY: does reflective tape come under light and light signalling?
---- RecursiveCharacterTextSplitter tokens: [167, 174, 145, 168, 160]
---- section_wise_chunking tokens: [783, 784, 784, 756]
ANSWER1: According to the provided context, there is no relevant information available that directly answers the question of whether reflective tape comes under light and light signalling. The context primarily focuses on lamps, light-signalling devices, and their photometric requirements, but it does not mention reflective tape.
ANSWER2: Based on the provided context, reflective tape or conspicuity markings are mentioned in Section 5.15, which lists the colors of light emitted by various lamps. Conspicuity markings are specified as having the following colors: White to the front, White or yellow to the side, and Red or yellow to the rear.

However, conspicuity markings are not considered as light sources

Reflective tape, also referred to as conspicuity markings, does not come under light and light signalling. While it is mentioned in Section 5.15, it is not considered a light source or light-signalling device. Instead, it is a passive reflective material that reflects light from other sources, such as headlights, to increase visibility. The colors of conspicuity markings are specified as white to the front, white or yellow to the side, and red or yellow to the rear. In contrast, light and light signalling refer to lamps and light-signalling devices that emit light, which are listed in Section 5.15, but do not include conspicuity markings.

____________________________________________________________  generated in 66.5406403541565
QUERY: standard weight of a person for testing?
---- RecursiveCharacterTextSplitter tokens: [206, 161, 153, 156, 194]
---- section_wise_chunking tokens: [535, 791, 1095, 810]
ANSWER1: According to paragraph 1 of the context, the standard weight of a person for testing is 75 kg.
ANSWER2: There is no relevant information available in the provided context about the standard weight of a person for testing.
COMBINED ANSWER:


According to the available information, the standard weight of a person for testing is 75 kg.

____________________________________________________________  generated in 68.38029956817627
QUERY: can dipped beam uses as a main beam?
---- RecursiveCharacterTextSplitter tokens: [151, 151, 142, 145, 141]
---- section_wise_chunking tokens: [799, 820, 783, 784]
ANSWER1: According to the provided context, there is no information that suggests a dipped beam can be used as a main beam. In fact, the context consistently distinguishes between dipped-beam headlamps and main-beam headlamps, and provides separate regulations for each.

For example, section 6.2.7.1 states that "The control for changing over to the dipped-beam shall switch off all main-beam headlamps simultaneously." This implies that the dipped beam and main beam are two separate entities that cannot be used interchangeably.

Additionally, section 6.1.7.4 states that "For changing over from the dipped to the main beam at least one pair of main-beam headlamps shall be switched on. For changing over from the main-beam to the dip

Based on the provided context, it appears that a dipped beam cannot be used as a main beam. The regulations consistently distinguish between dipped-beam headlamps and main-beam headlamps, and provide separate regulations for each. For example, section 6.2.7.1 states that the control for changing over to the dipped-beam shall switch off all main-beam headlamps simultaneously, implying that the dipped beam and main beam are two separate entities that cannot be used interchangeably.

Additionally, the main purpose of a dipped-beam headlamp is to provide a lower beam of light that does not dazzle oncoming traffic, whereas a main-beam headlamp is designed to provide a higher beam of light for better visibility at higher speeds. Using a dipped-beam as a main beam might not provide the same level of visibility and could potentially cause glare for oncoming traffic.

Although section 6.2.7.2 allows the dipped-beam to remain switched on at the same time as the main-beams, it does not explicitly state that a dipped-beam can be used as a main beam. Therefore, based on the provided context, it is not recommended to use a dipped-beam as a main beam, as it may not provide the same level of visibility and could cause glare for oncoming traffic.

____________________________________________________________  generated in 75.6738498210907
QUERY: what are the light functions to be kept rear of the vehicle?
---- RecursiveCharacterTextSplitter tokens: [151, 153, 156, 150, 170]
---- section_wise_chunking tokens: [802, 756, 819, 770]
ANSWER1: According to the provided context, the light functions to be kept rear of the vehicle are:

* Rear position lamp (Section 6.10)
* Rear light-signalling devices (Section 6.1.7.3)
* End-outline marker lamp (Section 6.13)

Note that these sections provide specific requirements and regulations for these light functions, including their orientation, electrical connections, and tell-tale indicators.
ANSWER2: According to the provided context, the light functions to be kept at the rear of the vehicle are:

1. Rear position lamps (Regulation No. 7) - mandatory on motor vehicles and trailers (Section 6.10).
2. Rear fog lamp (Regulation No. 38) - mandatory on motor vehicles, optional on trailers (Section 6

The light functions to be kept rear of the vehicle are:

* Rear position lamp (mandatory on motor vehicles and trailers)
* Rear fog lamp (mandatory on motor vehicles, optional on trailers)
* Rear retro-reflector, non-triangular (mandatory on motor vehicles, optional on trailers)
* End-outline marker lamp (mandatory on vehicles exceeding 2.10 m in width, optional on vehicles between 1.80 and 2.10 m in width, and on chassis-cabs)
* Rear light-signalling devices
* Parking lamp (optional on motor vehicles not exceeding 6 m in length and not exceeding 2 m in width, prohibited on all other vehicles)

These light functions are subject to specific requirements and regulations, including their orientation, electrical connections, and tell-tale indicators, as outlined in the relevant sections.

____________________________________________________________  generated in 69.54209089279175
QUERY: What lamp should be fitted for passenger vehicles?
---- RecursiveCharacterTextSplitter tokens: [170, 142, 158, 148, 164]
---- section_wise_chunking tokens: [770, 802, 788, 756]
ANSWER1: According to the provided context, there is no specific requirement for a particular lamp to be fitted for passenger vehicles. However, it can be inferred that certain lamps are optional or mandatory depending on the vehicle's dimensions and category.

For example, parking lamps are optional for motor vehicles not exceeding 6 m in length and not exceeding 2 m in width (Section 6.12.1). End-outline marker lamps are mandatory for vehicles exceeding 2.10 m in width and optional for vehicles between 1.80 and 2.10 m in width (Section 6.13.1).

It is also important to note that the context provides requirements for various lamps, such as rear fog-lamps, stop-lamps, and courtesy lamps, but it does not specify a 

Based on the provided context, there is no specific requirement for a particular lamp to be fitted for passenger vehicles. However, certain lamps are optional or mandatory depending on the vehicle's dimensions and category. For instance, parking lamps are optional for motor vehicles not exceeding 6 m in length and not exceeding 2 m in width (Section 6.12.1), while end-outline marker lamps are mandatory for vehicles exceeding 2.10 m in width and optional for vehicles between 1.80 and 2.10 m in width (Section 6.13.1). The context provides general specifications and individual specifications for various lamps, such as rear fog-lamps, stop-lamps, and courtesy lamps, but it does not specify a particular lamp that must be fitted for passenger vehicles.

____________________________________________________________  generated in 67.76967453956604


In [None]:
df = pd.DataFrame(saved_responses)
df.to_excel('responses_DRAG_detailed2.xlsx', index=False)

In [42]:
# query = "Whats the difference between Grouped and Combined lamps?"
# for i in range(4):
#     print(retriever2.invoke(query)[i].page_content)
#     print("_"*80)

## RAG 1
- Using FAISS retriever
- Adv. any k
- RetrievalQA chain

In [None]:
%%time
# retriever = get_vectorstore_2(path4).as_retriever(search_kwargs={"k": 4})
retriever = vectorstore_path4.as_retriever()

In [None]:
def RAG1(query):
    qa_chain = RetrievalQA.from_llm(llm=llama3, retriever=retriever, prompt= prompt)
    return qa_chain.invoke(query)['result']

In [None]:
data = []
for query in queries4:
    print("Query:",query)
    print("Tokens:",[len(retriever.invoke(query)[i].page_content.split()) for i in range(4)])   
    response = RAG1(query) 
    display(Markdown(response))
    print("_"*100)
    data.append({"query": query, "response": response})

In [None]:
df = pd.DataFrame(data)
df.to_excel('responses10_prompt3.xlsx', index=False)

In [None]:
q = "what is color of End Outline marker lamp?"
for context in retriever.invoke(q):
    print(context.page_content)
    print("_"*80)

# Comparison_RAG2
- using PDR
- Fixed k = 4

In [None]:
def PDR(path):
    documents = PyPDFLoader(path).load()
    combined_text = "\n".join(document.page_content for document in documents)
    document = [Document(page_content=combined_text, metadata={"source": path})]

    retriever = ParentDocumentRetriever(vectorstore=Chroma(collection_name="full_documents", embedding_function=FastEmbedEmbeddings()),
                                        docstore=InMemoryStore(),
                                        child_splitter=RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=100),
                                        parent_splitter=RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=100))
    
    retriever.add_documents(document, ids=None)
    return retriever

In [None]:
%%time
retriever2 = PDR(path4)

In [None]:
def Comparison_RAG2(query):
    qa_chain = RetrievalQA.from_llm(llm=llama3, retriever=retriever2, prompt= prompt)
    return qa_chain.invoke(query)['result']

In [None]:
data = []
for query in queries4:
    print("#",query)
    response = Comparison_RAG2(query) 
    print(response)
    print("---------------------------------------------------------------------")
    data.append({"query": query, "response": response})

In [None]:
# Save responses and export
df = pd.DataFrame(data)
df.to_excel('query_responses10.xlsx', index=False)

# Comparison_RAG 1
- RAG for comparison
- by FAISS Retrieval
- by calling LLM 3 times
- chain - RetrievalQA

In [None]:
%%time
retriever1 = get_vectorstore(path1).as_retriever(search_kwargs={"k": 6})
retriever2 = get_vectorstore(path2).as_retriever(search_kwargs={"k": 6})

In [None]:
def Comparison_RAG1(query):
    llm = llama3
    
    qa_chain1 = RetrievalQA.from_llm(llm=llm, retriever=retriever1, prompt= prompt,)
    qa_chain2 = RetrievalQA.from_llm(llm=llm, retriever=retriever2, prompt= prompt)

    answer1 = qa_chain1.invoke(query)['result']
    answer2 = qa_chain2.invoke(query)['result']
    
    comparison_prompt = comparison_template.format(question = query,answer1 = answer1, answer2 = answer2)
    comparison = llm.invoke(comparison_prompt).content
    
    response = f"**ANSWER 1**: {answer1}\n\n**ANSWER 2**: {answer2}\n\n**COMPARISION**: {comparison}"
    return response

In [None]:
# AEBS
queries2 = ["Explain the test procedures in detail, in details.", "What are warning indications, in details."]
for query in queries2:
    display(Markdown(Co1parison_RAG1(query)))

# Comparison_RAG 2
- by PDR
- 3 LLM calls

In [None]:
retriever1 = PDR(path1)
retriever2 = PDR(path2)

In [None]:
def Comparison_RAG2(query):
    context1 = get_context(query, retriever1)
    context2 = get_context(query, retriever2)

    qa_prompt1 = qa_template.format(question = query,context = context1)
    qa_prompt2 = qa_template.format(question = query,context = context2)
    
    answer1 = llama3.invoke(qa_prompt1).content
    answer2 = llama3.invoke(qa_prompt2).content
    
    comparison_prompt = comparison_template.format(question = query,answer1 = answer1, answer2 = answer2)
    comparison = llm.invoke(comparison_prompt).content
    
    response = f"**ANSWER 1**: {answer1}\n\n**ANSWER 2**: {answer2}\n\n**COMPARISION**: {comparison}"
    return response

# Comparison_RAG 3
- Vectostore to context from scratch`
- Dis. - 1 prompt, 1 LLM call, too much load on geneartion

In [None]:
def get_context(query, vectorstore):
    retrieved_docs = vectorstore.similarity_search_with_relevance_scores(query, k = 6)
    context = ""
    for doc in retrieved_docs:
        context += doc.page_content
    return context

In [None]:
vectorstore1 = get_vectorstore(path1)
vectorstore2 = get_vectorstore(path2)

In [None]:
def Comparison_RAG3(query):
    context1 = get_context(query, retriever1)
    context2 = get_context(query, retriever2)
    
    prompt = get_comparision_prompt(query, context1, context2)
    
    return llama3.invoke(prompt).content

In [None]:
#AEBS
queries1 = ["Explain the test procedures in detail.","When collision early warning signal shall be sent?","What are warning indications, in details.","What AEBS should do in vehicle ignition?","The total speed reduction of the subject vehicle at the time of the collision with the stationary target shall be not less than how many kilometers per hour?"]

In [None]:
for query in queries1:
    print("#",query)
    display(Markdown(Comparison_RAG3(query)))

### Some notes

In [None]:
## FAISS
# both dont have k as a parameters
# Both returns list of documents retrievd
            
# retriever1.get_relevant_documents(query) #deprceated
# retriever1.invoke(query) #use this instead

In [None]:
retriever.invoke

In [None]:
retriever.get_relevant_documents("degrees on a sphere", k =1)

In [None]:
import inspect
signature = inspect.signature(retriever.invoke)

for param_name, param in signature.parameters.items():
    print(f"Parameter: {param_name}")
    print(f"  Default: {param.default}")
    print(f"  Annotation: {param.annotation}")
    print()

In [None]:
import inspect
signature = inspect.signature(retriever.get_relevant_documents)

for param_name, param in signature.parameters.items():
    print(f"Parameter: {param_name}")
    print(f"  Default: {param.default}")
    print(f"  Annotation: {param.annotation}")
    print()

In [None]:
import inspect
signature = inspect.signature(retriever1.get_relevant_documents)

for param_name, param in signature.parameters.items():
    print(f"Parameter: {param_name}")
    print(f"  Default: {param.default}")
    print(f"  Annotation: {param.annotation}")
    print()

In [None]:
import inspect
signature = inspect.signature(get_vectorstore(path4).as_retriever)

for param_name, param in signature.parameters.items():
    print(f"Parameter: {param_name}")
    print(f"  Default: {param.default}")
    print(f"  Annotation: {param.annotation}")
    print()
    

In [None]:
#multivector embeddings have best performance for retrieval https://www.rungalileo.io/blog/mastering-rag-how-to-select-an-embedding-model

In [None]:
print(llama3.invoke("Do you know about how Retrieval augmented generation works? You(llm) will be provided with some context from the whole context(pdf) to generate the answer, but the problem is sometimses the context extracted for you is not very relevant but we can't be sure that whether it is problem of retriever or not(It may possible that relevant information is present in the pdf , but retrievr was not able to extract them). So if the context is irravalent , you can't say the pdf doesn't provide infomation about the query. did you understand what I am saying. If I tell you in the prompt that the llm(you) are used in a rag implementation, will you able to act accordingly? Will it improve the responses?. Write a good prompt for the same issue (for the same issue i mentioned before), insuer that prompt is short and understandable by the llm").content)

In [None]:
print(llama3.invoke("Telling a llm that he(llm) is being used in a Retrieval augmented generation implementation. How thsi will affect the performanec of behaviour of llm").content)

In [None]:
# Optional way for RAG2# def get_context(query, retriever):
#     retrieved_docs = retriever.get_relevant_documents(query)
#     context = " ".join([doc.page_content for doc in retrieved_docs])
#     return context

# def RAG2(query):
#     context = get_context(query,retriever1)
#     qa_prompt = template2.format(question = query,context = context)
#     return llama3.invoke(qa_prompt).content

- llama3 could be taken from groq or ollama
- Possible raesons for latency in pdr-llama2-chromadb code
  - llama2 downloaded in local computer
  - slow parent document retriever
  - small cpu
- Giskard for evaluation
  - By default uses gpt4 but llama3 also could be connected
  - it finds contexts from pdf/url and generates question, further context is matched to the RAG response for evaluating RAG.
- Use an image processing model (e.g., CLIP, a Vision Transformer, or a CNN) to convert images into embeddings or textual descriptions.

# Code given (ss)

In [None]:
!pip install pdfplumber

In [None]:
! pip install anthropic

In [None]:
# %%
import re
import json
import time
import pdfplumber
import boto3
from botocore.config import Config
from anthropic import Anthropic

In [None]:
# Initialize Anthropic client
client = Anthropic()

# Constants
MAX_ATTEMPTS = 1
session_cache = {}

# Configure boto3 clients with extended timeout
my_config = Config(
    connect_timeout=60 * 3,
    read_timeout=60 * 3,
)
bedrock = boto3.client(service_name='bedrock-runtime', region_name='eu-central-1', config=my_config)
bedrock_service = boto3.client(service_name='bedrock', config=my_config, region_name='eu-central-1')

def ask_claude(messages, system="", DEBUG=False, model='sonnet'):
    '''
    Send a prompt to Bedrock and return the response.

    Args:
    - messages (str or list): A single message or a list of role/message pairs.
    - system (str): Optional. The system to send the message to.
    - DEBUG (bool): Optional. If True, print debug information.
    - model (str): Optional. The model to use for generating responses.

    Returns:
    - list: [raw_prompt_text, response_text] containing the original prompt and the response received.
    '''
    raw_prompt_text = str(messages)
    
    if isinstance(messages, str):
        messages = [{"role": "user", "content": messages}]
    
    prompt_json = {
        "system": system,
        "messages": messages,
        "max_tokens": 100000,
        "temperature": 0.1,
        "anthropic_version": "",
        "top_k": 500,
        "stop_sequences": ["\n\nHuman:"]
    }
    
    # if DEBUG:
    #     print("Sending:\nSystem:\n", system, "\nMessages:\n", "\n".join(messages))
    
    modelId = 'anthropic.claude-3-sonnet-20240229-v1:0'
    
    # if raw_prompt_text in session_cache:
    #     return [raw_prompt_text, session_cache[raw_prompt_text]]
    
    attempt = 1
    while True:
        try:
            response = bedrock.invoke_model(body=json.dumps(prompt_json), modelId=modelId, accept='application/json', contentType='application/json')
            response_body = json.loads(response['body'].read())
            results = response_body.get("content", [{}])[0].get("text", "")
            if DEBUG:
                print("Received:", results)
            break
        except Exception as e:
            print("Error with calling Bedrock: " + str(e))
            attempt += 1
            if attempt > MAX_ATTEMPTS:
                print("Max attempts reached!")
                results = str(e)
                break
            else:
                time.sleep(10)
    
    # session_cache[raw_prompt_text] = results
    return [raw_prompt_text, results]

### Section-wise chunking (ss)

In [None]:
# Initialize Anthropic client
client = Anthropic()

def count_tokens(text):
    '''
    Count the number of tokens in the provided text using Anthropic's client.

    Args:
    - text (str): The text for which tokens need to be counted.

    Returns:
    - int: Number of tokens in the text.
    '''
    return client.count_tokens(text)

def extract_all_text(pdf_path):
    '''
    Extract all text from a PDF document.

    Args:
    - pdf_path (str): Path to the PDF file.

    Returns:
    - dict: Dictionary where keys are page numbers and values are extracted text strings.
    '''
    with pdfplumber.open(pdf_path) as pdf:
        text_dict = {}
        for i, page in enumerate(pdf.pages, start=1):
            if i == 1:
                text_dict[i] = page.extract_text(layout=False, strip=True, return_chars=True)
                
            else:
                text_dict[i] = '\n'.join(line for line in page.extract_text().split('\n') if 'Official Journal of the European Union' not in line)
    return text_dict

# Function to concatenate lines where the element before the period (.) is the same
def concat_lines_by_same_element(text):
    '''
    Concatenate lines where the element before the period (.) is the same into blocks.

    Args:
    - text (str): Text to process.

    Returns:
    - list: List of concatenated blocks.
    '''
    lines = text.split('\n')
    concatenated_blocks = []
    current_block = ""
    current_element = None
    
    for line in lines:
        match = re.match(r'^(\d+)\.', line.strip())
        if match:
            element = match.group(1)
            if current_element is None:
                if current_block:
                    concatenated_blocks.append(current_block.strip())
                current_element = element
                current_block = line
            elif element == current_element:
                current_block += " " + line
            else:
                concatenated_blocks.append(current_block.strip())
                current_element = element
                current_block = line
        else:
            current_block += " " + line
    
    if current_block:
        concatenated_blocks.append(current_block.strip())
    
    return concatenated_blocks

def concatenate_blocks(blocks, max_tokens=2000):
    '''
    Concatenate text blocks ensuring the total token count for each concatenated block does not exceed the specified limit.

    Args:
    - blocks (list): List of text blocks to concatenate.
    - max_tokens (int): Maximum number of tokens allowed per concatenated block (default is 2000).

    Returns:
    - list: List of concatenated blocks where each block's total token count is within the specified limit.
    '''
    concatenated_blocks = []
    temp_block = []
    total_value = 0
    
    for ele in blocks:
        value = count_tokens(ele)
        if total_value + value <= max_tokens:
            temp_block.append(ele)
            total_value += value
        else:
            print(f"Total tokens for current block: {total_value}")
            concatenated_blocks.append(' '.join(temp_block))
            temp_block = [ele]
            total_value = value
    
    # Add the last block if it's not empty
    if temp_block:
        concatenated_blocks.append(' '.join(temp_block))
    
    return concatenated_blocks

# %%
# Example usage of extract_all_text function
pdf_path = path1
all_text = extract_all_text(pdf_path)

# %%
full_text = '\n'.join(all_text.values())
# Encode some misc unicode characters
full_text = full_text.encode('utf-8').decode()
# Example usage of concat_lines_by_same_element function
blocks = concat_lines_by_same_element(full_text)  # Assuming all_text is a dictionary with page numbers
print(blocks)
blocks_1 = concatenate_blocks(blocks)

In [None]:
len(blocks)

In [None]:
blocks_1

In [None]:
len(blocks_1)

In [None]:
# %%
long_prompt_template = """Consider the following portion of the document.
<document>
{{document}}
</document>

Please explain all the points of test condition mentioned in test procedure in details?
Extract all the contents related to query and then summarize the extracted contents into one response.
Return the answer in <response> tag, and if could not find the answer return empty <response> tag,
and also return the confidence as percentage of the answers in <confidence> tag. 
"""

long_prompt = long_prompt_template.replace("{{document}}",full_text)
long_responce = ask_claude(long_prompt, model="sonnet")[1]
print(long_responce)
# %%
answer_output = []
for ele in blocks_1:
    long_prompt = long_prompt_template.replace("{{document}}",ele)
    long_responce = ask_claude(long_prompt, model="sonnet")[1]
    answer_output.extend([long_responce])
    
answer_all = "".join(answer_output)
# %%
summarize_template = """Consider the following responses. It has fetched for a single query from different sections of
the document.

<response>
{{response}}
</response>

Summarize these responses into one response.
Return the answer in <response> tag, and if could not find the answer return "Could not find the answer" in <response> tag,
and also return the confidence as percentage of the answers in <confidence> tag. 
"""

output = ask_claude(summarize_template.replace("{{response}}",answer_all), model="sonnet")[1]


# %%
print(output)
# %%

### Section-wise chunking (smb)

In [None]:
import pdfplumber
import re

def read_pdf_pagewise(file_path):
    
    section_pattern = re.compile(r'(\d [A-Z]|Annex [A-Z])')

    sections = {}
    with pdfplumber.open(file_path) as pdf:
        num_pages = len(pdf.pages)
        print(f'Total pages: {num_pages}')

        current_section = None
        for page_num in range(2, num_pages):
            page = pdf.pages[page_num]
            page_text = page.extract_text()

            if page_text:
                lines = page_text.split('\n')
                if len(lines) > 3:  
                    lines = lines[1:-2]
                else:
                    lines = [] 
                
                for line in lines:
                    match = section_pattern.match(line)
                    if match:
                        current_section = match.group(1)
                        if current_section not in sections:
                            sections[current_section] = []
                    
                    if current_section:
                        sections[current_section].append(line)
    return sections
#     for section, content in sections.items():
#         print(f"--- {section} ---")
#         print('\n'.join(content))
#         print("\n\n")

In [None]:
import pdfplumber
import re

def read_pdf_pagewise(file_path):
    
    section_pattern = re.compile(r'(\d [A-Z]|Annex [A-Z])')

    sections = {}
    with pdfplumber.open(file_path) as pdf:
        num_pages = len(pdf.pages)
        print(f'Total pages: {num_pages}')

        current_section = None
        for page_num in range(2, num_pages):
            page = pdf.pages[page_num]
            page_text = page.extract_text()

            if page_text:
                lines = page_text.split('\n')
                if len(lines) > 3:  
                    lines = lines[1:-2]
                else:
                    lines = [] 
                
                for line in lines:
                    match = section_pattern.match(line)
                    if match:
                        current_section = match.group(1)
                        if current_section not in sections:
                            sections[current_section] = []
                    
                    if current_section:
                        sections[current_section].append(line)
    for section, content in sections.items():
        print(f"--- {section} ---")
        print('\n'.join(content))
        print("\n\n")
        
read_pdf_pagewise(path4)