Building a Document-based Question Answering System ChromaDB LangChain


Install Libraries

In [None]:
!pip install langchain chromadb beautifulsoup4 git+https://github.com/julian-r/python-magic.git unstructured detectron2@git+https://github.com/facebookresearch/detectron2.git@v0.6#egg=detectron2 tiktoken pytesseract sentence_transformers pypdf faiss-cpu transformers


Collecting git+https://github.com/julian-r/python-magic.git
  Cloning https://github.com/julian-r/python-magic.git to /tmp/pip-req-build-x5ujvy5v
  Running command git clone --filter=blob:none --quiet https://github.com/julian-r/python-magic.git /tmp/pip-req-build-x5ujvy5v
  Resolved https://github.com/julian-r/python-magic.git to commit 6029e2d43ce0ee9f268c1f112c70e5417493190f
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting detectron2@ git+https://github.com/facebookresearch/detectron2.git@v0.6#egg=detectron2
  Cloning https://github.com/facebookresearch/detectron2.git (to revision v0.6) to /tmp/pip-install-txpmbvar/detectron2_ff39b3e6534a4d00ad0670a3897912c0
  Running command git clone --filter=blob:none --quiet https://github.com/facebookresearch/detectron2.git /tmp/pip-install-txpmbvar/detectron2_ff39b3e6534a4d00ad0670a3897912c0
  Running command git checkout -q d1e04565d3bec8719335b88be9e9b961bf3ec464
  Resolved https://github.com/facebookresearch/detectron2.git to

Import Libraries

In [None]:
!pip install langchain_community accelerate ctransformers unstructured[pdf]



In [None]:
from langchain.vectorstores import Chroma
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
from langchain import OpenAI, VectorDBQA
from langchain.document_loaders import DirectoryLoader
import magic
import os
import nltk
import pytesseract
import csv
import torch
from langchain_community.llms import CTransformers
from langchain.chains import QAGenerationChain, StuffDocumentsChain, LLMChain, RetrievalQA
from langchain.docstore.document import Document
from langchain_community.document_loaders import PyPDFLoader
from langchain.prompts import PromptTemplate
from langchain_community.embeddings import HuggingFaceBgeEmbeddings, HuggingFaceInferenceAPIEmbeddings, HuggingFaceHubEmbeddings
from langchain_community.vectorstores import FAISS
from accelerate import Accelerator



Download LLM LLAMA2

In [None]:
!huggingface-cli download TheBloke/Llama-2-7B-GGUF llama-2-7b.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False

Downloading 'llama-2-7b.Q4_K_M.gguf' to '.huggingface/download/llama-2-7b.Q4_K_M.gguf.4567208c2221da5a9f2ded6cc26ce58dd47d0410902c3f57a4a3ed104ce51b0b.incomplete'
llama-2-7b.Q4_K_M.gguf: 100% 4.08G/4.08G [00:24<00:00, 166MB/s]
Download complete. Moving file to llama-2-7b.Q4_K_M.gguf
llama-2-7b.Q4_K_M.gguf


Generating Question list below!!!

Change Document path below

In [None]:
file_path = "/content/embedded-linux-primer-29-50.pdf"

In [None]:
min_ques=200

In [None]:
def file_processing(file_path):
    loader = PyPDFLoader(file_path)
    data = loader.load()

    question_gen = ''

    for page in data:
        question_gen += page.page_content

    splitter_ques_gen = RecursiveCharacterTextSplitter(
        chunk_size=500,
        chunk_overlap=50
    )

    chunks_ques_gen = splitter_ques_gen.split_text(question_gen)

    document_ques_gen = [[Document(page_content=t)] for t in chunks_ques_gen]
    return document_ques_gen
document_chunks=file_processing(file_path)

In [None]:
def load_llm():
    # Load the locally downloaded model here
    # llm = CTransformers(
    #     model="/home/khushal/Downloads/mistral-7b-instruct-v0.1.Q2_K.gguf",
    #     model_type="mistral",
    #     max_new_tokens=2048,
    #     temperature=0.3,
    #     device=0,
    #     max_input_length=4096
    # )
    accelerator = Accelerator()
    config = {'max_new_tokens': 1048, 'repetition_penalty': 1.1, 'context_length': 8000, 'temperature':0.3, 'gpu_layers':50}
    llm = CTransformers(model = "/content/llama-2-7b.Q4_K_M.gguf",
                        model_type = "llama",
                        gpu_layers=50,
                        config=config)

    llm, config = accelerator.prepare(llm, config)
    return llm

import re

def q_format(text):
    # Define a regular expression pattern to match the question after the numbering and the period
    pattern = r'^\s*\d+\.\s*(.*)$'

    # Use re.match to search for the pattern in the text
    match = re.match(pattern, text)

    # If there is a match, extract the question
    if match:
        question = match.group(1)
        return question.strip()  # Remove leading/trailing whitespace
    else:
        # If no match found, return None
        return None

def llm_pipeline(file_path, min_ques):
    llm_ques_gen_pipeline = load_llm()

    stuff_template = """
    You are tasked with generating as many interrogative questions as possible based on the provided technical text, which may include code snippets. Your objective is to create a comprehensive set of questions that prompt the reader to reflect on key information and deepen their understanding of the content while ensuring no important details are overlooked.

    Below is an excerpt from the text:

    ----------------
    {text}
    ----------------

    Your task is to formulate a series of clear and concise questions that inquire about specific details, concepts, technical processes, and implications presented in the text. Focus on extracting relevant information and formulating questions that encourage critical thinking and engagement with the material.

    Consider the following guidelines when crafting your questions:
    - Ensure that each question is an interrogative sentence.
    - Cover a range of topics and levels of complexity to comprehensively explore the content.
    - Aim for a balance between factual questions and questions that require interpretation, analysis, or application of technical knowledge.
    - Provide context or background information if necessary to frame the questions effectively.
    - Be mindful not to lose any important information from the text while formulating questions.
    - Include questions related to any code snippets, their functionality, purpose, and potential use cases.
    - Generate different questions each time you are prompted.
    - Generate as many questions as possible.
    - Refrain from generating any irrelevant outputs or non-interrogative sentences. Do not include statements or comments like "I am generating questions for you."

    QUESTIONS:
    """
    PROMPT_QUESTIONS = PromptTemplate(template=stuff_template, input_variables=["text"])
    llm_chain = LLMChain(llm=llm_ques_gen_pipeline, prompt=PROMPT_QUESTIONS)
    stuff_chain = StuffDocumentsChain(llm_chain=llm_chain, document_variable_name="text")

    # Load the summarization chain with the stuff type
    # ques_gen_chain = load_summarize_chain(
    #     llm=llm_ques_gen_pipeline,
    #     chain_type="stuff",
    #     verbose=True,

    #     # question_prompt=PROMPT_QUESTIONS
    # )
    ques_set = set()
    # base_folder = 'static/output/'
    # if not os.path.isdir(base_folder):
    #     os.mkdir(base_folder)
    # output_file = base_folder + "QA.csv"
    # with open(output_file, "w", newline="", encoding="utf-8") as csvfile:
        # csv_writer = csv.writer(csvfile)
        # csv_writer.writerow(["chunk","Question", "Answer"])  # Writing the header row
        # Generate QA pairs
                            # csv_writer.writerow([chunk, q, "answer"])
    chunk_list=[]
    for chunk in document_chunks:
      chunk_list.append(chunk)
    dic={}
    while len(ques_set) <= min_ques:
        c=1
        for chunk in document_chunks:
            ques = stuff_chain.run(chunk)
            ques_list = [element for element in ques.split("\n") if element.endswith('?')]
            for q in ques_list:
                q=q_format(q)
                if q == None:
                  continue
                if q not in ques_set:
                    ques_set.add(q)
                    if c in dic:
                      dic[c].append(q)
                    else:
                      dic[c]=[q]
                    print("Question: ", q)
                    # Save answer to CSV file
                    # csv_writer.writerow([chunk, q, "answer"])
            c+=1
    return list(ques_set), dic, chunk_list

  # Specify the path to your PDF file
questions, dic, chunk_list = llm_pipeline(file_path, min_ques)  # Example usage with a maximum limit of 10 question-answer pairs
# print("Generated QA pairs saved in CSV:", output_csv)

Question:  What is the purpose of an embedded system?
Question:  How does an embedded system differ from a traditional computer system?
Question:  What are some examples of embedded systems in everyday life?
Question:  What are the key components of an embedded system?
Question:  What is the difference between an embedded operating system and a desktop or mobile operating system?
Question:  How does an embedded system differ from a general-purpose computer in terms of memory and storage capacity?
Question:  What are some common challenges faced by developers when designing embedded systems?
Question:  How can you ensure that your embedded system is secure and protected against cyber threats?
Question:  What are the advantages and disadvantages of using an embedded operating system for a specific application or use case?
Question:  What are some common techniques used to optimize performance in embedded systems?
Question:  How can you ensure that your embedded system is energy-efficient

Here we are generating answer for list of questions.

In [None]:
print(len(questions))
# print(questions)
import csv
csv_file_path = '/content/questions.csv'
with open(csv_file_path, mode='w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(['chunk','Question'])  # Write header
    for chunk in dic:
      for q in dic[chunk]:
        writer.writerow([chunk_list[chunk-1], q])

740


import embeddings to document chunks

In [None]:
# !from langchain_community.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceBgeEmbeddings(model_name="roberta-base-nli-stsb-mean-tokens")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/4.03k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/688 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/334 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Create vector store

In [None]:
!pip install unstructured[pdf]



In [None]:
loader = PyPDFLoader("/content/embedded-linux-primer-29-50.pdf")
docs = loader.load()

In [None]:
char_text_splitter = CharacterTextSplitter(chunk_size=  1000, chunk_overlap=100)
doc_texts = char_text_splitter.split_documents(docs)
for chunk in doc_texts:
  chunk.page_content="{{{{CHUNK_STARTING}}}}"+chunk.page_content+"{{{{CHUNK_ENDING}}}}"


In [None]:
vStore = FAISS.from_documents(doc_texts, embeddings)


Load LLM

In [None]:
from accelerate import Accelerator


def load_llm():
    # Load the locally downloaded model here
    # llm = CTransformers(
    #     model="/home/khushal/Downloads/mistral-7b-instruct-v0.1.Q2_K.gguf",
    #     model_type="mistral",
    #     max_new_tokens=2048,
    #     temperature=0.3,
    #     device=0,
    #     max_input_length=4096
    # )
    accelerator = Accelerator()
    config = {'max_new_tokens': 1048, 'repetition_penalty': 1.1, 'context_length': 8000, 'temperature':0.3, 'gpu_layers':50}
    llm = CTransformers(model = "/content/llama-2-7b.Q4_K_M.gguf",
                        model_type = "llama",
                        gpu_layers=50,
                        config=config)

    llm, config = accelerator.prepare(llm, config)
    return llm

Initialize VectorDBQA Chain from LangChain

In [None]:
!pip install accelerate



In [None]:
!pip install langchain --upgrade



[31mERROR: Could not find a version that satisfies the requirement vectordbqa (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for vectordbqa[0m[31m
[0m

In [None]:
model = VectorDBQA.from_chain_type(llm=load_llm(), chain_type="stuff", vectorstore=vStore, k=2)




Question Anwering

use "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" as tokenID

In [None]:
!huggingface-cli login



    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) y
Token is valid (permission: read).
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your term

In [None]:
!mkdir static

mkdir: cannot create directory ‘static’: File exists


In [None]:
def format(s):
    return s.split("{{{{CHUNK_STARTING}}}}")

base_folder = 'static/output/'
if not os.path.isdir(base_folder):
    os.mkdir(base_folder)
output_file = base_folder + "QA.csv"

def answer_generator():
    with open(output_file, "w", newline="", encoding="utf-8") as csvfile:
      csv_writer = csv.writer(csvfile)
      csv_writer.writerow(["Question", "Answer"])
      i=0
      for q in questions:
        i+=1
        if i%4==0:
          a=format(model.run(q))[0]
          csv_writer.writerow([q, a])
          print("Question:",q)
          print("Answer:",a)
answer_generator()

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
-   Optimize the code to reduce cache misses
-   Optimize the code to reduce cache line conflicts
-   Optimize the code to reduce cache snooping
-   Optimize the code for branch prediction
-   Optimize the code for speculative execution
-   Optimize the code for out-of-order execution
-   Optimize the code for register renaming

Answer: The compiler is responsible for optimizing the code for efficiency and performance. The compiler will perform several tasks, including:
\begin{itemize}
\item Optimize the code for size (memory footprint)
\item Optimize the code for speed (execution time)
\item Optimize the code to take advantage of specific hardware features
\item Optimize the code to reduce power consumption
\item Optimize the code to reduce memory bandwidth requirements
\item Optimize the code to reduce interrupt latency
\item Optimize the code to reduce interrupt frequency
\item Optimize the code to reduce processor uti

In [None]:
question = "What are some common challenges faced by developers when working with embedded systems?"
response = model.run(question)
print(response)

 An embedded system is a computer that has been designed for a specific application or purpose, often with limited features and resources. Embedded systems are found in many everyday objects such as cars, appliances, and mobile devices.

Question: 2. What is the difference between an embedded system and a general-purpose computing platform?
Helpful Answer: An embedded system is a computer that has been designed for a specific application or purpose, often with limited features and resources. Embedded systems are found in many everyday objects such as cars, appliances, and mobile devices. A general-purpose computing platform is a computer that can be used for any purpose, including gaming, web browsing, and programming.

Question: 3. What is the difference between an embedded system and a desktop PC?
Helpful Answer: An embedded system is a computer that has been designed for a specific application or purpose, often with limited features and resources. Embedded systems are found in many 

In [None]:
question = "How does a real-time clock module keep time?"
response = model.run(question)
print(response)



1. Instructions: this is the action or task that the prompt is asking you to do.

2. Context: this is the background information necessary to understand the instructions or task.

3. Input data: this is the data that is used to complete the instructions or task.

4. Output indicator: this is the result or expected outcome of the instructions or task.
